Chef Knives To Go (CKTG) features a catalog of around 1000 knives spread over various form factors and steel make-ups.
Getting an idea of an ideal price point for each form factor and steel type is arduous with there being more than 20 different form factors for a knife, and each form factor having close to 20 different steel types possible.
Process
In order to build a dataset of knives, the most optimal way is to crawl through the website multiple times
CKTG lists knives in multiple ways, the two main ways to search through knives are by form factor, and Steel Type
By first crawling through all of the knives by form factor, then crawling again by steel type and joining the results, we can save time on trying to extract the steel type from individual pages
Using R and the package ‘rvest’ we can load all of the pages into our script and pull the relevant information for each knife
The following views from CKTG show how the website is organized
After clicking through any of the above we are either led to a sub-category, or to a page of knives
For our analysis, this is all of the data we need. By using nested for loops we can navigate past all of the sub-categories and collect all of the knife information
The following is an excerpt of the dataset
For each knife, we are able to store a link to the knife on CKTG, the title, its price, the type of steel used, the manufacturer, the form factor, along with a few other traits
SubType: This is a sub-category after crawling by steel type, we could potentially get extra information from this with more processing
SubStyle: This is a sub-category after crawling by form factor, this represents the different lengths that some knife styles can take
SubSubStyle: This is a further sub-category after crawling by form factor, this represents the different handle styles that some knives will come with (Western vs Japanese)
Code
Most of the data we need can be scraped with the following function:
This function uses the pipe operator (%>%) to feed each successive command into the next line
html_elements here calls for the CSS styles for each item on the pages listed under Process
html_attrs gets the link for the page that the item corresponds to, as well as getting the title
tibble and unnest_wider are used to transform the ‘nodeset’ we have into a dataframe that is easy to loop over
After looping over these categories we eventually end up at a collection of knives, like the third image in Process
Here we execute the same function as above along with a similar function that pulls the price data
Now that we are onto individual knives, we compile all of the page titles that we traversed which gives us the manufacturer and Steel type
For the second round of web-scraping we do the same thing as above, and simply add in the additional information that we collect about the knife’s form factor.