Use Case

Deep Web Crawling Using Byteline Web Scraper

CHECK OUT THE BYTELINE FLOW

Here is the Byteline flow used for this use case. You can clone it by following the link
Deep web crawling using Byteline Web Scraper
1
Byteline flow runs on a scheduler
Byteline flow runs in the cloud at the scheduled time. You can configure the scheduled time based on your requirements.
2
Scrape from Etsy
Data is scraped from Etsy using Byteline Web scraper and Chrome extension.
3
Deep Web Crawling
The scraped data from the first scraper node is then used to perform deep crawling and to extract additional information from the site.

Step by Step Instructions

Byteline step by step blog

Byteline Web Scraper allows users to easily perform deep web crawling. The platform gives you the flexibility to export the scraped data to any of the Byteline's integrations, including Airtable or Webflow CMS. In this documentation, we have explained various steps to perform deep web crawling using Byteline web scraper. We will use the Byteline Web Scraper Chrome Extension for configuring the data to be scraped.

We will be configuring the following nodes to perform deep web crawling:

Scheduler Trigger Node- First of all, we’ll have to configure the scheduler node to run the flow at a regular interval of time.


Web Scraper Node: After that, we’ll need to configure the Web Scraper node to extract data and links from a web page. Here, we will scrape data from the www.etsy.com site.

Second Web Scraper Node: This scraper node will take the link from the first scraper node to perform deep scraping to extract additional information from the site.

Follow the steps outlined below to perform deep web crawling.

Let’s get started.

1. Configure Airtable Trigger Node
Base ID
1. Heading Category
Sub-Heading
Sub-Heading
Sub-Heading
2. Heading Category
Sub-Heading
Sub-Heading
Sub-Heading
3. Heading Category
Sub-Heading
Sub-Heading
Sub-Heading
3. Heading Category
Sub-Heading
Sub-Heading
Sub-Heading
3. Heading Category
Sub-Heading
Sub-Heading
Sub-Heading

Create Flow

In this section, you’ll learn to create the flow. For more details, you can check How to Create your First Flow Design.

Configure Web Scraper

For title

Step 1: Select the Web Scraper node from the Select Node window. 

Step 2: Click on the Edit button to open the Web Scraper node configuration window. 

Step 3: On another browser tab, copy the URL of the website for which you want to perform deep web scraping or crawling. For this documentation, we are scraping data of paintings from https://www.etsy.com/in-en/search?q=painting

 

Step 4: Enter the Website URL you want to scrape in the Web Scraper URL field in the Byteline console. 

Step 5: Download and install the chrome extension of Byteline in your browser. 

Note: Download the Byteline Web Scraper Chrome Extension from here.

Click on the puzzle piece-shaped extension button on the top right corner of the interface. 

After that, click on the Pin button as shown below to pin the Byteline Web Scraper extension to your browser.  

Step 6: Click on the toggle button to enable the Byteline extension.

Step 7:  Select Capture List Elements as we are extracting the data from multiple items.

Once selected all you need to do is to hover the cursor over the element you want to copy and the whole list will be highlighted in yellow color. Perform a single click to make the list selection once you’re happy with the list elements shown in yellow. Now the selection turns green.

Click the element you want to copy, and a dialog box will appear. Select from the following options as per the need.

Here the Text option is chosen as we have selected the title of the painting to scrape. 

Note: Users need to make sure that the selection has all of the elements selected in the list. If all elements are not selected users must move the cursor a little bit to select all the elements.

Click on the Copy to copy the Xpath.

Note: Herein, we are considering Capture List Elements as we want to scrape repeating elements i.e. title of the paintings.

Step 8: Switch to the Byteline console and click on the 'Paste from the Chrome Extension' button, and the console will automatically paste the copied link of the elements in the XPath field.  

Step 9: Mention the name of the list in the List Name field.  

Step 10: Specify the name for the data you want to scrape in the Field Name.

Step 11: Now, switch back to the Etsy tab to copy another set of elements you want to scrape. We have selected the price to scrape.

Note: Users need to make sure that the selection has all of the elements selected in the list. If all elements are not selected users must move the cursor a little bit to select all the elements.

Step 12: Click on the 'Paste from the Chrome Extension' button and give a name to the field you are scraping.

Step 13: Now, switch back to the Etsy tab to copy the link of the repeating elements. This link will be used to perform deep scraping to get additional information from each page.

Step 14: Click on the 'Paste from the Chrome Extension' button and give a name to the field you are scraping.

Step 15: Click on the Advanced button and select the Limit repeating elements option from the drop-down menu to set the limit for repeating elements. This is a good practice to limit the number of elements so that we can easily test our flow. It also avoid consuming too many actions.

Step 16: Mention the number of elements you want to scrape in the given field.

Step 17: Click on the Save button to save changes.  

Test Run the Flow

For company name

Click on the Test Run button in the top right corner of the interface to run the created flow.

Now, click on the 'i' (more information) button on the top-right corner of the Web Scraper node to check the data content extracted. 

You will see an output window as illustrated below: 

Click on the Back to Flow editing button to return to the Byteline Flow Designer. 

 

Configure second Web Scraper node

For link

Step 1: Select the Web Scraper node from the Select Node window. 

Step 2: Click on the Edit button to open the Web Scraper node configuration window. 

Step 3: Tick the box before the loop over. 

Step 4: Choose the web_scraper_1 option from the drop-down menu. As we want to get the links from this node.

Step 5: Click on the Expression selector icon located at the extreme right end of the URL field. 

Step 6: Click on the link option to copy the link value of the first web scraper. 

Once you click on the link, it will appear in the URL field. 

Step 7: Switch back to the website Etsy tab and Click on the Byteline Web Scraper extension button on the top right corner of your browser.

 After that, click on the toggle button to disable the Byteline extension.

Step 8:Click on the link of any painting and then click on the toggle button to enable the Byteline Web Scraper extension.

Step 9: Select Capture Single Element to copy a single element. On this page, we are getting additional data for each painting. This data only has a single instance, so we will use the Capture Single element.

Step 10: Once selected, all you need to do is to hover the cursor over the element you want to copy and it will be highlighted in yellow color. We have selected “Reviews” to scrape.

Step 11: Click on the element you want to copy and a dialog box will appear. Select the Text option and click on the Copy button to copy the Xpath.

Step 12: Switch to the Byteline console and click on the 'Paste from the Chrome Extension' button, and the console will automatically paste the copied value of the element in the XPath field.  

Step 13: Specify the name for the column you want to scrape. 

 

Step 14: Now, switch back to the Etsy tab to copy another element you want to scrape. We have selected the first review to scrape.

Step 15: Click on the 'Paste from the Chrome Extension' button and give a name to the field you are scraping.

Note: Users can copy as many as elements as they want by repeating the steps 14 and 15. 

Step 16: Click on the Save button to save changes.  

Test Run the Flow

For location

Click on the Test Run button in the top right corner of the interface to run the created flow.

Now, click on the 'i' (more information) button on the top-right corner of the Web Scraper node to check the data content extracted. 

You will see an output window as illustrated below: 

Note: Users can export the scraped data to any Byteline integration like Airtable or Google Sheets depending on their use case. 

Your Deep Web Crawling completes here. Feel free to connect us for any doubts. Develop fast!