Use Case

How to build a lists crawler using Byteline Web Scraper

CHECK OUT THE BYTELINE FLOW

Here is the Byteline flow used for this use case. You can clone it by following the link
How to build a lists crawler?
1
Configure Scheduler with URLs list
The scheduler node is configured with a Google Spreadsheet containing the list of URLs to crawl
2
Web scraper with URL expression
The web scraper node scrapes data from Coinbase by using an expression for the URL. This expression points to a URL record from the Google Spreadsheet

Step by Step Instructions

Byteline step by step blog

Byteline allows you to crawl list of URLs from any site by using URL as an expression on its Web Scraper. In this documentation, we are scraping a list of URLs from the Coinbase web page using the Web Scraper node. This is one of the best lists crawlers for scraping data from any site and then directly pushing the data to a cloud service.

Here we will configure the following nodes to scrape a list of URLs:

Scheduler Trigger Node - First of all, we’ll have to configure the scheduler node to run the flow at a regular interval of time. A Google spreadsheet with the list of URLs is configured on the scheduler.
Web Scraper Node - After that, we’ll need to configure the Web Scraper node to scrape data from a  webpage. Here, we will scrape a list of URLs from Coinbase (a centralized exchange to buy and sell cryptocurrencies).

Let’s get started.


1. Configure Airtable Trigger Node
Base ID
1. Heading Category
Sub-Heading
Sub-Heading
Sub-Heading
2. Heading Category
Sub-Heading
Sub-Heading
Sub-Heading
3. Heading Category
Sub-Heading
Sub-Heading
Sub-Heading
3. Heading Category
Sub-Heading
Sub-Heading
Sub-Heading
3. Heading Category
Sub-Heading
Sub-Heading
Sub-Heading

Create flow

In this section, you’ll learn to create the flow. For more details, you can check How to Create your First Flow Design.

 

Step 1: Enter the flow name to create your new flow.




Step 2: Select the Scheduler trigger node to initiate your flow. 


Now, you need to configure the Scheduler Node to run the flow at a regular interval of time.


So, let’s get started with Scheduler node configuration! 

Configure scheduler with URLs list

Step 1: Click on the edit button to configure the scheduler node. 


Note: Scheduler gets configured by default or else, you can also change the configuration according to your requirement. 


For more information, please refer to this guide - 


In this documentation, we’re using the default settings for the Scheduler node.

Providing URLs list

In the data section, you can provide the Spreadsheet ID for the flow to get the URLs.


Step 1: Copy the spreadsheet ID of the sheet where you have curated a list of Coinbase URLs.





Step 2: Enter the copied spreadsheet Id in the Field.



Step 3: Click on the Save button to configure the scheduler node configuration.



Configuring Web Scraper

Step 1: Click on the add button to view the nodes in the select node window. 




Step 2: Select the Web Scraper node to add it to your flow. 




Step 3: Click on the Edit button to configure the Web Scraper node. 





Configure URL using Byteline expression

Enter the web page URL expression in the field to scrape the web page data. Use the select variable tool on the right and then click on the field you want.

For title


Launch ZipRecruiter in a new browser tab and enable the Byteline Web Scraper Chrome extension

Here, we are scraping a couple of fields such as title, company name, job proposal link from the ZipRecruiter website.   


Step 1: Double click on the title to select the job title you would like to scrap.




Step 2: Select the Text option to specify the data type for scraping. 



Step 3: Click on the Repeating Elements to scrape the multiple job titles over the web page. We are using repeating elements as multiple jobs are scraped from this page.


 



The Web Scraper will automatically copy the data to the clipboard. 


Step 4: In the Webscraper configuration window, click on the Paste from the Chrome Extension to paste the scraped data.




Step 5: Enter the Array Name to specify the JSON array from which you want fetch elements.

 

Step 6: Enter the title in the field and its XPath is automatically fetched in the field for scraping the Job title. 



For Company Name


Step 1: Double click on the company name to select it for scraping. 




Step 2: Select the Text option to specify the data type for scraping. 




Step 3: Click on the Repeating Elements to scrape the multiple company names over the web page. 



The webscraper will automatically save the data in the clipboard. 


Step 4: In the Webscraper configuration window, click on the Paste from the Chrome Extension to paste the scraped data.



Step 5: Enter Company in the field and its XPath is automatically fetched in the field for scraping the Company Name. 


For Link


Step 1: Double click on the job title (having hyperlink) to select the link for scraping. 




Step 2: Select the Link to specify the data type.


Note: You can also preview the link. 




Step 3: Click on the Repeating Elements to scrape the multiple company links over the web page. 




Note: The webscraper will automatically save the data in the clipboard. 


Step 4: In the Webscraper configuration window, click on the Paste from the Chrome Extension to fetch the scraped data.


Step 5: Enter Link in the field and its XPath is automatically fetched in the field for scraping the Link.

For Location


Step 1: Double click on the location to select the company location for scraping.  





Step 2: Select the Text option to specify the data type for scraping. 




Step 3: Click on the Repeating Elements to scrape the multiple company locations over the web page. 




Note: The Web Scraper will automatically copy the data to the clipboard. 


Step 4: In the Web Scraper configuration window, click on the Paste from the Chrome Extension to fetch the scraped data.



Step 5: Enter Location in the field and its XPath is automatically fetched in the field for the location. 




Step 6: Once you’re done with the configuration, click on the Save button. 



Thus you have configured the web scraper to scrape the required job details. 

After the configuration of the flow, you will need to perform a test run to make sure the web scraper task works.

Run 

Click on the Test Run button to test the flow. 



Now, click on the 'i' (more information) button on the top-right corner of the Web Scraper node to check the data content extracted. 


You will see a SUCCESS window as shown in the snippet below: 




Your Web Scraper node has been configured successfully.  


Test Run

Click on the Test Run button to test the flow. 



Now, click on the 'i' (more information) button on the top-right corner of the Web Scraper node to check the data content extracted. 


You will see an output window as shown in the snippet below: 


Flow runs for all URLs


Step 7: Click on the Runs button located top right corner of the console to view the flow runs for each URL in the Google Spreadsheet. 



You can view the Flow Runs table showing Instance Id, Start time, End time, Status, and Action Consumed as shown in the snippet below. 




You can click on any of the flow instances and then see the information on the Scheduler node to see the URL for the run.



Your Web Scraper node has been configured successfully. 


If you have any doubts, feel free to connect with us.