Web scraping is a popular method of automatically collecting the information from different websites. It allows you to quickly obtain the data without the necessity to browse through the numerous pages and copy and paste the data.
Later, it is outputted into a CSV file with structured information. Scraping tools are also capable of actualizing the changing information.
There are numerous applications, websites, and browser plugins allowing you to parse the information quickly and efficiently. It is also possible to create your own web scraper – this is not as hard as it may seem.
In this article, you will learn more about web scraping, its types, and possible applications. We will also tell you how to scrape websites with Ruby.
Ways of collecting the information
There are two ways to automatically collect the information: web scraping and web crawling. They are both used for extracting the content from websites, but the areas of work are different.
Web scraping refers to collecting the data from a particular source (website, database) or a local machine.
It does not involve working with large datasets, and a simple download of the web page is considered to be a sort of data scraping.
Web crawling implements processing large sets of data on numerous resources.
The crawler attends the main page of the website and gradually scans the entire resource. Generally, the bot is programmed to attend numerous sites of the same type (for example, internet furniture shops).
Both processes result in presenting the output of the collected information. Since the Internet is an open network, and the same content can be reposted on different resources, the output can contain lots of duplicated information.
Data crawling involves processing the output and removing the duplicates. This can also be done while scraping the information, but it is not necessarily part of it.
How web scraping works and how to choose the tool
The scraping scripts are executed according to the following algorithm: the program attends the web page and selects the necessary HTML-elements according to the settled CSS- or XPath-selectors.
The necessary information is processed, and the result is saved in the document.
The web provides quite a lot of out-of-box scraping tools like online and desktop applications, browser extensions, etc. They provide different functionalities that are suitable for different needs. That is why choosing a web scraper requires a bit of market research.
Let’s have a look at the key features to consider when choosing a web scraping tool.
- Input and output
The different scrapers process different types of information: articles, blog and forum comments, internet shop databases, tables, dropdowns, Javascript elements, etc. The result can also be presented in different formats, like XML or CSV, or be written right into a database.
- Type of license
The out-of-box scrapers can provide a free and commercial license.
The free tools generally have fewer options for customization, less capacity, and less thorough scraping. The paid scrapers offer wider functionality and efficiency of work and are perfectly suited for professional usage.
- Technical background for usage
Some of the tools can be used just via the visual interface, without writing any lines of code.
The other ones require a basic technical background. There are also tools for advanced computer users. The difference between them is in the customization options.
It is also possible to develop a custom web scraper from scratch. The application can be written on any of the existing programming languages, including Ruby. The custom Ruby parser will have all the necessary functionality and the output information will be pre-processed exactly the way you need it.
Having considered the existing types of web scraping tools, let’s see how to choose a scraper according to your needs:
- A free out-of-box tool will be sufficient for processing small amounts of information for personal use.
- A scraper with a paid license is necessary for users collecting large, yet similar sets for information for business and scientific needs (e.g. collecting financial statistics).
- A customized tool for scraping the web with Ruby is suitable for users who need a fully customized tool for professional scraping tasks on a regular basis.
The application of web scraping
Data scraping and crawling are used for processing sets of unstructured information and logically presenting them as a database or a spreadsheet. The output is valuable information for analysts and researchers, and it can be applied in many different areas.
- Machine learning
The Ruby web crawler can collect the information from different resources, and output the dynamics of market changes (such as changes of currency rates, prices for securities, oil, gold, estate, etc). The output can then be used for predictive analytics and training of artificial intelligence.
- Collecting product characteristics and prices
Web scraping is widely used by aggregators – they collect the information about the goods in different internet shops, and later present it on their websites.
This gives the users the opportunity to compare the prices and characteristics of the necessary item on different platforms without having to browse through numerous sites.
- Collecting contact details
Web scraping can be useful for establishing both B2B and B2C relationships.
With the help of scraping tools, companies can create lists of suppliers, partners, etc., and collect the databases of existing and potential clients. In other words, web scraping can help to obtain the lists of any individuals of interests.
- Collecting job opportunities
Recruitment companies can extract the contact details of potential applicants for different vacancies, and vice versa – the information about job opportunities in different companies can be collected as well.
This output is a good base not only for finding the necessary specialists and jobs, but also for market analysis, creating statistics about the demand and requirements for the different specialists, their salary rates, etc.
- Collecting information on a topic
With the help of scraping, you can download all the necessary information in bulk and then use it offline.
For example, it is possible to extract all the questions and answers on a particular topic from Quora or any other service for questions and answers. You can also collect blog posts or the results of internet searches.
- Conducting market research
Data scraping can be applied by marketing specialists for conducting research on a target audience, collecting the email base for newsletters, etc.
It helps to monitor competitors’ activities and track if they are changing their catalogs. SEO specialists can also scrape the web pages of competitors in order to analyze the semantics of the website.
How to do web scraping using Ruby
Having considered the variety of web scraping tools and the possible ways to apply the scraped data, now let’s talk about creating your own custom tool. We are going to present you with a brief guide covering the basic stages of web scraping in Ruby.
Useful tools
This language provides a wide range of ready-made tools for performing typical operations.
They allow developers to use official and reliable solutions instead of reinventing the wheel. For Ruby web scraping, you will need to install the following gems on your computer:
- NokoGiri is an HTML, SAX and RSS parser providing access to the elements based on XPath and CSS3-selectors. This gem can be applied not only for web parsing but also for processing different types of XML files.
- HTTParty is a client for RESTful services, sending HTTP queries to the scrapped pages and automatic parsing of JSON and XML files to your Ruby storage.
- Pry is a tool used for debugging. It will help us to parse the code from the scrapped pages.
Web scraping is quite a simple operation and, generally, there is no need to install the Rails framework for this. However, it does make sense if the scraper is part of a more complicated service.
Having installed the necessary gems, you are now ready to learn how to make a web scraper. Let’s proceed!
Step 1. Creating the scraping file
Create the directory where the application data will be stored. Then add a blank text file named after the application and save it to the folder. Let’s call it “web_scraper.rb”.
In the file, integrate the Nokogiri, HTTParty and Pry gems by running these commands:
require ‘nokogiri’
require ‘httparty’
require ‘pry’
Step 2. Sending the HTTP-queries
Create a variable and send the HTTP-request to the page you are going to scrape:
page = HTTParty.get(‘https://www.iana.org/domains/reserved’)
Step 3. Launching NokoGiri
The aim of this stage is to convert the list items into Nokogiri objects for further parsing. Set a new variable named “parsed_page” and make it equal to the Nokogiri method of converting the HTML data to objects – you will use it throughout the process.
parsed_page = Nokogiri::HTML(page)
Pry.start(binding)
Save your file and launch it once again. Execute a “parsed_page” variable for retrieving the necessary page as the set of Nokogiri objects.
In the same folder, create an HTML file (let’s call it “output”), and save the result of “parse page command” there. You will be able to refer to this document later.
Before proceeding, exit from Pry in the terminal.
Step 4. Parsing
Now you need to extract all the needed list items. To do this, select the necessary CSS item and enter it to the Nokogiri output. You can locate the selector by viewing the page’s source code:
array = parsed_page.css(‘h2’).map(&:text)
Once the parsing is complete, it is necessary to export the parsed data to the CSV file so it won’t get lost.
Step 5. Export
Having parsed the information, you need to complete the scraping and convert the data into a structured table. Return to the terminal and execute the commands:
require ‘csv’
CSV.open(‘reserved.csv’, ‘w’) { |csv| csv << array }
You will receive a new CSV file with all the parsed data inside.
Conclusion
We have covered the process of web scraping, its types, benefits, and possible applications. You are now aware of the basic features of the existing tools and know how to choose right one.
If your business needs a customized solution, drop us a line. We have a nice expertise in Ruby and are recently ranked the World Top Ruby on Rails agency by Clutch.