How to do Web Scraping with Ruby?

Victor Rak

Victor Rak

What is web scraping?

Web scraping is a process of retrieving of desired data from websites. It’s being used when a site from which you want to retrieve some data does not have API or another designed way to do it.

HTML scraping and data crawling

The most simple form of web scraping is HTML scraping. We are just downloading corresponding website page files and parsing them using Nokogiri or another website parser. It’s a high-speed solution, and it’s suitable for retrieving any amount of data from any sites, that provide all the needed data just in HTML and don’t restrict all the traffic for any client which isn’t an actual browser. Using HTML parsers, we can build our own website parser or even a search bot, which will do the data crawling from a big part of the World Wide Web. Before JavaScript got viral, bots from the search engines were using simple HTML scraping tools for the data crawling.

JavaScript handling and screen scraping

Nowadays, there are lot of sites which don’t provide the full information just in the HTML code of their pages, and load information using JavaScript. When scraping information from such sites, we must use the actual web browsers, other way we will not . If the site allows to visit it only with the latest browsers, or if we want to receive data from such site once, and it isn’t so large so it counts in gigabytes, then we can use software bundle of Capybara or Watir and drivers for popular web browsers and automatize the process which would take many days of manual work. Using this bundle also allowing us to do the screen scraping, if you need to receive graphical information too.

There are also cases when there is a need to scrape some data regularly for your application, but to retrieve this data, a target site uses JavaScript. Web server, where your application is hosted, hardly contains any graphic user interface, where a real browser like Google Chrome or Mozilla Firefox can be launched. Nevertheless, it’s not a problem for PhantomJS, a great server-oriented scraper tool for website testing and page automation, which is based on WebKit, as Safari and many other browsers are. However, for the new application it’s better to use the new feature of Google Chrome – headless mode, as far as work on PhantomJS has been stopped due to the release of this product. Google Chrome headless weighs more than PhantomJS, but is more reliable for the screen scraping, as far as it do the work on the web content, in the same way as a Google Chrome browser in a normal mode.

P.S.: For your information, if the target site noticeably change its layout, the website parser code, attached to it, probably will be broken. It can be fixed fast if a change was moderate, or should be almost completely rewritten otherwise, if the paradigm of data retrieving was changed.

Join our Newsletter