How to do Web Scraping with Ruby?

What is web scraping?


Web scraping is a process of retrieving desired data from websites. It is used when a site from which you want to extract some data does not have an API, or web scraping is the only option.

HTML scraping and data crawling

The simplest form of web scraping is HTML scraping. We just download required website page files and parse them using Nokogiri or other website parsers. It’s a high-speed solution, and it’s suitable for retrieving any amount of data from any sites, that provide all the needed data just in HTML and don’t restrict all the traffic for any client which isn’t an actual browser. Using HTML parsers, we can build our own website parser or even a search bot, which will do the data crawling from a big part of the World Wide Web. Before JavaScript got viral, bots from search engines were using simple HTML scraping tools for data crawling.



JavaScript handling and screen scraping

Nowadays, there are a lot of sites which don’t provide full information just in HTML code of their page, they load information using JavaScript instead. When scraping information from such sites, we must use actual web browsers. If a site allows to visit it only with the latest browsers, or if we want to receive data from such site once, and it isn’t so large so it counts in gigabytes, then we can use software bundle of Capybara or Watir and drivers for popular web browsers and automate the process which would otherwise take many days of manual work. Using this bundle also allows us to do the screen scraping if you need to get graphical information as well.

There are also cases when there is a need to scrape some data regularly for your application, but to retrieve this data, a target site uses JavaScript. Web server, where your application is hosted, hardly contains any graphic user interface, where a real browser like Google Chrome or Mozilla Firefox can be launched. Nevertheless, it’s not a problem for PhantomJS, a great server-oriented scraper tool for website testing and page automation, which is based on WebKit, as Safari and many other browsers are. However, for a new application, it’s better to use a new feature of Google Chrome – headless mode, as work on PhantomJS has been stopped due to the release of this product. Google Chrome headless weighs more than PhantomJS, but it is more reliable for the screen scraping, as far as it does the work on the web content, in the same way as a Google Chrome browser in a normal mode.

P.S.: For your information, if a target site noticeably changes its layout, a website parser code attached to it probably will be broken. It can be fixed fast if a change is moderate or should be almost completely rewritten otherwise, if a paradigm of data retrieving is changed.

Victor Rak

Victor Rak

Backend Developer

Join our Newsletter