Alex Watt

About Blog Hacks Recommendations

Scraping with Selenium

In college, I had a friend with a part-time research job. One time we were chatting and he told me he had to copy and paste data from a government website into a Word document. He was explaining to me that he had over 20 hours of work left at the current pace, and wasn’t sure how he would get it all done with his other responsibilities.

I asked him if he was interested in getting it done faster with software. He said yes, and I had some time, so I wrote a small Python program to scrape the data for him. It was pretty easy: The data was on a public website, the pagination involved passing a number into a query string, and there were obvious CSS selectors to use to target the right data. I wasn’t the fastest; I wanted to explain what I was doing as I went, and I think it took me almost 2 hours, but the result was amazing to my friend, because it saved him at least 18 hours of work. I used Beautiful Soup in Python to do all the scraping tasks.

Recently, I ran into another situation where scraping would be useful. Unfortunately, it didn’t look to be so easy this time.

I needed to pull some data from a web app. The web app has a page where you can look up data by an identifier; the problem is that doing the lookup by one identifier at a time is slow. I would get a lot of value out of being able to automatically pull the data for hundreds of identifiers on a weekly basis and store the results in a database for querying, but only if the cost was pretty cheap; it wasn’t worth doing that manually all the time.

Since this web app had no documented API, I spent some time looking at the requests that were being made using Developer Tools in Chrome. I hoped to find some REST API behind the lookup. If I had found such an API, I might have not needed to scrape data at all; I might have instead found some way to get an active user session token, and then integrated directly with the REST API. Integrating with a REST API would have been simpler and faster: Instead of rendering an entire page, I could have made a single API call. API responses are usually structured, lightweight JSON objects, while web pages contain extra HTML, JavaScript, and styling that make scraping more fragile. API calls also tend to be more stable over time, whereas scraping relies on the structure of the page remaining unchanged.

Since I couldn’t figure out the API, I resorted to scraping. The end result is a weekly process that pulls data from the web app. It uses a sufficiently large sleep between requests that the script will not be DDOSing the target app.

Disclaimer: Ethics of Scraping

Not all sites can be scraped. Before scraping, you should check the terms of service, robots.txt, etc. I am not a lawyer, and this is not legal advice; you should consult a legal expert for that!

Selenium

I used Selenium for this because I needed to simulate the browser’s activity (unlike the college project above - too much JavaScript magic). I would have preferred to find and use a REST API, but I couldn’t. I decided that simulating the browser was reasonable; I have been using this app for years and have never noticed a frontend change. It seems pretty unlikely to change anytime soon.

First, I downloaded the Selenium plugin for Chrome and used this to record a script of me using the app myself.

Then, I used the plugin to export the routine that I had performed into Ruby, and tried running the script to reproduce the results. The script included logging into the web app with a username and password. At this point, I started tweaking some things - including using sleep calls to increase the odds that the app had finished loading certain data. (This is not robust but it worked well enough for what I am doing.)

Once I got that script working, the last trick was to get it running in production. Chrome has a flavor that is designed for testing and automation (Chrome for Testing); I added this to my production build. I also figured out the flags I needed for the Selenium Chrome driver: --headless, --no-sandbox, and --disable-dev-shm-usage. With some trial and error, I got it working!

Now, the data is synced regularly from the other web app into a database, and I have a playbook for doing something similar in the future.

Playwright

After I did this project, I learned that Playwright is another option for automation. I will consider it next time. Playwright has built-in features for handling waiting, which sounds more robust than the explicit sleep calls I used with Selenium. It also supports multiple browsers out of the box. If I were starting this project today, I’d likely evaluate Playwright to see if it provides a smoother experience.

Posted on 08 Feb 2025.