How to scrape the web and not get caught

Published by Karol Majta on 19th Apr 2018

Tiny Endian

This article will be just a quick one. It's a few line of code recipe on how to mitigate IP restrictions and WAFs when crawling the web. If you're reading this you probably already already tried web scraping. It's all easy breezy until one day someone managing the website you're harvesting data from realizes what happens and blocks your IP. If you're running your scrappers in an automated way you'll start seeing them failing miserably. You'll probably want to solve this problem fast, before any of precious data slips through your fingers.

Sa hello to proxies

While it might be tempting to use one of paid providers of such services it isn't that hard to craft a home baked solution that will cost you no money. This is thanks to an awesome project scrapy-rotating-proxies.

Just add it to your project like it is described in the documentation:


# ...

    # ...


# ...

So, where to get this proxies.txt list from? This is easier than you think. I was not able to find a python project that would provide a list free proxies out of the box, but there is a list-proxies node module made exactly for that!

Installation is extremely simple, as well as usage:

proxy-lists getProxies --sources-white-list="gatherproxy,sockslist"

This will save a bulky list of proxies in your proxies.txt file.

Say hello to Makefiles

Now you're essentially running a mixed-language project (with Python for scrapy and JS for list-proxies). You need a way to synchronize these two tools. What would be better than the lingua franca of builds and orchestration - the Makefile.

Just create a target:

        yarn run proxy-lists getProxies --sources-white-list=$$PROXIES_SOURCE_LIST
        scrapy crawl mycrawler -o myoutput.csv
        rm -r proxies.txt

And after you're done with that, your build step in Jenkins becomes just:

make all

Things to consider

Of course there's an overhead to pay for using this - after introducing proxies my crawl times grew by an order of magnitude from minutes to hours! But hey, it works and it's free, so if you're not willing to pay for data in cash, you need to pay for it with time. Luckily for you with this sweet hack it's build server's time, not yours.