This article will be just a quick one. It's a few line of code recipe on how to mitigate IP restrictions and WAFs when crawling the web. If you're reading this you probably already already tried web scraping. It's all easy breezy until one day someone managing the website you're harvesting data from realizes what happens and blocks your IP. If you're running your scrappers in an automated way you'll start seeing them failing miserably. You'll probably want to solve this problem fast, before any of precious data slips through your fingers.
While it might be tempting to use one of paid providers of such services it isn't that hard to craft a home baked solution that will cost you no money. This is thanks to an awesome project scrapy-rotating-proxies.
Just add it to your project like it is described in the documentation:
# settings.py # ... ROTATING_PROXY_LIST = [ 'proxy1.com:8000', 'proxy2.com:8031', # ... ] ROTATING_PROXY_LIST_PATH = 'proxies.txt' # ...
So, where to get this
proxies.txt list from? This is easier than you think.
I was not able to find a python project that would provide a list free
proxies out of the box, but there is a
node module made exactly for that!
Installation is extremely simple, as well as usage:
proxy-lists getProxies --sources-white-list="gatherproxy,sockslist"
This will save a bulky list of proxies in your
Now you're essentially running a mixed-language project (with Python for scrapy and JS for list-proxies). You need a way to synchronize these two tools. What would be better than the lingua franca of builds and orchestration - the Makefile.
Just create a target:
all: yarn run proxy-lists getProxies --sources-white-list=$$PROXIES_SOURCE_LIST scrapy crawl mycrawler -o myoutput.csv rm -r proxies.txt
And after you're done with that, your build step in Jenkins becomes just:
Of course there's an overhead to pay for using this - after introducing proxies my crawl times grew by an order of magnitude from minutes to hours! But hey, it works and it's free, so if you're not willing to pay for data in cash, you need to pay for it with time. Luckily for you with this sweet hack it's build server's time, not yours.