07-31-2015, 07:24 PM
Unfortunately what has happend is that that thomsonlocal has blocked the useragent that the addon uses. So the addon is tested and a useragent that works with the most possible sites is used.
I manually submitted a GET request using the same useragent and it came back 503 immediately, but when I use any other agent it gives a 200.
So basically they must have figured out you were scraping, and/or some other people and it must have generated enough traffic that they just decided to block the user agent and send back a 502/503 instead of a standard 403 so that you would think their server is having trouble and leave them alone. Thats my guess.
The user agent can't be changed and if it was changed it would just break many other sites and then other users would complain.
You could probably build your own url grabber with the scrapebox custom data grabber, I tried a sample GET request with it and its not blocked. However if you did a lot of heavy scraping they would probably block that as well and you can't change the user agent their either.
I suppose you could try to build it in as a custom engine for the harvester, as its meant to harvest urls and you can select the user agent there. That said you probably would want to go and change the user agents periodically to avoid getting any 1 banned.
I manually submitted a GET request using the same useragent and it came back 503 immediately, but when I use any other agent it gives a 200.
So basically they must have figured out you were scraping, and/or some other people and it must have generated enough traffic that they just decided to block the user agent and send back a 502/503 instead of a standard 403 so that you would think their server is having trouble and leave them alone. Thats my guess.
The user agent can't be changed and if it was changed it would just break many other sites and then other users would complain.
You could probably build your own url grabber with the scrapebox custom data grabber, I tried a sample GET request with it and its not blocked. However if you did a lot of heavy scraping they would probably block that as well and you can't change the user agent their either.
I suppose you could try to build it in as a custom engine for the harvester, as its meant to harvest urls and you can select the user agent there. That said you probably would want to go and change the user agents periodically to avoid getting any 1 banned.