Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scraping a directory probem
#1
Hi

I am a newbie so please excuse me if Im missing something simple.
Last week I began successfully scraping a directory for emails (thomsonlocal.com). I first scraped the directory for thousands of internal pages then scraped those pages for external links. Then scraped those links for emails.

This process worked fine and i managed to get quite a few emails. Then suddenly I found that I could no longer get outbound links from the thomson local pages. I received the error 502.

I tried changing my proxies and this time including a 30 sec delay with only 1 connection, but i receive the same error. I have changed my proxies numerous times now informing buyproxies.com of my problem, they have told me that as long as the proxies are google passed, this should be ok.

I would really appreciate any help in this as I am completely stuck.

I look forward to hearing from you.
Reply
#2
So you can't get outbound links from thomsonlocal, and your using the link extractor addon to get the links?

Maybe you can give a screenshot or more details.

Make sure if you are using an addon that you are using Scrapebox V2 to start with and then that you have the use proxies box checked Before you start the addon. Addons only pull proxies upon startup so if you make changes to the proxies after the addon is started you must close down and restart the addon in order for those changes to take effect.

If you can give screenshots and a few example urls that don't work I can test them and give you more feedback.
Reply
#3
Hi Looplines,

I have attached a screenshot. Buy proxies has stated that the proxies seem fine, if viewing the website via firefox and as soon as the pages are called via scrapebox the problem occurs. This is fine apart from the fact the i was successfully scraping the website with scrapebox only last weekl
I was managing to scrape loads of oubound links with no problem and now its seems to through up the 502 error as soon as i start scraping and changing proxies dozens of times.


Attached Files Thumbnail(s)
   
Reply
#4
Unfortunately what has happend is that that thomsonlocal has blocked the useragent that the addon uses. So the addon is tested and a useragent that works with the most possible sites is used.

I manually submitted a GET request using the same useragent and it came back 503 immediately, but when I use any other agent it gives a 200.

So basically they must have figured out you were scraping, and/or some other people and it must have generated enough traffic that they just decided to block the user agent and send back a 502/503 instead of a standard 403 so that you would think their server is having trouble and leave them alone. Thats my guess.

The user agent can't be changed and if it was changed it would just break many other sites and then other users would complain.

You could probably build your own url grabber with the scrapebox custom data grabber, I tried a sample GET request with it and its not blocked. However if you did a lot of heavy scraping they would probably block that as well and you can't change the user agent their either.

I suppose you could try to build it in as a custom engine for the harvester, as its meant to harvest urls and you can select the user agent there. That said you probably would want to go and change the user agents periodically to avoid getting any 1 banned.
Reply




Users browsing this thread: 1 Guest(s)

Looplines Scrapebox List