ScrapeBox Forum
Email scraping large site - Printable Version

+- ScrapeBox Forum (https://www.scrapeboxforum.com)
+-- Forum: ScrapeBox Main Discussion (https://www.scrapeboxforum.com/Forum-scrapebox-main-discussion)
+--- Forum: General ScrapeBox Talk (https://www.scrapeboxforum.com/Forum-general-scrapebox-talk)
+--- Thread: Email scraping large site (/Thread-email-scraping-large-site)



Email scraping large site - scrapingby - 05-12-2017

So i´m trying to email scrape this large site and am getting to my limits.
I´m using a 18k seedlist of URLS and the grab emails from URL list but
SB kept crashing and i lost the harvested emails. Now i´m trying with
just 10 connections ( no proxies) yet and it takes ages.

As loopline mentioned i probably need proxies.
Now the questions is how many do i need?

The site in question has 20 million pages indexed.
If is crawler going through URLS at X per hour/minute second rate
can i assume that if i use more proxies /connections i can simply multiply
the number by the number of proxies? 

It´s been an hour now and it went through roughly 70k urls
can i assume/extrapolate to reach 20 mio. pages in 285 hours more

or less ? Or is it not that simple?


Please help ..i feel stupid


RE: Email scraping large site - loopline - 06-03-2017

My experience has been that with that size of a scraping run things aren't always linear, but you can try and see. I would get some shared proxies, which should work fine and be inexpensive.