ScrapeBox Forum
email scraping - Printable Version

+- ScrapeBox Forum (https://www.scrapeboxforum.com)
+-- Forum: ScrapeBox Main Discussion (https://www.scrapeboxforum.com/Forum-scrapebox-main-discussion)
+--- Forum: General ScrapeBox Talk (https://www.scrapeboxforum.com/Forum-general-scrapebox-talk)
+--- Thread: email scraping (/Thread-email-scraping--8283)



email scraping - scrapingby - 02-16-2017

I have huge site that has hundreds of thousands of emails that I´d like to scrape.
All the emails are on the surface, so a simple site:domain.com would suffice.
I succeeded scraping a few hundred.

Can someone tell me how many proxies i would need to scrape something like 500.000 pages?

Also how would i best go about in terms of organization?
Angel


RE: email scraping - loopline - 02-17-2017

You can use the grab mails by crawling site funciton. if you set it at 1 connection you probably don't need any proxies, but I would do 1 proxy for every 1-2 connections.

https://www.youtube.com/watch?v=CykedqJg92w&t=5s


RE: email scraping - scrapingby - 02-20-2017

ty loopline,
that works like a charm.

Generally speaking does having more proxies make things faster?
Should i spend more time on harvesting free proxies or invest into
premium proxies?

Is it scraping only inside the URL or also following outside domain links?


RE: email scraping - loopline - 02-20-2017

IT can, I mean it makes it so the end site doesn't block your ip because your spreading out the requests with proxies. But if the end site server can't handle all the requests you throw at it then its just slowing things down. So "it depends" is about the best answer I can give you.

you don't want to use free proxies for anything except scraping from the search engines. They are built for speed at the cost of accuracy, but with email scraping you want premium proxies or no proxies.

Its only scraping inside the domain for email scraping, its not following outside links.


RE: email scraping - scrapingby - 03-06-2017

"Its only scraping inside the domain for email scraping, its not following outside links."

That´s what i thought too. But for some reason i found emails from outside the domain.
Does the depth level influence the likelihood of going outside domain or what happened?