ScrapeBox Forum
Yelp Scrapper Harvesting Guide - Printable Version

+- ScrapeBox Forum (https://www.scrapeboxforum.com)
+-- Forum: ScrapeBox Main Discussion (https://www.scrapeboxforum.com/Forum-scrapebox-main-discussion)
+--- Forum: General ScrapeBox Talk (https://www.scrapeboxforum.com/Forum-general-scrapebox-talk)
+--- Thread: Yelp Scrapper Harvesting Guide (/Thread-yelp-scrapper-harvesting-guide)



Yelp Scrapper Harvesting Guide - tirmizi - 12-03-2018

Hi Guys,
 
Yesterday I ran a campaign for harvesting a specific niche and it scrapped 150 K records. Now when I cleaned them up was only left with 70 records and it was all full of duplicates. Now my question is can I somehow prevent this from happening so that duplicates are not harvested , like I could set a condition based on emails or name or website , that I only want 1 record per name or website or address. Otherwise it was a total waste of time and wouldn't want the same to happen again. As I am after a unique email from a specific company / organization.

Also for harvesting , please share a footprint that will ensure that the keywords that I harvest won't have duplicate url's or duplicate emails.

Thanks,


RE: Yelp Scrapper Harvesting Guide - loopline - 12-05-2018

Are you using the ypscraper or just harvesting google?

I don't have a footprint to stop duplicates.


RE: Yelp Scrapper Harvesting Guide - tirmizi - 12-05-2018

(12-05-2018, 05:06 AM)loopline Wrote: Are you using the ypscraper or just harvesting google?

I don't have a footprint to stop duplicates.

Yscraper


RE: Yelp Scrapper Harvesting Guide - loopline - 12-05-2018

I would guess your using too close of locations. Yellow pages in general has a lot of overlap between results and locations. So "plumber" in city X can show up for tons of cities that are either close by or not even that close by depending on density and various factors. Yellow pages just has a lot of overlap.

Ive never seen 150K go down to 70 but if the areas and keywords are related then there may be just not that many results that are actually unique. If thats the case I would reduce the quantity of results you get for each and then broaden your keywords and locations and try it.