How is it that Scrapebox does not scrape all URLs in Google's index?

How is it that Scrapebox does not scrape all URLs in Google's index? - Printable Version

+- ScrapeBox Forum (https://www.scrapeboxforum.com)
+-- Forum: ScrapeBox Main Discussion (https://www.scrapeboxforum.com/Forum-scrapebox-main-discussion)
+--- Forum: General ScrapeBox Talk (https://www.scrapeboxforum.com/Forum-general-scrapebox-talk)
+--- Thread: How is it that Scrapebox does not scrape all URLs in Google's index? (/Thread-how-is-it-that-scrapebox-does-not-scrape-all-urls-in-google-s-index)

Pages: 1 2 3

How is it that Scrapebox does not scrape all URLs in Google's index? - theone - 04-27-2014

Not all URLs I see in Google's index are scraped by Scrapebox. Like this one:

http://www.freshhomebuilders.co.uk/interior/painting-and-decorating-in-london/painters-and-decorators-hampstead/

You will see it in Google's index but not in Scrapebox.

RE: How is it that Scrapebox does not scrape all URLs in Google's index? - loopline - 04-28-2014

(04-27-2014, 07:08 AM)theone Wrote: Not all URLs I see in Google's index are scraped by Scrapebox. Like this one:

http://www.freshhomebuilders.co.uk/interior/painting-and-decorating-in-london/painters-and-decorators-hampstead/

You will see it in Google's index but not in Scrapebox.

Scrapebox acts just like a browser, so scrapebox is scraping the urls that google gives back. If you don't "see it in scrapebox" then google isn't giving it back.

This could be because google gives back all manor of different results sets based on IP or it could be other things.

But if you can provide more detail we can help you more specifically.

What footprint/keyword are you using to see the result in google in a browser but not in scrapebox?

Which harvester in scrapebox are you using, custom, single or multi?

RE: How is it that Scrapebox does not scrape all URLs in Google's index? - paki250 - 05-09-2014

It does not work in google. You should to search it in Yahoo search engine.

RE: How is it that Scrapebox does not scrape all URLs in Google's index? - weechoochter - 04-10-2015

Hi. I really need to scrape the top 10 URLs in google but the results ar very inconsistent.
http://i.imgur.com/skUitHF.jpg
Somtimes keyword will deliver only similar results to google but that'S ok as different ip or whatever. Other times I'll just get a list of YouTube videos or URI of a domain connected to that keyword.
I am using multi-threaded harvester.
I would really appreciate any help on this. Thanks

RE: How is it that Scrapebox does not scrape all URLs in Google's index? - loopline - 04-10-2015

Well if you look at the term Scrapebox in google in a browser, it probably has a 6 pack or something similar, so Scrapebox is seeing all those results as the first 10. Technically thats correct, so its a matter of what you want vs what is technically correct. You could filter out those domains for example by removing duplicate domains.

RE: How is it that Scrapebox does not scrape all URLs in Google's index? - weechoochter - 04-11-2015

thanks, I didn't think of six packs etc.
When I set to scrape the first 40 links of google I get 40 links from that domain. That's way more than a sixpack so scrapebox is not acting like a browser here and I don't know where it is getting these links from because they are not showing in any browser. This is a core function of scrapebox and it seems a bit hit n' miss.
Thanks for the duplicates suggestion. I could do that but then for those keywords I would only get 1 link instead of the 10 that I need.

RE: How is it that Scrapebox does not scrape all URLs in Google's index? - loopline - 04-11-2015

I can't duplicate your issue is my first problem, across a half dozen servers and also on my home and office machines it all works perfectly. Scrapebox has literally tens of thousands of users and in 5 years of using Scrapebox and being heavily involved Ive never heard anyone else have this issue.

Try unchecking the use proxies box, does it also do the same thing?

Try downloading a fresh copy of Scrapebox to make sure you do not have some corrupt files and/or try V2.

http://www.scrapebox.com/v2-beta

RE: How is it that Scrapebox does not scrape all URLs in Google's index? - weechoochter - 04-12-2015

Hi and thanks for your suggestions.
My IP is blocked so I can't try it without proxies until its unblocked. I tried it with only one proxy but no change.
Downloaded V1.16.4 again and same thing.
I downloaded V2 (nice) and used server proxies and the result is exactly the same.
I get 40 links from the scrapebox.com domain when scraping for the word scrapebox.
It is understandable that many people are using scrapebox and not seeing this problem but it is also conceivable that they are scraping millions of urls and just not looking at them.
In this case I only need the first 10 URLS for every keyword.
So you get different results when scraping the first 40 google urls?
I can video it if you want :-)
Any other suggestions would be appreciated
Thanks for your help.

RE: How is it that Scrapebox does not scrape all URLs in Google's index? - loopline - 04-12-2015

Ok, so here is whats happening, its working with 100 results per page and if you do that in your browser you will see a "more results from scrapebox.com" link under the page. Those urls are being included in the non javascript version of the results, so there is like 63 or something total urls and if you try to get like 75 results you will start to see some youtube urls etc... Not sure what, if anything, can be done about it, as google seems to be including the results in the no js version of the page, where as in a browser it wouldn't be the case as js would sort it out most likely, but Ill let them know.

RE: How is it that Scrapebox does not scrape all URLs in Google's index? - loopline - 04-13-2015

Because the harvester is geared towards getting the most URL’s using the fewest requests and the least amount of bandwidth it uses a useragent that returns an older style Google design. The pages Google return to modern browsers are huge with a lot of useless fluff/html/js which can add up to GB’s of additional bandwidth consumed on big scrapes.

If you want the full fluff version like a browser, in the custom harvester you just need to change the useragent to something modern like Chrome v41

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36

And remove the &num=100 from the URL string.

You do this under settings >> harvester engine configuration. Click google and then remove the &num=100 from the string and update the useragent and then click update engine, or save it as a new engine.

Bearing in mind the extra fluff of the page may slow things down and will use more bandwidth.