ScrapeBox Forum
How is it that Scrapebox does not scrape all URLs in Google's index? - Printable Version

+- ScrapeBox Forum (https://www.scrapeboxforum.com)
+-- Forum: ScrapeBox Main Discussion (https://www.scrapeboxforum.com/Forum-scrapebox-main-discussion)
+--- Forum: General ScrapeBox Talk (https://www.scrapeboxforum.com/Forum-general-scrapebox-talk)
+--- Thread: How is it that Scrapebox does not scrape all URLs in Google's index? (/Thread-how-is-it-that-scrapebox-does-not-scrape-all-urls-in-google-s-index)

Pages: 1 2 3


RE: How is it that Scrapebox does not scrape all URLs in Google's index? - weechoochter - 04-13-2015

ok thanks that sounds logical with the bottom results (although I didn't understand it all lol and why they are included when I limit to 10 results).
In other results I am seeing tons of youtube urls.
Great that you will pass it on. Sounds better coming from you :-).
So I removed the &num=100 and got mixed results. First 10 look promising and include six links from the six pack but then at about 30 there are 10 YouTube URLs in a row (Maybe that's a valid result?). I'll keep testing it. I need the first 10 only for now and it's getting better.
I also tried changing the engine to Chrome v41 by pasting that line of code but only got errors.
Anyway. Thanks for all your help :-)


RE: How is it that Scrapebox does not scrape all URLs in Google's index? - loopline - 04-14-2015

What errors did you get? Changing the user agent is a vital piece of the puzzle. Can you post a screen?


RE: How is it that Scrapebox does not scrape all URLs in Google's index? - weechoochter - 04-18-2015

I tried to put my errors here but the post is still in moderation. maybe because of the links in it.


RE: How is it that Scrapebox does not scrape all URLs in Google's index? - loopline - 04-18-2015

Its ok, I can see it. It did get stuck in moderation.

Try unchecking the clear cookies box, cookies actually help with google harvesting, when there are no cookies it thinks your a bot. Smile

Can you go into your scrapebox folder >> configuration >> engines.dat and open that engines.dat file in notepad or similar. Then find the engine you made, probably the last one in the list. Can you copy and paste those engine settings here, like:

[Engine1]
Displayname=Google
QueryString=http://www.google.com/search?complete=0&output=ie&hl=en&q={KEYWORD}&num=100&start={PAGENUM}
MustBeInLink=http
MustNotBeInLink=webcache.google|q=related:
JustBeforeLink=/url?q=
RightAfterLink=&sa=U&
PageStart=0
PageInc=100
NextPageMarker=<span style="display:block;margin-left:53px">|<img border="0" src="nav_next.gif"|src="nav_next_2.gif" width="100"
Translation=%3f=?|%3d==|%26=&|%3a=:|&amp;=&|%2F=/
headerData=
SSLVersion=0
Favicon=/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMUFRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBT/wAARCAAQABADASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwDpfB2m2/ivVtXfU7lZr8WU11bR3UxX7ZcgqfLLZBJILsBkElQO9WfH8dnoGs6OdJsToGpw2cUl9axSS77W8DNuXLszKcBGxnKliDyKi8KxaVYRa7Z6w8Wl64AiWc+o2rywwsrnzVdArEMRgAlGxg9Mgix488Qt44u9GQNHrHiARmC61C0gZDesW/dDaVUswXA3FQTwOcZP9MvneLTV/Zr1ttv/ACuP4p+R+O+6qHTm/Hf70/wt5n//2Q==
GracePeriod=0
UserAgent=
Selected=1
FollowRelocation=0
AddRelative=
AddFieldValue=
ClearCookies=0
ReadOnly=0


RE: How is it that Scrapebox does not scrape all URLs in Google's index? - weechoochter - 04-19-2015

Hi, tried the allowing cookies but no luck.
here are the engines settings
[Engine33]
Displayname=Google Chrome V41
QueryString=http://www.google.com/search?complete=0&hl=en&q={KEYWORD}&start={PAGENUM}&filter=0
MustBeInTag=
MustNotBeInTag=
MustBeInLink=http
MustNotBeInLink=webcache.google|q=related:
JustBeforeLink=/url?q=
RightAfterLink=&amp;sa=U&amp;
PageStart=0
PageInc=100
NextPageMarker=<span style="display:block;margin-left:53px">|<img border="0" src="nav_next.gif"|src="nav_next_2.gif" width="100"
Translation=%3f=?|%3d==|%26=&|%3a=:|&amp;=&|%2F=/
headerData=
Referer=0
SSLVersion=2
Favicon=/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMUFRUVDA8XGBYUGBIUFRT/2wBDAQMEBAUEBQkFBQkUDQsNFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBT/wAARCAAQABADASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwDpfB2m2/ivVtXfU7lZr8WU11bR3UxX7ZcgqfLLZBJILsBkElQO9WfH8dnoGs6OdJsToGpw2cUl9axSS77W8DNuXLszKcBGxnKliDyKi8KxaVYRa7Z6w8Wl64AiWc+o2rywwsrnzVdArEMRgAlGxg9Mgix488Qt44u9GQNHrHiARmC61C0gZDesW/dDaVUswXA3FQTwOcZP9MvneLTV/Zr1ttv/ACuP4p+R+O+6qHTm/Hf70/wt5n//2Q==
GracePeriod=0
UserAgent=Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36
Selected=1
DetailedSelected=0
FollowRelocation=1
AddRelative=
AddFieldValue=
ClearCookies=0
ListMode=0
ReadOnly=0

Thanks :-)


RE: How is it that Scrapebox does not scrape all URLs in Google's index? - loopline - 04-21-2015

Ill mess around and get back to you. The error is coming from the user agent, so we just need a newer user agent that google accepts.

So I have to run at the moment, but basically its the useragent causing errors on my end. So try a few until you find one that works.

http://www.useragentstring.com/pages/useragentstring.php

has a massive list.


RE: How is it that Scrapebox does not scrape all URLs in Google's index? - weechoochter - 05-05-2015

Thanks. I tried about 15 custom agents and got nothing.
I can't even get any results from V2 using the standard engine from a fresh install. All I get is errors using private and server proxies.
Any help would be appreciated
Thanks :-)


RE: How is it that Scrapebox does not scrape all URLs in Google's index? - loopline - 05-05-2015

I have been talking with support and they can't replicate anything. I mean they tried all kinds of things and Everything worked fine.

Its tricky when 1 person can't get it to fail and the other can't get it to work, haha

So I got it to fail some but it works 99.9% of the time for me. So Im going to monkey around with some stuff and see if I can get it to fail and log it with some debugging and send them all the details to see if we can get to the bottom of it.

So if you have any more feedback from tests, please post back as anything helps.

Else hang loose and we will see if we can't get it sorted over the next few days.


RE: How is it that Scrapebox does not scrape all URLs in Google's index? - loopline - 05-07-2015

Ok, update to V.39 and this is fixed. However basically you should use a https url string. What was happening was that when http gets used, google redirects to https and the https was not being initialized.

Anyway, that is fixed now but its better to use a https url string because google will see all the http redirects and probably block the proxy ips quicker.


RE: How is it that Scrapebox does not scrape all URLs in Google's index? - weechoochter - 05-07-2015

Great thanks. So do you mean I should edit the standard engines also to https instead of http?
I am not getting lots of errors anymore so that's great.
Tried lots of agents but no luck. I just get piles of links from 1 domain and lots of YouTube video links. I'll keep popping away at different agents until I get it to work.

Thanks for all your help :-)