Scrapebox2.0 Stuck While Harvesting, Errors Climb Though (30 Proxies, One Connection)

Scrapebox2.0 Stuck While Harvesting, Errors Climb Though (30 Proxies, One Connection) - Printable Version

+- ScrapeBox Forum (https://www.scrapeboxforum.com)
+-- Forum: ScrapeBox Main Discussion (https://www.scrapeboxforum.com/Forum-scrapebox-main-discussion)
+--- Forum: General ScrapeBox Talk (https://www.scrapeboxforum.com/Forum-general-scrapebox-talk)
+--- Thread: Scrapebox2.0 Stuck While Harvesting, Errors Climb Though (30 Proxies, One Connection) (/Thread-scrapebox2-0-stuck-while-harvesting-errors-climb-though-30-proxies-one-connection)

Pages: 1 2

Scrapebox2.0 Stuck While Harvesting, Errors Climb Though (30 Proxies, One Connection) - scanner737 - 03-20-2015

[Image: scrapeboxerror.png]

Basically as the title suggests, my harvest has been stuck on the same number for at least 12 hours (I left it on overnight).

I have 30 private proxies I'm using, I set the harvester to just 1 connection because I'm running advanced operator searches. Around 12 hours ago (as that's when I last checked and saw it was at the same number), it got stuck on 1600056 harvested urls. Now it stays stuck on that number, yet the error continues to climb at about 2 per second.

I just tested all of my proxies in a second instance of Scrapebox and they're all still working just fine, Google Passed, etc.

As you can see from the image, I'm harvesting with roughly 1.4 million keywords/advanced search operators. My PC is pretty close to top of the line and I was thinking SB v2.0 could handle a massive search like this. Any ideas or insight would be massively appreciated.

Side Question: What's the "completed %" at the bottom? Is that really how much of the queries it's run so far? In which case I probably need to change something like drop my search query number?

I don't mind if it takes a few days to complete as long as I'm safe and getting good results, hence the one connection, but I feel I'm doing something wrong here so again: that's my situation so if anyone has any insights or advice for me I'd really appreciate it.

Thanks all!

RE: Scrapebox2.0 Stuck While Harvesting, Errors Climb Though (30 Proxies, One Connection) - scanner737 - 03-21-2015

I paused the harvester for about 6 hours to see if that would do anything. When I unpaused, the harvested URL number began going up again with no errors, and it's climbing pretty quickly, pretty much in increments of 100 every few seconds. It's almost as if it was still gathering URLs while it was paused and now it's just actually adding them to the total. Also, the average URL's per second has been stuck on 9 for about 12 hours now. Strange.

Update: Then, after about 20 minutes of this, it reverted to its behavior of no new harvested URLs and the error counter goes up incrementally again.

Again, anyone experiencing similar behavior, if you want to chime in that'd be great.

RE: Scrapebox2.0 Stuck While Harvesting, Errors Climb Though (30 Proxies, One Connection) - gopo2k - 03-21-2015

Yes. You have too few proxies and too many threads.
Get 100 proxies and use 5 threads at once.

RE: Scrapebox2.0 Stuck While Harvesting, Errors Climb Though (30 Proxies, One Connection) - loopline - 03-21-2015

(03-21-2015, 09:25 AM)gopo2k Wrote: Yes. You have too few proxies and too many threads.
Get 100 proxies and use 5 threads at once.

In ratio to what he is using now, he has about 100 proxies at 3 connections so your suggestion would actually make the problem worse, but you are on right track, just not with the number of proxies and threads persay.

~~~~~~~~~~~~~~~`

OP your proxies are getting banned. Thats why the errors. When you paused some of them got unblocked and thus the reason it worked fine for 20 mins and then started over.

The google test only tests against basic keyewords, not advanced operators. You can have proxies that work fine for 1 query and are blocked for the next. I have a video on it here:

https://www.youtube.com/watch?v=P9CbGhfc1aY

You could build a custom test with your exact query and then you would see exact results.

You would just need more proxies, perhaps as much as 50+ total with 1 connection for advanced operators.

Also I would recommend using the detailed harvester, it still only harvests at 1 connection but will give you more details and has the option to add a delay.

So you could wait 48 hours for your ips to be unbanned and then add a 2-5 second delay.

Or you could try some back connect proxies for harvesting, they change the ip every 10 mins on the back end.

Recommendations here:
http://scrapeboxfaq.com/scrapebox-proxies

The bar at the bottom is percentage of total keywords completed.

Scrapebox will have no problem with that volume of keywords, as long as you are using the 64bit version.

The overall effect is to bear in mind that V2 is highly optimized for Scraping and its VERY fast at scraping google, so the concept of using 1 connection on 30 proxies in V1 is probably fine as the harvester is slower. The concept in general just keeps things slow enough you don't get your ips banned, but V2 is so fast and optimized its rotating thru the ips quicker, hence the ban. So you have to add a delay or more proxies to slow things back down to a speed where you won't get banned. That or use back connect proxies that just rotate so you don't have to worry about bans as much.

RE: Scrapebox2.0 Stuck While Harvesting, Errors Climb Though (30 Proxies, One Connection) - scanner737 - 03-24-2015

(03-21-2015, 09:24 PM)loopline Wrote:
(03-21-2015, 09:25 AM)gopo2k Wrote: Yes. You have too few proxies and too many threads.
Get 100 proxies and use 5 threads at once.

In ratio to what he is using now, he has about 100 proxies at 3 connections so your suggestion would actually make the problem worse, but you are on right track, just not with the number of proxies and threads persay.

~~~~~~~~~~~~~~~`

OP your proxies are getting banned. Thats why the errors. When you paused some of them got unblocked and thus the reason it worked fine for 20 mins and then started over.

The google test only tests against basic keyewords, not advanced operators. You can have proxies that work fine for 1 query and are blocked for the next. I have a video on it here:

https://www.youtube.com/watch?v=P9CbGhfc1aY

You could build a custom test with your exact query and then you would see exact results.

You would just need more proxies, perhaps as much as 50+ total with 1 connection for advanced operators.

Also I would recommend using the detailed harvester, it still only harvests at 1 connection but will give you more details and has the option to add a delay.

So you could wait 48 hours for your ips to be unbanned and then add a 2-5 second delay.

Or you could try some back connect proxies for harvesting, they change the ip every 10 mins on the back end.

Recommendations here:
http://scrapeboxfaq.com/scrapebox-proxies

The bar at the bottom is percentage of total keywords completed.

Scrapebox will have no problem with that volume of keywords, as long as you are using the 64bit version.

The overall effect is to bear in mind that V2 is highly optimized for Scraping and its VERY fast at scraping google, so the concept of using 1 connection on 30 proxies in V1 is probably fine as the harvester is slower. The concept in general just keeps things slow enough you don't get your ips banned, but V2 is so fast and optimized its rotating thru the ips quicker, hence the ban. So you have to add a delay or more proxies to slow things back down to a speed where you won't get banned. That or use back connect proxies that just rotate so you don't have to worry about bans as much.

Good advice, Loopline, thanks. I was away for the weekend and I turned off my compy, so I'll try starting it over with the detailed harvester and a delay as you recommend.

*Edit*

So I set the detailed harvester up with a delay of 5 seconds. Set at only one connection still to be safe and slow. It's at around 750,000 URLs after about 15 hours (unfortunately there's no timer on the detailed harvester yet). Every few thousand searches I'll notice there's a group/cluster of searches with 0 results and it still says "harvesting with proxy..." but never seems to complete those queries to get results. Are those instances of where it banned the proxies for awhile again?

Is there a way to save harvesting progress, as in I wanted to turn off my computer for the night then pick it back up the next day in the same spot? That would be HUGE. Then I can swap out proxies if I need to, as well, without losing my progress or can just not have to worry about having to leave my PC on constantly.

I was also going to ask which Back Connect Proxies package you'd recommend, but it seems like it's really just all about how much I want to spend vs. how long I want to wait to get complete results when dealing with lots and lots of advanced searches.

Thanks for the help again!

RE: Scrapebox2.0 Stuck While Harvesting, Errors Climb Though (30 Proxies, One Connection) - loopline - 03-24-2015

(03-24-2015, 02:57 AM)scanner737 Wrote:
(03-21-2015, 09:24 PM)loopline Wrote:
(03-21-2015, 09:25 AM)gopo2k Wrote: Yes. You have too few proxies and too many threads.
Get 100 proxies and use 5 threads at once.

In ratio to what he is using now, he has about 100 proxies at 3 connections so your suggestion would actually make the problem worse, but you are on right track, just not with the number of proxies and threads persay.

~~~~~~~~~~~~~~~`

OP your proxies are getting banned. Thats why the errors. When you paused some of them got unblocked and thus the reason it worked fine for 20 mins and then started over.

The google test only tests against basic keyewords, not advanced operators. You can have proxies that work fine for 1 query and are blocked for the next. I have a video on it here:

https://www.youtube.com/watch?v=P9CbGhfc1aY

You could build a custom test with your exact query and then you would see exact results.

You would just need more proxies, perhaps as much as 50+ total with 1 connection for advanced operators.

Also I would recommend using the detailed harvester, it still only harvests at 1 connection but will give you more details and has the option to add a delay.

So you could wait 48 hours for your ips to be unbanned and then add a 2-5 second delay.

Or you could try some back connect proxies for harvesting, they change the ip every 10 mins on the back end.

Recommendations here:
http://scrapeboxfaq.com/scrapebox-proxies

The bar at the bottom is percentage of total keywords completed.

Scrapebox will have no problem with that volume of keywords, as long as you are using the 64bit version.

The overall effect is to bear in mind that V2 is highly optimized for Scraping and its VERY fast at scraping google, so the concept of using 1 connection on 30 proxies in V1 is probably fine as the harvester is slower. The concept in general just keeps things slow enough you don't get your ips banned, but V2 is so fast and optimized its rotating thru the ips quicker, hence the ban. So you have to add a delay or more proxies to slow things back down to a speed where you won't get banned. That or use back connect proxies that just rotate so you don't have to worry about bans as much.

Good advice, Loopline, thanks. I was away for the weekend and I turned off my compy, so I'll try starting it over with the detailed harvester and a delay as you recommend.

*Edit*

So I set the detailed harvester up with a delay of 5 seconds. Set at only one connection still to be safe and slow. It's at around 750,000 URLs after about 15 hours (unfortunately there's no timer on the detailed harvester yet). Every few thousand searches I'll notice there's a group/cluster of searches with 0 results and it still says "harvesting with proxy..." but never seems to complete those queries to get results. Are those instances of where it banned the proxies for awhile again?

Is there a way to save harvesting progress, as in I wanted to turn off my computer for the night then pick it back up the next day in the same spot? That would be HUGE. Then I can swap out proxies if I need to, as well, without losing my progress or can just not have to worry about having to leave my PC on constantly.

I was also going to ask which Back Connect Proxies package you'd recommend, but it seems like it's really just all about how much I want to spend vs. how long I want to wait to get complete results when dealing with lots and lots of advanced searches.

Thanks for the help again!

On which package, yes its about how much you want to spend vs how long you want to wait for results.

Its possible on those groups of queries with no results that there is actually no results for those queries. Its also possible that yes proxies were banned.

You can pause the harvester but you can't like "Save the state" of it so you can shut down your pc. However you can set proxies in a file and tell it to refresh proxies every X minutes or just have it refresh from file.

If you tell it to refresh every X mins, you could just change out the proxies when you want. If you tell it to refresh from file, but not every X mins it will just harvest until it has no working proxies then get new proxies from the file.

All in all though its not built for saving a state and shutting down a pc and starting back up again later.

May I ask, why do you want to shut down your pc? Are you limited on bandwidth or electricity or? Just curious.

A cheap VPS might be a way to go too, they stay running non stop and then it doesn't use anything on your pc etc...

RE: Scrapebox2.0 Stuck While Harvesting, Errors Climb Though (30 Proxies, One Connection) - scanner737 - 03-25-2015

I think I was mostly concerned about power outages. I should invest in a UPC to cover short, second long cutouts because that's a lot of harvesting (progress) lost in that situation. I also don't like to leave my PC on when I'm away for an extended period of time, like a weekend away or vacation.

I'm not sure how realistic from a practical point of view a save feature (even an auto save feature like every 60 minutes or something) would be, but figured I'd mention that might make a nice addition.

I will say though I just realized (my brain works on occasion) that a workaround is to simply observe which keyword the harvester is up to when you want to stop it, then, because the harvester goes in order top to bottom of your keyword list when harvesting, you can just start with that part of your keyword list the next time you boot it up if you did want to actively stop the harvester midway through. Of course you'd have to build your own master harvested URL list because otherwise it will be broken into multiple folders, but yeah that would work just fine for me come to think of it.

By the way, I believe you're correct; each cluster that's returning 0 results is from the same main advance operator (plus all of my base keywords). Even when it's not, when it's in a cluster of other keywords which are returning a handful of results, when I do a manual search for the keyword returning 0 then I can confirm it in my own Google search that there are no results for it. So that's good to know that it's not my proxies so far. Now I'm wondering if my initial, non detailed harvester run was counting "0" return searches as "errors" or if those were indeed banned IPs. I like the detailed for the added detail as you have a better idea of what's going on, though I'd still be great to have a specific notification that proxies are banned on a search and maybe SB could give them some time to breathe before testing them again.

This is great though learning more about SB and getting a better handle on it.

RE: Scrapebox2.0 Stuck While Harvesting, Errors Climb Though (30 Proxies, One Connection) - loopline - 03-25-2015

(03-25-2015, 03:47 PM)scanner737 Wrote: I think I was mostly concerned about power outages. I should invest in a UPC to cover short, second long cutouts because that's a lot of harvesting (progress) lost in that situation. I also don't like to leave my PC on when I'm away for an extended period of time, like a weekend away or vacation.

I'm not sure how realistic from a practical point of view a save feature (even an auto save feature like every 60 minutes or something) would be, but figured I'd mention that might make a nice addition.

I will say though I just realized (my brain works on occasion) that a workaround is to simply observe which keyword the harvester is up to when you want to stop it, then, because the harvester goes in order top to bottom of your keyword list when harvesting, you can just start with that part of your keyword list the next time you boot it up if you did want to actively stop the harvester midway through. Of course you'd have to build your own master harvested URL list because otherwise it will be broken into multiple folders, but yeah that would work just fine for me come to think of it.

By the way, I believe you're correct; each cluster that's returning 0 results is from the same main advance operator (plus all of my base keywords). Even when it's not, when it's in a cluster of other keywords which are returning a handful of results, when I do a manual search for the keyword returning 0 then I can confirm it in my own Google search that there are no results for it. So that's good to know that it's not my proxies so far. Now I'm wondering if my initial, non detailed harvester run was counting "0" return searches as "errors" or if those were indeed banned IPs. I like the detailed for the added detail as you have a better idea of what's going on, though I'd still be great to have a specific notification that proxies are banned on a search and maybe SB could give them some time to breathe before testing them again.

This is great though learning more about SB and getting a better handle on it.

Yes you could just observe it. Yes you would have to build your own master list.

You could always break up the keyword list into what seems to take about 1 days worth, then harvest and then shut down, then start over.

Also you could get fancy with the automator to help with this. You could even have the automator execute a exe that would then shut down your pc when its done.

UPC would be a good way to go for short power outages.

Yeah when you get into mass queries its easy enough to get hold combos that just have no results.

Errors should be ip bans and not 0 results, 0 results should be just completed with 0 results, as thats not an error, its just nothing.

Indeed, much can be done and I think of new ways to apply it all the time, that or people ask things that spark it, like I didn't think about that you could have the automator call a program that would shut your pc down when it was done. Lots to learn.

RE: Scrapebox2.0 Stuck While Harvesting, Errors Climb Though (30 Proxies, One Connection) - scanner737 - 03-26-2015

Hmm SB 2.0 had a hiccup at some point last night while I was asleep and must have crashed in the middle of harvesting as I see no trace of it today. I had a second instance of Scrapebox (Version 1, actually) which was doing something else, and that was still open when I woke up so it's not as if my computer restarted itself for some reason. I have the saved harvested list for as far as the detailed harvester got, but this is a situation where I have no idea how far it got in regards to the keyword list so I'll have to start from scratch and just remove duplicates between the two lists.

I think this is where, short of having an autosave feature on the harvester in regards to where it was on your keyword list when it crashed, your idea of splitting the keyword list into something like 20 smaller keyword lists to do daily. That way I'll have a much smaller keyword list to repeat if it crashes again.

RE: Scrapebox2.0 Stuck While Harvesting, Errors Climb Though (30 Proxies, One Connection) - loopline - 03-26-2015

Yeah, I mean you can mail support and request an auto save feature, but the problem is that each engine gets an array of the keywords. So if you select the max of 4 engines in the detailed harvester it loads 4 copies of the keywords into memory, 1 for each engine.

So each engine can be at a different spot and saving 4 sets of keywords to disk in real time could get hairy at best and cause crashes of its own at worst.

The worst thing would be it would slow the harvester down, and probably take a rewrite of half the harvester to accomplish.

I know it would cripple the custom harvester with over 20 engines, but it might be possible with the detailed harvester.

You can mail support at

support (at} scrapebox [dot) com

But yes breaking up the keyword list would work as well.

In the main scrapebox folder there should be a bugreport.txt file.

If you attach it here I can possibly tell you what went wrong, or you can also mail to support at

scrapeboxhelp (at} gmail (dot] com