Connections that never end... - Printable Version

Connections that never end... - Printable Version

+- ScrapeBox Forum (https://www.scrapeboxforum.com)
+-- Forum: ScrapeBox Main Discussion (https://www.scrapeboxforum.com/Forum-scrapebox-main-discussion)
+--- Forum: General ScrapeBox Talk (https://www.scrapeboxforum.com/Forum-general-scrapebox-talk)
+--- Thread: Connections that never end... (/Thread-connections-that-never-end)

Connections that never end... - DigitalMu - 12-22-2019

Probably THE biggest annoyance about Scrapebox for me has been situations where a job refuses to end (even when you press stop) due to open connections. This monster rears its head in several places, but most often when running Check Links on a bunch of domains.

I've tried everything...
- reducing the number of connections to a crawl
- waiting for hours (and even a full day)
- hitting stop and waiting
- Shutting down Scrapebox and trying it again (and again and again)
- Writing the vendor
...and more

Nothing seems to help.

Right now, for example, I have a list of about 100,000 urls that I want to link check. The first pass made it through just fine. It found about 7000 successful links. I've found that I often need to run several more passes to check all the urls so I ran it a second time (with 150 threads)...it choked up leaving me 113 open threads when returned a few hours later. I tried it again...same result. I tried it again with 90 threads...same result. I'm in the middle of some other gymnastics at the moment.

I wrote the creator a few months ago and his answer really didn't seem satisfying....and could be summed up as "Yeah, there's no way to close down threads that remain open on Windows". First and foremost, that seems almost inconceivable. Surely there is some software way to simply terminate threads (especially after a period of time or after hitting stop). I can't imagine that Windows forces threads to remain open....indefinitely. But....the second issue is.... Even if the above were true and there's no way to force threads to close, I should at least be able to regain control of Scrapebox so I can save the data that just took hours to collect. I mean, when harvesting I'm able to save the URLs on a periodic basis (like every 10,000 for example)....and there's always the files in the /Harvester_Sessions directory. With Check Link, though, it seems like I cannot get any such files. If the Active Threads ceases (as it often does), I'm just out of luck. I cannot get a listing of my successful/unsuccessful links. I simply have to start over...and over...and over....sometimes finally taking the time to split up my large lists and processing them in groups of 10,000 instead of 100,000+. This is very time consuming.

Surely there is some reasonable, better way? Maybe I'm still not getting something fundamental?

Again, it's inconceivable to me that simply hitting stop doesn't.....uhmmm....stop. It's inconceivable that Windows forces the threads to remain open with no open of forcibly closing them and even more inconceivable that I cannot save my data when this happens (and have to simply shutdown the Scrapebox task).

So that's my rant today as I'm now experimenting with the forty-leventh method that I'm hoping my skirt this issue Smile

Any thoughts, ideas? Smile

RE: Connections that never end... - loopline - 12-23-2019

First I recommend you try this approach.
https://www.youtube.com/watch?v=bZRh6sZZyz0

If that doesn't work you can try settings >> connections timeouts and other settings >> other >> link checker min threads.

That attempts to force close all connections when you reach the threshold. I like 20%. So if you are using 100 threads on link checking then I would set it to 20 for link checker min threads. Note that if you set the link checker min threads equal to or greater then your total link checker connections that checking will end instantly when you push start on the link checker.

Believe or not scrapebox actually can't just stop the threads. The thing is scrapebox uses threads/sockets and these are a global convention. Scrapebox did not design them, they are used by programs around the world. These are built in such a way where windows literally has control over them.

Coding isn't as cut and dry as it seems. So windows literally has control over the threads and if windows locks a thread or any 3rd party software locks a thread then scrapebox is forced to wait till its unlocked.

Windows doesn't really "intentionally" just lock threads probably, unless it doesn't like the content on a url, but windows just was not built for scrapebox, so windows just has "brain farts" if you will and it just locks threads sometimes.

On my servers I combat this by working in ultra small chunks. I run the automator and first split my source urls into like 100 url chunks, dump them in a folder.

Then have the automator link check 1, and output results and then run a custom script that copies in the next file in the folder and then loop the automator to link check it.

This works well, even google runs everything in small chunks and small servers as its more effective then larger machines etc..

I take it a step further, which is probably overkill for you, but I want stuff to run 24/7/365. So I have another script that monitors the link checker and arbitrarily force closes it every 12 hours. I lose a few links every 12 hours this way, but you can always load them back in to check again, which your doing anyway.

When the link checker is force closed I kill it, scrapebox/automator the works and then the script pauses a few seconds and restarts it scrapebox and the automator which then kicks off the link checker again.

So it runs year round seamlessly, for the most part. There is always anomyles (no idea how to spell that, lol) of course.

Anyway link checker min threads is probably your best bet. Also you can fire up another instance of scrapebox and slit your list in half, that way if 1 locks the other completes you can only lose half the work. you can run unlimited instances of scrapebox on a single machine.

As for windows having control over the threads, it sucks, but its the best option out there, so its a case of its better then the alternative. But you can work around it. Small list size and more then 1 instance is how I do it.

RE: Connections that never end... - DigitalMu - 12-23-2019

(12-23-2019, 03:23 AM)loopline Wrote: First I recommend you try this approach.
https://www.youtube.com/watch?v=bZRh6sZZyz0

If that doesn't work you can try settings >> connections timeouts and other settings >> other >> link checker min threads.

That attempts to force close all connections when you reach the threshold. I like 20%. So if you are using 100 threads on link checking then I would set it to 20 for link checker min threads. Note that if you set the link checker min threads equal to or greater then your total link checker connections that checking will end instantly when you push start on the link checker.

Believe or not scrapebox actually can't just stop the threads. The thing is scrapebox uses threads/sockets and these are a global convention. Scrapebox did not design them, they are used by programs around the world. These are built in such a way where windows literally has control over them.

Coding isn't as cut and dry as it seems. So windows literally has control over the threads and if windows locks a thread or any 3rd party software locks a thread then scrapebox is forced to wait till its unlocked.

Windows doesn't really "intentionally" just lock threads probably, unless it doesn't like the content on a url, but windows just was not built for scrapebox, so windows just has "brain farts" if you will and it just locks threads sometimes.

On my servers I combat this by working in ultra small chunks. I run the automator and first split my source urls into like 100 url chunks, dump them in a folder.

Then have the automator link check 1, and output results and then run a custom script that copies in the next file in the folder and then loop the automator to link check it.

This works well, even google runs everything in small chunks and small servers as its more effective then larger machines etc..

I take it a step further, which is probably overkill for you, but I want stuff to run 24/7/365. So I have another script that monitors the link checker and arbitrarily force closes it every 12 hours. I lose a few links every 12 hours this way, but you can always load them back in to check again, which your doing anyway.

When the link checker is force closed I kill it, scrapebox/automator the works and then the script pauses a few seconds and restarts it scrapebox and the automator which then kicks off the link checker again.

So it runs year round seamlessly, for the most part. There is always anomyles (no idea how to spell that, lol) of course.

Anyway link checker min threads is probably your best bet. Also you can fire up another instance of scrapebox and slit your list in half, that way if 1 locks the other completes you can only lose half the work. you can run unlimited instances of scrapebox on a single machine.

As for windows having control over the threads, it sucks, but its the best option out there, so its a case of its better then the alternative. But you can work around it. Small list size and more then 1 instance is how I do it.

Thanks for your lengthy and thoughtful reply. I freely admit that I don't understand the behind-the-scenes coding controlling such threads. The model I have in my mind, which must be incorrect, can't comprehend the notion of being unable to terminate threads. I concede to my ignorance. Smile

I'm not opposed to setting up automator scripts and doing things in smaller batches and I've been increasingly doing things just like that. It makes sense to do smaller batches for a variety of reasons, but I'm also often impatient and love the feeling to doing big jobs quickly haha Smile

I will follow your advice and try other things though, but probably not the first piece as I really would like to keep my monitor on my desk Smile

RE: Connections that never end... - loopline - 12-23-2019

HAHA Big Grin

Yes I like my monitor as well.

Yes small batches is the way to go. But it doesn't have to be 100 urls like me, thats for servers running 24/7.

If you have 100K, just split it and run 2 sets of 50K each and if they work, great. If 1 fails run it again or split it and run it 25K in each instance.

If its locking like 95% of the time, then I would investigate the root causes, like security software/router etc... close down any 3rd party software and turn it back on 1 by 1 as a test etc..

Merry Christmas!