ScrapeBox Forum
Similar site - Printable Version

+- ScrapeBox Forum (https://www.scrapeboxforum.com)
+-- Forum: ScrapeBox Main Discussion (https://www.scrapeboxforum.com/Forum-scrapebox-main-discussion)
+--- Forum: General ScrapeBox Talk (https://www.scrapeboxforum.com/Forum-general-scrapebox-talk)
+--- Thread: Similar site (/Thread-similar-site)

Pages: 1 2


RE: Similar site - Nosh - 07-18-2019

(07-02-2019, 01:47 AM)loopline Wrote: A 301 is a permanent redirect, it’s not a page that gets loaded. Just adding this URL to similarsitesearch.com will work

https://www.similarsitesearch.com/alternatives-to/{KEYWORD}

There is no pagination, they just return around 11 results so no need for the pagenum variable.

You mean like this ?
[Image: Captura-de-pantalla-2019-07-18-a-las-20-55-13.png]


RE: Similar site - loopline - 07-19-2019

Yes if the markers are correct for before/after that should work.


RE: Similar site - Nosh - 07-19-2019

(07-19-2019, 06:07 PM)loopline Wrote: Yes if the markers are correct for before/after that should work.

What do you mean exactly ? With this configuration I don't get results

[Image: Captura-de-pantalla-2019-07-19-a-las-21-21-49.png]


RE: Similar site - loopline - 07-20-2019

It worked at the time I posted it, which was some time ago.

You had posted your setup I think and the url change is all that needed changed.

However since then the site may have changed the before/after markers.

So double check that the before and after markers are still correct with the current html.


RE: Similar site - Nosh - 07-21-2019

(07-20-2019, 09:33 PM)loopline Wrote: It worked at the time I posted it, which was some time ago.  

You had posted your setup I think and the url change is all that needed changed.  

However since then the site may have changed the before/after markers.  

So double check that the before and after markers are still correct with the current html.

Do you mean something like in the screenshot ?
[Image: Captura-de-pantalla-2019-07-21-a-las-21-59-13.png]


RE: Similar site - loopline - 07-22-2019

Maybe. Its been a while since this thread was started, what exact element are you trying to extract?


RE: Similar site - Nosh - 07-22-2019

(07-22-2019, 08:01 PM)loopline Wrote: Maybe.  Its been a while since this thread was started, what exact element are you trying to extract?

Only the URLs of the links
[img]<a href=[/img][Image: Captura-de-pantalla-2019-07-22-a-las-22-44-36.png]" />


RE: Similar site - loopline - 07-26-2019

so, similar site only uses 1 page of results I believe, so there is no point in a next page marker, and thats the point of the harvester. So why not just use the merge feature and merge all your keywords
into the urls and load them all into the link extractor. Then you don't have to mess around with a custom harvester engine that you have to update every time similar site changes something.


Merge info
http://scrapeboxfaq.com/how-do-i-use-tokens-with-the-m-merge-option

and link extractor
https://www.youtube.com/watch?v=t6pxt-4C6Xc&t=2s


RE: Similar site - Nosh - 07-27-2019

(07-26-2019, 09:28 PM)loopline Wrote: so, similar site only uses 1 page of results I believe, so there is no point in a next page marker, and thats the point of the harvester.  So why not just use the merge feature and merge all your keywords
into the urls and load them all into the link extractor.  Then you don't have to mess around with a custom harvester engine that you have to update every time similar site changes something.  


Merge info
http://scrapeboxfaq.com/how-do-i-use-tokens-with-the-m-merge-option

and link extractor
https://www.youtube.com/watch?v=t6pxt-4C6Xc&t=2s

Sounds good. But does not work
[Image: Captura-de-pantalla-2019-07-27-a-las-11-26-38.png]


RE: Similar site - loopline - 07-28-2019

The site is blocking you.  I just rebuilt the entire engine from scratch and saved off the test html and similar sites is returning just this

Code:
<!DOCTYPE html>
<html>

<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=/site/cloud" />
<script type="text/javascript">
    (function(window){
        try {
            if (typeof sessionStorage !== 'undefined'){
                sessionStorage.setItem('distil_referrer', document.referrer);
            }
        } catch (e){}
    })(window);
</script>
<script type="text/javascript" src="/smlrdstil.js" defer></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#bvbfavdddwv{display:none!important}</style></head>
<body>
<div id="distilIdentificationBlock">&nbsp;</div>
</body>
</html>
Notice this part

Code:
<div id="distilIdentificationBlock">&nbsp;</div>

Its blocked.  So I tried some different user agent, but that didn't do it.  So you can monkey around with the header data and user agent and see if you can get it to work, but otherwise, they may just simply have a good enough blocking system that its not going to work.