Login

Nosh · 07-18-2019, 06:57 PM

(07-02-2019, 01:47 AM)loopline Wrote: A 301 is a permanent redirect, it’s not a page that gets loaded. Just adding this URL to similarsitesearch.com will work

https://www.similarsitesearch.com/altern...o/{KEYWORD}

There is no pagination, they just return around 11 results so no need for the pagenum variable.

You mean like this ?
[Image: Captura-de-pantalla-2019-07-18-a-las-20-55-13.png]

**loopline** · 07-19-2019, 06:07 PM

Yes if the markers are correct for before/after that should work.

Nosh · 07-19-2019, 07:24 PM

(07-19-2019, 06:07 PM)loopline Wrote: Yes if the markers are correct for before/after that should work.

What do you mean exactly ? With this configuration I don't get results

[Image: Captura-de-pantalla-2019-07-19-a-las-21-21-49.png]

**loopline** · 07-20-2019, 09:33 PM

It worked at the time I posted it, which was some time ago.

You had posted your setup I think and the url change is all that needed changed.

However since then the site may have changed the before/after markers.

So double check that the before and after markers are still correct with the current html.

Nosh · 07-21-2019, 08:03 PM

(07-20-2019, 09:33 PM)loopline Wrote: It worked at the time I posted it, which was some time ago.

You had posted your setup I think and the url change is all that needed changed.

However since then the site may have changed the before/after markers.

So double check that the before and after markers are still correct with the current html.

Do you mean something like in the screenshot ?
[Image: Captura-de-pantalla-2019-07-21-a-las-21-59-13.png]

**loopline** · 07-22-2019, 08:01 PM

Maybe. Its been a while since this thread was started, what exact element are you trying to extract?

Nosh · 07-22-2019, 08:46 PM

(07-22-2019, 08:01 PM)loopline Wrote: Maybe. Its been a while since this thread was started, what exact element are you trying to extract?

Only the URLs of the links
[img]<a href=[/img] [Image: Captura-de-pantalla-2019-07-22-a-las-22-44-36.png]

" />

**loopline** · 07-26-2019, 09:28 PM

so, similar site only uses 1 page of results I believe, so there is no point in a next page marker, and thats the point of the harvester. So why not just use the merge feature and merge all your keywords
into the urls and load them all into the link extractor. Then you don't have to mess around with a custom harvester engine that you have to update every time similar site changes something.

Merge info
http://scrapeboxfaq.com/how-do-i-use-tok...rge-option

and link extractor
https://www.youtube.com/watch?v=t6pxt-4C6Xc&t=2s

Nosh · 07-27-2019, 09:31 AM

(07-26-2019, 09:28 PM)loopline Wrote: so, similar site only uses 1 page of results I believe, so there is no point in a next page marker, and thats the point of the harvester. So why not just use the merge feature and merge all your keywords
into the urls and load them all into the link extractor. Then you don't have to mess around with a custom harvester engine that you have to update every time similar site changes something.

Merge info
http://scrapeboxfaq.com/how-do-i-use-tok...rge-option

and link extractor
https://www.youtube.com/watch?v=t6pxt-4C6Xc&t=2s

Sounds good. But does not work
[Image: Captura-de-pantalla-2019-07-27-a-las-11-26-38.png]

**loopline** · (This post was last modified: 07-28-2019, 04:26 AM by loopline.)

The site is blocking you. I just rebuilt the entire engine from scratch and saved off the test html and similar sites is returning just this

Code:
<!DOCTYPE html>

<html>

<head>

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

<meta http-equiv="cache-control" content="max-age=0" />

<meta http-equiv="cache-control" content="no-cache" />

<meta http-equiv="expires" content="0" />

<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />

<meta http-equiv="pragma" content="no-cache" />

<meta http-equiv="refresh" content="10; url=/site/cloud" />

<script type="text/javascript">

    (function(window){

        try {

            if (typeof sessionStorage !== 'undefined'){

                sessionStorage.setItem('distil_referrer', document.referrer);

            }

        } catch (e){}

    })(window);

</script>

<script type="text/javascript" src="/smlrdstil.js" defer></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#bvbfavdddwv{display:none!important}</style></head>

<body>

<div id="distilIdentificationBlock">&nbsp;</div>

</body>

</html>

Notice this part

Code:
<div id="distilIdentificationBlock">&nbsp;</div>

Its blocked. So I tried some different user agent, but that didn't do it. So you can monkey around with the header data and user agent and see if you can get it to work, but otherwise, they may just simply have a good enough blocking system that its not going to work.

Login

Username:
Password:

Login

Username:
Password: