ScrapeBox Forum
Google Scholar Engine - Printable Version

+- ScrapeBox Forum (https://www.scrapeboxforum.com)
+-- Forum: ScrapeBox Main Discussion (https://www.scrapeboxforum.com/Forum-scrapebox-main-discussion)
+--- Forum: General ScrapeBox Talk (https://www.scrapeboxforum.com/Forum-general-scrapebox-talk)
+--- Thread: Google Scholar Engine (/Thread-google-scholar-engine)



Google Scholar Engine - Leatherneck - 01-28-2022

Hello: I'm having difficulty getting an engine for google scholar to function.  I can use my query string in the browser but when I test the engine I get a page full of html.  I have reviewed the page source and have identified the before and after strings but it doesn't seem to work.

here is the query string

https://scholar.google.com/scholar?start={PAGENUM}&q={KEYWORD}&hl=en&as_sdt=0,44

if I replace the {PAGENUM} and {KEYWORD} with values and paste the string in the browser I get good results

Also here is the rest of the information for the engine

[Engine1]
Displayname=Google Scholar
QueryString=https://scholar.google.com/scholar?start={PAGENUM}&q={KEYWORD}&hl=en&as_sdt=0,44
MustBeInLink=
MustNotBeInLink=scholar?q=related:_                      
JustBeforeLink=" href="
RightAfterLink=" data-clk=
PageStart=0
PageInc=1
NextPageMarker= <b style="display:block;margin-left:53px">Next</b>
Translation=%3f=?|%3d==|%26=&|%3a=:|&amp;=&|%2F=/
headerData=
Referer=0
SSLVersion=1
Favicon=
GracePeriod=0
UserAgent=
Selected=1
DetailedSelected=0
FollowRelocation=0
AddRelative=
AddFieldValue=
ClearCookies=0
ListMode=0
ReadOnly=0

One thing I see but do not know how to deal with is the JustBeforeLink. The source HTML looks like this: <a id="_Om2lsUOWQEJ" href="http......


I think in order for scrapebox to recognize the URL, the string containing the d="_Om2lsUOWQEJ" needs to be converted to a wildcard because each found URL has a unique id.



I appreciate it if someone can help me finish this.  I think I'm close, however; ignorance is bliss!

Leatherneck


RE: Google Scholar Engine - loopline - 01-28-2022

Then just ignore the wild card string.

When you test the engine, it will give you the option to save the raw html, you want to do that and use the markers from that. Because that is the exact formatting that scrapebox will see. If you are looking at the source code in a browser, it might be different, which is why it might not work for the markers.


RE: Google Scholar Engine ~ RESOLVED - Leatherneck - 01-28-2022

(01-28-2022, 05:46 AM)loopline Wrote: Then just ignore the wild card string. 

When you test the engine, it will give you the option to save the raw html, you want to do that and use the markers from that.  Because that is the exact formatting that scrapebox will see.  If you are looking at the source code in a browser, it might be different, which is why it might not work for the markers.

Thanks for the quick reply.  

...I did ultimately figure it out.  As you said, the HTML returned was slightly different from the page source.  

I found the URL was predicated with a different string.  When I changed the before URL to match that...You know, it worked!

Here is what I'm using as a footprint for google scholar in case anyone else needs to use it.

[Engine1]
Displayname=Google Scholar
QueryString=https://scholar.google.com/scholar?start={PAGENUM}&q={KEYWORD}&hl=en&as_sdt=0,44
MustBeInLink=
MustNotBeInLink="javascript:void(0)"                
JustBeforeLink=tabindex="-1"><a href="
RightAfterLink=" data-clk=
PageStart=0
PageInc=5
NextPageMarker=<b style="display:block;margin-left:53px">Next</b>
Translation=%3f=?|%3d==|%26=&|%3a=:|&amp;=&|%2F=/
headerData=

Is there a repository of custom engines people have set up? A place to share and exchange engines for various uses.  

Thanks, Leatherneck