ScrapeBox Forum
Harvesting Research Documents - Printable Version

+- ScrapeBox Forum (https://www.scrapeboxforum.com)
+-- Forum: ScrapeBox Main Discussion (https://www.scrapeboxforum.com/Forum-scrapebox-main-discussion)
+--- Forum: General ScrapeBox Talk (https://www.scrapeboxforum.com/Forum-general-scrapebox-talk)
+--- Thread: Harvesting Research Documents (/Thread-harvesting-research-documents)



Harvesting Research Documents - Leatherneck - 01-28-2022

I am working on a research project that requires me to access many documents on the web.  

I have successfully created the footprints and Engines (one) I want to use and am finding a fair amount of documents.  

Although I have used to harvester to download many of the documents, Scrapebox is not getting the files (like PDF's) that open in a new browser window and download automatically when using my browser.

Is there a custom grab idea/strategy to collect these?  Alternatively how might I determine in advance these will auto download so I remove them from the URL list

Any thoughts?

Thanks


RE: Harvesting Research Documents - loopline - 01-29-2022

You would have to look at the links. Like open a link and look at the html source code of the page. you may want to turn off javascript in your browser first and then open the link, as its probably some sort of script that initiates the download.

That said, if you look at the html, you could probably works something out between the page scanner addon and the redirect checker (if its redirecting) to scan all the urls to see what happens, but it really depends on whats actually happening.

Im not sure you need to remove them from the list, just export all the failed when done and its probably in that list.

As for how to download them, if you can look at the html source and find the actual download url, then you could possibly setup a custom data scraper that pulls that data and then run the actual urls thru the file downloader. May or may not work depending on how the site is setup. If its a convenience mechanism, you can probably get around it with enough work. If its a security mechanism it may be harder or not possible with scrapebox.


RE: Harvesting Research Documents - Leatherneck - 02-02-2022

(01-29-2022, 09:50 PM)loopline Wrote: You would have to look at the links.  Like open a link and look at the html source code of the page.  you may want to turn off javascript in your browser first and then open the link, as its probably some sort of script that initiates the download. 

That said, if you look at the html, you could probably works something out between the page scanner addon and the redirect checker (if its redirecting) to scan all the urls to see what happens, but it really depends on whats actually happening. 

Im not sure you need to remove them from the list, just export all the failed when done and its probably in that list. 

As for how to download them, if you can look at the html source and find the actual download url, then you could possibly setup a custom data scraper that pulls that data and then run the actual urls thru the file downloader.  May or may not work depending on how the site is setup.  If its a convenience mechanism, you can probably get around it with enough work. If its a security mechanism it may be harder or not possible with scrapebox.

Thanks, After experimenting with this I have decided to use the strengths of scrapebox for finding information, data, links...and have deployed a standalone document downloader to retrieve them.

Thanks!


RE: Harvesting Research Documents - loopline - 02-04-2022

sounds good.