The Blueprint Training

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
OneBox / Universal Search footprints
I have to scrape displayings of universal search / onebox inside of organic search results, like images, news, local 3-pack, videos.

Now i accomplish such tasks using iMacros: i get needed HTML elements with xPath and/or CSS, scrape them and save scrapings as csv file. But the volume of needed scraping grows, so i can't do it longer in one thread. For this i plan to use Scrapebox.

But i want to know it in advance and detailed: should i train the harvester with xPath/CSS or is there already something ready for action out of the box, or something else...

Please scrapebox ninjas, share you knowledgeSmile

PS: from my current knowledge stand the only way to do this with Scrapebox is to use a RegEx mask to find needed URLs out. Is it true?
I honestly don't clearly or even half clearly understand exactly what your trying to scrape so I can't really accurately answer your question.

PRobably "test and see" is the best answer.

But it sounds like you could possibly use the custom harvester or the custom data scraper. The custom data scraper does support regex.

But the image grabber is seperate, and there is a google image grabber.

So Im not even totally sure Scrapebox will do what you want. If you just need image urls and video urls then maybe, but if you need to download the videos and images and urls and aggregate all that into a CSV, it could get tricky or impossible, depending on the volume your wanting to do and how specific you need things matched up.
Thank you Loopline: Scrapebox does, what i need, but on a bit different way, as i thought firstly. It doesn't use xPath, not regular expressions, not a css selectors. What i need to do, is to create a custom (new) search engine and input as footprints exactly, what is placed before and after the urls i need to scrape.

The problem on this workaround begins, if urls, which should be scraped has unique id. But i haven't such case, and maybe, Scrapebox can scrape only with one footprint, which is only after the url to scrape.

Users browsing this thread: 1 Guest(s)