ScrapeBox Forum
scrape article content - Printable Version

+- ScrapeBox Forum (https://www.scrapeboxforum.com)
+-- Forum: ScrapeBox Main Discussion (https://www.scrapeboxforum.com/Forum-scrapebox-main-discussion)
+--- Forum: General ScrapeBox Talk (https://www.scrapeboxforum.com/Forum-general-scrapebox-talk)
+--- Thread: scrape article content (/Thread-scrape-article-content)



scrape article content - gavner25 - 08-15-2020

Hi, im trying to scrape the content (text) only from this site, which has the same layout for each page so i can use the same formula. Im using the premium article scraper plugin but im havin trouble scraping. 

See my video for the issue. https://www.loom.com/share/417ad357633b4a869ff73ae6408823c9

The url example im scraping is here
https://fr.wikihow.com/entrainer-votre-chien-%C3%A0-chasser


RE: scrape article content - loopline - 08-15-2020

I don't know what the after marker should be but its an after marker issue.

Scrapebox starts at the before marker and then proceeds to find the first occurrence of the after marker and then it scrapes all the data in between.

your after marker is found at the paragraph which is why each paragraph is being scraped separate

You need an after marker that is found only after all the paragraphs.


RE: scrape article content - serialscraper - 08-27-2020

The following config will work.
But, you will need to filter out the image code between the {...}
This can be done with a wildcard search and replace

[fr.wikihow title]
Before1=%3Ctitle%3E
After1=%3C%2Ftitle%3E
Before2=
After2=
Before3=
After3=
UseRegex=0
RegEx=
Width=150
Type=0

[fr.wikihow body]
Before1=%3Cdiv%20class%3D%22mf-section-0%22%20id%3D%22mf-section-0%22%3E
After1=%3Cdiv%20class%3D%22printfooter%22%3E
Before2=
After2=
Before3=
After3=
UseRegex=0
RegEx=
Width=150
Type=1

[EXPORTPARAMETERS]
ExportMode=0
ExportEncoding=3
ExportPrefix=<DATE>_<TIME>_<NUM>
ExportFolder=
ExportFilenameField=0
ExportSeparator=<LF>
ExportFileExist=1
ExportFields=0,1

[Other]
DoNotApplyCharSet=0