Load WordPress Sites in as fast as 37ms!

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
scrape article content
#1
Hi, im trying to scrape the content (text) only from this site, which has the same layout for each page so i can use the same formula. Im using the premium article scraper plugin but im havin trouble scraping. 

See my video for the issue. https://www.loom.com/share/417ad357633b4...e6408823c9

The url example im scraping is here
https://fr.wikihow.com/entrainer-votre-c...A0-chasser
Reply
#2
I don't know what the after marker should be but its an after marker issue.

Scrapebox starts at the before marker and then proceeds to find the first occurrence of the after marker and then it scrapes all the data in between.

your after marker is found at the paragraph which is why each paragraph is being scraped separate

You need an after marker that is found only after all the paragraphs.
Reply
#3
The following config will work.
But, you will need to filter out the image code between the {...}
This can be done with a wildcard search and replace

[fr.wikihow title]
Before1=%3Ctitle%3E
After1=%3C%2Ftitle%3E
Before2=
After2=
Before3=
After3=
UseRegex=0
RegEx=
Width=150
Type=0

[fr.wikihow body]
Before1=%3Cdiv%20class%3D%22mf-section-0%22%20id%3D%22mf-section-0%22%3E
After1=%3Cdiv%20class%3D%22printfooter%22%3E
Before2=
After2=
Before3=
After3=
UseRegex=0
RegEx=
Width=150
Type=1

[EXPORTPARAMETERS]
ExportMode=0
ExportEncoding=3
ExportPrefix=<DATE>_<TIME>_<NUM>
ExportFolder=
ExportFilenameField=0
ExportSeparator=<LF>
ExportFileExist=1
ExportFields=0,1

[Other]
DoNotApplyCharSet=0
[-] The following 1 user says Thank You to serialscraper for this post:
  • loopline
Reply




Users browsing this thread: 1 Guest(s)

Looplines Scrapebox List