Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Email Scraper Plugin - Custom Crawler] How to websites with no distinct markers
#1
Hello,

I purchased the Email Scraper Plugin and I'm trying to scrape a directory. I want to scrape companies' name and email.

However, the website I scrape isn't using unique markers. For example the company name is juste a <span>, so SB catches all the other <span> on the page.
Do you know any solution to this?
I've tried writing the marker with the code above the <span> to make it unique but ScrapeBox doesn't allow line breaks.

Ex:
Code:
<h2>Company</h2>
 <span>ScrapeBox</span>


Also, there is no email for some companies, so the marker detected is the one for the next company's email in the directory and it mixes up all the  names and emails.
I'm trying to figure out how to solve this and I would love some help...
Reply
#2
Scrapebox can work with line breaks, you can use #13#10 but I don't think it needs it, it should search without it.

I wouldn't know unless I saw the actual html though.

as for the email not being there, that Im not sure about.
Reply
#3
Ill find out about the skipping the mail.
[-] The following 1 user says Thank You to loopline for this post:
  • PixAtom
Reply
#4
(07-11-2018, 04:31 AM)loopline Wrote: Scrapebox can work with line breaks, you can use #13#10  but I don't think it needs it, it should search without it.  

I wouldn't know unless I saw the actual html though.  

Thanks for your help.

This is the source code with the COMPANY field :
Code:
           <div class="lbb-result__header">

 <h2>
   <span>COMPANY</span>
   - <small>CITY</small>
   
 </h2>

I tried using 
Code:
<div class="lbb-result__header"> <h2> <span>
or 
Code:
<div class="lbb-result__header"><h2><span>
or 
Code:
<div class="lbb-result__header">#13#10<h2>#13#10<span>
 as "before markers" (with a closing span as "after marker") but without any success.

   
Reply
#5
You will likely need to take into account every space, tab, line feed and carriage return.  
So

Code:
<div class="lbb-result__header">#13#10#13#10 <h2>#13#10   <span>
Should work
Reply
#6
(07-12-2018, 05:52 AM)loopline Wrote: You will likely need to take into account every space, tab, line feed and carriage return.  
So

Code:
<div class="lbb-result__header">#13#10#13#10 <h2>#13#10   <span>
Should work

I tried but it's not working either... ?
Reply
#7
Not sure. there is probably a space I missed or something, but you get the idea. You can revisit the html and compare what I have and built it like that, but make it match the exact html.

As for the actual skipping of data, its only looking for markers.

So if the email isnt there, its going to proceed to the next marker, there is no other way to do it. You would need to have a custom scraper coded by a developer to know to go to the next entry if email isn't present and keep it all like you want etc..
[-] The following 1 user says Thank You to loopline for this post:
  • PixAtom
Reply
#8
OK thanks for your help !
Reply
#9
your welcome, cheers!
Reply




Users browsing this thread: 1 Guest(s)
Looplines Scrapebox List