ScrapeBox Forum
Custom data masks capturing too much irrelevant data - Printable Version

+- ScrapeBox Forum (https://www.scrapeboxforum.com)
+-- Forum: ScrapeBox Main Discussion (https://www.scrapeboxforum.com/Forum-scrapebox-main-discussion)
+--- Forum: General ScrapeBox Talk (https://www.scrapeboxforum.com/Forum-general-scrapebox-talk)
+--- Thread: Custom data masks capturing too much irrelevant data (/Thread-custom-data-masks-capturing-too-much-irrelevant-data)



Custom data masks capturing too much irrelevant data - jamesmel - 05-30-2016

Im trying to build at custom data grabber, but having issues with the masks, im trying to get the external URL from this html :

Code:
<a href="http://www.somesite.com" compid="Profile_Website" target="_blank" class="proWebsiteLink" data-trackinglink="http://www.houzz.co.uk/trk/aHR0cDovL3d3dy5taGNvc3RhLmNvbQ/547eced61fa4c4209238cfc20c387147/ue/MTcwODAyOTg/1a31f260679248b4074981417ba52232" rel="nofollow">


To extract this i was trying :

Code:
before_after="|" compid="Profile_Website" target="_blank" class="proWebsiteLink"

But it returns too much data which is not relevant / the correct data, i only need the http://www.somesite.com part

Any ideas where im going wrong on this ?

Similarly im trying to extract the profile name from this html :

Code:
<a class="profile-full-name" itemprop="name" href="http://www.houzz.co.uk/pro/mhcostaconstruction/mh-costa-construction-ltd">Acme Co Ltd</a>

But im unsure as how to extract this using the before|after as the href URL will change for each profile and if i just target before_after=">|</a> it will pull every link - any idea how i could capture the company name eg Acme Co Ltd from the above html ?


RE: Custom data masks capturing too much irrelevant data - loopline - 05-31-2016

before_after=href="|" compid="Profile_Website"

seems like it would work


well you could extract the whole url and then just toss it in excel and strip away this data pretty easily

or

you could try to build a regex, for which I think would work fine but I am no expert on. You can try a regex or coding forum for that

or you could just try

before_after=">|</a>

but that might not give you what your after, it would give you pretty much every anchor on the page. So probably a regex