ScrapeBox Forum
Custom data grabber with regex issue - Printable Version

+- ScrapeBox Forum (https://www.scrapeboxforum.com)
+-- Forum: ScrapeBox Main Discussion (https://www.scrapeboxforum.com/Forum-scrapebox-main-discussion)
+--- Forum: General ScrapeBox Talk (https://www.scrapeboxforum.com/Forum-general-scrapebox-talk)
+--- Thread: Custom data grabber with regex issue (/Thread-custom-data-grabber-with-regex-issue)



Custom data grabber with regex issue - Splendens - 01-15-2020

Hello,

I'm looking to use Scrapebox to scrape all domain name mentions on a list just shy of 4000 web page urls.

The domain names are formatted on the pages like so:

Scrapeboxforum.com
Scrapeboxinfo.net
Scrapeboxhub.org

The domain names are plain text. They are not hyperlinks.

If it helps, they are also always in between <td> and </td> elements.  

I already have my list of almost 4000 urls I want to scan.

I am using 5 private proxies that have been tested and saved.
I think they're being applied when using the Custom Data Grabber, but honestly I struggle with Scrapebox.

I created inbound and outbound rules for Scrapebox in Windows Firewall.
I can do other things using Scrapebox that do work. Like grabbing internal links on the domain I'm getting the urls from.  

I created a Custom Data Grabber Module and under that a Module Mask:

https://imgur.com/a/TpER4Q3


I tried several regex examples and found this one:

Code:
^(?=.{1,253}\.?$)(?:(?!-|[^.]+_)[A-Za-z0-9-_]{1,63}(?<!-)(?:\.|$)){2,}$


Source: https://stackoverflow.com/a/41193739/5048548


I tested it using the tool on https://regex101.com/ and 3 sample urls come up as matches (as far as I can tell?):

https://imgur.com/iVR422q


However, when I run my Module all I get is this:

https://imgur.com/dGgD3Ft


The Module data folder contains a csv for every time I run the Module, containing two odd characters in the first cell:

https://imgur.com/OS3uupX


I ran several of the urls through browseo.net and the domain names on those urls are readable according to that tool.

Does anyone know where I'm going wrong here?
Or is there a better way to scrape domain name MENTIONS from a list of urls?

Thank you in advance!


RE: Custom data grabber with regex issue - loopline - 01-16-2020

remove the ^ at the front and the $ at the end, scrapebox adds those automatically.

Also here is a little info directly from support:
It uses PCRE, which is Perl Compatible Regular Expressions https://www.google.com/search?q=Perl+Compatible+Regular+Expressions&ie=utf-8



Any regex should have the leading ^ and ending $ removed. This means match the start and end of the line, however when scraping stuff from HTML the data isn’t going to sit perfectly on the start and end of a line of source code there’s going to be other HTML and content before and after the data being scraped.

The regex here should all work http://www.regexlib.com/Search.aspx?k=phone%20number but if it starts with ^ and ends with $ they simply need to be removed.


RE: Custom data grabber with regex issue - Splendens - 01-16-2020

Hi Matt,

Thank you for taking the time to reply. And for all the awesome instructional content you have created for Scrapebox.

I didn't come across any information on the type of regex used by Scrapebox (and didn't know there were several types). Is there an official Scrapebox documentation that I haven't come across?

I like finding solutions on my own without bothering anyone, but sometimes I feel that's hard to do within the Scrapebox ecosystem.

I removed the ^ at the front and the $ at the end. I read you're not the biggest fan of Regex, so thank you again for having a look at this. Smile

The Custom Data Grabber still didn't return any domains using the updated Module. However, it did create empty txt files in the module data folder instead of csv files with two weird characters in the first cell.

Then I tried the following regex from the site you linked:

Code:
[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)


Source: http://www.regexlib.com/REDetails.aspx?regexp_id=25

Removed the ^ at the front and the $ at the end.
Added the specific tlds I wanted to have returned.

And now it's working.

I don't quite know why. But here we are. Hopefully it will help others searching for a solution in the future.

Thank you again for taking the time to reply. What's the best way people can support you considering you have no official ties to Scrapebox, but do so much for Scrapebox users?