Hello,
I'm looking to use Scrapebox to scrape all domain name mentions on a list just shy of 4000 web page urls.
The domain names are formatted on the pages like so:
Scrapeboxforum.com
Scrapeboxinfo.net
Scrapeboxhub.org
The domain names are plain text. They are not hyperlinks.
If it helps, they are also always in between <td> and </td> elements.
I already have my list of almost 4000 urls I want to scan.
I am using 5 private proxies that have been tested and saved.
I think they're being applied when using the Custom Data Grabber, but honestly I struggle with Scrapebox.
I created inbound and outbound rules for Scrapebox in Windows Firewall.
I can do other things using Scrapebox that do work. Like grabbing internal links on the domain I'm getting the urls from.
I created a Custom Data Grabber Module and under that a Module Mask:
https://imgur.com/a/TpER4Q3
I tried several regex examples and found this one:
Source: https://stackoverflow.com/a/41193739/5048548
I tested it using the tool on https://regex101.com/ and 3 sample urls come up as matches (as far as I can tell?):
https://imgur.com/iVR422q
However, when I run my Module all I get is this:
https://imgur.com/dGgD3Ft
The Module data folder contains a csv for every time I run the Module, containing two odd characters in the first cell:
https://imgur.com/OS3uupX
I ran several of the urls through browseo.net and the domain names on those urls are readable according to that tool.
Does anyone know where I'm going wrong here?
Or is there a better way to scrape domain name MENTIONS from a list of urls?
Thank you in advance!
I'm looking to use Scrapebox to scrape all domain name mentions on a list just shy of 4000 web page urls.
The domain names are formatted on the pages like so:
Scrapeboxforum.com
Scrapeboxinfo.net
Scrapeboxhub.org
The domain names are plain text. They are not hyperlinks.
If it helps, they are also always in between <td> and </td> elements.
I already have my list of almost 4000 urls I want to scan.
I am using 5 private proxies that have been tested and saved.
I think they're being applied when using the Custom Data Grabber, but honestly I struggle with Scrapebox.
I created inbound and outbound rules for Scrapebox in Windows Firewall.
I can do other things using Scrapebox that do work. Like grabbing internal links on the domain I'm getting the urls from.
I created a Custom Data Grabber Module and under that a Module Mask:
https://imgur.com/a/TpER4Q3
I tried several regex examples and found this one:
Code:
^(?=.{1,253}\.?$)(?:(?!-|[^.]+_)[A-Za-z0-9-_]{1,63}(?<!-)(?:\.|$)){2,}$
Source: https://stackoverflow.com/a/41193739/5048548
I tested it using the tool on https://regex101.com/ and 3 sample urls come up as matches (as far as I can tell?):
https://imgur.com/iVR422q
However, when I run my Module all I get is this:
https://imgur.com/dGgD3Ft
The Module data folder contains a csv for every time I run the Module, containing two odd characters in the first cell:
https://imgur.com/OS3uupX
I ran several of the urls through browseo.net and the domain names on those urls are readable according to that tool.
Does anyone know where I'm going wrong here?
Or is there a better way to scrape domain name MENTIONS from a list of urls?
Thank you in advance!