Scrape ALL URLs for a Domain - Printable Version

Scrape ALL URLs for a Domain - Printable Version

+- ScrapeBox Forum (https://www.scrapeboxforum.com)
+-- Forum: ScrapeBox Main Discussion (https://www.scrapeboxforum.com/Forum-scrapebox-main-discussion)
+--- Forum: General ScrapeBox Talk (https://www.scrapeboxforum.com/Forum-general-scrapebox-talk)
+--- Thread: Scrape ALL URLs for a Domain (/Thread-scrape-all-urls-for-a-domain)

Scrape ALL URLs for a Domain - karlos - 10-15-2011

I want to scrape one particular domain to submit all their URLs (about 19,000 pages) for indexing.

The domain apparently hasn't got a sitemap (at least I couldn't find them) and if I put site:http://www.rootdomain.com in the footprint window, use proxies and harvest, scapebox deletes 99% results because apparently "the keywords were maybe too" similar.

Should I use the search site differently or is there another way to scrape one domain only?

Can you help?

K

RE: Scrape ALL URLs for a Domain - s4nt0s - 10-15-2011

(10-15-2011, 12:38 PM)karlos Wrote: I want to scrape one particular domain to submit all their URLs (about 19,000 pages) for indexing.

The domain apparently hasn't got a sitemap (at least I couldn't find them) and if I put site:http://www.rootdomain.com in the footprint window, use proxies and harvest, scapebox deletes 99% results because apparently "the keywords were maybe too" similar.

Should I use the search site differently or is there another way to scrape one domain only?

Can you help?

K

First, go to the toolbar at the top of Scrapebox and select "options" and uncheck, "Automatically Remove Duplicate Domains".

Second, make sure you only have Google, Bing and Aol checked when using the site: command. Yahoo doesn't use the "site:" command as far as I know.

After you're done harvesting, make sure and go to remove/filter > remove duplicate URL's.

Since you're harvesting from three different search engines, you will get a lot of the same URL's.

Problem solved. Then sit back and enjoy a beer.

RE: Scrape ALL URLs for a Domain - scrapebrokers - 10-16-2011

This will scrape only indexed URLs which is good but I prefer to scrape as many urls as possible even the non idexed ones. As the original poster said he wants to send the URLs for indexing so there is a high chance that the site have a lot of non indexed urls.

I load ScrapeBox Links Extractor Addon, set the connections at 30-50 , and tick Internal button only and then load a file with url list I want to harvest all of their pages. I hit start and when is done I end up with a list of urls. If the list is relative small < less than 100-200K you may click show/edit links and delete the duplicates or un wanted links. Then save the list and repeat the procedure but LOAD every time the new file you just saved. this way you harvest the urls like a spider.. When you do not get any new urls simply stop , remove duplicates and you are done Wink