web crawling

**salbahis** · 08-26-2011, 03:45 PM

ill just give you the idea on how the process should be, atleast makat-on sab ka

Assuming you have database where you stored the url you want to crawl

Step 1) Download all the webpage and save it locally
Step 2) Once you download all, load the regular expression table and parse the downloaded file, normally this is the part that really takes time, most famous programming language have regular expression functions, everytime a you are done parsing, save it as a temporary textfile
Step 3) Once you are done all the parsing, you can start uploading it into your database, you can also add additional parsing prior uploading to the database but that just optional

accuracy of this steps is 60%-80%, you still have to proof read the resulted data since not all website are created the same and change of design may affect you regular expression table.

This is similar to Web Crawler used at innodata, except that they dont do it in level, they just combined the 3 steps in one process, bad very bad...

i made my first web crawler when im still at innodata, i used VBA from excel... it runs process by level!, avoiding the heavy processing of the application

i can only give you the idea on how it should be done since the procedure is pretty staight forward and most of the functions are readily available on the net all it needs is you finger on the google search field and your off togo...

**fixyourself** · 08-26-2011, 03:55 PM

Originally Posted by kamahak

Naay Lucene based crawler ang Apache. Nutch. Taw-anan pero magamit ni nimo. Basaha lng documentation or search for introductory tutorials.

Naa puy Heritix ug OpenSpider.

Wa pko kagamit aning duha, pero maayo mn ug reviews, especially ang Heritix.

nana gyuy ni second sa akong idea sa Apache Lucene bai o..

dali ra gyud lagi kaau na.. basa lang documentation sa lucene..

**kamahak** · 08-26-2011, 04:12 PM

Originally Posted by salbahis

accuracy of this steps is 60%-80%, you still have to proof read the resulted data since not all website are created the same and change of design may affect you regular expression table.

This is why we need to use a well-defined algorithm that caters to all and can be re-structured to cater, specifically, to your needs.

I applaud you for making your own, but in cases like these, where competition is intense; Why re-invent the wheel?

Lucene has one of the best (the very best, imho) search algorithms on the net. It can match Google in terms of scalability and speed. A crawler based on this will definitely be worth your while.

**salbahis** · 08-26-2011, 04:22 PM

Originally Posted by kamahak

This is why we need to use a well-defined algorithm that caters to all and can be re-structured to cater, specifically, to your needs.

I applaud you for making your own, but in cases like these, where competition is intense; Why re-invent the wheel?

Lucene has one of the best (the very best, imho) search algorithms on the net. It can match Google in terms of scalability and speed. A crawler based on this will definitely be worth your while.

yeah good question, when i made it i was still working at innodata as non-production staff, the only programming language that readily available is HTA (VB+HTML), WSH and VBA, with the user privilege limitation i was able to work around on my limitations... that time i still work as a Lawyer (kunohay)!!

yes i canna reinvent the wheel because they dont allow us to use other wheel and i just make use whats on my arms reach!!!

question lang about aning lucene, does it accepts DTD parsing?

google uses python to parse!!, and afaik google view the site as text with the tag strips...

Thread: web crawling

Thread Tools

Search Thread

Re: web crawling

Re: web crawling

Re: web crawling

Re: web crawling

Advertisement
Sign-in to hide this ad

Similar Threads

THE BEST of Free Web Hosting

Website for web developers

Weird Web sites :-)

Google Web Accelerator

Surigao:: top 10 semi-finalist in 5th Philippine Web Awards

Posting Permissions

about us

follow us

Thread: web crawling

Thread Tools

Search Thread

Re: web crawling

Re: web crawling

Re: web crawling

Re: web crawling

AdvertisementSign-in to hide this ad

Similar Threads

THE BEST of Free Web Hosting

Website for web developers

Weird Web sites :-)

Google Web Accelerator

Surigao:: top 10 semi-finalist in 5th Philippine Web Awards

Posting Permissions

about us

follow us

Advertisement
Sign-in to hide this ad