Page 3 of 3 FirstFirst 123
Results 21 to 24 of 24

Thread: web crawling

  1. #21

    Default Re: web crawling


    ill just give you the idea on how the process should be, atleast makat-on sab ka

    Assuming you have database where you stored the url you want to crawl

    Step 1) Download all the webpage and save it locally
    Step 2) Once you download all, load the regular expression table and parse the downloaded file, normally this is the part that really takes time, most famous programming language have regular expression functions, everytime a you are done parsing, save it as a temporary textfile
    Step 3) Once you are done all the parsing, you can start uploading it into your database, you can also add additional parsing prior uploading to the database but that just optional

    accuracy of this steps is 60%-80%, you still have to proof read the resulted data since not all website are created the same and change of design may affect you regular expression table.

    This is similar to Web Crawler used at innodata, except that they dont do it in level, they just combined the 3 steps in one process, bad very bad...

    i made my first web crawler when im still at innodata, i used VBA from excel... it runs process by level!, avoiding the heavy processing of the application

    i can only give you the idea on how it should be done since the procedure is pretty staight forward and most of the functions are readily available on the net all it needs is you finger on the google search field and your off togo...
    Last edited by salbahis; 08-26-2011 at 03:50 PM.

  2. #22

    Default Re: web crawling

    Quote Originally Posted by kamahak View Post
    Naay Lucene based crawler ang Apache. Nutch. Taw-anan pero magamit ni nimo. Basaha lng documentation or search for introductory tutorials.

    Naa puy Heritix ug OpenSpider.

    Wa pko kagamit aning duha, pero maayo mn ug reviews, especially ang Heritix.
    nana gyuy ni second sa akong idea sa Apache Lucene bai o..

    dali ra gyud lagi kaau na.. basa lang documentation sa lucene..

  3. #23

    Default Re: web crawling

    Quote Originally Posted by salbahis View Post
    accuracy of this steps is 60%-80%, you still have to proof read the resulted data since not all website are created the same and change of design may affect you regular expression table.
    This is why we need to use a well-defined algorithm that caters to all and can be re-structured to cater, specifically, to your needs.

    I applaud you for making your own, but in cases like these, where competition is intense; Why re-invent the wheel?

    Lucene has one of the best (the very best, imho) search algorithms on the net. It can match Google in terms of scalability and speed. A crawler based on this will definitely be worth your while.

  4. #24

    Default Re: web crawling

    Quote Originally Posted by kamahak View Post
    This is why we need to use a well-defined algorithm that caters to all and can be re-structured to cater, specifically, to your needs.

    I applaud you for making your own, but in cases like these, where competition is intense; Why re-invent the wheel?

    Lucene has one of the best (the very best, imho) search algorithms on the net. It can match Google in terms of scalability and speed. A crawler based on this will definitely be worth your while.
    yeah good question, when i made it i was still working at innodata as non-production staff, the only programming language that readily available is HTA (VB+HTML), WSH and VBA, with the user privilege limitation i was able to work around on my limitations... that time i still work as a Lawyer (kunohay)!!

    yes i canna reinvent the wheel because they dont allow us to use other wheel and i just make use whats on my arms reach!!!

    question lang about aning lucene, does it accepts DTD parsing?

    google uses python to parse!!, and afaik google view the site as text with the tag strips...
    Last edited by salbahis; 08-26-2011 at 04:29 PM.

  5.    Advertisement

Page 3 of 3 FirstFirst 123

Similar Threads

 
  1. THE BEST of Free Web Hosting
    By tingkagol in forum Websites & Multimedia
    Replies: 73
    Last Post: 06-09-2014, 02:38 PM
  2. Website for web developers
    By vandalesm in forum Websites & Multimedia
    Replies: 4
    Last Post: 09-23-2007, 06:49 PM
  3. Weird Web sites :-)
    By grss1982 in forum Websites & Multimedia
    Replies: 15
    Last Post: 03-07-2007, 07:46 PM
  4. Google Web Accelerator
    By sinus_ in forum Websites & Multimedia
    Replies: 2
    Last Post: 05-06-2005, 05:57 PM
  5. Surigao:: top 10 semi-finalist in 5th Philippine Web Awards
    By ScReWfAcE in forum Websites & Multimedia
    Replies: 11
    Last Post: 04-26-2005, 06:25 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
about us
We are the first Cebu Online Media.

iSTORYA.NET is Cebu's Biggest, Southern Philippines' Most Active, and the Philippines' Strongest Online Community!
follow us
#top