Page 2 of 3 FirstFirst 123 LastLast
Results 11 to 20 of 24

Thread: web crawling

  1. #11

    Default Re: web crawling


    Quote Originally Posted by lestat1116 View Post
    naa kay setup ani?
    You can just use a normal Unix or Linux box. Here's a sample on how to spider the URLs from PNOY's website:

    Code:
    $ wget -r -l 0 http://president.gov.ph/
    
    $ for i in `egrep -Ihsoiwr '(http[s]*[:][/]+|www[.])[^"\<>]*' * |  egrep -v '^$|^#'^C`; do
           echo "INSERT INTO lestat('url') VALUES('$i');"
      done
    Sample output:

    Code:
    INSERT INTO lestat('url') VALUES('http://www.twinhelix.com^M');
    INSERT INTO lestat('url') VALUES('http://creativecommons.org/licenses/LGPL/2.1/^M');
    INSERT INTO lestat('url') VALUES('http://www.president.gov.ph/images/seal.png');
    INSERT INTO lestat('url') VALUES('http://www.facebook.com/pages/Noynoy-Aquino-P-Noy/141976959168393?ref=mf?');
    INSERT INTO lestat('url') VALUES('http://twitter.com/#!/PresidentNoy');
    INSERT INTO lestat('url') VALUES('http://www.youtube.com/user/RTVMPNoy');
    INSERT INTO lestat('url') VALUES('http://api.recaptcha.net/challenge?k=6Lc_TbwSAAAAAFnJFB4XffLPFsftPkyexkr143PJ');
    INSERT INTO lestat('url') VALUES('http://api.recaptcha.net/noscript?k=6Lc_TbwSAAAAAFnJFB4XffLPFsftPkyexkr143PJ');
    INSERT INTO lestat('url') VALUES('http://www.president.gov.ph');
    INSERT INTO lestat('url') VALUES('WWW.PRESIDENT.GOV.PH');
    INSERT INTO lestat('url') VALUES('http://gov.ph');
    INSERT INTO lestat('url') VALUES('http://www.idocs.com^M');
    NOTE: Some of the output are invalid. You still need to fix the regex or add a URL validation to make it more robust.

    [ simon.cpu ]

  2. #12

    Default Re: web crawling

    Quote Originally Posted by salbahis View Post
    then you just data mine...
    im looking for program na mo data mine. manual man gud ako gamit. copy stuff den paste sa db.

  3. #13

    Default Re: web crawling

    Quote Originally Posted by lestat1116 View Post
    im looking for program na mo data mine. manual man gud ako gamit. copy stuff den paste sa db.
    hmmmm.... odesk? data entry?

  4. #14

    Default Re: web crawling

    Quote Originally Posted by salbahis View Post
    hmmmm.... odesk? data entry?
    yup data entry but not odesk. naa ka setup?

  5. #15

    Default Re: web crawling

    Quote Originally Posted by lestat1116 View Post
    yup data entry but not odesk. naa ka setup?
    yeah, but its not for sharing its for my wifes own odesk job!!
    Last edited by salbahis; 08-24-2011 at 08:57 AM.

  6. #16

    Default Re: web crawling

    nose bleed

  7. #17

    Default Re: web crawling

    Quote Originally Posted by salbahis View Post
    yeah, but its not for sharing its for my wifes own odesk job!!
    im not in odesk... hehe.. i just need help.

  8. #18

    Default Re: web crawling

    up lang nako ni..

  9. #19

    Default Re: web crawling

    Naay Lucene based crawler ang Apache. Nutch. Taw-anan pero magamit ni nimo. Basaha lng documentation or search for introductory tutorials.

    Naa puy Heritix ug OpenSpider.

    Wa pko kagamit aning duha, pero maayo mn ug reviews, especially ang Heritix.

  10. #20

    Default Re: web crawling

    thanks mura dako jud ako study mode ani.

  11.    Advertisement

Page 2 of 3 FirstFirst 123 LastLast

Similar Threads

 
  1. THE BEST of Free Web Hosting
    By tingkagol in forum Websites & Multimedia
    Replies: 73
    Last Post: 06-09-2014, 02:38 PM
  2. Website for web developers
    By vandalesm in forum Websites & Multimedia
    Replies: 4
    Last Post: 09-23-2007, 06:49 PM
  3. Weird Web sites :-)
    By grss1982 in forum Websites & Multimedia
    Replies: 15
    Last Post: 03-07-2007, 07:46 PM
  4. Google Web Accelerator
    By sinus_ in forum Websites & Multimedia
    Replies: 2
    Last Post: 05-06-2005, 05:57 PM
  5. Surigao:: top 10 semi-finalist in 5th Philippine Web Awards
    By ScReWfAcE in forum Websites & Multimedia
    Replies: 11
    Last Post: 04-26-2005, 06:25 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
about us
We are the first Cebu Online Media.

iSTORYA.NET is Cebu's Biggest, Southern Philippines' Most Active, and the Philippines' Strongest Online Community!
follow us
#top