![Quote](images/metro/blue/misc/quote_icon.png)
Originally Posted by
lestat1116
naa kay setup ani?
You can just use a normal Unix or Linux box. Here's a sample on how to spider the URLs from PNOY's website:
Code:
$ wget -r -l 0 http://president.gov.ph/
$ for i in `egrep -Ihsoiwr '(http[s]*[:][/]+|www[.])[^"\<>]*' * | egrep -v '^$|^#'^C`; do
echo "INSERT INTO lestat('url') VALUES('$i');"
done
Sample output:
Code:
INSERT INTO lestat('url') VALUES('http://www.twinhelix.com^M');
INSERT INTO lestat('url') VALUES('http://creativecommons.org/licenses/LGPL/2.1/^M');
INSERT INTO lestat('url') VALUES('http://www.president.gov.ph/images/seal.png');
INSERT INTO lestat('url') VALUES('http://www.facebook.com/pages/Noynoy-Aquino-P-Noy/141976959168393?ref=mf?');
INSERT INTO lestat('url') VALUES('http://twitter.com/#!/PresidentNoy');
INSERT INTO lestat('url') VALUES('http://www.youtube.com/user/RTVMPNoy');
INSERT INTO lestat('url') VALUES('http://api.recaptcha.net/challenge?k=6Lc_TbwSAAAAAFnJFB4XffLPFsftPkyexkr143PJ');
INSERT INTO lestat('url') VALUES('http://api.recaptcha.net/noscript?k=6Lc_TbwSAAAAAFnJFB4XffLPFsftPkyexkr143PJ');
INSERT INTO lestat('url') VALUES('http://www.president.gov.ph');
INSERT INTO lestat('url') VALUES('WWW.PRESIDENT.GOV.PH');
INSERT INTO lestat('url') VALUES('http://gov.ph');
INSERT INTO lestat('url') VALUES('http://www.idocs.com^M');
NOTE: Some of the output are invalid. You still need to fix the regex or add a URL validation to make it more robust.
[ simon.cpu ]