NYCPHP Meetup

NYPHP.org

[nycphp-talk] Curl & Traversing Pages

Rolan Yang rolan at omnistep.com
Tue Nov 22 21:46:02 EST 2005


A great tool for debugging bots and spiders is the "Tamper Data 0.85" 
extension of Firefox browser. Go download the browser, then install the 
extension, which can be found here:
https://addons.mozilla.org/extensions/showlist.php?application=firefox&category=Developer%20Tools&numpg=10&pageid=7
It logs and displays all inbound and outbound traffic to the browser. 
This is very useful, especially when creating bots that interface with 
SSL pages. I used to use a packet sniffer to debug, but having to tackle 
an SSL only application prompted me to seek out and discover this 
wonderful app.

An alternative way to spider a website is to grab the pages with "wget" 
then parse and process it all offline later. One caveat: some dynamic 
scripts may generate links on the fly resulting in loops. Googlebot, for 
example, has been endlessely spidering my same photo album pages for the 
past year and a half.

~Rolan

Joseph Crawford wrote:

>Hello Everyone,
>
>let me explain a bit what i am trying to do.  I have a script that
>will grab the first page which i specify from a URL such as
>
>http://yellowpages.superpages.com/listings.jsp?PS=45&OO=1&R=N&PP=L&CB=1&STYPE=S&F=1&L=VT&CID=00000518939&paging=1&PI=0
>
>  
>



More information about the talk mailing list