NYCPHP Meetup

NYPHP.org

[nycphp-talk] Curl & Traversing Pages

Joseph Crawford codebowl at gmail.com
Wed Nov 23 10:30:09 EST 2005


Guys i am still in need of help with this ;)

Here is an explanation of what i have tried so far.

i am trying to fetch data from yellowpages.superpages.com, the script
i have written does this, i feed it a category url, it grabs the
records, checks to see if there is a next page or not, if there is it
grabs the url, then re-executes with the new URL until it hits the
last page of the category.

The issue i am having is this.  It reaches out to the first page and
grabs the results, but when it reaches out to grab the second page i
get the following error
http://codebowl.dontexist.net/images/ypresult.jpg

now what doesnt make any sense to me is that if i echo the URL that is
grabbed (second page) and paste it to my browser i get the results
fine, i dont see that error page.  If i feed the second page url to
the script it grabs the records then errors when trying to go to the
3rd page

The following is the current code i have

		$url = explode('?', $url);
		$url = $url[0].'?'.$url[1];
		echo $url . '<br>';
		$c = new Curl($url[0]);
		$c->SetOpt(CURLOPT_FOLLOWLOCATION, 1);
		$c->SetOpt(CURLOPT_RETURNTRANSFER, 1);
		//$c->SetOpt(CURLOPT_URL, $url);
		$c->SetOpt(CURLOPT_POST, 1);
		$c->SetOpt(CURLOPT_POSTFIELDS, $url[1]);
		$c->SetOpt(CURLOPT_HEADER, 1);
		$c->SetOpt(CURLOPT_COOKIE, 1);
		$c->SetOpt(CURLOPT_ENCODING, "gzip,deflate");
		$c->SetOpt(CURLOPT_USERAGENT, "User-Agent=Mozilla/5.0 (Windows; U;
Windows NT 5.1; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7");
		$c->SetOpt(CURLOPT_REFERER, "http://yellowpages.superpages.com/");
		$c->SetOpt(CURLOPT_COOKIEJAR,
'e:\htdocs\tmp\cookies\superpages.cookiejar.txt');
		//$c->SetOpt(CURLOPT_COOKIEFILE,
'e:\htdocs\tmp\cookies\superpages.cookiefile.txt');
		$this->source = $c->Execute();

that's the curl code i have, i have tried to use POST, POSTFIELDS, i
have tried encoding the query string values, i also have cURL setting
the cookie and have tried with and without that

here's the cookie set by cUR

# Netscape HTTP Cookie File
# http://www.netscape.com/newsref/std/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.

.superpages.com	TRUE	/	FALSE	1290438669	SPC	1132758669510-yellowpages.superpages.com-42072-519770
.superpages.com	TRUE	/	FALSE	0	web	
.superpages.com	TRUE	/	FALSE	0	shopping	
.superpages.com	TRUE	/	FALSE	0	yp	PS:45$

Here are the headers output with the cURL exec

HTTP/1.1 200 OK P3P: CP="NOI DSP COR DEVa TAIa OUR BUS UNI"
Set-Cookie: SPC=1132758669510-yellowpages.superpages.com-42072-519770;
Domain=.superpages.com; Expires=Mon, 22-Nov-2010 15:11:09 GMT; Path=/
Content-Encoding: gzip Set-Cookie: web=; Domain=.superpages.com;
Path=/ Set-Cookie: shopping=; Domain=.superpages.com; Path=/
Set-Cookie: yp=PS:45$; Domain=.superpages.com; Path=/ Content-Type:
text/html;charset=ISO-8859-1 Content-Language: en-US Content-Length:
9082 Date: Wed, 23 Nov 2005 15:11:09 GMT Server: Apache Coyote/1.0

You can also see the page running at the following URL
http://codebowl.homelinux.net:8001/codebowl/yp.php

What really srikes me as odd is that i can feed the script an array of
URL's and make it loop over each to grab each page.  However making it
traverse the pages automatically is where i am having the issues.

I have also compared the URL's from the actual HTML page and the one
grabbed by cURL

Here is the one from the HTML page

http://yellowpages.superpages.com/listings.jsp?PS=15&CB=1&L=VT&CID=00000518939&paging=1&F=1&OO=1&PI=15

and here is the one grabbed by cURL

http://yellowpages.superpages.com/listings.jsp?PS=15&CB=1&L=VT&CID=00000518939&paging=1&F=1&OO=1&PI=15

as you can see they are exactly the same so i am not sure what could
be going wrong here.

as before the latest code is located at
http://codebowl.dontexist.net/codebowl/System/Misc/YellowPages.phps
http://codebowl.dontexist.net/codebowl/System/Misc/Curl.phps

I have taken a look at that mozilla plugin however i dont know what i
am looking for as it doesnt show what happens in cURL and that's
really what i need to know.

Any help would be appreciated.


--
Joseph Crawford Jr.
Zend Certified Engineer
Codebowl Solutions, Inc.
1-802-671-2021
codebowl at gmail.com



More information about the talk mailing list