NYCPHP Meetup

NYPHP.org

[nycphp-talk] Curl & Traversing Pages

inforequest 1j0lkq002 at sneakemail.com
Wed Nov 23 01:08:16 EST 2005


By the way, site scraping may not be a crime but it is considered 
"hostile" by many webmasters. Not to mention that you might be lifting 
someone's database, you are also using their bandwidth to do it, and 
muddying their stats, etc.

So if they are smart they might hinder your ability to scrape as much as 
possible (that's what I do on competitive sites).

By hitting the site with curl, what user agent are you offering? Ditto 
for wget... there are recommended configurations to deny (or better, 
misdirect) such requests as
wget 1.1 	wget 1.1
Wget 1.8.2 	Wget/1.8.2 modified
Wget 1.9 	Wget/1.9+cvs-stable (Red Hat modified)


How about a scraper honeypot? There are many of those out there tying up 
the harvesters because, well, harvesters are evil.

Another method is to put a /whatever into robots.txt as a deny, with a 
link on the home page.. Anyone who asks for a file in there  is by 
definition a bad bot (as you would be if you simply parsed the home page 
for links and followed them scraping). A script can tie that bot up with 
endless loops of autogenerated garbage or perhaps comon sql inection 
atempts, folowed by an IP ban ater N attempts. This stuff is out there.

Personally I like serving realistic (but fake) Apache error codes 
whenever I am hit with wget because I know those are *nix folk and it's 
fun to mess with their heads.

I mention all this because it seems you haven't considered that it might 
be part of your problem. When you're ready to step up to the big leagues 
of Spy vs. Spy site scraping, you might start with the Snoopy class. I 
think it's a nice piece of work. It lets you focus on the randomization 
and proxy management, since the fetch is really solid.



-=john andrews
http://www.seo-fun.com
"I could tell you how to beat my anti-scraper code, but then I'd have to 
kill you"






Joseph Crawford codebowl-at-gmail.com |nyphp dev/internal group use| wrote:

>Hello Everyone,
>
>let me explain a bit what i am trying to do.  I have a script that
>will grab the first page which i specify from a URL such as
>
>http://yellowpages.superpages.com/listings.jsp?PS=45&OO=1&R=N&PP=L&CB=1&STYPE=S&F=1&L=VT&CID=00000518939&paging=1&PI=0
>
>now when it grabs this page, it will scour the returned HTML and grab
>all the information for each record under Yellow Page Listings.
>once it has all records it then checks to see if there is a Next page,
>basically Next will either be a link or not.
>
>If it is a link the script will execute using the URL from the Next
>Link.  Here's where i am running into problems.  I want to feed it 1
>url and have it go through every page until there is not a next page.
>
>The issue i am having is that with the url grabbed from the link, curl
>fetches the page, but it's not the page expected rather it's an error
>page from superpages stating that i have not supplied enough search
>criteria.
>
>On the first page grabbed, this is the link that is grabbed from the source
>http://yellowpages.superpages.com/listings.jsp?PS=45&PP=L&CB=1&L=VT&CID=00000518939&paging=1&F=1&OO=1&PI=45
>
>Now when curl grabs that url it complains about search criteria
>however if you paste that to a browser it will work just fine.
>
>here is a screenshot of the page that is returned by cURL
>http://codebowl.dontexist.net/images/ypresult.jpg
>
>I am not sure what is going on with this but if anyone here can lend a
>hand with curl i would appreciate it.  I have
>the cookie directory writable by apache also as i read you had to
>specify the exact path to the cookie on windows using apache 2
>
>Here is my code
>
>http://codebowl.dontexist.net/codebowl/System/Misc/Curl.phps
>http://codebowl.dontexist.net/codebowl/System/Misc/YellowPages.phps
>
>
>Note that i created a curl class because i am thinking of expanding on
>what i have and the framework i am working on
>is going to be 100% Object Oriented.
>
>Any help is appreciated.
>
>--
>Joseph Crawford Jr.
>Zend Certified Engineer
>Codebowl Solutions, Inc.
>1-802-671-2021
>codebowl at gmail.com
>  
>




More information about the talk mailing list