Techniques that can be used to grab contents of a site

Programming for Search Engines 101. An area for avid PHP and .NET developers to chat about Programming techniques and how to make better use of search engines.

Moderator: Moderators

Techniques that can be used to grab contents of a site

Postby silvester » Fri Sep 04, 2009 5:39 am


1)fsockopen & pfsockopen


int fsockopen(string hostname, int port, int [errno], string [errstr], double [timeout]);

Initiates a stream connection in the Internet (AF_INET) or Unix (AF_UNIX) domain. For the Internet domain, it will open a TCP socket connection to hostname on port port. For the Unix domain, hostname will be used as the path to the socket, port must be set to 0 in this case. The optional timeout can be used to set a timeout in seconds for the connect system call.

$fp = fsockopen("www.php.net", 80, &$errno, &$errstr, 30);
if(!$fp) {
echo "$errstr ($errno)<br>\n";
} else {
fputs($fp,"GET / HTTP/1.0\n\n");
while(!feof($fp)) {
echo fgets($fp,128);
}
fclose($fp);
}


2)File_Get_Contents


It works the same as file () except it returns a string instead of an array. This function retrieves data from an external file and then puts it into a string for use.
Examples:

<?php
$file = file_get_contents ('YourFile.txt');
Echo $file;
?>

3)CURL

// create a new curl resource
$ch = curl_init();

/*
Here you find more options for curl:
http://www.php.net/curl_setopt
*/

// set URL to download
curl_setopt($ch, CURLOPT_URL, $Url);

// set referer:
curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com/");

// user agent:
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");

// remove header? 0 = yes, 1 = no
curl_setopt($ch, CURLOPT_HEADER, 0);

// should curl return or print the data? true = return, false = print
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

// timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 10);

// download the given URL, and return output
$output = curl_exec($ch);

// close the curl resource, and free system resources
curl_close($ch);


4)Snoopy

Snoopy is a PHP class that simulates a web browser. It automates the
task of retrieving web page content and posting forms, for example.

Some of Snoopy's features:

* easily fetch the contents of a web page
* easily fetch the text from a web page (strip html tags)
* easily fetch the the links from a web page
* supports proxy hosts
* supports basic user/pass authentication
* supports setting user_agent, referer, cookies and header content
* supports browser redirects, and controlled depth of redirects
* expands fetched links to fully qualified URLs (default)
* easily submit form data and retrieve the results
* supports following html frames (added v0.92)
* supports passing cookies on redirects (added v0.92)

You can download it from http://sourceforge.net/projects/snoopy/


REQUIREMENTS:

Snoopy requires PHP with PCRE (Perl Compatible Regular Expressions),
which should be PHP 3.0.9 and up. For read timeout support, it requires
PHP 4 Beta 4 or later. Snoopy was developed and tested with PHP 3.0.12.

CLASS METHODS:

fetch($URI)
-----------

This is the method used for fetching the contents of a web page.
$URI is the fully qualified URL of the page to fetch.
The results of the fetch are stored in $this->results.
If you are fetching frames, then $this->results
contains each frame fetched in an array.

fetchtext($URI)
---------------

This behaves exactly like fetch() except that it only returns
the text from the page, stripping out html tags and other
irrelevant data.

fetchform($URI)
---------------

This behaves exactly like fetch() except that it only returns
the form elements from the page, stripping out html tags and other
irrelevant data.

fetchlinks($URI)
----------------

This behaves exactly like fetch() except that it only returns
the links from the page. By default, relative links are
converted to their fully qualified URL form.

submit($URI,$formvars)
----------------------

This submits a form to the specified $URI. $formvars is an
array of the form variables to pass.


submittext($URI,$formvars)
--------------------------

This behaves exactly like submit() except that it only returns
the text from the page, stripping out html tags and other
irrelevant data.

submitlinks($URI)
----------------

This behaves exactly like submit() except that it only returns
the links from the page. By default, relative links are
converted to their fully qualified URL form.


CLASS VARIABLES: (default value in parenthesis)

$host the host to connect to
$port the port to connect to
$proxy_host the proxy host to use, if any
$proxy_port the proxy port to use, if any
$agent the user agent to masqerade as (Snoopy v0.1)
$referer referer information to pass, if any
$cookies cookies to pass if any
$rawheaders other header info to pass, if any
$maxredirs maximum redirects to allow. 0=none allowed. (5)
$offsiteok whether or not to allow redirects off-site. (true)
$expandlinks whether or not to expand links to fully qualified URLs (true)
$user authentication username, if any
$pass authentication password, if any
$accept http accept types (image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*)
$error where errors are sent, if any
$response_code responde code returned from server
$headers headers returned from server
$maxlength max return data length
$read_timeout timeout on read operations (requires PHP 4 Beta 4+)
set to 0 to disallow timeouts
$timed_out true if a read operation timed out (requires PHP 4 Beta 4+)
$maxframes number of frames we will follow
$status http status of fetch
$temp_dir temp directory that the webserver can write to. (/tmp)
$curl_path system path to cURL binary, set to false if none


EXAMPLES:

Example: fetch a web page and display the return headers and
the contents of the page (html-escaped):

include "Snoopy.class.php";
$snoopy = new Snoopy;

$snoopy->user = "joe";
$snoopy->pass = "bloe";

if($snoopy->fetch("http://www.slashdot.org/"))
{
echo "response code: ".$snoopy->response_code."<br>\n";
while(list($key,$val) = each($snoopy->headers))
echo $key.": ".$val."<br>\n";
echo "<p>\n";

echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";
}
else
echo "error fetching document: ".$snoopy->error."\n";



Example: submit a form and print out the result headers
and html-escaped page:

include "Snoopy.class.php";
$snoopy = new Snoopy;

$submit_url = "http://lnk.ispi.net/texis/scripts/msearch/netsearch.html";

$submit_vars["q"] = "amiga";
$submit_vars["submit"] = "Search!";
$submit_vars["searchhost"] = "Altavista";


if($snoopy->submit($submit_url,$submit_vars))
{
while(list($key,$val) = each($snoopy->headers))
echo $key.": ".$val."<br>\n";
echo "<p>\n";

echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";
}
else
echo "error fetching document: ".$snoopy->error."\n";



Example: showing functionality of all the variables:


include "Snoopy.class.php";
$snoopy = new Snoopy;

$snoopy->proxy_host = "my.proxy.host";
$snoopy->proxy_port = "8080";

$snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)";
$snoopy->referer = "http://www.microsnot.com/";

$snoopy->cookies["SessionID"] = 238472834723489l;
$snoopy->cookies["favoriteColor"] = "RED";

$snoopy->rawheaders["Pragma"] = "no-cache";

$snoopy->maxredirs = 2;
$snoopy->offsiteok = false;
$snoopy->expandlinks = false;

$snoopy->user = "joe";
$snoopy->pass = "bloe";

if($snoopy->fetchtext("http://www.phpbuilder.com"))
{
while(list($key,$val) = each($snoopy->headers))
echo $key.": ".$val."<br>\n";
echo "<p>\n";

echo "<PRE>".htmlspecialchars($snoopy->results)."</PRE>\n";
}
else
echo "error fetching document: ".$snoopy->error."\n";


Example: fetched framed content and display the results

include "Snoopy.class.php";
$snoopy = new Snoopy;

$snoopy->maxframes = 5;

if($snoopy->fetch("http://www.ispi.net/"))
{
echo "<PRE>".htmlspecialchars($snoopy->results[0])."</PRE>\n";
echo "<PRE>".htmlspecialchars($snoopy->results[1])."</PRE>\n";
echo "<PRE>".htmlspecialchars($snoopy->results[2])."</PRE>\n";
}
else
echo "error fetching document: ".$snoopy->error."\n";
silvester
 
Posts: 89
Joined: Mon Nov 10, 2008 12:59 am

Return to Programming

Who is online

Users browsing this forum: No registered users and 9 guests