Building a basic PHP web-crawler

Building a basic PHP web-crawler - php

Am fairly new to PHP but hear me out please.
Would like to build a web-crawler that basically traverses all the links on a certain site -> gets specific content from it and returns link+specific info from each page.
Got the link traverse function from a Youtube tutorial: https://www.youtube.com/watch?v=KBemN_bTnHU, but can't seem to make the final part work though, when trying to follow the links nothing is outputed(noob alert).
Here is the function to get the links out of a website(that doesn't fully work):
<?php
$to_crawl = "http://reteteculinare.ro";
$c = array();
function get_Links($to_crawl){
global $c;
$input = #file_get_contents($to_crawl);
$base_url = parse_url($to_crawl, PHP_URL_HOST);
$regexp = '<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>';
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
foreach ($l as $link) {
if(strpos($link, "#")) {
$link = substr($link, 0, strpos($link, "#"));
}
if(substr($link, 0, 1) == "."){
$link = substr($link, 1);
}
if(substr($link, 0, 7) == "http://"){
$link = $link;
} else if (substr($link, 0, 8) == "https://"){
$link = $link;
} else if (substr($link, 0, 4) == "www."){
$link = substr($link, 4);
} else if (substr($link, 0, 6) == "//wwww."){
$link = substr($link, 6);
} else if (substr($link, 0, 2) == "//"){
$link = substr($link, 2);
} else if (substr($link, 0, 1) == "#"){
$link = $to_crawl;
} else if (substr($link, 0, 7) == "mailto:"){
$link = "[".$link."]";
} else {
if(substr($link, 0, 1) != "/") {
$link = $base_url."/".$link;
} else {
$link = $base_url.$link;
}
}
if(substr($link, 0, 4) == "www."){
$link = substr($link, 4);
}
if(substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && substr($link, 0, 1) != "[") {
$link = "http://".$link;
} else {
$link = "https://".$link;
}
if (!in_array($link, $c)) {
array_push($c, $link);
}
}
}
get_links($to_crawl);
foreach ($c as $page) {
get_links($page);
}
foreach ($c as $page) {
echo $page."<br />";
}
?>
Code works until it tries to follow the links. Any clues?:D In the videos it seems to be working fine for the guy....
Finally here is my function for getting certain information from a page and saving it into an array:
<?php
include('simple_html_dom.php');
header('Content-type: text/plain');
$html = new simple_html_dom();
$page = ('http://www.reteteculinare.ro/carte_de_bucate/dulciuri/gauffre-de-liege-1687/');
$base_url = parse_url($page, PHP_URL_HOST);
function getRecipe($page) {
global $recipe, $page, $base_url;
$html = new simple_html_dom();
$html->load_file($page);
$reteta = $html->getElementById('hrecipe');
$r_title = $reteta->children(4)->outertext;
$r_title = strip_tags($r_title);
$r_title = trim($r_title);
$r_poza = $reteta->getElementById('.div_photo_reteta')->children(0)->src;
$r_poza = $base_url.$r_poza;
$r_ingrediente = $reteta->getElementById('#ingrediente-lista')->outertext;
$r_preparare = $reteta->getElementById('.instructions')->children(1)->outertext;
$r_preparare = strip_tags($r_preparare);
// $r_durata = $reteta->getElementById('.duration')->children(0)->outertext;
// $r_durata = preg_replace('/\s/', '', $r_durata);
// $r_durata = strip_tags($r_durata);
$recipe = array(
"Titlu: " => $r_title,
// "Durata: " => $r_durata,
"Link Poza: " => $r_poza,
"Ingrediente: " => $r_ingrediente,
"Preparare: " => $r_preparare
);
echo '<pre>';
print_r($recipe);
echo '</pre>';
}
getRecipe($html);
?>
This works fine, gets the info that I want into an array - noob method of data-mining I am sure but don't know better:)
Finally, I would like to somehow connect these two functions so that when it traverses each link, to get the data from the second function and return an array that contains the link where the data was found+data.
If any of you can throw a helping hand in my direction I would certainly appreciate it.
Kinda out of my grasp what I am trying to achieve but really want to learn and expand my knowledge.
Cheers!

Related

PHP Link Shortener

Zack here.. I've recently run into a spot of trouble with my URL Shortener.. Can you check the below code for errors? Thanks so much!
<?php
$db = new mysqli("localhost","pretzel_main","00WXE5fMWtaVd6Dd","pretzel");
function code($string_length = 3) {
$permissible_characters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
$char_length = strlen($permissible_characters);
$random = '';
for ($i = 0; $i < $string_length; $i++) {
$random .= $permissible_characters[rand(0, $char_length - 1)];
}
return $random;
}
if (substr($_POST['long'], 0, 7) == http://) {
$url = $_POST['long'];
} elseif (substr($_POST['long'], 0, 8) == https://) {
$url = $_POST['long'];
} else {
$url = "http://".$_POST['long'];
}
$result = $db->prepare("INSERT INTO links VALUES('',?,?)");
$result->bind_param("ss",$url, $title);
$result->execute();
$title = $code();
?>
Good luck, and Thanks in advance!
- Zack

Try this :
<?php
$db = new mysqli("localhost","pretzel_main","00WXE5fMWtaVd6Dd","pretzel");
function code($string_length = 3) {
$permissible_characters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
$char_length = strlen($permissible_characters);
$random = '';
for ($i = 0; $i < $string_length; $i++) {
$random .= $permissible_characters[rand(0, $char_length - 1)];
}
return $random;
}
if (substr($_POST['long'], 0, 7) == "http://") {
$url = $_POST['long'];
} elseif (substr($_POST['long'], 0, 8) == "https://") {
$url = $_POST['long'];
} else {
$url = "http://".$_POST['long'];
}
$result = $db->prepare("INSERT INTO links VALUES('ss','$url','$title')");
$result->execute();
$title = code(); // no need of $ sign here
?>

Php Error in_array() null given

I followed a tutorial on making a web crawler app. I just simply pulls all the links from a page and then follows them. I have a problem with pushing the foreach loop of links to the global variable. I keep getting an error that says the second variable in the in_array should be an array which is what i set it to. Is there anything there you guys might see bugging up the code?
Error:
in_array() expects parameter 2 to be array, null given
HTML:
<?php
$to_crawl = "http://thechive.com/";
$c = array();
function get_links($url){
global $c;
$input = file_get_contents($url);
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
$base_url = parse_url($url, PHP_URL_HOST);
foreach($l as $link){
if(strpos($link, '#')){
$link = substr($link, 0, strpos($link, '#'));
}
if(substr($link, 0, 1) == "."){
$link = substr($link, 1);
}
if(substr($link, 0, 7) == "http://"){
$link = $link;
} elseif(substr($link, 0, 8) == "https://"){
$link = $link;
} elseif(substr($link, 0, 2) == "//"){
$link = substr($link, 2);
} elseif(substr($link, 0, 1) == "#"){
$link = $url;
} elseif(substr($link, 0, 7) == "mailto:"){
$link = "[".$link."]";
} else{
if(substr($link, 0,1) != "/"){
$link = $base_url."/".$link;
} else{
$link = $base_url.$link;
}
}
if(substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && substr($link, 0, 1) != "["){
if(substr($link, 0 , 8) == "https://"){
$link = "https://".$link;
} else{
$link= "http://".$link;
}
}
if (!in_array($link, $c)){
array_push($c, $link);
}
}
}
get_links($to_crawl);
foreach($c as $page){
get_links($page);
}
foreach($c as $page){
echo $page."<br/ >";
}
?>

Trying to make "global" your $c at each iteration is a bad design. You should avoid "global" when it's possible.
Here I see 2 choices :
1/ Pass your array as reference (search google for that) in parameter of the "get_links" function. It will allow you to fill the array from your function.
Exemple :
function getlinks($url, &$links){
//do your stuff to find the links
//then add each link to the array
$links[] = $oneLink;
}
$allLinks = array();
getlinks("thefirsturl.com", $allLinks);
//call getlinks as many as you want
//then your array will contain all the links
print_r($allLinks);
Or 2/ Make "get_links" return an array of links, and concatenate it into a bigger one to store all your links.
function getlinks($url){
$links = array();
//do your stuff to find the links
//then add each link to the array
$links[] = $oneLink;
return $links;
}
$allLinks = array();
$allLinks += getlinks("thefirsturl.com");
//call getlinks as many as you want. Note the concatenation operator +=
print_r($allLinks);

I keep getting this error "in_array() expects parameter 2 to be array, null given" [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 7 years ago.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Improve this question
I'm gettin this error "in_array() expects parameter 2 to be array, null given" even though I have "$c = array();"
heres my code below:
Route::get('/', function () {
return view('welcome');
});
get('pro', function(){
$to_crawl = "http://bestspace.co";
$c = array();
function get_links($url)
{
global $c;
$input = #file_get_contents($url);
$regexp = '<a\s[^>]*href=(\"??)([^" >]*?)\\1[^>]*>(.*)<\/a>';
preg_match_all("/$regexp/siU", $input, $matches);
$base_url = parse_url($url, PHP_URL_HOST);
$l = $matches[2];
foreach($l as $link)
{
if (strpos($link, "#"))
{
$link = substr($link, 0, strpos($link, "#"));
}
if (substr($link,0,1) == ".")
{
$link = substr($link, 1);
}
if (substr($link, 0, 7) == "http://")
{
$link = $link;
}
else if (substr($link, 0, 8) == "https://")
{
$link = $link;
}
else if (substr($link, 0, 2) == "//")
{
$link = substr($link, 2);
}
else if (substr($link, 0, 2) == "#")
{
$link = $url;
}
else if (substr($link, 0, 7) == "mailto:")
{
$link = "[". $link."]";
}
else
{
if (substr($link, 0, 1) != "/")
{
$link = $base_url."/".$link;
}
else
{
$link = $base_url.$link;
}
}
if (substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && substr($link, 0, 1) != "[")
{
if (substr($url, 0, 8) == "https://")
{
// prepend https
$link = "https://".$link;
}
else
{
// prepend http
$link = "http://".$link;
}
}
//echo $link."<br>";
if (!in_array($link, $c))
{
array_push($c, $link);
}
}
}
get_links($to_crawl);
foreach ($c as $page)
{
echo $page."<br>";
}
});
the problem is comming from here
if (!in_array($link, $c))
{
array_push($c, $link);
}
It's complaining about $c, but it's global and I also have it as
$c = array();
any help as to why this is occurring will be appreciated.

This is where the problem really originates:
get('pro', function(){
$to_crawl = "http://bestspace.co";
$c = array();
function get_links($url)
{
global $c;
The $c = array() is inside the closure that you passed as second argument for the get()-function, and that is not in the global namespace, but in the closure namespace. Using global $c wont work, then.
To fix this, just pass on the $c in the get_links() function
function get_links($url, array &$c)
{

in your get_links() function, have you tried the "use()" clause?
function get_links() use($c) {
print_r($c)
}

PHP script to extract artist & title from Shoutcast/Icecast stream

I found a script which can extract the artist & title name from an Icecast or Shoutcast stream.
I want the script to update automatically when a song changed, at the moment its working only when i execute it. I'm new to PHP so any help will be appreciated.
Thanks!
define('CRLF', "\r\n");
class streaminfo{
public $valid = false;
public $useragent = 'Winamp 2.81';
protected $headers = array();
protected $metadata = array();
public function __construct($location){
$errno = $errstr = '';
$t = parse_url($location);
$sock = fsockopen($t['host'], $t['port'], $errno, $errstr, 5);
$path = isset($t['path'])?$t['path']:'/';
if ($sock){
$request = 'GET '.$path.' HTTP/1.0' . CRLF .
'Host: ' . $t['host'] . CRLF .
'Connection: Close' . CRLF .
'User-Agent: ' . $this->useragent . CRLF .
'Accept: */*' . CRLF .
'icy-metadata: 1'.CRLF.
'icy-prebuffer: 65536'.CRLF.
(isset($t['user'])?'Authorization: Basic '.base64_encode($t['user'].':'.$t['pass']).CRLF:'').
'X-TipOfTheDay: Winamp "Classic" rulez all of them.' . CRLF . CRLF;
if (fwrite($sock, $request)){
$theaders = $line = '';
while (!feof($sock)){
$line = fgets($sock, 4096);
if('' == trim($line)){
break;
}
$theaders .= $line;
}
$theaders = explode(CRLF, $theaders);
foreach ($theaders as $header){
$t = explode(':', $header);
if (isset($t[0]) && trim($t[0]) != ''){
$name = preg_replace('/[^a-z][^a-z0-9]*/i','', strtolower(trim($t[0])));
array_shift($t);
$value = trim(implode(':', $t));
if ($value != ''){
if (is_numeric($value)){
$this->headers[$name] = (int)$value;
}else{
$this->headers[$name] = $value;
}
}
}
}
if (!isset($this->headers['icymetaint'])){
$data = ''; $metainterval = 512;
while(!feof($sock)){
$data .= fgetc($sock);
if (strlen($data) >= $metainterval) break;
}
$this->print_data($data);
$matches = array();
preg_match_all('/([\x00-\xff]{2})\x0\x0([a-z]+)=/i', $data, $matches, PREG_OFFSET_CAPTURE);
preg_match_all('/([a-z]+)=([a-z0-9\(\)\[\]., ]+)/i', $data, $matches, PREG_SPLIT_NO_EMPTY);
echo '<pre>';var_dump($matches);echo '</pre>';
$title = $artist = '';
foreach ($matches[0] as $nr => $values){
$offset = $values[1];
$length = ord($values[0]{0}) +
(ord($values[0]{1}) * 256)+
(ord($values[0]{2}) * 256*256)+
(ord($values[0]{3}) * 256*256*256);
$info = substr($data, $offset + 4, $length);
$seperator = strpos($info, '=');
$this->metadata[substr($info, 0, $seperator)] = substr($info, $seperator + 1);
if (substr($info, 0, $seperator) == 'title') $title = substr($info, $seperator + 1);
if (substr($info, 0, $seperator) == 'artist') $artist = substr($info, $seperator + 1);
}
$this->metadata['streamtitle'] = $artist . ' - ' . $title;
}else{
$metainterval = $this->headers['icymetaint'];
$intervals = 0;
$metadata = '';
while(1){
$data = '';
while(!feof($sock)){
$data .= fgetc($sock);
if (strlen($data) >= $metainterval) break;
}
//$this->print_data($data);
$len = join(unpack('c', fgetc($sock))) * 16;
if ($len > 0){
$metadata = str_replace("\0", '', fread($sock, $len));
break;
}else{
$intervals++;
if ($intervals > 100) break;
}
}
$metarr = explode(';', $metadata);
foreach ($metarr as $meta){
$t = explode('=', $meta);
if (isset($t[0]) && trim($t[0]) != ''){
$name = preg_replace('/[^a-z][^a-z0-9]*/i','', strtolower(trim($t[0])));
array_shift($t);
$value = trim(implode('=', $t));
if (substr($value, 0, 1) == '"' || substr($value, 0, 1) == "'"){
$value = substr($value, 1);
}
if (substr($value, -1) == '"' || substr($value, -1) == "'"){
$value = substr($value, 0, -1);
}
if ($value != ''){
$this->metadata[$name] = $value;
}
}
}
}
fclose($sock);
$this->valid = true;
}else echo 'unable to write.';
}else echo 'no socket '.$errno.' - '.$errstr.'.';
}
public function print_data($data){
$data = str_split($data);
$c = 0;
$string = '';
echo "<pre>\n000000 ";
foreach ($data as $char){
$string .= addcslashes($char, "\n\r\0\t");
$hex = dechex(join(unpack('C', $char)));
if ($c % 4 == 0) echo ' ';
if ($c % (4*4) == 0 && $c != 0){
foreach (str_split($string) as $s){
//echo " $string\n";
if (ord($s) < 32 || ord($s) > 126){
echo '\\'.ord($s);
}else{
echo $s;
}
}
echo "\n";
$string = '';
echo str_pad($c, 6, '0', STR_PAD_LEFT).' ';
}
if (strlen($hex) < 1) $hex = '00';
if (strlen($hex) < 2) $hex = '0'.$hex;
echo $hex.' ';
$c++;
}
echo " $string\n</pre>";
}
public function __get($name){
if (isset($this->metadata[$name])){
return $this->metadata[$name];
}
if (isset($this->headers[$name])){
return $this->headers[$name];
}
return null;
}
}
$t = new streaminfo('http://64.236.34.196:80/stream/1014'); // get metadata
echo Meta Interval: $t->icymetaint;
echo Current Track: $t->streamtitle;

You will need to constantly query the stream at a set interval to find when the song changes.
This can be best done by scheduling a cron job.
If on Windows, you should use the Windows Task Scheduler

If you want to run the PHP script to keep your meta data up to date (I'm assuming you're making a website and using html audio tags here) you can use the ontimeupdate event with an ajax function. If you're not you probably should look up your audio playback documentation for something similar.
<audio src="http://ip:port/;" ontimeupdate="loadXMLDoc()">
You can find a great example here http://www.w3schools.com/php/php_ajax_php.asp
You want to use the PHP echo function all the relevant information at once using one php variable at the very end of your script.
<?php ....
$phpVar=$streamtitle;
$phpVar2=$streamsong;
$result="I want my string to look like this: <br> {$phpVar} {$phpVar2}";
echo $result;
?>
and then use the function called by the .onreadystatechange to modify the particular elements you want on your website by using the .resonseText (this will contain the same content as your PHP script's echo).

After SCOURING the web for 4 hours, this is the only Shoutcast metadata script I've found that works! Thankyou.
To run this constantly, why not use a setInterval combined with jQuery's AJAX call?
<script>
$(function() {
setInterval(getTrackName,16000);
});
function getTrackName() {
$.ajax({
url: "track_name.php"
})
.done(function( data ) {
$( "#results" ).text( data );
});
}
</script>
Also your last couple 'echo' lines were breaking the script for me. Just put quotes around the Meta Interval, etc....

solve a bad Url to clean Url in php?

when I spide a website ,I got a lot of bad url like these.
http://example.com/../../.././././1.htm
http://example.com/test/../test/.././././1.htm
http://example.com/.//1.htm
http://example.com/../test/..//1.htm
all of these should be http://example.com/1.htm.
how to use PHP codes to do this ,thanks.
PS: I use http://snoopy.sourceforge.net/
I get a lot of repeated link in my database , the 'http://example.com/../test/..//1.htm' should be 'http://example.com/1.htm' .

You could do it like this, assuming all the urls you have provided are expected tobe http://example.com/1.htm:
$test = array('http://example.com/../../../././.\./1.htm',
'http://example.com/test/../test/../././.\./1.htm',
'http://example.com/.//1.htm',
'http://example.com/../test/..//1.htm');
foreach ($test as $url){
$u = parse_url($url);
$path = $u['scheme'].'://'.$u['host'].'/'.basename($u['path']);
echo $path.'<br />'.PHP_EOL;
}
/* result
http://example.com/1.htm<br />
http://example.com/1.htm<br />
http://example.com/1.htm<br />
http://example.com/1.htm<br />
*/
//or as a function #lpc2138
function getRealUrl($url){
$u = parse_url($url);
$path = $u['scheme'].'://'.$u['host'].'/'.basename($u['path']);
$path .= (!empty($u['query'])) ? '?'.$u['query'] : '';
return $path;
}

You seem to be looking for a algorithm to remove the dot segments:
function remove_dot_segments($abspath) {
$ib = $abspath;
$ob = '';
while ($ib !== '') {
if (substr($ib, 0, 3) === '../') {
$ib = substr($ib, 3);
} else if (substr($ib, 0, 2) === './') {
$ib = substr($ib, 2);
} else if (substr($ib, 0, 2) === '/.' && ($ib[2] === '/' || strlen($ib) === 2)) {
$ib = '/'.substr($ib, 3);
} else if (substr($ib, 0, 3) === '/..' && ($ib[3] === '/' || strlen($ib) === 3)) {
$ib = '/'.substr($ib, 4);
$ob = substr($ob, 0, strlen($ob)-strlen(strrchr($ob, '/')));
} else if ($ib === '.' || $ib === '..') {
$ib = '';
} else {
$pos = strpos($ib, '/', 1);
if ($pos === false) {
$ob .= $ib;
$ib = '';
} else {
$ob .= substr($ib, 0, $pos);
$ib = substr($ib, $pos);
}
}
}
return $ob;
}
This removes the . and .. segments. Any removal of any other segment like an empty one (//) or .\. is not as per standard as it changes the semantics of the path.

You could do some fancy regex but this works just fine.
fixUrl('http://example.com/../../../././.\./1.htm');
function fixUrl($str) {
$str = str_replace('../', '', $str);
$str = str_replace('./', '', $str);
$str = str_replace('\.', '', $str);
return $str;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Building a basic PHP web-crawler - php

Related

PHP Link Shortener

Php Error in_array() null given

I keep getting this error "in_array() expects parameter 2 to be array, null given" [closed]

PHP script to extract artist & title from Shoutcast/Icecast stream

solve a bad Url to clean Url in php?

Categories

Resources