I followed a tutorial on making a web crawler app. I just simply pulls all the links from a page and then follows them. I have a problem with pushing the foreach loop of links to the global variable. I keep getting an error that says the second variable in the in_array should be an array which is what i set it to. Is there anything there you guys might see bugging up the code?
Error:
in_array() expects parameter 2 to be array, null given
HTML:
<?php
$to_crawl = "http://thechive.com/";
$c = array();
function get_links($url){
global $c;
$input = file_get_contents($url);
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
$base_url = parse_url($url, PHP_URL_HOST);
foreach($l as $link){
if(strpos($link, '#')){
$link = substr($link, 0, strpos($link, '#'));
}
if(substr($link, 0, 1) == "."){
$link = substr($link, 1);
}
if(substr($link, 0, 7) == "http://"){
$link = $link;
} elseif(substr($link, 0, 8) == "https://"){
$link = $link;
} elseif(substr($link, 0, 2) == "//"){
$link = substr($link, 2);
} elseif(substr($link, 0, 1) == "#"){
$link = $url;
} elseif(substr($link, 0, 7) == "mailto:"){
$link = "[".$link."]";
} else{
if(substr($link, 0,1) != "/"){
$link = $base_url."/".$link;
} else{
$link = $base_url.$link;
}
}
if(substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && substr($link, 0, 1) != "["){
if(substr($link, 0 , 8) == "https://"){
$link = "https://".$link;
} else{
$link= "http://".$link;
}
}
if (!in_array($link, $c)){
array_push($c, $link);
}
}
}
get_links($to_crawl);
foreach($c as $page){
get_links($page);
}
foreach($c as $page){
echo $page."<br/ >";
}
?>
Trying to make "global" your $c at each iteration is a bad design. You should avoid "global" when it's possible.
Here I see 2 choices :
1/ Pass your array as reference (search google for that) in parameter of the "get_links" function. It will allow you to fill the array from your function.
Exemple :
function getlinks($url, &$links){
//do your stuff to find the links
//then add each link to the array
$links[] = $oneLink;
}
$allLinks = array();
getlinks("thefirsturl.com", $allLinks);
//call getlinks as many as you want
//then your array will contain all the links
print_r($allLinks);
Or 2/ Make "get_links" return an array of links, and concatenate it into a bigger one to store all your links.
function getlinks($url){
$links = array();
//do your stuff to find the links
//then add each link to the array
$links[] = $oneLink;
return $links;
}
$allLinks = array();
$allLinks += getlinks("thefirsturl.com");
//call getlinks as many as you want. Note the concatenation operator +=
print_r($allLinks);
Related
Am fairly new to PHP but hear me out please.
Would like to build a web-crawler that basically traverses all the links on a certain site -> gets specific content from it and returns link+specific info from each page.
Got the link traverse function from a Youtube tutorial: https://www.youtube.com/watch?v=KBemN_bTnHU, but can't seem to make the final part work though, when trying to follow the links nothing is outputed(noob alert).
Here is the function to get the links out of a website(that doesn't fully work):
<?php
$to_crawl = "http://reteteculinare.ro";
$c = array();
function get_Links($to_crawl){
global $c;
$input = #file_get_contents($to_crawl);
$base_url = parse_url($to_crawl, PHP_URL_HOST);
$regexp = '<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>';
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
foreach ($l as $link) {
if(strpos($link, "#")) {
$link = substr($link, 0, strpos($link, "#"));
}
if(substr($link, 0, 1) == "."){
$link = substr($link, 1);
}
if(substr($link, 0, 7) == "http://"){
$link = $link;
} else if (substr($link, 0, 8) == "https://"){
$link = $link;
} else if (substr($link, 0, 4) == "www."){
$link = substr($link, 4);
} else if (substr($link, 0, 6) == "//wwww."){
$link = substr($link, 6);
} else if (substr($link, 0, 2) == "//"){
$link = substr($link, 2);
} else if (substr($link, 0, 1) == "#"){
$link = $to_crawl;
} else if (substr($link, 0, 7) == "mailto:"){
$link = "[".$link."]";
} else {
if(substr($link, 0, 1) != "/") {
$link = $base_url."/".$link;
} else {
$link = $base_url.$link;
}
}
if(substr($link, 0, 4) == "www."){
$link = substr($link, 4);
}
if(substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && substr($link, 0, 1) != "[") {
$link = "http://".$link;
} else {
$link = "https://".$link;
}
if (!in_array($link, $c)) {
array_push($c, $link);
}
}
}
get_links($to_crawl);
foreach ($c as $page) {
get_links($page);
}
foreach ($c as $page) {
echo $page."<br />";
}
?>
Code works until it tries to follow the links. Any clues?:D In the videos it seems to be working fine for the guy....
Finally here is my function for getting certain information from a page and saving it into an array:
<?php
include('simple_html_dom.php');
header('Content-type: text/plain');
$html = new simple_html_dom();
$page = ('http://www.reteteculinare.ro/carte_de_bucate/dulciuri/gauffre-de-liege-1687/');
$base_url = parse_url($page, PHP_URL_HOST);
function getRecipe($page) {
global $recipe, $page, $base_url;
$html = new simple_html_dom();
$html->load_file($page);
$reteta = $html->getElementById('hrecipe');
$r_title = $reteta->children(4)->outertext;
$r_title = strip_tags($r_title);
$r_title = trim($r_title);
$r_poza = $reteta->getElementById('.div_photo_reteta')->children(0)->src;
$r_poza = $base_url.$r_poza;
$r_ingrediente = $reteta->getElementById('#ingrediente-lista')->outertext;
$r_preparare = $reteta->getElementById('.instructions')->children(1)->outertext;
$r_preparare = strip_tags($r_preparare);
// $r_durata = $reteta->getElementById('.duration')->children(0)->outertext;
// $r_durata = preg_replace('/\s/', '', $r_durata);
// $r_durata = strip_tags($r_durata);
$recipe = array(
"Titlu: " => $r_title,
// "Durata: " => $r_durata,
"Link Poza: " => $r_poza,
"Ingrediente: " => $r_ingrediente,
"Preparare: " => $r_preparare
);
echo '<pre>';
print_r($recipe);
echo '</pre>';
}
getRecipe($html);
?>
This works fine, gets the info that I want into an array - noob method of data-mining I am sure but don't know better:)
Finally, I would like to somehow connect these two functions so that when it traverses each link, to get the data from the second function and return an array that contains the link where the data was found+data.
If any of you can throw a helping hand in my direction I would certainly appreciate it.
Kinda out of my grasp what I am trying to achieve but really want to learn and expand my knowledge.
Cheers!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 7 years ago.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Improve this question
I'm gettin this error "in_array() expects parameter 2 to be array, null given" even though I have "$c = array();"
heres my code below:
Route::get('/', function () {
return view('welcome');
});
get('pro', function(){
$to_crawl = "http://bestspace.co";
$c = array();
function get_links($url)
{
global $c;
$input = #file_get_contents($url);
$regexp = '<a\s[^>]*href=(\"??)([^" >]*?)\\1[^>]*>(.*)<\/a>';
preg_match_all("/$regexp/siU", $input, $matches);
$base_url = parse_url($url, PHP_URL_HOST);
$l = $matches[2];
foreach($l as $link)
{
if (strpos($link, "#"))
{
$link = substr($link, 0, strpos($link, "#"));
}
if (substr($link,0,1) == ".")
{
$link = substr($link, 1);
}
if (substr($link, 0, 7) == "http://")
{
$link = $link;
}
else if (substr($link, 0, 8) == "https://")
{
$link = $link;
}
else if (substr($link, 0, 2) == "//")
{
$link = substr($link, 2);
}
else if (substr($link, 0, 2) == "#")
{
$link = $url;
}
else if (substr($link, 0, 7) == "mailto:")
{
$link = "[". $link."]";
}
else
{
if (substr($link, 0, 1) != "/")
{
$link = $base_url."/".$link;
}
else
{
$link = $base_url.$link;
}
}
if (substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && substr($link, 0, 1) != "[")
{
if (substr($url, 0, 8) == "https://")
{
// prepend https
$link = "https://".$link;
}
else
{
// prepend http
$link = "http://".$link;
}
}
//echo $link."<br>";
if (!in_array($link, $c))
{
array_push($c, $link);
}
}
}
get_links($to_crawl);
foreach ($c as $page)
{
echo $page."<br>";
}
});
the problem is comming from here
if (!in_array($link, $c))
{
array_push($c, $link);
}
It's complaining about $c, but it's global and I also have it as
$c = array();
any help as to why this is occurring will be appreciated.
This is where the problem really originates:
get('pro', function(){
$to_crawl = "http://bestspace.co";
$c = array();
function get_links($url)
{
global $c;
The $c = array() is inside the closure that you passed as second argument for the get()-function, and that is not in the global namespace, but in the closure namespace. Using global $c wont work, then.
To fix this, just pass on the $c in the get_links() function
function get_links($url, array &$c)
{
in your get_links() function, have you tried the "use()" clause?
function get_links() use($c) {
print_r($c)
}
I want to extract the website name, from a link, so I write the following function:
protected function getWebsiteName()
{
$prefixs = ['https://', 'http://', 'www.'];
foreach($prefixs as $prefix)
{
if(strpos($this->website_link, $prefix) !== false)
{
$len = strlen($prefix);
$this->website_name = substr($this->website_link, $len);
$this->website_name = substr($this->website_name, 0, strpos($this->website_name, '.'));
}
}
}
The problem is that when I use I website link that look like https://www.github.com, the result is: s://www, and the function only works when I remove that 'www.' from the array list.
Any ideas why this is happening, or how I can improve this function?
You could use parse_url();, Try:
print_r(parse_url('https//www.name/'));
Let's look at your code. Each time you go through the foreach, you are applying your logic from the original website_link every time. This means when you run strlen in the situation of www. after the first two iterations, this happens:
$prefix is www.
Therefore, $len = 4 (the length of $prefix)
$this->website_link is still https://www.github.com
You apply substr($this->website_link, 4)
Result is $this->website_name = 's://www.github.com'
You apply substr($this->website_name, 0, 7) (7 being the result of strpos($this->website_name, '.')
The result is $this->website_name = 's://www'
To fix this, you should save $this->website_link to $temp and then use the following code:
$temp = $this->website_link;
foreach($prefixs as $prefix)
{
if(strpos($temp, $prefix) !== false)
{
$len = strlen($prefix);
$temp = substr($temp, $len);
}
}
$this->website_name = substr($temp, 0, strpos($temp, '.'));
I'd suggest #dynamic's answer, but if you want to continue the strategy of string replacement, use str_replace. It accepts arrays for the needle!
$prefixes = ['https://', 'http://', 'www.'];
$this->website_name = str_replace($prefixes, '', $this->website_link);
$this->website_name = substr($this->website_name, 0, strpos($this->website_name, '.'));
Yes, use parse_url along with preg_match should do the job
function getWebsiteName($url)
{
$pieces = parse_url($url);
$domain = isset($pieces['host']) ? $pieces['host'] : '';
if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
return $regs['domain'];
}
return false;
}
This is fixing your code.
function getWebsiteName()
{
$this->website_name = $this->website_link;
$prefixs = array('https://', 'http://', 'www.');
foreach($prefixs as $prefix)
{
if (substr($this->website_name, 0, strlen($prefix)) == $prefix) {
$this->website_name = substr($this->website_name, strlen($prefix));
}
}
}
Say if I have two strings
$first = 'http://www.example.com';
$second = 'www.example.com/';
How could I determine they match? I just care that the example part matches. I'm thinking some form of Regex pattern would match but I can't figure it out at all.
Don't use a regex if you're trying to evaluate structured data. Regexes are not a magic wand you wave at every problem that happens to involve strings. What if you have a URL like http://www.some-other-domain.com/blah/blah/?www.example.com?
If you're trying to match a domain name to a domain name, then break apart the URL to get the host and compare that. In PHP, use the parse_url function. That will give you www.example.com as the host name, and then you can compare that to make sure it is the hostname you expect.
Try this
function DomainUrl($x) {
$url = $x;
if ( substr($url, 0, 7) == 'http://') { $url = substr($url, 7); }
if ( substr($url, 0, 8) == 'https://') { $url = substr($url, 8); }
if ( substr($url, 0, 4) == 'www.') { $url = substr($url, 4); }
if ( substr($url, 0, 4) == 'www9.') { $url = substr($url, 4); }
if ( strpos($url, '/') !== false) {
$ex = explode('/', $url);
$url = $ex['0'];
}
return $url;
}
$first = DomainUrl('http://www.example.com');
$second = DomainUrl('www.example.com/');
if($first == $second){
echo 'Match';
}else{
echo 'Not Match';
}
when I spide a website ,I got a lot of bad url like these.
http://example.com/../../.././././1.htm
http://example.com/test/../test/.././././1.htm
http://example.com/.//1.htm
http://example.com/../test/..//1.htm
all of these should be http://example.com/1.htm.
how to use PHP codes to do this ,thanks.
PS: I use http://snoopy.sourceforge.net/
I get a lot of repeated link in my database , the 'http://example.com/../test/..//1.htm' should be 'http://example.com/1.htm' .
You could do it like this, assuming all the urls you have provided are expected tobe http://example.com/1.htm:
$test = array('http://example.com/../../../././.\./1.htm',
'http://example.com/test/../test/../././.\./1.htm',
'http://example.com/.//1.htm',
'http://example.com/../test/..//1.htm');
foreach ($test as $url){
$u = parse_url($url);
$path = $u['scheme'].'://'.$u['host'].'/'.basename($u['path']);
echo $path.'<br />'.PHP_EOL;
}
/* result
http://example.com/1.htm<br />
http://example.com/1.htm<br />
http://example.com/1.htm<br />
http://example.com/1.htm<br />
*/
//or as a function #lpc2138
function getRealUrl($url){
$u = parse_url($url);
$path = $u['scheme'].'://'.$u['host'].'/'.basename($u['path']);
$path .= (!empty($u['query'])) ? '?'.$u['query'] : '';
return $path;
}
You seem to be looking for a algorithm to remove the dot segments:
function remove_dot_segments($abspath) {
$ib = $abspath;
$ob = '';
while ($ib !== '') {
if (substr($ib, 0, 3) === '../') {
$ib = substr($ib, 3);
} else if (substr($ib, 0, 2) === './') {
$ib = substr($ib, 2);
} else if (substr($ib, 0, 2) === '/.' && ($ib[2] === '/' || strlen($ib) === 2)) {
$ib = '/'.substr($ib, 3);
} else if (substr($ib, 0, 3) === '/..' && ($ib[3] === '/' || strlen($ib) === 3)) {
$ib = '/'.substr($ib, 4);
$ob = substr($ob, 0, strlen($ob)-strlen(strrchr($ob, '/')));
} else if ($ib === '.' || $ib === '..') {
$ib = '';
} else {
$pos = strpos($ib, '/', 1);
if ($pos === false) {
$ob .= $ib;
$ib = '';
} else {
$ob .= substr($ib, 0, $pos);
$ib = substr($ib, $pos);
}
}
}
return $ob;
}
This removes the . and .. segments. Any removal of any other segment like an empty one (//) or .\. is not as per standard as it changes the semantics of the path.
You could do some fancy regex but this works just fine.
fixUrl('http://example.com/../../../././.\./1.htm');
function fixUrl($str) {
$str = str_replace('../', '', $str);
$str = str_replace('./', '', $str);
$str = str_replace('\.', '', $str);
return $str;
}