Recursive Web Link Search in PHP

Recursive Web Link Search in PHP - php

I am trying to do recursive web link search using PHP, but the code doesn't seem to work. I get a timeout error.
function linksearch($url)
{
$text = file_get_contents($url);
if (!empty($text))
{
$res1 = preg_match_all("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&##\/%?=~_|!:,.;]*[-a-z0-9+&##\/%=~_|]/i",
$text,
$matches);
if ($res1)
{
foreach(array_unique($matches[0]) as $link)
{
linksearch($url);
}
}
else
{
// echo "No links found.";
}
}
}

You have a neverending loop there in your function, because you call your function again inside your function.
linksearch($url);
You need a condition to terminate your function. That ain't no recursion, because on each iteration the input would change and end until some condition. Now it's the same all the time - $url.

Why don't you first save the page locally, and tune your script fetching a local test file, instead of having to run a remote call every time. You won't get a timeout error from the evaluation code that follows your file_get_contents(), unless the HTML file is humungously large.

Related

image_container remains null therefore it throws the error? [duplicate]

I am using this library (PHP Simple HTML DOM parser) to parse a link, here's the code:
function getSemanticRelevantKeywords($keyword){
$results = array();
$html = file_get_html("http://www.semager.de/api/keyword.php?q=". urlencode($keyword) ."&lang=de&out=html&count=2&threshold=");
foreach($html->find('span') as $e){
$results[] = $e->plaintext;
}
return $results;
}
but I am getting this error when I output the results:
Fatal error: Call to a member function find() on a non-object in
/var/www/vhosts/efamous.de/subdomains/sandbox/httpdocs/getNewTrusts.php
on line 25
(line 25 is the foreach loop), the odd thing is that it outputs everything (at least seemingly) correctly but I still get that error and can't figure out why.

The reason for this error is: the simple HTML DOM does not return the object if the size of the response from url is greater than 600000.
You can void it by changing the simple_html_dom.php file. Remove strlen($contents) > MAX_FILE_SIZE from the if condition of the file_get_html function.
This will solve your issue.

You just need to increase CONSTANT MAX_FILE_SIZE in file simple_html_dom.php.
For example:
define('MAX_FILE_SIZE', 999999999999999);

This error usually means that $html isn't an object.
It's odd that you say this seems to work. What happens if you output $html?
I'd imagine that the url isn't available and that $html is null.
Edit:
Looks like this may be an error in the parser. Someone has submitted a bug and added a check in his code as a workaround.

Before file_get_html/load_file method, you should first check if URL exists or not.
If the URL exists, you pass one step.
(Some servers, service a 404 page a valid HTML page. which has propriate HTML page structure like body, head, etc. But it has only text "This page couldn'!t find. 404 error bla bla..)
If URL is 200-OK, then you should check whether fetched thing is object and whether nodes are set.
That's the code i used in my pages.
function url_exists($url){
if ((strpos($url, "http")) === false) $url = "http://" . $url;
$headers = #get_headers($url);
// print_r($headers);
if (is_array($headers)){
if(strpos($headers[0], '404 Not Found'))
return false;
else
return true;
}
else
return false;
}
$pageAddress='http://www.google.com';
if ( url_exists($pageAddress) ) {
$htmlPage->load_file( $pageAddress );
} else {
echo 'url doesn t exist, i stop';
return;
}
if( $htmlPage && is_object($htmlPage) && isset($htmlPage->nodes) )
{
// do your work here...
} else {
echo 'fetched page is not ok, i stop';
return;
}

For those arriving here via a search engine (as I did), after reading the info (and linked bug-report) above, I started some code-prodding and ended up fixing my problems with 2 extra checks after loading the dom;
$html = file_get_html('<your url here>');
// first check if $html->find exists
if (method_exists($html,"find")) {
// then check if the html element exists to avoid trying to parse non-html
if ($html->find('html')) {
// and only then start searching (and manipulating) the dom
}
}

I'm having the same error come up in my logs and apart from the solutions mentioned above, it could also be that there is no 'span' in the document. I get the same error when searching for divs with a particular class that doesn't exist on the page, but when searching for something that I know exists on the page, the error doesn't pop up.

your script is OK.
I receive this error when it doase not find the element that i'm looking for on that page.
In your case, please check if the page that you are accessing it has 'SPAN' element

Simplest solution to this problem
if ($html = file_get_html("http://www.semager.de/api/keyword.php?q=". urlencode($keyword) ."&lang=de&out=html&count=2&threshold=") {
} else {
// do something else because couldn't find html
}

Error means, the find() function is either not defined yet or not available. Make sure you have loaded or include related function.

SimpleHtmlDOM, PHP, Fatal Error: Call to a member function find() on a non-object in C:\xampp\htdocs [duplicate]

I am using this library (PHP Simple HTML DOM parser) to parse a link, here's the code:
function getSemanticRelevantKeywords($keyword){
$results = array();
$html = file_get_html("http://www.semager.de/api/keyword.php?q=". urlencode($keyword) ."&lang=de&out=html&count=2&threshold=");
foreach($html->find('span') as $e){
$results[] = $e->plaintext;
}
return $results;
}
but I am getting this error when I output the results:
Fatal error: Call to a member function find() on a non-object in
/var/www/vhosts/efamous.de/subdomains/sandbox/httpdocs/getNewTrusts.php
on line 25
(line 25 is the foreach loop), the odd thing is that it outputs everything (at least seemingly) correctly but I still get that error and can't figure out why.

The reason for this error is: the simple HTML DOM does not return the object if the size of the response from url is greater than 600000.
You can void it by changing the simple_html_dom.php file. Remove strlen($contents) > MAX_FILE_SIZE from the if condition of the file_get_html function.
This will solve your issue.

You just need to increase CONSTANT MAX_FILE_SIZE in file simple_html_dom.php.
For example:
define('MAX_FILE_SIZE', 999999999999999);

This error usually means that $html isn't an object.
It's odd that you say this seems to work. What happens if you output $html?
I'd imagine that the url isn't available and that $html is null.
Edit:
Looks like this may be an error in the parser. Someone has submitted a bug and added a check in his code as a workaround.

Before file_get_html/load_file method, you should first check if URL exists or not.
If the URL exists, you pass one step.
(Some servers, service a 404 page a valid HTML page. which has propriate HTML page structure like body, head, etc. But it has only text "This page couldn'!t find. 404 error bla bla..)
If URL is 200-OK, then you should check whether fetched thing is object and whether nodes are set.
That's the code i used in my pages.
function url_exists($url){
if ((strpos($url, "http")) === false) $url = "http://" . $url;
$headers = #get_headers($url);
// print_r($headers);
if (is_array($headers)){
if(strpos($headers[0], '404 Not Found'))
return false;
else
return true;
}
else
return false;
}
$pageAddress='http://www.google.com';
if ( url_exists($pageAddress) ) {
$htmlPage->load_file( $pageAddress );
} else {
echo 'url doesn t exist, i stop';
return;
}
if( $htmlPage && is_object($htmlPage) && isset($htmlPage->nodes) )
{
// do your work here...
} else {
echo 'fetched page is not ok, i stop';
return;
}

For those arriving here via a search engine (as I did), after reading the info (and linked bug-report) above, I started some code-prodding and ended up fixing my problems with 2 extra checks after loading the dom;
$html = file_get_html('<your url here>');
// first check if $html->find exists
if (method_exists($html,"find")) {
// then check if the html element exists to avoid trying to parse non-html
if ($html->find('html')) {
// and only then start searching (and manipulating) the dom
}
}

I'm having the same error come up in my logs and apart from the solutions mentioned above, it could also be that there is no 'span' in the document. I get the same error when searching for divs with a particular class that doesn't exist on the page, but when searching for something that I know exists on the page, the error doesn't pop up.

your script is OK.
I receive this error when it doase not find the element that i'm looking for on that page.
In your case, please check if the page that you are accessing it has 'SPAN' element

Simplest solution to this problem
if ($html = file_get_html("http://www.semager.de/api/keyword.php?q=". urlencode($keyword) ."&lang=de&out=html&count=2&threshold=") {
} else {
// do something else because couldn't find html
}

Error means, the find() function is either not defined yet or not available. Make sure you have loaded or include related function.

PHP replace {replace_me} with <?php include ?> in output buffer

I have a file like this
**buffer.php**
ob_start();
<h1>Welcome</h1>
{replace_me_with_working_php_include}
<h2>I got a problem..</h2>
ob_end_flush();
Everything inside the buffer is dynamically made with data from the database.
And inserting php into the database is not an option.
The issue is, I got my output buffer and i want to replace '{replace}' with a working php include, which includes a file that also has some html/php.
So my actual question is: How do i replace a string with working php-code in a output-buffer?
I hope you can help, have used way to much time on this.
Best regards - user2453885
EDIT - 25/11/14
I know wordpress or joomla is using some similar functions, you can write {rate} in your post, and it replaces it with a rating system(some rate-plugin). This is the secret knowledge I desire.

You can use preg_replace_callback and let the callback include the file you want to include and return the output. Or you could replace the placeholders with textual includes, save that as a file and include that file (sort of compile the thing)

For simple text you could do explode (though it's probably not the most efficient for large blocks of text):
function StringSwap($text ="", $rootdir ="", $begin = "{", $end = "}") {
// Explode beginning
$go = explode($begin,$text);
// Loop through the array
if(is_array($go)) {
foreach($go as $value) {
// Split ends if available
$value = explode($end,$value);
// If there is an end, key 0 should be the replacement
if(count($value) > 1) {
// Check if the file exists based on your root
if(is_file($rootdir . $value[0])) {
// If it is a real file, mark it and remove it
$new[]['file'] = $rootdir . $value[0];
unset($value[0]);
}
// All others set as text
$new[]['txt'] = implode($value);
}
else
// If not an array, not a file, just assign as text
$new[]['txt'] = $value;
}
}
// Loop through new array and handle each block as text or include
foreach($new as $block) {
if(isset($block['txt'])) {
echo (is_array($block['txt']))? implode(" ",$block['txt']): $block['txt']." ";
}
elseif(isset($block['file'])) {
include_once($block['file']);
}
}
}
// To use, drop your text in here as a string
// You need to set a root directory so it can map properly
StringSwap($text);

I might be misunderstanding something here, but something simple like this might work?
<?php
# Main page (retrieved from the database or wherever into a variable - output buffer example shown)
ob_start();
<h1>Welcome</h1>
{replace_me_with_working_php_include}
<h2>I got a problem..</h2>
$main = ob_get_clean();
# Replacement
ob_start();
include 'whatever.php';
$replacement = ob_get_clean();
echo str_replace('{replace_me_with_working_php_include}', $replacement, $main);
You can also use a return statement from within an include file if you wish to remove the output buffer from that task too.
Good luck!

Ty all for some lovely input.
I will try and anwser my own question as clear as I can.
problem: I first thought that I wanted to implement a php-function or include inside a buffer. This however is not what I wanted, and is not intended.
Solution: Callback function with my desired content. By using the function preg_replace_callback(), I could find the text I wanted to replace in my buffer and then replace it with whatever the callback(function) would return.
The callback then included the necessary files/.classes and used the functions with written content in it.
Tell me if you did not understand, or want to elaborate/tell more about my solution.

Using WordPress, calling a function twice on same page fails second time

I have a function in my theme's function.php file that calls to the Edmunds API and retreives a stock vehicle image.
From a page template, if I call the function more than once, it fails on the second call. It works perfectly the first time, but doesn't output anything the second time. When I try and print_r the $aVehImage array, it's empty. (I've verified that images are available in the API for the vehicles in the secondary calls, btw)
Code below:
function get_edmunds_image($vehicleMake, $vehicleModel, $vehicleYear) {
$getVehicleStyle = 'https://api.edmunds.com/api/vehicle/v2/'.$vehicleMake.'/'.$vehicleModel.'/'.$vehicleYear.'/styles?state=used&fmt=json&api_key=XXX';
$vehicleStyleID = json_decode(file_get_contents($getVehicleStyle), true);
$getImages = 'https://api.edmunds.com/v1/api/vehiclephoto/service/findphotosbystyleid?styleId='.$vehicleStyleID['styles'][0]['id'].'&fmt=json&api_key=XXX';
$aImages = json_decode(file_get_contents($getImages), true);
$aVehImage = array();
foreach ($aImages as $image) {
$iURL = 'http://media.ed.edmunds-media.com'.str_replace('dam/photo','',$image['id']).'_';
array_push($aVehImage, $iURL);
}
echo '<img src="'.$aVehImage[0].'500.jpg" />';
}

Thanks Marcos! That did, indeed, appear to be the issue. For now, I just used the sleep() function to pause it for a second, until I find a better solution.

Can I retry file_get_contents() until it opens a stream?

I am using PHP to get the contents of an API. The problem is, sometimes that API just sends back a 502 Bad Gateway error and the PHP code can’t parse the JSON and set the variables correctly. Is there some way I can keep trying until it works?

This is not an easy question because PHP is a synchronous language by default.
You could do this:
$a = false;
$i = 0;
while($a == false && $i < 10)
{
$a = file_get_contents($path);
$i++;
usleep(10);
}
$result = json_decode($a);
Adding usleep(10) allows your server not to get on his knees each time the API will be unavailable. And your function will give up after 10 attempts, which prevents it to freeze completely in case of long unavailability.

Since you didn't provide any code it's kind of hard to help you. But here is one way to do it.
$data = null;
while(!$data) {
$json = file_get_contents($url);
$data = json_decode($json); // Will return false if not valid JSON
}
// While loop won't stop until JSON was valid and $data contains an object
var_dump($data);
I suggest you throw some sort of increment variable in there to stop attempting after X scripts.

Based on your comment, here is what I would do:
You have a PHP script that makes the API call and, if successful, records the price and when that price was acquired
You put that script in a cronjob/scheduled task that runs every 10 minutes.
Your PHP view pulls the most recent price from the database and uses that for whatever display/calculations it needs. If pertinent, also show the date/time that price was captured
The other answers suggest doing a loop. A combo approach probably works best here: in your script, put in a few loops just in case the interface is down for a short blip. If it's not up after say a minute, use the old value until your next try.

A loop can solve this problem, but so can a recursive function like this one:
function file_get_contents_retry($url, $attemptsRemaining=3) {
$content = file_get_contents($url);
$attemptsRemaining--;
if( empty($content) && $attemptsRemaining > 0 ) {
return file_get_contents_retry($url, $attemptsRemaining);
}
return $content;
}
// Usage:
$retryAttempts = 6; // Default is 3.
echo file_get_contents_retry("http://google.com", $retryAttempts);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Recursive Web Link Search in PHP - php

Why don't you first save the page locally, and tune your script fetching a local test file, instead of having to run a remote call every time. You won't get a timeout error from the evaluation code that follows your file_get_contents(), unless the HTML file is humungously large.

Related

image_container remains null therefore it throws the error? [duplicate]

SimpleHtmlDOM, PHP, Fatal Error: Call to a member function find() on a non-object in C:\xampp\htdocs [duplicate]

PHP replace {replace_me} with <?php include ?> in output buffer

Using WordPress, calling a function twice on same page fails second time

Can I retry file_get_contents() until it opens a stream?

Categories

Resources