HTTP 500 error in simple PHP web crawler - php

I'm trying to run a web crawler that is pointed at one url, that has no links, the code seems fine; but, I am getting an http 500 error.
All it does with the content it crawls is echo it.
Any idea why?
<?php
error_reporting( E_ERROR );
define( "CRAWL_LIMIT_PER_DOMAIN", 50 );
$domains = array();
$urls = array();
function crawl( $url )
{
global $domains, $urls;
$parse = parse_url( $url );
$domains[ $parse['host'] ]++;
$urls[] = $url;
$content = file_get_contents( $url );
if ( $content === FALSE ){
echo "Error: No content";
return;
}
$content = stristr( $content, "body" );
preg_match_all( '/http:\/\/[^ "\']+/', $content, $matches );
// do something with content.
echo $content;
foreach( $matches[0] as $crawled_url ) {
$parse = parse_url( $crawled_url );
if ( count( $domains[ $parse['host'] ] ) < CRAWL_LIMIT_PER_DOMAIN && !in_array( $crawled_url, $urls ) ) {
sleep( 1 );
crawl( $crawled_url );
}
}
}
crawl(http://the-irf.com/hello/hello6.html);
?>

Replace:
crawl(http://the-irf.com/hello/hello6.html);
with:
crawl('http://the-irf.com/hello/hello6.html');
The URL is a text string, so it must be enclosed in quotes.
About your problem with stristr:
Returns all of haystack starting from and including the first occurrence of needle to the end.
So, your code:
$content = stristr( $content, "body" );
will return all of $content starting from and including the first occurence of body.

Related

Cut a string/url to always get a final string/url with a specific data and it's value in php

I have an url that contain the word "&key".
The "&key" word can be at the beginning or at the end of our url.
Ex1= http://xxxxx.com?c1=xxx&c2=xxx&c3=xxx&key=xxx&c4=xxx&f1=xxx
Ex2= http://xxxxx.com?c1=xxx&key=xxx&c2=xxx&c3=xxx&c4=xxx&f1=xxx
What I would like to get is all the time the url with the Key element and it's value.
R1: http://xxxxx.com?c1=xxx&c2=xxx&c3=xxx&key=xxx
R2: http://xxxxx.com?c1=xxx&key=xxx
Here is what I have done:
$lp_sp_ad_publisher = "http://xxxxx.com?c1=xxx&c2=xxx&c3=xxx&key=xxxc4=xxxf1=xxx";
$lp_sp_ad_publisher_cut_link = explode("&", $lp_sp_ad_publisher_cut[1]); // tab
$lp_sp_ad_publisher_cut_link_final = $lp_sp_ad_publisher_cut_link[0]; // http://xxxxx.com?c1=xxx
$counter = 1;
// finding &key inside $lp_sp_ad_publisher_cut_link_final
while ((strpos($lp_sp_ad_publisher_cut_link_final, '&key')) !== false);
{
$lp_sp_ad_publisher_cut_link_final .= $lp_sp_ad_publisher_cut_link[$counter];
echo 'counter: ' . $counter . ' link: ' . $lp_sp_ad_publisher_cut_link_final . '<br/>';
$counter++;
}
I'm only looping once all the time. I guess the while loop isn't refreshing with the inside new value. Any solution?
EDIT: Sorry, I misunderstood the question.
This is tricky because the url key and value can be anything, so it might be safer to breakdown the URL using a combination of parse_url() and parse_str(), then put the url back together leaving off the part you don't want. Something like this:
function cut_url( $url='', $key='' )
{
$output = '';
$parts = parse_url( $url );
$query = array();
if( isset( $parts['scheme'] ) )
{
$output .= $parts['scheme'].'://';
}
if( isset( $parts['host'] ) )
{
$output .= $parts['host'];
}
if( isset( $parts['path'] ) )
{
$output .= $parts['path'];
}
if( isset( $parts['query'] ) )
{
$output .= '?';
parse_str( $parts['query'], $query );
}
foreach( $query as $qkey => $qvalue )
{
$output .= $qkey.'='.$qvalue.'&';
if( $qkey == $key ) break;
}
return rtrim( $output, '&' );
}
Usage:
$input = 'https://www.xxxxx.com/test/path/index.php?c1=xxx&c2=xxx&key=xxx&c3=xxx&c4=xxx&f1=xxx';
$output = cut_url( $input, 'key' );
Output:
https://www.xxxxx.com/test/path/index.php?c1=xxx&c2=xxx&key=xxx
If the intention is to always ensure that the parameter key and it's associated value appear at the end of the string, how about something like:
$tmp=array();$key='';
$parts=explode( '&', parse_url( $_SERVER['REQUEST_URI'], PHP_URL_QUERY ) );
foreach( $parts as $pair ) {
list( $param,$value )=explode( '=',$pair );
if( $param=='key' )$key=$pair;
else $tmp[]=$pair;
}
$query = implode( '&', array( implode( '&', $tmp ), $key ) );
echo $query;
or,
parse_str( $_SERVER['QUERY_STRING'], $pieces );
foreach( $pieces as $param => $value ){
if( $param=='key' ) $key=$param.'='.$value;
else $tmp[]=$param.'='.$value;
}
$query = implode( '&', array( implode( '&', $tmp ), $key ) );
update
I'm puzzled that you were "not getting the good result"!
consider the url:
https://localhost/index.php?sort=0&dir=false&tax=23&cost=99&aardvark=creepy&key=banana&tree=large&ac=dc&limit=1000#569f945674935
The above would output:
sort=0&dir=false&tax=23&cost=99&aardvark=creepy&tree=large&ac=dc&limit=1000&key=banana
so the key=banana gets placed last using either method above.

Remove exact embed code(wordpress) with preg_replace

I'm using the following code to find first YouTube/Vimeo embed in the post content:
function compare_by_offset( $a, $b ) {
return $a['order'] - $b['order'];
}
function first_video_url($post_id = null) {
if ( $post_id == null OR $post_id == '' ) $post_id = get_the_ID();
$post_array = get_post( $post_id );
$markup = $post_array->post_content;
$regexes = array(
'#(?:https?:)?//www\.youtube(?:\-nocookie)?\.com/(?:v|e|embed)/([A-Za-z0-9\-_]+)#', // Comprehensive search for both iFrame and old school embeds
'#(?:https?(?:a|vh?)?://)?(?:www\.)?youtube(?:\-nocookie)?\.com/watch\?.*v=([A-Za-z0-9\-_]+)#', // Any YouTube URL. After http(s) support a or v for Youtube Lyte and v or vh for Smart Youtube plugin
'#(?:https?(?:a|vh?)?://)?youtu\.be/([A-Za-z0-9\-_]+)#', // Any shortened youtu.be URL. After http(s) a or v for Youtube Lyte and v or vh for Smart Youtube plugin
'#<div class="lyte" id="([A-Za-z0-9\-_]+)"#', // YouTube Lyte
'#data-youtube-id="([A-Za-z0-9\-_]+)"#', // LazyYT.js
'#<object[^>]+>.+?http://vimeo\.com/moogaloop.swf\?clip_id=([A-Za-z0-9\-_]+)&.+?</object>#s', // Standard Vimeo embed code
'#(?:https?:)?//player\.vimeo\.com/video/([0-9]+)#', // Vimeo iframe player
'#\[vimeo id=([A-Za-z0-9\-_]+)]#', // JR_embed shortcode
'#\[vimeo clip_id="([A-Za-z0-9\-_]+)"[^>]*]#', // Another shortcode
'#\[vimeo video_id="([A-Za-z0-9\-_]+)"[^>]*]#', // Yet another shortcode
'#(?:https?://)?(?:www\.)?vimeo\.com/([0-9]+)#', // Vimeo URL
'#(?:https?://)?(?:www\.)?vimeo\.com/channels/(?:[A-Za-z0-9]+)/([0-9]+)#' // Channel URL
);
$provider_videos = array();
foreach ( $regexes as $regex ) {
if ( preg_match_all( $regex, $markup, $matches, PREG_OFFSET_CAPTURE ) ) {
$provider_videos = array_merge( $provider_videos, $matches[0] );
}
}
if ( empty( $provider_videos ) ) return;
foreach ( $provider_videos as $video ) {
$videos[] = array(
'url' => $video[0],
'order' => $video[1]
);
}
usort( $videos, 'compare_by_offset' );
$first_video_url = current(array_column($videos, 'url'));
if ( empty( $first_video_url ) ) return;
return $first_video_url;
}
Now when I got the link to the first video in the post I want to remove it from the post content. And that's where I'm stuck. My attempt so far:
function remove_first_image ($content) {
$url = first_video_url();
$parsed = parse_url($url);
$video_id = $parsed['query'];
$embed_code = wp_oembed_get($url);
$pattern = 'a pattern for that embed which I fail to make';
$content = preg_replace($pattern, '', $content);
return $content;
}
add_filter('the_content', 'remove_first_image');
Thanks!
I guess one couldn't answer his own stupid question until he asks it. Here comes the answer:
function remove_first_image ($content) {
if ( is_single() && has_post_format('video') ) {
$url = first_video_url();
$embed_code = wp_oembed_get($url);
$content = str_replace($embed_code, '', $content);
}
return $content;
}
add_filter('the_content', 'remove_first_image');

Using strpos, check if given URL has 'any one' of the strings in the array?

I have a variable $url (who I have no control over) whose value is a URL (as a string). For example:
$url = 'http://www.youtube.com/watch?v=rSnzy2dZtsE';
I have a list of hosts (example.com) that I'd like to check against the $url, and see if any one of them matches the host in the URL.
I am doing it like this:
<?php
function itsme_custom_oembed( $html, $url, $attr, $post_id ) {
// Supported video embeds
$hosts = array( 'blip.tv', 'money.cnn.com', 'dailymotion.com', 'flickr.com', 'hulu.com', 'kickstarter.com', 'vimeo.com', 'vine.co', 'youtube.com' );
foreach( $hosts as $host ) {
// check if it's a supported video embed
if( strpos( $url, $host ) === false )
return $html;
return '<div class="flex-video">'. $html .'</div>';
}
}
add_filter( 'embed_oembed_html', 'itsme_custom_oembed', 10, 4 );
?>
But it's not working (i.e. strpos( $url, $host ) is always returning false), and as I see it, the problem is with the foreach construct. Especially because this works:
<?php
function itsme_custom_oembed( $html, $url, $attr, $post_id ) {
// Supported video embeds
$host = 'youtube.com';
// check if it's a supported video embed
if( strpos( $url, $host ) === false )
return $html;
return '<div class="flex-video">'. $html .'</div>';
}
add_filter( 'embed_oembed_html', 'itsme_custom_oembed', 10, 4 );
?>
Clearly, foreach isn't meant for this purpose.
So, how am I supposed to check if given URL has any one of the strings in the array? (i.e. true if any one of the hosts in the list matches the host in the URL.)
Problem is that you are returning inside the loop. Once you return from a function, the function stops. So you end up checking the first value on the first run through the loop and return stopping the function from checking any subsequent iterations.
To fix it, you could just move the second return outside the loop. This would make the function loop over each value in the array until it found a match. If match found, function exits (return). If no match is found, it will hit the return after the loop.
function itsme_custom_oembed( $html, $url, $attr, $post_id ) {
// Supported video embeds
$hosts = array( 'blip.tv', 'money.cnn.com', 'dailymotion.com', 'flickr.com', 'hulu.com', 'kickstarter.com', 'vimeo.com', 'vine.co', 'youtube.com' );
//loop over all the hosts
foreach( $hosts as $host ) {
// check if it's a supported video embed
if( strpos( $url, $host ) === false )
return $html; //it was supported, so return from the original html from the function
}
//no hosts matched so return the original html wrapped in a div.
return '<div class="flex-video">'. $html .'</div>';
}
I am not sure what you want to return but you can try to use this!
function itsme_custom_oembed( $html, $url, $attr, $post_id ) {
$hosts = array('blip.tv', 'money.cnn.com', 'dailymotion.com', 'flickr.com', 'hulu.com', 'kickstarter.com', 'vimeo.com', 'vine.co', 'youtube.com');
$success = false;
foreach ($hosts as $host) {
if (stripos($url, $host) !== false) {
$success = true;
break;
}
}
if ($success) {
// put your return when it DOES contain the host here
}
else {
// put your return when it DOES NOT contain the host here
}
}
(Based on Jonathan Kuhn's answer and suggestions.) This does it:
<?php
function itsme_custom_oembed( $html, $url, $attr, $post_ID ) {
// Supported video embeds
$hosts = array( 'blip.tv', 'money.cnn.com', 'dailymotion.com', 'flickr.com', 'hulu.com', 'kickstarter.com', 'vimeo.com', 'vine.co', 'youtube.com' );
foreach( $hosts as $host ) {
// check if it's a supported video embed
if( strpos( $url, $host ) !== false )
return '<div class="flex-video">'. $html .'</div>';
}
}
return $html;
}
add_filter( 'embed_oembed_html', 'itsme_custom_oembed', 10, 4 );
?>
Then an idea struck me; that I could do it in a much simpler way, like this:
<?php
function itsme_custom_oembed( $html, $url, $attr, $post_ID ) {
// Supported video embeds
$hosts = array( 'blip.tv', 'money.cnn.com', 'dailymotion.com', 'flickr.com', 'hulu.com', 'kickstarter.com', 'vimeo.com', 'vine.co', 'youtube.com' );
foreach( $hosts as $host ) {
// check if it's a supported video embed
if( strpos( $url, $host ) !== false ) {
$html = '<div class="flex-video">'. $html .'</div>';
break;
}
}
return $html;
}
add_filter( 'embed_oembed_html', 'itsme_custom_oembed', 10, 4 );
?>
Seems like a much better way to do what I am after.

I call a function in itself recursively, but it doesn't seem to be working

I am calling a crawl() function recursively that gets the content of a page and echo's it.
It works when I manually call it from outside the function, but when I recursively call it from inside the function I get no output from the recursive calls. The only output I get is from the one manual call.
Why isn't this working, what am I doing wrong?
<?php
error_reporting( E_ERROR );
define( "CRAWL_LIMIT_PER_DOMAIN", 50 );
$domains = array();
$urls = array();
$dom = new DOMDocument();
$matches = array();
function crawl( $domObject, $url, $matchList )
{
global $domains, $urls;
$parse = parse_url( $url );
$domains[ $parse['host'] ]++;
$urls[] = $url;
$content = file_get_contents( $url );
if ( $content === FALSE ){
return;
}
echo strip_tags($content) . "<br /><br /><br />";
array_push($matchList, 'http://www.the-irf.com/hello/hello5.html');
array_push($matchList, 'http://www.the-irf.com/hello/hello6.html');
array_push($matchList, 'http://www.the-irf.com/hello/end.html');
foreach( $matchList[0] as $crawled_url ) {
$parse = parse_url( $crawled_url );
if ( count( $domains[ $parse['host'] ] ) < CRAWL_LIMIT_PER_DOMAIN && !in_array( $crawled_url, $urls ) ) {
sleep( 1 );
crawl( $domObject, $crawled_url, $matchList );
}
}
}
crawl($dom, 'http://the-irf.com/hello/hello6.html', $matches);
?>
The problem lies when you are using your foreach.
foreach($matchList[0] as $crawled_url){
Your array $matchList looks like: array('http://www.the-irf.com/hello/hello5.html', 'http://www.the-irf.com/hello/hello6.html', 'http://www.the-irf.com/hello/end.html').
foreach is expecting the first parameter to be an array. $matchList[0] is not an array but the string 'http://www.the-irf.com/hello/hello5.html'.
In other words, if you change that line to
foreach($matchList as $crawled_url){
you will start calling the function recursively.

php spider script not working

I have been using the following script to create sitemaps for my clients websites. The issue is it does not work for every site. I have found that many if not all the sites hosted on godaddy do not spider. If anyone can see an error in my script or know what is causing the fault I would greatly appreciate the help.
Thanks in advance
set_time_limit(0);
class spider_man
{
var $url;
var $limit;
var $cache;
var $crawled;
var $banned_ext;
var $domain;
function spider_man( $url, $banned_ext, $limit ){
$this->domain = $url;
$this->url = 'http://'.$url ;
$this->banned_ext = $banned_ext ;
$this->limit = $limit ;
if( !fopen( $this->url, "r") ) return false;
else $this->_spider($this->url);
}
function _spider( $url ){
$this->cache = #file_get_contents( urldecode( $url ) );
if( !$this->cache ) return false;
$this->crawled[] = urldecode( $url ) ;
preg_match_all( "#href=\"(https?://[&=a-zA-Z0-9-_./]+)\"#si", $this->cache, $links );
if ( $links ) :
foreach ( $links[1] as $hyperlink ){
if(strpos($hyperlink,$this->domain)===false){ break; }
else{
$this->limit--;
if( ! $this->limit ) return;
if( $this->is_valid_ext( trim( $hyperlink ) ) and !$this->is_crawled( $hyperlink ) ) :
$this->crawled[] = $hyperlink;
echo "Crawling $hyperlink<br />\n";
unset( $this->cache );
$this->_spider( $hyperlink );
endif;
}
}
endif;
}
function is_valid_ext( $url ){
foreach( $this->banned_ext as $ext ){
if( $ext == substr( $url, strlen($url) - strlen( $ext ) ) ) return false;
}
return true;
}
function is_crawled( $url ){
return in_array( $url, $this->crawled );
}
}
$banned_ext = array(".dtd",".css",".xml",".js",".gif",".jpg",".jpeg",".bmp",".ico",".rss",".pdf",".png",".psd",".aspx",".jsp",".srf",".cgi",".exe",".cfm");
$spider = new spider_man( 'domain.com', $banned_ext, 100 );
print_r( $spider->crawled );
When you access a site using fopen() of file_get_contents() you don't send AGENT or REFERRER or other header information. It's blatently obvious that this is an automated script.
You need to look at sending context with your fopen (check the docs and read the context section) or, better still, using CURL. This allows you to set the agent and referrer headers to simulate a browser.

Categories