Error PHP website crawler class using Simple HTML Dom

Error PHP website crawler class using Simple HTML Dom - php

I'm getting the following error when trying to use Simple HTML Dom inside a web crawler class. The class seems to be working well but I get many errors in my error_log file.
[01-Apr-2016 23:16:51 UTC] PHP Warning: Invalid argument supplied for foreach() in /home/scrybs/public_html/order/uploader/php/simple_html_dom.php on line 357
If I check Simple HTML Dom, the error comes from here:
function innertext()
{
if (isset($this->_[HDOM_INFO_INNER])) return $this->_[HDOM_INFO_INNER];
if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);
$ret = '';
foreach ($this->nodes as $n)
$ret .= $n->outertext();
return $ret;
}
The crawler class in question is as following:
class crawler
{
protected $_url;
protected $_depth;
protected $_host;
protected $_useHttpAuth = false;
protected $_user;
protected $_pass;
protected $_seen = array();
protected $_filter = array();
public $contenu = array();
public function __construct($url, $depth = 5)
{
$this->_url = $url;
$this->_depth = $depth;
$parse = parse_url($url);
$this->_host = $parse['host'];
$this->html = new simple_html_dom();
}
protected function _processAnchors($content, $url, $depth)
{
//$dom = new DOMDocument('1.0');
//#$dom->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
//$dom->formatOutput = true;
$this->html->load($content);
$metatitle = $this->html->find('title',0)->innertext;
foreach($this->html->find("meta[name='description']") as $element){
$metadescription = $element->content;
}
foreach($this->html->find("meta[name='keywords']") as $element){
$metakeywords = $element->content;
}
if(!empty($metatitle)){
$this->contenu['meta_titles'][] = $metatitle;
}
if(!empty($metadescription)){
$this->contenu['meta_titles'][] = $metadescription;
}
if(!empty($metakeywords)){
$this->contenu['meta_titles'][] = $metakeywords;
}
// IMAGE ALTS
foreach($this->html->find('img') as $e){
if(!empty($e->alt)){
if(!$this->search_array($e->alt, $this->contenu)){
$this->contenu['alt_images'][] = $e->alt;
}
}
}
// LINKS
$links = $this->html->find('a');
foreach($links as $element){
// GET LINK TEXTS
$a = $element->innertext;
$a = preg_replace("/<a.*?>(.*?)<\/a>/", '\1', $a);
$a = preg_replace("/<p.*?>.*?<\/p>/", "{{P}}", $a);
$a = preg_replace("/<img.*?>/", "{{IMG}}", $a);
$a = preg_replace('#(<br */?>\s*)+#i', "{{BR}}", $a);
$a = preg_replace('#<button.*?>.*?</button>#i', '{{BUTTON}}', $a);
$a = preg_replace('#<time.*?>(.*?)</time>#i', '{{TIME}}', $a);
$a = preg_replace('#<span.*?>(.*?)</span>#i', '{{SPAN}}\1{{/SPAN}}', $a);
$a = preg_replace('#<strong.*?>(.*?)</strong>#i', '{{STRONG}}\1{{/STRONG}}', $a);
$a = preg_replace('#<b.*?>(.*?)</b>#i', '{{B}}\1{{/B}}', $a);
$a = preg_replace('#<i.*?>(.*?)</i>#i', '{{I}}\1{{/I}}', $a);
$a = preg_replace('#<small.*?>(.*?)</small>#i', '{{SMALL}}\1{{/SMALL}}', $a);
$a = preg_replace('#<abbr.*?>(.*?)</abbr>#i', '{{ABBR}}\1{{/ABBR}}', $a);
$a = trim(strip_tags($a));
$a = preg_replace('/\s+/', ' ', $a);
// CHECK IF NOT ONLY VARIABLES AND SPACES
$atmp = strip_tags($a);
$atmp = preg_replace("/{{.*?}}/", '', $atmp);
$atmp = preg_replace('/\s+/', '', $atmp);
if(!empty($a) && $a != '' && $atmp != ''){
if(!$this->search_array($a, $this->contenu)){
$this->contenu['link_texts'][] = $a;
}
}
// GET LINK TITLES
$title = $element->title;
if(!empty($title)){
if(!$this->search_array($title, $this->contenu)){
$this->contenu['link_titles'][] = $title;
}
}
$href = $element->href;
if (0 !== strpos($href, 'http')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '#';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= $path;
}
}
// Crawl only link that belongs to the start domain
$this->crawl_page($href, $depth - 1);
}
return $this->contenu;
}
protected function _getContent($url)
{
$handle = curl_init($url);
if ($this->_useHttpAuth) {
curl_setopt($handle, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($handle, CURLOPT_USERPWD, $this->_user . ":" . $this->_pass);
}
// follows 302 redirect, creates problem wiht authentication
// curl_setopt($handle, CURLOPT_FOLLOWLOCATION, TRUE);
// return the content
curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);
/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);
// response total time
$time = curl_getinfo($handle, CURLINFO_TOTAL_TIME);
/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
curl_close($handle);
return array($response, $httpCode, $time);
}
protected function _printResult($url, $depth, $httpcode, $time)
{
ob_end_flush();
$currentDepth = $this->_depth - $depth;
$count = count($this->_seen);
//echo "N::$count,CODE::$httpcode,TIME::$time,DEPTH::$currentDepth URL::$url <br>";
ob_start();
flush();
}
protected function isValid($url, $depth)
{
if (strpos($url, $this->_host) === false
|| $depth === 0
|| isset($this->_seen[$url])
|| preg_match("/#/i", $url)
|| preg_match("/.png/i", $url)
|| preg_match("/.jpg/i", $url)
|| preg_match("/.jpeg/i", $url)
|| preg_match("/.gif/i", $url)
|| preg_match("/.pdf/i", $url)
|| preg_match("/javascript/i", $url)
|| preg_match("/twitter.com/i", $url)
|| preg_match("/google.com/i", $url)
|| preg_match("/facebook.com/i", $url)
|| preg_match("/youtube.com/i", $url)
|| preg_match("/instagram.com/i", $url)
|| preg_match("/wp-login.php/i", $url)
) {
return false;
}
foreach ($this->_filter as $excludePath) {
if (strpos($url, $excludePath) !== false) {
return false;
}
}
return true;
}
public function search_array($needle, $haystack) {
if(in_array($needle, $haystack)) {
return true;
}
foreach($haystack as $element) {
if(is_array($element) && $this->search_array($needle, $element))
return true;
}
return false;
}
public function crawl_page($url, $depth)
{
if (!$this->isValid($url, $depth)) {
return;
}
// add to the seen URL
$this->_seen[$url] = true;
// get Content and Return Code
list($content, $httpcode, $time) = $this->_getContent($url);
// print Result for current Page
//$this->_printResult($url, $depth, $httpcode, $time);
// process subPages
$this->_processAnchors($content, $url, $depth, $contenu = array());
}
public function addFilterPath($path)
{
$this->_filter[] = $path;
}
public function run()
{
$this->crawl_page($this->_url, $this->_depth);
}
}
The error seems to be coming from this line related to innertext function:
// GET LINK TEXTS
$a = $element->innertext;
I don't get any error when I use:
$a = $element->innertext;
But not ideal as I would like to keep HTML tags. I don't get any error when I use Simple HTML Dom outside the class so does it have something to do with the fact that Simple HTML Dom is in a class? Do somebody have an idea?
Thanks for your help!

I have found the bug.
On my (limited) tests, the problem happens when you set depth > 1, so — seeing your code — when you load more than one page URL. One of the countless Simple HTML DOM problems, is that ->load() method doesn't work correctly on multiple loads.
Re-instantiating html object, the script seems work:
protected function _processAnchors( $content, $url, $depth )
{
$this->html = new simple_html_dom(); # <-----
$this->html->load( $content );
I tested also $this->html = str_get_html($content); but it works only on limited sites.
Additional Note: In HTML <title> tag is mandatory, but not all sites has well formatted HTML: consider checking for <title> tag (and for each tag) existence to avoid additional errors.

Related

How to fix "user is invalid" error, when adding user to a group with redmine API REST

I would like assign a user to a group with API REST.
But it doesn't work.
I use POST /groups/:id/users.:format syntax (see Rest Groups)
User with this id exists in redmine and group too.
In redmine log I can see:
Processing by GroupsController#add_users as XML
Parameters: {"group"=>{"user_id"=>"34"}, "key"=>"81aa228c55ac5cfe4264a566ef67ac27702da8eb", "id"=>"5"}
Current user: admin (id=1)
Rendering common/error_messages.api.rsb
Rendered common/error_messages.api.rsb (0.1ms)
Completed 422 Unprocessable Entity in 4ms (Views: 0.4ms | ActiveRecord: 1.3ms)
And in API's response:
Code Error :422
Message : User is invalid
In request body : id of user
I use ActiveResouce for REST API.
$method = 'users'
$options = array('user_id' => user's id to add)
/**
* Posts to a specified custom method on the current object via:
*
* POST /collection/id/method.xml
*/
function post ($method, $options = array (), $start_tag = false) {
$req = $this->site . $this->element_name_plural;
if ($this->_data['id']) {
$req .= '/' . $this->_data['id'];
}
$req .= '/' . $method . '.xml';
return $this->_send_and_receive ($req, 'POST', $options, $start_tag);
}
And this function for send request and parse the response :
/**
* Build the request, call _fetch() and parse the results.
*/
function _send_and_receive ($url, $method, $data = array (), $start_tag = false) {
$params = '';
$el = $start_tag ? $start_tag : $this->element_name; // Singular this time
if ($this->request_format == 'url') {
foreach ($data as $k => $v) {
if ($k != 'id' && $k != 'created-at' && $k != 'updated-at') {
$params .= '&' . $el . '[' . str_replace ('-', '_', $k) . ']=' . rawurlencode ($v);
}
}
$params = substr ($params, 1);
} elseif ($this->request_format == 'xml') {
$params = '<?xml version="1.0" encoding="UTF-8"?><' . $el . ">\n";
foreach ($data as $k => $v) {
if ($k != 'id' && $k != 'created-at' && $k != 'updated-at') {
$params .= $this->_build_xml ($k, $v);
}
}
$params .= '</' . $el . '>';
}
if ($this->extra_params !== false) {
if(strpos($url, '?'))
{
$url = $url .'&'.$this->extra_params;
}
else
{
$url = $url .'?'.$this->extra_params;
}
}
$this->request_body = $params;
$this->request_uri = $url;
$this->request_method = $method;
$res = $this->_fetch ($url, $method, $params);
if ($res === false) {
return $this;
}
// Keep splitting off any top headers until we get to the (XML) body:
while (strpos($res, "HTTP/") === 0) {
list ($headers, $res) = explode ("\r\n\r\n", $res, 2);
$this->response_headers = $headers;
$this->response_body = $res;
if (preg_match ('/HTTP\/[0-9]\.[0-9] ([0-9]+)/', $headers, $regs)) {
$this->response_code = $regs[1];
} else {
$this->response_code = false;
}
if (! $res) {
return $this;
} elseif ($res == ' ') {
$this->error = 'Empty reply';
return $this;
}
}
// parse XML response
$xml = new SimpleXMLElement ($res);
// normalize xml element name in case rails ressource contains an underscore
if (str_replace ('-', '_', $xml->getName ()) == $this->element_name_plural) {
// multiple
$res = array ();
$cls = get_class ($this);
foreach ($xml->children () as $child) {
$obj = new $cls;
foreach ((array) $child as $k => $v) {
$k = str_replace ('-', '_', $k);
if (isset ($v['nil']) && $v['nil'] == 'true') {
continue;
} else {
$obj->_data[$k] = $v;
}
}
$res[] = $obj;
}
return $res;
} elseif ($xml->getName () == 'errors') {
// parse error message
$this->error = $xml->error;
$this->errno = $this->response_code;
return false;
}
foreach ((array) $xml as $k => $v) {
$k = str_replace ('-', '_', $k);
if (isset ($v['nil']) && $v['nil'] == 'true') {
continue;
} else {
$this->_data[$k] = $v;
}
}
return $this;
}
Thank you

Removing an entire PHP function based on its suffix

I want to remove all functions ending with _example from my code. I am processing the code using token_get_all. The code I currently have is below to change the opening tags and strip the comments out.
foreach ($files as $file) {
$content = file_get_contents($file);
$tokens = token_get_all($content);
$output = '';
foreach($tokens as $token) {
if (is_array($token)) {
list($index, $code, $line) = $token;
switch($index) {
case T_OPEN_TAG_WITH_ECHO:
$output .= '<?php echo';
break;
case T_COMMENT:
case T_DOC_COMMENT:
$output .= '';
break;
case T_FUNCTION:
// ???
default:
$output .= $code;
break;
}
} else {
$output .= $token;
}
}
file_put_contents($file, $output);
}
I just can't figure out how I can modify it to strip entire functions based on their names.

Ok, I wrote the new code for your problem:
First, he finds every functions and them declarations in your source code.
Second, he checks if function name finished by "_example" and remove his code.
$source = file_get_contents($filename); // Obtain source from filename $filename
$tokens = token_get_all($source); // Get php tokens
// Init variables
$in_fnc = false;
$fnc_name = null;
$functions = array();
// Loop $tokens to locate functions
foreach ($tokens as $token){
$t_array = is_array($token);
if ($t_array){
list($t, $c) = $token;
if (!$in_fnc && $t == T_FUNCTION){ // "function": we register one function start
$in_fnc = true;
$fnc_name = null;
$nb_opened = $nb_closed = 0;
continue;
}
else if ($in_fnc && null === $fnc_name && $t == T_STRING){ // we check and store the name of function if exists
if (preg_match('`function\s+'.preg_quote($c).'\s*\(`sU', $source)){ // "function function_name ("
$fnc_name = $c;
continue;
}
}
}
else {
$c = $token; // single char: content is $token
}
if ($in_fnc && null !== $fnc_name){ // we are in declaration of function
$nb_closed += substr_count($c, '}'); // we count number of } to extract later complete code of this function
if (!$t_array){
$nb_opened += substr_count($c, '{') - substr_count($c, '}'); // we count number of { not closed (num "{" - num "}")
if ($nb_closed > 0 && $nb_opened == 0){ // once "}" parsed and all "{" are closed by "}"
if (preg_match('`function\s+'.preg_quote($fnc_name).'\s*\((.*\}){'.$nb_closed.'}`sU', $source, $m)){
$functions[$fnc_name] = $m[0]; // we store entire code of this function in $functions
}
$in_fnc = false; // we declare that function is finished
}
}
}
}
// Ok, now $functions contains all functions found in $filename
$source_changed = false; // Prevents re-write $filename with the original content
foreach ($functions as $f_name => $f_code){
if (preg_match('`_example$`', $f_name)){
$source = str_replace($f_code, '', $source); // remove function if her name finished by "_example"
$source_changed = true;
}
}
if ($source_changed){
file_put_contents($filename, $source); // replace $filename file contents
}

The PHP crawler I am using has a memory leak, what is causing this?

I am using a PHP crawler that has a memory leak. It is good for the first ~3125 links, then it runs out of memory.I tried getting rid of the MySQL insert, but that did not change anything. Can someone help me diagnose this problem? Thank you so much.
<?php
include $_SERVER['DOCUMENT_ROOT'] . '/config.php';
ini_set('max_execution_time', 0);
// USAGE
$startURL = $your_url;
$depth = 9999;
$crawler = new crawler($startURL, $depth);
// Exclude path with the following structure to be processed
$crawler->addFilterPath('customer/account/login/referer');
$crawler->run();
class crawler
{
protected $_url;
protected $_depth;
protected $_host;
protected $_seen = array();
protected $_filter = array();
public function __construct($url, $depth = 5)
{
$this->_url = $url;
$this->_depth = $depth;
$parse = parse_url($url);
$this->_host = $parse['host'];
}
protected function _processAnchors($content, $url, $depth)
{
$dom = new DOMDocument('1.0');
#$dom->loadHTML($content);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '#';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= $path;
}
}
// Crawl only link that belongs to the start domain
$this->crawl_page($href, $depth - 1);
}
}
protected function _getContent($url)
{
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);
$response = curl_exec($handle);
curl_close($handle);
return array($response);
}
protected function _printResult($url, $depth)
{
ob_end_flush();
$currentDepth = $this->_depth - $depth;
$count = count($this->_seen);
echo "$url <br>";
include $_SERVER['DOCUMENT_ROOT'] . '/config.php';
$databaseconnect = new PDO("mysql:dbname=DB_NAME;host=$mysqlhost;charset=utf8", $mysqlusername, $mysqlpassword);
$statement = $databaseconnect->prepare("INSERT INTO data(url,name) VALUES(:url,:name)");
$statement->execute(array(':url' => $url,
':name' => $url));
ob_start();
flush();
}
protected function isValid($url, $depth)
{
if (strpos($url, $this->_host) === false
|| $depth === 0
|| isset($this->_seen[$url])
) {
return false;
}
foreach ($this->_filter as $excludePath) {
if (strpos($url, $excludePath) !== false) {
return false;
}
}
return true;
}
public function crawl_page($url, $depth)
{
if (!$this->isValid($url, $depth)) {
return;
}
// add to the seen URL
$this->_seen[$url] = true;
// get Content and Return Code
list($content) = $this->_getContent($url);
// print Result for current Page
$this->_printResult($url, $depth);
// process subPages
$this->_processAnchors($content, $url, $depth);
}
public function addFilterPath($path)
{
$this->_filter[] = $path;
}
public function run()
{
$this->crawl_page($this->_url
, $this->_depth);
}
}
?>

I'm not sure if this classifies as a memory leak exactly. You are essentially using recursion without a terminating case. Before the crawl_page() method finishes it calls _processAnchors(), which in turn may call crawl_page() again if it finds any links (very likely). Every recursive call eats up more memory because the originating crawl_page() call (and most thereafter) can't be removed from the call stack until all of its recursive calls terminate.

Change php script with variables from working in http to working in shell

I use a script from here to generate my sitemaps.
I can call it with the browser with http://www.example.com/sitemap.php?update=pages and its working fine.
I need to call it as shell script so that I can automate it with the windows task scheduler. But the script needs to be changed to get the variables ?update=pages. But I don't manage to change it correctly.
Could anybody help me so that I can execute the script from command line with
...\php C:\path\to\script\sitemap.php update=pages. It would also be fine for me to hardcode the variables into the script since I wont change them anyway.
define("BASE_URL", "http://www.example.com/");
define ('BASE_URI', $_SERVER['DOCUMENT_ROOT'] . '/');
class Sitemap {
private $compress;
private $page = 'index';
private $index = 1;
private $count = 1;
private $urls = array();
public function __construct ($compress=true) {
ini_set('memory_limit', '75M'); // 50M required per tests
$this->compress = ($compress) ? '.gz' : '';
}
public function page ($name) {
$this->save();
$this->page = $name;
$this->index = 1;
}
public function url ($url, $lastmod='', $changefreq='', $priority='') {
$url = htmlspecialchars(BASE_URL . 'xx' . $url);
$lastmod = (!empty($lastmod)) ? date('Y-m-d', strtotime($lastmod)) : false;
$changefreq = (!empty($changefreq) && in_array(strtolower($changefreq), array('always', 'hourly', 'daily', 'weekly', 'monthly', 'yearly', 'never'))) ? strtolower($changefreq) : false;
$priority = (!empty($priority) && is_numeric($priority) && abs($priority) <= 1) ? round(abs($priority), 1) : false;
if (!$lastmod && !$changefreq && !$priority) {
$this->urls[] = $url;
} else {
$url = array('loc'=>$url);
if ($lastmod !== false) $url['lastmod'] = $lastmod;
if ($changefreq !== false) $url['changefreq'] = $changefreq;
if ($priority !== false) $url['priority'] = ($priority < 1) ? $priority : '1.0';
$this->urls[] = $url;
}
if ($this->count == 50000) {
$this->save();
} else {
$this->count++;
}
}
public function close() {
$this->save();
}
private function save () {
if (empty($this->urls)) return;
$file = "sitemaps/xx-sitemap-{$this->page}-{$this->index}.xml{$this->compress}";
$xml = '<?xml version="1.0" encoding="UTF-8"?>' . "\n";
$xml .= '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">' . "\n";
foreach ($this->urls as $url) {
$xml .= ' <url>' . "\n";
if (is_array($url)) {
foreach ($url as $key => $value) $xml .= " <{$key}>{$value}</{$key}>\n";
} else {
$xml .= " <loc>{$url}</loc>\n";
}
$xml .= ' </url>' . "\n";
}
$xml .= '</urlset>' . "\n";
$this->urls = array();
if (!empty($this->compress)) $xml = gzencode($xml, 9);
$fp = fopen(BASE_URI . $file, 'wb');
fwrite($fp, $xml);
fclose($fp);
$this->index++;
$this->count = 1;
$num = $this->index; // should have already been incremented
while (file_exists(BASE_URI . "xxb-sitemap-{$this->page}-{$num}.xml{$this->compress}")) {
unlink(BASE_URI . "xxc-sitemap-{$this->page}-{$num}.xml{$this->compress}");
$num++;
}
$this->index($file);
}
private function index ($file) {
$sitemaps = array();
$index = "sitemaps/xx-sitemap-index.xml{$this->compress}";
if (file_exists(BASE_URI . $index)) {
$xml = (!empty($this->compress)) ? gzfile(BASE_URI . $index) : file(BASE_URI . $index);
$tags = $this->xml_tag(implode('', $xml), array('sitemap'));
foreach ($tags as $xml) {
$loc = str_replace(BASE_URL, '', $this->xml_tag($xml, 'loc'));
$lastmod = $this->xml_tag($xml, 'lastmod');
$lastmod = ($lastmod) ? date('Y-m-d', strtotime($lastmod)) : date('Y-m-d');
if (file_exists(BASE_URI . $loc)) $sitemaps[$loc] = $lastmod;
}
}
$sitemaps[$file] = date('Y-m-d');
$xml = '<?xml version="1.0" encoding="UTF-8"?>' . "\n";
$xml .= '<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">' . "\n";
foreach ($sitemaps as $loc => $lastmod) {
$xml .= ' <sitemap>' . "\n";
$xml .= ' <loc>' . BASE_URL . $loc . '</loc>' . "\n";
$xml .= ' <lastmod>' . $lastmod . '</lastmod>' . "\n";
$xml .= ' </sitemap>' . "\n";
}
$xml .= '</sitemapindex>' . "\n";
if (!empty($this->compress)) $xml = gzencode($xml, 9);
$fp = fopen(BASE_URI . $index, 'wb');
fwrite($fp, $xml);
fclose($fp);
}
private function xml_tag ($xml, $tag, &$end='') {
if (is_array($tag)) {
$tags = array();
while ($value = $this->xml_tag($xml, $tag[0], $end)) {
$tags[] = $value;
$xml = substr($xml, $end);
}
return $tags;
}
$pos = strpos($xml, "<{$tag}>");
if ($pos === false) return false;
$start = strpos($xml, '>', $pos) + 1;
$length = strpos($xml, "</{$tag}>", $start) - $start;
$end = strpos($xml, '>', $start + $length) + 1;
return ($end !== false) ? substr($xml, $start, $length) : false;
}
public function __destruct () {
$this->save();
}
}
// start part 2
$sitemap = new Sitemap;
if (get('pages')) {
$sitemap->page('pages');
$result = mysql_query("SELECT uri FROM app_uri");
while (list($url, $created) = mysql_fetch_row($result)) {
$sitemap->url($url, $created, 'monthly');
}
}
$sitemap->close();
unset ($sitemap);
function get ($name) {
return (isset($_GET['update']) && strpos($_GET['update'], $name) !== false) ? true : false;
}
?>

I could install wget (it's available for windows as well) and then call the url via localhost in the task scheduler script:
wget.exe "http://localhost/path/to/script.php?pages=test"
This way you wouldn't have to rewrite the php script.
Otherwise, if the script is meant for shell usage only, then pass variables via command line:
php yourscript.php variable1 variable2 ...
In the php script you can than access those variables using the $argv variable:
$variable1 = $argv[1];
$variable2 = $argv[2];

have a look on:
How to pass GET variables to php file with Shell?
which already answered the same question :).

How to Add background or change text

My goal is to create pages with information needed about the movie and with links that redirect to a video host where you will watch it. I found an IMDB php plugin and I am trying to insert in jquery mobile so mobile users will easily find the movies and stream them on their iDevices or any device that supports MP4 formats. Something close to http://freeflix.technoinsidr.com/watch.php?m=tt1190080
I've made this http://ivids.tk/test.php?m=tt0068646 How can I remove TITLE, YEAR (everything that's in BOLD) and put TITLE with the YEAR on the same line if its possible touse jquery mobile as the design? Is it even possible?
<?php
class Imdb
{
function getMovieInfo($title)
{
$imdbId = $this->getIMDbIdFromGoogle(trim($title));
if($imdbId === NULL){
$arr = array();
$arr['error'] = "No Title found in Search Results!";
return $arr;
}
return $this->getMovieInfoById($imdbId);
}
function getMovieInfoById($imdbId)
{
$arr = array();
$imdbUrl = "http://www.imdb.com/title/" . trim($imdbId) . "/";
$html = $this->geturl($imdbUrl);
if(stripos($html, "<meta name=\"application-name\" content=\"IMDb\" />") !== false){
$arr = $this->scrapMovieInfo($html);
$arr['imdb_url'] = $imdbUrl;
} else {
$arr['error'] = "No Title found on IMDb!";
}
return $arr;
}
function getIMDbIdFromGoogle($title){
$url = "http://www.google.com/search?q=imdb+" . rawurlencode($title);
$html = $this->geturl($url);
$ids = $this->match_all('/<a href="http:\/\/www.imdb.com\/title\/(tt\d+).*?".*?>.*?<\/a>/ms', $html, 1);
if (!isset($ids[0])) //if Google fails
return $this->getIMDbIdFromBing($title); //search using Bing
else
return $ids[0]; //return first IMDb result
}
function getIMDbIdFromBing($title){
$url = "http://www.bing.com/search?q=imdb+" . rawurlencode($title);
$html = $this->geturl($url);
$ids = $this->match_all('/<a href="http:\/\/www.imdb.com\/title\/(tt\d+).*?".*?>.*?<\/a>/ms', $html, 1);
if (!isset($ids[0]))
return NULL;
else
return $ids[0]; //return first IMDb result
}
// Scan movie meta data from IMDb page
function scrapMovieInfo($html)
{
$arr = array();
$arr['title'] = trim($this->match('/<title>(IMDb \- )*(.*?) \(.*?<\/title>/ms', $html, 2));
$arr['year'] = trim($this->match('/<title>.*?\(.*?(\d{4}).*?\).*?<\/title>/ms', $html, 1));
$arr['rating'] = $this->match('/ratingValue">(\d.\d)</ms', $html, 1);
$arr['genres'] = array();
foreach($this->match_all('/<a.*?>(.*?)<\/a>/ms', $this->match('/Genre.?:(.*?)(<\/div>|See more)/ms', $html, 1), 1) as $m)
array_push($arr['genres'], $m);
//Get extra inforation on Release Dates and AKA Titles
if($arr['title_id'] != ""){
$releaseinfoHtml = $this->geturl("http://www.imdb.com/title/" . $arr['title_id'] . "/releaseinfo");
$arr['also_known_as'] = $this->getAkaTitles($releaseinfoHtml, $usa_title);
$arr['usa_title'] = $usa_title;
$arr['release_date'] = $this->match('/Release Date:<\/h4>.*?([0-9][0-9]? (January|February|March|April|May|June|July|August|September|October|November|December) (19|20)[0-9][0-9]).*?(\(|<span)/ms', $html, 1);
$arr['release_dates'] = $this->getReleaseDates($releaseinfoHtml);
}
$arr['plot'] = trim(strip_tags($this->match('/<p itemprop="description">(.*?)(<\/p>|<a)/ms', $html, 1)));
$arr['poster'] = $this->match('/img_primary">.*?<img src="(.*?)".*?<\/td>/ms', $html, 1);
$arr['poster_small'] = "";
if ($arr['poster'] != '' && strrpos($arr['poster'], "nopicture") === false && strrpos($arr['poster'], "ad.doubleclick") === false) { //Get large and small posters
$arr['poster_small'] = preg_replace('/_V1\..*?.jpg/ms', "_V1._SY150.jpg", $arr['poster']);
} else {
$arr['poster'] = "";
}
$arr['runtime'] = trim($this->match('/Runtime:<\/h4>.*?(\d+) min.*?<\/div>/ms', $html, 1));
if($arr['runtime'] == '') $arr['runtime'] = trim($this->match('/infobar.*?(\d+) min.*?<\/div>/ms', $html, 1));
$arr['storyline'] = trim(strip_tags($this->match('/Storyline<\/h2>(.*?)(<em|<\/p>|<span)/ms', $html, 1)));
$arr['language'] = array();
foreach($this->match_all('/<a.*?>(.*?)<\/a>/ms', $this->match('/Language.?:(.*?)(<\/div>|>.?and )/ms', $html, 1), 1) as $m)
array_push($arr['language'], trim($m));
$arr['country'] = array();
foreach($this->match_all('/<a.*?>(.*?)<\/a>/ms', $this->match('/Country:(.*?)(<\/div>|>.?and )/ms', $html, 1), 1) as $c)
array_push($arr['country'], $c);
if($arr['title_id'] != "") $arr['media_images'] = $this->getMediaImages($arr['title_id']);
return $arr;
}
// Scan all Release Dates
function getReleaseDates($html){
$releaseDates = array();
foreach($this->match_all('/<tr>(.*?)<\/tr>/ms', $this->match('/Date<\/th><\/tr>(.*?)<\/table>/ms', $html, 1), 1) as $r)
{
$country = trim(strip_tags($this->match('/<td><b>(.*?)<\/b><\/td>/ms', $r, 1)));
$date = trim(strip_tags($this->match('/<td align="right">(.*?)<\/td>/ms', $r, 1)));
array_push($releaseDates, $country . " = " . $date);
}
return $releaseDates;
}
// Scan all AKA Titles
function getAkaTitles($html, &$usa_title){
$akaTitles = array();
foreach($this->match_all('/<tr>(.*?)<\/tr>/msi', $this->match('/Also Known As(.*?)<\/table>/ms', $html, 1), 1) as $m)
{
$akaTitleMatch = $this->match_all('/<td>(.*?)<\/td>/ms', $m, 1);
$akaTitle = trim($akaTitleMatch[0]);
$akaCountry = trim($akaTitleMatch[1]);
array_push($akaTitles, $akaTitle . " = " . $akaCountry);
if ($akaCountry != '' && strrpos(strtolower($akaCountry), "usa") !== false) $usa_title = $akaTitle;
}
return $akaTitles;
}
// Collect all Media Images
function getMediaImages($titleId){
$url = "http://www.imdb.com/title/" . $titleId . "/mediaindex";
$html = $this->geturl($url);
$media = array();
$media = array_merge($media, $this->scanMediaImages($html));
foreach($this->match_all('/<a href="\?page=(.*?)">/ms', $this->match('/<span style="padding: 0 1em;">(.*?)<\/span>/ms', $html, 1), 1) as $p)
{
$html = $this->geturl($url . "?page=" . $p);
$media = array_merge($media, $this->scanMediaImages($html));
}
return $media;
}
// Scan all media images
function scanMediaImages($html){
$pics = array();
foreach($this->match_all('/src="(.*?)"/ms', $this->match('/<div class="thumb_list" style="font-size: 0px;">(.*?)<\/div>/ms', $html, 1), 1) as $i)
{
array_push($pics, preg_replace('/_V1\..*?.jpg/ms', "_V1._SY0.jpg", $i));
}
return $pics;
}
// ************************[ Extra Functions ]******************************
function geturl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
$ip=rand(0,255).'.'.rand(0,255).'.'.rand(0,255).'.'.rand(0,255);
curl_setopt($ch, CURLOPT_HTTPHEADER, array("REMOTE_ADDR: $ip", "HTTP_X_FORWARDED_FOR: $ip"));
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/".rand(3,5).".".rand(0,3)." (Windows NT ".rand(3,5).".".rand(0,2)."; rv:2.0.1) Gecko/20100101 Firefox/".rand(3,5).".0.1");
$html = curl_exec($ch);
curl_close($ch);
return $html;
}
function match_all($regex, $str, $i = 0)
{
if(preg_match_all($regex, $str, $matches) === false)
return false;
else
return $matches[$i];
}
function match($regex, $str, $i = 0)
{
if(preg_match($regex, $str, $match) == 1)
return $match[$i];
else
return false;
}
}
?>

This really shouldn't be done in jQuery, and you could still use a few lessons in being clear with what you're looking for, but a question is a question, and here is my answer:
$('th').hide();
var $titlerow = $('tr td:first'),
$yearrow = $('tr:eq(1) td:first'),
title = $titlerow.text(),
year = $yearrow.text();
$titlerow.text(title + ' - ' + year);
$yearrow.remove();
Some things to note:
You should not be doing this is jQuery. You should rearrange your PHP. If the code is copy/pasted, then I suggest reading through it. I'll be honest, I didn't read a single line of what you posted, as it was irrelevant to a client-side question after you give a link.
You should be sure to include jQuery in your site. It is not on the page you linked to. Otherwise, the code I provided will not work.
You should put the above code in document ready. I left that last bit somewhat obfuscated. Reason being is that if you don't understand any of this bullet point, some googling of the terms in it will do you good.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Error PHP website crawler class using Simple HTML Dom - php

Related

How to fix "user is invalid" error, when adding user to a group with redmine API REST

Removing an entire PHP function based on its suffix

The PHP crawler I am using has a memory leak, what is causing this?

Change php script with variables from working in http to working in shell

How to Add background or change text

Categories

Resources