PHP Notice: Undefined offset: 0 fixable? - php

I have a "get first image script" I am using that is all over the internet but am getting the error:
PHP Notice: Undefined offset: 0
the script is:
function get_first_image() {
global $post, $posts;
$first_img = '';
ob_start();
ob_end_clean();
$output = preg_match_all('/<img.+src=[\'"]([^\'"]+)[\'"].*>/i',$post->post_content, $matches);
$first_img = $matches [1] [0];
return $first_img;
}
can this be fixed?

Based on your regex this could happen if the <img> tag has no src attribute or if there are no <img> tags at all.
As others have suggested you could fix this by checking $matches first, but I'd like to suggest an alternate approach that may be more robust for parsing html in php, since using regex to do this is discouraged.
function get_first_image() {
global $post;
$first_img = '';
$dom = new DOMDocument();
$dom->loadHtml($post->post_content);
foreach ($dom->getElementsByTagName('img') as $img) {
if ($img->hasAttribute('src')) {
$first_image = $img->getAttribute('src');
break;
}
}
return $first_img;
}
The above function uses php's DOMDocument Class to iterate over <img> tags and get the src attribute if it exists. (Note: I removed the ob_start() and ob_end_clean() functions from your code because I don't understand what purpose they were serving)

You can do this:
$first_img = isset($matches[1][0]) ? $matches[1][0] : false;
Which will, then, return false if the first position in this two dimension array would not exist.

Before operator:
$first_img = $matches [1] [0];
insert the line:
var_dump($matches);
Make sure, that $matches is an array, and has two dimensions.

Related

PHP Fatal error: Cannot use object of type simple_html_dom as array

I am working on web scraping application using simple_html_dom. I need to extract all the images in a web page. The following are the possibilities:
<img> tag images
if there is a css with the <style> tag in the same page.
if there is an image with the inline style with <div> or with some other tag.
I can scrape all the images by using the following code.
function download_images($html, $page_url , $local_url){
foreach($html->find('img') as $element) {
$img_url = $element->src;
$img_url = rel2abs($img_url, $page_url);
$parts = parse_url($img_url);
$img_path= $parts['path'];
$url_to_be_change = $GLOBALS['website_server_root'].$img_path;
download_file($img_url, $GLOBALS['website_local_root'].$img_path);
$element->src=$url_to_be_change;
}
$css_inline = $html->find("style");
$matches = array();
preg_match_all( "/url\((.*?)\)/", $css_inline, $matches, PREG_SET_ORDER );
foreach ( $matches as $match ) {
$img_url = trim( $match[1], "\"'" );
$img_url = rel2abs($img_url, $page_url);
$parts = parse_url($img_url);
$img_path= $parts['path'];
$url_to_be_change = $GLOBALS['website_server_root'].$img_path ;
download_file($img_url , $GLOBALS['website_local_root'].$img_path);
$html = str_replace($img_url , $url_to_be_change , $html );
}
return $html;
}
$html = download_images($html , $page_url , $dir); // working fine
$html = str_get_html ($html);
$html->save($dir. "/" . $ff);
Please note that, I am modifying the HTML too after image downloading.
downloading is working fine. but when i am trying to save the HTML, then its giving the following error:
PHP Fatal error: Cannot use object of type simple_html_dom as array
Important: its working perfectly fine, if I am not using str_replace and second loop.
Fatal error: Cannot use object of type simple_html_dom as array in /var/www/html/app/framework/cache/includes/simple_html_dom.php on line 1167
Guess №1
I see a possible mistake here:
$html = str_get_html($html);
Looks like you pass an object to function str_get_html(), while it accepts a string as an argument. Lets fix that this way:
$html = str_get_html($html->plaintext);
We can only guess what is the content of the $html variable, that comes to this piece of code.
Guess №2
Or maybe we just need to use another variable in function download_images to make your code correct in both cases:
function download_images($html, $page_url , $local_url){
foreach($html->find('img') as $element) {
$img_url = $element->src;
$img_url = rel2abs($img_url, $page_url);
$parts = parse_url($img_url);
$img_path= $parts['path'];
$url_to_be_change = $GLOBALS['website_server_root'].$img_path ;
download_file($img_url , $GLOBALS['website_local_root'].$img_path);
$element->src=$url_to_be_change;
}
$css_inline = $html->find("style");
$result_html = "";
$matches = array();
preg_match_all( "/url\((.*?)\)/", $css_inline, $matches, PREG_SET_ORDER );
foreach ( $matches as $match ) {
$img_url = trim( $match[1], "\"'" );
$img_url = rel2abs($img_url, $page_url);
$parts = parse_url($img_url);
$img_path= $parts['path'];
$url_to_be_change = $GLOBALS['website_server_root'].$img_path ;
download_file($img_url , $GLOBALS['website_local_root'].$img_path);
$result_html = str_replace($img_url , $url_to_be_change , $html );
}
return $result_html;
}
$html = download_images($html , $page_url , $dir); // working fine
$html = str_get_html ($html);
$html->save($dir. "/" . $ff);
Explanation: if there was no matches (array $matches is empty) we never go in the second cycle, thats why variable $html still has the same value as at beginning of the function. This is common mistake when you're trying to use same variable in the place of code where you need two different variables.
As the error message states, you are dealing with an Object where you should have an array.
You could try tpyecasting your object:
$array = (array) $yourObject;
That should solve it.
I had this error, I solved it by using (in my case) return $html->save(); in end of function.
I can't explain why two instances with different variable names, and scoped in different functions made this error. I guess this is how the "simple html dom" class works.
So just to be clear, try: $html->save(), before you do anything else after
I hope this information helps somebody :)

Find image or iframe with regular expressions

I've got the following code, it spits out the first image of each post, on WordPress:
function catch_that_image() {
global $post, $posts;
$first_img = '';
ob_start();
ob_end_clean();
$output = preg_match_all('/<img.+src=[\'"]([^\'"]+)[\'"].*>/i', $post->post_content, $matches);
$first_img = $matches [1] [0];
if(empty($first_img)){ //Defines a default image
}
echo "<img src=" . $first_img . ">";
}
However, I also need to catch the first iframe, and echo whichever is first. I'm not experienced with regular expressions, so any help or resources would be great :)
Use the |(or) operator. Replace the img with (img|iframe).

Return value without numeric and punctuations

I'm trying to strip the numeric and punctuations from a string leaving only alpha characters in SIMPLE HTML DOM, with no success I've tried multiple approaches and just can't get it!
Example string: The Amazing Retard (2012) #1
Output string: The Amazing Retard
I understand it's for an undefined method and I've looked at multiple pages for this, however I'm brain farting for how to include the method. Any help would be appreciated. The error that I get is
Fatal error: Call to undefined method simple_html_dom_node::preg_replace() in /home/**/public_html/wp-content/themes/*/***.php on line 123
The code is as follows:
<?php
function scraping_comic()
{
// create HTML DOM
$html = file_get_html('http://page-to-scrape.com');
// get block
foreach($html->find('li.browse_result') as $article)
{
// get title
$item['title'] = trim($article->find('h4', 0)->find('span',0)->outertext);
// get title url
$item['title_url'] = trim($article->find('h4', 0)->find('a.grid-hidden',0)->href);
// get image
$item['image_url'] = trim($article->find('img.main_thumb',0)->src);
// get details
$item['details'] = trim($article->find('p.browse_result_description_release', 0)->plaintext);
// get sale info
$item['on_sale'] = trim($article->find('.browse_comics_release_dates', 0)->plaintext);
// strip numbers and punctuations
$item['title2'] = trim($article->find('h4',0)->find('span',0)->preg_replace("/[^A-Za-z]/","",$item['title2'], 0)->plaintext);
$ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
$ret = scraping_comic();
if ( ! empty($ret))
{
$scrape = 'http://the-domain.com';
foreach($ret as $v)
{
echo '<p>'.$v['title2'].'</p>';
echo '<p>'.$v['title'].'</p>';
echo '<p><img src="'.$v['image_url'].'"></p>';
echo '<p>'.$v['details'].'</p>';
echo '<p> '.$v['on_sale'].'</p>';
}
}
else { echo 'Could not scrape site!'; }
?>
preg_replace is a php function, not a member of the simple_html_dom_node class. call it like this:
$matches = preg_replace ($pattern, $replacement, mixed $subject);
http://php.net/manual/en/function.preg-replace.php
it looks like your $pattern and replacement are OK; you'll just pass in as the $subject the input you're trying to change.
for example, this might be what you're trying to achieve:
$item['title2'] =
trim(preg_replace("/[^A-Za-z]/","",$article->find('h4',0)->find('span',0));
I think it's because of this line :
// strip numbers and punctuations
$item['title2'] = trim($article->find('h4',0)->find('span',0)->preg_replace("/[^A-Za-z]/","",$item['title2'], 0)->plaintext);
written like this it means that preg_replace is a method of your class simple_html_dom_node which is not as it's standard php function.
you might have in your class something like execute_php_function("a_php_function",anArrayOfArguments)
so you'll write something like this :
// strip numbers and punctuations
$item['title2'] = trim($article->find('h4',0)->find('span',0)->execute_php_function("preg_replace",anArrayOfArguments)->plaintext);

Extract doctype with simple_html_dom

I am using simple_html_dom to parse a website.
Is there a way to extract the doctype?
You can use file_get_contents function to get all HTML data from website.
For example
<?php
$html = file_get_contents("http://google.com");
$html = str_replace("\n","",$html);
$get_doctype = preg_match_all("/(<!DOCTYPE.+\">)<html/i",$html,$matches);
$doctype = $matches[1][0];
?>
You can use $html->find('unknown'). This works - at least - in version 1.11 of the simplehtmldom library. I use it as follows:
function get_doctype($doc)
{
$els = $doc->find('unknown');
foreach ($els as $e => $el)
if ($el->parent()->tag == 'root')
return $el;
return NULL;
}
That's just to handle any other 'unknown' elements which might be found; I'm assuming the first will be the doctype. You can explicitly inspect ->innertext if you want to ensure it starts with '!DOCTYPE ', though.

Php: Find first img or object tag in string

I want ask what could be the mistake i am doing in this code.
I am currently trying to find the first occurrence of an image tag or an object tag then return a piece of html if it matches one.
Currently, I can get the image tag, but unfortunately I can't seem to have any results on object tag.
I am thought, I am doing some mistake in my regex pattern or something. Hope requirement is clear enough for you to understand thanks.
My code here:
function get_first_image(){
global $post, $posts;
$first_img = '';
ob_start();
ob_end_clean();
$output = preg_match_all('/<img.+src=[\'"]([^\'"]+)[\'"].*>/i', $post->post_content, $matches) || preg_match_all('/<object[0-9 a-z_?*=\":\-\/\.#\,<>\\n\\r\\t]+<\/object>/smi', $post->post_content, $matches);
$first_img = $matches [1] [0];
if(empty($first_img)){ //Defines a default image
$mediaSearch = preg_match_all('/<object[0-9 a-z_?*=\":\-\/\.#\,<>\\n\\r\\t]+<\/object>/smi', $post->post_content, $matches2);
$first_media = $matches2 [1] [0];
$first_img = "/images/default.jpg";
}
if(!empty($first_img)){
$result = "<div class=\"alignleft\"><img src=\"$first_img\" style=\"max-width: 200px;\" /></div>";
}
if(!empty($first_media)){
$result = "<p>" . $first_media . "</p>";
}
return $result;
}
While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.
What I recommend you do is use a DOM parser such as SimpleHTML and use it as such:
function get_first_image(){
global $post, $posts;
require_once('SimpleHTML.class.php')
$post_dom = str_get_dom($post->post_content);
$first_img = $post_dom->find('img', 0);
if($first_img !== null) {
$first_img->style = $first_img->style . ';max-width: 200px';
return '<div class="alignleft">' . $first_img->outertext . '</div>';
} else {
$first_obj = $post_dom->find('object', 0);
if($first_obj !== null) {
return '<p>' . $first_obj->outertext . '</p>';
}
}
return '<div class="alignleft"><img src="/images/default.jpg" style="max-width: 200px;" /></div>';
}
Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility. For example, using the DOM parser, I can add to the styles of your current image.
A regular expression could be devised to achieve the same goal but would be limited in such way that it would force the style attribute to be after the src or the opposite, and to overcome this limitation would add more complexity to the regular expression.
Also, consider the following. To properly match an <img> tag using regular expressions and to get only the src attribute (captured in group 2), you need the following regular expression:
<\s*?img\s+?[^>]*?\s*?src\s*?=\s*?(["'])((\\?+.)*?)\1[^>]*?>
And then again, the above can fail if:
The attribute or tag name is in capital and the i modifier is not used.
Quotes are not used around the src attribute.
Another attribute then src uses the > character somewhere in their value.
Some other reason I have not foreseen.
So again, simply don't use regular expressions to parse a dom document.
Try this: (You need to define what you want to get in the matches array)
function get_first_image(){
global $post, $posts;
$first_img = '';
ob_start();
ob_end_clean();
$output = preg_match_all('/<img.+src=[\'"]([^\'"]+)[\'"].*>/i', $post->post_content, $matches) || preg_match_all('(/<object[0-9 a-z_?*=\":\-\/\.#\,<>\\n\\r\\t]+<\/object>)/smi', $post->post_content, $matches);
$first_img = $matches [1] [0];
if(empty($first_img)){ //Defines a default image
$mediaSearch = preg_match_all('/<object[0-9 a-z_?*=\":\-\/\.#\,<>\\n\\r\\t]+<\/object>/smi', $post->post_content, $matches2);
$first_media = $matches2 [1] [0];
$first_img = "/images/default.jpg";
}
if(!empty($first_img)){
$result = "<div class=\"alignleft\"><img src=\"$first_img\" style=\"max-width: 200px;\" /></div>";
}
if(!empty($first_media)){
$result = "<p>" . $first_media . "</p>";
}
return $result;
}

Categories