PHP way of parsing HTML string - php

I have a php string that contains the below HTML I am retrieving from an RSS feed. I am using simple pie and cant find any other way of splitting these two datasets it gets from <description>. If anyone knows of a way in simple pie to select children that would be great.
<div style="example"><div style="example"><img title="example" alt="example" src="example.jpg"/></div><div style="example">EXAMPLE TEXT</div></div>
to:
$image = '<img title="example" alt="example" src="example.jpg">';
$description = 'EXAMPLE TEXT';

$received_str = 'Your received html';
$html = str_get_html($received_str);
//Image tag
$img_tag = $html->find("img", 0)->outertext;
//Example Text
$example_text = $html->find('div[style=example]', 0)->last_child()->innertext;
See Here: http://simplehtmldom.sourceforge.net/manual.htm

Try Simple HTML Dom Parser
// Create DOM from HTML string
$html = str_get_html('Your HTML here');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Description
$description = $html->find('div[style=example]');

try using strip_tags:
<?php
$html ='<div style="example"><div style="example"><img title="example" alt="example" src="example.jpg"/></div><div style="example">EXAMPLE TEXT</div></div>';
$html = strip_tags($html,'<img>');
// $html == '<img title="example" alt="example" src="example.jpg">'
?>

Related

Question about using simple html dom parser to store HTML tags as objects

I am building a web scraper using the simple HTML DOM parser. However, I ran into some issues figuring out how to store HTML elements on a web page as objects. I would like to take an input URL, and turn all the HTML elements like tags, divs, fields, etc. and turn them into an object that gets spit out onto a page. I have written some code that currently works when I type in a URL, but the output is not what I am trying to achieve. Below, I have attached the code that I have worked out already, and I am seeking to find a way in which I could achieve what I am trying to do.
I have tried finding all images and links as well as creating a DOM object. I can't seem to figure out how to convert these elements into objects that I can use to learn more about a website, and possibly store that data into a database.
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
$url = $_POST["url"];
$html = file_get_html($url);
echo $html;
// Find all images
$element = new simple_html_dom();
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
$element = new simple_html_dom();
foreach($html->find('a') as $element)
echo $element->href . '<br>';
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a URL
$html->load_file($url);
echo $html;
?>
I am expecting an output of objects, but I am instead getting an actual output of images and links on a web page.
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
// $url = $_POST["url"];
$url = 'Your-Url'; // Your url: 'www.example.com'
$html = file_get_html($url);
// Find all images
$images = []; //create empty images array
foreach($html->find('img') as $element){
$images[] = $element->src . '<br>'; //Store the found elements in the images array
}
echo '<pre>Output $images: '; var_dump($images); echo '</pre>'; //An output from the images array
// Find all links
$links = []; //create empty images array
foreach($html->find('a') as $element){
$links[] = $element->href . '<br>'; //Store the found elements in the links array
}
echo '<pre>Output $links: '; var_dump($links); echo '</pre>'; //An output from the links array
The echo's display the arrays filled with 'image' and 'a' tags value's from your page

how i can get img src from a html page by using php

AA Dear bro, i want to get the img src from a html page butt i have faced with error,Help please , my server show this messaage
Notice: Undefined offset: 0 in F:\xamppppp\htdocs\Arslan_Sir\img
download from google.php on line 13 Notice: Array to string
conversion in F:\xamppppp\htdocs\Arslan_Sir\img download from
google.php on line 15 Array
my code is
<?php //this code can be pic
image from a html page $ctual_link="https://www.google.com/search?q=9780333993385&ie=utf-8&oe=utf-8&client=firefox-b-ab"; define('DIRECTORY', '/imgg/m/'); $text = file_get_contents($ctual_link); preg_match_all('/<div class=\"image\">(.*?)<\/div>/s', $text, $out); //preg_match('/~src="(.*)"itemprop="image" \/>/',$text,$out); preg_match('~src="(.*)"\s*itemprop="image"[^>]*>~',$text,$out); //$out
= explode(' ',$out[1]); $z=trim($out[0],'"'); echo $out; //} ?>
Not quite sure but thinking about PHP Simple HTML DOM Parser
the example from the landing page of the library
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
solved your issue you can apply this code it will help you better
<?php
$ctual_link="https://www.google.com/search?q=9780333993385&ie=utf-8&oe=utf-8&client=firefox-b-ab";
$html = file_get_contents($ctual_link);
//Create a new DOM document
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('img');
foreach ($links as $link){
//Extract and show the "src" attribute of image.
echo $link->nodeValue;
echo $link->getAttribute('src'), '<br>';
}
?>

String replace with regex in PHP

I want to modify the contents of an html file with php.
I am applying style to img tags, and I need to check if the tag already has a style attribute, if it has, I want to replace it with my own.
$pos = strpos($theData, "src=\"".$src."\" style=");
if (!$pos){
$theData = str_replace("src=\"".$src."\"", "src=\"".$src."\" style=\"width:".$width."px\"", $theData);
}
else{
$theData = preg_replace("src=\"".$src."\" style=/\"[^\"]+\"/", "src=\"".$src."\" style=\"width: ".$width."px\"", $theData);
}
$theData is the html source code I receive.
If a style attribute has not been found, I successfully insert my own style, but I think the problem comes when there is already a style attribute defined so my regex is not working.
I want to replace the style attribute with everything inside it, with my new style attribute.
How should my regex look?
Instead of using regex for this, you should use a DOM parser.
Example using DOMDocument:
<?php
$html = '<img src="http://example.com/image.jpg" width=""/><img src="http://example.com/image.jpg"/>';
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />'.$html);
$dom->formatOutput = true;
foreach ($dom->getElementsByTagName('img') as $item)
{
//Remove width attr if its there
$item->removeAttribute('width');
//Get the sytle attr if its there
$style = $item->getAttribute('style');
//Set style appending existing style if necessary, 123px could be your $width var
$item->setAttribute('style','width:123px;'.$style);
}
//remove unwanted doctype ect
$ret = preg_replace('~<(?:!DOCTYPE|/?(?:html|body|head))[^>]*>\s*~i', '', $dom->saveHTML());
echo trim(str_replace('<meta http-equiv="Content-Type" content="text/html;charset=utf-8">','',$ret));
//<img src="http://example.com/image.jpg" style="width:123px;">
//<img src="http://example.com/image.jpg" style="width:123px;">
?>
Here is the regexp variant of solving this problem:
<?php
$theData = "<img src=\"/image.png\" style=\"lol\">";
$src = "/image.png";
$width = 10;
//you must escape potential special characters in $src,
//before using it in regexp
$regexp_src = preg_quote($src, "/");
$theData = preg_replace(
'/src="'. $regexp_src .'" style=".*?"/i',
'src="'. $src .'" style="width: '. $width . 'px;"',
$theData);
print $theData;
prints:
<img src="/image.png" style="width: 10px;">
Regex expression:
(<[^>]*)style\s*=\s*('|")[^\2]*?\2([^>]*>)
Usage:
$1$3
Example:
http://rubular.com/r/28tCIMHs50
Search for:
<img([^>])style="([^"])"
and replace with:
<img\1style="attribute1: value1; attribute2: value2;"
http://regex101.com/r/zP2tV9

How to get page title in php?

I have this function to get title of a website:
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
However, this function make my page took too much time to response. Someone tell me to get title by request header of the website only, which won't read the whole file, but I don't know how. Can anyone please tell me which code and function i should use to do this? Thank you very much.
Using regex is not a good idea for HTML, use the DOM Parser instead
$html = new simple_html_dom();
$html->load_file('****'); //put url or filename
$title = $html->find('title');
echo $title->plaintext;
or
// Create DOM from URL or file
$html = file_get_html('*****');
// Find all images
foreach($html->find('title') as $element)
echo $element->src . '<br>';
Good read
RegEx match open tags except XHTML self-contained tags
Use jQuery Instead to get Title of your page
$(document).ready(function() {
alert($("title").text());
});​
Demo : http://jsfiddle.net/WQNT8/1/
try this will work surely
include_once 'simple_html_dom.php';
$oHtml = str_get_html($url);
$Title = array_shift($oHtml->find('title'))->innertext;
$Description = array_shift($oHtml->find("meta[name='description']"))->content;
$keywords = array_shift($oHtml->find("meta[name='keywords']"))->content;
echo $title;
echo $Description;
echo $keywords;

How to extract title and meta description using PHP Simple HTML DOM Parser?

How can I extract a page's title and meta description using the PHP Simple HTML DOM Parser?
I just need the title of the page and the keywords in plain text.
$html = new simple_html_dom();
$html->load_file('some_url');
//To get Meta Title
$meta_title = $html->find("meta[name='title']", 0)->content;
//To get Meta Description
$meta_description = $html->find("meta[name='description']", 0)->content;
//To get Meta Keywords
$meta_keywords = $html->find("meta[name='keywords']", 0)->content;
NOTE: The names of meta tags are casesensitive!
I just took a look at the HTML DOM Parser, try:
$html = new simple_html_dom();
$html->load_file('xxx'); //put url or filename in place of xxx
$title = $html->find('title');
echo $title->plaintext;
$descr = $html->find('meta[description]');
echo $descr->plaintext;
$html = new simple_html_dom();
$html->load_file('http://www.google.com');
$title = $html->find('title',0)->innertext;
$html->find('title') will return an array
so you should use $html->find('title',0), so does meta[description]
Taken from LeiXC's solution above, you need to use the simple html dom class:
$dom = new simple_html_dom();
$dom->load_file( 'websiteurl.com' );// put your own url in here for testing
$html = str_get_html($dom);
$descr = $html->find("meta[name=description]", 0);
$description = $descr->content;
echo $description;
I have tested this code and yes it is case sensitive (some meta tags use a capital D for description)
Here is some error checking for spelling errors:
if( is_object( $html->find("meta[name=description]", 0)) ){
echo $html->find("meta[name=description]", 0)->content;
} elseif( is_object( $html->find("meta[name=Description]", 0)) ){
echo $html->find("meta[name=Description]", 0)->content;
}
$html->find('meta[name=keywords]',0)->attr['content'];
$html->find('meta[name=description]',0)->attr['content'];
$html = new simple_html_dom();
$html->load_file('xxx');
//put url or filename in place of xxx
$title = array_shift($html->find('title'))->innertext;
echo $title;
$descr = array_shift($html->find("meta[name='description']"))->content;
echo $descr;
you can using php code and so simple to know. like here
$result = 'site.com';
$tags = get_meta_tags("html/".$result);
The correct answer is:
$html = str_get_html($html);
$descr = $html->find("meta[name=description]", 0);
$description = $descr->content;
The above code gets html into an object format, then the find method looks for a meta tag with the name description, and finally you need to return the value of the meta tag's content, not the innertext or plaintext as outlined by others.
This has been tested and used in live code. Best
I found the easy way to take description
$html = new simple_html_dom();
$html->load_file('your_url');
$title = $html->load('title')->simpletext; //<title>**Text from here**</title>
$description = $html->load("meta[name='description']", 0)->simpletext; //<meta name="description" content="**Text from here**">
If your line contains extra spaces, then try this
$title = trim($title);
$description = trim($description);

Categories