This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
How to extract img src, title and alt from html using php?
Hi,
I have found solution to get first image from string:
preg_match('~<img[^>]*src\s?=\s?[\'"]([^\'"]*)~i',$string, $matches);
But I can't manage to get all images from string.
One more thing... If image contains alternative text (alt attribute) how to get it too and save to another variable?
Thanks in advance,
Ilija
Don't do this with regular expressions. Instead, parse the HTML. Take a look at Parse HTML With PHP And DOM. This is a standard feature in PHP 5.2.x (and probably earlier). Basically the logic for getting images is roughly:
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
}
This should be trivial to adapt to finding images.
This is what I tried but can't get it print value of src
$dom = new domDocument;
/*** load the html into the object ***/
$dom->loadHTML($html);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
/*** the table by its tag name ***/
$images = $dom->getElementsByTagName('img');
/*** loop over the table rows ***/
foreach ($images as $img)
{
/*** get each column by tag name ***/
$url = $img->getElementsByTagName('src');
/*** echo the values ***/
echo $url->nodeValue;
echo '<hr />';
}
EDIT: I solved this problem
$dom = new domDocument;
/*** load the html into the object ***/
$dom->loadHTML($string);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach($images as $img)
{
$url = $img->getAttribute('src');
$alt = $img->getAttribute('alt');
echo "Title: $alt<br>$url<br>";
}
Note that Regular Expressions are a bad approach to parsing anything that involves matching braces.
You'd be better off using the DOMDocument class.
You assume that you can parse HTML using regular expressions. That may work for some sites, but not all sites. Since you are limiting yourself to only a subset of all web pages, it would be interesting to know how you limit yourself... maybe you can parse the HTML in a quite easy way from php.
Look at preg_match_all to get all matches.
Related
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 5 years ago.
I am trying get a specific div element (i.e. with attribute id="vung_doc") from a website, but I get almost every element. Do you have any idea what's wrong?
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = true;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://lightnovelgate.com/chapter/epoch_of_twilight/chapter_300');
$xpath = new DOMXPath($doc);
$query = "//*[#class='vung_doc']";
$entries = $xpath->query($query);
var_dump($entries->item(0)->textContent);
Actually, it appears that that one element, which has both id and class attributes with value vung_doc, has many paragraphs inside its text content. Perhaps you are thinking each paragraph should be in its own div element.
<div id="vung_doc" class="vung_doc" style="font-size: 18px;">
<p></p>
"Mayor song..."
In the screenshot at the bottom of this post, I added an outline style to that element, to show just how many paragraphs are within that element.
If you wanted to separate the paragraphs, you could use preg_split() to split on any new line characters:
$entries = $xpath->query($query);
foreach($entries as $entry) {
$paragraphs = preg_split("/[\r\n]+/s",$entry->textContent);
foreach($paragraphs as $paragraph) {
if (trim($paragraph)) {
echo '<b>paragraph:</b> '.$paragraph;
break;
}
}
}
See a demonstration of this in this playground example. Note that before loading the HTML file, libxml_use_internal_errors() is called, to suppress the XML errors:
libxml_use_internal_errors(true);
Screenshot of the target div element with outline added:
Change
$query = "//*[#class='vung_doc']";
to
$query = "//*[#id='vung_doc']";
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 7 years ago.
I need to extract all the links to news articles from the NY Times RSS feed to a MySQL database periodically. How do I go about doing this? Can I use some regular expression (in PHP) to match the links? Or is there some other alternative way? Thanks in advance.
UPDATE 2 I tested the code below and had to modify the
$links = $dom->getElementsByTagName('a');
and change it to:
$links = $dom->getElementsByTagName('link');
It successfully outputted the links. Good Luck
UPDATE Looks like there is a complete answer here: How do you parse and process HTML/XML in PHP.
I developed a solution so that I could recurse all the links in my website. I've removed the code which verified the domain was the same with each recursion (since the question didn't ask for this), but you can easily add one back in if you need it.
Using html5 DOMDocument, you can parse HTML or XML document to read links. It is better than using regex. Try something like this
<?php
//300 seconds = 5 minutes - or however long you need so php won't time out
ini_set('max_execution_time', 300);
// using a global to store the links in case there is recursion, it makes it easy.
// You could of course pass the array by reference for cleaner code.
$alinks = array();
// set the link to whatever you are reading
$link = "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml";
// do the search
linksearch($link, $alinks);
// show results
var_dump($alinks);
function linksearch($url, & $alinks) {
// use $queue if you want this fn to be recursive
$queue = array();
echo "<br>Searching: $url";
$href = array();
//Load the HTML page
$html = file_get_contents($url);
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('link');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
$href[] = $link->getAttribute('href');
}
foreach (array_unique($href) as $link) {
// add to list of links found
$queue[] = $link;
}
// remove duplicates
$queue = array_unique($queue);
// get links that haven't yet been processed
$queue = array_diff($queue, $alinks);
// update array passed by reference with new links found
$alinks = array_merge($alinks, $queue);
if (count($queue) > 0) {
foreach ($queue as $link) {
// recursive search - uncomment out if you use this
// remember to check that the domain is the same as the one starting from
// linksearch($link, $alinks);
}
}
}
DOM+Xpath allows you to fetch nodes using expressions.
RSS Item Links
To fetch the RSS link elements (the link for each item):
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$expression = '//channel/item/link';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->textContent);
}
Atom Links
The atom:link have a different semantic, they are part of the Atom namespace and used to describe relations. NYT uses the standout relation to mark featured stories. To fetch the Atom links you need to register a prefix for the namespace. Attributes are nodes, too so you can fetch them directly:
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/a:link[#rel="standout"]/#href';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->value);
}
Here are other relations like prev and next.
HTML Links (a elements)
The description elements contain HTML fragments. To extract the links from them you have to load the HTML into a separate DOM document.
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/description';
foreach ($xpath->evaluate($expression) as $description) {
$fragment = new DOMDocument();
$fragment->loadHtml($description->textContent);
$fragmentXpath = new DOMXpath($fragment);
foreach ($fragmentXpath->evaluate('//a[#href]/#href') as $link) {
var_dump($link->value);
}
}
This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I use this regex to match all images. How I can rewrite it to match all images WITHOUT </a> in the end ?
preg_match_all ("/\<img ([^>]*)\/*\>/i", $text, $dst);
soap box
I don't recommend using regex to parse an html string.
however
However you might want to try using DOM to first loop through all the images and store them in an array.
foreach ($dom->getElementsByTagName('img') as $img) {
$array[$img->getAttribue('src')]=1;
}
Then loop through all links and try to find an image inside to remove from your array.
foreach ($dom->getElementsByTagName('a') as $a) {
//loop to catch multiple IMGs in LINKS
foreach ($a->getElementsByTagName('img') as $img) {
unset($array[$img->getAttribue('src')]);
}
}
You could use domDocument instead of a regex, the syntax here may not be right but it shoudl give you an idea.
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
$images_array = array();
foreach ($images as $image) {
if ($image->parentNode->nodeName != 'a')
echo $images_array = $image->getAttribute('src');
}
I am using this PHP function to grab all <img> tags within any given HTML.
function extract_images($content)
{
$img = strip_tags(html_entity_decode($content),'<img>');
$regex = '~src="[^"]*"~';
preg_match_all($regex, $img, $all_images);
return $all_images;
}
This works and returns all images (gif, png, jpg, etc).
Anyone know how to change the regex...
~src="[^"]*"~
in order to only get files with JPG or JPEG extension?
Thanks a bunch.
Sooner or later the Regex Enforcement Agency will show up. It might as well be me :)
The proper way to do this is with a proper HTML DOM parser. Here's a DOMDocument solution. The usefulness of this is in that it's more robust than parsing the HTML by regex, and also gives you the ability to access or modify other HTML attributes on your <img> nodes at the same time.
$dom = new DOMDocument();
$dom->loadHTML($content);
// To hold all your links...
$links = array();
// Get all images
$imgs = $dom->getElementsByTagName("img");
foreach($imgs as $img) {
// Check the src attr of each img
$src = "";
$src = $img->getAttribute("src");
if (preg_match("/\.jp[e]?g$/i", $src) {
// Add it onto your $links array.
$links[] = $src;
}
See other answers for the simple regex solution, or adapt from the regex inside my foreach loop.
/src="[^"]*\.(jpg|jpeg)"/i
i -> case insensitive match
Hey,
Consider i have the follwing html syntax
<p>xyz</p>
<p>abc</p>
I want to retrieve the text (xyz and abc) using DOM.
This is my code.
<?php
$link='http://www.xyz.com';
$ret= getLinks($link);
print_r ($ret);
function getLinks($link)
{
/*** return array ***/
$ret = array();
/*** a new dom object ***/
$dom = new domDocument;
/*** get the HTML (suppress errors) ***/
#$dom->loadHTML(file_get_contents($link));
/*** remove silly white space ***/
$dom->preserveWhiteSpace = false;
/*** get the links from the HTML ***/
$text = $dom->getElementsByTagName('p');
/*** loop over the links ***/
foreach ($text as $tag)
{
$ret[] = $tag->innerHTML;
}
return $ret;
}
?>
But i get an empty result. wat am i miissing here.?
To suppress parsing errors, do not use
#$dom->loadHTML(file_get_contents($link));
but
libxml_use_internal_errors(TRUE);
Also, there is no reason to use file_get_contents. DOM can load from remote resources.
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile($link);
libxml_clear_errors();
Also, Tag Names are case sensitive. You are querying for <P> when the snippet contains <p>. Change to
$text = $dom->getElementsByTagName('p');
And finally, there is no innerHTML. A userland solution to fetch it is in
How to get innerHTML of DOMNode?
You can fetch the outerHTML with
$ret[] = $dom->saveHtml($tag); // requires PHP 5.3.6+
or
$ret[] = $dom->saveXml($tag); // that will make it XML compliant though
To get the text content of the P tag, use
$ret[] = $tag->nodeValue;
First, case matters:
$dom->getElementsByTagName('P');
Should be:
$dom->getElementsByTagName('p');
Second, innerHTML is not a valid DOMElement property.
Try:
echo $dom->textContent;
echo $dom->nodeValue;
However, this won't return the inner HTML tags and will strip them. There are a few examples on how to make it work in the PHP manual.