This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I use this regex to match all images. How I can rewrite it to match all images WITHOUT </a> in the end ?
preg_match_all ("/\<img ([^>]*)\/*\>/i", $text, $dst);
soap box
I don't recommend using regex to parse an html string.
however
However you might want to try using DOM to first loop through all the images and store them in an array.
foreach ($dom->getElementsByTagName('img') as $img) {
$array[$img->getAttribue('src')]=1;
}
Then loop through all links and try to find an image inside to remove from your array.
foreach ($dom->getElementsByTagName('a') as $a) {
//loop to catch multiple IMGs in LINKS
foreach ($a->getElementsByTagName('img') as $img) {
unset($array[$img->getAttribue('src')]);
}
}
You could use domDocument instead of a regex, the syntax here may not be right but it shoudl give you an idea.
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
$images_array = array();
foreach ($images as $image) {
if ($image->parentNode->nodeName != 'a')
echo $images_array = $image->getAttribute('src');
}
Related
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 7 years ago.
I need to extract all the links to news articles from the NY Times RSS feed to a MySQL database periodically. How do I go about doing this? Can I use some regular expression (in PHP) to match the links? Or is there some other alternative way? Thanks in advance.
UPDATE 2 I tested the code below and had to modify the
$links = $dom->getElementsByTagName('a');
and change it to:
$links = $dom->getElementsByTagName('link');
It successfully outputted the links. Good Luck
UPDATE Looks like there is a complete answer here: How do you parse and process HTML/XML in PHP.
I developed a solution so that I could recurse all the links in my website. I've removed the code which verified the domain was the same with each recursion (since the question didn't ask for this), but you can easily add one back in if you need it.
Using html5 DOMDocument, you can parse HTML or XML document to read links. It is better than using regex. Try something like this
<?php
//300 seconds = 5 minutes - or however long you need so php won't time out
ini_set('max_execution_time', 300);
// using a global to store the links in case there is recursion, it makes it easy.
// You could of course pass the array by reference for cleaner code.
$alinks = array();
// set the link to whatever you are reading
$link = "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml";
// do the search
linksearch($link, $alinks);
// show results
var_dump($alinks);
function linksearch($url, & $alinks) {
// use $queue if you want this fn to be recursive
$queue = array();
echo "<br>Searching: $url";
$href = array();
//Load the HTML page
$html = file_get_contents($url);
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('link');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
$href[] = $link->getAttribute('href');
}
foreach (array_unique($href) as $link) {
// add to list of links found
$queue[] = $link;
}
// remove duplicates
$queue = array_unique($queue);
// get links that haven't yet been processed
$queue = array_diff($queue, $alinks);
// update array passed by reference with new links found
$alinks = array_merge($alinks, $queue);
if (count($queue) > 0) {
foreach ($queue as $link) {
// recursive search - uncomment out if you use this
// remember to check that the domain is the same as the one starting from
// linksearch($link, $alinks);
}
}
}
DOM+Xpath allows you to fetch nodes using expressions.
RSS Item Links
To fetch the RSS link elements (the link for each item):
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$expression = '//channel/item/link';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->textContent);
}
Atom Links
The atom:link have a different semantic, they are part of the Atom namespace and used to describe relations. NYT uses the standout relation to mark featured stories. To fetch the Atom links you need to register a prefix for the namespace. Attributes are nodes, too so you can fetch them directly:
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/a:link[#rel="standout"]/#href';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->value);
}
Here are other relations like prev and next.
HTML Links (a elements)
The description elements contain HTML fragments. To extract the links from them you have to load the HTML into a separate DOM document.
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/description';
foreach ($xpath->evaluate($expression) as $description) {
$fragment = new DOMDocument();
$fragment->loadHtml($description->textContent);
$fragmentXpath = new DOMXpath($fragment);
foreach ($fragmentXpath->evaluate('//a[#href]/#href') as $link) {
var_dump($link->value);
}
}
This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 2 years ago.
Description of the current situation:
I have a folder full of pages (pages-folder), each page inside that folder has (among other things) a div with id="short-info".
I have a code that pulls all the <div id="short-info">...</div> from that folder and displays the text inside it by using textContent (which is for this purpose the same as nodeValue)
The code that loads the divs:
<?php
$filename = glob("pages-folder/*.php");
sort($filename);
foreach ($filename as $filenamein) {
$doc = new DOMDocument();
$doc->loadHTMLFile($filenamein);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("*//div[#id='short-info']");
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->textContent;
}
}
}
?>
Now the problem is that if the page I am loading has a child, like an image: <div id="short-info"> <img src="picture.jpg"> Hello world </div>, the output will only be Hello world rather than the image and then Hello world.
Question:
How do I make the code display the full html inside the div id="short-info" including for instance that image rather than just the text?
You have to make an undocumented call on the node.
$node->c14n() Will give you the HTML contained in $node.
Crazy right? I lost some hair over that one.
http://php.net/manual/en/class.domnode.php#88441
Update
This will modify the html to conform to strict HTML. It is better to use
$html = $Node->ownerDocument->saveHTML( $Node );
Instead.
You'd want what amounts to 'innerHTML', which PHP's dom doesn't directly support. One workaround for it is here in the PHP docs.
Another option is to take the $node you've found, insert it as the top-level element of a new DOM document, and then call saveHTML() on that new document.
I am using this PHP function to grab all <img> tags within any given HTML.
function extract_images($content)
{
$img = strip_tags(html_entity_decode($content),'<img>');
$regex = '~src="[^"]*"~';
preg_match_all($regex, $img, $all_images);
return $all_images;
}
This works and returns all images (gif, png, jpg, etc).
Anyone know how to change the regex...
~src="[^"]*"~
in order to only get files with JPG or JPEG extension?
Thanks a bunch.
Sooner or later the Regex Enforcement Agency will show up. It might as well be me :)
The proper way to do this is with a proper HTML DOM parser. Here's a DOMDocument solution. The usefulness of this is in that it's more robust than parsing the HTML by regex, and also gives you the ability to access or modify other HTML attributes on your <img> nodes at the same time.
$dom = new DOMDocument();
$dom->loadHTML($content);
// To hold all your links...
$links = array();
// Get all images
$imgs = $dom->getElementsByTagName("img");
foreach($imgs as $img) {
// Check the src attr of each img
$src = "";
$src = $img->getAttribute("src");
if (preg_match("/\.jp[e]?g$/i", $src) {
// Add it onto your $links array.
$links[] = $src;
}
See other answers for the simple regex solution, or adapt from the regex inside my foreach loop.
/src="[^"]*\.(jpg|jpeg)"/i
i -> case insensitive match
Using the following code I get "img" tags from some html and check them if they are covered with "a" tags. Later if current "img" tag is not part of the "a" ( hyperlink ) I want to do cover this img tag into "a" tag adding hyperlinks start ending tag plus setting to target. For this I want the whole "img" tags html to work with.
Question is how can I transfer "img" tags html into regexp. I need some php variable in regexp to work with the place is marked with ??? signs.
$doc = new DOMDocument();
$doc->loadHTML($article_header);
$imgs = $doc->getElementsByTagName('img');
foreach ($imgs as $img) {
if ($img->parentNode->tagName != "a") {
preg_match_all("|<img(.*)\/>|U", ??? , $matches, PREG_PATTERN_ORDER);
}
}
You do not want to use regex for this. You already have a DOM, so use it:
foreach ($imgs as $img) {
$container = $img->parentNode;
if ($container->tagName != "a") {
$a = $doc->createElement("a");
$a->appendChild( $img->cloneNode(true) );
$container->replaceChild($a, $img);
}
}
see documentation on
DOMDocument::createElement
DOMNode::appendChild
DOMNode::cloneNode
DOMNode::replaceChild
This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
How to extract img src, title and alt from html using php?
Hi,
I have found solution to get first image from string:
preg_match('~<img[^>]*src\s?=\s?[\'"]([^\'"]*)~i',$string, $matches);
But I can't manage to get all images from string.
One more thing... If image contains alternative text (alt attribute) how to get it too and save to another variable?
Thanks in advance,
Ilija
Don't do this with regular expressions. Instead, parse the HTML. Take a look at Parse HTML With PHP And DOM. This is a standard feature in PHP 5.2.x (and probably earlier). Basically the logic for getting images is roughly:
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
}
This should be trivial to adapt to finding images.
This is what I tried but can't get it print value of src
$dom = new domDocument;
/*** load the html into the object ***/
$dom->loadHTML($html);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
/*** the table by its tag name ***/
$images = $dom->getElementsByTagName('img');
/*** loop over the table rows ***/
foreach ($images as $img)
{
/*** get each column by tag name ***/
$url = $img->getElementsByTagName('src');
/*** echo the values ***/
echo $url->nodeValue;
echo '<hr />';
}
EDIT: I solved this problem
$dom = new domDocument;
/*** load the html into the object ***/
$dom->loadHTML($string);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach($images as $img)
{
$url = $img->getAttribute('src');
$alt = $img->getAttribute('alt');
echo "Title: $alt<br>$url<br>";
}
Note that Regular Expressions are a bad approach to parsing anything that involves matching braces.
You'd be better off using the DOMDocument class.
You assume that you can parse HTML using regular expressions. That may work for some sites, but not all sites. Since you are limiting yourself to only a subset of all web pages, it would be interesting to know how you limit yourself... maybe you can parse the HTML in a quite easy way from php.
Look at preg_match_all to get all matches.