I have different format array of html
[amp;src]=>image, anotherone [posthtml]=>image2, anothertwo [nbsp;image3
How to extract img and text using common preg_match() by which we can get perfect image src and text from html. If it is not possible using preg_match(), is there another way to fix it.
If any one know please, reply it. How to fix it.
I need your hand.
The recommended way is to use DOM
$dom = new DOMDocument;
$dom->loadHTML($HTML);
$images = $dom->getElementsByTagName('img');
foreach($images as $im){
$attrs = $imgages->attributes();
$src = $attrs->getNamedItem('src')->nodeValue
}
Using Regular expression:
preg_match_all("/<img .*?(?=src)src=\"([^\"]+)\"/si", $html, $m);
print_r($m);
Related
I have string with multiple image tags in it.
Like this
<img src="/files/028ou2p5g/blogs/9d66329f4/5844644f69fe7-64.jpg">
I want to find FIRST such tag, and get image name from it
5844644f69fe7-64.jpg
How can be this done in PHP asuming there is a lot of other text and tags in string ?
You should use like what #moopet suggested. This is the code, but please give credit to #moopet.
$str = '<img src="/files/028ou2p5g/blogs/9d66329f4/5844644f69fe7-64.jpg">';
$doc = new DOMDocument();
$doc->loadHTML($str);
$first_img = $doc->getElementsByTagName("img")[0];
var_dump( basename($first_img->getAttribute('src')) );
Don't use regex for this. Use PHP's DOM parser or an alternative to extract the tags, then use PHP's basename() function on the src element to extract the filename.
Use preg_match_all() to find all occurences and then get the first one.
Example:
<?php
preg_match_all('/<\s*img[^<>]+?src\s*=\s*[\'\"][^<>\'\"]+?\/([^<>\'\"\/]\.jpg)/', $html, $matches, PREG_SET_ORDER);
var_dump($matches[0]);
?>
Here is my regex to scrap image from page.
preg_match_all('/\bhttps?:\/\/\S+(?:png|jpg)\b/', $html, $matches
But it fails when image url is like this:
src="//upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Adolescent_girl_sad_0001.jpg/200px-Adolescent_girl_sad_0001.jpg"
I think it need to add OR operation in above regex to allove image starting with //.
documentation says | pipe will do or operation. But how to add it in above regex?
You could just avoid the wrath of the pony instead...
$dom = new DOMDocument();
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
$sources = array();
foreach($image as $img) $sources[] = $img->getAttribute("src");
Done!
I'm trying to find ALL images in my blog posts with regex. The code below returns images IF the code is clean and the SRC tag comes right after the IMG tag. However, I also have images with other attributes such as height and width. The regex I have does not pick that up... Any ideas?
The following code returns images that looks like this:
<img src="blah_blah_blah.jpg">
But not images that looks like this:
<img width="290" height="290" src="blah_blah_blah.jpg">
Here is my code
$pattern = '/<img\s+src="([^"]+)"[^>]+>/i';
preg_match($pattern, $data, $matches);
echo $matches[1];
Use DOM or another parser for this, don't try to parse HTML with regular expressions.
$html = <<<DATA
<img width="290" height="290" src="blah.jpg">
<img src="blah_blah_blah.jpg">
DATA;
$doc = new DOMDocument();
$doc->loadHTML($html); // load the html
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//img');
foreach ($imgs as $img) {
echo $img->getAttribute('src') . "\n";
}
Output
blah.jpg
blah_blah_blah.jpg
Ever think of using the DOM object instead of regex?
$doc = new DOMDocument();
$doc->loadHTML('<img src="http://example.com/img/image.jpg" ... />');
$imageTags = $doc->getElementsByTagName('img');
foreach($imageTags as $tag) {
echo $tag->getAttribute('src');
}
You'd better to use a parser, but here is a way to do with regex:
$pattern = '/<img\s.*?src="([^"]+)"/i';
The problem is that you only accept \s+ after <img. Try this instead:
$pattern = '/<img\s+[^>]*?src="([^"]+)"[^>]+>/i';
preg_match($pattern, $data, $matches);
echo $matches[1];
Try this:
$pattern = '/<img\s.*?src=["\']([^"\']+)["\']/i';
Single or double quote and dynamic src attr position.
I am using preg_replace to delete from $content certain <img>:
$content=preg_replace('/(?!<img.+?id="img_menu".*?\/>)(?!<img.+?id="featured_img".*?\/>)<img.+?\/>/','',$content);
When I am now displaying the content using wordpress the_content function, I did indeed remove the <img>s from $content:
I'd like beforehand to get this images to place them elsewhere in the template. I am using the same regex pattern with preg_match_all:
preg_match_all('/(?!<img.+?id="img_menu".*?\/>)(?!<img.+?id="featured_img".*?\/>)<img.+?\/>/', $content, $matches);
But I can't get my imgs?
preg_match_all('/(?!<img.+?id="img_menu".*?\/>)(?!<img.+?id="featured_img".*?\/>)<img.+?\/>/', $content, $matches);
print_r($matches);
Array ( [0] => Array ( ) )
assuming and hopefully you are using php5, this is a task for DOMDocument and xpath. regex with html elements mostly will work, but check the following example from
<img alt=">" src="/path.jpg" />
regex will fail. since there aren't many guarantees in programming, take the guarantee that xpath will find EXACTLY what you want, at a perfomance cost, so to code it:
$doc = new DOMDocument();
$doc->loadHTML('<span><img src="com.png" /><img src="com2.png" /></span>');
$xpath = new DOMXPath($doc);
$imgs = $xpath->query('//span/img');
$html = '';
foreach($imgs as $img){
$html .= $doc->saveXML($img);
}
now you have all img elements in $html, use str_replace() to remove them from $content, and from there you can have a drink and be pleased that xpath with html elements is painless, just a little slower
ps. i couldnt be be bother understanding your regex, i just think xpath is better in your situation
at the end i have used preg_replace_callback:
$content2 = get_the_content();
$removed_imgs = array();
$content2 = preg_replace_callback('#(?!<img.+?id="featured_img".*?\/>)(<img.+? />)#',function($r) {
global $removed_imgs;
$removed_imgs[] = $r[1];
return '';
},$content2);
foreach($removed_imgs as $img){
echo $img;
}
I need to convert
$text = 'We had <i>fun</i>. Look at this photo of Joe';
[Edit] There could be multiple links in the text.
to
$text = 'We had fun. Look at this photo (http://example.com) of Joe';
All HTML tags are to be removed and the href value from <a> tags needs to be added like above.
What would be an efficient way to solve this with regex? Any code snippet would be great.
First do a preg_replace to keep the link. You could use:
preg_replace('(.*?)', '$\2 ($\1)', $str);
Then use strip_tags which will finish off the rest of the tags.
try an xml parser to replace any tag with it's inner html and the a tags with its href attribute.
http://www.php.net/manual/en/book.domxml.php
The DOM solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[#href]') as $node) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());
and the same without XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
if($node->hasAttribute('href')) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
}
echo strip_tags($dom->saveHTML());
All it does is load any HTML into a DomDocument instance. In the first case it uses an XPath expression, which is kinda like SQL for XML, and gets all links with an href attribute. It then creates a text node element from the innerHTML and the href attribute and replaces the link. The second version just uses the DOM API and no Xpath.
Yes, it's a few lines more than Regex but this is clean and easy to understand and it won't give you any headaches when you need to add additional logic.
I've done things like this using variations of substring and replace. I'd probably use regex today but you wanted an alternative so:
For the <i> tags, I'd do something like:
$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");
(My php is really rusty, so replace may not be the right function name -- but the idea is what I'm sharing.)
The <a> tag is a bit more tricky. But, it can be done. You need to find the point that <a starts and that the > ends with. Then you extract the entire length and replace the closing </a>
That might go something like:
$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text, $start, $end );
$text = replace($text, "</a>", "");
(I don't know if this will work, again the idea is what I want to communicate. I hope the code fragments help but they probably don't work "out of the box". There are also a lot of possible bugs in the code snippets depending on your exact implementation and environment)
Reference:
strrpos - http://www.php.net/manual/en/function.strrpos.php
replace - http://www.php.net/manual/en/function.str-replace.php
substr - http://php.net/manual/en/function.substr.php
It's also very easy to do with a parser:
# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');
# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at this photo of Joe');
$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";
echo strip_tags($html);
And that produces the code you want in your test case.