image problems with regular expressions - php

When I run the following script, the image is not rendered well. What is the problem here? This is the code:
<?php
header('Content-Type: text/html; charset=utf-8');
$url = "http://www.asaphshop.nl/epages/asaphnl.sf/nl_NL/
ObjectPath=/Shops/asaphnl/Products/80203122";
$htmlcode = file_get_contents($url);
$pattern = "/class=\"noscript\"\>(.*)\<\/div\>/isU";
preg_match_all($pattern, $htmlcode, $matches);
//print_r ($matches);
$image = ($matches[0][0]);
print_r ($image);
?>
This is the part of the link I need to copy (the data-src-l part):
<div id="ProductImages" class="noscript">
<ul>
<li>
<a href="/WebRoot/products/8020/80203122/bilder/80203122.jpg">
<img itemprop="image" alt="Jesus Remember Me - Taize Songs (2CD)"
src="/WebRoot/AsaphNL/Shops/asaphnl/5422/8F43/62EE/
D698/EF8E/4DEB/AED5/3B0E/80203122_xs.jpg"
data-src-xs="/WebRoot/AsaphNL/Shops/asaphnl/5422/8F43/62EE/
D698/EF8E/4DEB/AED5/3B0E/80203122_xs.jpg"
data-src-s="/WebRoot/products/8020/80203122/bilder/80203122_s.jpg"
data-src-m="/WebRoot/products/8020/80203122/bilder/80203122_m.jpg"
data-src-l="/WebRoot/products/8020/80203122/bilder/80203122.jpg"
/>
</a>
</li>
</ul>
</div>

$pattern = "#class=\"noscript\">.*data-src-l=([\"'])(?<url>.*)\\1.*</div>#isU";
But it is better to deal with the page as with the DOM structure, not as a string. \\1 is a backreference to ([\"']) so that the same quotes are used at the end of the string. Not so necessary for the URLs as there should be no direct quotes (unescaped) in them, but it is good for general purpose.
ps: if you need everything between <img and /> (including them) - $pattern = '#class="noscript">.*(<img.*>).*</div>#isU';

Use DOMDocument (I hope that your schoolmistress will not scold you):
$dom = new DOMDocument();
$dom->loadHTMLFile('http://www.asaphshop.nl/epages/asaphnl.sf/nl_NL/?ObjectPath=/Shops/asaphnl/Products/80203122');
$xpath = new DOMXPath($dom);
$url = $xpath->query('//div[#id="ProductImages"]/ul/li/a/img/#data-src-l')->item(0)->nodeValue;
echo $url;

Related

preg_match() get the source link of image using regex

I want to get the image link from the html content with preg_match() function.
I tried like this but not getting the correct source link.
$data = "<div class="poster">
<div class="pic">
<img class="xfieldimage img" src="https://bobtor.com/uploads/posts/2019-01/1546950927_mv5bnji5yta2mtetztmzny00odc5lwfimzctnme2owqwnwnkywm1xkeyxkfqcgdeqxvyntm3mdmymdq._v1_-1.jpg" alt="Song of Back and Neck 2018" title="Song of Back and Neck 2018">
</div>
</div>";
preg_match("'<img class=\"xfieldimage img\" src=\"(.*?)\" alt=\"(.*?)\" title=\"(.*?)\" />'si", $data, $movie_poster);
print_r($movie_poster);
Its not working.
self-contained tags meme link.
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$image = $xpath->query("//img[#class='xfieldimage img']")->item(0);
echo $image->getAttribute("src");

Get specific html portion with regex string matching in php

i am trying to get specific HTML code portion with regex preg_match_all by matching it with class tag But it is returning empty array.
This is the html portion which i want to get from complete HTML
<div class="details">
<div class="title">
<a href="citation.cfm?id=2892225&CFID=598850954&CFTOKEN=15595705"
target="_self">Restrictification of function arguments</a>
</div>
</div>
Where I am using this regex
preg_match_all('~<div class=\'details\'>\s*(<div.*?</div>\s*)?(.*?)</div>~is', $html, $matches );
NOTE: $html variable is having the whole html from which I want to search.
Thanks.
You are looking for single quotes in your regex in contrast to the double quotes in $html.
Your regex should look like:
'~<div class="details">\s*(<div.*?</div>\s*)?(.*?)</div>~is'
or better:
'~<div class=[\'"]details[\'"]>\s*(<div.*?</div>\s*)?(.*?)</div>~is'
Better use a DOM approach !
<?php
$html = '<div class="details">
<div class="title">
<a href="citation.cfm?id=2892225&CFID=598850954&CFTOKEN=15595705"
target="_self">Restrictification of function arguments</a>
</div>
</div>';
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$divs = $xpath->query('//div[#class="title"]');
print_r($divs);
?>

Regex extract image links

I am reading a html content. There are image tags such as
<img onclick="document.location='http://abc.com'" src="http://a.com/e.jpg" onload="javascript:if(this.width>250) this.width=250">
or
<img src="http://a.com/e.jpg" onclick="document.location='http://abc.com'" onload="javascript:if(this.width>250) this.width=250" />
I tried to reformat this tags to become
<img src="http://a.com/e.jpg" />
However i am not successful. The codes i tried to build so far is like
$image=preg_replace('/<img(.*?)(\/)?>/','',$image);
anyone can help?
Here's a version using DOMDocument that removes all attributes from <img> tags except for the src attribute. Note that doing a loadHTML and saveHTML with DOMDocument can alter other html as well, especially if that html is malformed. So be careful - test and see if the results are acceptable.
<?php
$html = <<<ENDHTML
<!doctype html>
<html><body>
<img onclick="..." src="http://a.com/e.jpg" onload="...">
<div><p>
<img src="http://a.com/e.jpg" onclick="..." onload="..." />
</p></div>
</body></html>
ENDHTML;
$dom = new DOMDocument;
if (!$dom->loadHTML($html)) {
throw new Exception('could not load html');
}
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//img') as $img) {
// unfortunately, cannot removeAttribute() directly inside
// the loop, as this breaks the attributes iterator.
$remove = array();
foreach ($img->attributes as $attr) {
if (strcasecmp($attr->name, 'src') != 0) {
$remove[] = $attr->name;
}
}
foreach ($remove as $attr) {
$img->removeAttribute($attr);
}
}
echo $dom->saveHTML();
Match one at a time then concat string, I am unsure which language you are using so ill explain in pseudo:
1.Find <img with regex place match in a string variable
2.Find src="..." with src=".*?" place match in a string variable
3.Find the end /> with \/> place match in a string variable
4.Concat the variables together

How to get img tag value inside a specific div and specific anchor tag using regular expression

I am new to regular expression i tried a lot for getting image tag value inside a anchor tag html
this is my html expresstion
<div class="smallSku" id="ctl00_ContentPlaceHolder1_smallImages">
<a title="" name="http://www.playg.in/productImages/med/PNC000051_PNC000051.jpg" href="http://www.playg.in/productImages/lrg/PNC000051_PNC000051.jpg" onclick="return showPic(this)" onmouseover="return showPic(this)">
<img border="0" alt="" src="http://www.playg.in/productImages/thmb/PNC000051_PNC000051.jpg"></a> <a title="PNC000051_PNC000051_1.jpg" name="http://www.playg.in/productImages/med/PNC000051_PNC000051_1.jpg" href="http://www.playg.in/productImages/lrg/PNC000051_PNC000051_1.jpg" onclick="return showPic(this)" onmouseover="return showPic(this)">
<img border="0" alt="PNC000051_PNC000051_1.jpg" src="http://www.playg.in/productImages/thmb/PNC000051_PNC000051_1.jpg"></a>
</div>
i want to return only the src value of image tag and i tried a matching pattern in "preg_match_all()" and the pattern was
"#<div[\s\S]class="smallSku"[\s\S]id="ctl00_ContentPlaceHolder1_smallImages"\><a title=\"\" name="[\w\W]" href="[\w\W]" onclick=\"[\w\W]" onmouseover="[\w\W]"\><img[\s\S]src="(.*)"[\s\S]></a><\/div>#"
please help i tried a lots of time for this also tried with this link too Match image tag not nested in an anchor tag using regular expression
Regular expression is not the right tool for parsing HTML. See this FAQ: How to parse and process HTML/XML?
Here is an example on how to get the src property using your example:
$doc = new DOMDocument();
$doc->loadHTML($your_html_string);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//div[#class="smallSku"]/a/img/#src') as $attr) {
$src = $attr->value;
print $src;
}
try this sunith
$content = file_get_contents('your url');
preg_match_all("|<div class='items'>.*</div>|", $content, $arr, PREG_PATTERN_ORDER);
preg_match_all("/src='([^']+)'/", $arr[0][0], $arrr, PREG_PATTERN_ORDER);
echo '<pre>';
print_r($arrr);

PHP - extract data from a web page HTML

I need to extract the words FIESTA ERASMUS ans /event/83318 in the following HTML code
<div id="tab-soiree" class=""><div class="soireeagenda cat_1">
<img src="http://www.parisbouge.com/img/fly/resize/100/83318.jpg" alt="fiesta erasmus" class="fly">
<ul>
<li class="nom"><h2>FIESTA ERASMUS </h2></li>
<li class="genre" style="margin-bottom:4px;">
soirée étudiante </li>
<li class="lieu">Duplex</li> <li class="musique">house, electro, r&b chic, latino, disco</li>
<li class="pass-label">pass</li> </ul>
<img src="/img/salles/resize/50/10.jpg" alt="duplex" class="flysalle">
<hr class="clearleft">
</div>
I tested something like this
$PATTERN = "/\<div id="tab-soiree".*(.*)/"
preg_match($PATTERN, $html, $matches);
but it doesnt work.
You don't parse HTML with Regular Expressions. Instead, use the built-in DOM parsing tools within PHP itself: http://php.net/manual/en/book.dom.php
Assuming your HTML is accessible from a variable named $html:
$doc = new DOMDocument();
$doc->loadHTML( $html );
$item = $doc->getElementsByTagName("li")->item(0);
$link = $item->getElementsByTagName("a")->item(0);
echo $link->attributes->getNamedItem('href')->nodeValue;
echo $link->textContent;
I suggest the following pattern:
$PATTERN = '%<h2>(.*?)[\s]+</h2>%i';
preg_match($PATTERN, $html, $matches);
The (.*?) part is a non-greedy pattern, which means that the parser won't go all the way to the end of the supplied string but will stop before the " in this case.
You may also want to pre-proccess the html before REGEX'ing it, i.e. remove all line-breaks in order to get rid of the [\s]+ part.
You can try it online here.

Categories