PHP - extract data from a web page HTML - php

I need to extract the words FIESTA ERASMUS ans /event/83318 in the following HTML code
<div id="tab-soiree" class=""><div class="soireeagenda cat_1">
<img src="http://www.parisbouge.com/img/fly/resize/100/83318.jpg" alt="fiesta erasmus" class="fly">
<ul>
<li class="nom"><h2>FIESTA ERASMUS </h2></li>
<li class="genre" style="margin-bottom:4px;">
soirée étudiante </li>
<li class="lieu">Duplex</li> <li class="musique">house, electro, r&b chic, latino, disco</li>
<li class="pass-label">pass</li> </ul>
<img src="/img/salles/resize/50/10.jpg" alt="duplex" class="flysalle">
<hr class="clearleft">
</div>
I tested something like this
$PATTERN = "/\<div id="tab-soiree".*(.*)/"
preg_match($PATTERN, $html, $matches);
but it doesnt work.

You don't parse HTML with Regular Expressions. Instead, use the built-in DOM parsing tools within PHP itself: http://php.net/manual/en/book.dom.php
Assuming your HTML is accessible from a variable named $html:
$doc = new DOMDocument();
$doc->loadHTML( $html );
$item = $doc->getElementsByTagName("li")->item(0);
$link = $item->getElementsByTagName("a")->item(0);
echo $link->attributes->getNamedItem('href')->nodeValue;
echo $link->textContent;

I suggest the following pattern:
$PATTERN = '%<h2>(.*?)[\s]+</h2>%i';
preg_match($PATTERN, $html, $matches);
The (.*?) part is a non-greedy pattern, which means that the parser won't go all the way to the end of the supplied string but will stop before the " in this case.
You may also want to pre-proccess the html before REGEX'ing it, i.e. remove all line-breaks in order to get rid of the [\s]+ part.
You can try it online here.

Related

Get specific html portion with regex string matching in php

i am trying to get specific HTML code portion with regex preg_match_all by matching it with class tag But it is returning empty array.
This is the html portion which i want to get from complete HTML
<div class="details">
<div class="title">
<a href="citation.cfm?id=2892225&CFID=598850954&CFTOKEN=15595705"
target="_self">Restrictification of function arguments</a>
</div>
</div>
Where I am using this regex
preg_match_all('~<div class=\'details\'>\s*(<div.*?</div>\s*)?(.*?)</div>~is', $html, $matches );
NOTE: $html variable is having the whole html from which I want to search.
Thanks.
You are looking for single quotes in your regex in contrast to the double quotes in $html.
Your regex should look like:
'~<div class="details">\s*(<div.*?</div>\s*)?(.*?)</div>~is'
or better:
'~<div class=[\'"]details[\'"]>\s*(<div.*?</div>\s*)?(.*?)</div>~is'
Better use a DOM approach !
<?php
$html = '<div class="details">
<div class="title">
<a href="citation.cfm?id=2892225&CFID=598850954&CFTOKEN=15595705"
target="_self">Restrictification of function arguments</a>
</div>
</div>';
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$divs = $xpath->query('//div[#class="title"]');
print_r($divs);
?>

image problems with regular expressions

When I run the following script, the image is not rendered well. What is the problem here? This is the code:
<?php
header('Content-Type: text/html; charset=utf-8');
$url = "http://www.asaphshop.nl/epages/asaphnl.sf/nl_NL/
ObjectPath=/Shops/asaphnl/Products/80203122";
$htmlcode = file_get_contents($url);
$pattern = "/class=\"noscript\"\>(.*)\<\/div\>/isU";
preg_match_all($pattern, $htmlcode, $matches);
//print_r ($matches);
$image = ($matches[0][0]);
print_r ($image);
?>
This is the part of the link I need to copy (the data-src-l part):
<div id="ProductImages" class="noscript">
<ul>
<li>
<a href="/WebRoot/products/8020/80203122/bilder/80203122.jpg">
<img itemprop="image" alt="Jesus Remember Me - Taize Songs (2CD)"
src="/WebRoot/AsaphNL/Shops/asaphnl/5422/8F43/62EE/
D698/EF8E/4DEB/AED5/3B0E/80203122_xs.jpg"
data-src-xs="/WebRoot/AsaphNL/Shops/asaphnl/5422/8F43/62EE/
D698/EF8E/4DEB/AED5/3B0E/80203122_xs.jpg"
data-src-s="/WebRoot/products/8020/80203122/bilder/80203122_s.jpg"
data-src-m="/WebRoot/products/8020/80203122/bilder/80203122_m.jpg"
data-src-l="/WebRoot/products/8020/80203122/bilder/80203122.jpg"
/>
</a>
</li>
</ul>
</div>
$pattern = "#class=\"noscript\">.*data-src-l=([\"'])(?<url>.*)\\1.*</div>#isU";
But it is better to deal with the page as with the DOM structure, not as a string. \\1 is a backreference to ([\"']) so that the same quotes are used at the end of the string. Not so necessary for the URLs as there should be no direct quotes (unescaped) in them, but it is good for general purpose.
ps: if you need everything between <img and /> (including them) - $pattern = '#class="noscript">.*(<img.*>).*</div>#isU';
Use DOMDocument (I hope that your schoolmistress will not scold you):
$dom = new DOMDocument();
$dom->loadHTMLFile('http://www.asaphshop.nl/epages/asaphnl.sf/nl_NL/?ObjectPath=/Shops/asaphnl/Products/80203122');
$xpath = new DOMXPath($dom);
$url = $xpath->query('//div[#id="ProductImages"]/ul/li/a/img/#data-src-l')->item(0)->nodeValue;
echo $url;

How to scrape img src value of each li tag

<ul class="vehicle__gallery cf">
<li><img src="AETV19098412_2a.jpg"></li>
<li><img src="AETV19098412_3a.jpg"></li>
<li><img src="AETV19098412_4a.jpg"></li>
</ul>
and my preg match syntax is as below:
preg_match_all('/<ul class="vehicle__gallery cf">.*?<li>.*?<a(.*?)href="(.*?)"(.*?)>(.*?)<\/a>.*?<\/li>.*?<\/ul>/s', $html_image,$posts, PREG_SET_ORDER);
Please don't use regular expressions to parse HTML. PHP has a fine DOM implementation you can use to loadHTML() and query() it with XPath expressions such as //ul/li/a/img/#src to retrieve what you're after, or maybe import it as a SimpleXML object if you prefer that toolset.
Example:
$html = <<<HTML
<ul class="vehicle__gallery cf">
<li><img src="AETV19098412_2a.jpg"></li>
<li><img src="AETV19098412_3a.jpg"></li>
<li><img src="AETV19098412_4a.jpg"></li>
</ul>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$imgs = $xpath->query("//ul/li/a/img/#src");
foreach ($imgs as $img) {
echo $img->nodeValue . "\n";
}
Output:
AETV19098412_2a.jpg
AETV19098412_3a.jpg
AETV19098412_4a.jpg
You dont use regex to parse HTML.It wont work.
<li> tags dont always have ending tag nor do <img> tag.
There can be n number of attributes to a tag
attribute values don't always go in double quotes
Use an html parser like simpledomparser
I wont even attempt to come up with a regex for this because at some point it would fail.
If you give your img tags a class or something, for example:
<img class="gallery_item" src="AETV19098412_2a.jpg">
<img class="gallery_item" src="AETV19098412_3a.jpg">
you can do more easy:
preg_match('/<img class="gallery_item" src="(.*)">/');
However this is still very hacky, if you ever add a css class, html attributes or modify your code you have the problem that your code might not work anymore.
This solution is anything else then clean and you should considerung using JQuery or a form as stated in my comment before would make your life alot easier and the code will not break because of future, minor html changes that might come up any day.
Another approach is use javascript (jquery).
var imgArr = []
$("ul.vehicle__gallery li img").each(function(){
imgArr.push($(this).attr('src'));
})

How to get everything between <span> & </span> including tags and text

I tried using preg_match_all to get all the contents between a given html tag but it produces an empty result and I'm not good at php.
Is there a way to get get contents between tags? Like this -
<span class="st"> EVERYTHING IN HERE INCLUDING TAGS<B></B><EM></EM><DIV></DIV>&+++ TEXT </span>
preg_match is not very good at HTML parsing, especially in your case which is a bit more complex.
Instead you use a HTML parser and obtain the elements you're looking for. The following is a simple example selecting the first span element. This can be more differentiated by looking for the class attribute as well for example, just to give you some pointers for the start:
$html = '<span class="st"> EVERYTHING IN HERE INCLUDING TAGS<B></B><EM></EM><DIV></DIV>&+++ TEXT </span>';
$doc = new DOMDocument();
$doc->loadHTML($html);
$span = $doc->getElementsByTagName('span')->item(0);
echo $doc->saveHTML($span);
Output:
<span class="st"> EVERYTHING IN HERE INCLUDING TAGS<b></b><em></em><div></div>&+++ TEXT </span>
If you look closely, you can see that even HTML errors have been fixed on the fly with the &+++ which was not valid HTML.
If you only need the inner HTML, you need to iterate over the children of the span element:
foreach($span->childNodes as $child)
{
echo $doc->saveHTML($child);
}
Which give you:
EVERYTHING IN HERE INCLUDING TAGS<b></b><em></em><div></div>&+++ TEXT
I hope this is helpful.
Try this with preg_match
$str = "<span class=\"st\"> EVERYTHING IN HERE INCLUDING TAGS<B></B><EM></EM><DIV></DIV>&+++ TEXT </span>";
preg_match("/<span class=\"st\">([.*?]+)<\/span>/i", $str, $matches);
print_r($matches);

create anchors in a page with the content of <h2></h2> in PHP

Well I have a html text string in a variable:
$html = "<h1>title</h1><h2>subtitle 1</h2> <h2>subtitle 2</h2>";
so I want to create anchors in each subtitle that has with the same name and then print the html code to browser and also get the subtitles as an array.
I think is using regex.. please help.
I think this will do the trick for you:
$pattern = "|<h2>(.*)</h2>|U";
preg_match_all($pattern,$html,$matches);
foreach($matches[1] as $match)
$html = str_replace($match, "<a name='".$match."' />".$match, $html);
$array_of_elements = $matches[1];
Just make sure that $html has the existing html before this code starts. Then it will have an <a name='foo' /> added after this completes, and $array_of_elements will have the array of matching text values.

Categories