I have the following code below on my website. It's used to find the images in a block of html that don't have http:// or / in front. If this is the case, it will add the website url to the front of the image source.
For example:
<img src="http://domain.com/image.jpg"> will stay the same
<img src="/image.jpg"> will stay the same
<img src="image.jpg"> will be changed to <img src="http://domain.com/image.jpg">
I feel my code is really inefficient... Any ideas on how I could make it run with less code?
preg_match_all('/<img[\s]+[^>]*src\s*=\s*[\"\']?([^\'\" >]+)[\'\" >]/i', $content_text, $matches);
if (isset($matches[1])) {
foreach($matches[1] AS $link) {
if (!preg_match("/^(https?|ftp)\:\/\//sie", $link) && !preg_match("/^\//sie", $link)) {
$full_link = get_option('siteurl') . '/' . $link;
$content_text = str_replace($link, $full_link, $content_text);
}
}
}
For a start you could stop using regular expressions to process HTML, particularly when what you're doing is so easily done with an HTML parser (of which PHP has at least 3). For example:
$dom = new DomDocoument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
$src = $image->getAttribute('src');
$url = parse_url($src);
$image->setAttribute('src', http_build_url('http://www.example.com', $url);
}
$html = $dom->saveHTML();
Problem solved. Well, almost. The case where you add the hostname to relative URLs but not to those beginning with / is a little puzzling and not handled in this snippet but it's a relatively minor change (it involves checking $url['path']).
See Parse HTML With PHP And DOM, the Document Object Model, parse_url() and http_build_url(). PHP has much better tools for this than regular expressions.
Oh and for good measure read Parsing Html The Cthulhu Way.
Maybe a completely different approach may work, too:
<base href="http://domain.com/" />
Trying to match HTML with regular expressions is very difficult.
Even though your code may seem to work, there is a good chance that some IMG tags will slip through as they are not in the exact format you have described.
This isn't tested, but I'm thinking something like this...
preg_match_all('/<img\b[^>]*\bsrc\s*=\s*[\'"]?([^\'">]*)/i', $content_text, $matches);
Related
The best answers I was able to find for this issue are using XSLT, but I'm just not sure how to apply those answers to my problem.
Basically, DOMDocument is doing a fine job of escaping URLs (in href attributes) that are passed in, but I'm actually using it to build a Twig/Django style template, and I'd rather it leave them alone. Here's a specific example, illustrating the "problem":
<?php
$doc = new DOMDocument();
$doc->loadHTML('<html><body>Test<br></body></html>');
echo $doc->saveHTML();
Which outputs the following:
<html><body>Test<br></body></html>
Is it possible to NOT percent-encode the href attribute?
If it's not possible directly, can you suggest a concise and reliable workaround? I'm doing other processing, and the DOMDocument usage will have to stay. So perhaps a pre/post processing trick?
I'm not happy with the 'hack'/duct-tape solution, but this is how I'm currently solving the problem:
function fix_template_variable_tokens($template_string)
{
$pattern = "/%7B%7B(\w+)%7D%7D/";
$replacement = '{{$1}}';
return preg_replace($pattern, $replacement, $template_string);
}
$html = $doc->saveHTML();
$html = fix_template_variable_tokens($html);
I'm pulling images from my Flickr account to my website, and I had used about nine lines of code to create a preg_match_all function that would pull the images.
I've read several times that it is better to parse HTML through DOM.
Personally, I've found it more complicated to parse HTML through DOM. I made up a similar function to pull the images with PHP's DOMDocument, and it's about 22 lines of code. It took awhile to create, and I'm not sure what the benefit was.
The page loads at about the same time for each code, so I'm not sure why I would use DOMDocument.
Does DOMDocument work faster than preg_match_all?
I'll show you my code, if you're interested (you can see how lengthy the DOMDocument code is):
//here's the URL
$flickrGallery = 'http://www.flickr.com/photos/***/collections/***/';
//below is the DOMDocument method
$flickr = new DOMDocument();
$doc->validateOnParse = true;
$flickr->loadHTMLFile($flickrGallery);
$elements = $flickr->getElementById('ViewCollection')->getElementsByTagName('div');
$flickr = array();
for($i=0;$i<$elements->length;$i++){
if($elements->item($i)->hasAttribute('class')&&$elements->item($i)->getAttribute('class')=='setLinkDiv'){
$flickr[] = array(
'href' => $elements->item($i)->getElementsByTagName('a')->item(0)->getAttribute('href'),
'src' => $elements->item($i)->getElementsByTagName('img')->item(0)->getAttribute('src'),
'title' => $elements->item($i)->getElementsByTagName('img')->item(0)->getAttribute('alt')
);
}
}
$elements = NULL;
foreach($flickr as $k=>$v){
$setQuery = explode("/",$flickr[$k]['href']);
$setQuery = $setQuery[4];
echo '<img src="'.$flickr[$k]['src'].'" title="'.$flickr[$k]['title'].'" width=75 height=75 />';
}
$flickr = NULL;
//preg_match_all code is below
$sets = file_get_contents($flickrGallery);
preg_match_all('/(class="setLink" href="(.*?)".*?class="setThumb" src="(.*?)".*?alt="(.*?)")+/s',$sets,$sets,PREG_SET_ORDER);
foreach($sets as $k=>$v){
$setQuery = explode("/",$sets[$k][2]);
$setQuery = $setQuery[4];
echo '<img src="'.$sets[$k][3].'" title="'.$sets[$k][4].'" width=75 height=75 />';
}
$sets = NULL;
If you're willing to sacrifice speed for correctness, then go ahead and try to roll your own parser with regular expressions.
You say "Personally, I've found it more complicated to parse HTML through DOM." Are you optimizing for correctness of results, or how easy it is for you to write the code?
If all you want is speed and code that's not complicated, why not just use this:
$array_of_photos = Array( 'booger.jpg', 'aunt-martha-on-a-horse.png' );
or maybe just
$array_of_photos = Array();
Those run in constant time, and they're easy to understand. No problem, right?
What's that? You want accurate results? Then don't parse HTML with regular expressions.
Finally, when you're working with a parser like DOM, you're working with a piece of code that has been well-tested and debugged for years. When you're writing your own regular expressions to do the parsing, you're working with code that you're going to have to write, test and debug yourself. Why would you not want to work with the tools that many people have been using for many years? Do you think you can do a better job yourself on the fly?
I would use DOM as this is less likely to break if any small changes are made to the page.
After using curl i've got from an external page i've got all source code with something like this (the part i'm interested)
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
So i'm using preg_match_all, i want to get only "buy_tickets.gif"
$pattern_before = "<td valign='top' class='rdBot' align='center'>";
$pattern_after = "</td>";
$pattern = '#'.$pattern_before.'(.*?)'.$pattern_after.'#si';
preg_match_all($pattern, $buffer, $matches, PREG_SET_ORDER);
Everything fine up to now... but the problem it's becase sometimes that external pages changes and the image i'm looking for it's inside a link
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
and i dunno how to get always my code to work (not just when the image gets no link)
hope u understand
thanks in advance
Don't use regex to parse HTML, Use PHP's DOM Extension. Try this:
$doc = new DOMDocument;
#$doc->loadHTMLFile( 'http://ventas.entradasmonumental.com/eventperformances.asp?evt=18' ); // Using the # operator to hide parse errors
$xpath = new DOMXPath( $doc );
$img = $xpath->query( '//td[#class="BrdBot"][#align="center"][1]//img[1]')->item( 0 ); // Xpath->query returns a 'DOMNodeList', get the first item which is a 'DOMElement' (or null)
$imgSrc = $img->getAttribute( 'src' );
$imgSrcInfo = pathInfo( $imgSrc );
$imgFilename = $imgSrcInfo['basename']; // All you need
You're going to get lots of advice not to use regex for pulling stuff out of HTML code.
There are times when it's appropriate to use regex for this kind of thing, and I don't always agree with the somewhat rigid advice given on the subject here (and elsewhere). However in this case, I would say that regex is not the appropriate solution for you.
The problem with using regex for searching for things in HTML code is exactly the problem you've encountered -- HTML code can vary wildly, making any regex virtually impossible to get right.
It is just about possible to write a regex for your situation, but it will be an insanely complex regex, and very brittle -- ie prone to failing if the HTML code is even slightly outside the parameters you expect.
Contrast this with the recommended solution, which is to use a DOM parser. Load the HTML code into a DOM parser, and you will immediately have an object structure which you can query for individual elements and attributes.
The details you've given make it almost a no-brainer to go with this rather than a regex.
PHP has a built-in DOM parser, which you can call as follows:
$mydom = new DOMDocument;
$mydom->loadHTMLFile("http://....");
You can then use XPath to search the DOM for your specific element or attribute that you want:
$myxpath = new DOMXPath($mydom);
$myattr = $xpath->query("//td[#class="rdbot"]//img[0]#src");
Hope that helps.
function GetFilename($file) {
$filename = substr($file, strrpos($file,'/')+1,strlen($file)-strrpos($file,'/'));
return $filename;
}
echo GetFilename('/images/buy_tickets.gif');
This will output buy_tickets.gif
Do you only need images inside of the "td" tags?
$regex='/<img src="\/images\/([^"]*)"[^>]*>/im';
edit:
to grab the specific image this should work:
$regex='/<td valign=\'top\' class=\'rdBot\' align=\'center\'>.*src="\/images\/([^"]*)".*<\/td>/
Parsing HTML with Regex is not recommended, as has been mentioned by several posters.
However, if the path of your images always follows the pattern src="/images/name.gif", you can easily extract it in Regex:
$pattern = <<<EOD
#src\s*=\s*['"]/images/(.*?)["']#
EOD;
If you are sure that the images always follow the path "/images/name.ext" and that you don't care where the image link is located in the page, this will do the job. If you have more detailed requirements (such matching only within a specific class), forget Regex, it's not the right tool for the job.
I just read in your comments that you need to match within a specific tag. Use a parser, it will save you untold headaches.
If you still want to go through regex, try this:
\(?<=<td .*?class\s*=\s*['"]rdBot['"][^<>]*?>.*?)(?<!</td>.*)<img [^<>]*src\s*=\s*["']/images/(.*?)["']\i
This should work. It does work in C#, I am not totally sure about php's brand of regex.
I'm using PHP's "simplexml_load_file" to get some data from Flickr.
My goal is to get the photo url.
I'm able to get the following value (assigned to PHP variable):
<p>codewrecker posted a photo:</p>
<p><img src="http://farm3.static.flickr.com/2298/2302759205_4fb109f367_m.jpg" width="180" height="240" alt="Santa Monica Pier" /></p>
How can I extract just this part of it?
http://farm3.static.flickr.com/2298/2302759205_4fb109f367_m.jpg
Just in case it helps, here's the code I'm working with:
<?php
$xml = simplexml_load_file("http://api.flickr.com/services/feeds/photos_public.gne?id=19725893#N00&lang=en-us&format=xml&tags=carousel");
foreach($xml->entry as $child) {
$flickr_content = $child->content; // gets html including img url
// how can I get the img url from "$flickr_content"???
}
?>
You can probably get away with using a regular expression for this, assuming that the way the HTML is formed is pretty much going to stay the same, e.g.:
if (preg_match('/<img src="([^"]+)"/i', $string, $matches)) {
$imageUrl = $matches[1];
}
This is fairly un-robust, and if the HTML is going to change (e.g. the order of parameters in the <img> tag, risk of malformed HTML etc.), you would be better off using an HTML parser.
It's not solving your problem(and probably total overkill), but worth mentioning because I've used the library on 2 projects and it's well written.
phpFlickr - http://phpflickr.com/
Easy way: Combination of substr and strpos to extract first the tag and then the src='...' value, and finally the target string.
Slightly more difficult way (BUT MUCH MORE ROBUST): Use an XML parsing library such as simpleXML
I hope this is helpful. I enjoy using xpath to cut through the XML I get back from SimpleXML:
<?php
$xml = new SimpleXMLElement("http://api.flickr.com/services/feeds/photos_public.gne?id=19725893#N00&lang=en-us&format=xml&tags=carousel", NULL, True);
$images = $xml->xpath('//img'); //use xpath on the XML to find the img tags
foreach($images as $image){
echo $image['src'] ; //here is the image URL
}
?>
I am trying to create a simple alert app for some friends.
Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two:
http://www.sparkfun.com/commerce/product_info.php?products_id=5
http://www.sparkfun.com/commerce/product_info.php?products_id=9279
I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and quantity available and alert us to make an order if a product is between some thresholds.
I have tried some regex (found on some tutorials, but i an way too n00b for this) but haven't managed to get this working, any good tips or examples?
$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');
preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];
preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];
echo "Price: $price - Availability: $in_stock\n";
It's called screen scraping, in case you need to google for it.
I would suggest that you use a dom parser and xpath expressions instead. Feed the HTML through HtmlTidy first, to ensure that it's valid markup.
For example:
$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[#class="pricing"]/th') as $node) {
echo $node, "\n";
}
What ever you do: Don't use regular expressions to parse HTML or bad things will happen. Use a parser instead.
1st, asking this question goes too into details. 2nd, extracting data from a website might not be legitimate. However, I have hints:
Use Firebug or Chrome/Safari Inspector to explore the HTML content and pattern of interesting information
Test your RegEx to see if the match. You may need do it many times (multi-pass parsing/extraction)
Write a client via cURL or even much simpler, use file_get_contents (NOTE that some hosting disable loading URLs with file_get_contents)
For me, I'd better use Tidy to convert to valid XHTML and then use XPath to extract data, instead of RegEx. Why? Because XHTML is not regular and XPath is very flexible. You can learn XSLT to transform.
Good luck!
You are probably best off loading the HTML code into a DOM parser like this one and searching for the "pricing" table. However, any kind of scraping you do can break whenever they change their page layout, and is probably illegal without their consent.
The best way, though, would be to talk to the people who run the site, and see whether they have alternative, more reliable forms of data delivery (Web services, RSS, or database exports come to mind).
The simplest method to extract data from Website. I've analysed that my all data is covered within <h3> tag only, so I've prepared this one.
<?php
include(‘simple_html_dom.php’);
// Create DOM from URL, paste your destined web url in $page
$page = ‘http://facebook4free.com/category/facebookstatus/amazing-facebook-status/’;
$html = new simple_html_dom();
//Within $html your webpage will be loaded for further operation
$html->load_file($page);
// Find all links
$links = array();
//Within find() function, I have written h3 so it will simply fetch the content from <h3> tag only. Change as per your requirement.
foreach($html->find(‘h3′) as $element)
{
$links[] = $element;
}
reset($links);
//$out will be having each of HTML element content you searching for, within that web page
foreach ($links as $out)
{
echo $out;
}
?>