Wrong images, regular expressions [duplicate] - php

This question already has answers here:
image problems with regular expressions
(2 answers)
Closed 8 years ago.
I need a little bit of help. I got an assignment for school, I need to make a regular expressionscript which get an image (and later upload to the database, but that's not the problem). The real problem is that I get an array with all images from the page, but should be one image, which is:
data-src-l="/WebRoot/products/8020/80203122/bilder/80203122.jpg"
this is the code from the whole image:
<li>
<a href="/WebRoot/products/8020/80203122/bilder/80203122.jpg">
<img
itemprop="image"
alt="Jesus Remember Me - Taize Songs (2CD)"
src="/WebRoot/AsaphNL/Shops/asaphnl/5422/8F43/62EE/D698/EF8E/4DEB/AED5/3B0E/80203122_xs.jpg"
data-src-xs="/WebRoot/AsaphNL/Shops/asaphnl/5422/8F43/62EE/D698/EF8E/4DEB/AED5/3B0E/80203122_xs.jpg"
data-src-s="/WebRoot/products/8020/80203122/bilder/80203122_s.jpg"
data-src-m="/WebRoot/products/8020/80203122/bilder/80203122_m.jpg"
data-src-l="/WebRoot/products/8020/80203122/bilder/80203122.jpg"
/>
</a>
</li>
</ul>
This is the code with PHP:
<?php
header('Content-Type: text/html; charset=utf-8');
$url = "http://www.asaphshop.nl/epages/asaphnl.sf/nl_NL/?ObjectPath=/Shops/asaphnl/Products/80203122";
$htmlcode = file_get_contents($url);
$pattern = "/<img\s[^>]*?src\s*=\s*['\"]([^'\"]*?)['\"][^>]*?>/";
preg_match_all($pattern, $htmlcode, $matches);
//print_r ($matches);
$image = ($matches[0]);
$image = str_replace('src="/', 'src="http://www.asaphshop.nl/', $image);
print_r ($image);
?>
UPDATE: in front of the imagelink must be the link to http://www.asaphshop.nl, so it will look into the site for the image. not inside my localhost. If you dont understand me, you can ask ;)

(<img\s[^>]*?data-src-l\s*=\s*['\"])([^'\"]*?['\"])([^>]*?>)
Try this.This will give the required img.Replace by $1http://www.asaphshop.nl$2$3.See demo.
http://regex101.com/r/wQ1oW3/29

I need a little bit of help. I got an assignment for school, I need to make a regular expression script which get an image (and later upload to the database, but that's not the problem).
Tell your school that regular expressions are not the best tool for the job.
Sure, there is this argument that regular expressions are not so regular and can be used for tasks such as palindrome matching. But that doesn't mean you should use them, since it will cause a lot of headache to you and other developers that might need to work with your code later.
What you should use instead is a proper HTML/XML parser.
Fortunately enough, PHP has what it needs, and it's called DOMDocument. Take a look at its getElementsByTagName method, for example. You could use it to retrieve images. Then you could iterate through all the attributes and parse them the way you want.
Not only it's safer since you don't have to worry about edge cases, it's also more readable.

Related

Put HTML String in PHP [duplicate]

This question already has answers here:
Reference Guide: What does this symbol mean in PHP? (PHP Syntax)
(24 answers)
What is the difference between client-side and server-side programming?
(3 answers)
Closed 9 years ago.
So I'm trying to define $picurl1 so that it uses the value in $pic1. So in the end I want it to be:
<img src="./pictures/{definition of pic1}.png">
Right now I use this php code:
$pic1 = '<script src="pic.js"></script>';
$picurl1 = '<img src="./pictures/' + $pic1 + '.png'">';
Sorry if I'm not being very clear. I don't really know how to explain it. I hope you understand.
In other words, please tell me what I should change $picurl1 to.
By the way the script comes up with a random picture name without the '.png'.
Thanks in advance.
For starters, you're using the wrong operator to concatenate strings in PHP. I think you mean this:
$picurl1 = '<img src="./pictures/' . $pic1 . '.png'">';
More to the point, what is "definition of pic1"? Do you mean that the code in pic.js will randomly choose a file name, and you want its result to be the URL used in the img tag?
The problem you're encountering, then, is that PHP runs on the server while JavaScript runs on the client. So your PHP code can't use the result of pic.js because it won't have a result until the browser runs it, after the PHP code is done.
So you need to get that result client-side in JavaScript code.
How does pic.js create that result? That is, is there a function in pic.js? For now I'm going to assume there is, and I'm going to assume that function is called something like getFileName. (Just for the purpose of this example.)
After you included the JavaScript code, and after the img tag is in the document, you can call that function and set the src of the img tag to its results. To help us identify the img tag, let's give it an id:
<img src="default.gif" id="theImage" alt="This is a dynamic image" />
(I gave it a default value for the src since an empty value is invalid. I also have it an alt value for completeness.) To change its src value to the result of a function, you'd do something like the following:
document.getElementById('theImage').src = getFileName();
Remember, this is all client-side code. The only way you can use the "result" in PHP code is if the calculation is done in PHP, not in JavaScript.
You must consider that all the server side codes are executed before the client side codes (javascript, html, css , ...). so your code does not make any sense , you can not embed an undefined code inside another code that is executing sooner.
if your js code must return some thing, so remove php codes and simply use HTML instead
I tested this successfully:
$picName = "greenButterfly7"; //note no spaces inbetween green and butterfly
$picurl1 = "<img src='./pictures/" . $picName . ".png'>";
echo $picurl1;
or in pure HTML form:
<img src='pictures/greenButterfly7.png'>
or in embedded form (PHP inside HTML):
<img src='pictures/<?php echo $picName; ?>.png'>

Changing/deleting html from file_get_contents

I'm currently using this code:
$blog= file_get_contents("http://powback.tumblr.com/post/" . $post);
echo $blog;
And it works. But tumblr has added a script that activates each time you enter a password-field. So my question is:
Can i remove certain parts with file_get_contents? Or just remove everything above the <html> tag? could i possibly kill a whole div so it wont load at all? And if so; how?
edit:
I managed to do it the simple way. By skipping 766 characters. The script now work as intended!
$blog= file_get_contents("powback.tumblr.com/post/"; . $post, NULL, NULL, 766);
After file_get_contents returns, you have in your hands a string. You can do anything you want to it, including cutting out parts of it.
There are two ways to actually do the cutting:
Using string functions like str_replace, preg_replace and others; the exact recipe depends on what you need to do. This approach is kind of frowned upon because you are working at the wrong level of abstraction, but in some cases it has an unmatched performance to time spent ratio.
Parsing the HTML into a DOM tree, modifying it appropriately (this time working at the appropriate level of abstraction) and then turn it back into a string and echo it. This can be more convenient to work with if your requirements are not dead simple and is easier to maintain, but it typically requires more code to be written.
If you want to do something that's most naturally expressed in HTML document terms ("cutting out this <div>") then don't be tempted and go with the second approach.
At that point, $blog is just a string, so you can use normal PHP functions to alter it. Look into these 2:
http://php.net/manual/en/function.str-replace.php
http://us2.php.net/manual/en/function.preg-replace.php
You can parse your output using simple html dom parser and display olythe contents thatyou really want to display

How to pull specific content from HTML using PHP? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
How do I go about pulling specific content from a given live online HTML page?
For example: http://www.gumtree.com/p/for-sale/ovation-semi-acoustic-guitar/93991967
I want to retrieve the text description, the path to the main image and the price only. So basically, I want to retrieve content which is inside specific divs with maybe specific IDs or classes inside a html page.
Psuedo code
$page = load_html_contents('http://www.gumtr..');
$price = getPrice($page);
$description = getDescription($page);
$title = getTitle($page);
Please note I do not intend to steal any content from gumtree, or anywhere else for that matter, I am just providing an example.
First of all, what u wanna do, is called WEBSCRAPING.
Basically, u load into the html content into one variable, so u will need to use regexps to search for specific ids..etc.
Search after webscraping.
HERE is a basic tutorial
THIS book should be useful too.
something like this would be a good starting point if you wanted tabular output
$raw=file_get_contents($url) or die('could not select');
$newlines=array("\t","\n","\r","\x20\x20","\0","\x0B","<br/>");
$content=str_replace($newlines, "", html_entity_decode($raw));
$start=strpos($content,'<some id> ');
$end = strpos($content,'</ending id>');
$table = substr($content,$start,$end-$start);
preg_match_all("|<tr(.*)</tr>|U",$table,$rows);
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
// array to vars
preg_match_all("|<td(.*)</td>|U",$row,$cells);
$var1= strip_tags($cells[0][0]);
$var2= strip_tags($cells[0][1]);
etc etc
The tutorial Easy web scraping with PHP recommended by robotrobert is good to start, I have made several comments in it. For a better performance use curl. Among other things handles HTTP headers, SSL, cookies, proxies, etc. Cookies is something that you must pay attention.
I just found HTML Parsing and Screen Scraping with the Simple HTML DOM Library. Is more advanced, facilitates and speed up the page parsing through a DOM parser (instead regular expressions --enough hard to master and resources consuming). I recommend you this last one 100%.

PHP Regular expressions problem [duplicate]

This question already has answers here:
Parse HTML with PHP's HTML DOMDocument
(2 answers)
PHP Regular expressions class/id
(1 answer)
Closed 8 years ago.
I'm using regex to pull info from a html table.
But I'm messing up some how, and have no idea why.
PHP CODE:
$printable = file_get_contents('./testplaylist.php', true);
if(preg_match_all('/<TR[^>]*>(.*?)<\/TR>/si', $printable, $matches, PREG_SET_ORDER)); {
foreach($matches as $match) {
$data = "$match[1]";
echo("$data <br />");
}
}
HTML DATA:
<TR class=" light ">
Stuff in here
</TR>
Any help would be appreciated,
Thanks!
Try this one instead
http://sandbox.phpcode.eu/g/bba70.php
if(preg_match_all('/<TR[^>]*>(.*?)<\/TR>/msU', $printable, $matches)) {
foreach($matches[1] as $match) {
echo("$match <br />");
}
}
I know what your first problem is. regex! I kid! but have you checked out PHP DOM?
http://www.php.net/manual/en/domdocument.loadhtmlfile.php
It would probably work in your case just fine. It would be 10x easier too.
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.” Now they have two problems. -Jamie Zawinski
Works fine here. It should work unless you have nested tables.
The problem must be in your data source. Do some tracing with var_dump.
Use PHP's document object model to be safe when parsing HTML. Except for very simple regexes, HTML parsing rapidly gets out of control when you DIY. There's a bit of overhead to set it up, but once you get going it's straightforward.
See DOM for instructions on how to use it.
If you stick to the regex technique, at the least, you may need to escape all '<' and '>'s eg.
/\<TR[^>]*\>(.*?)\<\/TR\>/si

using preg_match_all to get name of image

After using curl i've got from an external page i've got all source code with something like this (the part i'm interested)
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
So i'm using preg_match_all, i want to get only "buy_tickets.gif"
$pattern_before = "<td valign='top' class='rdBot' align='center'>";
$pattern_after = "</td>";
$pattern = '#'.$pattern_before.'(.*?)'.$pattern_after.'#si';
preg_match_all($pattern, $buffer, $matches, PREG_SET_ORDER);
Everything fine up to now... but the problem it's becase sometimes that external pages changes and the image i'm looking for it's inside a link
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
and i dunno how to get always my code to work (not just when the image gets no link)
hope u understand
thanks in advance
Don't use regex to parse HTML, Use PHP's DOM Extension. Try this:
$doc = new DOMDocument;
#$doc->loadHTMLFile( 'http://ventas.entradasmonumental.com/eventperformances.asp?evt=18' ); // Using the # operator to hide parse errors
$xpath = new DOMXPath( $doc );
$img = $xpath->query( '//td[#class="BrdBot"][#align="center"][1]//img[1]')->item( 0 ); // Xpath->query returns a 'DOMNodeList', get the first item which is a 'DOMElement' (or null)
$imgSrc = $img->getAttribute( 'src' );
$imgSrcInfo = pathInfo( $imgSrc );
$imgFilename = $imgSrcInfo['basename']; // All you need
You're going to get lots of advice not to use regex for pulling stuff out of HTML code.
There are times when it's appropriate to use regex for this kind of thing, and I don't always agree with the somewhat rigid advice given on the subject here (and elsewhere). However in this case, I would say that regex is not the appropriate solution for you.
The problem with using regex for searching for things in HTML code is exactly the problem you've encountered -- HTML code can vary wildly, making any regex virtually impossible to get right.
It is just about possible to write a regex for your situation, but it will be an insanely complex regex, and very brittle -- ie prone to failing if the HTML code is even slightly outside the parameters you expect.
Contrast this with the recommended solution, which is to use a DOM parser. Load the HTML code into a DOM parser, and you will immediately have an object structure which you can query for individual elements and attributes.
The details you've given make it almost a no-brainer to go with this rather than a regex.
PHP has a built-in DOM parser, which you can call as follows:
$mydom = new DOMDocument;
$mydom->loadHTMLFile("http://....");
You can then use XPath to search the DOM for your specific element or attribute that you want:
$myxpath = new DOMXPath($mydom);
$myattr = $xpath->query("//td[#class="rdbot"]//img[0]#src");
Hope that helps.
function GetFilename($file) {
$filename = substr($file, strrpos($file,'/')+1,strlen($file)-strrpos($file,'/'));
return $filename;
}
echo GetFilename('/images/buy_tickets.gif');
This will output buy_tickets.gif
Do you only need images inside of the "td" tags?
$regex='/<img src="\/images\/([^"]*)"[^>]*>/im';
edit:
to grab the specific image this should work:
$regex='/<td valign=\'top\' class=\'rdBot\' align=\'center\'>.*src="\/images\/([^"]*)".*<\/td>/
Parsing HTML with Regex is not recommended, as has been mentioned by several posters.
However, if the path of your images always follows the pattern src="/images/name.gif", you can easily extract it in Regex:
$pattern = <<<EOD
#src\s*=\s*['"]/images/(.*?)["']#
EOD;
If you are sure that the images always follow the path "/images/name.ext" and that you don't care where the image link is located in the page, this will do the job. If you have more detailed requirements (such matching only within a specific class), forget Regex, it's not the right tool for the job.
I just read in your comments that you need to match within a specific tag. Use a parser, it will save you untold headaches.
If you still want to go through regex, try this:
\(?<=<td .*?class\s*=\s*['"]rdBot['"][^<>]*?>.*?)(?<!</td>.*)<img [^<>]*src\s*=\s*["']/images/(.*?)["']\i
This should work. It does work in C#, I am not totally sure about php's brand of regex.

Categories