PHP Regular expressions problem [duplicate] - php

This question already has answers here:
Parse HTML with PHP's HTML DOMDocument
(2 answers)
PHP Regular expressions class/id
(1 answer)
Closed 8 years ago.
I'm using regex to pull info from a html table.
But I'm messing up some how, and have no idea why.
PHP CODE:
$printable = file_get_contents('./testplaylist.php', true);
if(preg_match_all('/<TR[^>]*>(.*?)<\/TR>/si', $printable, $matches, PREG_SET_ORDER)); {
foreach($matches as $match) {
$data = "$match[1]";
echo("$data <br />");
}
}
HTML DATA:
<TR class=" light ">
Stuff in here
</TR>
Any help would be appreciated,
Thanks!

Try this one instead
http://sandbox.phpcode.eu/g/bba70.php
if(preg_match_all('/<TR[^>]*>(.*?)<\/TR>/msU', $printable, $matches)) {
foreach($matches[1] as $match) {
echo("$match <br />");
}
}

I know what your first problem is. regex! I kid! but have you checked out PHP DOM?
http://www.php.net/manual/en/domdocument.loadhtmlfile.php
It would probably work in your case just fine. It would be 10x easier too.
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.” Now they have two problems. -Jamie Zawinski

Works fine here. It should work unless you have nested tables.
The problem must be in your data source. Do some tracing with var_dump.

Use PHP's document object model to be safe when parsing HTML. Except for very simple regexes, HTML parsing rapidly gets out of control when you DIY. There's a bit of overhead to set it up, but once you get going it's straightforward.
See DOM for instructions on how to use it.
If you stick to the regex technique, at the least, you may need to escape all '<' and '>'s eg.
/\<TR[^>]*\>(.*?)\<\/TR\>/si

Related

Wrong images, regular expressions [duplicate]

This question already has answers here:
image problems with regular expressions
(2 answers)
Closed 8 years ago.
I need a little bit of help. I got an assignment for school, I need to make a regular expressionscript which get an image (and later upload to the database, but that's not the problem). The real problem is that I get an array with all images from the page, but should be one image, which is:
data-src-l="/WebRoot/products/8020/80203122/bilder/80203122.jpg"
this is the code from the whole image:
<li>
<a href="/WebRoot/products/8020/80203122/bilder/80203122.jpg">
<img
itemprop="image"
alt="Jesus Remember Me - Taize Songs (2CD)"
src="/WebRoot/AsaphNL/Shops/asaphnl/5422/8F43/62EE/D698/EF8E/4DEB/AED5/3B0E/80203122_xs.jpg"
data-src-xs="/WebRoot/AsaphNL/Shops/asaphnl/5422/8F43/62EE/D698/EF8E/4DEB/AED5/3B0E/80203122_xs.jpg"
data-src-s="/WebRoot/products/8020/80203122/bilder/80203122_s.jpg"
data-src-m="/WebRoot/products/8020/80203122/bilder/80203122_m.jpg"
data-src-l="/WebRoot/products/8020/80203122/bilder/80203122.jpg"
/>
</a>
</li>
</ul>
This is the code with PHP:
<?php
header('Content-Type: text/html; charset=utf-8');
$url = "http://www.asaphshop.nl/epages/asaphnl.sf/nl_NL/?ObjectPath=/Shops/asaphnl/Products/80203122";
$htmlcode = file_get_contents($url);
$pattern = "/<img\s[^>]*?src\s*=\s*['\"]([^'\"]*?)['\"][^>]*?>/";
preg_match_all($pattern, $htmlcode, $matches);
//print_r ($matches);
$image = ($matches[0]);
$image = str_replace('src="/', 'src="http://www.asaphshop.nl/', $image);
print_r ($image);
?>
UPDATE: in front of the imagelink must be the link to http://www.asaphshop.nl, so it will look into the site for the image. not inside my localhost. If you dont understand me, you can ask ;)
(<img\s[^>]*?data-src-l\s*=\s*['\"])([^'\"]*?['\"])([^>]*?>)
Try this.This will give the required img.Replace by $1http://www.asaphshop.nl$2$3.See demo.
http://regex101.com/r/wQ1oW3/29
I need a little bit of help. I got an assignment for school, I need to make a regular expression script which get an image (and later upload to the database, but that's not the problem).
Tell your school that regular expressions are not the best tool for the job.
Sure, there is this argument that regular expressions are not so regular and can be used for tasks such as palindrome matching. But that doesn't mean you should use them, since it will cause a lot of headache to you and other developers that might need to work with your code later.
What you should use instead is a proper HTML/XML parser.
Fortunately enough, PHP has what it needs, and it's called DOMDocument. Take a look at its getElementsByTagName method, for example. You could use it to retrieve images. Then you could iterate through all the attributes and parse them the way you want.
Not only it's safer since you don't have to worry about edge cases, it's also more readable.

php search for string, then find another [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
I am brand new to php, only a couple of hours in, trying to understand searching and finding. Let's say I want to extract the rank of Diablo 3 from Amazon's top seller list here. There I can search for the string "Diablo III" or similar to find the following block (sorry about the formatting):
http://www.amazon.com/Diablo-III-Standard-Edition-Pc/dp/B00178630A/ref=zg_bs_4924894011_1
"><img src="http://ecx.images-amazon.com/images/I/41kXCp%2BUyeL._SL160_SL160_.jpg" alt="Diablo III: Standard Edition" title="Diablo III: Standard Edition" onload="if (typeof uet == 'function') { uet('af'); }"/></a></div></div><div class="zg_itemRightDiv_normal"><div class="zg_rankLine"><span class="zg_rankNumber">1.</span><span class="zg_rankMeta"></span></div><div class="zg_title"><a href="
http://www.amazon.com/Diablo-III-Standard-Edition-Pc/dp/B00178630A/ref=zg_bs_4924894011_1
">Diablo III: Standard Edition</a></div><div class="zg_byline">by Blizzard Entertainment
Now, I want to try to extract the rank, which is defined in this part <span class="zg_rankNumber">1.</span> and is currently 1.
Could someone please advise on the best way on extracting that number so that if it falls to second, third or whatever place (up until 20) I will still be able to extract it?
I have looked a bit into preg_match and regex but I couldn't quite understand the use.
You can start using Simple dom html parser
So, if you wanna find this:
<span class="zg_rankNumber">
you can do it like this: ($str contains the html data)
$html = str_get_html($str);
echo $html->find("span[class='zg_rankNumber']",0)->innertext;
EDITED:
If you want to get a specific rank of game (Diablo III), then based on formatting, you just call:
echo $html->find("img[title^='Diablo III']",0)->find("span[class='zg_rankNumber']",0)->innertext;
preg_match_all( '/<span class=\"zg_rankNumber\">(.*?)<\/span>/is', $string, $matches );
print_r($matches)
it'll take a couple of hours for writing the exact code.. but i can tell you the logic
Extract all "" from the html and store it in an array.
Loop through the array and check for the title.
If you found the title, extract the rank from that array element

PHP Regex problem!

I was creating a Syntax Highlighter in PHP but I was failed! You see when I was creating script comments (//) Syntax Highlighting (gray) , I was facing some problems. So I just created a shortened version of my Syntax Highlighting Function to show you all my problem. See whenever a PHP variable ,i.e., $example, is inserted in between the comment it doesn't get grayed as it should be according to my Syntax Highlighter. You see I'm using preg_replace() to achieve this. But the regex of it which I'm using currently doesn't seem to be right. I tried out almost everything that I know about it, but it doesn't work. See the demo code below.
Problem Demo Code
<?php
$str = '
<?php
//This is a php comment $test and resulted bad!
$text_cool++;
?>
';
$result = str_replace(array('<','>','/'),array('[',']','%%'),$str);
$result = preg_replace("/%%%%(.*?)(?=(\n))/","<span style=\"color:gray;\">$0</span>",$result);
$result = preg_replace("/(?<!\"|'|%%%%\w\s\t)[\$](?!\()(.*?)(?=(\W))/","<span style=\"color:green;\">$0</span>",$result);
$result = str_replace(array('[',']','%%'),array('<','>','/'),$result);
$resultArray = explode("\n",$result);
foreach ($resultArray as $i) {
echo $i.'</br>';
}
?>
Problem Demo Screen
So you see the result I want is that $test in the comment string of the 'Demo Screen' above should also be colored as gray!(See below.)
Can anyone help me solve this problem?
I'm Aware of highlight_string() function!
THANKS IN ADVANCE!
Reinventing the wheel?
highlight_string()
Also, this is why they have parsers, and regex (despite popular demand) should not be used as a parser.
I agree, that you should use existing, parsers. Every ide has a php parser, and many people have written more of them.
That said, I do think it is worth the mental exercise. So, you can replace:
$result = preg_replace("/(?<!\"|')[\$](?!\()(.*?)(?=(\W))/","<span style=\"color:green;\">$0</span>",$result);
with
//regular expression.:
//#([^(%%%%|\"|')]*)([\$](?!\()(.*?)(?=(\W)))#
//replacement text:
//$1<span style=\"color:green;\">$2</span>
$result = preg_replace("#([^(%%%%|\"|')]*)([\$](?!\()(.*?)(?=(\W)))#","$1<span style=\"color:green;\">$2</span>",$result);
Personally, I think your best bet is to use CSS selectors. Replace style=\"color:gray;\" with class="comment-text" and style=\"color:green;\" with class="variable-text" and this CSS should work for you:
.variable-text {
color: #00E;
}
.comment-text .comment-text.variable-text {
color: #DDD;
}
Insert don't use regex to parse irregular languages here
anyway, it looks like you've run into a prime example of why regular expressions are not suited for this kind of problem. You'd be better off looking into PHP's highlight_string functionality
Well, you don't seem to care that php already has a function like this.
But because of the structure of php code one cannot simply use a regex for this or walk into mordor (the latter being the easier).
You have to use a parser or you will fly over the cuckoo's nest soon.

using preg_match_all to get name of image

After using curl i've got from an external page i've got all source code with something like this (the part i'm interested)
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
So i'm using preg_match_all, i want to get only "buy_tickets.gif"
$pattern_before = "<td valign='top' class='rdBot' align='center'>";
$pattern_after = "</td>";
$pattern = '#'.$pattern_before.'(.*?)'.$pattern_after.'#si';
preg_match_all($pattern, $buffer, $matches, PREG_SET_ORDER);
Everything fine up to now... but the problem it's becase sometimes that external pages changes and the image i'm looking for it's inside a link
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
and i dunno how to get always my code to work (not just when the image gets no link)
hope u understand
thanks in advance
Don't use regex to parse HTML, Use PHP's DOM Extension. Try this:
$doc = new DOMDocument;
#$doc->loadHTMLFile( 'http://ventas.entradasmonumental.com/eventperformances.asp?evt=18' ); // Using the # operator to hide parse errors
$xpath = new DOMXPath( $doc );
$img = $xpath->query( '//td[#class="BrdBot"][#align="center"][1]//img[1]')->item( 0 ); // Xpath->query returns a 'DOMNodeList', get the first item which is a 'DOMElement' (or null)
$imgSrc = $img->getAttribute( 'src' );
$imgSrcInfo = pathInfo( $imgSrc );
$imgFilename = $imgSrcInfo['basename']; // All you need
You're going to get lots of advice not to use regex for pulling stuff out of HTML code.
There are times when it's appropriate to use regex for this kind of thing, and I don't always agree with the somewhat rigid advice given on the subject here (and elsewhere). However in this case, I would say that regex is not the appropriate solution for you.
The problem with using regex for searching for things in HTML code is exactly the problem you've encountered -- HTML code can vary wildly, making any regex virtually impossible to get right.
It is just about possible to write a regex for your situation, but it will be an insanely complex regex, and very brittle -- ie prone to failing if the HTML code is even slightly outside the parameters you expect.
Contrast this with the recommended solution, which is to use a DOM parser. Load the HTML code into a DOM parser, and you will immediately have an object structure which you can query for individual elements and attributes.
The details you've given make it almost a no-brainer to go with this rather than a regex.
PHP has a built-in DOM parser, which you can call as follows:
$mydom = new DOMDocument;
$mydom->loadHTMLFile("http://....");
You can then use XPath to search the DOM for your specific element or attribute that you want:
$myxpath = new DOMXPath($mydom);
$myattr = $xpath->query("//td[#class="rdbot"]//img[0]#src");
Hope that helps.
function GetFilename($file) {
$filename = substr($file, strrpos($file,'/')+1,strlen($file)-strrpos($file,'/'));
return $filename;
}
echo GetFilename('/images/buy_tickets.gif');
This will output buy_tickets.gif
Do you only need images inside of the "td" tags?
$regex='/<img src="\/images\/([^"]*)"[^>]*>/im';
edit:
to grab the specific image this should work:
$regex='/<td valign=\'top\' class=\'rdBot\' align=\'center\'>.*src="\/images\/([^"]*)".*<\/td>/
Parsing HTML with Regex is not recommended, as has been mentioned by several posters.
However, if the path of your images always follows the pattern src="/images/name.gif", you can easily extract it in Regex:
$pattern = <<<EOD
#src\s*=\s*['"]/images/(.*?)["']#
EOD;
If you are sure that the images always follow the path "/images/name.ext" and that you don't care where the image link is located in the page, this will do the job. If you have more detailed requirements (such matching only within a specific class), forget Regex, it's not the right tool for the job.
I just read in your comments that you need to match within a specific tag. Use a parser, it will save you untold headaches.
If you still want to go through regex, try this:
\(?<=<td .*?class\s*=\s*['"]rdBot['"][^<>]*?>.*?)(?<!</td>.*)<img [^<>]*src\s*=\s*["']/images/(.*?)["']\i
This should work. It does work in C#, I am not totally sure about php's brand of regex.

crawling a html page using php?

This website lists over 250 courses in one list. I want to get the name of each course and insert that into my mysql database using php. The courses are listed like this:
<td> computer science</td>
<td> media studeies</td>
…
Is there a way to do that in PHP, instead of me having a mad data entry nightmare?
Regular expressions work well.
$page = // get the page
$page = preg_split("/\n/", $page);
for ($text in $page) {
$matches = array();
preg_match("/^<td>(.*)<\/td>$/", $text, $matches);
// insert $matches[1] into the database
}
See the documentation for preg_match.
How to parse HTML has been asked and answered countless times before. While (for your specific UseCase) Regular Expressions will work, it is - in general - better and more reliable to use a proper parser for this task. Below is how to do it with DOM:
$dom = new DOMDocument;
$dom->loadHTMLFile('http://courses.westminster.ac.uk/CourseList.aspx');
foreach($dom->getElementsByTagName('td') as $title) {
echo $title->nodeValue;
}
For inserting the data into MySql, you should use the mysqli extension. Examples are plentiful on StackOverflow. so please use the search function.
You can use this HTML parsing php library to achieve this :http://simplehtmldom.sourceforge.net/
I encountered the same problem.
Here is a good class library called the html dom
http://simplehtmldom.sourceforge.net/.
This like jquery
Just for fun, here's a quick shell script to do the same thing.
curl http://courses.westminster.ac.uk/CourseList.aspx \
| sed '/<td>\(.*\)<\/td>/ { s/.*">\(.*\)<\/a>.*/\1/; b }; d;' \
| uniq > courses.txt

Categories