Is regex the right tool to find a line of HTML?

Is regex the right tool to find a line of HTML? - php

I have a PHP script that pulls some content off of a server, but the problem is that the line on which the content is changes every day, so I can't just pull a specific line. However, the content is contained within a div that has a unique id. Is it possible (and is it the best way) for regex to search for this unique id and then pass the line of which it's on back to my script?
Example:
HTML file:
<html><head><title>Example</title></head>
<body>
<div id="Alpha"> Blah blah blah </div>
<div id="Beta"> Blah Blah Blah </div>
</body>
</html>
So let's say that I'm looking for the line with an opening div tag with an id of alpha. The code should return 3, because on the third line is the div with the id of alpha.

At the risk of providing more up-votes for Jeff who has already crossed the mountains of madness... see here
The argument rages back and forth, but... it's is a simple one-off or little used script you are writing then sure use regex, if it's more complex and needs to be reliable with little future tweaking then I'd suggest using an HTML parser. HTML is a nasty often non-regular beast to tame. Use the right tool for the job... maybe in your case it's regex, or maybe its a full blown parser.

Generally, NO. But if you are sure that the div will always be one line or there is not another div inside it, you can use it without problem. Something like /<div id=\"mydivid\">(.*?)</div>/ or something similar.
Otherwise, DOMDocument would be a more sane way.
EDIT See from your HTML example. My answer would be "YES". RegEx is a very good tool for this.
I assume that you have the HTML as a continuous text not as lines (which will be slightly different). I also assume that you want the line number more that the line content.
Here is a rought PHP code to extract it. (just to give some idea)
$HTML =
"<html><head><title>Example</title></head>
<body>
<div id=\"Alpha\"> Blah blah blah </div>
<div id=\"Beta\"> Blah Blah Blah </div>
</body>
</html>";
$ID = "Alpha";
function GetLineOfDIV($HTML, $ID) {
$RegEx_Alpha = '/\n(<div id="'.$ID.'">.*?<\/div>)\n/m';
$Index = preg_match($RegEx_Alpha, $HTML, $Match, PREG_OFFSET_CAPTURE);
$Match = $Match[1]; // Only the one in '(...)'
if ($Match == "")
return -1;
//$MatchStr = $Match[0]; Since you do not want it, so we comment it out.
$MatchOffset = $Match[1];
$StartLines = preg_split("/\n/", $HTML, -1, PREG_SPLIT_OFFSET_CAPTURE);
foreach($StartLines as $I => $StartLine) {
$LineOffset = $StartLine[1];
if ($MatchOffset <= $LineOffset)
return $I + 1;
}
return count($StartLines);
}
echo GetLineOfDIV($HTML, $ID);
I hope I give you some idea.

According to Jeff Atwood, you should never parse HTML using regex.

Since the line number is important to you here and not the actual contents of the div, I'd be inclined not to use regex at all. I'd probably explode() the string into an array and loop through that array looking for your marker. Like so:
<?php
$myContent = "[your string of html here]";
$myArray = explode("\n", $myContent);
$arraylen = count($myArray); // So you don't waste time counting the array at every loop
$lineNo = 0;
for($i = 0; $i < $arraylen; $i++)
{
$pos = strpos($myArray[$i], 'id="Alpha"');
if($pos !== false)
{
$lineNo = $i+1;
break;
}
}
?>
Disclaimer: I haven't got a php installation readily available to test this so some debugging may be required.
Hope this helps as I think it's probably just going to be a waste of time for you to implement a parsing engine just to do something so simple - especially if it's a one-off.
Edit: if the content is impotant to you at this stage too then you can use this in combination with the other answers which provide an adequate regex for the job.
Edit #2: Oh what the hey... here's my two cents:
"/<div.*?id=\"Alpha\".*?>.*?(<div.*//div>)*.*?//div>/m"
The (<div.*//div>) tells the regex engine that it may find nested div tags and to just incorporate them into the match if it finds them rather than just stopping at the first </div>. However this only solves the problem if there is only one level of nesting. If there's more, then regex is not for you sorry :(.
The /m also makes the regex engine ignore linebreaks so you don't have to dirty up your expressions with [\S\s] everywhere.
Again, sorry, I've no environment to test this in at the moment so you may need to debug.
Cheers
Iain

The fact that a unique id is involved, sounds promising, but since it will be a DIV, and not necessarily a single line of HTML, it will be difficult to construct a regular expression, and the usual objections to parsing HTML with regexes apply.
Not recommended.

Instead of RegEx, use a parser that is made especially to handle (messy) HTML. This will make your application less brittle in case the HTML changes slightly, and you don't have to hand-craft custom RegEx each time you want to pull out a new piece of data.
See this Stack Overflow page: Mature HTML Parsers for PHP

#OP since your requirement is that easy, you can just use string methods
$f = fopen("file","r");
if($f){
$s="";
while( !feof($f) ){
$i+=1;
$line = fgets($f,4096);
if (stripos($line,'<div id="Alpha">')!==FALSE){
print "line number: $i\n";
}
}
fclose($f);
}

Related

PHP Custom Markup Language Parser

I am making a website, and I would like to make a custom Markup type language in PHP. I want the tags to be surrounded with [ and ]. Now, I was thinking about this, like anyone would, and I could do something like this:
function formatMarkup($markup = ''){
$markup = str_replace('[color=blue]', '<span style="color: blue">', $markup);
return $markup
}
Even though that might work, it would be more progrematically correct if it would do something like explode(), but starting at every [ and ending at every ]. This would be great if I found out. Thank you for your time and effort.
EDIT:
I have decided to use preg_split(). It seems nice, and all, but I cannot get the regex. Here is my code.
EDIT #2:
I have got most of the regex done, but there are uneeded extra keys in the array. How would I fix them? Here is my new code.

I have made my Markup language. I used
$split = preg_split("/(\[|\])/", $markup);
to get the individual "tags" and used
foreach($split as $k => $v){
if(strlen($v) < 1){
continue;
}
to illiterate through each of them, and check if the value is empty. Then, after that, I would do all of my checks, and parse the code blocks together, and make line, after line, the re-constructed text.

PHP preg_replace inside for loop

I'm currently trying out this PHP preg_replace function and I've run into a small problem. I want to replace all the tags with a div with an ID, unique for every div, so I thought I would add it into a for loop. But in some strange way, it only do the first line and gives it an ID of 49, which is the last ID they can get. Here's my code:
$res = mysqli_query($mysqli, "SELECT * FROM song WHERE id = 1");
$row = mysqli_fetch_assoc($res);
mysqli_set_charset("utf8");
$lyric = $row['lyric'];
$lyricHTML = nl2br($lyric);
$lines_arr = preg_split('[<br />]',$lyricHTML);
$lines = count($lines_arr);
for($i = 0; $i < $lines; $i++) {
$string = preg_replace(']<br />]', '</h4><h4 id="no'.$i.'">', $lyricHTML, 1);
echo $i;
}
echo '<h4>';
echo $string;
echo '</h4>';
How it works is that I have a large amount of text in my database, and when I add it into the lyric variable, it's just plain text. But when I nl2br it, it gets after every line, which I use here. I get the number of by using the little "lines_arr" method as you can see, and then basically iterate in a for loop.
The only problem is that it only outputs on the first line and gives that an ID of 49. When I move it outside the for loop and removes the limit, it works and all lines gets an <h4> around them, but then I don't get the unique ID I need.
This is some text I pulled out from the database
Mama called about the paper turns out they wrote about me
Now my broken heart´s the only thing that's broke about me
So many people should have seen what we got going on
I only wanna put my heart and my life in songs
Writing about the pain I felt with my daddy gone
About the emptiness I felt when I sat alone
About the happiness I feel when I sing it loud
He should have heard the noise we made with the happy crowd
Did my Gran Daddy know he taught me what a poem was
How you can use a sentence or just a simple pause
What will I say when my kids ask me who my daddy was
I thought about it for a while and I'm at a loss
Knowing that I´m gonna live my whole life without him
I found out a lot of things I never knew about him
All I know is that I´ll never really be alone
Cause we gotta lot of love and a happy home
And my goal is to give every line an <h4 id="no1">TEXT</h4> for example, and the number after no, like no1 or no4 should be incremented every iteration, that's why I chose a for-loop.

Looks like you need to escape your regexp
preg_replace('/\[<br \/\]/', ...);
Really though, this is a classic XY Problem. Instead of asking us how to fix your solution, you should ask us how to solve your problem.
Show us some example text in the database and then show us how you would like it to be formatted. It's very likely there's a better way.
I would use array_walk for this. ideone demo here
$lines = preg_split("/[\r\n]+/", $row['lyric']);
array_walk($lines, function(&$line, $idx) {
$line = sprintf("<h4 id='no%d'>%s</h4>", $idx+1, $line);
});
echo implode("\n", $lines);
Output
<h4 id="no1">Mama called about the paper turns out they wrote about me</h4>
<h4 id="no2">Now my broken heart's the only thing that's broke about me</h4>
<h4 id="no3">So many people should have seen what we got going on</h4>
...
<h4 id="no16">Cause we gotta lot of love and a happy home</h4>
Explanation of solution
nl2br doesn't really help us here. It converts \n to <br /> but then we'd just end up splitting the string on the br. We might as well split using \n to start with. I'm going to use /[\r\n]+/ because it splits one or more \r, \n, and \r\n.
$lines = preg_split("/[\r\n]+/", $row['lyric']);
Now we have an array of strings, each containing one line of lyrics. But we want to wrap each string in an <h4 id="noX">...</h4> where X is the number of the line.
Ordinarily we would use array_map for this, but the array_map callback does not receive an index argument. Instead we will use array_walk which does receive the index.
One more note about this line, is the use of &$line as the callback parameter. This allows us to alter the contents of the $line and have it "saved" in our original $lyrics array. (See the Example #1 in the PHP docs to compare the difference).
array_walk($lines, function(&$line, $idx) {
Here's where the h4 comes in. I use sprintf for formatting HTML strings because I think they are more readable. And it allows you to control how the arguments are output without adding a bunch of view logic in the "template".
Here's the world's tiniest template: '<h4 id="no%d">%s</h4>'. It has two inputs, %d and %s. The first will be output as a number (our line number), and the second will be output as a string (our lyrics).
$line = sprintf('<h4 id="no%d">%s</h4>', $idx+1, $line);
Close the array_walk callback function
});
Now $lines is an array of our newly-formatted lyrics. Let's output the lyrics by separating each line with a \n.
echo implode("\n", $lines);
Done!

If your text in db is in every line why just not explode it with \n character?
Always try to find a solution without using preg set of functions, because they are heavy memory consumers:
I would go lke this:
$lyric = $row['lyric'];
$lyrics =explode("\n",$lyrics);
$lyricsHtml=null;
$i=0;
foreach($lyrics as $val){
$i++;
$lyricsHtml[] = '<h4 id="no'.$i.'">'.$val.'</h4>';
}
$lyricsHtml = implode("\n",$lyricsHtml);

An other way with preg_replace_callback:
$id = 0;
$lyric = preg_replace_callback('~(^)|$~m',
function ($m) use (&$id) {
return (isset($m[1])) ? '<h4 id="no' . ++$id . '">' : '</h4>'; },
$lyric);

How to expand variables in a string

Problem
I'd like to expand variables in a string in the same manner that variable in a double quoted string get expanded.
$string = '<p>It took $replace s</>';
$replace = 40;
expression_i_look_for;
$string should become '<p>It took 40 s</>';
I see a obvious solution like this:
$string = str_replace('"', '\"', $string);
eval('$string = "$string";');
But I really don't like it, because eval() is insecure. Is there any other way to do this ?
Context
I'm building a simple templateing engine, that's where I need this.
Example Template (view_file.php)
<h1>$title</h1>
<p>$content</p>
Template rendering (simplified code):
$params = array('title' => ...);
function render($view_file, $params)
extract($params)
ob_start();
include($view_file);
$text = ob_get_contents();
ob_end_clean();
expression_i_look_for; // this will expand the variables in the template
return $text;
}
The expansion of the variables in the template simplifies it's syntax. Without it, the above example template would be:
<h1><?php echo $title;?></h1>
<p><?php echo $content;?></p>
Do you think this approach is good ? Or should I look in another direction ?
Edit
Finally I understand that there is no simple solution due to flexible way PHP expands variables (even ${$var}->member[0] would be valid.
So there are only two options:
Adopt an existing full fledged templating system
Stick with something very basic that essentially is limited to including the view files via include.

I would rather suggest using some existing template engines, like for example Smarty, but if you really want to do it by yourself you can use the simple regular expression to match all variables constructed with for example letters and numbers and then replace them with correct variables:
<?php
$text = 'hello $world, what is the $matter? I like $world!';
preg_match_all('/\$([a-zA-Z0-9]+)/',
$text,
$out, PREG_PATTERN_ORDER);
$world = 'World';
$matter = 'matter';
foreach(array_unique($out[1]) as $variable){
$text=str_replace('$'.$variable, $$variable, $text);
}
echo $text;
?>
prints
hello World, what is the matter? I like World!

Parse
Parse the string look for $ followed by valid variable name (i.e. \[a-zA-Z_\x7f-\xff\]\[a-zA-Z0-9_\x7f-\xff\]*)
Variable²
Use variable variables syntax (i.e. $$var notation).

Are you trying to do this?
templater.php:
<?php
$first = "first";
$second = "second";
$third = "third";
include('template.php');
template.php:
<?php
echo 'The '.$first.', '.$second.', and '.$third.' variables in a string!';
When templater.php is run, produces:
"The first, second, and third variables in a string!"

Do you want something like this ?
$replace = 40;
$string = '<p>It took {$replace}s</p>';

Instead of using single quotes
$string = '<p>It took $replace s</>';
$replace = 40;
use double quotes
$replace = 40;
$string = "<p>It took $replace s</>";
However, for readability and to enable you to remove the space between $replace and the s I would use:
$replace = 40;
string = '<p>It took ' . $replace . 's</>';

The correct way is probably to parse your document as a tree, identify your parser tags ( because you are managing your own parser they don't have to follow php conventions if you don't want them to ) and then add in your values from an associative array or other data structure as the opportunity arises.
This is a more complex solution but will make it far easier when you realise that you want to be able to display lists whose length is unknown ahead of time using some kind of looping structure based on a standard display option. In the long run, you won't find many serious templating systems that aren't parsing the documents into some kind of in-memory tree where the placeholders can be inserted and then the document constructed as required. This also offers many opportunities for cacheing. Also, if you are unafraid of recursion you will be able to perform a lot of operations on it fairly simply.
However, this is not an uncommon problem to solve and as I commented on the question, there are almost guaranteed to be libraries and extensions around that provide most of the functionality you need. Unless this is a purely academic process for you, I would find some existing solutions and either use one of those or get a solid understanding of how it works so you have a starting point for adapting your own solution.

This is a snippet I pulled out from Lejlot's answer. I tested it and it works fine.
function resolve_vars_in_str( $input )
{
preg_match_all('/\$([a-zA-Z0-9]+)/', $input, $out, PREG_PATTERN_ORDER);
foreach(array_unique($out[1]) as $variable) $input=str_replace('$'.$variable, $GLOBALS["$variable"], $input);
return $input ;
}

using preg_match_all to get name of image

After using curl i've got from an external page i've got all source code with something like this (the part i'm interested)
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
So i'm using preg_match_all, i want to get only "buy_tickets.gif"
$pattern_before = "<td valign='top' class='rdBot' align='center'>";
$pattern_after = "</td>";
$pattern = '#'.$pattern_before.'(.*?)'.$pattern_after.'#si';
preg_match_all($pattern, $buffer, $matches, PREG_SET_ORDER);
Everything fine up to now... but the problem it's becase sometimes that external pages changes and the image i'm looking for it's inside a link
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
and i dunno how to get always my code to work (not just when the image gets no link)
hope u understand
thanks in advance

Don't use regex to parse HTML, Use PHP's DOM Extension. Try this:
$doc = new DOMDocument;
#$doc->loadHTMLFile( 'http://ventas.entradasmonumental.com/eventperformances.asp?evt=18' ); // Using the # operator to hide parse errors
$xpath = new DOMXPath( $doc );
$img = $xpath->query( '//td[#class="BrdBot"][#align="center"][1]//img[1]')->item( 0 ); // Xpath->query returns a 'DOMNodeList', get the first item which is a 'DOMElement' (or null)
$imgSrc = $img->getAttribute( 'src' );
$imgSrcInfo = pathInfo( $imgSrc );
$imgFilename = $imgSrcInfo['basename']; // All you need

You're going to get lots of advice not to use regex for pulling stuff out of HTML code.
There are times when it's appropriate to use regex for this kind of thing, and I don't always agree with the somewhat rigid advice given on the subject here (and elsewhere). However in this case, I would say that regex is not the appropriate solution for you.
The problem with using regex for searching for things in HTML code is exactly the problem you've encountered -- HTML code can vary wildly, making any regex virtually impossible to get right.
It is just about possible to write a regex for your situation, but it will be an insanely complex regex, and very brittle -- ie prone to failing if the HTML code is even slightly outside the parameters you expect.
Contrast this with the recommended solution, which is to use a DOM parser. Load the HTML code into a DOM parser, and you will immediately have an object structure which you can query for individual elements and attributes.
The details you've given make it almost a no-brainer to go with this rather than a regex.
PHP has a built-in DOM parser, which you can call as follows:
$mydom = new DOMDocument;
$mydom->loadHTMLFile("http://....");
You can then use XPath to search the DOM for your specific element or attribute that you want:
$myxpath = new DOMXPath($mydom);
$myattr = $xpath->query("//td[#class="rdbot"]//img[0]#src");
Hope that helps.

function GetFilename($file) {
$filename = substr($file, strrpos($file,'/')+1,strlen($file)-strrpos($file,'/'));
return $filename;
}
echo GetFilename('/images/buy_tickets.gif');
This will output buy_tickets.gif

Do you only need images inside of the "td" tags?
$regex='/<img src="\/images\/([^"]*)"[^>]*>/im';
edit:
to grab the specific image this should work:
$regex='/<td valign=\'top\' class=\'rdBot\' align=\'center\'>.*src="\/images\/([^"]*)".*<\/td>/

Parsing HTML with Regex is not recommended, as has been mentioned by several posters.
However, if the path of your images always follows the pattern src="/images/name.gif", you can easily extract it in Regex:
$pattern = <<<EOD
#src\s*=\s*['"]/images/(.*?)["']#
EOD;
If you are sure that the images always follow the path "/images/name.ext" and that you don't care where the image link is located in the page, this will do the job. If you have more detailed requirements (such matching only within a specific class), forget Regex, it's not the right tool for the job.
I just read in your comments that you need to match within a specific tag. Use a parser, it will save you untold headaches.
If you still want to go through regex, try this:
\(?<=<td .*?class\s*=\s*['"]rdBot['"][^<>]*?>.*?)(?<!</td>.*)<img [^<>]*src\s*=\s*["']/images/(.*?)["']\i
This should work. It does work in C#, I am not totally sure about php's brand of regex.

Errors with basic PHP template using eval()

I'm just about ready to cry. I have read the php.net manual page, tried a dozen Google searches with various phrases, and scanned hundreds of stackoverflow threads. I have seen this method in use, but it just doesn't work for me. It seems so basic. I have a couple related questions or problems I don't know how to find answers to.
The important part of the PHP file looks like this:
switch() {
…other cases…
default:
$tpl['title'] = "Newsletter Signup";
$tpl['description'] = "Newsletter description";
$tpl['page-content'] = file_get_contents('signup.html');
}
$tpl_src = addslashes(file_get_contents('index.tpl'));
eval("\$html = \"$tpl_src\";");
echo $html;
My index.tpl file includes lines like these:
<title>{$tpl['title']}</title>
<meta name="description" content="{$tpl['description']}" />
nav, etc…
<div id="main-content"> {$tpl['page-content']} </div>
I like how neat and clean the code is, without a whole bunch of extra <?=…?>'s.
First, when curly brackets {} appear in a string, what is that called? I might be able to look it up and learn how to use them if I knew.
Next, this just doesn't work at all. If I remove the single quotes from the variable keys, it's good, but php.net says you should never do that in case my name becomes a language constant at some point. Fair enough. But how do I fix this? I like using an array for the vars in case I want to build an evalTemplate subroutine and can just pass $tpl to it.
Lastly, $tpl['page-content'] doesn't print out. The variable is set okay; I can use echo $tpl['page-content'] to test, but it appears as a single blank line in the final HTML.
I'm sure there's just some aspect of the language I don't know yet. Any help is much appreciated!!

As Volker pointed out, addslashes seems to be an issue. Try addcslashes instead. Also, I'd strongly recommend making this a function, to simplify sanitisation / parsing:
function render ($file, $vars)
{
// .. extra sanitisation, validation, et al.
$_html = '';
$_raw_file = addcslashes (file_get_contents ($file), '"\\');
extract ($vars, EXTR_SKIP);
eval ('$_html = "'.$_raw_file.'"');
return $_html;
}
And called thus:
switch() {
// …other cases…
default:
$tpl['title'] = "Newsletter Signup";
$tpl['description'] = "Newsletter description";
$tpl['page-content'] = render ('signup.html');
}
echo render ('index.tpl', $tpl);
PS: The use of extract above means that your vars will simply be $title, not $tpl['title'], etc.

Usually you don't use the '' string delimiters in string variable expansion. I.e. "$tpl[content]" instead of "$tpl['content']".
As for as the braces, they delimit variable's when identifier characters may come straight before or after the name. For example:
$item = "Cup";
$text = "I smashed four $items"; // won't work
$text = "I smashed four {$item}s"; // will work.
// 2nd output: "I smashed four Cups"

addslashes() adds slashes before both single- and double-quotes within the string.
The code generated for your example would be
$html = "<title>{$tpl[\'title\']}</title>
<meta name=\"description\" content=\"{$tpl[\'description\']}\" />
nav, etc…¦
<div id=\"main-content\"> {$tpl[\'page-content\']} </div>";
And {$tpl[\'title\']} doesn't parse well.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Is regex the right tool to find a line of HTML? - php

According to Jeff Atwood, you should never parse HTML using regex.

The fact that a unique id is involved, sounds promising, but since it will be a DIV, and not necessarily a single line of HTML, it will be difficult to construct a regular expression, and the usual objections to parsing HTML with regexes apply. Not recommended.

#OP since your requirement is that easy, you can just use string methods $f = fopen("file","r"); if($f){ $s=""; while( !feof($f) ){ $i+=1; $line = fgets($f,4096); if (stripos($line,'<div id="Alpha">')!==FALSE){ print "line number: $i\n"; } } fclose($f); }

Related

PHP Custom Markup Language Parser

PHP preg_replace inside for loop

How to expand variables in a string

using preg_match_all to get name of image

Errors with basic PHP template using eval()

Categories

Resources