PHP - Advanced Regex Help needed - php

So I have many large text paragraphs to parse.
The end goal is to separate the paragraphs into smaller postings, so I can insert them into mysql.
Here's a very short example of one of the paragraphs in a string:
<?php
$longstring = '
(<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
Lots of text entered here under the first line.<br>And most of it is html, since it is for displaying in a web browser.<br></br></br>
(<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
Forgot to put one more thing in the notes.........<br>blah blah blah
(<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
Groceries list:<br>Watermelons<br>Floss<br><br>email doctor
';
?>
Yep, I have a freaky project of parsing these strings for each entry.
Yes, I agree with anyone that this is not a cool task. the original developer allowed for appending text to the original text. Not a bad idea for some occasions, but for me it is.
I do need help with how to RegEx this beast and place it into a foreach loop so I can start cleaning it up.
Here's how far I got:
<?php
if(preg_match_all('/\(<b>.*?<hr>/', $longstring, $matches)){
print_r($matches);
}
/* output:
Array
(
[0] => Array
(
[0] => (<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
[1] => (<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
[2] => (<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
)
)
*/
?>
So, I'm actually doing pretty good with looping through the tops of each entry. I'm kinda proud I figured that out. (regex is my nemesis)
So now I'm stuck figuring out how to include the actual text below each iteration.
Anyone have an idea on how I can adjust the preg_match_all to account for the text below each "header"?

Try to use preg_split instead:
$matches = preg_split("/\s*(\(<b>.*?<hr>)\s*/s", trim($longstring), null, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
print_r($matches);
Note: trim is applied on your string to cut leading and trailing spaces.
Result will be something like
Array
(
[0] => (<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
[1] => Lots of text entered here under the first line.<br>And most of it is html, since it is for displaying in a web browser.<br></br></br>
[2] => (<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
[3] => Forgot to put one more thing in the notes.........<br>blah blah blah
[4] => (<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
[5] => Groceries list:<br>Watermelons<br>Floss<br><br>email doctor
)

Try this
if(preg_match_all('/\(<b>(?:(?!\(<b>).)*/s', $longstring, $matches)){
print_r($matches);
}

This is going to be easier if you parse the HTML rather than just trying to regex it, unless you can guarantee the format of the HTML.
You might want to look at Robust and Mature HTML Parser for PHP.

Related

skipping html tags in php regex

I'm a stickler for correct-ish English (yes, I know "stickler" and "correct-ish" are an oxymoron). I have created a CMS for use on my company's sites, but there is one thing that is really on my nerves - creating "smart" quotes in the published content.
I have a reg-ex that does it, but I run into problems when I encounter html tags in the copy. For instance, one of the published stories used by my CMS may contain a bunch of plain text and a few HTML tags, such as a link tag, which contains quotation marks that I do NOT want to change to "smart" quotes for obvious reasons.
15 years ago, I was a Perl RegEx ace, but I'm totally drawing a blank on this one. What I want to do is process a string, ignoring all text inside html tags, replace all quotes in the string with "smart" quotes, then return the string with its html tags intact.
I have a function that I cobbled together to handle the most common scenarios I face with the CMS, but I hate that it's ugly and not elegant at all, and that if unforeseen tags come up, my solution completely breaks.
Here's the code (please don't laugh, it was slammed together over half a bottle of Scotch):
function educate_quotes($string) {
$pattern = array('/\b"/',//right double
'/"\b/',//left double
'/"/',//left double end of line
"/(\w+)'(\w+)/",//apostrophe
"/\b'/",//left single
"/'\b/",//right single
"/'$/",//right single end of line
"/--/"//emdash
);
$replace = array("”",//right double quote
"“",//left double
"”",//left double end of line
"$1"."’"."$2",//apostrophe
"’",//left single
"‘",//right single
"’",//right single end of line
"—"//emdash
);
$string = preg_replace($pattern,$replace,$string);
//remove smart quotes around urls
$string = preg_replace("/href=“(.+)”/","href=\"$1\"",$string);
//remove smart quotes around images
$string = preg_replace("/src=“(.+?)”/","src=\"$1\" ",$string);
//remove smart quotes around alt tags
$string = str_replace('alt=”"','',$string);
$pat = "/alt=“(.+?)”/is";
$rep = "alt=\"$1\" ";
$string = preg_replace($pat,$rep,$string);
//i'm too lazy to figure out why this artifact keeps appearing
$string = str_replace("alt=“",'alt="',$string);
//same thing here
$string = preg_replace("/” target/","\" target",$string);
return $string;
}
Like I said, I know the code is ugly, and I'm open to more elegant solutions. It works, but in the future, it will break if unforeseen tags come along. For the record, I want to reiterate that I'm not trying to get a regex to PARSE html tags; I'm trying to get it to IGNORE them while parsing all the rest of the text in the string.
Any solutions? I've done a LOT of online searching and can't seem to find the solution, and I'm unfamiliar enough with PHP's implementation of regex that it's consternating.
OK. I sort of answered my own question after Slacks suggested DOM parsing, but now i have the problem that the regex isn't working on the strings created. Here's my code:
function educate_quotes($string) {
$pattern = array(
'/"(\w+)"/',//quotes
"/(\w+)'(\w+)/",//apostrophe
"/'(\w+)'/",//single quotes
"/'\b/",//right single
"/--/"//emdash
);
$replace = array(
"“"."$1"."”",//quotes
"$1"."’"."$2",//apostrophe
"’"."$1"."‘",//single quotes
"‘",//right single
"—"//emdash
);
$xml = new DOMDocument();
$xml->loadHTML($string);
$text = (string)$xml->textContent;
$smart = preg_replace($pattern,$replace,$text);
$xml->textContent = $smart;
$html = $xml->saveHTML();
return $html;
}
The DOM parsing is working fine; the issue is now my regex (which I've changed from the one above, but not until the one above already wasn't working on the new strings created) isn't actually replacing any of the quotation marks in the strings.
Also, I'm getting the following annoying warnings when there is imperfect HTML code in the string:
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Unexpected end tag : p in Entity, line: 2 in /home/leifw/now/cms_functions.php on line 418
Since I can't count on the reporters to always use perfect HTML code, that's a problem, too.
Is it possible to split based on html < > tags and then piece it back together?
$text = "<div sdfas=\"sdfsd\" >ksdfsdf\"dfsd\" dfs </div> <span sdf='dsfs'> dfsd 'dsf ds' </span> ";
$new_text = preg_split("/(<.*?>)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE);
echo htmlspecialchars(print_r($new_text, 1));
so what you get is:
Array
(
[0] =>
[1] => <div sdfas="sdfsd" >
[2] => ksdfsdf"dfsd" dfs
[3] => </div>
[4] =>
[5] => <span sdf='dsfs'>
[6] => dfsd 'dsf ds'
[7] => </span>
[8] =>
)
Then what you can do is just piece the entire thing back together, while using preg_replace, if it doesn't have a < >.
Using A. Lau's suggestion, I think I have a solution, and turned out it actually was regex, not xml parser.
Here's my code:
$string = '<p>"This" <b>is</b> a "string" with quotes in it. <img src="blah.jpg" alt="This is an alt tag"></p><p>Whatever, you know?</p>';
$new_string = preg_split("/(<.*?>)/",$string, -1, PREG_SPLIT_DELIM_CAPTURE);
echo "<pre>";
print_r($new_string);
echo "</pre>";
for($i=0;$i<count($new_string);$i++) {
$str = $new_string[$i];
if ($str) {
if (strpos($str,"<") === false) {
$new_string[$i] = convert_quotes($str);
}
}
}
$str = join('',$new_string);
echo $str;
function convert_quotes($string) {
$pattern = array('/\b"/',//right double
'/"\b/',//left double
'/"/',//left double end of line
"/(\w+)'(\w+)/",//apostrophe
"/\b'/",//left single
"/'\b/",//right single
"/'$/",//right single end of line
"/--/"//emdash
);
$replace = array("”",//right double quote
"“",//left double
"”",//left double end of line
"$1"."’"."$2",//apostrophe
"’",//left single
"‘",//right single
"’",//right single end of line
"—"//emdash
);
return preg_replace($pattern,$replace,$string);
}
That code outputs the following:
Array (
> [0] =>
> [1] => <p>
> [2] => "This"
> [3] => <b>
> [4] => is
> [5] => </b>
> [6] => a "string" with
> [7] => <a href="http://somewhere.com">
> [8] => quotes
> [9] => </a>
> [10] => in it.
> [11] => <img src="blah.jpg" alt="This is an alt tag">
> [12] =>
> [13] => </p>
> [14] =>
> [15] => <p>
> [16] => Whatever, you know?
> [17] => </p>
> [18] => >
> Whatever, you know?
“This” is a “string” with quotes in it. This is an alt tag
Whatever, you know?

php, strpos extract digit from string

I have a huge html code to scan. Until now i have been using preg_match_all to extract desired parts from it. The problem from the start was that it was extremely cpu time consuming. We finally decided to use some other method for extraction. I read in some articles that preg_match can be compared in performance with strpos. They claim that strpos beats regex scanner up to 20 times in efficiency. I thought i will try this method but i dont really know how to get started.
Lets say i have this html string:
<li id="ncc-nba-16451" class="che10">23 - Star</li>
<li id="ncd-bbt-5674" class="che10">54 - Moon</li>
<li id="ertw-cxda-c6543" class="che10">34,780 - Sun</li>
I want to extract only number from each id and only text (letters) from content of a tags. so i do this preg_match_all scan:
'/<li.*?id=".*?([\d]+)".*?<a.*?>.*?([\w]+)<\/a>/s'
here you can see the result: LINK
Now if i would want to replace my method to strpos functionality how the approach would look like? I understand that strpos returns a index of start where match took place. But how can i use it to:
get all possible matches, not just one
extract numbers or text from desired place in string
Thank you for all the help and tips ;)
Using DOM
$html = '
<html>
<head></head>
<body>
<li id="ncc-nba-16451" class="che10">23 - Star</li>
<li id="ncd-bbt-5674" class="che10">54 - Moon</li>
<li id="ertw-cxda-c6543" class="che10">34,780 - Sun</li>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
$rootElement = $dom_document->documentElement;
$getId = $rootElement->getElementsByTagName('li');
$res = [];
foreach($getId as $tag)
{
$data = explode('-',$tag->getAttribute('id'));
$res['li_id'][] = end($data);
}
$getNode = $rootElement->getElementsByTagName('a');
foreach($getNode as $tag)
{
$res['a_node'][] = $tag->parentNode->textContent;
}
print_r($res);
Output :
Array
(
[li_id] => Array
(
[0] => 16451
[1] => 5674
[2] => c6543
)
[a_node] => Array
(
[0] => 23 - Star
[1] => 54 - Moon
[2] => 34,780 - Sun
)
)
This regex finds a match in 24 steps using 0 backtracks
(?:id="[^\d]*(\d*))[^<]*(?:<a href="[^>]*>[^a-z]*([a-z]*))
The regex you posted requires 134 steps. Maybe you will notice a difference? Note that regex engines can optimize so that in minimizes backtracking. I used the debugger of RegexBuddy to come to the numbers.

PHP - Complicated Regex extraction

I have some strings to parse, and it's getting a little more complex.
<?php
$notecomments = '
This is the first of the notes, and so whatever comes later is appended.<br>
(<b>John Smith</b>) at <b class="datetimeGMT">2012-02-07 00:00:20 GMT</b><hr>This is a comment posted<br><br>(<b>Alex Boom</b>) at <b class="datetimeGMT">2013-02-07 00:08:06 GMT</b><hr>And let's put some more in here<br />with a new line.';
if(preg_match_all('/\(<b>(?:(?!\(<b>).)*/s', $notecomments, $matches)){
print_r($matches);
}
/* result of code:
Array
(
[0] => Array
(
[0] => (<b>John Smith</b>) at <b class="datetimeGMT">2012-02-07 00:00:20 GMT</b><hr>This is a comment posted<br><br>
[1] => (<b>Alex Boom</b>) at <b class="datetimeGMT">2013-02-07 00:08:06 GMT</b><hr>And let's put some more in here<br />with a new line.
)
)
*/
?>
I'm able to loop through "appended" notes, since I have indicators to work with in the preg_match_all regex rules.
However, many of my notes have text before the first iteration from my preg_match_all.
(in this case: "This is the first of the notes, and so whatever comes later is appended.")
My first goal was met. Which is the result of my code above. I'm extracting appended notes to the first note.
My next goal is to detect anything before the first iteration. And that's where I'm stuck. (detecting anything before the first iteration, in my regex statement above)
i use preg_replace_callback with two regex for this
like
$notecomments = "This is the first of the notes, and so whatever comes later is appended.<br>(<b>John Smith</b>) at <b class=\"datetimeGMT\">2012-02-07 00:00:20 GMT</b><hr>This is a comment posted<br><br>(<b>Alex Boom</b>) at <b class=\"datetimeGMT\">2013-02-07 00:08:06 GMT</b><hr>And let's put some more in here<br />with a new line.";
$output=preg_replace_callback(array("~<b (.*?)>(.+?)</b>~si","~<b>(.+?)</b>~si"),function($matches){
if(isset($matches[2])){
print_r($matches[2]."\n");
}else{
print_r($matches[1]."\n");
}
return '';},' '.$notecomments.' ');
output:
2012-02-07 00:00:20 GMT
2013-02-07 00:08:06 GMT
John Smith
Alex Boom

PHP split content when a HTML element is found

I have a PHP variable that holds some HTML I wanting to be able to split the variable into two pieces, and I want the spilt to take place when a second bold <strong> or <b> is found, essentially if I have content that looks like this,
My content
This is my content. Some more bold content, that would spilt into another variable.
is this at all possible?
Something like this would basically work:
preg_split('/(<strong>|<b>)/', $html1, 3, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
Given your test string of:
$html1 = '<strong>My content</strong>This is my content.<b>Some more bold</b>content';
you'd end up with
Array (
[0] => <strong>
[1] => My content</strong>This is my content.
[2] => <b>
[3] => Some more bold</b>content
)
Now, if your sample string did NOT start with strong/b:
$html2 = 'like the first, but <strong>My content</strong>This is my content.<b>Some more bold</b>content, has some initial none-tag content';
Array (
[0] => like the first, but
[1] => <strong>
[2] => My content</strong>This is my content.
[3] => <b>
[4] => Some more bold</b>content, has some initial none-tag content
)
and a simple test to see if element #0 is either a tag or text to determine where your "second tag and onwards" text starts (element #3 or element #4)
It is possible with 'positive lookbehind' in regular expressions. E.g., (?<=a)b matches the b (and only the b) in cab, but does not match bed or debt.
In your case, (?<=(\<strong|\<b)).*(\<strong|\<b) should do the trick. Use this regex in a preg_split() call and make sure to set PREG_SPLIT_DELIM_CAPTURE if you want those tags <b> or <strong> to be included.
If you truly really need to split the string, the regular expression approach might work. There are many fragilities about parsing HTML, though.
If you just want to know the second node that has either a strong or b tag, using a DOM is so much easier. Not only is the code very obvious, all the parsing bits are taken care of for you.
<?php
$testHtml = '<p><strong>My content</strong><br>
This is my content. <strong>Some more bold</strong> content, that would spilt into another variable.</p>
<p><b>This should not be found</b></p>';
$htmlDocument = new DOMDocument;
if ($htmlDocument->loadHTML($testHtml) === false) {
// crash and burn
die();
}
$xPath = new DOMXPath($htmlDocument);
$boldNodes = $xPath->query('//strong | //b');
$secondNodeIndex = 1;
if ($boldNodes->item($secondNodeIndex) !== null) {
$secondNode = $boldNodes->item($secondNodeIndex);
var_dump($secondNode->nodeValue);
} else {
// crash and burn
}

Regex to replace reg trademark

I need some help with regex:
I got a html output and I need to wrap all the registration trademarks with a <sup></sup>
I can not insert the <sup> tag in title and alt properties and obviously I don't need to wrap regs that are already superscripted.
The following regex matches text that is not part of a HTML tag:
(?<=^|>)[^><]+?(?=<|$)
An example of what I'm looking for:
$original = `<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>`
The filtered string should output:
<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>
thanks a lot for your time!!!
Well, here is a simple way, if you agree to following limitation:
Those regs that are already processed have the </sup> following right after the ®
echo preg_replace('#®(?!\s*</sup>|[^<]*>)#','<sup>®</sup>', $s);
The logic behind is:
we replace only those ® which are not followed by </sup> and...
which are not followed by > simbol without opening < symbol
I would really use an HTML parser in place of regular expressions, since HTML is not regular and will present more edge cases than you can dream of (ignoring your contextual limitations that you've identified above).
You don't say what technology you're using. If you post that up, someone can undoubtedly recommend the appropriate parser.
Regex is not enough for what you want. First you must write code to identify when content is a value of an attribute or a text node of an element. Then you must through all that content and use some replace method. I am not sure what it is in PHP, but in JavaScript it would look something like:
content[i].replace(/\®/g, "<sup>®</sup>");
I agree with Brian that regular expressions are not a good way to parse HTML, but if you must use regular expressions, you could try splitting the string into tokens and then running your regexp on each token.
I'm using preg_split to split the string on HTML tags, as well as on the phrase <sup>&reg</sup> -- this will leave text that's either not an already superscript ® or a tag as tokens. Then for each token, ® can be replaced with <sup>®</sup>:
$regex = '/(<sup>®<\/sup>|<.*?>)/i';
$original = '<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>';
// we need to capture the tags so that the string can be rebuilt
$tokens = preg_split($regex, $original, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
/* $tokens => Array
(
[0] => <div>
[1] => asd® asdasd. asd
[2] => <sup>®</sup>
[3] => asd
[4] => <img alt="qwe®qwe" />
[5] => </div>
)
*/
foreach ($tokens as &$token)
{
if ($token[0] == "<") continue; // Skip tokens that are tags
$token = substr_replace('®', '<sup>®</sup>');
}
$tokens = join("", $tokens); // reassemble the string
// $tokens => "<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>"
Note that this is a naive approach, and if the output isn't formatted as expected it might not parse like you'd like (again, regular expression is not good for HTML parsing ;) )

Categories