PHP - Complicated Regex extraction

PHP - Complicated Regex extraction - php

I have some strings to parse, and it's getting a little more complex.
<?php
$notecomments = '
This is the first of the notes, and so whatever comes later is appended.<br>
(<b>John Smith</b>) at <b class="datetimeGMT">2012-02-07 00:00:20 GMT</b><hr>This is a comment posted<br><br>(<b>Alex Boom</b>) at <b class="datetimeGMT">2013-02-07 00:08:06 GMT</b><hr>And let's put some more in here<br />with a new line.';
if(preg_match_all('/\(<b>(?:(?!\(<b>).)*/s', $notecomments, $matches)){
print_r($matches);
}
/* result of code:
Array
(
[0] => Array
(
[0] => (<b>John Smith</b>) at <b class="datetimeGMT">2012-02-07 00:00:20 GMT</b><hr>This is a comment posted<br><br>
[1] => (<b>Alex Boom</b>) at <b class="datetimeGMT">2013-02-07 00:08:06 GMT</b><hr>And let's put some more in here<br />with a new line.
)
)
*/
?>
I'm able to loop through "appended" notes, since I have indicators to work with in the preg_match_all regex rules.
However, many of my notes have text before the first iteration from my preg_match_all.
(in this case: "This is the first of the notes, and so whatever comes later is appended.")
My first goal was met. Which is the result of my code above. I'm extracting appended notes to the first note.
My next goal is to detect anything before the first iteration. And that's where I'm stuck. (detecting anything before the first iteration, in my regex statement above)

i use preg_replace_callback with two regex for this
like
$notecomments = "This is the first of the notes, and so whatever comes later is appended.<br>(<b>John Smith</b>) at <b class=\"datetimeGMT\">2012-02-07 00:00:20 GMT</b><hr>This is a comment posted<br><br>(<b>Alex Boom</b>) at <b class=\"datetimeGMT\">2013-02-07 00:08:06 GMT</b><hr>And let's put some more in here<br />with a new line.";
$output=preg_replace_callback(array("~<b (.*?)>(.+?)</b>~si","~<b>(.+?)</b>~si"),function($matches){
if(isset($matches[2])){
print_r($matches[2]."\n");
}else{
print_r($matches[1]."\n");
}
return '';},' '.$notecomments.' ');
output:
2012-02-07 00:00:20 GMT
2013-02-07 00:08:06 GMT
John Smith
Alex Boom

Related

PHP - Scraping a DIV Element from a Web Page using preg_match

I am trying to use preg_match currently just to retrieve 1 value (before I move onto retrieving multiple values), however, I am having no luck. When I perform a print_r() there is nothing stored in my array.
Here is my code what i am trying so far:
<?php
$content = '<div class="text-right font-90 text-default text-light last-updated vertical-offset-20">
Reported ETA Received:
<time datetime="2017-02-02 18:12">2017-02-02 18:12</time>
UTC
</div>';
preg_match('|Reported ETA Received: <time datetime=".+">(.*)</time>(.*)\(<span title=".+">(.*)<time datetime=".+">(.*)</time></span>\)|', $content, $reported_eta_received);
if ($reported_eta_received) {
$arr_parsed['reported_eta_received'] = $reported_eta_received[1];
}
?>
Required Output:
2017-02-02 18:12
My above-mentioned code is not working. Any help on this regards would be appreciated. Thanks in advance.

It may not match because there is a new line between Reported ETA Received: and the <time> tag. And you've just put in there a space (use [\n\r\s\t]+ instead " ").
Also, why don't you simply use:
preg_match('|<time datetime=".*?">(.*?)</time>|', $content, $reported_eta_received);
You can also use:?P<name> for a easier pointing (associative vs numeric: numeric can change if you put more capture groups).
preg_match('|<time datetime=".*?">(?P<name>.*?)</time>|', $content, $match);
print_r($match); // $match['name'] should be there if matched.

php, strpos extract digit from string

I have a huge html code to scan. Until now i have been using preg_match_all to extract desired parts from it. The problem from the start was that it was extremely cpu time consuming. We finally decided to use some other method for extraction. I read in some articles that preg_match can be compared in performance with strpos. They claim that strpos beats regex scanner up to 20 times in efficiency. I thought i will try this method but i dont really know how to get started.
Lets say i have this html string:
<li id="ncc-nba-16451" class="che10">23 - Star</li>
<li id="ncd-bbt-5674" class="che10">54 - Moon</li>
<li id="ertw-cxda-c6543" class="che10">34,780 - Sun</li>
I want to extract only number from each id and only text (letters) from content of a tags. so i do this preg_match_all scan:
'/<li.*?id=".*?([\d]+)".*?<a.*?>.*?([\w]+)<\/a>/s'
here you can see the result: LINK
Now if i would want to replace my method to strpos functionality how the approach would look like? I understand that strpos returns a index of start where match took place. But how can i use it to:
get all possible matches, not just one
extract numbers or text from desired place in string
Thank you for all the help and tips ;)

Using DOM
$html = '
<html>
<head></head>
<body>
<li id="ncc-nba-16451" class="che10">23 - Star</li>
<li id="ncd-bbt-5674" class="che10">54 - Moon</li>
<li id="ertw-cxda-c6543" class="che10">34,780 - Sun</li>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
$rootElement = $dom_document->documentElement;
$getId = $rootElement->getElementsByTagName('li');
$res = [];
foreach($getId as $tag)
{
$data = explode('-',$tag->getAttribute('id'));
$res['li_id'][] = end($data);
}
$getNode = $rootElement->getElementsByTagName('a');
foreach($getNode as $tag)
{
$res['a_node'][] = $tag->parentNode->textContent;
}
print_r($res);
Output :
Array
(
[li_id] => Array
(
[0] => 16451
[1] => 5674
[2] => c6543
)
[a_node] => Array
(
[0] => 23 - Star
[1] => 54 - Moon
[2] => 34,780 - Sun
)
)

This regex finds a match in 24 steps using 0 backtracks
(?:id="[^\d]*(\d*))[^<]*(?:<a href="[^>]*>[^a-z]*([a-z]*))
The regex you posted requires 134 steps. Maybe you will notice a difference? Note that regex engines can optimize so that in minimizes backtracking. I used the debugger of RegexBuddy to come to the numbers.

PHP - Advanced Regex Help needed

So I have many large text paragraphs to parse.
The end goal is to separate the paragraphs into smaller postings, so I can insert them into mysql.
Here's a very short example of one of the paragraphs in a string:
<?php
$longstring = '
(<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
Lots of text entered here under the first line.<br>And most of it is html, since it is for displaying in a web browser.<br></br></br>
(<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
Forgot to put one more thing in the notes.........<br>blah blah blah
(<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
Groceries list:<br>Watermelons<br>Floss<br><br>email doctor
';
?>
Yep, I have a freaky project of parsing these strings for each entry.
Yes, I agree with anyone that this is not a cool task. the original developer allowed for appending text to the original text. Not a bad idea for some occasions, but for me it is.
I do need help with how to RegEx this beast and place it into a foreach loop so I can start cleaning it up.
Here's how far I got:
<?php
if(preg_match_all('/\(<b>.*?<hr>/', $longstring, $matches)){
print_r($matches);
}
/* output:
Array
(
[0] => Array
(
[0] => (<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
[1] => (<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
[2] => (<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
)
)
*/
?>
So, I'm actually doing pretty good with looping through the tops of each entry. I'm kinda proud I figured that out. (regex is my nemesis)
So now I'm stuck figuring out how to include the actual text below each iteration.
Anyone have an idea on how I can adjust the preg_match_all to account for the text below each "header"?

Try to use preg_split instead:
$matches = preg_split("/\s*(\(<b>.*?<hr>)\s*/s", trim($longstring), null, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
print_r($matches);
Note: trim is applied on your string to cut leading and trailing spaces.
Result will be something like
Array
(
[0] => (<b>John Smith</b>) at <b class="datetimeGMT">2011-01-10 22:13:01 GMT</b><hr>
[1] => Lots of text entered here under the first line.<br>And most of it is html, since it is for displaying in a web browser.<br></br></br>
[2] => (<b>Alan Slappy</b>) at <b class="datetimeGMT">2011-01-11 13:12:00 GMT</b><hr>
[3] => Forgot to put one more thing in the notes.........<br>blah blah blah
[4] => (<b>Joe Mama</b>) at <b class="datetimeGMT">2011-01-13 10:15:00 GMT</b><hr>
[5] => Groceries list:<br>Watermelons<br>Floss<br><br>email doctor
)

Try this
if(preg_match_all('/\(<b>(?:(?!\(<b>).)*/s', $longstring, $matches)){
print_r($matches);
}

This is going to be easier if you parse the HTML rather than just trying to regex it, unless you can guarantee the format of the HTML.
You might want to look at Robust and Mature HTML Parser for PHP.

Unusual behaviour of regex

My Setup:
index.php:
<?php
$page = file_get_contents('a.html');
$arr = array();
preg_match('/<td class=\"myclass\">(.*)\<\/td>/s',$page,$arr);
print_r($arr);
?>
a.html:
...other content
<td class="myclass">
THE
CONTENT
</td>
other content...
Output:
Array
(
[0] => Array
(
)
)
If I change the line 4 of index.php to:
preg_match('/<td class=\"myclass\">(.*)\<\/t/s',$page,$arr);
The output is:
Array
(
[0] => <td class="myclass">
THE
CONTENT
</t
[1] =>
THE
CONTENT
)
I can't make out what's wrong. Please help me match the content between <td class="myclass"> and </td>.

Your code appears to work. I edited the regex to use a different separator and get a clearer view. You may want to use the ungreedy modifier in case there is more than one myclass TD in your HTML.
I have not been able to reproduce the "array of array" behaviour you note, unless I manipulate the code to add an error -- see at bottom.
<?php
$page = <<<PAGE
...other content
<td class="myclass">
THE
CONTENT
</td>
other content...
PAGE;
preg_match('#<td class="myclass">(.*)</td>#s',$page,$arr);
print_r($arr);
?>
returns, as expected:
Array
(
[0] => <td class="myclass">
THE
CONTENT
</td>
[1] =>
THE
CONTENT
)
The code below is similar to yours but has been modified to cause an identical error. Doesn't seem likely you did this, though. The regexp is modified in order to not match, and the resulting empty array is stored into $arr[0] instead of $arr.
preg_match('#<td class="myclass">(.*)</ td>#s',$page,$arr[0]);
Returns the same error you observe:
Array
(
[0] => Array
(
)
)
I can duplicate the same behaviour you observe (works with </t, does not work with </td>) if I use your regexp, but modify the HTML to have </t d>. I still need to write to $arr[0] instead of $arr if I also want to get an identical output.

Do you understand that the 3rd paramter of preg_match is the matches and it will contain the match then the other elements will show the captured pattern.
http://ca3.php.net/manual/en/function.preg-match.php
If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.
This code
preg_match('/<td class=\"myclass\">(.*)\<\/t/s',$page,$arr);
When applied on
...other content
<td class="myclass">
THE
CONTENT
</td>
other content...
Will return the match in $arr[0] and the result of (.*) in $arr[1]. This result is correct: There is your content in [1]
Array
(
[0] => <td class="myclass">
THE
CONTENT
</t
[1] =>
THE
CONTENT
Example two
<?php
header('Content-Type: text/plain');
$page = 'A B C D E F';
$arr = array();
preg_match('/C (D) E/', $page, $arr);
print_r($arr);
Example output
Array
(
[0] => C D E // This is the string found
[1] => D // this is what I wanted to look for and extracted out of [0], the matched parenthesis
)

Your regex seems correct. Isn't the syntax of preg_match as follows?
preg_match('/<td class=\"myclass\">(.*)\<\/td>/s',$page,$arr);
The | in the regex represents or

PHP split content when a HTML element is found

I have a PHP variable that holds some HTML I wanting to be able to split the variable into two pieces, and I want the spilt to take place when a second bold <strong> or <b> is found, essentially if I have content that looks like this,
My content
This is my content. Some more bold content, that would spilt into another variable.
is this at all possible?

Something like this would basically work:
preg_split('/(<strong>|<b>)/', $html1, 3, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
Given your test string of:
$html1 = '<strong>My content</strong>This is my content.<b>Some more bold</b>content';
you'd end up with
Array (
[0] => <strong>
[1] => My content</strong>This is my content.
[2] => <b>
[3] => Some more bold</b>content
)
Now, if your sample string did NOT start with strong/b:
$html2 = 'like the first, but <strong>My content</strong>This is my content.<b>Some more bold</b>content, has some initial none-tag content';
Array (
[0] => like the first, but
[1] => <strong>
[2] => My content</strong>This is my content.
[3] => <b>
[4] => Some more bold</b>content, has some initial none-tag content
)
and a simple test to see if element #0 is either a tag or text to determine where your "second tag and onwards" text starts (element #3 or element #4)

It is possible with 'positive lookbehind' in regular expressions. E.g., (?<=a)b matches the b (and only the b) in cab, but does not match bed or debt.
In your case, (?<=(\<strong|\<b)).*(\<strong|\<b) should do the trick. Use this regex in a preg_split() call and make sure to set PREG_SPLIT_DELIM_CAPTURE if you want those tags <b> or <strong> to be included.

If you truly really need to split the string, the regular expression approach might work. There are many fragilities about parsing HTML, though.
If you just want to know the second node that has either a strong or b tag, using a DOM is so much easier. Not only is the code very obvious, all the parsing bits are taken care of for you.
<?php
$testHtml = '<p><strong>My content</strong><br>
This is my content. <strong>Some more bold</strong> content, that would spilt into another variable.</p>
<p><b>This should not be found</b></p>';
$htmlDocument = new DOMDocument;
if ($htmlDocument->loadHTML($testHtml) === false) {
// crash and burn
die();
}
$xPath = new DOMXPath($htmlDocument);
$boldNodes = $xPath->query('//strong | //b');
$secondNodeIndex = 1;
if ($boldNodes->item($secondNodeIndex) !== null) {
$secondNode = $boldNodes->item($secondNodeIndex);
var_dump($secondNode->nodeValue);
} else {
// crash and burn
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP - Complicated Regex extraction - php

Related

PHP - Scraping a DIV Element from a Web Page using preg_match

php, strpos extract digit from string

PHP - Advanced Regex Help needed

Unusual behaviour of regex

PHP split content when a HTML element is found

Categories

Resources