Unusual behaviour of regex

Unusual behaviour of regex - php

My Setup:
index.php:
<?php
$page = file_get_contents('a.html');
$arr = array();
preg_match('/<td class=\"myclass\">(.*)\<\/td>/s',$page,$arr);
print_r($arr);
?>
a.html:
...other content
<td class="myclass">
THE
CONTENT
</td>
other content...
Output:
Array
(
[0] => Array
(
)
)
If I change the line 4 of index.php to:
preg_match('/<td class=\"myclass\">(.*)\<\/t/s',$page,$arr);
The output is:
Array
(
[0] => <td class="myclass">
THE
CONTENT
</t
[1] =>
THE
CONTENT
)
I can't make out what's wrong. Please help me match the content between <td class="myclass"> and </td>.

Your code appears to work. I edited the regex to use a different separator and get a clearer view. You may want to use the ungreedy modifier in case there is more than one myclass TD in your HTML.
I have not been able to reproduce the "array of array" behaviour you note, unless I manipulate the code to add an error -- see at bottom.
<?php
$page = <<<PAGE
...other content
<td class="myclass">
THE
CONTENT
</td>
other content...
PAGE;
preg_match('#<td class="myclass">(.*)</td>#s',$page,$arr);
print_r($arr);
?>
returns, as expected:
Array
(
[0] => <td class="myclass">
THE
CONTENT
</td>
[1] =>
THE
CONTENT
)
The code below is similar to yours but has been modified to cause an identical error. Doesn't seem likely you did this, though. The regexp is modified in order to not match, and the resulting empty array is stored into $arr[0] instead of $arr.
preg_match('#<td class="myclass">(.*)</ td>#s',$page,$arr[0]);
Returns the same error you observe:
Array
(
[0] => Array
(
)
)
I can duplicate the same behaviour you observe (works with </t, does not work with </td>) if I use your regexp, but modify the HTML to have </t d>. I still need to write to $arr[0] instead of $arr if I also want to get an identical output.

Do you understand that the 3rd paramter of preg_match is the matches and it will contain the match then the other elements will show the captured pattern.
http://ca3.php.net/manual/en/function.preg-match.php
If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.
This code
preg_match('/<td class=\"myclass\">(.*)\<\/t/s',$page,$arr);
When applied on
...other content
<td class="myclass">
THE
CONTENT
</td>
other content...
Will return the match in $arr[0] and the result of (.*) in $arr[1]. This result is correct: There is your content in [1]
Array
(
[0] => <td class="myclass">
THE
CONTENT
</t
[1] =>
THE
CONTENT
Example two
<?php
header('Content-Type: text/plain');
$page = 'A B C D E F';
$arr = array();
preg_match('/C (D) E/', $page, $arr);
print_r($arr);
Example output
Array
(
[0] => C D E // This is the string found
[1] => D // this is what I wanted to look for and extracted out of [0], the matched parenthesis
)

Your regex seems correct. Isn't the syntax of preg_match as follows?
preg_match('/<td class=\"myclass\">(.*)\<\/td>/s',$page,$arr);
The | in the regex represents or

Related

Regex to match placeholders that contain HTML within them

I have placeholders that users can insert into a WYSIWYG editor (which contains HTML code). Sometimes when they paste from apps like Word etc it injects HTML within them.
Eg: It pastes %<span>firstname</span>% instead of %firstname%.
Here is an example of my regex code:
$html = '
<p>%firstname%</p>
<p>%<span>firstname</span>%</p>
<p>%<span class="blah">firstname</span>%</p>
<p>%<span><span>firstname</span></span>%</p>
<p>%<span><span><span>firstname</span></span></span>%</p>
<p>%<span class="blah"><span>firstname</span></span>%</p>
<div>other random <strong>HTML</strong> that needs to be preserved.</div>
';
preg_match_all(
'/\%(?![0-9])((?:<[^<]+?>)?[a-zA-z0-9_-]+(?:[\s]?<[^<]+?>)?)\%/U',
$html,
$matches
);
echo '<pre>';
print_r($matches);
echo '</pre>';
Which outputs the following:
Array
(
[0] => Array
(
[0] => %firstname%
[1] => %firstname%
[2] => %firstname%
)
[1] => Array
(
[0] => firstname
[1] => firstname
[2] => firstname
)
)
As soon as there is more than one span inside the placeholder it doesn't work. I'm not quite sure what to adjust in my regex.
/\%(?![0-9])((?:<[^<]+?>)?[a-zA-z0-9_-]+(?:[\s]?<[^<]+?>)?)\%/U
How would I achieve this?

Try this Regex. It should help you out!
/\%(?![0-9])(?:<[^<]+?>)*([a-zA-z0-9_-]+)(?:[\s]?<\/[^<]+?>)*\%/U

You could use a parser and the textContent property if it is a WYSIWYG editor anyway:
<?php
$html = '
<p>%firstname%</p>
<p>%<span>firstname</span>%</p>
<p>%<span class="blah">firstname</span>%</p>
<p>%<span><span>firstname</span></span>%</p>
<p>%<span><span><span>firstname</span></span></span>%</p>
<p>%<span class="blah"><span>firstname</span></span>%</p>
<div>A cool div with %firstname%</div>
<span>And a very neat span with %firstname%</span>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
# query only root elements here
$containers = $xpath->query("/*");
foreach ($containers as $container) {
echo $container->textContent . "\n";
}
?>
This outputs %firstname% a couple of times, see a demo on ideone.com.

Do you really need a regex for this? You could have simply used strip_tags() here.
Try this:
echo strip_tags($html);

how to php preg_match_all find exact words

i am trying to get exact word 846002 from html code by using preg_match_all
php code :
<?php
$sText =file_get_contents('test.html', true);
preg_match_all('#download.php/.[^\s.,"\']+#i', $sText, $aMatches);
print_r($aMatches[0][0]);
?>
test.html
<tr>
<td class=ac>
<img src="/dl3.png" alt="Download" title="Download">
</td>
</tr>
output:
download.php/829685/Dark
but i want to output only,
829685

Just add the slash in the character class and capture in group 1:
preg_match_all('#download.php/([^\s.,"\'/]+)#i', $sText, $aMatches);
The value you're looking after is in $aMatches[1][0]

You need to create a group by ( and )
preg_match_all ( '#download.php/([^/]+)#i', $sText, $aMatches );
print_r ( $aMatches[1][0] );

Using RegEx to Capture All Links & In Between Text From A String

<Link to: http://www.someurl(.+)> maybe some text here(.*) <Link: www.someotherurl(.+)> maybe even more text(.*)
Given that this is all on one line, how can I match or better yet extract all full urls and text? ie. for this example I wish to extract:
http://www.someurl(.+) . maybe some text here(.*) . www.someotherurl(.+) . maybe even more text(.*)
Basically, <Link.*:.* would start each link capture and > would end it. Then all text after the first capture would be captured as well up until zero or more occurrences of the next link capture.
I have tried:
preg_match_all('/<Link.*?:.*?(https|http|www)(.+?)>(.*?)/', $v1, $m4);
but I need a way to capture the text after the closing >. The problem is that there may or may not be another link after the first one (of course there could also be no links to begin with!).

$string = "<Link to: http://www.someurl(.+)> maybe some text here(.*) <Link: www.someotherurl(.+)> maybe even more text(.*)";
$string = preg_split('~<link(?: to)?:\s*([^>]+)>~i',$string,-1,PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
echo "<pre>";
print_r($string);
output:
Array
(
[0] => http://www.someurl(.+)
[1] => maybe some text here(.*)
[2] => www.someotherurl(.+)
[3] => maybe even more text(.*)
)

You can use this pattern:
preg_match_all('~<link\b[^:]*:\s*\K(?<link>[^\s>]++)[^>]*>\s*(?<text>[^<]++)~',
$txt, $matches, PREG_SET_ORDER);
foreach($matches as $match) {
printf("<br/>link: %s\n<br/>text: %s", $match['link'], $match['text']);
}

PHP preg_match_all - how to get a content from HTML?

$Content contains HTML document
$contents = curl_exec ($ch)
I need to get a content from:
<span class="Menu1">Artur €2000</span>
It's repeated several times so I want to save it into Array
I try to do that this way:
preg_match_all('<span class=\"Menu1\">(.*?)</span>#si',$contents,$wynik2);
But I've got an error
Warning: preg_match_all() [function.preg-match-all]: Unknown modifier '('
Can You guys help me please?
EDIT: $contents = curl_exec ($ch)
SOLVED: The error was cased becasue of wrong HTML on CURLed website:
<span class="Menu1">Content</tr>
instead of:
<span class="Menu1">Content</tr>
I didn't expected that someone can write wrong HTML. Thank You guys for help!

You forgot the first delimiter (#):
$contents = '<span class="Menu1">Artur $2000</span> somehtml <span class="Menu1">Mark $1000</span>';
preg_match_all('#<span class="Menu1">(.*?)</span>#si', $contents, $wynik2);
print_r($wynik2);
/*
Array
(
[0] => Array
(
[0] => <span class="Menu1">Artur $2000</span>
[1] => <span class="Menu1">Mark $1000</span>
)
[1] => Array
(
[0] => Artur $2000
[1] => Mark $1000
)
)
*/

You should put this sign "|" in the start and the end of your regular expression :
preg_match_all("|<span class=\"Menu1\">(.*?)</span>|U",$contents,$wynik2);

PHP split content when a HTML element is found

I have a PHP variable that holds some HTML I wanting to be able to split the variable into two pieces, and I want the spilt to take place when a second bold <strong> or <b> is found, essentially if I have content that looks like this,
My content
This is my content. Some more bold content, that would spilt into another variable.
is this at all possible?

Something like this would basically work:
preg_split('/(<strong>|<b>)/', $html1, 3, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
Given your test string of:
$html1 = '<strong>My content</strong>This is my content.<b>Some more bold</b>content';
you'd end up with
Array (
[0] => <strong>
[1] => My content</strong>This is my content.
[2] => <b>
[3] => Some more bold</b>content
)
Now, if your sample string did NOT start with strong/b:
$html2 = 'like the first, but <strong>My content</strong>This is my content.<b>Some more bold</b>content, has some initial none-tag content';
Array (
[0] => like the first, but
[1] => <strong>
[2] => My content</strong>This is my content.
[3] => <b>
[4] => Some more bold</b>content, has some initial none-tag content
)
and a simple test to see if element #0 is either a tag or text to determine where your "second tag and onwards" text starts (element #3 or element #4)

It is possible with 'positive lookbehind' in regular expressions. E.g., (?<=a)b matches the b (and only the b) in cab, but does not match bed or debt.
In your case, (?<=(\<strong|\<b)).*(\<strong|\<b) should do the trick. Use this regex in a preg_split() call and make sure to set PREG_SPLIT_DELIM_CAPTURE if you want those tags <b> or <strong> to be included.

If you truly really need to split the string, the regular expression approach might work. There are many fragilities about parsing HTML, though.
If you just want to know the second node that has either a strong or b tag, using a DOM is so much easier. Not only is the code very obvious, all the parsing bits are taken care of for you.
<?php
$testHtml = '<p><strong>My content</strong><br>
This is my content. <strong>Some more bold</strong> content, that would spilt into another variable.</p>
<p><b>This should not be found</b></p>';
$htmlDocument = new DOMDocument;
if ($htmlDocument->loadHTML($testHtml) === false) {
// crash and burn
die();
}
$xPath = new DOMXPath($htmlDocument);
$boldNodes = $xPath->query('//strong | //b');
$secondNodeIndex = 1;
if ($boldNodes->item($secondNodeIndex) !== null) {
$secondNode = $boldNodes->item($secondNodeIndex);
var_dump($secondNode->nodeValue);
} else {
// crash and burn
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Unusual behaviour of regex - php

Your regex seems correct. Isn't the syntax of preg_match as follows? preg_match('/<td class=\"myclass\">(.*)\<\/td>/s',$page,$arr); The | in the regex represents or

Related

Regex to match placeholders that contain HTML within them

how to php preg_match_all find exact words

Using RegEx to Capture All Links & In Between Text From A String

PHP preg_match_all - how to get a content from HTML?

PHP split content when a HTML element is found

Categories

Resources