Regex to replace reg trademark - php

I need some help with regex:
I got a html output and I need to wrap all the registration trademarks with a <sup></sup>
I can not insert the <sup> tag in title and alt properties and obviously I don't need to wrap regs that are already superscripted.
The following regex matches text that is not part of a HTML tag:
(?<=^|>)[^><]+?(?=<|$)
An example of what I'm looking for:
$original = `<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>`
The filtered string should output:
<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>
thanks a lot for your time!!!

Well, here is a simple way, if you agree to following limitation:
Those regs that are already processed have the </sup> following right after the ®
echo preg_replace('#®(?!\s*</sup>|[^<]*>)#','<sup>®</sup>', $s);
The logic behind is:
we replace only those ® which are not followed by </sup> and...
which are not followed by > simbol without opening < symbol

I would really use an HTML parser in place of regular expressions, since HTML is not regular and will present more edge cases than you can dream of (ignoring your contextual limitations that you've identified above).
You don't say what technology you're using. If you post that up, someone can undoubtedly recommend the appropriate parser.

Regex is not enough for what you want. First you must write code to identify when content is a value of an attribute or a text node of an element. Then you must through all that content and use some replace method. I am not sure what it is in PHP, but in JavaScript it would look something like:
content[i].replace(/\®/g, "<sup>®</sup>");

I agree with Brian that regular expressions are not a good way to parse HTML, but if you must use regular expressions, you could try splitting the string into tokens and then running your regexp on each token.
I'm using preg_split to split the string on HTML tags, as well as on the phrase <sup>&reg</sup> -- this will leave text that's either not an already superscript ® or a tag as tokens. Then for each token, ® can be replaced with <sup>®</sup>:
$regex = '/(<sup>®<\/sup>|<.*?>)/i';
$original = '<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>';
// we need to capture the tags so that the string can be rebuilt
$tokens = preg_split($regex, $original, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
/* $tokens => Array
(
[0] => <div>
[1] => asd® asdasd. asd
[2] => <sup>®</sup>
[3] => asd
[4] => <img alt="qwe®qwe" />
[5] => </div>
)
*/
foreach ($tokens as &$token)
{
if ($token[0] == "<") continue; // Skip tokens that are tags
$token = substr_replace('®', '<sup>®</sup>');
}
$tokens = join("", $tokens); // reassemble the string
// $tokens => "<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>"
Note that this is a naive approach, and if the output isn't formatted as expected it might not parse like you'd like (again, regular expression is not good for HTML parsing ;) )

Related

PHP regexp parse HTML

My regexp:
<([a-zA-Z0-9]+)>[\na-zA-Z0-9]*<\/\1+>
my string:
<div>
<f>
</f>
</div>
the result is:
array(2
0 => array(1
0 => <f>
</f>
)
1 => array(1
0 => f
)
)
why it is capturing <f></f>, and ignoring the first <div> ?
The answer is USE A PARSER INSTEAD (sorry for my shouting). While it is sometimes faster to use a regular expression to obtain an ID or URL string, html tags need a rather error-prone way of understanding via regex. Consider the following code, isn't that much more beautiful than druidic characters with special meanings?
<?php
$str = "
<container>
<div class='someclass' data='somedata'>
<f>some content here</f>
</div>
</container>";
$xml = simplexml_load_string($str);
echo $xml->div->f; // some content here
$attributes = $xml->div->attributes();
print_r($attributes); // class and data as keys
?>
I'd say it's because your second character class statement tries to find 0 or more of the characters before the ending tag comes, and that doesn't match with the <div>...</div> block.

Using RegEx to Capture All Links & In Between Text From A String

<Link to: http://www.someurl(.+)> maybe some text here(.*) <Link: www.someotherurl(.+)> maybe even more text(.*)
Given that this is all on one line, how can I match or better yet extract all full urls and text? ie. for this example I wish to extract:
http://www.someurl(.+) . maybe some text here(.*) . www.someotherurl(.+) . maybe even more text(.*)
Basically, <Link.*:.* would start each link capture and > would end it. Then all text after the first capture would be captured as well up until zero or more occurrences of the next link capture.
I have tried:
preg_match_all('/<Link.*?:.*?(https|http|www)(.+?)>(.*?)/', $v1, $m4);
but I need a way to capture the text after the closing >. The problem is that there may or may not be another link after the first one (of course there could also be no links to begin with!).
$string = "<Link to: http://www.someurl(.+)> maybe some text here(.*) <Link: www.someotherurl(.+)> maybe even more text(.*)";
$string = preg_split('~<link(?: to)?:\s*([^>]+)>~i',$string,-1,PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
echo "<pre>";
print_r($string);
output:
Array
(
[0] => http://www.someurl(.+)
[1] => maybe some text here(.*)
[2] => www.someotherurl(.+)
[3] => maybe even more text(.*)
)
You can use this pattern:
preg_match_all('~<link\b[^:]*:\s*\K(?<link>[^\s>]++)[^>]*>\s*(?<text>[^<]++)~',
$txt, $matches, PREG_SET_ORDER);
foreach($matches as $match) {
printf("<br/>link: %s\n<br/>text: %s", $match['link'], $match['text']);
}

How do I make my preg_replace only look for the words while they are NOT within an <acronym> tag?

In PHP I have a String $string and an array $acronyms (in the form "UK" => "United Kingdom").
Now I want to replace all acronyms within $string by some HTML Tags. For example Hello UK should turn into Hello <acronym title="United Kingdom">UK</acronym></pre>
I do it this way:
foreach($acronyms as $acronym => $tooltip){
$string = preg_replace('/'.$acronym.'/i', ''.$acronym.'', $string);
}
The problem is: Let's say I have a text Hello UK and have an array to replace "UK" with "United Kingdom" and "Kingdom" with "RandomWord". Then the text will replace into Hello <acronym title="United <acronym title="RandomWord">Kingdom</acronym>">UK</acronym> which obviously is chaos.
So the question is: How do I make my preg_replace only look for the words while they are NOT within an <acronym> tag? (neither in title-attribute, nor within the tag itself)
Edit: second attempt according to a response (because I can't put code in reply). Still the same problem, the text within acronym gets replaced a second time...
foreach($acronyms as $acronym => $tooltip){
$acronyms[$acronym] = '<acronym title="'.$tooltip.'">'.$acronym.'</acronym>';
}
$string = str_ireplace(array_keys($acronyms), array_values($acronyms), $string);
You can use strtr(). It doesn't rescan the string after performing a replacement:
foreach ($acronyms as $acronym => $tooltip) {
$acronyms[$acronym] = sprintf('<acronym title="%s">%s</acronym>',
htmlspecialchars($tooltip),
htmlspecialchars($acronym)
);
}
echo strtr($str, $acronyms);
Here's an attempt at the regex version:
foreach($acronyms as $acronym => $tooltip){
$rexp = '/' . $acronym . '(?!((?!<acronym).)*<\/acronym>)/i';
$string = preg_replace($rexp, ''.$acronym.'', $string);
}
Seems to work for me. It does the following:
Match the $acronym variable with a negative look ahead...
where a closing acronym tag can be found
but stop the lookahead when an opening acronym tag is before it.
Ultimately this matches only where it's not within an acronym tag (including all attributes such as the title).
Here's an example of it in action: gSkinner regex example
Don't try to do everything with regexes :
Parse your HTML using a HTML/XML parsing library.
Iterate over your HTML tags, replace what you have to replace.
Ask your "html parsing lib" to convert this back to a "HTML string".

PHP split content when a HTML element is found

I have a PHP variable that holds some HTML I wanting to be able to split the variable into two pieces, and I want the spilt to take place when a second bold <strong> or <b> is found, essentially if I have content that looks like this,
My content
This is my content. Some more bold content, that would spilt into another variable.
is this at all possible?
Something like this would basically work:
preg_split('/(<strong>|<b>)/', $html1, 3, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
Given your test string of:
$html1 = '<strong>My content</strong>This is my content.<b>Some more bold</b>content';
you'd end up with
Array (
[0] => <strong>
[1] => My content</strong>This is my content.
[2] => <b>
[3] => Some more bold</b>content
)
Now, if your sample string did NOT start with strong/b:
$html2 = 'like the first, but <strong>My content</strong>This is my content.<b>Some more bold</b>content, has some initial none-tag content';
Array (
[0] => like the first, but
[1] => <strong>
[2] => My content</strong>This is my content.
[3] => <b>
[4] => Some more bold</b>content, has some initial none-tag content
)
and a simple test to see if element #0 is either a tag or text to determine where your "second tag and onwards" text starts (element #3 or element #4)
It is possible with 'positive lookbehind' in regular expressions. E.g., (?<=a)b matches the b (and only the b) in cab, but does not match bed or debt.
In your case, (?<=(\<strong|\<b)).*(\<strong|\<b) should do the trick. Use this regex in a preg_split() call and make sure to set PREG_SPLIT_DELIM_CAPTURE if you want those tags <b> or <strong> to be included.
If you truly really need to split the string, the regular expression approach might work. There are many fragilities about parsing HTML, though.
If you just want to know the second node that has either a strong or b tag, using a DOM is so much easier. Not only is the code very obvious, all the parsing bits are taken care of for you.
<?php
$testHtml = '<p><strong>My content</strong><br>
This is my content. <strong>Some more bold</strong> content, that would spilt into another variable.</p>
<p><b>This should not be found</b></p>';
$htmlDocument = new DOMDocument;
if ($htmlDocument->loadHTML($testHtml) === false) {
// crash and burn
die();
}
$xPath = new DOMXPath($htmlDocument);
$boldNodes = $xPath->query('//strong | //b');
$secondNodeIndex = 1;
if ($boldNodes->item($secondNodeIndex) !== null) {
$secondNode = $boldNodes->item($secondNodeIndex);
var_dump($secondNode->nodeValue);
} else {
// crash and burn
}

php preg_match_all html dates with slashes error

I've trying to preg_match_all a date with slashes in it sitting between 2 html tags; however its returning null.
here is the html:
> <td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>
Here is my preg_match_all() code
preg_match_all('/<td width=\'40%\' align=\'right\' class=\'SmallDimmedText\'>Last([a-zA-Z0-9\s\.\-\',]*)<\/td>/', $h, $table_content, PREG_PATTERN_ORDER);
where $h is the html above.
what am i doing wrong?
thanks in advance
It (from a quick glance) is because you are trying to match:
Last Login: 11/14/2009
With this regex:
Last([a-zA-Z0-9\s\.\-\',]*)
The regex doesn't contain the required characters of : and / which are included in the text string. Changing the required part of the regex to:
Last([a-zA-Z0-9\s\.\-\',:/]*)
Gives a match
Would it be better to simply use a DOM parser, and then preform the regex on the result of the DOM lookup? It makes for nicer regex...
EDIT
The other issue is that your HTML is:
...40%' align='right'class='SmallDimmedText'>...
Where there is no space between align='right' and class='SmallDimmedText'
However your regex for that section is:
...40%\' align=\'right\' class=\'SmallDimmedText\'>...
Where it is indicated there is a space.
Use a DOM Parser It will save you more headaches caused by subtle bugs than you can count.
Just to give you an idea on how simple it is to parse using Simple HTML DOM.
$html = str_get_html(...);
$elems = $html->find('.SmallDimmedText');
if ( count($elems->children()) != 1 ){
throw new Exception('Too many/few elements found');
}
$text = $elems->children(0)->plaintext;
//parsing here is only an example, but you have removed all
//the html so that any regex used is really simple.
$date = substr($text, strlen('Last Login: '));
$unixTime = strtotime($date);
I see at least two problems :
in your HTML string, there is no space between 'right' and class=, and there is one space there in your regex
you must add at least these 3 characters to the list of matched characters, between the [] :
':' (there is one between "Login" and the date),
' ' (there are spaces between "Last" and "Login", and between ":" and the date),
and '/' (between the date parts)
With this code, it seems to work better :
$h = "<td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>";
if (preg_match_all("#<td width='40%' align='right'class='SmallDimmedText'>Last([a-zA-Z0-9\s\.\-',: /]*)<\/td>#",
$h, $table_content, PREG_PATTERN_ORDER)) {
var_dump($table_content);
}
I get this output :
array
0 =>
array
0 => string '<td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>' (length=80)
1 =>
array
0 => string ' Login: 11/14/2009' (length=18)
Note I have also used :
# as a regex delimiter, to avoid having to escape slashes
" as a string delimiter, to avoid having to escape single quotes
My first suggestion would be to minimize the amount of text you have in the preg_match_all, why not just do between a ">" and a "<"? Second, I'd end up writing the regex like this, not sure if it helps:
/>.*[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}</
That will look for the end of one tag, then any character, then a date, then the beginning of another tag.
I agree with Yacoby.
At the very least, remove all reference to any of the HTML specific and simply make the regex
preg_match_all('#Last Login: ([\d+/?]+)#', ...

Categories