PHP split content when a HTML element is found

PHP split content when a HTML element is found - php

I have a PHP variable that holds some HTML I wanting to be able to split the variable into two pieces, and I want the spilt to take place when a second bold <strong> or <b> is found, essentially if I have content that looks like this,
My content
This is my content. Some more bold content, that would spilt into another variable.
is this at all possible?

Something like this would basically work:
preg_split('/(<strong>|<b>)/', $html1, 3, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
Given your test string of:
$html1 = '<strong>My content</strong>This is my content.<b>Some more bold</b>content';
you'd end up with
Array (
[0] => <strong>
[1] => My content</strong>This is my content.
[2] => <b>
[3] => Some more bold</b>content
)
Now, if your sample string did NOT start with strong/b:
$html2 = 'like the first, but <strong>My content</strong>This is my content.<b>Some more bold</b>content, has some initial none-tag content';
Array (
[0] => like the first, but
[1] => <strong>
[2] => My content</strong>This is my content.
[3] => <b>
[4] => Some more bold</b>content, has some initial none-tag content
)
and a simple test to see if element #0 is either a tag or text to determine where your "second tag and onwards" text starts (element #3 or element #4)

It is possible with 'positive lookbehind' in regular expressions. E.g., (?<=a)b matches the b (and only the b) in cab, but does not match bed or debt.
In your case, (?<=(\<strong|\<b)).*(\<strong|\<b) should do the trick. Use this regex in a preg_split() call and make sure to set PREG_SPLIT_DELIM_CAPTURE if you want those tags <b> or <strong> to be included.

If you truly really need to split the string, the regular expression approach might work. There are many fragilities about parsing HTML, though.
If you just want to know the second node that has either a strong or b tag, using a DOM is so much easier. Not only is the code very obvious, all the parsing bits are taken care of for you.
<?php
$testHtml = '<p><strong>My content</strong><br>
This is my content. <strong>Some more bold</strong> content, that would spilt into another variable.</p>
<p><b>This should not be found</b></p>';
$htmlDocument = new DOMDocument;
if ($htmlDocument->loadHTML($testHtml) === false) {
// crash and burn
die();
}
$xPath = new DOMXPath($htmlDocument);
$boldNodes = $xPath->query('//strong | //b');
$secondNodeIndex = 1;
if ($boldNodes->item($secondNodeIndex) !== null) {
$secondNode = $boldNodes->item($secondNodeIndex);
var_dump($secondNode->nodeValue);
} else {
// crash and burn
}

Related

skipping html tags in php regex

I'm a stickler for correct-ish English (yes, I know "stickler" and "correct-ish" are an oxymoron). I have created a CMS for use on my company's sites, but there is one thing that is really on my nerves - creating "smart" quotes in the published content.
I have a reg-ex that does it, but I run into problems when I encounter html tags in the copy. For instance, one of the published stories used by my CMS may contain a bunch of plain text and a few HTML tags, such as a link tag, which contains quotation marks that I do NOT want to change to "smart" quotes for obvious reasons.
15 years ago, I was a Perl RegEx ace, but I'm totally drawing a blank on this one. What I want to do is process a string, ignoring all text inside html tags, replace all quotes in the string with "smart" quotes, then return the string with its html tags intact.
I have a function that I cobbled together to handle the most common scenarios I face with the CMS, but I hate that it's ugly and not elegant at all, and that if unforeseen tags come up, my solution completely breaks.
Here's the code (please don't laugh, it was slammed together over half a bottle of Scotch):
function educate_quotes($string) {
$pattern = array('/\b"/',//right double
'/"\b/',//left double
'/"/',//left double end of line
"/(\w+)'(\w+)/",//apostrophe
"/\b'/",//left single
"/'\b/",//right single
"/'$/",//right single end of line
"/--/"//emdash
);
$replace = array("”",//right double quote
"“",//left double
"”",//left double end of line
"$1"."’"."$2",//apostrophe
"’",//left single
"‘",//right single
"’",//right single end of line
""//emdash
);
$string = preg_replace($pattern,$replace,$string);
//remove smart quotes around urls
$string = preg_replace("/href=“(.+)”/","href=\"$1\"",$string);
//remove smart quotes around images
$string = preg_replace("/src=“(.+?)”/","src=\"$1\" ",$string);
//remove smart quotes around alt tags
$string = str_replace('alt=”"','',$string);
$pat = "/alt=“(.+?)”/is";
$rep = "alt=\"$1\" ";
$string = preg_replace($pat,$rep,$string);
//i'm too lazy to figure out why this artifact keeps appearing
$string = str_replace("alt=“",'alt="',$string);
//same thing here
$string = preg_replace("/” target/","\" target",$string);
return $string;
}
Like I said, I know the code is ugly, and I'm open to more elegant solutions. It works, but in the future, it will break if unforeseen tags come along. For the record, I want to reiterate that I'm not trying to get a regex to PARSE html tags; I'm trying to get it to IGNORE them while parsing all the rest of the text in the string.
Any solutions? I've done a LOT of online searching and can't seem to find the solution, and I'm unfamiliar enough with PHP's implementation of regex that it's consternating.

OK. I sort of answered my own question after Slacks suggested DOM parsing, but now i have the problem that the regex isn't working on the strings created. Here's my code:
function educate_quotes($string) {
$pattern = array(
'/"(\w+)"/',//quotes
"/(\w+)'(\w+)/",//apostrophe
"/'(\w+)'/",//single quotes
"/'\b/",//right single
"/--/"//emdash
);
$replace = array(
"“"."$1"."”",//quotes
"$1"."’"."$2",//apostrophe
"’"."$1"."‘",//single quotes
"‘",//right single
""//emdash
);
$xml = new DOMDocument();
$xml->loadHTML($string);
$text = (string)$xml->textContent;
$smart = preg_replace($pattern,$replace,$text);
$xml->textContent = $smart;
$html = $xml->saveHTML();
return $html;
}
The DOM parsing is working fine; the issue is now my regex (which I've changed from the one above, but not until the one above already wasn't working on the new strings created) isn't actually replacing any of the quotation marks in the strings.
Also, I'm getting the following annoying warnings when there is imperfect HTML code in the string:
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Unexpected end tag : p in Entity, line: 2 in /home/leifw/now/cms_functions.php on line 418
Since I can't count on the reporters to always use perfect HTML code, that's a problem, too.

Is it possible to split based on html < > tags and then piece it back together?
$text = "<div sdfas=\"sdfsd\" >ksdfsdf\"dfsd\" dfs </div> <span sdf='dsfs'> dfsd 'dsf ds' </span> ";
$new_text = preg_split("/(<.*?>)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE);
echo htmlspecialchars(print_r($new_text, 1));
so what you get is:
Array
(
[0] =>
[1] => <div sdfas="sdfsd" >
[2] => ksdfsdf"dfsd" dfs
[3] => </div>
[4] =>
[5] => <span sdf='dsfs'>
[6] => dfsd 'dsf ds'
[7] => </span>
[8] =>
)
Then what you can do is just piece the entire thing back together, while using preg_replace, if it doesn't have a < >.

Using A. Lau's suggestion, I think I have a solution, and turned out it actually was regex, not xml parser.
Here's my code:
$string = '<p>"This" <b>is</b> a "string" with quotes in it. <img src="blah.jpg" alt="This is an alt tag"></p><p>Whatever, you know?</p>';
$new_string = preg_split("/(<.*?>)/",$string, -1, PREG_SPLIT_DELIM_CAPTURE);
echo "<pre>";
print_r($new_string);
echo "</pre>";
for($i=0;$i<count($new_string);$i++) {
$str = $new_string[$i];
if ($str) {
if (strpos($str,"<") === false) {
$new_string[$i] = convert_quotes($str);
}
}
}
$str = join('',$new_string);
echo $str;
function convert_quotes($string) {
$pattern = array('/\b"/',//right double
'/"\b/',//left double
'/"/',//left double end of line
"/(\w+)'(\w+)/",//apostrophe
"/\b'/",//left single
"/'\b/",//right single
"/'$/",//right single end of line
"/--/"//emdash
);
$replace = array("”",//right double quote
"“",//left double
"”",//left double end of line
"$1"."’"."$2",//apostrophe
"’",//left single
"‘",//right single
"’",//right single end of line
""//emdash
);
return preg_replace($pattern,$replace,$string);
}
That code outputs the following:
Array (
> [0] =>
> [1] => <p>
> [2] => "This"
> [3] => <b>
> [4] => is
> [5] => </b>
> [6] => a "string" with
> [7] => <a href="http://somewhere.com">
> [8] => quotes
> [9] => </a>
> [10] => in it.
> [11] => <img src="blah.jpg" alt="This is an alt tag">
> [12] =>
> [13] => </p>
> [14] =>
> [15] => <p>
> [16] => Whatever, you know?
> [17] => </p>
> [18] => >
> Whatever, you know?
“This” is a “string” with quotes in it. This is an alt tag
Whatever, you know?

php, strpos extract digit from string

I have a huge html code to scan. Until now i have been using preg_match_all to extract desired parts from it. The problem from the start was that it was extremely cpu time consuming. We finally decided to use some other method for extraction. I read in some articles that preg_match can be compared in performance with strpos. They claim that strpos beats regex scanner up to 20 times in efficiency. I thought i will try this method but i dont really know how to get started.
Lets say i have this html string:
<li id="ncc-nba-16451" class="che10">23 - Star</li>
<li id="ncd-bbt-5674" class="che10">54 - Moon</li>
<li id="ertw-cxda-c6543" class="che10">34,780 - Sun</li>
I want to extract only number from each id and only text (letters) from content of a tags. so i do this preg_match_all scan:
'/<li.*?id=".*?([\d]+)".*?<a.*?>.*?([\w]+)<\/a>/s'
here you can see the result: LINK
Now if i would want to replace my method to strpos functionality how the approach would look like? I understand that strpos returns a index of start where match took place. But how can i use it to:
get all possible matches, not just one
extract numbers or text from desired place in string
Thank you for all the help and tips ;)

Using DOM
$html = '
<html>
<head></head>
<body>
<li id="ncc-nba-16451" class="che10">23 - Star</li>
<li id="ncd-bbt-5674" class="che10">54 - Moon</li>
<li id="ertw-cxda-c6543" class="che10">34,780 - Sun</li>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
$rootElement = $dom_document->documentElement;
$getId = $rootElement->getElementsByTagName('li');
$res = [];
foreach($getId as $tag)
{
$data = explode('-',$tag->getAttribute('id'));
$res['li_id'][] = end($data);
}
$getNode = $rootElement->getElementsByTagName('a');
foreach($getNode as $tag)
{
$res['a_node'][] = $tag->parentNode->textContent;
}
print_r($res);
Output :
Array
(
[li_id] => Array
(
[0] => 16451
[1] => 5674
[2] => c6543
)
[a_node] => Array
(
[0] => 23 - Star
[1] => 54 - Moon
[2] => 34,780 - Sun
)
)

This regex finds a match in 24 steps using 0 backtracks
(?:id="[^\d]*(\d*))[^<]*(?:<a href="[^>]*>[^a-z]*([a-z]*))
The regex you posted requires 134 steps. Maybe you will notice a difference? Note that regex engines can optimize so that in minimizes backtracking. I used the debugger of RegexBuddy to come to the numbers.

Using RegEx to Capture All Links & In Between Text From A String

<Link to: http://www.someurl(.+)> maybe some text here(.*) <Link: www.someotherurl(.+)> maybe even more text(.*)
Given that this is all on one line, how can I match or better yet extract all full urls and text? ie. for this example I wish to extract:
http://www.someurl(.+) . maybe some text here(.*) . www.someotherurl(.+) . maybe even more text(.*)
Basically, <Link.*:.* would start each link capture and > would end it. Then all text after the first capture would be captured as well up until zero or more occurrences of the next link capture.
I have tried:
preg_match_all('/<Link.*?:.*?(https|http|www)(.+?)>(.*?)/', $v1, $m4);
but I need a way to capture the text after the closing >. The problem is that there may or may not be another link after the first one (of course there could also be no links to begin with!).

$string = "<Link to: http://www.someurl(.+)> maybe some text here(.*) <Link: www.someotherurl(.+)> maybe even more text(.*)";
$string = preg_split('~<link(?: to)?:\s*([^>]+)>~i',$string,-1,PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
echo "<pre>";
print_r($string);
output:
Array
(
[0] => http://www.someurl(.+)
[1] => maybe some text here(.*)
[2] => www.someotherurl(.+)
[3] => maybe even more text(.*)
)

You can use this pattern:
preg_match_all('~<link\b[^:]*:\s*\K(?<link>[^\s>]++)[^>]*>\s*(?<text>[^<]++)~',
$txt, $matches, PREG_SET_ORDER);
foreach($matches as $match) {
printf("<br/>link: %s\n<br/>text: %s", $match['link'], $match['text']);
}

Extract <span> tag contents in PHP

Say I have a WordPress post, and certain words are wrapped in span tags.
For example:
<p>John went to the <span>bakery</span> today,
and after picking up his favourite muffin
he made his way across to the <span>park</span>
and spent a couple hours on the <span>swings</span>
with his friends.</p>
Is then then a way using PHP to dynamically spit them (the words in the span tags) out as an ordered list in my template file?
Like so:
<h3>What John Did Today</h3>
<ol>
<li>bakery</li>
<li>park</li>
<li>swings</li>
</ol>
If someone could point be in the right direction of how to do something like this, it would be much appreciated. Thank you.

$str = '<p>John went to the <span>bakery</span> today, and after picking up his favourite muffin he made his way across to the <span>park</span> and spent a couple hours on the <span>swings</span> with his friends.</p>';
$d = new DomDocument;
$d->loadHTML($str);
$xpath = new DOMXPath($d);
echo "<h3>What John Did Today</h3>\n";
echo "<ol>\n";
foreach ($xpath->query('//span') as $span)
echo "<li>".$span->nodeValue."</li>\n";
echo "</ol>\n";

A simple possibility is using regular expressions, take a look at preg_match function

Parse the DOM :
http://simplehtmldom.sourceforge.net/

I'm not a regex whiz but this SHOULD do the job for replacing <span> tags with <li> tags:
$str = preg_replace("/<span>([^[]*)<\/span>/i", "<li>$1</li>", $str);
..i know this doesn't directly answer your question but it should help you with this at some point lol
EDIT: full actually working regex solution for getting all your span tags into an array and converting to list items at the same time:
// input string:
$str = '<span>Walk</span> blah <span>Drive</span> blah blee blah <span>Eat</span>';
// get array of span matches
preg_match_all("/(<span>)(.*?)(<\/span>)/i", $str, $matches, PREG_SET_ORDER);
// build array using the exact matches
foreach($matches as $val){
$spanArray[] = preg_replace("/<span>([^[]*)<\/span>/i", "<li>$1</li>", $val[0]);
}
if you then print_r($spanArray); you should get something that looks like this:
Array
(
[0] => <li>Walk</li>
[1] => <li>Drive</li>
[2] => <li>Eat</li>
)

Regex to replace reg trademark

I need some help with regex:
I got a html output and I need to wrap all the registration trademarks with a <sup></sup>
I can not insert the <sup> tag in title and alt properties and obviously I don't need to wrap regs that are already superscripted.
The following regex matches text that is not part of a HTML tag:
(?<=^|>)[^><]+?(?=<|$)
An example of what I'm looking for:
$original = `<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>`
The filtered string should output:
<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>
thanks a lot for your time!!!

Well, here is a simple way, if you agree to following limitation:
Those regs that are already processed have the </sup> following right after the ®
echo preg_replace('#®(?!\s*</sup>|[^<]*>)#','<sup>®</sup>', $s);
The logic behind is:
we replace only those ® which are not followed by </sup> and...
which are not followed by > simbol without opening < symbol

I would really use an HTML parser in place of regular expressions, since HTML is not regular and will present more edge cases than you can dream of (ignoring your contextual limitations that you've identified above).
You don't say what technology you're using. If you post that up, someone can undoubtedly recommend the appropriate parser.

Regex is not enough for what you want. First you must write code to identify when content is a value of an attribute or a text node of an element. Then you must through all that content and use some replace method. I am not sure what it is in PHP, but in JavaScript it would look something like:
content[i].replace(/\®/g, "<sup>®</sup>");

I agree with Brian that regular expressions are not a good way to parse HTML, but if you must use regular expressions, you could try splitting the string into tokens and then running your regexp on each token.
I'm using preg_split to split the string on HTML tags, as well as on the phrase <sup>&reg</sup> -- this will leave text that's either not an already superscript ® or a tag as tokens. Then for each token, ® can be replaced with <sup>®</sup>:
$regex = '/(<sup>®<\/sup>|<.*?>)/i';
$original = '<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>';
// we need to capture the tags so that the string can be rebuilt
$tokens = preg_split($regex, $original, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
/* $tokens => Array
(
[0] => <div>
[1] => asd® asdasd. asd
[2] => <sup>®</sup>
[3] => asd
[4] => <img alt="qwe®qwe" />
[5] => </div>
)
*/
foreach ($tokens as &$token)
{
if ($token[0] == "<") continue; // Skip tokens that are tags
$token = substr_replace('®', '<sup>®</sup>');
}
$tokens = join("", $tokens); // reassemble the string
// $tokens => "<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>"
Note that this is a naive approach, and if the output isn't formatted as expected it might not parse like you'd like (again, regular expression is not good for HTML parsing ;) )

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP split content when a HTML element is found - php

Related

skipping html tags in php regex

php, strpos extract digit from string

Using RegEx to Capture All Links & In Between Text From A String

Extract <span> tag contents in PHP

Regex to replace reg trademark

Categories

Resources