PHP preg_split Input by <br>, <br/>, <p> into Separate Paragraphs - php

I am curling from a page with very ill-formed code. There is a particular snippet of the page I am trying to parse into paragraphs. This input snippet may be divided by <p> and </p> or separated by one or more <br> or <br/> tags. In cases where there are two <br> tags after another, I don't want those to be two separate pargaraphs.
My current code I'm trying to parse/display with is
$paragraphs = preg_split('/(<\s*p\s*\/?>)|(<\s*br\s*\/?>)|(\s\s+)|(<\s*\/p\s*\/?>)/', $article, -1, PREG_SPLIT_NO_EMPTY);
$paragraphcount = count($paragraphs);
for($x = 1; $x <= $paragraphcount; $x++ )
{
echo "<p>".$paragraphs[$x-1]."</p>";
}
However, this is not working as expected. Some different inputs/outputs are as follows:
Input 1: first part </p> <p> second part </p> <p> third part </p> <p> fourth part <br/>
Output 1: <p>first part </p><p> </p><p>second part </p><p> </p><p> third part </p><p> </p><p>fourth part</p><p> </p>
My code is parsing the input into paragraphs; however, it's also adding extra paragraphs containing only a space.
Any help would be appreciated.
Input is UTF-8 if it makes a difference.

Here is a solution with preg_replace:
$article = "first part </p> <p> second part </p> <p> third part </p>
<p> fourth part <br/> <br> fifth part";
$healed = substr(
preg_replace('/(\s*<(\/?p|br)\s*\/?>\s*)+/u', "</p><p>", "<p>$article<p>"),
4, -3);
It first wraps the string in <p> and then replaces (repetitions of) the variants of breaks by </p><p>, to finally remove the starting </p> and ending <p>. Note that this does not produce an (intermediate) array, but the final string.
echo $healed;
outputs:
<p>first part</p><p>second part</p><p>third part</p><p>fourth part</p><p>fifth part</p>
Note that you need the u modifier at the end of the regular expression to get UTF-8 support.
If on the other hand you need the paragraphs in an array, then preg_split is better suited (using the same regular expression):
$paragraphs = preg_split('/(\s*<(\/?p|br)\s*\/?>\s*)+/u',
$article, null, PREG_SPLIT_NO_EMPTY);
If you then write:
foreach ($paragraphs as $paragraph) {
echo "$paragraph\n";
}
You get:
first part
second part
third part
fourth part
fifth part

print_r(preg_split('/((<\s*p\s*\/?>\s*)|(<\s*br\s*\/?>\s*)|(\s\s+)|(<\s*\/p\s*\/?>\s*))+/', $article, -1, PREG_SPLIT_NO_EMPTY));
result:
Array
(
[0] => first part
[1] => second part
[2] => third part
[3] => fourth part
)

Related

Truncate Text Within Specific HTML Tag

This might not even be possible but I have quite a limited knowledge of PHP so I can't figure out if it is or not.
Basically I have a string $myText and this string outputs HTML in the following format:
<p>This is the main bit of text</p>
<small> This is some additional text</small>
My aim is to limit the number of characters displayed specifically within the <p> tag, for example 10 characters.
I have been playing around with PHP substr but I can only get this to work on all of the text, not just the text in the <p> tag.
Do you know if this is possible and if it is, do you know how to do it? Any pointers at all would be appreciated.
Thank you
The simplest solution is:
<?php
$text = '
<p>This is the main bit of text</p>
<small> This is some additional text</small>';
$pos = strpos($text,'<p>');
$pos2 = strpos($text,'</p>');
$text = '<p>' . substr($text,$pos+strlen('<p>'),10).substr($text,$pos2);
echo $text;
but it will work just for first pair of <p> ... </p>
If you need more, you can use regular expressions:
<?php
$text = '
<p>This is the main bit of text</p>
<small> This is some additional text</small>
<p>
werwerwrewre
</p>';
preg_match_all('#<p>(.*)</p>#isU', $text, $matches);
foreach ($matches[1] as $match) {
$text = str_replace('<p>'.$match.'</p>', '<p>'.substr($match,0,10).'</p>', $text);
}
echo $text;
or even
<?php
$text = '
<p>This is the main bit of text</p>
<small> This is some additional text</small>
<p>
werwerwrewre
</p>';
$text = preg_replace_callback('#<p>(.*)</p>#isU', function($matches) {
$matches[1] = '<p>'.substr($matches[1],0,10).'</p>';
return $matches[1];
}, $text);
echo $text;
However in those all 3 cases, all white characters are assumed as part of the string, so if the content of <p>...</p> starts with 3 spaces and you want to display only 3 characters, you simple display only 3 spaces, nothing more. Of course it can be quite easily modified, but I mentioned it to notice that fact.
And one more thing, quite possible you will need to use multibyte version of functions to get the result, so for example instead of strpos() you should use mb_strpos() and set earlier utf-8 encoding using mb_internal_encoding('UTF-8'); to make it working
You can achieve it by a quite simple way:
<?php
$max_length = 5;
$input = "<b>example: </b><div align=left>this is a test</div><div>another very very long item</div>";
$elements_count = preg_match_all("|(<[^>]+>)(.*)(</[^>]+>)|U",
$input,
$out, PREG_PATTERN_ORDER);
for($i=0; $i<$elements_count; $i++){
echo $out[1][$i].substr($out[2][$i], 0, $max_length).$out[3][$i]."\n";
}
these will work for any tag and any class or attribute within it.
ex. input:
<b>example: </b><div align=left>this is a test</div><div>another very very long item</div>
output:
<b>examp</b>
<div align=left>this </div>
<div>anoth</div>

Preg_replace only replaces first match

I'm relatively new to regex expressions and I'm having a problem with this one. I've searched this site and found nothing that works.
I want it to remove all <br /> between <div class='quote'> and </div>. The reason for this is that the whitespace is preserved anyway by the CSS and I want to remove any extra linebreaks the user puts into it.
For example, say I have this:
<div class='quote'>First line of text<br />
Second line of text<br />
Third line of text</div>
I've been trying to use this remove both the <br /> tags.
$TEXT = preg_replace("/(<div class='quote'>(.*?))<br \/>((.*?)<\/div>)/is","$1$3",$TEXT);
This works to an extent because the result is:
<div class='quote'>First line of text
Second line of text<br />
Third line of text</div>
However it won't remove the second <br />. Can someone help please? I figure it's probably something small I'm missing :)
Thanks!
If you want to clear all br-s inside only one div-block you need to first catch the content inside your div-block and then clear all your br-s.
Your regexp has the only one <br /> in it and so it replaces only one <br />.
You need something like that:
function clear_br($a)
{
return str_replace("<br />", "", $a[0]);
}
$TEXT = preg_replace_callback("/<div class='quote'>.*?<br \/>.*?<\/div>/is", "clear_br", $TEXT);
It does replace more than once, because you didn't use a 4th argument in preg_replace, so it is "without limit" and will replace more than once. It only replaced once because you specified the wrapping <div> in your regex and so it only matched your string once, because your string only has such a wrapping <div> once.
Assuming we already have:
<div class='quote'>First line of text<br />
Second line of text<br />
Third line of text</div>
we can simply do something like:
$s = "<div class='quote'>First line of text<br />\nSecond line of text<br>\nThird line of text</div>";
echo preg_replace("{<br\s*/?>}", " ", $s);
the \s* is for optional whitespaces, because what if it is <br/>? The /? is for optional / because it might be <br>. If the system entered those <br /> for you and you are sure they will be in this form, then you can use the simpler regex instead.
One word of caution is that I actually would replace it with a space, because for hello<br>world, if no space is used as the replacement text, then it would become helloworld and it merged two words into one.
(If you do not have this <div ... > ... </div> extracted already, then you probably would need to first do that using an HTML parser, say if the original content is a whole webpage (we use a parser because what if the content inside this outer <div>...</div> has <div> and </div> or even nested as well? If there isn't any <div> inside, then it is easier to extract it just using regex))
I don't get your [.*?] : You said here that you want "any charactere any number of times zero or one time". So you can simply say "any charactere any number of times" : .*
function clear_br($a){ return str_replace("<br />","",$a); }
$TEXT = preg_replace("/(<div class='quote'>.*<br \/>.*<\/div>)/",clear_br($1), $TEXT);
Otherwise that should works
You have to be careful about how you capture the div that contains the br elements. Mr. 動靜能量 pointed out that you need to watch out for nested divs. My solution does not.
<?php
$subject ="
<div>yomama</div>
<div class='quote'>First line of text<br />
Second line of text<br />
Third line of text</div>
<div>hasamustache</div>
";
$result = preg_replace_callback( '#<div[^>]+class.*quote.*?</div>#s',
function ($matches) {
print_r($matches);
return preg_replace('#<br ?/?>#', '', $matches[0]);
}
, $subject);
echo "$result\n";
?>
# is used as a regex delimiter instead of the conventional /
<div[^>]+ prevents the yomama div from being matched because it would have been with <div.*class.*quote since we have the s modifier (multiline-match).
quote.*? means a non-greedy match to prevent hasamustache</div> from being caught.
So the strategy is to match only the quote div in a string with newlines, and run a function on it that will kill all br tags.
output:

Detect paragraph in a form

How can I detect that there is different paragraphs in a form? In this example if the user writes different paragraphs, the echo puts all toguether. I tried white-space:pre and it did not work. I do not know what else can I do to echo the text with <p>?
CSS:
#text {
white-space:pre;
}
HTML:
<form action='normal-html.php' method='post'>
<textarea id="text" name='text' rows='15' cols='60'></textarea> <br/>
<input type='submit' value='Convertir a html' />
</form>
<br />
<?php
$text = $_POST[text];
echo $text;
?>
This sounds like a job for http://php.net/manual/en/function.nl2br.php
string nl2br ( string $string [, bool $is_xhtml = true ] )
Returns string with '<br />' or '<br>' inserted before all
newlines (\r\n, \n\r, \n and \r).
You can use this as you echo out the data, so that way you are never changing what is in the database - or you can simply alter the user input as you save it to the database. Personally I am a fan of the first option, but whatever works best for your application.
Edit: If you want to use only <p> tags, you could also do this using str_replace:
$text = '<p>';
$text.= str_replace('\n', '</p><p>', $_POST[text]);
The \n is generally a new line, depending on how it is read, you may need to use \r\n and the string replace will do the rest. This will leave a spare <p> on the end of the string, but you see where this is going.
You can use the explode function (php manual page):
$your_array = explode("\n", $your_string_from_db);
Example:
$str = "Lorem Ipsum\nAlle jacta est2\nblblbalbalbal";
$arr = explode("\n", $str);
foreach ( $arr as $item){
echo "<p>".$item."</p>";
}
Output:
Lorem Ipsum
Alle jacta est
blblbalbalbal

What's the right pattern for this hidden input

I have this field Returned by curl_exec:
<input name="NUMBER_R" type="hidden" value="1500000">
150000 is a random number and may change the others are constant
i tried:
preg_match ('/<input name="NUMBER_R" type="hidden" value="([^"]*)" \/>/', $result, $number)
and also:
preg_match ('/<input name=\'NUMBER_R\' type=\'hidden\' value=\'(\\d+)\'>/ims', $result, $number)
but no luck...
Here is the full code:
$result=curl_exec($cid);
curl_close($cid);
$number = array();
if (preg_match ('REGEX', $result, $number))
{
echo $number[1];
}
EDIT 1:
Sorry i forgot [1] in echo $number[1];
Also 1500000 is a random number and may change
Description
This regex will find the input tag which has the attributes name="number_r" and type="hidden" in any order. Then it'll pull out the attribute value with it's associated values. It does require the value text to be all digits
<input\b\s+(?=[^>]*name=(["'])number_r\1)(?=[^>]*type=(["'])hidden\2)[^>]*value=(["'])(\d+)\3[^>]*>
<input\b\s+ consume the open bracket and the tag name, ensure there is a word break and white space
(?=[^>]*name=(["'])number_r\1) look ahead to ensure this tag include the correct name attribute
(?=[^>]*type=(["'])hidden\2) look ahead to ensure this tag also includes the type attribute
[^>]* move the cursor forward until we find the
value= tag
(["']) capture the open qoute
(\d+) capture the substring and require it to be all digits
\3 match the correct close quote. This is can be omitted as you've already received the desired substring.
[^>]*> match the rest of the characters in the tag. This is can be omitted as you've already received the desired substring.
Groups
Group 0 gets the entire input tag
the open quote for name which is back referenced to ensure the correct close quote is captured
the open quote for type which is back referenced to ensure the correct close quote is captured
the open quote for value which is back referenced to ensure the correct close quote is captured
the value in the attribute named value
PHP Code Example:
<?php
$sourcestring="<input name="NUMBER_R" type="hidden" value="1500000">";
preg_match('/<input\b\s+(?=[^>]*name=(["\'])number_r\1)(?=[^>]*type=(["\'])hidden\2)[^>]*value=(["\'])(\d+)\3[^>]*>/im',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => <input name="NUMBER_R" type="hidden" value="1500000">
[1] => "
[2] => "
[3] => "
[4] => 1500000
)
Try using DOM and Xpath for get that.
$xml = new DomDocument;
$xml->loadXml('<input name="NUMBER_R" type="hidden" value="1500000" />');
$xpath = new DomXpath($xml);
// traverse all results
foreach ($xpath->query('//input[#name="NUMBER_R"]') as $rowNode) {
var_dump($rowNode->getAttribute('value'));
}
testet : http://codepad.viper-7.com/8dwu9f

How to grab the contents of HTML tags?

Hey so what I want to do is snag the content for the first paragraph. The string $blog_post contains a lot of paragraphs in the following format:
<p>Paragraph 1</p><p>Paragraph 2</p><p>Paragraph 3</p>
The problem I'm running into is that I am writing a regex to grab everything between the first <p> tag and the first closing </p> tag. However, it is grabbing the first <p> tag and the last closing </p> tag which results in me grabbing everything.
Here is my current code:
if (preg_match("/[\\s]*<p>[\\s]*(?<firstparagraph>[\\s\\S]+)[\\s]*<\\/p>[\\s\\S]*/",$blog_post,$blog_paragraph))
echo "<p>" . $blog_paragraph["firstparagraph"] . "</p>";
else
echo $blog_post;
Well, sysrqb will let you match anything in the first paragraph assuming there's no other html in the paragraph. You might want something more like this
<p>.*?</p>
Placing the ? after your * makes it non-greedy, meaning it will only match as little text as necessary before matching the </p>.
If you use preg_match, use the "U" flag to make it un-greedy.
preg_match("/<p>(.*)<\/p>/U", $blog_post, &$matches);
$matches[1] will then contain the first paragraph.
It would probably be easier and faster to use strpos() to find the position of the first
<p>
and first
</p>
then use substr() to extract the paragraph.
$paragraph_start = strpos($blog_post, '<p>');
$paragraph_end = strpos($blog_post, '</p>', $paragraph_start);
$paragraph = substr($blog_post, $paragraph_start + strlen('<p>'), $paragraph_end - $paragraph_start - strlen('<p>'));
Edit: Actually the regex in others' answers will be easier and faster... your big complex regex in the question confused me...
Using Regular Expressions for html parsing is never the right solution. You should be using XPATH for this particular case:
$string = <<<XML
<a>
<b>
<c>texto</c>
<c>cosas</c>
</b>
<d>
<c>código</c>
</d>
</a>
XML;
$xml = new SimpleXMLElement($string);
/* Busca <a><b><c> */
$resultado = $xml->xpath('//p[1]');

Categories