how do I use preg_replace_callback on unknown values? - php

I got some great help today with starting to understand preg_replace_callback with known values. But now I want to tackle unknown values.
$string = '<p id="keepthis"> text</p><div id="foo">text</div><div id="bar">more text</div><a id="red"href="page6.php">Page 6</a><a id="green"href="page7.php">Page 7</a>';
With that as my string, how would I go about using preg_replace_callback to remove all id's from divs and a tags but keeping the id in place for the p tag?
so from my string
<p id="keepthis"> text</p>
<div id="foo">text</div>
<div id="bar">more text</div>
<a id="red"href="page6.php">Page 6</a>
<a id="green"href="page7.php">Page 7</a>
to
<p id="keepthis"> text</p>
<div>text</div>
<div>more text</div>
Page 6
Page 7

There's no need of a callback.
$string = preg_replace('/(?<=<div|<a)( *id="[^"]+")/', ' ', $string);
Live demo
However in the use of preg_replace_callback:
echo preg_replace_callback(
'/(?<=<div|<a)( *id="[^"]+")/',
function ($match)
{
return " ";
},
$string
);
Demo

For your example, the following should work:
$result = preg_replace('/(<(a|div)[^>]*\s+)id="[^"]*"\s*/', '\1', $string);
Though in general you'd better avoid parsing HTML with regular expressions and use a proper parser instead (for example load the HTML into a DOMDocument and use the removeAttribute method, like in this answer). That way you can handle variations in markup and malformed HTML much better.

Related

Truncate Text Within Specific HTML Tag

This might not even be possible but I have quite a limited knowledge of PHP so I can't figure out if it is or not.
Basically I have a string $myText and this string outputs HTML in the following format:
<p>This is the main bit of text</p>
<small> This is some additional text</small>
My aim is to limit the number of characters displayed specifically within the <p> tag, for example 10 characters.
I have been playing around with PHP substr but I can only get this to work on all of the text, not just the text in the <p> tag.
Do you know if this is possible and if it is, do you know how to do it? Any pointers at all would be appreciated.
Thank you
The simplest solution is:
<?php
$text = '
<p>This is the main bit of text</p>
<small> This is some additional text</small>';
$pos = strpos($text,'<p>');
$pos2 = strpos($text,'</p>');
$text = '<p>' . substr($text,$pos+strlen('<p>'),10).substr($text,$pos2);
echo $text;
but it will work just for first pair of <p> ... </p>
If you need more, you can use regular expressions:
<?php
$text = '
<p>This is the main bit of text</p>
<small> This is some additional text</small>
<p>
werwerwrewre
</p>';
preg_match_all('#<p>(.*)</p>#isU', $text, $matches);
foreach ($matches[1] as $match) {
$text = str_replace('<p>'.$match.'</p>', '<p>'.substr($match,0,10).'</p>', $text);
}
echo $text;
or even
<?php
$text = '
<p>This is the main bit of text</p>
<small> This is some additional text</small>
<p>
werwerwrewre
</p>';
$text = preg_replace_callback('#<p>(.*)</p>#isU', function($matches) {
$matches[1] = '<p>'.substr($matches[1],0,10).'</p>';
return $matches[1];
}, $text);
echo $text;
However in those all 3 cases, all white characters are assumed as part of the string, so if the content of <p>...</p> starts with 3 spaces and you want to display only 3 characters, you simple display only 3 spaces, nothing more. Of course it can be quite easily modified, but I mentioned it to notice that fact.
And one more thing, quite possible you will need to use multibyte version of functions to get the result, so for example instead of strpos() you should use mb_strpos() and set earlier utf-8 encoding using mb_internal_encoding('UTF-8'); to make it working
You can achieve it by a quite simple way:
<?php
$max_length = 5;
$input = "<b>example: </b><div align=left>this is a test</div><div>another very very long item</div>";
$elements_count = preg_match_all("|(<[^>]+>)(.*)(</[^>]+>)|U",
$input,
$out, PREG_PATTERN_ORDER);
for($i=0; $i<$elements_count; $i++){
echo $out[1][$i].substr($out[2][$i], 0, $max_length).$out[3][$i]."\n";
}
these will work for any tag and any class or attribute within it.
ex. input:
<b>example: </b><div align=left>this is a test</div><div>another very very long item</div>
output:
<b>examp</b>
<div align=left>this </div>
<div>anoth</div>

Get data from div classes

How I can get data from div classes?
Example: A <div class=ab>1 2</div>b <div class=ab>3 4</div>c.
I want: 1 2, 3 4 etc - between <div class=ab> and </div>.
Second example: http://www.imdb.com/title/tt29747">
I want: tt29747 - between http://www.imdb.com/title/ and ">.
With strstr all good, except that I get only the first result. I tried some solutions founded here but no succes, regular expressions beyond me. Thank you!
Try parsing the HTML using DOMDocument() instead of regex.
However, here is the regex to parse assuming there will be no nested div:
$html= 'Example: Lorem <div class=ab>1 2</div>ipsum <div class=ab>3 4</div>dolor.';
preg_match_all('|<div class=ab>([^<]*)</div>|i', $html, $m);
print_r($m[1]);
And for parsing the title id:
$html = 'http://www.imdb.com/title/tt29747">';
preg_match('|imdb.com/title/(tt\d+)|i', $html, $m);
print_r($m[1]);

How replace all spaces inside HTML PRE elements with

Similar to How replace all spaces inside HTML elements with using preg_replace?
Except I only want to modify spaces found between PRE tags. For example:
<table atrr="zxzx"><tr>
<td>adfa a adfadfaf></td><td><br /> dfa dfa</td>
</tr></table>
<pre class="abc" id="abc">abc abc</pre>
<pre>123 123</pre>
would be converted to (note the pre tag may contain attributes, or may not):
<table atrr="zxzx"><tr>
<td>adfa a adfadfaf></td><td><br /> dfa dfa</td>
</tr></table>
<pre class="abc" id="abc">abc abc</pre>
<pre>123 123</pre>
$html = preg_replace(
'#(\<pre[^>]*>)(.*)(</pre>)#Umie'
, "'$1'.str_replace(' ', ' ', '$2').'$3'"
, $html);
Has been tested, works with the sample string you provided. It's ungreedy, you don't want to replace spaces between </pre> and <pre>. Also works if the <pre></pre> section spans several lines.
Note: this will fail if you have nested situations like <pre> <pre> </pre> </pre>. If you want to be able to parse that, you need to parse the (X)HTML using the Document Object Model.
Update:
I have done some benchmarking and it turns out the callback version is faster by about 1 second per 100,000 iterations, so I think I should also mention that option.
$html = preg_replace_callback(
'#(\<pre[^>]*>)(.*)(</pre>)#Uim'
, function($matches){
return $matches[1].str_replace(' ', ' ', $matches[2]).$matches[3];
}
, $html);
This requires PHP 5.3 or newer, earlier versions do not support anonymous functions.
do
$html = preg_replace('/(<pre.*>.*) (.*<\/pre>)/', '$1 $2', $html, 1, $count);
while($count);
echo $html;
I'm not sure if there's a better solution. I'm not very familiar with all the preg functions.

Regex Replace with Backreference modified by functions

I want to replace the class with the div text like this :
This: <div class="grid-flags" >FOO</div>
Becomes: <div class="iconFoo" ></div>
So the class is changed to "icon". ucfirst(strtolower(FOO)) and the text is removed
Test HTML
<div class="grid-flags" >FOO</div>
Pattern
'/class=\"grid-flags\" \>(FOO|BAR|BAZ)/e'
Replacement
'class="icon'.ucfirst(strtolower($1).'"'
This is one example of a replacement string I've tried out of seemingly hundreds. I read that the /e modifier evaluates the PHP code but I don't understand how it works in my case because I need the double quotes around the class name so I'm lost as to which way to do this.
I tried variations on the backref eg. strtolower('$1'), strtolower('\1'), strtolower('{$1}')
I've tried single and double quotes and various escaping etc and nothing has worked yet.
I even tried preg_replace_callback() with no luck
function callback($matches){
return 'class="icon"'.ucfirst(strtolower($matches[0])).'"';
}
It was difficult for me to try to work out what you meant, but I think you want something like this:
preg_replace('/class="grid-flags" \>(FOO|BAR|BAZ)/e',
'\'class="icon\'.ucfirst(strtolower("$1")).\'">\'',
$text);
Output for your example input:
<div class="iconFoo"></div>
If this isn't what you want, could you please give us some example inputs and outputs?
And I have to agree that this would be easier with an HTML parser.
Instead of using the e(valuate) option you can use preg_replace_callback().
$text = '<div class="grid-flags" >FOO</div>';
$pattern = '/class="grid-flags" >(FOO|BAR|BAZ)/';
$myCB = function($cap) {
return 'class="icon'.ucfirst($cap[1]).'" >';
};
echo preg_replace_callback($pattern, $myCB, $text);
But instead of using regular expressions you might want to consider a more suitable parser for html like simple_html_dom or php's DOM extension.
This works for me
$html = '<div class="grid-flags" >FOO</div>';
echo preg_replace_callback(
'/class *= *\"grid-flags\" *\>(FOO|BAR|BAZ)/'
, create_function( '$matches', 'return \'class="icon\' . ucfirst(strtolower($matches[1])) .\'">\'.$matches[1];' )
, $html
);
Just be aware of the problems of parsing HTML with regex.

regex php: find everything in div

I'm trying to find eveything inside a div using regexp. I'm aware that there probably is a smarter way to do this - but I've chosen regexp.
so currently my regexp pattern looks like this:
$gallery_pattern = '/<div class="gallery">([\s\S]*)<\/div>/';
And it does the trick - somewhat.
The problem is if i have two divs after each other - like this.
<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>
I want to extract the information from both divs, but my problem, when testing, is that im not getting the text in between as a result but instead:
"text to extract here </div>
<div class="gallery">text to extract from here as well"
So to sum up. It skips the first end of the div. and continues on to the next.
The text inside the div can contain <, / and linebreaks. just so you know!
Does anyone have a simple solution to this problem? Im still a regexp novice.
You shouldn't be using regex to parse HTML when there's a convenient DOM library:
$str = '
<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>
';
$doc = new DOMDocument();
$doc->loadHTML($str);
$divs = $doc->getElementsByTagName('div');
if ( count($divs ) ) {
foreach ( $divs as $div ) {
echo $div->nodeValue . '<br>';
}
}
What about something like this :
$str = <<<HTML
<div class="gallery">text to extract here</div>
<div class="gallery">text to extract from here as well</div>
HTML;
$matches = array();
preg_match_all('#<div[^>]*>(.*?)</div>#s', $str, $matches);
var_dump($matches[1]);
Note the '?' in the regex, so it is "not greedy".
Which will get you :
array
0 => string 'text to extract here' (length=20)
1 => string 'text to extract from here as well' (length=33)
This should work fine... If you don't have imbricated divs ; if you do... Well... actually : are you really sure you want to use rational expressions to parse HTML, which is quite not that rational itself ?
A possible answer to this problem can be found at http://simplehtmldom.sourceforge.net/
That class help me to solve similar problem quickly

Categories