Preg replace callback validation - php

So I need to re-write some old code that I found on a library.
$text = preg_replace("/(<\/?)(\w+)([^>]*>)/e",
"'\\1'.strtolower('\\2').'\\3'", $text);
$text = preg_replace("/<br[ \/]*>\s*/","\n",$text);
$text = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n",
$text);
And for the first one I have tried like this:
$text = preg_replace_callback(
"/(<\/?)(\w+)([^>]*>)/",
function($subs) {
return strtolower($subs[0]);
},
$text);
I'm a bit confused b/c I don't understand this part: "'\\1'.strtolower('\\2').'\\3'" so I'm not sure what should I replace it with.
As far as I understand the first line looks for tags, and makes them lowercase in case I have data like
<B>FOO</B>
Can you guys help me out here with a clarification, and If my code is done properly?

The $subs is an array that contains the whole value in the first item and captured texts in the subsequent items. So, Group 1 is in $subs[1], Group 2 value is in $subs[2], etc. The $subs[0] contains the whole match value, and you applied strtolower to it, but the original code left the Group 3 value (captured with ([^>]*>) that may also contain uppercase letters) intact.
Use
$text = preg_replace_callback("~(</?)(\w+)([^>]*>)~", function($subs) {
return $subs[1] . strtolower($subs[2]) . $subs[3];
}, $text);
See the PHP demo.

Related

PHP str_replace scraped content with wild card?

I'm looking for a solution to strip some HTML from a scraped HTML page. The page has some repetitive data I would like to delete so I tried with preg_replace() to delete the variable data.
Data I want to strip:
Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2
....
...
Must be like this afterwards:
Producent:Example
Groep:Example1
Type:Example2
So a big piece is the same except the word within the data-title piece. How could I delete this piece of data?
I tried a few things like this one:
$pattern = '/<td class=\"datatable__body__item\"(.*?)>/';
$tech_specs = str_replace($pattern,"", $tech_specs);
But that didn't work. Is there any solution to this?
Just use a wildcard:
$newstr = preg_replace('/<td class="datatable__body__item" data-title=".*?">/', '', $str);
.*? means match anything but don't be greedy
Assuming that the string looked like this:
$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example';
You could get the beginning and the end of the string with this:
preg_match('/^(\w+:).*\>(\w+)/', $string, $matches);
echo implode([$matches[1], $matches[2]]);
Which, in this case, will throw Producent:Example. So, then you could add this output to another variable/array you intend to use.
OR, since you mentioned replacing:
$string = preg_replace('/^(\w+:).*\>(\w+)/', '$1$2', $string);
But then again, checking as it would probably come in a variable number of lines:
$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2';
$stringRows = explode(PHP_EOL, $string);
$pattern = '/^(\w+:).*\>(\w+)/';
$replacement = '$1$2';
foreach ($stringRows as &$stringRow) {
$stringRow = preg_replace($pattern, $replacement, $stringRow);
}
$string = implode(PHP_EOL, $stringRows);
Which will then output the string like you expect.
Explaining my regex:
the first group catches the first word until the two dots :, then another group to catch the last word. I had previously specified anchors for both ends, but when breaking each line this wouldn't work as expected, so I kept only the beginning.
^(\w+:) => the word in the beginning of the string until two dots appear
.*\> => everything else until smaller symbol appears (escaped by slash)
(\w+) => the word after the smaller than symbol
Well maybe my question wasn't that good written. I had a table which I needed to scrape from a website. I needed the info in the table, but had to cleanup some parts as mentioned. The solution I finally made was this one and it works. It still has a little work to do with manual replacements but that is because of the stupid " they use for inch. ;-)
Solution:
\\ find the table in the sourcecode
foreach($techdata->find('table') as $table){
\\ filter out the rows
foreach($table->find('tr') as $row){
\\ take the innertext using simplehtmldom
$tech_specs = $row->innertext;
\\ strip some 'garbage'
$tech_specs = str_replace(" \t\t\t\t\t\t\t\t\t\t\t<td class=\"datatable__body__item\">","", $tech_specs);
\\ find the first word of the string so I can use it
$spec1 = explode('</td>', $tech_specs)[0];
\\ use the found string to strip down the rest of the table
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"" . $spec1 . "\">",":", $tech_specs);
\\ manual correction because of the " used
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"tbv Montage benodigde 19\">",":", $tech_specs);
\\ manual correction because of the " used
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"19\">",":", $tech_specs);
\\ strip some 'garbage'
$tech_specs = str_replace("\t\t\t\t\t\t\t\t\t\t","\n", $tech_specs);
$tech_specs = str_replace("</td>","", $tech_specs);
$tech_specs = str_replace(" ","", $tech_specs);
\\ put the clean row in an array ready for usage
$specs[] = $tech_specs;
}
}

php preg_replace pattern - replace text between commas

I have a string of words in an array, and I am using preg_replace to make each word into a link. Currently my code works, and each word is transformed into a link.
Here is my code:
$keywords = "shoes,hats,blue curtains,red curtains,tables,kitchen tables";
$template = '%1$s';
$newkeys = preg_replace("/(?!(?:[^<]+>|[^>]+<\/a>))\b([a-z]+)\b/is", sprintf($template, "\\1"), $keywords);
Now, the only problem is that when I want 2 or 3 words to be a single link. For example, I have a keyword "blue curtains". The script would create a link for the word "blue" and "curtains" separately. I have the keywords separated by commas, and I would like the preg_replace to only replace the text between the commas.
I've tried playing around with the pattern, but I just can't figure out what the pattern would be.
Just to clarify, currently the output looks as follows:
shoes,hats,blue curtains,red curtains,tables,kitchen tables
While I want to achieve the following output:
shoes,hats,blue curtains,red curtains,tables,kitchen tables
A little bit change in preg_replace code and your job will done :-
$keywords = "shoes,hats,blue curtains,red curtains,tables,kitchen tables";
$template = '%1$s';
$newkeys = preg_replace("/(?!(?:[^<]+>|[^>]+<\/a>))\b([a-z ' ']+)\b/is", sprintf($template, "\\1"), $keywords);
OR
$newkeys = preg_replace("/(?!(?:[^<]+>|[^>]+<\/a>))\b([a-z' ']+)\b/is", sprintf($template, "\\1"), $keywords);
echo $newkeys;
Output:- http://prntscr.com/77tkyb
Note:- I just added an white-space in your preg_replace. And you can easily get where it is. I hope i am clear.
Matching white-space along with words is missing there in preg_replace and i added that only.

PHP:preg_replace function

$text = "
<tag>
<html>
HTML
</html>
</tag>
";
I want to replace all the text present inside the tags with htmlspecialchars(). I tried this:
$regex = '/<tag>(.*?)<\/tag>/s';
$code = preg_replace($regex,htmlspecialchars($regex),$text);
But it doesn't work.
I am getting the output as htmlspecialchars of the regex pattern. I want to replace it with htmlspecialchars of the data matching with the regex pattern.
what should i do?
You're replacing the match with the pattern itself, you're not using the back-references and the e-flag, but in this case, preg_replace_callback would be the way to go:
$code = preg_replace_callback($regex,'htmlspecialchars',$text);
This will pass the mathces groups to htmlspecialchars, and use its return value as replacement. The groups might be an array, in which case, you can try either:
function replaceCallback($matches)
{
if (is_array($matches))
{
$matches = implode ('', array_slice($matches, 1));//first element is full string
}
return htmlspecialchars($matches);
}
Or, if your PHP version permits it:
preg_replace_callback($expr, function($matches)
{
$return = '';
for ($i=1, $j = count($matches); $i<$j;$i++)
{//loop like this, skips first index, and allows for any number of groups
$return .= htmlspecialchars($matches[$i]);
}
return $return;
}, $text);
Try any of the above, until you find simething that works... incidentally, if all you want to remove is <tag> and </tag>, why not go for the much faster:
echo htmlspecialchars(str_replace(array('<tag>','</tag>'), '', $text));
That's just keeping it simple, and it'll almost certainly be faster, too.
See the quickest, easiest way in action here
If you want to isolate the actual contents as defined by your pattern, you could use preg_match($regex,$text,$hits);. This will give you an array of hits those bits that were between the paratheses in the pattern, starting at $hits[1], $hits[0] contains the whole matched string). You can then start manipulating these found matches, possibly using htmlspecialchars ... and combine them again into $code.

PHP: Bolding of overlapping keywords in string

This is a problem that I have figured out how to solve, but I want to solve it in a simpler way... I'm trying to improve as a programmer.
Have done my research and have failed to find an elegant solution to the following problem:
I have a hypothetical array of keywords to search for:
$keyword_array = array('he','heather');
and a hypothetical string:
$text = "What did he say to heather?";
And, finally, a hypothetical function:
function bold_keywords($text, $keyword_array)
{
$pattern = array();
$replace = array();
foreach($keyword_array as $keyword)
{
$pattern[] = "/($keyword)/is";
$replace[] = "<b>$1</b>";
}
$text = preg_replace($pattern, $replace, $text);
return $text;
}
The function (not too surprisingly) is returning something like this:
"What did <b>he</b> say to <b>he</b>ather?"
Because it is not recognizing "heather" when there is a bold tag in the middle of it.
What I want the final solution to do is, as simply as possible, return one of the two following strings:
"What did <b>he</b> say to <b>heather</b>?"
"What did <b>he</b> say to <b><b>he</b>ather</b>?"
Some final conditions:
--I would like the final solution to deal with a very large number of possible keywords
--I would like it to deal with the following two situations (lines represent overlapping strings):
One string engulfs the other, like the following two examples:
-- he, heather
-- sanding, and
Or one string does not engulf the other:
-- entrain, training
Possible way to solve:
-A regex that ignores tags in keywords
-Long way (that I am trying to avoid):
*Search string for all occurrences of each keyword, store an array of positions (start and end) of keywords to be bolded
*Process this array recursively to combine overlapping keywords, so there is no redundancy
*Add the bold tags (starting from the end of the string, to avoid the positions of information shifting from the additional characters)
Many thanks in advance!
Example
$keyword_array = array('he','heather');
$text = "What did he say to heather?";
$pattern = array();
$replace = array();
sort($keyword_array, SORT_NUMERIC);
foreach($keyword_array as $keyword)
{
$pattern[] = "/ ($keyword)/is";
$replace[] = " <b>$1</b>";
}
$text = preg_replace($pattern, $replace, $text);
echo $text; // What did <b>he</b> say to <b>heather</b>?
need to change your regex pattern to recognize that each "term" you are searching for is followed by whitespace or punctuation, so that it does not apply the pattern match to items followed by an alpha-numeric.
Simplistic and lazy-ish Approach off The Top of My head:
Sort your initial Array by Item length, descending! No more "Not recognized because there's already a Tag in The Middle" issues!
Edit: The nested tags issue is then easily fixed by extending your regex in a Way that >foo and foo< isn't being matched anymore.

Preg Replace - replace second occurance of a match

I am relatively new to php, and hope someone can help me with a replace regex, or maybe a match replace I am not exactly sure.
I want to automatically bold the (second occurance of a match) and then make the 4th appearance of a match italic and then the 7th appearance of a match underlined.
This is basically for SEO purposes in content.
I have done some replacements with: and were thinking this should do the trick?
preg_replace( pattern, replacement, subject [, limit ])
I already know the word I want to use in
'pattern' is also a word that is already defined like [word].
`replacement` 'This is a variable I am getting from a mysql db.
'subject' - The subject is text from a db.
Lets say I have this content: This explains more or less what I want to do.
This is an example of the text that I want to replace. In this text I want to make the second occurance of the word example < bold. Then I want to skip the next time example occurs in the text, and make the 4th time the word example appears in italic. Then I want to skip the 5th time the word example appears in the text, as well as the 6th time and lastly wants to make the 7th time example appears in the text underline it. In this example I have used a hyperlink as the underline example as I do not see an underline function in the text editor. The word example may appear more times in the text, but my only requerement is to underline once, make bold once and make italic once. I may later descide to do some quotes on the word "example" as well but it is not yet priority.
It is also important for the code not to through an error if there is not atleast 7 occurances of the word.
How would I do this, any ideas would be appreciated.
You could use preg_split to split the text at the matches, apply the modifications, and then put everything back together:
$parts = preg_split('/(example)/', $str, 7, PREG_SPLIT_DELIM_CAPTURE);
if (isset($parts[3])) $parts[3] = '<b>'.$parts[3].'</b>';
if (isset($parts[7])) $parts[7] = '<i>'.$parts[7].'</i>';
if (isset($parts[13])) $parts[13] = '<u>'.$parts[13].'</u>';
$str = implode('', $parts);
The index formula for the i-th match is index = i ยท 2 - 1.
The regular expression itself cannot count, and the preg_ functions provide little help. You need a workaround. If you were to actually search for just a word, you might want to use string functions. Otherwise try:
// just counting
if (7 >= preg_match_all($pattern, $subject, $matches)) {
$cb_num = 0;
$subject = preg_replace_callback($pattern, "cb_ibu", $subject);
}
function cb_ibu($match) {
global $cb_num;
$match = $match[0];
switch (++$cb_num) {
case 2: return "<b>$match</b>";
case 4: return "<i>$match</i>";
case 7: return "<u>$match</u>";
default: return $match;
}
}
The trick is to have a callback which does the accounting. And there it's quite easy to add any rules.
That's an interesting question. My implementation would be:
function replace_exact($word, $tag, $string, $limit) {
$tag1 = '<'.$tag.'>';
$tag2 = '</'.$tag.'>';
$string = str_replace($word, $tag1.$word.$tag2, $string, 1);
if ($limit==1) return $string;
return str_replace($tag1.$word.$tag2,$word,$string,$limit-1);
}
Use it like this:
echo replace_exact('Example', 'b', $source_text, 2);
echo replace_exact('Example', 'i', $source_text, 4);
I don't know about how fast this will work, but it will be faster than preg_replace.

Categories