Searching page and replacing some elements - php

I have 2 sets of tags on page, first is
{tip}tooltip text{/tip}
and second is
{tip class="someClass"}tooltip text{/tip}
I need to replace those with
<span class=„sstooltip”><i>?</i><em>tooltip text</em></span>
I dont know how to deal with adding new class to the <span> tag. (The tooltip class is always present)
This is my regex /\{tip.*?(?:class="([a-z]+)")?\}(.*?)\{\/tip\}/.
I guess I need to check array indexes for class value, but those are different, depending on {tip} tag version. Do I need two regular expressions, one for each version, or there is some way to extract and replace class value?
php code:
$regex = "/\{tip.*?(?:class=\"([a-z]+)\")?\}(.*?)\{\/tip\}/";
$matches = null;
preg_match_all($regex, $article->text, $matches);
if (is_array($matches)) {
foreach ($matches as $match) {
$article->text = preg_replace(
$regex,
"<span class=tooltip \$1"."><i>?</i><em>"."\$2"."</em></span>",
$article->text
);
}
}

Here's your answer (I've also made it a bit more robust):
{tip(?:\s+class\s*=\s*"([a-zA-Z\s]+)")?}([^{]*){\/tip}
PCRE (which PHP uses, if memory serves) will automatically pick up that the first capture group (which grabs the classes) is empty in the first case, and just substitute the empty string in the replacement. The second case is self-explanatory.
Your replacement code, then, will look like this:
$article->text = preg_replace(
'/{tip(?:\s+class\s*=\s*"([a-zA-Z\s]+)")?}([^}]*){\/tip}/',
'<span class="tooltip $1"><i>?</i><em>$2</em></span>',
$article->text
);
Yout don't need to check if the regex matches beforehand - that's implied by preg_replace, which is performing a regex match and then replacing any text matched by the pattern with that text. If there are no matches, no replacement occurs.
Regex Demo on Regex101
Code Demo on repl.it

Related

Matching substrings with PHP preg_match_all()

I'm attempting to create a lightweight BBCode parser without hardcoding regex matches for each element. My way is utilizing preg_replace_callback() to process the match in the function.
My simple yet frustrating way involves using regex to group the elements name and parse different with a switch for each function.
Here is my regex pattern:
'~\[([a-z]+)(?:=(.*))?(?: (.*))?\](.*)(?:\[/\1\])~siU'
And here is the preg_replace_callback() I've got to test.
return preg_replace_callback(
'~\[([a-z]+)(?:=(.*))?(?: (.*))?\](.*)(?:\[/\1\])~siU',
function($matches) {
var_dump($matches);
return "<".$matches[1].">".$matches[4]."</".$matches[1].">";
},
$this->raw
);
This one issue has stumped me. The regex pattern won't seem to recursively match, meaning if it matches an element, it won't match elements inside it.
Take this BBCode for instance:
[i]This is all italics along with a [b]bold[/b].[/i]
This will only match the [u], and won't match any of the elements inside of it, so it looks like
This is all italics along with a [b]bold[/b].
preg_match_all() continues to show this to be the case, and I've tried messing with greedy syntax and modes.
How can I solve this?
Thanks to #Casimir et Hippolyte for their comment, I was able to solve this using a while loop and the count parameter like they said.
The basic regex strings don't work because I would like to use values in the tags like [color=red] or [img width=""].
Here is the finalized code. It isn't perfect but it works.
$str = $this->raw;
do {
$str = preg_replace_callback(
'~\[([a-z]+)(?:=([^]\s]*))?(?: ([^[]*))?\](.*?)(?:\[/\1\])~si',
function($matches) {
return "<".$matches[1].">".$matches[4]."</".$matches[1].">";
},
$str,
-1,
$count
);
} while ($count);
return $str;

I need to find a way explode a specific string that has quotes in it

I'm having serious trouble with this and I'm not really experienced enough to understand how I should go about it.
To start off I have a very long string known as $VC. Each time it's slightly different but will always have some things that are the same.
$VC is an htmlspecialchars() string that looks something like
Example Link... Lots of other stuff in between here... 80] ,[] ,"","3245697351286309258",[] ,["812750926... and it goes on ...80] ,[] ,"","6057413202557366578",[] ,["103279554... and it continues on
In this case the <a> tag is always the same so I take my information from there. The numbers listed after it such as ,"3245697351286309258",[] and ,"6057413202557366578",[] will also always be in the same format, just different numbers and one of those numbers will always be a specific ID.
I then find that specific ID I want, I will always want that number inside pid%3D and %26oid.
$pid = explode("pid%3D", $VC, 2);
$pid = explode("%26oid", $pid[1], 2);
$pid = $pid[0];
In this case that number is 6057413202557366578. Next I want to explode $VC in a way that lets me put everything after ,"6057413202557366578",[] into a variable as its own string.
This is where things start to break down. What I want to do is the following
$vinfo = explode(',"'.$pid.'",[]',$VC,2);
$vinfo = $vinfo[1]; //Everything after the value I used to explode it.
Now naturally I did look around and try other things such as preg_split and preg_replace but I've got to admit, it is beyond me and as far as I can tell, those don't let you put your own variable in the middle of them (e.g. ',"'.$pid.'",[]').
If I'm understanding the whole regular expression idea, there might be other problems in that if I look for it without the $pid variable (e.g. just the surrounding characters), it will pick up the similar parts of the string before it gets to the one I want, (e.g. the ,"3245697351286309258",[]).
I hope I've explained this well enough, the main question though is - How can I get the information after that specific part of the string (',"'.$pid.'",[]') into a variable?
I hope this does what you want:
pid%3D(?P<id>\d+).*?"(?P=id)",\[\](?P<vinfo>.*?)}\);<\/script>
It captures the number after pid%3D in group id, and everything after "id",[] (until the next occurence of });</script>) in group vinfo.
Here's a demo with shortened text.
The problem of capturing more than you want is fixed using capture groups. You'll wrap part of a regular expression in parenthesis to capture it.
You can use preg_match_all to do more robust regular expression capture. You will get an array of things that contains matches to the string that matched the entire pattern plus a string with a partial match for each capture group you use. We'll start by capturing the parts of the string you want. There are no capture groups at this point:
$text = 'Example Link... Lots of other stuff in between here... 80] ,[] ,"","3245697351286309258",[] ,["812750926... and it goes on ...80] ,[] ,"","6057413202557366578",[] ,["103279554... and it continues on"';
$pattern = '/,"\\d+",\\[\\]/';
preg_match_all($pattern,
$text,
$out, PREG_PATTERN_ORDER);
echo $out[0][0]; //echo ,"3245697351286309258",[]
Now to get just the pids into a variable, you can add a capture group in your pattern. The capture group is done by adding parenthesis:
$text = ...
$pattern = '/,"(\\d+)",\\[\\]/'; // the \d+ match will be capture
preg_match_all($pattern,
$text,
$out, PREG_PATTERN_ORDER);
$pids = $out[1];
echo $pids[0]; // echo 3245697351286309258
Notice the first (and only in this case) capture group is in $out[1] (which is an array). What we have captured is all the digits.
To capture everything else, assuming everything is between square brackets, you could match more and capture it. To address the question, we'll use two capture groups. The first will capture the digits and the second will capture everything matching square brackets and everything in between:
$text = ...;
$pattern = '/,"(\\d+)",\\[\\] ,(\\[.+?\\])/';
preg_match_all($pattern,
$text,
$out, PREG_PATTERN_ORDER);
$pids = $out[1];
$contents = $out[2];
echo $pids[0] . "=" . $contents[0] ."\n";
echo $pids[1] . "=". $contents[1];

preg_match_all to include all results plus ones without a certain value

I'm trying to do a preg_match_all on the following string:
$string1 = '/<a href="(.*?).(jpg|jpeg|png|gif|bmp|ico)"><img(.*?)class="(.*?)wp-image-(.*?)" title="(.*?)" (.*?) \/><\/a>/i';
preg_match_all( $string, $content, $matches, PREG_SET_ORDER);
The above works fine for what i'm doing, the problem is I also need to detect images without the "title" tag.
Is there a way to do a preg_match_all and also add matches if the string doesn't have value[6]? (title flag is value[6]), and give those results (without title) a special name (i.e $matches_no_title?
My current solution is to run two preg_match_all on two different strings (same string except one doesn't have the title="" part), but if I could do it all in one preg_match_all to optimize the website speed, that would be better!
regex it is not the best approach on what do you want. You can try parsing the HTML and get what do you want.
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
}
If you're sure that the title attribute comes (right) after the class attribute, it's simple. Just make it optional.
$string1 = '/<a href="(.*?)\.(jpg|jpeg|png|gif|bmp|ico)"><img(.*?)class="(.*?)wp-image-(.*?)"(?: title="(.*?)")? (.*?) \/><\/a>/i';
Do note that the regex is to specific to match general HTML.
In this case you might be better of using SimpleXML with XPath or a library like PHP Simple HTML DOM Parser.
I would think alternation with a null will do what you want:
$string1 = '/<a href="(.*?).(jpg|jpeg|png|gif|bmp|ico)"><img(.*?)class="(.*?)wp-image-(.*?)" (|title="(.*?)") (.*?) \/><\/a>/i';
preg_match_all( $string1, $content, $matches, PREG_SET_ORDER);
You may also need to get fancy about optional whitespace; as it is, you'll be expecting to match a space before and after the optional title="blah" tokens, which means that the match would look for two spaces if the title="blah" isn't there... so you may want
wp-image-(.*?)"(| title="(.*?)" )(.*?) \/>
or
wp-image-(.*?)"(|\s+title="(.*?)"\s+)(.*?) \/>
instead of
wp-image-(.*?)" (|title="(.*?)") (.*?) \/>

How to ignore regex matches wrapped by a particular string?

I had a great idea for some functionality on a project and I've tried to implement it to the best of my ability but I need a little help achieving the desired effect. The page in question is: http://dev.favorcollective.com/guidelines/ (just to provide some context)
I'm using php's preg_replace to go through a particular page's contents (giant string) and I'm having it search for glossary terms and then I wrap the terms with a bit of html that enables dynamic glossary definition tooltips.
Here is my current code:
function annotate($content)
{
global $glossary_terms;
$search = array();
$replace = array();
$count=1;
foreach ($glossary_terms as $term):
array_push($search,'/\b('.preg_quote($term['term'],'/').')[?=a-zA-Z]*/i');
$id = "annotation-".$count;
$replacement = ''.$term['term'].'<span id="'.$id.'" style="display:none;"><span class="term">'.$term['term'].'</span><span class="definition">'.$term['def'].'</span></span>';
array_push($replace,(string)$replacement);
$count++;
endforeach;
return preg_replace($search, $replace, $content);
}
• But what if I want to ignore matches inside of <h#> </h#> tags?
• I also have a particular string that I do not want a specific term to match within. For example, I want the word "proficiency" to match any time it is NOT used in the context of "ACTFL Proficiency Guidelines" how would I go about adding exceptions to my regular expression? Is that even an option?
• Finally, how can I return the matched text as a variable? Currently when I match for a term ending in 's' or 'ing' (on purpose) my script prints the matched term rather than the original string that was matched (i.e. it's replacing "descriptions" with "description"). Is there anyway to do that?
not a php guy (c#), but here goes. I assume that:
'/\b('.preg_quote($term['term'],'/').')[?=a-zA-Z]*/i' will map to this far more readable pattern:
/\b(ESCAPED_TERM)[?=a-zA-Z]*/i
so, as far as excluding <h#> type tags, regex is ok only if you can assume your data would be the simple, non-nested case: <h#>TERM<h#>. If you can, you can use a negative lookahead assertion:
/\b(ESCAPED_TERM)(?!<h\d>)[?=a-zA-Z]*/i
you can use a lookahead with a lookbehind to handle your special case:
/\b(ESCAPED_TERM|(?<!ACTFL )Proficiency(?!\sGuidelines))(?!<h\d>)[?=a-zA-Z]*/i
note: if you have a bunch of these special cases, PHP might (should) have an "ignore whitespace" flag which will let you put each token on newline.
Regular expressions are awesome, wonderful, magical. But everything has its limits.
That's why it's nice to have a language like PHP to provide the extra functionality. :)
Can you strip out headers with a non-greedy regexp?
$content = preg_replace('/<h[1-6]>.*?<\/h[1-6]>/sim', "", $content);
If non-greedy evaluations aren't working, what about just assuming that there won't be any other HTML inside your headers?
$content = preg_replace('/<h[1-6]>[^<]*<\/h[1-6]>/im', "", $content);
Also, you might want to use sprintf to simplify your replacement:
/*
1 get_bloginfo('url')
2 preg_replace( '/\s+/', '', $term['term']).
3 $id
4 $term['term']
5 $term['def']
*/
$rfmt = '%4$s<span id="%3$s" style="display:none;"><span class="term">%4$s</span><span class="definition">%5$s</span></span>';
...
$replacement = sprintf($rfmt, get_bloginfo('url'), preg_replace( '/\s+/', '', $term['term']), $id, $term['term'], $term['def'] );

How to use preg match all in php?

Hi i want to retrieve certain information from a website.
This is what is display on the website with html tags.
<a href="ProductDisplay?catalogId=10051&storeId=90001&productId=258033&langId=-1" id="WC_CatalogSearchResultDisplay_Link_6_3" class="s_result_name">
SALT - Fine
</a>
What i want to extract is "SALT - FINE" using preg match however i do not know why i cant use it. isit because they are all on different line? cos i realise if they are on a single line i can actually retrieve what i want.
This is my code -
$pattern = '/id="WC_CatalogSearchResultDisplay_Link_6_3.*<\/a>/';
preg_match_all($pattern, $response, $match);
print_r($match);
I do not get anything in my array. if they are on a single line it works?.why is that so?
Have a look at:
http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
especially the m and s modifiers.
Also, I would recommend, changing the pattern to something like:
$pattern = '/id="WC_CatalogSearchResultDisplay_Link_6_3"[^>]*>(.*)<\/a>/ims';
Otherwise, you'll match the end of your a-tag.
And on a side note, don't use regex to parse html/xml.
Something like this:
<?php
$dom = DOMDocument::loadHtml($response);
$xpath = new DOMXPath($dom);
$node = $xpath->query('//*[#id="WC_CatalogSearchResultDisplay_Link_6_3"]/text()')->item(0);
if ($node instanceof DOMText) {
echo trim($node->nodeValue);
}
will also work, and will be a lot more robust.
You should encapsulate what you want to match by (). So i guess your pattern would then become
$pattern = '/id="WC_CatalogSearchResultDisplay_Link_6_3(.*)<\/a>/';
I however don't fully see how you arrived at this pattern, since it would be simpler to just match everything enclosed by a-tags.
Edit:
You also need the s modifier as mentioned by Yoshi so the . matches a newline. I would thus suggest you use this code:
$pattern = '/<a[^>]*>(.+)<\/a>/si';
preg_match_all($pattern, $response, $match);
print_r($match);
You're right, it's because it's a multi-line input string.
You need to add the m and s modifiers to the regex pattern to match multiline strings:
$pattern = '/id="WC_CatalogSearchResultDisplay_Link_6_3.*<\/a>/ms';
The m modifier makes it multi-line.
The s modifier makes the . dot match newline characters as well as all others (by default it doesn't match newlines)

Categories