How to use preg match all in php? - php

Hi i want to retrieve certain information from a website.
This is what is display on the website with html tags.
<a href="ProductDisplay?catalogId=10051&storeId=90001&productId=258033&langId=-1" id="WC_CatalogSearchResultDisplay_Link_6_3" class="s_result_name">
SALT - Fine
</a>
What i want to extract is "SALT - FINE" using preg match however i do not know why i cant use it. isit because they are all on different line? cos i realise if they are on a single line i can actually retrieve what i want.
This is my code -
$pattern = '/id="WC_CatalogSearchResultDisplay_Link_6_3.*<\/a>/';
preg_match_all($pattern, $response, $match);
print_r($match);
I do not get anything in my array. if they are on a single line it works?.why is that so?

Have a look at:
http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
especially the m and s modifiers.
Also, I would recommend, changing the pattern to something like:
$pattern = '/id="WC_CatalogSearchResultDisplay_Link_6_3"[^>]*>(.*)<\/a>/ims';
Otherwise, you'll match the end of your a-tag.
And on a side note, don't use regex to parse html/xml.
Something like this:
<?php
$dom = DOMDocument::loadHtml($response);
$xpath = new DOMXPath($dom);
$node = $xpath->query('//*[#id="WC_CatalogSearchResultDisplay_Link_6_3"]/text()')->item(0);
if ($node instanceof DOMText) {
echo trim($node->nodeValue);
}
will also work, and will be a lot more robust.

You should encapsulate what you want to match by (). So i guess your pattern would then become
$pattern = '/id="WC_CatalogSearchResultDisplay_Link_6_3(.*)<\/a>/';
I however don't fully see how you arrived at this pattern, since it would be simpler to just match everything enclosed by a-tags.
Edit:
You also need the s modifier as mentioned by Yoshi so the . matches a newline. I would thus suggest you use this code:
$pattern = '/<a[^>]*>(.+)<\/a>/si';
preg_match_all($pattern, $response, $match);
print_r($match);

You're right, it's because it's a multi-line input string.
You need to add the m and s modifiers to the regex pattern to match multiline strings:
$pattern = '/id="WC_CatalogSearchResultDisplay_Link_6_3.*<\/a>/ms';
The m modifier makes it multi-line.
The s modifier makes the . dot match newline characters as well as all others (by default it doesn't match newlines)

Related

Searching page and replacing some elements

I have 2 sets of tags on page, first is
{tip}tooltip text{/tip}
and second is
{tip class="someClass"}tooltip text{/tip}
I need to replace those with
<span class=„sstooltip”><i>?</i><em>tooltip text</em></span>
I dont know how to deal with adding new class to the <span> tag. (The tooltip class is always present)
This is my regex /\{tip.*?(?:class="([a-z]+)")?\}(.*?)\{\/tip\}/.
I guess I need to check array indexes for class value, but those are different, depending on {tip} tag version. Do I need two regular expressions, one for each version, or there is some way to extract and replace class value?
php code:
$regex = "/\{tip.*?(?:class=\"([a-z]+)\")?\}(.*?)\{\/tip\}/";
$matches = null;
preg_match_all($regex, $article->text, $matches);
if (is_array($matches)) {
foreach ($matches as $match) {
$article->text = preg_replace(
$regex,
"<span class=tooltip \$1"."><i>?</i><em>"."\$2"."</em></span>",
$article->text
);
}
}
Here's your answer (I've also made it a bit more robust):
{tip(?:\s+class\s*=\s*"([a-zA-Z\s]+)")?}([^{]*){\/tip}
PCRE (which PHP uses, if memory serves) will automatically pick up that the first capture group (which grabs the classes) is empty in the first case, and just substitute the empty string in the replacement. The second case is self-explanatory.
Your replacement code, then, will look like this:
$article->text = preg_replace(
'/{tip(?:\s+class\s*=\s*"([a-zA-Z\s]+)")?}([^}]*){\/tip}/',
'<span class="tooltip $1"><i>?</i><em>$2</em></span>',
$article->text
);
Yout don't need to check if the regex matches beforehand - that's implied by preg_replace, which is performing a regex match and then replacing any text matched by the pattern with that text. If there are no matches, no replacement occurs.
Regex Demo on Regex101
Code Demo on repl.it

Regex pattern for matching mm <sup>3<sup>

I’m trying to write a regular expression to change mm3 to mL:
<?php
$match = 'mm<sup>3</sup>';
if(preg_match('/\b(mm<sup>3</sup>)\b/', $match))
{
$replacement = 'ml';
$replac = preg_replace('/\b(mm<sup>3</sup>)\b/', $replacement, $match);
echo $replac;
}
?>
But my regular expression doesn't capture the content in $match variable, and the $replac value isn't output. What am I doing wrong?
Change:
if(preg_match('/\b(mm<sup>3</sup>)\b/',$match))
to:
if(preg_match('#\bmm<sup>3</sup>\b#',$match))
and similarly in the preg_replace call.
Since your regular expression contains /, you need to either escape it or use a different delimiter around the regular expression.
There's also no need for the parentheses, since you're not doing anything with the groups.
You need to either use preg_quote to get rid of that / in your regexp, or use a different delimiter (usually # is used).
Also, the \b separator after the > is not necessary, nor are parentheses since you don't seem to be doing capture; you're basically doing a more expensive str_replace.
Finally, you can do everything in one move. If there's no match, nothing will happen.
<?php
$match = 'mm<sup>3</sup>';
$replacement='ML';
$replac = preg_replace('#\\bmm<sup>3</sup>#',
$replacement,
$match);
echo $replac;
?>
If you want to be picky, I guess you should also replace with 'ml', not 'ML' :-)
(for replacement of multiple strings, preg_replace supports arrays).
Note: unless you're sure that is the correct HTML you want replaces, maybe you ought to try
$match = 'mm\\s*<sup>\\s*3\\s*</sup>';
in order to catch mm 3 and similar, in addition to mm3 (in some circumstances they may look alike, and some editors might use or automatically "correct" either form into the other).

Make me understand preg_replace

I've been looking all over the internet for some useful information and I think I found too much. I'm trying to understand regular expressions but don't get it.
Lets for instance say $data="A bunch of text [link=123] another bunch of text.", and it should get replaced with "< a href=\"123.html\">123< /a>".
I've been trying around a lot with code similar to this:
$find = "/[link=[([0-9])]/";
$replace = "< a href=\"$1\">$1< /a>";
echo preg_replace ($find, $replace, $data);
but the output is always the same as the original $data.
I think I have to see something relevent to my problem understand the basics.
Remove the extra [] around the (), and add + after the [0-9] to quantify it. Also, escape the [] that make up the tag itself.
$find = "/\[link=(\d+)\]/"; // "\d" is equivalent to "[0-9]"
$replace = "$1";
echo preg_replace($find,$replace,$data);
The regex would be \[link=([\d]+)\]
A good source for an quick overview of regular expression can you find here http://www.regular-expressions.info/
When you really interested in the power of regular expression, you should buy this book: Mastering Regular Expressions
A good Programm to test your RexEx on a Windows Client is: RegEx-Trainer
You are missing the + quantifier and as a result of this your pattern matches if there is a single digit following link=.
And there is an extra pair of [..] as a result of this the outer [...] will be treated as the character class.
You also forgot the escape the closing ].
Solution:
$find = "/[link=([0-9]+)\]/";
<?php
$data= "A bunch of text [link=123] another bunch of text.";
$find = '/\[link=([0-9]+?)\]/';
echo preg_replace($find, "$1", $data);

preg_match a key and date from string

I am working on a project that involves a type of caching to be performed. Multiple caches can be done for situations based on different cache names. In the files I am storying a cache like so:
{cache:2011-12-11 02:01:47}
And when I search for it, I am trying to preg_match it like this:
$match = "{cache:/\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}/}";
$str = 'FIND ME! {cache:2011-12-11 02:01:47}';
if (preg_match($match, $str, $matches)) {
print "it's a match";
print_r($match);
}
The problem is, it never finds it. But this will work if I do:
$match = "/\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}/";
What am I doing wrong with my preg_match statement? And is there something type of string search I could use that is faster than preg_match?
Your in-code regex will not work, because you copy&pasted the delimiters where they don't belong:
$match = "{cache:/\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}/}";
^ ^
Eeek! Eeek!
This way your { and } became the regex delimiters, and the inner slashes were interpreted as literal characters to search for.
It rather should have been:
$match = "/\{cache:\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}}/";
Note also the escaped leading \{ curly.

Regular expression anchor text for a link

I am trying to pull the anchor text from a link that is formatted this way:
<h3><b>File</b> : i_want_this</h3>
I want only the anchor text for the link : "i_want_this"
"variable_text" varies according to the filename so I need to ignore that.
I am using this regex:
<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>
This is matching of course the complete link.
PHP uses a pretty close version to PCRE (PERL Regex). If you want to know a lot about regex, visit perlretut.org. Also, look into Regex generators like exspresso.
For your use, know that regex is greedy. That means that when you specify that you want something, follwed by anything (any repetitions) followed by something, it will keep on going until that second something is reached.
to be more clear, what you want is this:
<a href="
any character, any number of times (regex = .* )
">
any character, any number of times (regex = .* )
</a>
beyond that, you want to capture the second group of "any character, any number of times". You can do that using what are called capture groups (capture anything inside of parenthesis as a group for reference later, also called back references).
I would also look into named subpatterns, too - with those, you can reference your choice with a human readable string rather than an array index. Syntax for those in PHP are (?P<name>pattern) where name is the name you want and pattern is the actual regex. I'll use that below.
So all that being said, here's the "lazy web" for your regex:
<?php
$str = '<h3><b>File</b> : i_want_this</h3>';
$regex = '/(<a href\=".*">)(?P<target>.*)(<\/a>)/';
preg_match($regex, $str, $matches);
print $matches['target'];
?>
//This should output "i_want_this"
Oh, and one final thought. Depending on what you are doing exactly, you may want to look into SimpleXML instead of using regex for this. This would probably require that the tags that we see are just snippits of a larger whole as SimpleXML requires well-formed XML (or XHTML).
I'm sure someone will probably have a more elegant solution, but I think this will do what you want to done.
Where:
$subject = "<h3><b>File</b> : i_want_this</h3>";
Option 1:
$pattern1 = '/(<a href=")(.*)(">)(.*)(<\/a>)/i';
preg_match($pattern1, $subject, $matches1);
print($matches1[4]);
Option 2:
$pattern2 = '()(.*)()';
ereg($pattern2, $subject, $matches2);
print($matches2[4]);
Do not use regex to parse HTML. Use a DOM parser. Specify the language you're using, too.
Since it's in a captured group and since you claim it's matching, you should be able to reference it through $1 or \1 depending on the language.
$blah = preg_match( $pattern, $subject, $matches );
print_r($matches);
The thing to remember is that regex's return everything you searched for if it matches. You need to specify that only care about the part you've surrounded in parenthesis (the anchor text). I'm not sure what language you're using the regex in, but here's an example in Ruby:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)
puts data # => outputs 'i_want_this'
If you specify what you want in parenthesis, you can reference it:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)[1]
puts data # => outputs 'i_want_this'
Perl will have you use $1 instead of [1] like this:
$string = 'i_want_this';
$string =~ m/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/;
$data = $1;
print $data . "\n";
Hope that helps.
I'm not 100% sure if I understand what you want. This will match the content between the anchor tags. The URL must start with /en/browse/file/, but may end with anything.
#(.*?)#
I used # as a delimiter as it made it clearer. It'll also help if you put them in single quotes instead of double quotes so you don't have to escape anything at all.
If you want to limit to numbers instead, you can use:
#(.*?)#
If it should have just 5 numbers:
#(.*?)#
If it should have between 3 and 6 numbers:
#(.*?)#
If it should have more than 2 numbers:
#(.*?)#
This should work:
<a href="[^"]*">([^<]*)
this says that take EVERYTHING you find until you meet "
[^"]*
same! take everything with you till you meet <
[^<]*
The paratese around [^<]*
([^<]*)
group it! so you can collect that data in PHP! If you look in the PHP manual om preg_match you will se many fine examples there!
Good luck!
And for your concrete example:
<a href="/en/browse/file/variable_text">([^<]*)
I use
[^<]*
because in some examples...
.*?
can be extremely slow! Shoudln't use that if you can use
[^<]*
You should use the tool Expresso for creating regular expression... Pretty handy..
http://www.ultrapico.com/Expresso.htm

Categories