How to optimize this regex - php

Can someone help me to optimize my regex pattern, so I don't have to go through each regexes below. So it matches all of the string like the example I provided.
$pattern = "/__\(\"(.*)\"/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
$pattern = "/__\(\"(.*)\",/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
$pattern = "/__\(\'(.*)\'/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
$pattern = "/__\(\'(.*)\',/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
$pattern = "/_e\(\"(.*)\"/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
$pattern = "/_e\(\"(.*)\",/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
$pattern = "/_e\(\'(.*)\'/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
$pattern = "/_e\(\'(.*)\',/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
Example:
_e('string');
_e("string");
_e('string', 'string2');
_e("string", 'string2');
__('string');
__("string");
__('string', 'string2');
__("string", 'string2');
Also if it possible, to match also these string below.
"string"|trans
'string'|trans
"string"|trans({}, "string2")
'string'|trans({}, 'string2')
'string'|trans({}, "string2")
"string"|trans({}, 'string2')
If it is possible to get the value string2 too. In the worst case, in the file, there are also mixed single and double quote.
Like you see on my preg_match_all code now, I go with 8 patterns for the first and also 8 patterns for the second one to get the first string.
Note:
I just only run this script on console command, not in PHP application. So I don't pay any attention to the performance and it doesn't matter too.
Thank you for your help!
Edited
Thank you for the response. I tried both your regex, almost there. My question might confusing. I am not english speaker. I copy paste from regex101. It might be easier to understand, what I am trying to achieve.
https://regex101.com/r/uX5nqR/2
and this one too
https://regex101.com/r/Fxs7yY/1
Please check this out. I tried to extract translations from wordpress project and also twig file which using "trans" filter. I know there are mo po Editor, but the editor don't recognize the file extension I used.

I took the liberty of writing this in JavaScript, but the regex will work the same.
My complete code looks like this:
const r = /^_[e_]\((\"(.*)\"|\'(.*)\')(, (\"(.*)\"|\'(.*)\'))?\);$/;
const xs = [
"_e('string');",
"_e(\"string\");",
"_e('string', 'string2');",
"_e(\"string\", 'string2');",
"__('string');",
"__(\"string\");",
"__('string', 'string2');",
"__(\"string\", 'string2');",
];
xs.forEach((x) => {
const matches = x.match(r);
if(matches){
console.log('matches are:\n ', matches.filter(m => m !== undefined).join('\n '));
}else{
console.log('no matches for', x);
}
});
Now let me explain how the regex works and how I arrived at it:
First I noticed that all your strings start with _ and end with );,
so I knew the regex had to look something like ^…\);$.
Here ^ and $ mark the beginning and end of the string, and you should leave them out if they're not required.
After the initial _ you've got either another _ or a e, so we put these into a group followed by the opening parenthesis: [e_]\(.
Now we have a string that is either in " or in ', and we put it down as alternatives: (\"(.*)\"|\'(.*)\').
This string is repeated, but optionally, with a leading , in front.
So we get (, …)? for the optional part, and (\"(.*)\"|\'(.*)\') for the whole second portion.
For the second portion of your problem you can use the same strategy:
"string"|trans
'string'|trans
"string"|trans({}, "string2")
'string'|trans({}, 'string2')
'string'|trans({}, "string2")
"string"|trans({}, 'string2')
Start building up your regex from the similarities. We've got the same string pattern as before used twice, and the optional second part now looks like (\(\{\}, (\"(.*)\"|\'(.*)\')\))?.
This way we can end up with a regex like this:
^(\"(.*)\"|\'(.*)\')\|trans\(\{\}, (\"(.*)\"|\'(.*)\')\))?$
Please note that this regex is not tested, but just a guess from my side.
Upon further discussion it became apparent that we're looking at several matches in a larger bunch of text. To adapt to this we need to exclude the ' and " characters from the innermost groups, which leaves us with these regexes:
_[e_]\(("([^"]*)"|\'([^']*)\')(, ("([^"]*)"|\'([^']*)\'))?\);
(\"(.*)\"|\'(.*)\')\|trans(\(\{\}, (\"(.*)\"|\'(.*)\')\))?
I've also noted that my second regex apparently had an unmatched parenthesis in it.

I tried to understand the purpose of these regexes - here's what I think. (Let me omit the slashes on both sides, also the string quotes belonging to the language instead of the regex itself.)
(__|_e)\(\"(.*)\"
(__|_e)\(\'(.*)\'
This way you get all the hits of your 8 regexes above; but that's probably not what you were trying to achieve.
As far as I understand, you want to list the I18N refs in your code, with one or more arguments between the brackets. I think the best way to do it is run a preg_match_all with the simplest form of the pattern:
(__|_e)\(.*\)
or maybe this one is better:
(__|_e)\([^\)]+\) // works for multiple calls in one line, ignores empties
...and then iterate the results one by one and split them by comma:
foreach($matches as $m) {
$args = explode(",",$m[1]); // [1] = second subpattern
;
; // now you have the arguments of this function call
;
}
If this answer is not helping, let's refine the question :)

Related

Regex expression for reptitive groups

This one is for the regex experts. I am trying to write a Regex expression for a key value pair for Cookies, which has a = in front of key and a ; at the end of the value.
So, basically a key=value; should pass. The string could be repititive, for which it should pass too. like key1=value1; key2=value2; should pass,
However anything except for this should fail. Like key=value1;key=value2;; should fail as it has 2 ; at the end. And also strings like key==valu1;;, =value;, key=;, key=value should fail.
So far, I have been learning about grouping in regex and came up with this (?<pat>([a-zA-Z0-9 ]*?=[a-zA-Z0-9\- :]+;)). But this is not working. Can anyone help me?
Maybe,
^(?:\b[a-z0-9]+=[a-z0-9]+\b;\s*)*$
or some similar expression might work OK.
Demo
Test
$re = '/^(?:\b[a-z0-9]+=[a-z0-9]+\b;\s*)*$/s';
$str = 'key1=value1; key2=value2;';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);

PHP preg_match_all not matching properly

I am trying to get some data off of a website source code. What I am trying to do is get everything after /collections/(whatever that follows here). My pattern matched "most" of what I am looking for. The problem occurs when my preg_match_all gets to a pattern with the "&", at which point it will simply read to the point of "&" and stop reading the remainder. Here is my script:
$homepage = file_get_contents('http://www.harrisfarm.com.au/');
$pattern = '/collections([\w-&\/]*)/i';
preg_match_all($pattern, $processedHomePage, $collections);
print_r($collections);
Notice that when printing like this, things after the "&" are ignored, meaning it will get me this:
/collections/seafood/Shellfish-&
But when I am pattern matching on one string such as below:
$subject = 'a href="/collections/organic/Pantry/sickmonster/grandma" <a href="/collections/seafood/Shellfish-&-Crustaceans">Oysters, Shellfish & Crustaceans';
it gets me everything I want:
/collections/seafood/Shellfish-&-Crustaceans
So I wonder... why is this happening? I am really stumped here.
There is no problem with the provided code when you use $homepage instead of $processedHomePage in preg_match_all.
BTW:
You should escape the minus sign in squared brackets (or write it at the beginning or end of the expression in squared brackets), but surprisingly it makes no difference in your case:
$pattern = '/collections([-\w&/]*)/i';
See http://php.net/manual/regexp.reference.meta.php for further information.
try this:
$re = "/\\/collections([\\w\\-\\&\\/;]*)/mi";
$str = "<a href=\"/collections/seafood/Shellfish-&-Crustaceans\">Oysters, Shellfish & Crustaceans';\n<a href=\"/collections/seafood/Shellfish-&-Crustaceans\">Oysters,collections Shellfish & Crustaceans';";
preg_match_all($re, $str, $matches);
live demo
your update code
$homepage = file_get_contents('http://www.harrisfarm.com.au/');
$pattern = "/\\/collections([\\w\\-\\&\\/;]*)/mi";
preg_match_all($pattern, $homepage, $collections);
print_r($collections);
I figured out what the problem is - maybe this will help others later.
I had tried to use htmlspecialchars() to convert the url http://www.harrisfarm.com.au/ and then read it in as a string. This converted some special characters like & and some other things, into something with many characters.
The conversion of & turns it into & which has a ;, and that's not in my regular expression. Since ; is not part of regular expression, the regex stopped matching at that point.

I need to find a way explode a specific string that has quotes in it

I'm having serious trouble with this and I'm not really experienced enough to understand how I should go about it.
To start off I have a very long string known as $VC. Each time it's slightly different but will always have some things that are the same.
$VC is an htmlspecialchars() string that looks something like
Example Link... Lots of other stuff in between here... 80] ,[] ,"","3245697351286309258",[] ,["812750926... and it goes on ...80] ,[] ,"","6057413202557366578",[] ,["103279554... and it continues on
In this case the <a> tag is always the same so I take my information from there. The numbers listed after it such as ,"3245697351286309258",[] and ,"6057413202557366578",[] will also always be in the same format, just different numbers and one of those numbers will always be a specific ID.
I then find that specific ID I want, I will always want that number inside pid%3D and %26oid.
$pid = explode("pid%3D", $VC, 2);
$pid = explode("%26oid", $pid[1], 2);
$pid = $pid[0];
In this case that number is 6057413202557366578. Next I want to explode $VC in a way that lets me put everything after ,"6057413202557366578",[] into a variable as its own string.
This is where things start to break down. What I want to do is the following
$vinfo = explode(',"'.$pid.'",[]',$VC,2);
$vinfo = $vinfo[1]; //Everything after the value I used to explode it.
Now naturally I did look around and try other things such as preg_split and preg_replace but I've got to admit, it is beyond me and as far as I can tell, those don't let you put your own variable in the middle of them (e.g. ',"'.$pid.'",[]').
If I'm understanding the whole regular expression idea, there might be other problems in that if I look for it without the $pid variable (e.g. just the surrounding characters), it will pick up the similar parts of the string before it gets to the one I want, (e.g. the ,"3245697351286309258",[]).
I hope I've explained this well enough, the main question though is - How can I get the information after that specific part of the string (',"'.$pid.'",[]') into a variable?
I hope this does what you want:
pid%3D(?P<id>\d+).*?"(?P=id)",\[\](?P<vinfo>.*?)}\);<\/script>
It captures the number after pid%3D in group id, and everything after "id",[] (until the next occurence of });</script>) in group vinfo.
Here's a demo with shortened text.
The problem of capturing more than you want is fixed using capture groups. You'll wrap part of a regular expression in parenthesis to capture it.
You can use preg_match_all to do more robust regular expression capture. You will get an array of things that contains matches to the string that matched the entire pattern plus a string with a partial match for each capture group you use. We'll start by capturing the parts of the string you want. There are no capture groups at this point:
$text = 'Example Link... Lots of other stuff in between here... 80] ,[] ,"","3245697351286309258",[] ,["812750926... and it goes on ...80] ,[] ,"","6057413202557366578",[] ,["103279554... and it continues on"';
$pattern = '/,"\\d+",\\[\\]/';
preg_match_all($pattern,
$text,
$out, PREG_PATTERN_ORDER);
echo $out[0][0]; //echo ,"3245697351286309258",[]
Now to get just the pids into a variable, you can add a capture group in your pattern. The capture group is done by adding parenthesis:
$text = ...
$pattern = '/,"(\\d+)",\\[\\]/'; // the \d+ match will be capture
preg_match_all($pattern,
$text,
$out, PREG_PATTERN_ORDER);
$pids = $out[1];
echo $pids[0]; // echo 3245697351286309258
Notice the first (and only in this case) capture group is in $out[1] (which is an array). What we have captured is all the digits.
To capture everything else, assuming everything is between square brackets, you could match more and capture it. To address the question, we'll use two capture groups. The first will capture the digits and the second will capture everything matching square brackets and everything in between:
$text = ...;
$pattern = '/,"(\\d+)",\\[\\] ,(\\[.+?\\])/';
preg_match_all($pattern,
$text,
$out, PREG_PATTERN_ORDER);
$pids = $out[1];
$contents = $out[2];
echo $pids[0] . "=" . $contents[0] ."\n";
echo $pids[1] . "=". $contents[1];

to fetch case insensitive word from a regular expression

Suppose, I am having a string like
$res = "there are many restaurants in the city. Restaurants like xyz,abc. one restaurant like.....";
In the above example, We can find restaurant in 3 places. I need the count to be 3.
$pattern = '/Restaurant/';
preg_match($pattern, substr($res,10), $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
One more problem
Which is related to the above question. i.e., I am having text like Food & Drinks. I need to match this word with food or drinks or seafood... etc. can anyone please help me in getting this. Thanks in advance.
You can use a regex like this:
$pattern = '/restaurants?/i';
There are two changes that I made to your original regex:
Adding the i modifier - this is the case insensitive flag.
Adding s? to the end of the search string. This makes the last s character optional. It matches zero or one occurances of s.
Note that because we are using the case insensitive flag, this regex will also match things like :
ResTaurants
rEstaurantS
RESTauRANTS
The i modifier is used for case-insensitive matching. The ? quantifier makes the preceding token optional matching in this case the preceding s either zero or one time.
You are using preg_match() wanting to get all matches, you need preg_match_all()
$pattern = '/restaurants?/i';
preg_match_all($pattern, substr($res,10), $matches, PREG_OFFSET_CAPTURE);
print_r($matches[0]);
See working demo
I suggest looking at a regex guide - this is a very simple request.
| in regex means or and ? means 0 or 1 of previous char or group, so the following pattern should work for your specification:
$pattern = '/[Rr]estaurants?/
As a solution to your problem please try executing following code snippet
$url = "http://www.examplesite.com/";
$curl = new Curl();
$res = $curl->get($url);
$pattern = '/Restaurant(s)*/i';
preg_match($pattern, substr($res,10), $matches, PREG_OFFSET_CAPTURE);
print_r($matches);

Preg_match_all and complex regexp for text match

for example i have data response from here:
http://www.facebook.com/ajax/shares/view/?target_fbid=410558838979218&__a=1
there is a pattern that looks like this:
data-hovercard=\"\/ajax\/hovercard\/hovercard.php?id=655581307\">
how can i parse it with the preg_match_all() in PHP?
I know i need complex regular expression, but i dont have a clue how to write one for such pattern in the text.
Thanks for help
UPD:
the following code does give the id:
$str = 'hovercard.php?id=655581307';
preg_match_all('/[0-9]{9}/', $str , $matches);
print_r($matches);
BUT
this one doesnt
$url = 'http://www.facebook.com/ajax/shares/view/?target_fbid=410558838979218&__a=1';
$html = file_get_contents($url);
preg_match_all('/[0-9]{9}/', $html, $matches);
print_r($matches);
This gets a bit messy due to the backslashes escaping stuff, but to match exactly that string this call to preg_match_all() should work:
preg_match_all('#(data-hovercard=\\\\"\\\\/ajax\\\\/hovercard\\\\/hovercard.php\?id=[0-9]+\\\\">)#', $str, $matches);
That will give you the whole string you posted in $matches. However, if you only want the numbers from id you can add extra parenthesis around that like so:
preg_match_all('#(data-hovercard=\\\\"\\\\/ajax\\\\/hovercard\\\\/hovercard.php\?id=([0-9]+)\\\\">)#', $str, $matches);
And the numbers will appear individually in $matches (similarly, you can remove the parenthesis that wraps the whole regexp to stop matching the whole string).
Update:
And now I see the question is updated. If your new example fails it's because there are no sequence of 9 digits in the data you get. When I try myself I simply get a response that says I need to log in, so maybe your matching issues is in fact due to you not getting the data you expect? Try dumping $html to see if what you are looking for is in fact in there.

Categories