Preg_match_all and complex regexp for text match

Preg_match_all and complex regexp for text match - php

for example i have data response from here:
http://www.facebook.com/ajax/shares/view/?target_fbid=410558838979218&__a=1
there is a pattern that looks like this:
data-hovercard=\"\/ajax\/hovercard\/hovercard.php?id=655581307\">
how can i parse it with the preg_match_all() in PHP?
I know i need complex regular expression, but i dont have a clue how to write one for such pattern in the text.
Thanks for help
UPD:
the following code does give the id:
$str = 'hovercard.php?id=655581307';
preg_match_all('/[0-9]{9}/', $str , $matches);
print_r($matches);
BUT
this one doesnt
$url = 'http://www.facebook.com/ajax/shares/view/?target_fbid=410558838979218&__a=1';
$html = file_get_contents($url);
preg_match_all('/[0-9]{9}/', $html, $matches);
print_r($matches);

This gets a bit messy due to the backslashes escaping stuff, but to match exactly that string this call to preg_match_all() should work:
preg_match_all('#(data-hovercard=\\\\"\\\\/ajax\\\\/hovercard\\\\/hovercard.php\?id=[0-9]+\\\\">)#', $str, $matches);
That will give you the whole string you posted in $matches. However, if you only want the numbers from id you can add extra parenthesis around that like so:
preg_match_all('#(data-hovercard=\\\\"\\\\/ajax\\\\/hovercard\\\\/hovercard.php\?id=([0-9]+)\\\\">)#', $str, $matches);
And the numbers will appear individually in $matches (similarly, you can remove the parenthesis that wraps the whole regexp to stop matching the whole string).
Update:
And now I see the question is updated. If your new example fails it's because there are no sequence of 9 digits in the data you get. When I try myself I simply get a response that says I need to log in, so maybe your matching issues is in fact due to you not getting the data you expect? Try dumping $html to see if what you are looking for is in fact in there.

Related

How to optimize this regex

Can someone help me to optimize my regex pattern, so I don't have to go through each regexes below. So it matches all of the string like the example I provided.
$pattern = "/__\(\"(.*)\"/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
$pattern = "/__\(\"(.*)\",/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
$pattern = "/__\(\'(.*)\'/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
$pattern = "/__\(\'(.*)\',/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
$pattern = "/_e\(\"(.*)\"/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
$pattern = "/_e\(\"(.*)\",/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
$pattern = "/_e\(\'(.*)\'/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
$pattern = "/_e\(\'(.*)\',/";
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
Example:
_e('string');
_e("string");
_e('string', 'string2');
_e("string", 'string2');
__('string');
__("string");
__('string', 'string2');
__("string", 'string2');
Also if it possible, to match also these string below.
"string"|trans
'string'|trans
"string"|trans({}, "string2")
'string'|trans({}, 'string2')
'string'|trans({}, "string2")
"string"|trans({}, 'string2')
If it is possible to get the value string2 too. In the worst case, in the file, there are also mixed single and double quote.
Like you see on my preg_match_all code now, I go with 8 patterns for the first and also 8 patterns for the second one to get the first string.
Note:
I just only run this script on console command, not in PHP application. So I don't pay any attention to the performance and it doesn't matter too.
Thank you for your help!
Edited
Thank you for the response. I tried both your regex, almost there. My question might confusing. I am not english speaker. I copy paste from regex101. It might be easier to understand, what I am trying to achieve.
https://regex101.com/r/uX5nqR/2
and this one too
https://regex101.com/r/Fxs7yY/1
Please check this out. I tried to extract translations from wordpress project and also twig file which using "trans" filter. I know there are mo po Editor, but the editor don't recognize the file extension I used.

I took the liberty of writing this in JavaScript, but the regex will work the same.
My complete code looks like this:
const r = /^_[e_]\((\"(.*)\"|\'(.*)\')(, (\"(.*)\"|\'(.*)\'))?\);$/;
const xs = [
"_e('string');",
"_e(\"string\");",
"_e('string', 'string2');",
"_e(\"string\", 'string2');",
"__('string');",
"__(\"string\");",
"__('string', 'string2');",
"__(\"string\", 'string2');",
];
xs.forEach((x) => {
const matches = x.match(r);
if(matches){
console.log('matches are:\n ', matches.filter(m => m !== undefined).join('\n '));
}else{
console.log('no matches for', x);
}
});
Now let me explain how the regex works and how I arrived at it:
First I noticed that all your strings start with _ and end with );,
so I knew the regex had to look something like ^…\);$.
Here ^ and $ mark the beginning and end of the string, and you should leave them out if they're not required.
After the initial _ you've got either another _ or a e, so we put these into a group followed by the opening parenthesis: [e_]\(.
Now we have a string that is either in " or in ', and we put it down as alternatives: (\"(.*)\"|\'(.*)\').
This string is repeated, but optionally, with a leading , in front.
So we get (, …)? for the optional part, and (\"(.*)\"|\'(.*)\') for the whole second portion.
For the second portion of your problem you can use the same strategy:
"string"|trans
'string'|trans
"string"|trans({}, "string2")
'string'|trans({}, 'string2')
'string'|trans({}, "string2")
"string"|trans({}, 'string2')
Start building up your regex from the similarities. We've got the same string pattern as before used twice, and the optional second part now looks like (\(\{\}, (\"(.*)\"|\'(.*)\')\))?.
This way we can end up with a regex like this:
^(\"(.*)\"|\'(.*)\')\|trans\(\{\}, (\"(.*)\"|\'(.*)\')\))?$
Please note that this regex is not tested, but just a guess from my side.
Upon further discussion it became apparent that we're looking at several matches in a larger bunch of text. To adapt to this we need to exclude the ' and " characters from the innermost groups, which leaves us with these regexes:
_[e_]\(("([^"]*)"|\'([^']*)\')(, ("([^"]*)"|\'([^']*)\'))?\);
(\"(.*)\"|\'(.*)\')\|trans(\(\{\}, (\"(.*)\"|\'(.*)\')\))?
I've also noted that my second regex apparently had an unmatched parenthesis in it.

I tried to understand the purpose of these regexes - here's what I think. (Let me omit the slashes on both sides, also the string quotes belonging to the language instead of the regex itself.)
(__|_e)\(\"(.*)\"
(__|_e)\(\'(.*)\'
This way you get all the hits of your 8 regexes above; but that's probably not what you were trying to achieve.
As far as I understand, you want to list the I18N refs in your code, with one or more arguments between the brackets. I think the best way to do it is run a preg_match_all with the simplest form of the pattern:
(__|_e)\(.*\)
or maybe this one is better:
(__|_e)\([^\)]+\) // works for multiple calls in one line, ignores empties
...and then iterate the results one by one and split them by comma:
foreach($matches as $m) {
$args = explode(",",$m[1]); // [1] = second subpattern
;
; // now you have the arguments of this function call
;
}
If this answer is not helping, let's refine the question :)

How to get a number from a html source page?

I'm trying to retrieve the followed by count on my instagram page. I can't seem to get the Regex right and would very much appreciate some help.
Here's what I'm looking for:
y":{"count":
That's the beginning of the string, and I want the 4 numbers after that.
$string = preg_replace("{y"\"count":([0-9]+)\}","",$code);
Someone suggested this ^ but I can't get the formatting right...

You haven't posted your strings so it is a guess to what the regex should be... so I'll answer on why your codes fail.
preg_replace('"followed_by":{"count":\d')
This is very far from the correct preg_replace usage. You need to give it the replacement string and the string to search on. See http://php.net/manual/en/function.preg-replace.php
Your second usage:
$string = preg_replace(/^y":{"count[0-9]/","",$code);
Is closer but preg_replace is global so this is searching your whole file (or it would if not for the anchor) and will replace the found value with nothing. What your really want (I think) is to use preg_match.
$string = preg_match('/y":\{"count(\d{4})/"', $code, $match);
$counted = $match[1];
This presumes your regex was kind of correct already.
Per your update:
Demo: https://regex101.com/r/aR2iU2/1
$code = 'y":{"count:1234';
$string = preg_match('/y":\{"count:(\d{4})/', $code, $match);
$counted = $match[1];
echo $counted;
PHP Demo: https://eval.in/489436
I removed the ^ which requires the regex starts at the start of your string, escaped the { and made the\d be 4 characters long. The () is a capture group and stores whatever is found inside of it, in this case the 4 numbers.
Also if this isn't just for learning you should be prepared for this to stop working at some point as the service provider may change the format. The API is a safer route to go.

This regexp should capture value you're looking for in the first group:
\{"count":([0-9]+)\}
Use it with preg_match_all function to easily capture what you want into array (you're using preg_replace which isn't for retrieving data but for... well replacing it).
Your regexp isn't working because you didn't escaped curly brackets. And also you didn't put count quantifier (plus sign in my example) so it would only capture first digit anyway.

preg_replace with Regex - find number-sequence in URL

I'm a regex-noobie, so sorry for this "simple" question:
I've got an URL like following:
http://stellenanzeige.monster.de/COST-ENGINEER-AUTOMOTIVE-m-w-Job-Mainz-Rheinland-Pfalz-Deutschland-146370543.aspx
what I'm going to archieve is getting the number-sequence (aka Job-ID) right before the ".aspx" with preg_replace.
I've already figured out that the regex for finding it could be
(?!.*-).*(?=\.)
Now preg_replace needs the opposite of that regular expression. How can I archieve that? Also worth mentioning:
The URL can have multiple numbers in it. I only need the sequence right before ".aspx". Also, there could be some php attributes behind the ".aspx" like "&mobile=true"
Thank you for your answers!

You can use:
$re = '/[^-.]+(?=\.aspx)/i';
preg_match($re, $input, $matches);
//=> 146370543
This will match text not a hyphen and not a dot and that is followed by .aspx using a lookahead (?=\.aspx).
RegEx Demo

You can just use preg_match (you don't need preg_replace, as you don't want to change the original string) and capture the number before the .aspx, which is always at the end, so the simplest way, I could think of is:
<?php
$string = "http://stellenanzeige.monster.de/COST-ENGINEER-AUTOMOTIVE-m-w-Job-Mainz-Rheinland-Pfalz-Deutschland-146370543.aspx";
$regex = '/([0-9]+)\.aspx$/';
preg_match($regex, $string, $results);
print $results[1];
?>
A short explanation:
$result contains an array of results; as the whole string, that is searched for is the complete regex, the first element contains this match, so it would be 146370543.aspx in this example. The second element contains the group captured by using the parentheeses around [0-9]+.

You can get the opposite by using this regex:
(\D*)\d+(.*)
Working demo
MATCH 1
1. [0-100] `http://stellenanzeige.monster.de/COST-ENGINEER-AUTOMOTIVE-m-w-Job-Mainz-Rheinland-Pfalz-Deutschland-`
2. [109-114] `.aspx`
Even if you just want the number for that url you can use this regex:
(\d+)

PHP preg_match_all not matching properly

I am trying to get some data off of a website source code. What I am trying to do is get everything after /collections/(whatever that follows here). My pattern matched "most" of what I am looking for. The problem occurs when my preg_match_all gets to a pattern with the "&", at which point it will simply read to the point of "&" and stop reading the remainder. Here is my script:
$homepage = file_get_contents('http://www.harrisfarm.com.au/');
$pattern = '/collections([\w-&\/]*)/i';
preg_match_all($pattern, $processedHomePage, $collections);
print_r($collections);
Notice that when printing like this, things after the "&" are ignored, meaning it will get me this:
/collections/seafood/Shellfish-&
But when I am pattern matching on one string such as below:
$subject = 'a href="/collections/organic/Pantry/sickmonster/grandma" <a href="/collections/seafood/Shellfish-&-Crustaceans">Oysters, Shellfish & Crustaceans';
it gets me everything I want:
/collections/seafood/Shellfish-&-Crustaceans
So I wonder... why is this happening? I am really stumped here.

There is no problem with the provided code when you use $homepage instead of $processedHomePage in preg_match_all.
BTW:
You should escape the minus sign in squared brackets (or write it at the beginning or end of the expression in squared brackets), but surprisingly it makes no difference in your case:
$pattern = '/collections([-\w&/]*)/i';
See http://php.net/manual/regexp.reference.meta.php for further information.

try this:
$re = "/\\/collections([\\w\\-\\&\\/;]*)/mi";
$str = "<a href=\"/collections/seafood/Shellfish-&-Crustaceans\">Oysters, Shellfish & Crustaceans';\n<a href=\"/collections/seafood/Shellfish-&-Crustaceans\">Oysters,collections Shellfish & Crustaceans';";
preg_match_all($re, $str, $matches);
live demo
your update code
$homepage = file_get_contents('http://www.harrisfarm.com.au/');
$pattern = "/\\/collections([\\w\\-\\&\\/;]*)/mi";
preg_match_all($pattern, $homepage, $collections);
print_r($collections);

I figured out what the problem is - maybe this will help others later.
I had tried to use htmlspecialchars() to convert the url http://www.harrisfarm.com.au/ and then read it in as a string. This converted some special characters like & and some other things, into something with many characters.
The conversion of & turns it into & which has a ;, and that's not in my regular expression. Since ; is not part of regular expression, the regex stopped matching at that point.

RegEx Capture Group with PHP preg_match Not Returning Values

I'm trying to capture the text "Capture This" in $string below.
$string = "</th><td>Capture This</td>";
$pattern = "/<\/th>\r.*<td>(.*)<\/td>$/";
preg_match ($pattern, $string, $matches);
echo($matches);
However, that just returns "Array". I also tried printing $matches using print_r, but that gave me "Array ( )".
This pattern will only come up once, so I just need it to match one time. Can somebody please tell me what I'm doing wrong?

The problem is that you require a CR character \r. Also you should make the search lazy inside the capturing group and use print_r to output the array. Like this:
$pattern = "/<\/th>.*<td>(.*?)<\/td>$/";
You can see it in action here: http://codepad.viper-7.com/djRJ0e
Note that it's recommended to parse html with a proper html parser rather than using regex.

Two things:
You need to drop the \r from your regex as there is no carriage return character in your input string.
Change echo($matches) to print_r($matches) or var_dump($matches)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Preg_match_all and complex regexp for text match - php

Related

How to optimize this regex

How to get a number from a html source page?

preg_replace with Regex - find number-sequence in URL

PHP preg_match_all not matching properly

RegEx Capture Group with PHP preg_match Not Returning Values

Categories

Resources