How can I extract a query string from these logs? - php

I have a bunch of lines in a log file where I need to extract only the query string part. I have identified these pattern:
/path/optin.html?e=somebase64string&l=somedifferentbase64string HTTP...
"/path/optin.html?e=somebase64string%3D&l=somedifferentbase64string" "browser info"...
"/path/optin.html?" "browser info"...
Some notes:
Sometimes path and query string are enclosed in double quotes
Sometimes there's no query string at all, obviously the ones with no query string are to be discarded.
Sometimes the base64 string was url encoded, so the ending "=" part comes as "%3D" instead. I don't think this has affected my script but I'd thought I'd note it also.
So, I was able to correctly extract - hopefully - all of the lines that follow the first pattern above, but the others I'm having some trouble with.
This is the pattern I'm trying with:
$pattern = '/html\?(.*)\s*HTTP/';
then I run a preg_match against the log line.
Anyone can help me out with a better regex pattern?
I need to grab this part off the log lines:
e=somebase64string&l=somedifferentbase64string
Thanks

You can use a pattern like ~\?([^\s.]*)~ to match everything after a ? until you reach a whitespace character (assuming a rule that "URLs will never have spaces in them [that aren't %20]):
$pattern = '~\?([^\s.]*)~';
preg_match_all($pattern, $logs, $output);
Then trim off any quotes (e.g. in your last example):
$output = array_map(function($var) { return rtrim($var, '"'); }, $output[1]);
Giving you:
Array
(
[0] => e=somebase64string&l=somedifferentbase64string
[1] => e=somebase64string%3D&l=somedifferentbase64string
[2] =>
)
Example

Related

Use regex to quote the name in name-value pair of a list of pairs

I am trying to put quotes around the names of name-value pairs separated by commas. I use preg_replace and regex to achieve that. However, my pattern is not working properly.
$str="f1=1,f2='2',f3='a',f4=4,f5='5'";
$newstr=Preg_replace(/'(?.[^=]+)'/,"'$1'",$str);
I expected $newstr to come out like so:
'f1'=1,'f2'='2','f3'='a','f4'=4,'f5'='5'
But it doesn't and the qoutes don't contain the name.
What should the pattern be and how can I use the comma to get all of them correctly?
There are a few issues with your attempt:
PHP does not have a regex-literal syntax as in JavaScript, so starting the regex value with a forward slash is a syntax error. It should be a string, so start with a quote. Maybe you accidently swapped the slash and quote at the start and the end.
(?. is not valid. Maybe you intended (?:, but then there is no capture group and $1 is not a valid back reference. To have the capture group, you should not have (?., but just (.
[^=]+ could include substrings like 1,f2. There should be logic to not start matching while still inside a value (whether quoted or not).
I would suggest a regex where you match both parts around the = (both key and value), and then in the replacement, just reproduce the second part without change. This will ensure you don't accidently use anything in the value side for wrapping in quotes:
$newstr = preg_replace("/([^,=]+)=('[^']*'|[^,]*)/","'$1'=$2",$str);
Basically, match beginning of line or a comma (with negative capture) and then capture everything until a =
$reg = "/(?<=^|,)([^=]+)/";
$str = "f1=1,f2='2',f3='a',f4=4,f5='5'";
print_r(preg_replace($reg, "'$1'", $str));
// output:
// 'f1'=1,'f2'='2','f3'='a','f4'=4,'f5'='5'
This will also work, a different approach, but assuming there will be no comma in the values or names except the separators..
$newstr = preg_replace("/(.)(?==)|(?<=,|^)(.)/", "$1'$2", $str);
But I believe string and simple array operations will be faster as the regex is really getting complex and there are so many steps to get the characters.. Here is the same output but with array functions only.
$newstr = implode(",", array_map(function($element){ return "'". implode("'=", explode("=", $element)); }, explode(",", $str)));
RegEx is not always fast than string or array operations, but yes it can do complex things with little bit of code.

How to get a number from a html source page?

I'm trying to retrieve the followed by count on my instagram page. I can't seem to get the Regex right and would very much appreciate some help.
Here's what I'm looking for:
y":{"count":
That's the beginning of the string, and I want the 4 numbers after that.
$string = preg_replace("{y"\"count":([0-9]+)\}","",$code);
Someone suggested this ^ but I can't get the formatting right...
You haven't posted your strings so it is a guess to what the regex should be... so I'll answer on why your codes fail.
preg_replace('"followed_by":{"count":\d')
This is very far from the correct preg_replace usage. You need to give it the replacement string and the string to search on. See http://php.net/manual/en/function.preg-replace.php
Your second usage:
$string = preg_replace(/^y":{"count[0-9]/","",$code);
Is closer but preg_replace is global so this is searching your whole file (or it would if not for the anchor) and will replace the found value with nothing. What your really want (I think) is to use preg_match.
$string = preg_match('/y":\{"count(\d{4})/"', $code, $match);
$counted = $match[1];
This presumes your regex was kind of correct already.
Per your update:
Demo: https://regex101.com/r/aR2iU2/1
$code = 'y":{"count:1234';
$string = preg_match('/y":\{"count:(\d{4})/', $code, $match);
$counted = $match[1];
echo $counted;
PHP Demo: https://eval.in/489436
I removed the ^ which requires the regex starts at the start of your string, escaped the { and made the\d be 4 characters long. The () is a capture group and stores whatever is found inside of it, in this case the 4 numbers.
Also if this isn't just for learning you should be prepared for this to stop working at some point as the service provider may change the format. The API is a safer route to go.
This regexp should capture value you're looking for in the first group:
\{"count":([0-9]+)\}
Use it with preg_match_all function to easily capture what you want into array (you're using preg_replace which isn't for retrieving data but for... well replacing it).
Your regexp isn't working because you didn't escaped curly brackets. And also you didn't put count quantifier (plus sign in my example) so it would only capture first digit anyway.

extract text between two words in php

I got the following URL
http://www.amazon.com/LEGO-Ultimate-Building-Set-Pieces/dp/B000NO9GT4/ref=sr_1_1?m=ATVPDKIKX0DER&s=toys-and-games&ie=UTF8&qid=1350518571&sr=1-1&keywords=lego
and I want to extract
B000NO9GT4
that is the asin...to now, I can get search between the string, but not in this way I require. I saw the split functin, I saw the explode. but cant find a way out...also, the urls will be different in length so I cant hardcode the length two..the only thing which make some sense in my mind is to split the string so that
http://www.amazon.com/LEGO-Ultimate-Building-Set-Pieces/dp/
become first part
and
B000NO9GT4/ref=sr_1_1?m=ATVPDKIKX0DER&s=toys-and-games&ie=UTF8&qid=1350518571&sr=1-1&keywords=lego
becomes the 2nd part , from the second part , I should extract B000NO9GT4
in the same way, i would want to get product name LEGO-Ultimate-Building-Set-Pieces from the first part
I am very bad at regex and cant find a way out..
can somebody guide me how I can do it in php?
thanks
This grabs both pieces of information that you are looking to capture:
$url = 'http://www.amazon.com/LEGO-Ultimate-Building-Set-Pieces/dp/B000NO9GT4/ref=sr_1_1?m=ATVPDKIKX0DER&s=toys-and-games&ie=UTF8&qid=1350518571&sr=1-1&keywords=lego';
$path = parse_url($url, PHP_URL_PATH);
if (preg_match('#^/([^/]+)/dp/([^/]+)/#i', $path, $matches)) {
echo "Description = {$matches[1]}<br />"
."ASIN = {$matches[2]}<br />";
}
Output:
Description = LEGO-Ultimate-Building-Set-Pieces
ASIN = B000NO9GT4
Short Explanation:
Any expressions enclosed in ( ) will be saved as a capture group. This is how we get at the data in $matches[1] and $matches[2].
The expression ([^/]+) says to match all characters EXCEPT / so in effect it captures everything in the URL between the two / separators. I use this pattern twice. The [ ] actually defines the character class which was /, the ^ in this case negates it so instead of matching / it matches everything BUT /. Another example is [a-f0-9] which would say to match the characters a,b,c,d,e,f and the numbers 0,1,2,3,4,5,6,7,8,9. [^a-f0-9] would be the opposite.
# is used as the delimiter for the expression
^ following the delimiter means match from the beginning of the string.
See www.regular-expressions.info and PCRE Pattern Syntax for more info on how regexps work.
You can try
$str = "http://www.amazon.com/LEGO-Ultimate-Building-Set-Pieces/dp/B000NO9GT4/ref=sr_1_1?m=ATVPDKIKX0DER&s=toys-and-games&ie=UTF8&qid=1350518571&sr=1-1&keywords=lego" ;
list(,$desc,,$num,) = explode("/",parse_url($str,PHP_URL_PATH));
var_dump($desc,$num);
Output
string 'LEGO-Ultimate-Building-Set-Pieces' (length=33)
string 'B000NO9GT4' (length=10)

PHP preg_match part of url

I am trying to create a url router in PHP, that works like django's.
The problems is, I don't know php regular expressions very well.
I would like to be able to match urls like this:
/post/5/
/article/slug-goes-here/
I've got an array of regexes:
$urls = array(
"(^[/]$)" => "home.index",
"/post/(?P<post_id>\d+)/" => "home.post",
);
The first regex in the array works to match the home page at / but I can't get the second one to work.
Here's the code I am using to match them:
foreach($urls as $regex => $mapper) {
if (preg_match($regex, $uri, $matches)) {
...
}
}
I should also note that in the example above, I am trying to match the post_id in the url: /post/5/ so that I can pass the 5 along to my method.
You must delimit the regex. Delimiting allows you to provide 'options' (such as 'i' for case insensitive matching) as part of the pattern:
,/post/(?P<post_id>\d+)/,
here, I have delimited the regex with commas.
As you have posted it, your regex was being delimited with /, which means it was treating everything after the second / as 'options', and only trying to match the "post" part.
The example you are trying to match against looks like it isn't what you're actually after based on your current regex.
If you are after a regex which will match something like;
/post/P1234/
Then, the following:
preg_match(',/post/(P\d+)/,', '/post/P1234/', $matches);
print_r($matches);
will result in:
Array
(
[0] => /post/P1234/
[1] => P1234
)
Hopefully that clears it up for you :)
Edit
Based on the comment to your OP, you are only trying to match a number after the /post/ part of the URL, so this slightly simplified version:
preg_match(',/post/(\d+)/,', '/post/1234/', $matches);
print_r($matches);
will result in:
Array
(
[0] => /post/1234/
[1] => 1234
)
If your second RegExp is meant to match urls like /article/slug-goes-here/, then the correct regular expression is
#\/article\/[\w-]+\/#
That should do it! Im not pretty sure about having to escape the /, so you can try without escaping them. The tag Im guessing is extracted from a .NET example, because that framework uses such tags to name matching groups.
I hope I can be of help!
php 5.2.2: Named subpatterns now accept the syntax (?<name>) and
(?'name') as well as (?P<name>). Previous versions accepted only
(?P<name>).
http://php.net/manual/fr/function.preg-match.php

PHP Regex Parse query string containing un-encoded ampersands

I'm receiving a query string (from a terrible payment system whose name I do not wish to sully publicly) that contains un-encoded ampersands
name=joe+jones&company=abercrombie&fitch&other=no
parse_str can't handle this, and I don't know enough of regex to come up with my own scheme (though I did try). My hang up was look-ahead regex which I did not quite understand.
What I'm looking for:
Array
(
[name] => joe jones
[company] => abercrombie&fitch
[other] => no
)
I thought about traipsing through the string, ampersand by ampersand, but that just seemed silly. Help?
How about this:
If two ampersands occur with no = between them, encode the first one. Then pass the result to the normal query string parser.
That should accomplish your task. This works because the pattern for a "normal" query string should always alternate equals signs and ampersands; thus two ampersands in a row means one of them should have been encoded, and as long as keys don't have ampersands in them, the last ampersand in a row is always the "real" ampersand preceding a new key.
You should be able to use the following regex to do the encoding:
$better_qs = preg_replace("/&(?=[^=]*&)/", "%26", $bad_qs);
You could also use the split() function to split the string by ampersands. After that, you could split again each element with the delimeter "="... something like that:
$myarray = split("&", $mystring);
foreach ($myarray as $element) {
$keyvalue = split("=", $element);
$resultarray[$keyvalue[0]] = $keyvalue[1];
}
print_r($resultarray);
Not tested! But you should get the idea.

Categories