Regular expression starting with http and ending with pdf? - php

I have loaded the entire HTML of a page and want to retrieve all the URL's which start with http and end with pdf. I wrote the following which didn't work:
$html = file_get_contents( "http://www.example.com" );
preg_match( '/^http(pdf)$/', $html, $matches );
I'm pretty new to regex but from what I've learned ^ marks the beginning of a pattern and $ marks the end. What am I doing wrong?

You need to match the characters in the middle of the URL:
/\bhttp[\w%+\/-]+?pdf\b/
\b matches a word boundary
^ and $ mark the beginning and end of the entire string. You don't want them here.
[...] matches any character in the brackets
\w matches any word character
+ matches one or more of the previous match
? makes the + lazy rather than greedy

preg_match( '/http[^\s]+pdf/', $html, $matches );
Matches http followed by not ([^...]) spaces (\s) one or more times (+) followed by pdf

Try this,
preg_match( '/\bhttp\S*pdf\b/', $html, $matches );
You need to match the part between the http and the pdf, this is what .*? is doing.
^ matches the start of the string and $ the end, but this is not what you want, when you want to extract those links from a longer text.
\b is matching on word boundaries
Update
for completeness, the .*? would still match too much so exchanged with \S*
\S matches a non whitespace character

Try this one:
preg_match_all('/\bhttp\S*?pdf\b/', $html, $matches);
Note that you need to use the preg_match_all()-function here, since you are trying to match more than one occurrence. ^ and $ wont work, because they only apply to line or file boundaries (depending on the used modifiers).

preg_match( '/^http.*pdf$/', $html, $matches );
is better (working)

Related

preg match between two strings

I need help with this preg match. I tried this from other post but did not get the result. So finally posting it.
I am trying to extract z,a,b from first and a from second example.
1) Write a function operations with parameter z,a,b and returns b.
2) write a function factorial with parameter a.
This is what I tried so far:
preg_match_all('/\parameter(.*?)\and?/', $question, $match);
$questionVars = $match[1];
print $questionVars;
Thank you so much!
Your solution can be different depending on actual requirements.
If you need a string after parameter as a whole word that can consist of word and comma chars you may use
preg_match('~\bparameter\s+\K\w+(?:\s*,\s*\w+)*~', $s, $m)
See the regex demo. The \bparameter\s+ matches a word boundary, parameter and 1+ whitespace chars, and all this text is omitted with the help of \K, the match reset operator. \w+(?:\s*,\s*\w+)* matches and returns the 1+ word chars followed with 0+ repetitions of a comma enclosed with optional whitespace chars and again 1+ word chars.
If you plan to get those comma-separated chunks separately, use
preg_match_all('~(?:\G(?!^)\s*,\s*|\bparameter\s+)\K\w+~', $s, $m)
See another regex demo. Here, (?:\G(?!^),\s*|\bparameter\s+) will either match the whole word parameter with 1+ whitespace after (\bparameter\s+, as in the previous solution) or the end of the previous successful match with , enclosed with optional whitespace chars (\G(?!^)\s*,\s*). The \K will omit the text matched so far and \w+ will grab the value. You may replace with [^,]* to grab 0+ chars other than a comma.

PHP regex to get WordPress category slug from $_SERVER['REQUEST_URI']

I am trying to get the category slug from the $_SERVER['REQUEST_URI'] using a pre_match pattern, but it's not working.
For example, the $_SERVER['REQUEST_URI'] returns /category/current-affairs/ and I want to set current-affairs to a variable that I want to use.
So far I came up with this but it's not working
^\/category\/(?:\/(\w+))*$/g
Any help with this will be very much appreciated.
you don't need regex. wp has a function do it for you:
if(is_category()) {
$category = get_query_var('cat');
$current_cat = get_category($cat);
echo 'The slug is ' . $current_cat->slug;
}
Your regex ^\/category\/(?:\/(\w+))*$/g matches:
From the beginning of the string ^
Match a forward slash \/
Match category
Match a forward slash \/
A non capturing group (?: repeated zero or more times *
In this non capturing group, match a forward slash and in a capturing group \w one or more times\/(\w+)
The end of the string $
With this part \/(\w+) you are trying to match current-affairs
But this part matches
A forward slash \/
Capture in a group [A-Za-z0-9_] one or more times
But your text has a hyphen - in it.
The full pattern expect to match for example /category, 2 forward slashes // and [A-Za-z0-9_]+
It would match:
/category//currentaffairs
But not
/category//currentaffairs/
/category/currentaffairs/
/category//current-affairs/
/category//current-affairs
I think you can get your match like this:
^\/category\/([\w-]+)\/$
A very easy way is to explode the uri to an array and read the 3rd value:
$request_uri = explode('/', $_SERVER['REQUEST_URI']);
$category = $request_uri[2];
Your pattern isn't working for a few reasons:
You're trying to match // (in this section \/(?:\/)
\w doesn't match -. It matches a-zA-Z0-9_
You're always ensuring something follows the last /: (?:\/(\w+))*$
Code
Note: The regex below uses a different regular expression delimiter (in the link, for example, I use # instead of / to delimit the pattern). This allows us to use / inside the pattern without first having to escape it.
See regex in use here
/category/\K[^/]*
/category/ Match this literally.
\K Resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match.
[^/]* Match any character except / any number of times.
Usage
$re = '#/category/\K[^/]*#';
preg_match_all($re, $_SERVER['REQUEST_URI'], $matches, PREG_SET_ORDER, 0);
var_dump($matches);
This should work:
\/category(.*)

Regex to get only characters without space inside special tags

I have 2 texts in a string:
%Juan%
%Juan Gonzalez%
And I want to only be able to get %Juan% and not the one with the Space, I have been trying several Regexes witout luck. I currently use:
/%(.*)%/U
but it gets both things, I tried adding and playing with [^\s] but it doesnt works.
Any help please?
The issue is that . matches any character but a newline. The /U ungreedy mode only makes .* lazy and it captures a text from the % up to the first % to the right of the first %.
If your strings contain one pair of %...%, you may use
/%(\S+)%/
See the regex demo
The \S+ pattern matches 1+ characters other than a whitespace, and the whole [^\h%] negated character class that matches any character but a horizontal space and % symbol.
If you have multiple %...% pairs, you may use
/%([^\h%]+)%/
See another regex demo, where \h matches any horizontal whitespace.
PHP demo:
$re = '/%([^\h%]+)%/';
$str = "%Juan%\n%Juan Gonzalez%";
preg_match_all($re, $str, $matches);
print_r($matches[1]);

PHP regular expression start and end with given strings

I have a string like this
05/15/2015 09:19 PM pt_Product2017.9.abc.swl.px64_kor_7700 I need to select the pt_Product2017.9.abc.swl.px64_kor from that. (start with pt_ and end with _kor)
$str = "05/15/2015 09:19 PM pt_Product2017.9.abc.swl.px64_kor_7700";
preg_match('/^pt_*_kor$/',$str, $matches);
But it doesn't work.
You need to remove the anchors, adda \b at the beginning to match pt_ preceded with a non-word character, and use a \S with * (\S shorthand character class that matches any character but whitespace):
preg_match('/\bpt_\S*_kor/',$str, $matches);
See regex demo
In your regex,^ and $ force the regex engine to search for the ptat the beginning and _kor at the end of the string, and _* matches 0 or more underscores. Note that regex patterns are not the same as wildcards.
In case there can be whitespace between pt_ and _kor, use .*:
preg_match('/\bpt_.*_kor/',$str, $matches);
I should also mention greediness: if you have pt_something_kor_more_kor, the .*/\S* will match the whole string, but .*?/\S*? will match just pt_something_kor. Please adjust according to your requirements.
^ and $ are the start and end of the complete string, not only the matched one. So use simply (pt_.+_kor) to match everything between pt_ and _kor: preg_match('/(pt_+_kor)/',$str, $matches);
Here's a demo: https://regex101.com/r/qL4fW9/1
The ^ and $ that you have used in the regular expression means that the string should start with pt AND end with kor. But it's neither starting as such, nor ending with kor (in fact, ending with kor_7700).
Try removing the ^ and $, and you'll get the match:
preg_match('/pt_.*_kor/',$str, $matches);

Finding #mentions in string

Trying to replace all occurrences of an #mention with an anchor tag, so far I have:
$comment = preg_replace('/#([^# ])? /', '#$1 ', $comment);
Take the following sample string:
"#name kdfjd fkjd as#name # lkjlkj #name"
Everything matches okay so far, but I want to ignore that single "#" symbol. I've tried using "+" and "{2,}" after the "[^# ]" which I thought would enforce a minimum amount of matches, but it's not working.
Replace the question mark (?) quantifier ("optional") and add in a + ("one or more") after your character class:
#([^# ]+)
The regex
(^|\s)(#\w+)
Might be what you are after.
It basically means, the start of the line, or a space, then an # symbol followed by 1 or more word characters.
E.g.
preg_match_all('/(^|\s)(#\w+)/', '#name1 kdfjd fkjd as#name2 # lkjlkj #name3', $result);
var_dump($result[2]);
Gives you
Array
(
[0] => #name1
[1] => #name3
)
I like Petah's answer but I adjusted it slightly
preg_replace('/(^|\s)#([\w.]+)/', '$1#$2', $text);
The main differences are:
the # symbol is not included. That's for display only, should not be in the URL
allows . character (note: \w includes underscore)
in the replacement, I added $1 at the beginning to preserve the whitespace
Replacing ? with + will work but not as you expect.
Your expression does not match #name at the end of string.
$comment = preg_replace('##(\w+)#', '$0 ', $comment);
This should do what you want. \w+ stands for letter (a-zA-Z0-9)
I recommend using a lookbehind before matching the # then one or more characters which are not a space or #.
The "one or more" quantifier (+) prevents the matching of mentions that mention no one.
Using a lookbehind is a good idea because it not only prevents the matching of email addresses and other such unwanted substrings, it asks the regex engine to primarily search #s then check the preceding character. This should improve pattern performance since the number of spaces should consistently outnumber the number of mentions in comments.
If the input text is multiline or may contain newlines, then adding an m pattern modifier will tell ^ to match all line starts. If newlines and tabs are possible, is will be more reliable to use (?<=^|\s)#([^#\s]+).
Code: (Demo)
$comment = "#name kdfjd ## fkjd as#name # lkjlkj #name";
var_export(
preg_replace(
'/(?<=^| )#([^# ]+)/',
'#$1',
$comment
)
);
Output: (single-quotes are from var_export())
'#name kdfjd ## fkjd as#name # lkjlkj #name'
Try:
'/#(\w+)/i'

Categories