Regular Expression to get all links with certain extensions - php

Im looking for a regular expression that will grab all of the urls that have the extensions int he following array:
Array
(
[0] => mp4
[1] => m4v
[2] => webm
[3] => ogv
[4] => wmv
[5] => flv
)
This array is returned by an internal WordPress function called wp_get_video_extensions() and are video URls that WordPress recognizes.
A block of content would look like this with URls inside it:
'Yes, but I grow at a reasonable pace,' said the Dormouse: 'not in
that ridiculous fashion.' And he got up very sulkily and crossed over
to the other side of the court.
All this time the Queen had never left off staring at the Hatter, and,
just as the Dormouse crossed the court, she said to one of the
officers of the court, 'Bring me the list of the singers in the last
concert!' on which the wretched Hatter trembled so, that he shook both
his shoes off.
[video mp4="http://www.example.com/files/video/video1.mp4"][/video]
'Give your evidence,' the King repeated angrily, 'or I'll have you
executed, whether you're nervous or not.'
http://www.example.com/files/video/video2.flv
'I'm a poor man, your Majesty,' the Hatter began, in a trembling
voice, '—and I hadn't begun my tea—not above a week or so—and what
with the bread-and-butter getting so thin—and the twinkling of the
tea—'
I am trying to get it to find both the video urls in there and return the entire URL in the array.
Here is what i have:
preg_match_all( '/^https?:\/\/(?:[a-z\-]+\.)+[a-z]{2,6}(?:/[^/#?]+)+\.(?:' . implode( '|', wp_get_video_extensions() ) . ')$/', $post->post_content, $matches);
And i am getting this:
Warning: preg_match_all(): Unknown modifier '['
Ideally, i would like to get this:
Array
(
[0] => Array
(
[0] => http://www.example.com/files/video/video1.mp4
[1] => http://www.example.com/files/video/video2.flv
)
[1] => Array
(
[0] => http://www.example.com/
[1] => http://www.example.com/
)
[2] => Array
(
[0] => files/video/
[1] => files/video/
)
[3] => Array
(
[0] => video1.mp4
[1] => video2.flv
)
)
But this would also be perfect as i can use parse_url() to break the rest out later on:
Array
(
[0] => http://www.example.com/files/video/video1.mp4
[1] => http://www.example.com/files/video/video2.flv
)

You're first problem, is that you didn't escape all the "/". The second problem is that you're trying to match only if that is the beginning and ending of the line. This should take care of it.
preg_match_all('~https?://(?:[a-z\-]+\.)+[a-z]{2,6}(?:/[^/#?]+)+\.(?:' . implode( '|', wp_get_video_extensions() ) . ')~', $post->post_content, $matches);
Using "~" makes it so you don't have to escape the "/".

Related

regex for html attributes, need fix

Need to fix this regex which extract html attributes in array for me by preg_mach_all function in php:
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
the attributes example is:
style="width: 462px;" src=".......=" data-filename="Screenshot from 2016-02-09 21:54:47.png"
working example in finddle: https://regex101.com/r/QE9XGD/1
because of equals sign in the end of src attribute, I got wrong array:
Array
(
[0] => Array
(
[0] => style="width: 462px;"
[1] => src=".......=" data-filename="
)
[1] => Array
(
[0] => style
[1] => src=".......
)
[2] => Array
(
[0] => width: 462px;
[1] => data-filename=
)
)
correct array should be like this:
Array
(
[0] => Array
(
[0] => style="width: 462px;"
[1] => src=".......="
[2] => data-filename="Screenshot from 2016-02-09 1:54:47.png"
)
[1] => Array
(
[0] => style
[1] => src
[2] => data-filename
)
[2] => Array
(
[0] => width: 462px;
[1] => .......=
[2] => Screenshot from 2016-02-09 1:54:47.png
)
)
how to fix this regex to get correct answer?
Remember I use this regex not just in image attributes extraction, is a universal regex for all type of html tags
(\S+?)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
The change is to make the attribute name evaluation lazy, so it only eats until it finds an =.
Working example on regex101
That being said, I'm fairly confident this regex can be reduced.
([^\s=]+)=('?)("?)([^>"']*)\2\3 is probably the best option:
It takes about 2% of the time of lazy evaluation and will do both singly and doubly quoted attributes. The big change here is the capture groups you want are the 1st and 4th. As far as I'm aware this will work on any html except: tag='"value'
regex101

PHP Array Break Down/Explode

STARTING ARRAY
Array
(
[0] => Array
(
[0] => /searchnew.aspx?Make=Toyota&Model=Tundra&Trim=CrewMax+5.7L+V8+6-Spd+AT+SR5&st=Price+asc
[1] => 19
)
)
I have been struggling to break down this array for the past couple days now. I have found a few useful functions to extract the strings I need when a start and end point are defined, however, I can't see that being good for long term use. Basically I'm trying to take the string relative to [0], and extract the strings following "Model=" and "Trim=", in hopes to have array like this:
Array
(
[0] => Array
(
[0] => Tundra ***model***
[1] => CrewMax+5.7L+V8+6-Spd+AT+SR5 ***trim***
[2] => 19
)
)
I'm getting this information fed through an api, so coming up with a dynamic solution is my biggest challenge. I realize this a big question, but is there a better/less hacky way of approaching this problem?
parse_url() will get you the query string and parse_str() parses the variables from that:
$q = parse_url($array[0][0], PHP_URL_QUERY);
parse_str($q, $result);
print_r($result);
Yields:
Array
(
[Make] => Toyota
[Model] => Tundra
[Trim] => CrewMax 5.7L V8 6-Spd AT SR5
[st] => Price asc
)
Now just echo $result['Model'] etc...

preg_replace json string match same character beginning/end

Ok so what I have is a JSON string which can contain 1 or many elements below I've put an example of the sting but this is only an example the real string is much more complicated. This one highlight's the issue's I'm having.
{"elements":[{"id":2,"string":"something","string2":"","string3":"no html here","integer":2,"array":{"options":[{"id":1,"value":"data"},{"id":2,"value":"more data"}]},"string4":"text with <a href=\"http:\/\/www.example.com\">html<\/a>","string5":"naughty <a href=\"http:\/\/www.example.com\">link<\/a>"},{"id":2,"string":"something","string2":"","string3":"no html here","integer":2,"array":{"options":[{"id":1,"value":"data"},{"id":2,"value":"more data"}]},"string4":"text with <a href=\"http:\/\/www.example.com\">html<\/a>","string5":"naughty <a href=\"http:\/\/www.example.com\">link<\/a>"}]}
What I'm trying to do is match all of the Strings (data-type not the name) in the JSON data and then depending on whether it's allowed HTML or not (using a blacklist) striping out the HTML. I'm no regex expert so I can't work out what's going wrong.
Here is my regex:-
([{,]"(?!(elements|string3|string4)":)(.*?)":)(?!,")"(.*?)",
I'm having two issue's with it:-
It is matching elements with both integer's and array's by simply jumping to the " found within the next string. I expected the match to fail and move on
I can't get it to handle the \" in the url so I need the , on the end of the regex but this then stop's the next string matching I tried \G but this seemed to have no affect I have a feeling it starts after the , in the previous match. I also tried a number of solutions that were suppose to allow for escaped text but these all failed to work in my case.
The thought was that this would be quicker than converting the JSON string into an object and then traversing the array of hundreds of elements to remove the HTML if that's quicker then I'll just do that it'll be a whole lot easier.
Don't work on the json directly, decode it using json_decode().
Then cleanup your HTML using HTMLPurifier, which does a great job at cleaning HTML code.
Then encode your data to json again using json_encode().
Description
There were several problems with your expression like the use of .*? will continue to capture all characters until the next required character is matched. I replaced this with [^"]*? which will match all non quotes, this forces the capture to stop consuming characters which are outside the quoted group.
I also made a capture group for the open quotes (["]) although probably overkill this allows you to simply add a single quote to the character class. Then I refer back to this captured group later to ensure the correct corresponding close quote is also matched. This way if the open quote is not required in your input string then you can simply insert a question mark (["])? and the close quote will automatically be found that matches the open quote.
I also moved the [{,] to outside the capture group
This is my cleaned up version of the regex
[{,]((")(?!(elements|string3|string4)\2:)([^"]*?)\2:)(")([^"]*?)\5(?=,)
PHP Code Example:
<?php
$sourcestring="your source string";
preg_match_all('/[{,]((")(?!(elements|string3|string4)\2:)([^"]*?)\2:)(")([^"]*?)\5(?=,)/i',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => Array
(
[0] => ,"string0":"something0"
[1] => ,"string1":""
[2] => ,"string":"something"
[3] => ,"string5":""
)
[1] => Array
(
[0] => "string0":
[1] => "string1":
[2] => "string":
[3] => "string5":
)
[2] => Array
(
[0] => "
[1] => "
[2] => "
[3] => "
)
[3] => Array
(
[0] =>
[1] =>
[2] =>
[3] =>
)
[4] => Array
(
[0] => string0
[1] => string1
[2] => string
[3] => string5
)
[5] => Array
(
[0] => "
[1] => "
[2] => "
[3] => "
)
[6] => Array
(
[0] => something0
[1] =>
[2] => something
[3] =>
)
)

Get name from hashtag using regex

I have this string/content:
#Salome, #Jessi H and #O'Ren were playing at the #Lean's yard with "#Ziggy" the mouse.
Well, I am trying to get all names focuses above. I have used # symbol to create like a hash to be used in my web. If you note, there are names with spaces between like #Jessi H and characters before and after like #Ziggy. So, I don't my if you suggest me another way to manage the hash in another way to get it works correctly. I was thinking that for user that have white spaces, could write the hash with quotes like #"Jessi H". What do you think? Other examples:
#Lean's => #"Lean"'s
#Jessi H => #"Jessi H"
"#Jessi H" => (sorry, I don't know how to parse it)
#O'Ren => #"O'Ren"
What I have do?
I'm starting using regex in php, but some SO questions have been usefull for me to get started, so, these are my tries using preg_match_all function firstly:
Result of /#(.*?)[,\" ]/:
Array ( [0] => Salome [1] => Jessi [2] => Charlie [3] => Lean's [4] => Ziggy" ) )
Result of /#"(.*?)"/ for names like #"name":
Empty array
Guys, I don't expect that you do it all for me. I think that a pseudo-code or something like this will be helpful to guide me to the right direction.
Try the following regex: '/#(?:"([^"]+)|([^\b]+?))\b/'
This will return two match groups, the first containing any quoted names (eg #"Jessi H" and #"O'Ren"), and the second containing any unquoted names (eg #Salome, #Leon)
$matches = array();
preg_match_all('/#(?:"([^"]+)|([^\b]+?))\b/', '#Salome, #"Jessi H" and #"O\'Ren" were playing at the #Lean\'s yard with "#Ziggy" the mouse.', $matches);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => #Salome
[1] => #"Jessi H
[2] => #"O'Ren
[3] => #Lean
[4] => #Ziggy
)
[1] => Array
(
[0] =>
[1] => Jessi H
[2] => O'Ren
[3] =>
[4] =>
)
[2] => Array
(
[0] => Salome
[1] =>
[2] =>
[3] => Lean
[4] => Ziggy
)
)
Are you setting these requirements or can you choose them? If you can set the requirements, I would suggest using _ instead of spaces, which would allow you to use the regex:
/#(.+) /
If spaces must be allowed and you're going with quotes, then the quotes should probably span the entire name, allowing for this regex:
/#\"(.+)\" /

PHP Subpattern without Numbering Array

Using preg_match with subpattern always returns double-key array with identical data, one with subpattern name and the other tagged with number. Because I'm matching hundred thousands of lines with few kbytes per row, I'm afraid the number array is occupying extra memory. Is there any proper way to disable the number tag array from returning?
Example:
<?php
header('Content-Type: text/plain');
$data = <<<START
I go to school.
He goes to funeral.
START;
preg_match_all('#^(?<who>.*?) go(es)* to (?<place>.*?)$#m', $data, $matches);
print_r($matches);
?>
Output:
Array
(
[0] => Array
(
[0] => I go to school.
[1] => He goes to funeral.
)
[who] => Array
(
[0] => I
[1] => He
)
[1] => Array
(
[0] => I
[1] => He
)
[2] => Array
(
[0] =>
[1] => es
)
[place] => Array
(
[0] => school.
[1] => funeral.
)
[3] => Array
(
[0] => school.
[1] => funeral.
)
)
From php.net- Subpatterns
It is possible to name a subpattern using the syntax (?P<name>pattern). This subpattern will then be indexed in the matches array by its normal numeric position and also by name.
I see no option to give only the index by name.
So, I think, if you don't want this data two times, the only possibility is: don't use named groups.
Is this really an issue? IMO optimize this only if you run into problems, because of this additional memory usage! The improved readability should be worth the memory!
Update
It look like go(es)* should only match an optional "es". Here you can save memory by using a non capturing group.
preg_match_all('#^(?<who>.*?) go(?:es)? to (?<place>.*?)$#m', $data, $matches);
by starting the group with ?: the matched content is not stored. I also replaced the * that means 0 or more and would also match "goeseses" with the ? which means 0 or 1.

Categories