Need to fix this regex which extract html attributes in array for me by preg_mach_all function in php:
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
the attributes example is:
style="width: 462px;" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAg4AAALoCAYAAAAQpn2mAAAABHNCSVQICAgIfAhkiAAAABl0RVh0U29mdHdhcmUAZ25vbWUtc2NyZWVuc2hvdO8Dv4AACAASURBVHic7L15fNTVufj/PjOTyWSyTfaEJBD2EJBNQFQEtFVRXMD7VQG1dfu2tLW92t77unaxam+t9nbTXze9tW61Vdqvgre9FXcqUHFBFiUEkX0PgSQkmf1zzu+Pzz6ZhBBwg3l4kZn5fM7yPM8553me85znnAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAqIiy66SDXM/SW7DyUQgEIBAiFAKTOZQn8p7N/OQhB6PgFCgUI43ull6mmwyhUolFWJMB.......=" data-filename="Screenshot from 2016-02-09 21:54:47.png"
working example in finddle: https://regex101.com/r/QE9XGD/1
because of equals sign in the end of src attribute, I got wrong array:
Array
(
[0] => Array
(
[0] => style="width: 462px;"
[1] => src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAg4AAALoCAYAAAAQpn2mAAAABHNCSVQICAgIfAhkiAAAABl0RVh0U29mdHdhcmUAZ25vbWUtc2NyZWVuc2hvdO8Dv4AACAASURBVHic7L15fNTVufj/PjOTyWSyTfaEJBD2EJBNQFQEtFVRXMD7VQG1dfu2tLW92t77unaxam+t9nbTXze9tW61Vdqvgre9FXcqUHFBFiUEkX0PgSQkmf1zzu+Pzz6ZhBBwg3l4kZn5fM7yPM8553me85znnAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAqIiy66SDXM/SW7DyUQgEIBAiFAKTOZQn8p7N/OQhB6PgFCgUI43ull6mmwyhUolFWJMB.......=" data-filename="
)
[1] => Array
(
[0] => style
[1] => src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAg4AAALoCAYAAAAQpn2mAAAABHNCSVQICAgIfAhkiAAAABl0RVh0U29mdHdhcmUAZ25vbWUtc2NyZWVuc2hvdO8Dv4AACAASURBVHic7L15fNTVufj/PjOTyWSyTfaEJBD2EJBNQFQEtFVRXMD7VQG1dfu2tLW92t77unaxam+t9nbTXze9tW61Vdqvgre9FXcqUHFBFiUEkX0PgSQkmf1zzu+Pzz6ZhBBwg3l4kZn5fM7yPM8553me85znnAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAqIiy66SDXM/SW7DyUQgEIBAiFAKTOZQn8p7N/OQhB6PgFCgUI43ull6mmwyhUolFWJMB.......
)
[2] => Array
(
[0] => width: 462px;
[1] => data-filename=
)
)
correct array should be like this:
Array
(
[0] => Array
(
[0] => style="width: 462px;"
[1] => src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAg4AAALoCAYAAAAQpn2mAAAABHNCSVQICAgIfAhkiAAAABl0RVh0U29mdHdhcmUAZ25vbWUtc2NyZWVuc2hvdO8Dv4AACAASURBVHic7L15fNTVufj/PjOTyWSyTfaEJBD2EJBNQFQEtFVRXMD7VQG1dfu2tLW92t77unaxam+t9nbTXze9tW61Vdqvgre9FXcqUHFBFiUEkX0PgSQkmf1zzu+Pzz6ZhBBwg3l4kZn5fM7yPM8553me85znnAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAqIiy66SDXM/SW7DyUQgEIBAiFAKTOZQn8p7N/OQhB6PgFCgUI43ull6mmwyhUolFWJMB.......="
[2] => data-filename="Screenshot from 2016-02-09 1:54:47.png"
)
[1] => Array
(
[0] => style
[1] => src
[2] => data-filename
)
[2] => Array
(
[0] => width: 462px;
[1] => data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAg4AAALoCAYAAAAQpn2mAAAABHNCSVQICAgIfAhkiAAAABl0RVh0U29mdHdhcmUAZ25vbWUtc2NyZWVuc2hvdO8Dv4AACAASURBVHic7L15fNTVufj/PjOTyWSyTfaEJBD2EJBNQFQEtFVRXMD7VQG1dfu2tLW92t77unaxam+t9nbTXze9tW61Vdqvgre9FXcqUHFBFiUEkX0PgSQkmf1zzu+Pzz6ZhBBwg3l4kZn5fM7yPM8553me85znnAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAqIiy66SDXM/SW7DyUQgEIBAiFAKTOZQn8p7N/OQhB6PgFCgUI43ull6mmwyhUolFWJMB.......=
[2] => Screenshot from 2016-02-09 1:54:47.png
)
)
how to fix this regex to get correct answer?
Remember I use this regex not just in image attributes extraction, is a universal regex for all type of html tags
(\S+?)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
The change is to make the attribute name evaluation lazy, so it only eats until it finds an =.
Working example on regex101
That being said, I'm fairly confident this regex can be reduced.
([^\s=]+)=('?)("?)([^>"']*)\2\3 is probably the best option:
It takes about 2% of the time of lazy evaluation and will do both singly and doubly quoted attributes. The big change here is the capture groups you want are the 1st and 4th. As far as I'm aware this will work on any html except: tag='"value'
regex101
Related
I'm trying to find a regex capable of capturing the content of short codes produces in Wordpress.
My short codes have the following structure:
[shortcode name param1="value1" param2="value2" param3="value3"]
The number of parameters is variable.
I need to capture the shortcode name, the parameter name and its value.
The closest results I have achieved is with this:
/(?:\[(.*?)|\G(?!^))(?=[^][]*])\h+([^\s=]+)="([^\s"]+)"/
If I have the following content in the same string:
[specs product="test" category="body"]
[pricelist keyword="216"]
[specs product="test2" category="network"]
I get this:
0=>array(
0=>[specs product="test"
1=> category="body"
2=>[pricelist keyword="216"
3=>[specs product="test2"
4=> category="network")
1=>array(
0=>specs
1=>
2=>pricelist
3=>specs
4=>)
2=>array(
0=>product
1=>category
2=>keyword
3=>product
4=>category)
3=>array(
0=>test
1=>body
2=>216
3=>test2
4=>network)
)
I have tried different regex models but I always end up with the same issue, if I have more than one parameter, it fails to detect it.
Do you have any idea of how I could achieve this?
Thanks
Laurent
You could make use of the \G anchor using 3 capture groups, where capture group 1 is the name of the shortcode, and group 2 and 3 the key value pairs.
Then you can remove the first entry of the array, and remove the empty entries in the 1st, 2nd and 3rd entry.
This is a slightly updated pattern
(?:\[(?=[^][]*])(\w+)|\G(?!^))\h+(\w+)="([^"]+)"
Regex demo | Php demo
Example
$s = '[specs product="test" category="body"]';
$pattern = '/(?:\[(?=[^][]*])(\w+)|\G(?!^))\h+(\w+)="([^"]+)"/';
$strings = [
'[specs product="test" category="body"]',
'[pricelist keyword="216"]',
'[specs product="test2" category="network" key="value"]'
];
foreach($strings as $s) {
if (preg_match_all($pattern, $s, $matches)) {
unset($matches[0]);
$matches = array_map('array_filter', $matches);
print_r($matches);
}
}
Output
Array
(
[1] => Array
(
[0] => specs
)
[2] => Array
(
[0] => product
[1] => category
)
[3] => Array
(
[0] => test
[1] => body
)
)
Array
(
[1] => Array
(
[0] => pricelist
)
[2] => Array
(
[0] => keyword
)
[3] => Array
(
[0] => 216
)
)
Array
(
[1] => Array
(
[0] => specs
)
[2] => Array
(
[0] => product
[1] => category
[2] => key
)
[3] => Array
(
[0] => test2
[1] => network
[2] => value
)
)
i need get from text all "src" elements. "src" can have " or '.
Text found in the well, but if element have id, style... They also grabbed.
I need only src value.
My code:
$html = 'text text <img src="img1.png"/> as as <img src=\'second.gif\' id ="test" /> as';
preg_match_all('/src=("|\')([^"]*)("|\')/', $html, $htmlSrc);
echo '<pre>';
print_r($htmlSrc);
Array
(
[0] => Array
(
[0] => src="img1.png"
[1] => src='second.gif' id ="
)
[1] => Array
(
[0] => "
[1] => '
)
[2] => Array
(
[0] => img1.png
[1] => second.gif' id =
)
[3] => Array
(
[0] => "
[1] => "
)
)
Regexp is a bad idea and you will probably end up with unmaintainable and unreliable code. It would be easy and reliable if you use an HTML parser. You can find an example here: http://simplehtmldom.sourceforge.net/
preg_match_all('/src="|\'([^"\']*)"|\'/', $html, $htmlSrc);
print_r($htmlSrc[2]);
Seems to work better.
I have preg_match_all function:
preg_match_all('#<h2>(.*?)</h2>#is', $source, $output, PREG_SET_ORDER);
It's working as intended, BUT the problem is, it preg_matches all items twice and into a huge multi dimensional array like this for example where it, as intended, preg_matched all 11 items needed, but twice and into a multidimensional array:
Array
(
[0] => Array
(
[0] => <h2>10. <em>Cruel</em> by St. Vincent</h2>
[1] => 10. <em>Cruel</em> by St. Vincent
)
[1] => Array
(
[0] => <h2>9. <em>Robot Rock</em> by Daft Punk</h2>
[1] => 9. <em>Robot Rock</em> by Daft Punk
)
[2] => Array
(
[0] => <h2>8. <em>Seven Nation Army</em> by the White Stripes</h2>
[1] => 8. <em>Seven Nation Army</em> by the White Stripes
)
[3] => Array
(
[0] => <h2>7. <em>Do You Want To</em> by Franz Ferdinand</h2>
[1] => 7. <em>Do You Want To</em> by Franz Ferdinand
)
[4] => Array
(
[0] => <h2>6. <em>Teenage Dream</em> by Katie Perry</h2>
[1] => 6. <em>Teenage Dream</em> by Katie Perry
)
[5] => Array
(
[0] => <h2>5. <em>Crazy</em> by Gnarls Barkley</h2>
[1] => 5. <em>Crazy</em> by Gnarls Barkley
)
[6] => Array
(
[0] => <h2>4. <em>Kids</em> by MGMT</h2>
[1] => 4. <em>Kids</em> by MGMT
)
[7] => Array
(
[0] => <h2>3. <em>Bad Romance</em> by Lady Gaga</h2>
[1] => 3. <em>Bad Romance</em> by Lady Gaga
)
[8] => Array
(
[0] => <h2>2. <em>Pumped Up Kicks</em> by Foster the People</h2>
[1] => 2. <em>Pumped Up Kicks</em> by Foster the People
)
[9] => Array
(
[0] => <h2>1. <em>Paradise</em> by Coldplay</h2>
[1] => 1. <em>Paradise</em> by Coldplay
)
[10] => Array
(
[0] => <h2>Song That Get Stuck In Your Head YouTube Playlist</h2>
[1] => Song That Get Stuck In Your Head YouTube Playlist
)
)
How to convert this array into simple one and without those duplicated items? Thank you very much.
You will always get a multidimensional array back, however, you can get close to what you want like this:
if (preg_match_all('#<h2>(.*?)</h2>#is', $source, $output, PREG_PATTERN_ORDER))
$matches = $output[0]; // reduce the multi-dimensional array to the array of full matches only
And if you don't want the submatch at all, then use a non-capturing grouping:
if (preg_match_all('#<h2>(?:.*?)</h2>#is', $source, $output, PREG_PATTERN_ORDER))
$matches = $output[0]; // reduce the multi-dimensional array to the array of full matches only
Note that this call to preg_match_all is using PREG_PATTERN_ORDER instead of PREG_SET_ORDER:
PREG_PATTERN_ORDER Orders results so that $matches[0] is an array of
full pattern matches, $matches[1] is an array of strings matched by
the first parenthesized subpattern, and so on.
PREG_SET_ORDER Orders results so that $matches[0] is an array of first
set of matches, $matches[1] is an array of second set of matches, and
so on.
See: http://php.net/manual/en/function.preg-match-all.php
Use
#<h2>(?:.*?)</h2>#is
as your regex. If you use a non capturing group (which is what ?: signifies), a backreference won't show up in the array.
I have this string/content:
#Salome, #Jessi H and #O'Ren were playing at the #Lean's yard with "#Ziggy" the mouse.
Well, I am trying to get all names focuses above. I have used # symbol to create like a hash to be used in my web. If you note, there are names with spaces between like #Jessi H and characters before and after like #Ziggy. So, I don't my if you suggest me another way to manage the hash in another way to get it works correctly. I was thinking that for user that have white spaces, could write the hash with quotes like #"Jessi H". What do you think? Other examples:
#Lean's => #"Lean"'s
#Jessi H => #"Jessi H"
"#Jessi H" => (sorry, I don't know how to parse it)
#O'Ren => #"O'Ren"
What I have do?
I'm starting using regex in php, but some SO questions have been usefull for me to get started, so, these are my tries using preg_match_all function firstly:
Result of /#(.*?)[,\" ]/:
Array ( [0] => Salome [1] => Jessi [2] => Charlie [3] => Lean's [4] => Ziggy" ) )
Result of /#"(.*?)"/ for names like #"name":
Empty array
Guys, I don't expect that you do it all for me. I think that a pseudo-code or something like this will be helpful to guide me to the right direction.
Try the following regex: '/#(?:"([^"]+)|([^\b]+?))\b/'
This will return two match groups, the first containing any quoted names (eg #"Jessi H" and #"O'Ren"), and the second containing any unquoted names (eg #Salome, #Leon)
$matches = array();
preg_match_all('/#(?:"([^"]+)|([^\b]+?))\b/', '#Salome, #"Jessi H" and #"O\'Ren" were playing at the #Lean\'s yard with "#Ziggy" the mouse.', $matches);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => #Salome
[1] => #"Jessi H
[2] => #"O'Ren
[3] => #Lean
[4] => #Ziggy
)
[1] => Array
(
[0] =>
[1] => Jessi H
[2] => O'Ren
[3] =>
[4] =>
)
[2] => Array
(
[0] => Salome
[1] =>
[2] =>
[3] => Lean
[4] => Ziggy
)
)
Are you setting these requirements or can you choose them? If you can set the requirements, I would suggest using _ instead of spaces, which would allow you to use the regex:
/#(.+) /
If spaces must be allowed and you're going with quotes, then the quotes should probably span the entire name, allowing for this regex:
/#\"(.+)\" /
I have some text strings like this
{hello|hi}{there|you}
I want to count the instances of {..anything..}, so in the example above, I would want to return:
hello|hi
there|you
in the matches array created by preg_match_all()
Right now my code looks like:
preg_match_all('/{(.*?)}/', $text,$text_pieces);
And $text_pieces contains:
Array ( [0] => Array ( [0] => {hello|hi} [1] => {there|you} ) [1] => Array ( [0] => hello|hi [1] => there|you ) )
All I need is this:
[0] => hello|hi [1] => there|you
preg_match_all cannot omit the full text matches, only subpattern matches, therefore the only solution is to set $text_pieces to $text_pieces[1] after the function call:
if(preg_match_all('/{(.*?)}/', $text,$text_pieces))
{
$text_pieces = $text_pieces[1];
}