preg_replace json string match same character beginning/end

preg_replace json string match same character beginning/end - php

Ok so what I have is a JSON string which can contain 1 or many elements below I've put an example of the sting but this is only an example the real string is much more complicated. This one highlight's the issue's I'm having.
{"elements":[{"id":2,"string":"something","string2":"","string3":"no html here","integer":2,"array":{"options":[{"id":1,"value":"data"},{"id":2,"value":"more data"}]},"string4":"text with <a href=\"http:\/\/www.example.com\">html<\/a>","string5":"naughty <a href=\"http:\/\/www.example.com\">link<\/a>"},{"id":2,"string":"something","string2":"","string3":"no html here","integer":2,"array":{"options":[{"id":1,"value":"data"},{"id":2,"value":"more data"}]},"string4":"text with <a href=\"http:\/\/www.example.com\">html<\/a>","string5":"naughty <a href=\"http:\/\/www.example.com\">link<\/a>"}]}
What I'm trying to do is match all of the Strings (data-type not the name) in the JSON data and then depending on whether it's allowed HTML or not (using a blacklist) striping out the HTML. I'm no regex expert so I can't work out what's going wrong.
Here is my regex:-
([{,]"(?!(elements|string3|string4)":)(.*?)":)(?!,")"(.*?)",
I'm having two issue's with it:-
It is matching elements with both integer's and array's by simply jumping to the " found within the next string. I expected the match to fail and move on
I can't get it to handle the \" in the url so I need the , on the end of the regex but this then stop's the next string matching I tried \G but this seemed to have no affect I have a feeling it starts after the , in the previous match. I also tried a number of solutions that were suppose to allow for escaped text but these all failed to work in my case.
The thought was that this would be quicker than converting the JSON string into an object and then traversing the array of hundreds of elements to remove the HTML if that's quicker then I'll just do that it'll be a whole lot easier.

Don't work on the json directly, decode it using json_decode().
Then cleanup your HTML using HTMLPurifier, which does a great job at cleaning HTML code.
Then encode your data to json again using json_encode().

Description
There were several problems with your expression like the use of .*? will continue to capture all characters until the next required character is matched. I replaced this with [^"]*? which will match all non quotes, this forces the capture to stop consuming characters which are outside the quoted group.
I also made a capture group for the open quotes (["]) although probably overkill this allows you to simply add a single quote to the character class. Then I refer back to this captured group later to ensure the correct corresponding close quote is also matched. This way if the open quote is not required in your input string then you can simply insert a question mark (["])? and the close quote will automatically be found that matches the open quote.
I also moved the [{,] to outside the capture group
This is my cleaned up version of the regex
[{,]((")(?!(elements|string3|string4)\2:)([^"]*?)\2:)(")([^"]*?)\5(?=,)
PHP Code Example:
<?php
$sourcestring="your source string";
preg_match_all('/[{,]((")(?!(elements|string3|string4)\2:)([^"]*?)\2:)(")([^"]*?)\5(?=,)/i',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => Array
(
[0] => ,"string0":"something0"
[1] => ,"string1":""
[2] => ,"string":"something"
[3] => ,"string5":""
)
[1] => Array
(
[0] => "string0":
[1] => "string1":
[2] => "string":
[3] => "string5":
)
[2] => Array
(
[0] => "
[1] => "
[2] => "
[3] => "
)
[3] => Array
(
[0] =>
[1] =>
[2] =>
[3] =>
)
[4] => Array
(
[0] => string0
[1] => string1
[2] => string
[3] => string5
)
[5] => Array
(
[0] => "
[1] => "
[2] => "
[3] => "
)
[6] => Array
(
[0] => something0
[1] =>
[2] => something
[3] =>
)
)

Related

Regex match HTML tag and attributes [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I'm trying to match html tag name along with it's attributes. In the example below, I am trying to match div, class, style and id.
$html='<div class="nav" style="float:left;" id="navigation">';
preg_match_all("/(([^<]\w+\s)|(\S+)=)/", $html, $match);
This returns the array like below.
As you can see, the correct results are kept in Array[2] and Array [3]. I was wondering if it is possible to put the results in a single array, perhaps in Array[1]? Not sure how to do this.
Array
(
[0] => Array
(
[0] => div
[1] => class=
[2] => style=
[3] => id=
)
[1] => Array
(
[0] => div
[1] => class=
[2] => style=
[3] => id=
)
[2] => Array
(
[0] => div
[1] =>
[2] =>
[3] =>
)
[3] => Array
(
[0] =>
[1] => class
[2] => style
[3] => id
)
)

You can use this simple regex :
(?<=<)\w++|\b\w++(?==)
where (?<=...) is a lookbehind and (?=...) a lookahead
example:
preg_match_all('~(?<=<)\w++|\b\w++(?==)~', $html, $matches);
print_r($matches);
But if you use several capturing parenthesis and you want the result in an unique array, you can use the branch reset feature. Example (without lookarounds):
preg_match_all('~(?|<(\w++)|\b(\w++)=)~', $html, $matches);
(about the ++, it is a possessive quantifier that informs the regex engine that it doesn't need to backtrack (among other things, backtrack positions are not recorded), this increase performances of the pattern but this is not essential (in particular for small strings). You can have more information about this feature here and here)

Regular Expression to get all links with certain extensions

Im looking for a regular expression that will grab all of the urls that have the extensions int he following array:
Array
(
[0] => mp4
[1] => m4v
[2] => webm
[3] => ogv
[4] => wmv
[5] => flv
)
This array is returned by an internal WordPress function called wp_get_video_extensions() and are video URls that WordPress recognizes.
A block of content would look like this with URls inside it:
'Yes, but I grow at a reasonable pace,' said the Dormouse: 'not in
that ridiculous fashion.' And he got up very sulkily and crossed over
to the other side of the court.
All this time the Queen had never left off staring at the Hatter, and,
just as the Dormouse crossed the court, she said to one of the
officers of the court, 'Bring me the list of the singers in the last
concert!' on which the wretched Hatter trembled so, that he shook both
his shoes off.
[video mp4="http://www.example.com/files/video/video1.mp4"][/video]
'Give your evidence,' the King repeated angrily, 'or I'll have you
executed, whether you're nervous or not.'
http://www.example.com/files/video/video2.flv
'I'm a poor man, your Majesty,' the Hatter began, in a trembling
voice, '—and I hadn't begun my tea—not above a week or so—and what
with the bread-and-butter getting so thin—and the twinkling of the
tea—'
I am trying to get it to find both the video urls in there and return the entire URL in the array.
Here is what i have:
preg_match_all( '/^https?:\/\/(?:[a-z\-]+\.)+[a-z]{2,6}(?:/[^/#?]+)+\.(?:' . implode( '|', wp_get_video_extensions() ) . ')$/', $post->post_content, $matches);
And i am getting this:
Warning: preg_match_all(): Unknown modifier '['
Ideally, i would like to get this:
Array
(
[0] => Array
(
[0] => http://www.example.com/files/video/video1.mp4
[1] => http://www.example.com/files/video/video2.flv
)
[1] => Array
(
[0] => http://www.example.com/
[1] => http://www.example.com/
)
[2] => Array
(
[0] => files/video/
[1] => files/video/
)
[3] => Array
(
[0] => video1.mp4
[1] => video2.flv
)
)
But this would also be perfect as i can use parse_url() to break the rest out later on:
Array
(
[0] => http://www.example.com/files/video/video1.mp4
[1] => http://www.example.com/files/video/video2.flv
)

You're first problem, is that you didn't escape all the "/". The second problem is that you're trying to match only if that is the beginning and ending of the line. This should take care of it.
preg_match_all('~https?://(?:[a-z\-]+\.)+[a-z]{2,6}(?:/[^/#?]+)+\.(?:' . implode( '|', wp_get_video_extensions() ) . ')~', $post->post_content, $matches);
Using "~" makes it so you don't have to escape the "/".

Get name from hashtag using regex

I have this string/content:
#Salome, #Jessi H and #O'Ren were playing at the #Lean's yard with "#Ziggy" the mouse.
Well, I am trying to get all names focuses above. I have used # symbol to create like a hash to be used in my web. If you note, there are names with spaces between like #Jessi H and characters before and after like #Ziggy. So, I don't my if you suggest me another way to manage the hash in another way to get it works correctly. I was thinking that for user that have white spaces, could write the hash with quotes like #"Jessi H". What do you think? Other examples:
#Lean's => #"Lean"'s
#Jessi H => #"Jessi H"
"#Jessi H" => (sorry, I don't know how to parse it)
#O'Ren => #"O'Ren"
What I have do?
I'm starting using regex in php, but some SO questions have been usefull for me to get started, so, these are my tries using preg_match_all function firstly:
Result of /#(.*?)[,\" ]/:
Array ( [0] => Salome [1] => Jessi [2] => Charlie [3] => Lean's [4] => Ziggy" ) )
Result of /#"(.*?)"/ for names like #"name":
Empty array
Guys, I don't expect that you do it all for me. I think that a pseudo-code or something like this will be helpful to guide me to the right direction.

Try the following regex: '/#(?:"([^"]+)|([^\b]+?))\b/'
This will return two match groups, the first containing any quoted names (eg #"Jessi H" and #"O'Ren"), and the second containing any unquoted names (eg #Salome, #Leon)
$matches = array();
preg_match_all('/#(?:"([^"]+)|([^\b]+?))\b/', '#Salome, #"Jessi H" and #"O\'Ren" were playing at the #Lean\'s yard with "#Ziggy" the mouse.', $matches);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => #Salome
[1] => #"Jessi H
[2] => #"O'Ren
[3] => #Lean
[4] => #Ziggy
)
[1] => Array
(
[0] =>
[1] => Jessi H
[2] => O'Ren
[3] =>
[4] =>
)
[2] => Array
(
[0] => Salome
[1] =>
[2] =>
[3] => Lean
[4] => Ziggy
)
)

Are you setting these requirements or can you choose them? If you can set the requirements, I would suggest using _ instead of spaces, which would allow you to use the regex:
/#(.+) /
If spaces must be allowed and you're going with quotes, then the quotes should probably span the entire name, allowing for this regex:
/#\"(.+)\" /

PHP Subpattern without Numbering Array

Using preg_match with subpattern always returns double-key array with identical data, one with subpattern name and the other tagged with number. Because I'm matching hundred thousands of lines with few kbytes per row, I'm afraid the number array is occupying extra memory. Is there any proper way to disable the number tag array from returning?
Example:
<?php
header('Content-Type: text/plain');
$data = <<<START
I go to school.
He goes to funeral.
START;
preg_match_all('#^(?<who>.*?) go(es)* to (?<place>.*?)$#m', $data, $matches);
print_r($matches);
?>
Output:
Array
(
[0] => Array
(
[0] => I go to school.
[1] => He goes to funeral.
)
[who] => Array
(
[0] => I
[1] => He
)
[1] => Array
(
[0] => I
[1] => He
)
[2] => Array
(
[0] =>
[1] => es
)
[place] => Array
(
[0] => school.
[1] => funeral.
)
[3] => Array
(
[0] => school.
[1] => funeral.
)
)

From php.net- Subpatterns
It is possible to name a subpattern using the syntax (?P<name>pattern). This subpattern will then be indexed in the matches array by its normal numeric position and also by name.
I see no option to give only the index by name.
So, I think, if you don't want this data two times, the only possibility is: don't use named groups.
Is this really an issue? IMO optimize this only if you run into problems, because of this additional memory usage! The improved readability should be worth the memory!
Update
It look like go(es)* should only match an optional "es". Here you can save memory by using a non capturing group.
preg_match_all('#^(?<who>.*?) go(?:es)? to (?<place>.*?)$#m', $data, $matches);
by starting the group with ?: the matched content is not stored. I also replaced the * that means 0 or more and would also match "goeseses" with the ? which means 0 or 1.

problem with php regular expression

hi i have a data in below format
<option value="http://www.torontoairportlimoflatrate.com/aurora-limousine-service.html">Aurora</option>
<option value="http://www.torontoairportlimoflatrate.com/alexandria-limousine-service.html">Alexandria</option>
i after banging my head on table 10 times figured out to use regular expression below
preg_match_all("#>\w*#",$data,$result);
This returns the results as below
Array
(
[0] => Array
(
[0] => >Ajax
[1] => >
[2] => >Aurora
[3] => >
[4] => >Alexandria
[5] => >
[6] => >Alliston
I only want single array having values i.e.
cities
[0] => Ajax
[1] => Aurora
...... so on.
Pleas

If you'd prefer not to use an HTML parser, you can do it with a regex, but keep in mind that you'll probably need to modify it based on what you'll receive as input in the future. For your specific problem, this is a regex that does the job:
<?php
preg_match_all('/<option\svalue=\"([a-zA-Z0-9-_.\/:]+)\">([a-zA-Z\s]+)<\/option>/', $data, $result);
var_dump($result[2]);
Note:
If you want to match every url you should replace ([a-zA-Z0-9-_.\/:]+) with a more capable url matching regex. You can find some on StackOverflow also, but for me is a matter of personal taste.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

preg_replace json string match same character beginning/end - php

Don't work on the json directly, decode it using json_decode(). Then cleanup your HTML using HTMLPurifier, which does a great job at cleaning HTML code. Then encode your data to json again using json_encode().

Related

Regex match HTML tag and attributes [duplicate]

Regular Expression to get all links with certain extensions

Get name from hashtag using regex

PHP Subpattern without Numbering Array

problem with php regular expression

Categories

Resources