Get name from hashtag using regex - php

I have this string/content:
#Salome, #Jessi H and #O'Ren were playing at the #Lean's yard with "#Ziggy" the mouse.
Well, I am trying to get all names focuses above. I have used # symbol to create like a hash to be used in my web. If you note, there are names with spaces between like #Jessi H and characters before and after like #Ziggy. So, I don't my if you suggest me another way to manage the hash in another way to get it works correctly. I was thinking that for user that have white spaces, could write the hash with quotes like #"Jessi H". What do you think? Other examples:
#Lean's => #"Lean"'s
#Jessi H => #"Jessi H"
"#Jessi H" => (sorry, I don't know how to parse it)
#O'Ren => #"O'Ren"
What I have do?
I'm starting using regex in php, but some SO questions have been usefull for me to get started, so, these are my tries using preg_match_all function firstly:
Result of /#(.*?)[,\" ]/:
Array ( [0] => Salome [1] => Jessi [2] => Charlie [3] => Lean's [4] => Ziggy" ) )
Result of /#"(.*?)"/ for names like #"name":
Empty array
Guys, I don't expect that you do it all for me. I think that a pseudo-code or something like this will be helpful to guide me to the right direction.

Try the following regex: '/#(?:"([^"]+)|([^\b]+?))\b/'
This will return two match groups, the first containing any quoted names (eg #"Jessi H" and #"O'Ren"), and the second containing any unquoted names (eg #Salome, #Leon)
$matches = array();
preg_match_all('/#(?:"([^"]+)|([^\b]+?))\b/', '#Salome, #"Jessi H" and #"O\'Ren" were playing at the #Lean\'s yard with "#Ziggy" the mouse.', $matches);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => #Salome
[1] => #"Jessi H
[2] => #"O'Ren
[3] => #Lean
[4] => #Ziggy
)
[1] => Array
(
[0] =>
[1] => Jessi H
[2] => O'Ren
[3] =>
[4] =>
)
[2] => Array
(
[0] => Salome
[1] =>
[2] =>
[3] => Lean
[4] => Ziggy
)
)

Are you setting these requirements or can you choose them? If you can set the requirements, I would suggest using _ instead of spaces, which would allow you to use the regex:
/#(.+) /
If spaces must be allowed and you're going with quotes, then the quotes should probably span the entire name, allowing for this regex:
/#\"(.+)\" /

Related

preg_replace json string match same character beginning/end

Ok so what I have is a JSON string which can contain 1 or many elements below I've put an example of the sting but this is only an example the real string is much more complicated. This one highlight's the issue's I'm having.
{"elements":[{"id":2,"string":"something","string2":"","string3":"no html here","integer":2,"array":{"options":[{"id":1,"value":"data"},{"id":2,"value":"more data"}]},"string4":"text with <a href=\"http:\/\/www.example.com\">html<\/a>","string5":"naughty <a href=\"http:\/\/www.example.com\">link<\/a>"},{"id":2,"string":"something","string2":"","string3":"no html here","integer":2,"array":{"options":[{"id":1,"value":"data"},{"id":2,"value":"more data"}]},"string4":"text with <a href=\"http:\/\/www.example.com\">html<\/a>","string5":"naughty <a href=\"http:\/\/www.example.com\">link<\/a>"}]}
What I'm trying to do is match all of the Strings (data-type not the name) in the JSON data and then depending on whether it's allowed HTML or not (using a blacklist) striping out the HTML. I'm no regex expert so I can't work out what's going wrong.
Here is my regex:-
([{,]"(?!(elements|string3|string4)":)(.*?)":)(?!,")"(.*?)",
I'm having two issue's with it:-
It is matching elements with both integer's and array's by simply jumping to the " found within the next string. I expected the match to fail and move on
I can't get it to handle the \" in the url so I need the , on the end of the regex but this then stop's the next string matching I tried \G but this seemed to have no affect I have a feeling it starts after the , in the previous match. I also tried a number of solutions that were suppose to allow for escaped text but these all failed to work in my case.
The thought was that this would be quicker than converting the JSON string into an object and then traversing the array of hundreds of elements to remove the HTML if that's quicker then I'll just do that it'll be a whole lot easier.
Don't work on the json directly, decode it using json_decode().
Then cleanup your HTML using HTMLPurifier, which does a great job at cleaning HTML code.
Then encode your data to json again using json_encode().
Description
There were several problems with your expression like the use of .*? will continue to capture all characters until the next required character is matched. I replaced this with [^"]*? which will match all non quotes, this forces the capture to stop consuming characters which are outside the quoted group.
I also made a capture group for the open quotes (["]) although probably overkill this allows you to simply add a single quote to the character class. Then I refer back to this captured group later to ensure the correct corresponding close quote is also matched. This way if the open quote is not required in your input string then you can simply insert a question mark (["])? and the close quote will automatically be found that matches the open quote.
I also moved the [{,] to outside the capture group
This is my cleaned up version of the regex
[{,]((")(?!(elements|string3|string4)\2:)([^"]*?)\2:)(")([^"]*?)\5(?=,)
PHP Code Example:
<?php
$sourcestring="your source string";
preg_match_all('/[{,]((")(?!(elements|string3|string4)\2:)([^"]*?)\2:)(")([^"]*?)\5(?=,)/i',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => Array
(
[0] => ,"string0":"something0"
[1] => ,"string1":""
[2] => ,"string":"something"
[3] => ,"string5":""
)
[1] => Array
(
[0] => "string0":
[1] => "string1":
[2] => "string":
[3] => "string5":
)
[2] => Array
(
[0] => "
[1] => "
[2] => "
[3] => "
)
[3] => Array
(
[0] =>
[1] =>
[2] =>
[3] =>
)
[4] => Array
(
[0] => string0
[1] => string1
[2] => string
[3] => string5
)
[5] => Array
(
[0] => "
[1] => "
[2] => "
[3] => "
)
[6] => Array
(
[0] => something0
[1] =>
[2] => something
[3] =>
)
)

PHP regex preg_split - split by largest group only

I have the following regex
((\$|(\\\[)).*?(\$|(\\\])))
which should capture everything between $$ and \[\] and I tested it on http://gskinner.com/RegExr/ and it's working.
PHP variant is (doubled backslashes)
((\$|(\\\\\[)).*?(\$|(\\\\\])))
and I would like to split my text based on that regex. How can I tell that it uses just the first (and largest group) and not these small ones?
preg_split('/((\$|(\\\\\[)).*?(\$|(\\\\\])))/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
So for text This is my $test$ for something. I should get an array
[0] => This is my
[1] => $test$
[2] => for something.
But I get
[0] => This is my
[1] => $test$
[2] => $
[3] =>
[4] => $
[5] => for something.
You would need something like this:
$text = 'This is my $test$ for \[something\] new!';
print_r(preg_split('/(\$.*?\$|\\\\\[.*?\\\\\])/', $text, -1, PREG_SPLIT_DELIM_CAPTURE));
Output:
Array
(
[0] => This is my
[1] => $test$
[2] => for
[3] => \[something\]
[4] => new!
)
IMHO, your regex is (probably) wrong. It would fail for texts like Hello $there\]. If you need to capture texts between two $s and a pair of \[ and \], then you need the regexp like:
<-------------> Match text between \[ and \]
/(\$.*?\$|\\\\\[.*?\\\\\])/
<-----> Match text between dollars

PHP Subpattern without Numbering Array

Using preg_match with subpattern always returns double-key array with identical data, one with subpattern name and the other tagged with number. Because I'm matching hundred thousands of lines with few kbytes per row, I'm afraid the number array is occupying extra memory. Is there any proper way to disable the number tag array from returning?
Example:
<?php
header('Content-Type: text/plain');
$data = <<<START
I go to school.
He goes to funeral.
START;
preg_match_all('#^(?<who>.*?) go(es)* to (?<place>.*?)$#m', $data, $matches);
print_r($matches);
?>
Output:
Array
(
[0] => Array
(
[0] => I go to school.
[1] => He goes to funeral.
)
[who] => Array
(
[0] => I
[1] => He
)
[1] => Array
(
[0] => I
[1] => He
)
[2] => Array
(
[0] =>
[1] => es
)
[place] => Array
(
[0] => school.
[1] => funeral.
)
[3] => Array
(
[0] => school.
[1] => funeral.
)
)
From php.net- Subpatterns
It is possible to name a subpattern using the syntax (?P<name>pattern). This subpattern will then be indexed in the matches array by its normal numeric position and also by name.
I see no option to give only the index by name.
So, I think, if you don't want this data two times, the only possibility is: don't use named groups.
Is this really an issue? IMO optimize this only if you run into problems, because of this additional memory usage! The improved readability should be worth the memory!
Update
It look like go(es)* should only match an optional "es". Here you can save memory by using a non capturing group.
preg_match_all('#^(?<who>.*?) go(?:es)? to (?<place>.*?)$#m', $data, $matches);
by starting the group with ?: the matched content is not stored. I also replaced the * that means 0 or more and would also match "goeseses" with the ? which means 0 or 1.

PHP REGEX separate by different criteria

I'm trying to separate some strings by different criteria but I can't get the desired results.
Here are 3 examples:
$ppl[0] = "Balko, Vlado \"Panelбk\" (2008) {Byt na tretom (#1.55)}";
$ppl[1] = "'Abd Al-Hamid, Ja'far A Two Hour Delay (2001)";
$ppl[2] = "'t Hoen, Frans De reьnie (1963) (TV)";
I'm currently using this for the last 2:
$pattern = '#,|\t|\(#'
But I will get and empty space.
result:
Array ( [0] => 'Abd Al-Hamid [1] => Ja'far [2] => A Two Hour Delay [3] => 2001) )
Array ( [0] => 't Hoen [1] => Frans [2] => [3] => De reünie [4] => 1963) [5] => TV) )
As for the 1st expression I used another pattern but I still get empty spaces. Any ideas?
EDIT:
Thanks this helped indeed. I tried using a modified version on the first string:
$pattern4 = '#[",\t]+|[{}]+|[()]+#';
However I still get an empty space:
Array ( [0] => Balko [1] => Vlado [2] => Panelák [3] => [4] => 2008 [5] => [6] => Byt na tretom [7] => #1.55 [8] => [9] => )
What should I do? I think that the " and the brackets are causing the problem but I don't know how to fix it.
I would surmise you have two tabs as separator in your second and third example string. (Can't see that here, the SO editor converts them into spaces).
But you could adapt your regex slightly in that case:
$pattern = '#,|\t+|\(#'
Or simpler even:
$pattern = '#[,\t(]+#'
And the alternatve, btw, would be just applying array_filter() on the result arrays to remove the empty entries.

problem with php regular expression

hi i have a data in below format
<option value="http://www.torontoairportlimoflatrate.com/aurora-limousine-service.html">Aurora</option>
<option value="http://www.torontoairportlimoflatrate.com/alexandria-limousine-service.html">Alexandria</option>
i after banging my head on table 10 times figured out to use regular expression below
preg_match_all("#>\w*#",$data,$result);
This returns the results as below
Array
(
[0] => Array
(
[0] => >Ajax
[1] => >
[2] => >Aurora
[3] => >
[4] => >Alexandria
[5] => >
[6] => >Alliston
I only want single array having values i.e.
cities
[0] => Ajax
[1] => Aurora
...... so on.
Pleas
If you'd prefer not to use an HTML parser, you can do it with a regex, but keep in mind that you'll probably need to modify it based on what you'll receive as input in the future. For your specific problem, this is a regex that does the job:
<?php
preg_match_all('/<option\svalue=\"([a-zA-Z0-9-_.\/:]+)\">([a-zA-Z\s]+)<\/option>/', $data, $result);
var_dump($result[2]);
Note:
If you want to match every url you should replace ([a-zA-Z0-9-_.\/:]+) with a more capable url matching regex. You can find some on StackOverflow also, but for me is a matter of personal taste.

Categories