Regex CSV : Match quotes that are not delimiters - php

I'm working on a csv file that was badly built, I created a regex that only matches quotes that ARE NOT delimiters, in this link I succeeded, however do you think you can optimize my regex to have only quotes and not the letters around, the constrait and that the quotation marks at the beginning or at the end are not taken into account, example:
"ModifTextePub";"ModifObservation";"Resume"Vitrine";"Observations"Criteres"";"InternetOK";"NumPhoto";"AmianteLe";"SNavantLe";"ActePrec";"ProprietairesPrec";"Situation";"FraisNotaires"
in this example it would be necessary to match only between Resume " Vitrine and also those around " Criteres "
The regex I am using is
(.){1}(?<!;|\n|\r|\t)(")(?!;|\n|\r|\t)(.){1}
with $1$3 as replacement.

Your regex with negative lookarounds containing positive character classes can be transformed into a pattern with positive lookarounds containing negated character classes:
(?<=[^;\n\r\t])"(?=[^;\n\r\t])
See the regex demo. The replacement will be an empty string.
Now, the match will only occur if there is a " that is immediately preceded and followed with any char but ;, CR, LF or TAB.

Related

Ignoring apostrophe while capturing contents in single quotes REGEX

The issue for me here is to capture the content inside single quotes(like 'xyz').
But the apostrophe which is the same symbol as a single quote(') is coming in the way!
The regex I've written is : /(\w\'\w)(*SKIP)(*F)|(\'[^\']*\')/
The example i have used is : Hello ma'am 'This is Prashanth's book.'
What needs to be captured is : 'This is Prashanth's book.'.
But, what's capured is : 'This is Prashanth'!
Here is the link of what i tried on online regex tester
Any help is greatly appreciated. Thank you!
You can't use [^\'] to capture a text that contains ' with in and in your example, This is Prashanth's book. contains a ' character within the text. You need to modify your regex to use .*? instead of [^\'] and can write your regex as this,
(\w'\w)(*SKIP)(*F)|('.*?'\B)
Demo with your updated regex
Also, you don't need to escape a single quote ' as that has no special meaning in regex.
From your example, it is not clear whether you want the captured match to contain ' around the match or not. In case you don't want ' to be captured in the match, you can use a lookarounds based regex and use this,
(?<=\B').*?(?='\B)
Explanation of regex:
(?<=\B') - This positive look behind ensures what gets captured in match is preceded by a single quote which is not preceded by a word character which is ensured by \B
.*? - Captures the text in non-greedy manner
(?='\B) - Ensures the matched text is followed by a single quote and \B ensures it doesn't match a quote that is immediately followed by any word character. E.g. it won't match an ending quote like 's
Demo
For the string you have provided, you can use the regex:
\B'\K(?:(?!'\B).)+
Click for Demo
Explanation:
\B - a non-word boundary
' - matches a '
\K - forget everything matched so far
(?:(?!'\B).)+ - matches 1+ occurrences of any character(except newline) which does not start with ' followed by a non-word boundary

What is the difference between 2 regex patterns?

I want users input their username with only alphanumeric and dot character.
So I wrote a regex pattern as following:
'/([a-zA-Z0-9\.]+)/'
But I want to know is it the same with:
'/([a-zA-Z0-9.]+)/'
2 below patterns is the same? Thank you for help! :-)
You don't need to escape the dot which was present inside a character class. Inside a character class, dot . and escaped dot \. matches the literal dot. So both regexes are same.
And also for validation purposes, i would suggest you to add anchors like '/^[a-zA-Z0-9.]+$/' . Anchors would be used to do a exact string match. That is , /[a-zA-Z0-9.]+/ regex would match the substring foo in this ()foo input string but if you add start and end anchors to your regex like /^[a-zA-Z0-9.]+$/, it won't match even a single character in the above mentioned string. It's allowed to match only one or more alphanumeric or dot characters , if it finds a character other than dot or alphanumeric, then the regex engine won't match the corresponding string.

Explode and/or regex text to HTML link in PHP

I have a database of texts that contains this kind of syntax in the middle of English sentences that I need to turn into HTML links using PHP
"text1(text1)":http://www.example.com/mypage
Notes:
text1 is always identical to the text in parenthesis
The whole string always have the quotation marks, parenthesis, colon, so the syntax is the same for each.
Sometimes there is a space at the end of the string, but other times there is a question mark or comma or other punctuation mark.
I need to turn these into basic links, like
text1
How do I do this? Do I need explode or regex or both?
"(.*?)\(\1\)":(.*\/[a-zA-Z0-9]+)(?=\?|\,|\.|$)
You can use this.
See Demo.
http://regex101.com/r/zF6xM2/2
You can use this replacement:
$pattern = '~"([^("]+)\(\1\)":(http://\S+)(?=[\s\pP]|\z)~';
$replacement = '\1';
$result = preg_replace($pattern, $replacement, $text);
pattern details:
([^("]+) this part will capture text1 in the group 1. The advantage of using a negated character class (that excludes the double quote and the opening parenthesis) is multiple:
it allows to use a greedy quantifier, that is faster
since the class excludes the opening parenthesis and is immediatly followed by a parenthesis in the pattern, if in an other part of the text there is content between double quotes but without parenthesis inside, the regex engine will not go backward to test other possibilities, it will skip this substring without backtracking. (This is because the PCRE regex engine converts automatically [^a]+a into [^a]++a before processing the string)
\S+ means all that is not a whitespace one or more times
(?=[\s\pP]|\z) is a lookahead assertion that checks that the url is followed by a whitespace, a punctuation character (\pP) or the end of the string.
You can use this regex:
"(.*?)\(.*?:(.*)
Working demo
An appropriate Regular Expression could be:
$str = '"text1(text1)":http://www.example.com/mypage';
preg_match('#^"([^\(]+)' .
'\(([^\)]+)\)[^"]*":(.+)#', $str, $m);
print ''.$m[2].'' . PHP_EOL;

PHP regular expression pattern allows unwanted literal asterisks

I have a regular expression that allows only specific characters from the name fields in an HTML form, namely letters, white space, single quotes, hyphens and periods. Here is the pattern:
return mb_ereg_match("^[\w\s'-\.]+$", $name);
Problem is this pattern, for some reason, returns true when there are literal asterisks in $name. This shouldn't be possible unless I'm missing something. I've done multiple searches on literal asterisks and all I found was the "\*" pattern for intentionally matching them.
The same pattern in preg_match() also returns a match when passed a string like "*John".
What the heck am I missing?
You need a double-backslash in front of these codes. One to escape the backslash, one to escape the escape sequence.
You also need to escape the -, otherwise it accepts all characters "between" ' and ..
return mb_ereg_match("^[\\w\\s'\\-\\.]+$", $name);
Have a look at a working case (using preg_match): http://ideone.com/E8afAM
When enclosed in square-brackets, the hyphen acts as a special character to denote a range. In your case, it's matching all characters in the range ' to ..
Escaping the hyphen should return the desired result:
^[\w\s'\-\.]+$
I have a regular expression that allows only specific characters from the name fields in an HTML form, namely letters, white space, single quotes, hyphens and periods.
You miss, that \w is not a letter character. php.net says:
A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word".
And, the perl definition is:
A \w matches a single alphanumeric character (an alphabetic character, or a decimal digit) or a connecting punctuation character, such as an underscore ("_").
The connecting punctuation character should mean only _ as i read, but this is maybe a multibyte extension's bug.
If you use mb_ereg_match only for whole unicode matches, give a try to preg_match's /u modifier & the Unicode character properties feature, since php 5.1.0

regex: remove all text within "double-quotes" (multiline included)

I'm having a hard time removing text within double-quotes, especially those spread over multiple lines:
$file=file_get_contents('test.html');
$replaced = preg_replace('/"(\n.)+?"/m','', $file);
I want to remove ALL text within double-quotes (included). Some of the text within them will be spread over multiple lines.
I read that newlines can be \r\n and \n as well.
Try this expression:
"[^"]+"
Also make sure you replace globally (usually with a g flag - my PHP is rusty so check the docs).
Another edit: daalbert's solution is best: a quote followed by one or more non-quotes ending with a quote.
I would make one slight modification if you're parsing HTML: make it 0 or more non-quote characters...so the regex will be:
"[^"]*"
EDIT:
On second thought, here's a better one:
"[\S\s]*?"
This says: "a quote followed by either a non-whitespace character or white-space character any number of times, non-greedily, ending with a quote"
The one below uses capture groups when it isn't necessary...and the use of a wildcard here isn't explicit about showing that wildcard matches everything but the new-line char...so it's more clear to say: "either a non-whitespace char or whitespace char" :) -- not that it makes any difference in the result.
there are many regexes that can solve your problem but here's one:
"(.*?(\s)*?)*?"
this reads as:
find a quote optionally followed by: (any number of characters that are not new-line characters non-greedily, followed by any number of whitespace characters non-greedily), repeated any number of times non-greedily
greedy means it will go to the end of the string and try matching it. if it can't find the match, it goes one from the end and tries to match, and so on. so non-greedy means it will find as little characters as possible to try matching the criteria.
great link on regex: http://www.regular-expressions.info
great link to test regexes: http://regexpal.com/
Remember that your regex may have to change slightly based on what language you're using to search using regex.
You can use single line mode (also know as dotall) and the dot will match even newlines (whatever they are):
/".+?"/s
You are using multiline mode which simply changes the meaning of ^ and $ from beginning/end of string to beginning/end of text. You don't need it here.
"[^"]+"
Something like below. s is dotall mode where . will match even newline:
/".+?"/s
$replaced = preg_replace('/"[^"]*"/s','', $file);
will do this for you. However note it won't allow for any quoted double quotes (e.g. A "test \" quoted string" B will result in A quoted string" B with a leading space, not in A B as you might expect.

Categories