Explode and/or regex text to HTML link in PHP - php

I have a database of texts that contains this kind of syntax in the middle of English sentences that I need to turn into HTML links using PHP
"text1(text1)":http://www.example.com/mypage
Notes:
text1 is always identical to the text in parenthesis
The whole string always have the quotation marks, parenthesis, colon, so the syntax is the same for each.
Sometimes there is a space at the end of the string, but other times there is a question mark or comma or other punctuation mark.
I need to turn these into basic links, like
text1
How do I do this? Do I need explode or regex or both?

"(.*?)\(\1\)":(.*\/[a-zA-Z0-9]+)(?=\?|\,|\.|$)
You can use this.
See Demo.
http://regex101.com/r/zF6xM2/2

You can use this replacement:
$pattern = '~"([^("]+)\(\1\)":(http://\S+)(?=[\s\pP]|\z)~';
$replacement = '\1';
$result = preg_replace($pattern, $replacement, $text);
pattern details:
([^("]+) this part will capture text1 in the group 1. The advantage of using a negated character class (that excludes the double quote and the opening parenthesis) is multiple:
it allows to use a greedy quantifier, that is faster
since the class excludes the opening parenthesis and is immediatly followed by a parenthesis in the pattern, if in an other part of the text there is content between double quotes but without parenthesis inside, the regex engine will not go backward to test other possibilities, it will skip this substring without backtracking. (This is because the PCRE regex engine converts automatically [^a]+a into [^a]++a before processing the string)
\S+ means all that is not a whitespace one or more times
(?=[\s\pP]|\z) is a lookahead assertion that checks that the url is followed by a whitespace, a punctuation character (\pP) or the end of the string.

You can use this regex:
"(.*?)\(.*?:(.*)
Working demo

An appropriate Regular Expression could be:
$str = '"text1(text1)":http://www.example.com/mypage';
preg_match('#^"([^\(]+)' .
'\(([^\)]+)\)[^"]*":(.+)#', $str, $m);
print ''.$m[2].'' . PHP_EOL;

Related

Regex: Select multiple lines between characters

I can't select multiple lines between two characters on regex.
how can i solve this?
{
example
example 1
}
I want to select 'example'. but i cant.
I tried this regex
#\n.*#
thank you
Your pattern does not do what you expect; it matches a newline character followed by any character except newline "zero or more" times. You need to use the s (dotall) modifier which forces the dot to match across newline sequences.
For example — matching everything between the two curly braces.
preg_match('/{(.*)}/s', $str, $match);
echo $match[1];

regular expression to match words with space or no space

i am trying to find php regular expression that match the word like "Hello World" with space and also match the word "HelloWorld" without space.
You could use:
/^Hello ?World$/
Or you don't care the number of spaces:
/^Hello *World$/
Or it could also be blank chars like tab, then use \s instead a space.
Generally that would be:
/[a-zA-Z ]+/
If you want numbers, too:
/[a-zA-Z0-9 ]+/
You would need to set some sort of boundary. If the string just contains this, you can use start and end delimiters:
/^[a-zA-Z0-9 ]+$/

exploding a string using a regular expression

I have a string as below (the letters in the example could be numbers or texts and could be either uppercase or lowercase or both. If a value is a sentence, it should be between single quotations):
$string="a,b,c,(d,e,f),g,'h, i j.',k";
How can I explode that to get the following result?
Array([0]=>"a",[1]=>"b",[2]=>"c",[3]=>"(d,e,f)",[4]=>"g",[5]=>"'h,i j'",[6]=>"k")
I think using regular expressions will be a fast as well as clean solution. Any idea?
EDIT:
This is what I have done so far, which is very slow for the strings having a long part between parenthesis:
$separator="*"; // whatever which is not used in the string
$Pattern="'[^,]([^']+),([^']+)[^,]'";
while(ereg($Pattern,$String,$Regs)){
$String=ereg_replace($Pattern,"'\\1$separator\\2'",$String);
}
$Pattern="\(([^(^']+),([^)^']+)\)";
while(ereg($Pattern,$String,$Regs)){
$String=ereg_replace($Pattern,"(\\1$separator\\2)",$String);
}
return $String;
This, will replace all the commas between the parenthesis. Then I can explode it by commas and the replace the $separator with the original comma.
You can do the job using preg_match_all
$string="a,b,c,(d,e,f),g,'h, i j.',k";
preg_match_all("~'[^']+'|\([^)]+\)|[^,]+~", $string, $result);
print_r($result[0]);
Explanation:
The trick is to match parenthesis before the ,
~ Pattern delimiter
'
[^'] All charaters but not a single quote
+ one or more times
'
| or
\([^)]+\) the same with parenthesis
| or
[^,]+ Any characters except commas one or more times
~
Note that the quantifiers in [^']+', in [^)]+\) but also in [^,]+ are all automatically optimized to possessive quantifiers at compile time due to "auto-possessification". The first two because the character class doesn't contain the next character, and the last because it is at the end of the pattern. In both cases, an eventual backtracking is unnecessary.
if you have more than one delimiter like quotes (that are the same for open and close), you can write your pattern like this, using a capture group:
$string="a,b,c,(d,e,f),g,'h, i j.',k,°l,m°,#o,p#,#q,r#,s";
preg_match_all('~([\'##°]).*?\1|\([^)]+\)|[^,]+~', $string, $result);
print_r($result[0]);
explanation:
(['##°]) one character in the class is captured in group 1
.*? any character zero or more time in lazy mode
\1 group 1 content
With nested parenthesis:
$string="a,b,(c,(d,(e),f),t),g,'h, i j.',k,°l,m°,#o,p#,#q,r#,s";
preg_match_all('~([\'##°]).*?\1|(\((?:[^()]+|(?-1))*+\))|[^,]+~', $string, $result);
print_r($result[0]);

How to replace only the last match of a string with preg_replace?

I have to replace the last match of a string (for example the word foo) in HTML document. The problem is that the structure of the HTML document is always random.
I'm trying to accomplish that with preg_replace, but so far I know how to replace only the first match, but not the last one.
Thanks.
Use negative look after (?!...)
$str = 'text abcd text text efgh';
echo preg_replace('~text(?!.*text)~', 'bar', $str),"\n";
output:
text abcd text bar efgh
A common approach to match all text to the last occurrence of the subsequent pattern(s) is using a greedy dot, .*. So, you may match and capture the text before the last text and replace with a backreference + the new value:
$str = 'text abcd text text efgh';
echo preg_replace('~(.*)text~su', '${1}bar', $str);
// => text abcd text bar efgh
If text is some value inside a variable that must be treated as plain text, use preg_quote to ensure all special chars are escaped correctly:
preg_replace('~(.*)' . preg_quote($text, '~') . '~su', '${1}bar', $str)
See the online PHP demo and a regex demo.
Here, (.*) matches and captures into Group 1 any zero or more chars (note that the s modifier makes the dot match line break chars, too), as many as possible, up to the rightmost (last) occurrence of text. If text is a Unicode substring, the u modifier comes handy in PHP (it enables (*UTF) PCRE verb allowing parsing the incoming string as a sequence of Unicode code points rather than bytes and the (*UCP) verb that makes all shorthand character classes Unicode aware - if any).
The ${1} is a replacement backreference, a placeholder holding the value captured into Group 1 that lets restore that substring inside the resulting string. You can use $1, but a problem might arise if the $text starts with a digit.
An example
<?php
$str = 'Some random text';
$str_Pattern = '/[^ ]*$/';
preg_match($str_Pattern, $str, $results);
print $results[0];
?>
Of course the accepted solution given here is correct. Nevertheless you might also want to have a look at this post. I'm using this where no pattern is needed and the string does not contain characters that could not be captured by the functions used (i.e. multibyte ones). I also put an additional parameter for dis/regarding case.
The first line then is:
$pos = $case === true ? strripos($subject, $search) : strrpos($subject, $search);
I have to admit that I did not test the performance. However, I guess that preg_replace() is slower, specially on large strings.

regex: remove all text within "double-quotes" (multiline included)

I'm having a hard time removing text within double-quotes, especially those spread over multiple lines:
$file=file_get_contents('test.html');
$replaced = preg_replace('/"(\n.)+?"/m','', $file);
I want to remove ALL text within double-quotes (included). Some of the text within them will be spread over multiple lines.
I read that newlines can be \r\n and \n as well.
Try this expression:
"[^"]+"
Also make sure you replace globally (usually with a g flag - my PHP is rusty so check the docs).
Another edit: daalbert's solution is best: a quote followed by one or more non-quotes ending with a quote.
I would make one slight modification if you're parsing HTML: make it 0 or more non-quote characters...so the regex will be:
"[^"]*"
EDIT:
On second thought, here's a better one:
"[\S\s]*?"
This says: "a quote followed by either a non-whitespace character or white-space character any number of times, non-greedily, ending with a quote"
The one below uses capture groups when it isn't necessary...and the use of a wildcard here isn't explicit about showing that wildcard matches everything but the new-line char...so it's more clear to say: "either a non-whitespace char or whitespace char" :) -- not that it makes any difference in the result.
there are many regexes that can solve your problem but here's one:
"(.*?(\s)*?)*?"
this reads as:
find a quote optionally followed by: (any number of characters that are not new-line characters non-greedily, followed by any number of whitespace characters non-greedily), repeated any number of times non-greedily
greedy means it will go to the end of the string and try matching it. if it can't find the match, it goes one from the end and tries to match, and so on. so non-greedy means it will find as little characters as possible to try matching the criteria.
great link on regex: http://www.regular-expressions.info
great link to test regexes: http://regexpal.com/
Remember that your regex may have to change slightly based on what language you're using to search using regex.
You can use single line mode (also know as dotall) and the dot will match even newlines (whatever they are):
/".+?"/s
You are using multiline mode which simply changes the meaning of ^ and $ from beginning/end of string to beginning/end of text. You don't need it here.
"[^"]+"
Something like below. s is dotall mode where . will match even newline:
/".+?"/s
$replaced = preg_replace('/"[^"]*"/s','', $file);
will do this for you. However note it won't allow for any quoted double quotes (e.g. A "test \" quoted string" B will result in A quoted string" B with a leading space, not in A B as you might expect.

Categories