I'm banging my head against the wall trying to figure out a (regexp?) based parser rule for the following problem. I'm developing a text markup parser similar to textile (using PHP), but i don't know how to get the inline formatting rules correct -- and i noticed, that the textile parsers i found are not able to format the following text as i would like to get it formatted:
-*deleted* -- text- and -more deleted text-
The result I want to have is:
<del><strong>deleted</strong> -- text</del> and <del>more deleted text</del>
What I do not want is:
<del><strong>deleted</strong> </del>- text- and <del>more deleted text</del>
Any ideas are very appreciated! thanks very much!
UPDATE
i think i should have mentioned, that '-' should still be a valid character (hyphen) :) -- for example the following should be possible:
-american-football player-
expected result:
<del>american-football player</del>
Based of the RedCloth library's parser description, with some modification for double-dash.
#
(?<!\S) # Start of string, or after space or newline
- # Opening dash
( # Capture group 1
(?: # : (see note 1)
[^-\s]+ # :
[-\s]+ # :
)*? # :
[^-\s]+? # :
) # End
- # Closing dash
(?![^\s!"\#$%&',\-./:;=?\\^`|~[\]()<]) # (see note 2)
#x
Note 1: This should match up to the next dash lazily, while consuming any non-single dashes, and single dashes surrounded by whitespace.
Note 2: Followed by space, punctuation, line break or end of string.
Or compacted:
#(?<!\S)-((?:[^-\s]+[-\s]+)*?[^-\s]+?)-(?![^\s!"#$%&',\-./:;=?\\^`|~[\]()<])#
A few examples:
$regex = '#(?<!\S)-((?:[^-\s]+[-\s]+)*?[^-\s]+?)-(?![^\s!"#$%&\',\-./:;=?\\\^`|~[\]()<])#';
$replacement = '<del>\1</del>';
preg_replace($regex, $replacement, '-*deleted* -- text- and -more deleted text-'), "\n";
preg_replace($regex, $replacement, '-*deleted*--text- and -more deleted text-'), "\n";
preg_replace($regex, $replacement, '-american-football player-'), "\n";
Will output:
<del>*deleted* -- text</del> and <del>more deleted text</del>
<del>*deleted*</del>-text- and <del>more deleted text</del>
<del>american-football player</del>
In the second example, it will match just -*deleted*-, since there are no spaces before the --. -text- will not be matched, because the initial - is not preceded by a space.
The strong tag is easy:
$string = preg_replace('~[*](.+?)[*]~', '<strong>$1</strong>', $string);
Working on the others.
Shameless hack for the del tag:
$string = preg_replace('~-(.+?)-~', '<del>$1</del>', $string);
$string = str_replace('<del></del>', '--', $string);
For a single token, you can simply match:
-((?:[^-]|--)*)-
and replace with:
<del>$1</del>
and similarly for \*((?:[^*]|\*{2,})*)\* and <strong>$1</strong>.
The regex is quite simple: literal - in both ends. In the middle, in a capturing group, we allow anything that isn't an hyphen, or two hyphens in a row.
To also allow single dashes in words, as in objective-c, this can work, by accepting dashes surrounded by two alphanumeric letters:
-((?:[^-]|--|\b-\b)*)-
You could try something like:
'/-.*?[^-]-\b/'
Where the ending hyphen must be at a word boundary and preceded by something that is not a hyphen.
I think you should read this warning sign first
You can't parse [X]HTML with regex
Perhaps you should try googling for a php html library
Related
Here is my code
<img src="folder/img1.jpg?somestring">
<img src="folder/img2.jpg?somediffstring">
want to replace somestring & somediffstring with another string in whole html. please suggest some regular expression with php.
example
change to using regular expression or anything
First of all, you shouldn't parse HTML with Regular Expressions.
Solution 1
Now, if you are exclusively parsing img tags, you could come up with a satisfying enough solution like this:
(\b\.jpg|\b\.png)\?(.*?)\"
That is:
(\b\.jpg|\b\.png) # 1st Capturing Group
\b\.jpg # 1st Alternative: match ``.jpg`` literally
\b\.png # 2nd Alternative: match ``.png`` literally
\? # Match the character ? literally
(.+?) # 2nd Capturing Group
.+? # Match any character between one and unlimited times,
# as few times as possible, expanding as needed.
\" # Match the character " literally
Problem
What's the problem? We are not checking if we are inside an img tag. This will match everywhere in the HTML.
Solution 2
Let's add the check for img > src:
<img.+?src=\".*?(\b\.jpg|\b\.png)\?(.+?)\"
That is:
<img # Match ``<img`` literally
.+? # Match any character between one and unlimited times,
# as few times as possible, expanding as needed.
# Needed in case there are rel or alt options inside the img tag.
src=\" # Match ``src="`` literally
... # The rest is same as before.
Problem
Does this really do its job? Apparently yes, but in reality no.
Consider the following HTML code
<img src="data:image/png;base64,iVBORw0KG" />
<div style="background-image: url(../images/test-background.jpg?)">
blah blah
</div>
It shouldn't match right? But it does (if you remove line-breaks). The regular expression above starts the match at <img src=", and will stop at "> of the div tag. The capturing group will contain the characters between ? and ": ), substituting it will break the HTML.
This was just an example, but many other situations will match even if they should not.
Other solutions...?
No matter how many constraints you can add to your RegEx and how sophisticated it becomes... HTML is a Context-Free Language and it can't be captured by a Regular Expression, which only recognizes Regular Languages.
In PHP
Still sure you're gonna use Regular Expressions? Alright, then your PHP function is preg_replace. You only need to keep in mind that it will replace everything that matched, not only the capturing groups. Hence, you need to wrap what you want to "remember" into another capturing group:
$str = '<img src="folder/img1.jpg?foo">';
$pattern = '/(<img.+?src=\".*?(\b\.jpg|\b\.png)\?)(.+?)(\")/';
$replacement = '$1' . 'bar' . '$4';
$str_replaced = preg_replace($pattern, $replacement, $str);
// Now you have $str_replaced = '<img src="folder/img1.jpg?bar">';
With reference to this How can I use the captured group in the same regex
suppose u wanna change img1.jpg?somestring to img1.jpg?somestringAAA
and img2.jpg?somediffstring to img2.jpg?somediffstringAAA
Search for: src="([a-zA-Z.0-9_]*)[?]([a-zA-Z.0-9_]*)">
Replace with: src="$1?$2AAA">
here $1 represents whatever is inside first round paranthesis () , i.e., img1.jpg
and $2 represents second paranthesis
UPDATE:
$string = 'img1.jpg?somestring';
$pattern = '/([a-zA-Z.0-9_]*)[?]([a-zA-Z.0-9_]*)/i';
$replacement = '$1?$2AAA';
echo preg_replace($pattern, $replacement, $string);
You can do it in this way :
<?php
$url_value = "folder/img2.jpg?somediffstring";
echo $url =substr($url_value, 0, strpos($url_value, "?"));
?>
you can use the regex \?(\w*)"
if u want to replace somestring and somediffstring with xx then u can replace it with regex \?(\w*)" and value as ?xx
https://regex101.com/r/S5pPuW/1
I'm sure I'm missing something. I know just enough to be dangerous.
In my php code I use file_get_contents() to put a file into a variable.
I then loop through an array and use preg_match to search the same variable many times. The file is a tab-delimited txt file. It does fine 800 times but one time randomly in the middle it does something very odd.
$current = file_get_contents($file);
foreach($blahs as $blah){
$image = 'somefile.jpg';
$pattern = '/https:\/\/www\.example\.com\/media(.*)\/' . preg_quote($image) . '/';
preg_match($pattern, $current, $matches);
echo $matches[0];
}
For some reason that one time it turns two URL's with a tab between them. When I look at the txt file the image i'm looking for is listed first then followed by the second iamge but echo $matches[0] returns it in reverse order. it does not exist like echo $matches[0] returns it. It would be like if you searched the string 'one two' and $matches returned 'two one'.
The regex engine is trying to do you a favor and capture the longest match. The \t tab between the two urls is being matched by the . (dot / any character).
Demonstration: (Link)
$blah='test case: https://www.example.com/media/foo/bar.jpg https://www.example.com/media/cat/fish.jpg some text';
$image = 'fish.jpg';
$your_pattern = '/https:\/\/www\.example\.com\/media(.*)\/'.preg_quote($image).'/';
echo preg_match($your_pattern,$blah,$matches)?$matches[0]:'fail';
echo "\n----\n";
$my_pattern='~https://www\.example\.com/media(?:[^/\s]*/)+'.preg_quote($image).'~';
echo preg_match($my_pattern,$blah,$out)?$out[0]:'fail';
Output:
https://www.example.com/media/foo/bar.jpg https://www.example.com/media/cat/fish.jpg
----
https://www.example.com/media/cat/fish.jpg
To crystallize...
test case: https://www.example.com/media/foo/bar.jpg https://www.example.com/media/cat/fish.jpg some text
// your (.*) is matching ---------------^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
My suggested pattern (I may be able to refine the pattern if you provide smoe sample strings) uses (?:[^/\s]*/)+ instead of the (.*).
My non-capturing group breaks down like this:
(?: #start non-capturing group
[^/\s]* #greedily match zero or more non-slash, non-whitespace characters
/ #match a slash
) #end non-capturing group
+ #allow the group to repeat one or more times
*note1: You can use \t where I use \s if you want to be more literal, I am using \s because a valid url shouldn't contain a space anyhow. You may make this adjustment in your project without any loss of accuracy.
*note2: Notice that I changed the pattern delimiters to ~ so that / doesn't need to be escaped inside the pattern.
I have this string stored in php:
Keyboard layout codes found here https://msdn.microsoft.com/en-us/library/cc233982.aspx test 123
test https://google.com
test google.com
<img src='http://example.com/pages/projects/uploader/files/2017-06-16%2011_27_36-Settings.png'>Link Converted to Image</img>
The img element was made with a prevous regex;
$url = '~(https|http)?(://)((\S)+(png|jpg|gif|jpeg))~';
$output = preg_replace($url, "<img src='$0'>Link Converted to Image</img>", $output);
My problem is, now I want to convert the regular links to an a element.
I have this regex, which works except for one problem.
$url = '~(https|http)?(://)?((\S)+[.]+(\w*))~';
$output = preg_replace($url, "<img src='$0'>Link Converted to Image</img>", $output);
This regex ALSO converts the link that has already become an img element, so it puts an a element in the source of the img element. My thinking on avoiding this problem is to ignore a preg match checking if the match starts with src=', but I can't figure out how to actually do this.
Am I doing this incorrectly? What is the most common/effecient way to accomplish this?
A good example for (*SKIP)(*FAIL):
<img.+?</img>(*SKIP)(*FAIL) # match <img> tags and throw them away
| # or
\bhttps?\S+\b # a link starting with http/https
In PHP:
<?php
$string = <<<DATA
Keyboard layout codes found here https://msdn.microsoft.com/en-us/library/cc233982.aspx test 123
test https://google.com
<img src='http://example.com/pages/projects/uploader/files/2017-06-16%2011_27_36-Settings.png'>Link Converted to Image</img>
DATA;
$regex = '~<img.+?</img>(*SKIP)(*FAIL)|\bhttps?\S+\b~';
$string = preg_replace($regex, "<a href='$0'>$0</a>", $string);
echo $string;
?>
Adding to #Jan's answer, although there may be some drawbacks with this workaround, it will match URL-like strings:
<img.+?</img>(*SKIP)(*FAIL)|(?:https?\S+|(?:(?!:)(?(1)\S|(\w)))*\.\w{2,5})
Live demo
Breakdown:
(?: # Open a NCG (a)
(?!:) # Next immediate character shouldn't be a colon `:`
(?(1)\S # If CG #1 exists match a non-whitespace character
| # otherwise
(\w)) # Match a word character (a URL begins with a word character)
)* # As much as possbile (this cluster denotes a tempered pattern)
\.\w{2,5} # Match TLD
Drawbacks:
TLD's character limit
Partial match of URLs containing a port number
I'm going to replace all <pre><code>CONTENT</code></pre> content .
<pre class="language-php"><code>m1
</code></pre>
lets go
<pre class="language-php"><code>m2
</code></pre>
but running preg_replace("/<pre(.*)<\/pre>/s", "SAMAN", $input_lines); would output this:
SAMAN
while i need this output:
SAMAN lets go SAMAN
here is my live test result
* quantifier is greedy, append it with ? to turn it into non-greedy. i.e. your regex should be: /<pre(.*?)<\/pre>/s
The Code is:
<?php
$input_lines='<pre class="language-php"><code>m1
</code></pre>
lets go
<pre class="language-php"><code>m2
</code></pre>';
$new_string=preg_replace("/(\r?\n?<pre.*?\/pre>\r?\n?)/s","SAMAN",$input_lines);
echo $new_string;
?>
Outputs:
SAMAN lets go SAMAN
Regex Pattern Explanation:
( # Begin capture group
\r?\n? # Optional newline characters on Windows and Linux
<pre.*?\/pre> # Match from opening pre tag to closing pre tag
\r?\n? # Optional newline characters on Windows and Linux
) # End capture group
/s # Force all dots in pattern to allow newline characters
My answer very closely resembles LeleDumbo's answer which probably performs to the asker's satisfaction. I merely omitted the unnecessary < from the closing pre tag, and included some newline characters so that $new_string doesn't have any hidden characters in it (this may or may not be an issue depending on usage).
Recently, I'm playing with something related to BBCode in phpBB3. When I trace back my database, the posts table and for a random post. I found that the image tag is written this way [img:fcjsgy5j]. There are 8 random characters generated between [img: ... ] for each post.
[img:fcjsgy5j]http://imageurl.jpg[/img]
My question is, how can I make use of preg_replace() to replace the random characters into this way..
<img src="http://imageurl.jpg">
$output = preg_replace("`\[img:.+?\](.*?)\[/img\]`i", '<img src="$1"/>', $input);
[ begins a character set. We don't want that; we want to match the literal [ character, so we have to escape it with a \
. matches any character
+ means we match 1 or more of the previous thing (any character)
? makes the previous quantifier ungreedy (.+ would match everything, right to the very end of the string, that's not what we want, we want it to match as little as possible... just up to the next ]
(.*?) matches all the junk between the [img] tags. Ungreedy again. We put () around it to make it mtaching set
The ` (back-tick) at the start and the end could be any character... whatever character you start with, you have to end with. A lot of people use / but I prefer the back-tick because it rarely appears anywhere inside the regular expression, thus I don't need to escape it.
The i at the very end means The expression will be case insensitive. (will match img, IMG, ImG, etc.)
The $1 in the replace refers back to the () section we denoted earlier... it basically takes whatever was matched there, and plops it into the place of $1
$result = preg_replace('%\[img:[^]]+\]([^[]+)\[/img\]%', '<img src="\1">', $subject);
or, as a commented regex:
$result = preg_replace(
'%\[img: # match [img:
[^]]+ # match one or more non-] characters
\] # match ]
([^[]+) # match one or more non-[ characters
\[/img\] # match [/img]
%x',
'<img src="\1">', $subject);
Try this code :
<?php
$search = array(
'\[img:.+?\](.*?)\[\/img\]\'
);
$replace = array(
'<img src="\\2">'
);
$result = preg_replace($search, $replace, $string);
}
?>
I used the array form of preg_replace so that u can add more search and replace patterns in the future. I think you are trying to replace some BBCODE tags. There is plenty of libraries on the net to handle BBCODE correctly.
Edited
Like this one :
http://php.net/manual/en/book.bbcode.php