PHP: Preg_match and replace all - php

I have an obvious hyperlink which I all want to replace in a text to just normal HTML hyperlinks.
So this just works for one hyperlink:
$string = '<u>\\n\\\\*HYPERLINK \\"http://www.youtube.com/watch?v=A0VUsoeT9aM\\"A Youtube Video</u>';
$pattern = '/http[?.:=\\w\\d\\/]*/';
$namePattern = '/(?:")([\\s\\w]*)</';
preg_match($pattern, $string, $matches);
preg_match($namePattern, $string, $nameMatches);
echo ''.$nameMatches[1].'';
But there are more hyperlinks than just one in a text so I want to just change all of these hyperlinks:
<?php
$input = 'Blablabla Beginning Text <u>\\n\\\\*HYPERLINK \\"http://www.youtube.com/watch?v=A0VUsoeT9aM\\"1.A Youtube Video</u> blablabla Text Middle <u>\\n\\\\*HYPERLINK \\"http://www.youtube.com/watch?v=A0VUsoeT9aM\\"2. A Youtube Video</u> blabla Text after';
//To become:
$output = 'Blablabla Beginning Text 1. A Youtube Videoblablabla Text Middle 2. A Youtube Video blabla Text after';
?>
How would I do that?

So, you want to replace the found matches, then use preg_replace() which does exactly that. However, you'll run into one obvious problem: Currently there are two instances of preg_match() - should those be replaced by two instances of preg_replace()? No. Combine them.
$pattern = '/http[?.:=\w\d\\/]*/';
$namePattern = '/(?:")([\s\w]*)</';
Can be combined to (I added . to the $namePattern part, so it can work with the second example text where the link description contains a dot):
$replacePattern = '/(http[?.:=\w\d\\/]*)\\\\"([\s\w.]*)</';
Because link and text are separated by \\" in the original text. I tested via preg_match_all() if this pattern works and it does. Also by adding () to the first pattern, they are now grouped.
$replacePattern = '/(http[?.:=\w\d\\/]*)\\\\"([\s\w.]*)</';
// ^-group1-----------^ ^-group2-^
These groupes can now be used in the replace statement.
$replaceWith = '\\2<';
Where \\1 points to the first group and \\2 to the second. The < at the end is necessary because preg_replace() will replace the whole found pattern (not just groups) and since the < is at the end of the pattern, we would lose it if it wasn't in the replace part.
All that you now need, is to call preg_replace() with this parameters like the following:
$output = preg_replace($replacePattern, $replaceWith, $string);
All occurences of the $replacePattern will now be replaced with their version of $replaceWith and saved in the variable $output.
You can see it here.
If you want a larger part to be removed, just extend the $replacePattern.
$replacePattern = '/<u>.*?(http[?.:=\w\d\\/]*)\\\\"([\s\w.]+)<\/u>/';
$replaceWith = '\\2';
(see it here) .*? will match everything and is not greedy, meaning it will stop once it finds the first occurence of whatever comes after (so here it is http...).

Related

preg_match_all has different result set than preg_replace using the same pattern

I find that preg_match_all and preg_replace do not find the same matches based on the same pattern.
My pattern is:
/<(title|h1|h2|h3|h4|h5|ul|ol|p|figure|caption|span)(.*?)><\/(\1)>/
When I run this against a snippet containing the likes of
<span class="blue"></span>
with preg_match_all I get 17 matches.
When I use the same pattern in preg_replace I get 0 matches. Replacing the \1 with the selection list does find the matches, but of course that won't work as a solution because it then doesn't ensure that the closing tag is the same type of the opening tag.
The overall goal is to find instances of tags with no content that should not be present without content...a holy crusade, I assure you.
In testing whether the regex works, I have also tried it in php cli. Here is the output:
Interactive shell
php > $str = 'abc<span class="blue"></span>def';
php > $pattern = "/<(title|h1|h2|h3|h4|h5|ul|ol|p|figure|caption|span)(.*?)><\/(\1)>/";
php > $final = preg_replace($pattern, '', $str);
php > print $final;
abc<span class="blue"></span>def
$str = 'abc<span class="blue"></span>def';
$pattern = "/<(title|h1|h2|h3|h4|h5|ul|ol|p|figure|caption|span)(.*?)><\/(\\1)>/";
// added \ ^
$final = preg_replace($pattern, '', $str);
print $final;
// echos 'abcdef'
explanation:
"\1" // <-- character in octal notation
is very different from
'\1' // <-- backslash and 1
because the first is an escape sequence. this is also the reason I almost exclusively use single quoted strings. see http://php.net/string#language.types.string.syntax.double

find match of 1st word and and last

I have a url that looks some what like this
for-sale/stuff/state/used-bla-bla2-bla3-bla4-(bla5)---f10-85934.html
i'm trying to validate the format, in my function using this regex.
if (preg_match('/(?:^|(?:\-))(\w+)/g', $pathInfo, $matches)) {
echo $digit = $matches[0];
}
$pathInfo is the url given above.
Basically i want to match
make sure the directory is for-sale/stuff/
used-bla-bla2-bla3-bla4-(bla5)---f10-85934.html file must start with either used/new and end with a integer.html
no spaces are allowed.
After i validate, i want to get the ID. which in this case is 85934
Seems like you want something like this,
'~^for-sale/stuff/\S+/(?:used|new)\S*?(\d+)\.html$~'
DEMO
I'd suggest this sample piece of code and the following regex:
$re = "~\\bfor\\-sale\\/stuff\\/[^<> ]*?\\/(?:used|new)[^/ ]*?\\-(\\d+)\\.html\\b~";
$str = "\n";
preg_match_all($re, $str, $matches);
Regex: \bfor\-sale\/stuff\/[^<> ]*?\/(?:used|new)[^/ ]*?\-(\d+)\.html\b
I assume you have several URLs to validate in a variable string of text, thus I sugget using \b, and that the URL is inside some tag, so I'd use [^<> ]*? in order to limit capturing to just inside a tag.
The ID will be in the first capturing group (captured by \d+).
Spaces are also disallowed: [^<> ]*?, [^/ ]*?.

Remove unnecessary close tags using regex

I'm looking for a regex, which removes close tags, and everything, until it finds an open tag. For example:
</xy>..</zz>..<a>... -> <a>...
</b>..</cc>..... -> ...
I tried this, but doesn't work for some reason:
$html = preg_replace("/^.*<.*>/","<.*>",$html);
Below regex would capture and stores all the text before an opening tag into a group(group1) and also it would capture and stores the remaining strings into another group. So the second group contains the text from the opening tag.
(.*)(<\w.*)
DEMO
Your php code would be,
<?php
$re = '~(.*)(<\w.*)~';
$str= '</b>..</cc>..... -> ...';
$replacement = "$2";
echo preg_replace($re, $replacement, $str);
?> //=> ...
OR
<?php
$re = '~(?:.*)(<\w.*)~';
$str= '</p>\n<p>Â </p>';
$replacement = "$1";
echo preg_replace($re, $replacement, $str);
?>
Explanation:
(.*)(<\w.*) capture from the begining of the string and stops capturing when it finds a < folllowed by an \w word character. Strings before <\w are stored inside group 1 and the strings after <\w are stored inside group2(Including <\w).
If I understand correctly your responses to Avinash Raj's answer you need something which matches any number of lines of input upto the first open tag, but that only matches once so all subsequent content is maintained.
.*(\n.*?)*?(<\w.*(\n.*)*)
The first part
.*(\n.*?)*?
Matches any number of lines but not greedily (hence the ?s), so it will stop at the first line which contains an open tag:
<\w
This is then followed once again by any number of lines of anything:
.*(\n.*)*
So to extract what you want you would replace
.*(\n.*?)*?(<\w.*(\n.*)*)
With
\2
Which is everything from and including the first open tag.

PHP converting plain text to hashtag link

I am trying to convert user's posts (text) into hashtag clickable links, using PHP.
From what I found, hashtags should only contain alpha-numeric characters.
$text = 'Testing#one #two #three.test';
$text = preg_replace('/#([0-9a-zA-Z]+)/i', '#$1', $text);
It places links on all (#one #two #three), but I think the #one should not be converted, because it is next to another alpha-numeric character, how to adjust the reg-ex to fix that ?
The 3rd one is also OK, it matches just #three, which I think is correct.
You could modify your regex to include a negative lookbehind for a non-whitespace character, like so:
(?<!\S)#([0-9a-zA-Z]+)
Working regex example:
http://regex101.com/r/mR4jZ7
PHP:
$text = preg_replace('/(?<!\S)#([0-9a-zA-Z]+)/', '#$1', $text);
Edit:
And to make the expression compatible with other languages (non-english characters):
(?<!\S)#([0-9\p{L}]+)
Working example:
https://regex101.com/r/Pquem3/1
With uni-code, html encoded safe and joined regexp; ~(?<!&)#([\pL\d]+)~u
Here some's tags like #tag1 #tag2#tag3 etc.
Finally I have found the solution like: facebook or others hashtag to url solutions, it may be help you too. This code also works with unicode. I have used some of Bangla Unicode, let me know other Languages work as well, I think it will work on any language.
$str = '#Your Text #Unicode #ফ্রিকেলস বা #তিল মেলানিনের #অতিরিক্ত উৎপাদনের জন‍্য হয় যা #সূর্যালোকে #বাড়ে';
$regex = '/(?<!\S)#([0-9a-zA-Z\p{L}\p{M}]+)/mu';
$text = preg_replace($regex, '#$1', $str);
echo $text;
To catch the second and third hashtags without the first one, you need to specify that the hashtag should start at the beginning of the line, or be preceded one of more characters of whitespace as follows:
$text = 'Testing#one #two #three.test';
$text = preg_replace('/(^|\s+)#([0-9a-zA-Z]+)(\b|$)/', '$1#$2', $text);
The \b in the third group defines a word boundary, which allows the pattern to match #three when it is immediately followed by a non-word character.
Edit: MElliott's answer above is more efficient, for the record.

Automatically convert keywords to links in php

I am trying to convert specific keywords in text, which are stored in array, to the links.
Example text:
$text='This text contains many keywords, but also formated keywords.'
So now I want to convert the word keywords to the #keywords.
I used the very simple preg_replace function
preg_replace('/keywords/i',' keywords ',$text);
but obviously it converts to link also the string already formatted as a link, so I get a messy html like:
$text='This text contains many keywords, but also formated keywords" title="keywords">keywords</a>.'
Expected result:
$text='This text contains many keywords, but also formated keywords.'
Any suggestions?
THX
EDIT
We are one step from the perfect function, but still not working well in this case:
$text='This text contains many keywords, but also formated
keywords.'
In this case it replaces also the word keywords in the href, so we again get the messy code like
keywords.com/keywords" title="keywords">keywords</a>
I'm not great with regular expressions, but maybe this one will work:
/[^#>"]keywords/i
What I think it will do is ignore any instances of #keywords, >keywords, and "keywords and find the rest.
EDIT:
After testing it out, it looks like that replaces the space before the word as well, and doesn't work if keywords is the beginning of the string. It also didn't preserve original capitalization. I have tested this one, and it works perfectly for me:
$string = "Keywords and keywords, plus some more keywords with the original keywords.";
$string = preg_replace("/(?<![#>\"])keywords/i", "$0", $string);
echo $string;
The first three are replaced, preserving the original capitalization, and the last one is left untouched. This one uses a negative lookbehind and backreferences.
EDIT 2:
OP edited question. With the new example provided, the following regex will work:
$string = 'This text contains many keywords, but also formated keywords.';
$string = preg_replace("/(?<![#>\".\/])keywords/i", "$0", $string);
echo $string;
// outputs: This text contains many keywords, but also formated keywords.
This will replace all instances of keywords that are not preceded by #, >, ", ., or /.
Here is the problem:
The keyword could be inside the href, the title, or the text of the link, and anywhere in there (like if the keyword was sanity and you already had href="insanity". Or even worse, you could have a non-keyword link that happens to contain a keyword, something like:
Click here to find more keywords and such!
In the above example, even though it fits every other possible criteria (it's got spaces before and after being the easiest one to test for), it still would result in a link within a link, which I think breaks the internet.
Because of this, you need to use lookaheads and lookbehinds to check if the keyword is wrapped in a link. But there is one catch: lookbehinds have to have a defined pattern (meaning no wild cards).
I thought I'd be the hero and show you the easy fix for your issue, which would be something to the effect of:
'/(?<!\<a.?>)[list|of|keywords](?!\<\/a>)/'
Except you can't do that because the lookbehind in this case has that wildcard. Without it, you end up with a super greedy expression.
So my proposed alternative is to use regex to find all link elements, then str_replace to swap them out with a placeholder, and then replacing them with the placeholder at the end.
Here's how I did it:
$text='This text contains many keywords, but also formated keywords.';
$keywords = array('text', 'formatted', 'keywords');
//This is just to make the regex easier
$keyword_list_pattern = '['. implode($keywords,"|") .']';
// First, get all matching keywords that are inside link elements
preg_match_all('/<a.*' . $keyword_list_pattern . '.*<\/a>/', $text, $links);
$links = array_unique($links[0]); // Cleaning up array for next step.
// Second, swap out all matches with a placeholder, and build restore array:
foreach($links as $count => $link) {
$link_key = "xxx_{$count}_xxx";
$restore_links[$link_key] = $link;
$text = str_replace($link, $link_key, $text);
}
// Third, we build a nice replacement array for the keywords:
foreach($keywords as $keyword) {
$keyword_links[$keyword] = "<a href='#$keyword'>$keyword</a>";
}
// Merge the restore links to the bottom of the keyword links for one mass replacement:
$keyword_links = array_merge($keyword_links, $restore_links);
$text = str_replace(array_keys($keyword_links), $keyword_links, $text);
echo $text;
You can change your RegEx so that it only targets keywords with a space in front. Since the formatted keywords do no contain a space. Here is an example.
$text = preg_replace('/ keywords/i',' keywords',$text);

Categories