PHP Regular Expression to convert html entities to their respective characters

PHP Regular Expression to convert html entities to their respective characters - php

I want to change
<lang class='brush:xhtml'>test</lang>
to
<pre class='brush:xhtml'>test</pre>
my code like that.
<?php
$content="<lang class='brush:xhtml'>test</lang>";
$pattern=array();
$replace=array();
$pattern[0]="/<lang class=([A-Za-z='\":])* </";
$replace[0]="<pre $1>";
$pattern[1]="/<lang>/";
$replace[1]="</pre>";
echo preg_replace($pattern, $replace,$content);
?>
but it's not working. How to change my code or something wrong in my code ?

There's quite a few problems:
Pattern 0 has the * outside the group, so the group only matches one character
Pattern 0 doesn't include the class= in the group, and the replacement doesn't have it either, so there won't be a class= in the replaced string
Pattern 0 has a space after the class, but there isn't one in the content string
Pattern 1 looks for lang instead of /lang
This will work:
$pattern[0]="/<lang (class=[A-Za-z='\":]*) ?>/";
$replace[0]="<pre $1>";
$pattern[1]="/<\/lang>/";
$replace[1]="</pre>";

How bout without regex? :)
<?php
$content="<lang class='brush:xhtml'>test</lang>";
$content = html_entity_decode($content);
$content = str_replace('lang','pre',$content);
echo $content;
?>

Using preg_replace is a lot faster than str_replace.
$str = preg_replace("/<lang class=([A-Za-z'\":]+)>(.*?)<\/lang>/", "<pre class=$1>$2</pre>", $str);
Execution time: 0.039815s
[preg_replace]
Time: 0.009518s (23.9%)
[str_replace]
Time: 0.030297s (76.1%)
Test Comparison:
[preg_replace]
compared with.........str_replace 218.31% faster
So preg_replace is 218.31% faster than the str_replace method mentioned above. Each tested 1000 times.

Related

how to do echo from a string, only from values that are between a specific stretch[href tag] of the string?

[PHP]I have a variable for storing strings (a BIIGGG page source code as string), I want to echo only interesting strings (that I need to extract to use in a project, dozens of them), and they are inside the quotation marks of the tag
but I just want to capture the values that start with the letter: N (news)
[<a href="/news7044449/exclusive_news_sunday_"]
<a href="/n[ews7044449/exclusive_news_sunday_]"
that is, I think you will have to work with match using: [a href="/n]
how to do that to define that the echo will delete all the texts of the variable, showing only:
note that there are other hrefs tags with values that start with other letters, such as the letter 'P' : href="/profiles... (This does not interest me.)
$string = '</div><span class="news-hd-mark">HD</span></div><p>exclusive_news_sunday_</p><p class="metadata"><span class="bg">Czech AV<span class="mobile-hide"> - 5.4M Views</span>
- <span class="duration">7 min</span></span></p></div><script>xv.thumbs.preparenews(7044449);</script>
<div id="news_31720715" class="thumb-block "><div class="thumb-inside"><div class="thumb"><a href="/news31720715/my_sister_running_every_single_morning"><img src="https://static-hw.xnewss.com/img/lightbox/lightbox-blank.gif"';
I imagine something like this:
$removes_everything_except_values_from_the_href_tag_starting_with_the_letter_n = ('/something regex expresion I think /' or preg_match, substring?);
echo $string = str_replace($removes_everything_except_values_from_the_href_tag_starting_with_the_letter_n,'',$string);
expected output: /news7044449/exclusive_news_sunday_
NOTE: it is not essential to be through a variable, it can be from a .txt file the place where the extracts will be extracted, and not necessarily a variable.
thanks.

I believe this will help her.
<?php
$source = file_get_contents("code.html");
preg_match_all("/<a href=\"(\/n(?:.+?))\"[^>]*>/", $source, $results);
var_export( end($results) );
Step by Step Regex:
Regex Demo
Regex Debugger

To get just the links out of the $results array from Valdeir's answer:
foreach ($results as $r) {
echo $r;
// alt: to display them with an HTML break tag after each one
echo $r."<br>\n";
}

Regex to select url except when = is directly infront of it

I'm trying to use a regex to find and replace all URLs in a forum system. This works but it also selects anything that is within bbcode. This shouldn't be happening.
My code is as follows:
<?php
function make_links_clickable($text){
return preg_replace('!(([^=](f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $text);
}
//$text = "https://www.mcgamerzone.com<br>http://www.mcgamerzone.com/help/support<br>Just text<br>http://www.google.com/<br><b>More text</b>";
$text = "#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa";
echo "<b>Unparsed text:</b><br>";
echo $text;
echo "<br><br>";
echo "<b>Parsed text:</b><br>";
echo make_links_clickable($text);
?>
All urls that occur in bb-code are following up on a = character, meaning that I don't want anything that starts with = to be selected.
I basically have that working but this results in selecting 1 extra character in in front of the string that should be selected.
I'm not very familiar with regex. The final output of my code is this:
<b>Unparsed text:</b><br>
#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa<br>
<br>
<b>Parsed text:</b><br>
#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa

You can match and skip [url=...] like this:
\[url=[^\]]*](*SKIP)(?!)|(((f|ht)tps?://)[-a-zA-Zа-яёЁА-Я()0-9#:%_+.\~#?&;/=]+)
See regex demo
That way, you will only match the URLs outside the [url=...] tag.
IDEONE demo:
function make_links_clickable($text){
return preg_replace('~\[url=[^\]]*](*SKIP)(?!)|(((f|ht)tps?://)[-a-zA-Zа-яёЁА-Я()0-9#:%_+.\~#?&;/=]+)~iu', '$1', $text);
}
$text = "#Theareak We know this and [b][url=https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks]here[/url] [/b]is an explanation, we are trying to fix this asap! https://www.mcgamerzone.com/news/67/False-positive-proxy-bans-and-bot-attacks aaa";
echo "<b>Parsed text:</b><br>";
echo make_links_clickable($text);

You can use a negative lookbehind (?<!=) instead of your negated class. It asserts that what is going to be matched isn't preceded by something.
Example

PHP replace : find and replace the same characters with different text

How can I find and replace the same characters in a string with two different characters? I.E. The first occurrence with one character, and the second one with another character, for the entire string in one go?
This is what I'm trying to do (so users need not type html in the body): I've used preg_replace here, but I'll willing to use anything else.
$str = $str = '>>Hello, this is code>> Here is some text >>This is more code>>';
$str = preg_replace('#[>>]+#','[code]',$str);
echo $str;
//output from the above
//[code]Hello, this is code[code] Here is some text [code]This is more code[code]
//expected output
//[code]Hello, this is code[/code] Here is some text [code]This is more code[/code]
But problem here is, both >> get replaced with [code]. Is it possible to somehow replace the first >> with [code] and the second >> with a [/code] for the entire output?
Does php have something to do this in one go? How can this be done?

$str = '>>Hello, this is code>> Here is some text >>This is more code>>';
echo preg_replace( "#>>([^>]+)>>#", "[code]$1[/code]", $str );
The above will fail if something like the following is your input:
>>Here is code >to break >stuff>>
To deal with this, use negative lookahead:
#>>((?!>[^>]).+?)>>#
will be your pattern.
echo preg_replace( "#>>((?!>[^>]).+?)>>#", "[code]$1[/code]", $str );

unable to understand how to match all characters except a given sequence with preg_replace() in php

So what I am trying to do is to match a regular expression which has an opening <p>; tag and a closing &lt/;p> tag.This is the code I wrote:
<?php
$input = "<p&gtjust some text</p&gt more text!";
$input = preg_replace('/<p&gt[^(<\/p&gt)]+?&lt\/;p&gt/','<p>$1</p>',$tem);
echo $input;
?>
So the code does not seem to replace <p&gt with <p> or replace </p&gt with </p>.I think the problem is in the part where I am checking all characters expect '</p&gt. I don't think the code [^(<\/p&gt)] is grouping all the characters correctly. I think it checks if any of the characters are not present and not if the entire group of characters is not present. Please help me out here.

[] in a RegEx is a character group, you can not match strings this way, only characters or unicode codepoints.
If you have escaped HTML entities, you can use htmlspecialchars_decode() to convert them back into characters.
After you have valid HTML, you can use the DOM to to parse, traverse and manipulate it.
How do you parse and process HTML/XML in PHP?

I think i figured it out.Here is the code:
<?php
$input = "<p>text</p>";
$tem = $input;
$tem = htmlspecialchars($input);
$tem = preg_replace('/<p>(.+?)<\/p>/','<p>$1</p>',$tem);
echo $tem;
?>

You don't need to capture the content between p tags, you only need to replace p tags:
$html = preg_replace('~<(/?p)>~', '<$1>', $html);
However, you don't regex too:
$trans = array('<p>' => '<p>', '</p>' => '</p>');
$html = strtr($html, $trans);

At least part of the trouble you're having is probably due to the fact that you seem to be playing fast and loose with the semicolons in your HTML entities. They always start with an ampersand, and end with a semicolon. So it's >, not &gt as you have scattered through your post.
That said, why not use html_entity_decode(), which doesn't require abusing regular expressions?
$string = 'shoop <p>da</p> woop';
echo html_entity_decode($string);
// output: shoop <p>da</p> woop

Regex Replace with Backreference modified by functions

I want to replace the class with the div text like this :
This: <div class="grid-flags" >FOO</div>
Becomes: <div class="iconFoo" ></div>
So the class is changed to "icon". ucfirst(strtolower(FOO)) and the text is removed
Test HTML
<div class="grid-flags" >FOO</div>
Pattern
'/class=\"grid-flags\" \>(FOO|BAR|BAZ)/e'
Replacement
'class="icon'.ucfirst(strtolower($1).'"'
This is one example of a replacement string I've tried out of seemingly hundreds. I read that the /e modifier evaluates the PHP code but I don't understand how it works in my case because I need the double quotes around the class name so I'm lost as to which way to do this.
I tried variations on the backref eg. strtolower('$1'), strtolower('\1'), strtolower('{$1}')
I've tried single and double quotes and various escaping etc and nothing has worked yet.
I even tried preg_replace_callback() with no luck
function callback($matches){
return 'class="icon"'.ucfirst(strtolower($matches[0])).'"';
}

It was difficult for me to try to work out what you meant, but I think you want something like this:
preg_replace('/class="grid-flags" \>(FOO|BAR|BAZ)/e',
'\'class="icon\'.ucfirst(strtolower("$1")).\'">\'',
$text);
Output for your example input:
<div class="iconFoo"></div>
If this isn't what you want, could you please give us some example inputs and outputs?
And I have to agree that this would be easier with an HTML parser.

Instead of using the e(valuate) option you can use preg_replace_callback().
$text = '<div class="grid-flags" >FOO</div>';
$pattern = '/class="grid-flags" >(FOO|BAR|BAZ)/';
$myCB = function($cap) {
return 'class="icon'.ucfirst($cap[1]).'" >';
};
echo preg_replace_callback($pattern, $myCB, $text);
But instead of using regular expressions you might want to consider a more suitable parser for html like simple_html_dom or php's DOM extension.

This works for me
$html = '<div class="grid-flags" >FOO</div>';
echo preg_replace_callback(
'/class *= *\"grid-flags\" *\>(FOO|BAR|BAZ)/'
, create_function( '$matches', 'return \'class="icon\' . ucfirst(strtolower($matches[1])) .\'">\'.$matches[1];' )
, $html
);
Just be aware of the problems of parsing HTML with regex.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Regular Expression to convert html entities to their respective characters - php

How bout without regex? :) <?php $content="<lang class='brush:xhtml'>test</lang>"; $content = html_entity_decode($content); $content = str_replace('lang','pre',$content); echo $content; ?>

Related

how to do echo from a string, only from values that are between a specific stretch[href tag] of the string?

Regex to select url except when = is directly infront of it

PHP replace : find and replace the same characters with different text

unable to understand how to match all characters except a given sequence with preg_replace() in php

Regex Replace with Backreference modified by functions

Categories

Resources