How to match and add a class name using preg_replace? - php

I'm trying to match the class attribute of <html> tag and to add a class name using preg_replace().
Here is what I tried so far:
$content = '<!DOCTYPE html><html lang="en" class="dummy"><head></head><body></body></html>';
$pattern = '/< *html[^>]*class *= *["\']?([^"\']*)/i';
if(preg_match($pattern, $content, $matches)){
$content = preg_replace($pattern, '<html class="$1 my-custom-class">', $content);
}
echo htmlentities($content);
But, I got only this returned:
<!DOCTYPE html><html class="dummy my-custom-class">"><head></head><body></body></html>
The attribute lang="en" is dropped out and the tag is appended with the duplicates like ">">. Please help me.

Please try this code it works, perfectly well :)
<?php
$content = '<!DOCTYPE html><html lang="en" class="dummy"><head></head><body></body></html>';
$pattern = '/(<html.*class="([^"]+)"[^>]*>)/i';
$callback_fn = 'process';
$content=preg_replace_callback($pattern, $callback_fn, $content);
function process($matches) {
$matches[1]=str_replace($matches[2],$matches[2]." # My Own Class", $matches[1]);
return $matches[1];
}
echo htmlentities($content);
?>

Remove the * in pattern for regex way
Use this pattern
/<html[^>]*class *= *["\']?([^"\']*)/i
I suggest use Dom parser for parsing the html
<?php
libxml_use_internal_errors(true);
$html="<!DOCTYPE html><html lang='en' class='dummy'><head></head><body></body></html>";
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('html') as $node) {
$node->setAttribute('class','dummy my-custom-class');
}
$html=$dom->saveHTML();
echo $html;
OUTPUT:
<!DOCTYPE html>
<html lang="en" class="dummy my-custom-class"><head></head><body></body></html>

Related

Replace custom tag with anchor tag using preg_replace

I want to replace <cast>Test Cast</cast> with Test Cast.
function replace_synopsis_tags($short_synopsis) {
$pattern = '/<cast>(.+?)<\/cast>/i';
$replacement = "<a href='".base_url()."casts/".str_replace(" ","-",strtolower("$1"))."'>$1</a>";
$short_synopsis = preg_replace($pattern, $replacement, $short_synopsis);
return $short_synopsis;
}
$synopsis = "<cast>Test Cast</cast>";
echo replace_synopsis_tags($synopsis);
What is being returned is Test Cast
How do I solve?
You can use DOMDocument, its way much more efficient for the job.
Online Eval : https://3v4l.org/0l9hT
$html = "
<!DOCTYPE html>
<html>
<body>
<cast>cast-test</cast>
<cast>cast two !</cast>
</body>
</html>";
function castTags(string $html)
{
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
libxml_clear_errors();
$casts = $dom->getElementsByTagName('cast');
while($cast = $casts->item(0)) {
$value = $cast->nodeValue;
$link = $dom->createElement('a');
$link->setAttribute('href', "www.example.com/cast/" . rawurlencode(str_replace(' ','-',strtolower($value))));
$link->nodeValue = $value;
$cast->parentNode->replaceChild($link, $cast);
}
return $dom->saveHTML();
}
echo castTags($html);
// <!DOCTYPE html>
// <html>
// <body>
// cast-test
// cast two !
// </body>
// </html>
If you are using PHP 5.5 or lower, you could simply add the \e modifier, and your script would work fine. However, if you're using PHP 7, you'll need to use preg-replace-callback() instead. PHP 7 no longer supports the \e modifier.
Your script can be updated to use preg_replace_callback() for compatibility with PHP 7:
function replace_synopsis_tags($short_synopsis) {
$pattern = '/<cast>(.+?)<\/cast>/i';
$replacement = function($matches) { return "<a href='".base_url()."casts/".str_replace(" ","-",strtolower($matches[1]))."'>".$matches[1]."</a>"; };
$short_synopsis = preg_replace_callback($pattern, $replacement, $short_synopsis);
return $short_synopsis;
}
$synopsis = "<cast>Test Cast</cast>";
echo replace_synopsis_tags($synopsis);
From the changelog of preg-replace:
As of PHP 5.5.0 E_DEPRECATED level error is emitted when passing in the "\e" modifier. As of PHP 7.0.0 E_WARNING is emitted in this case and "\e" modifier has no effect.
In PHP 5.4, you could have used this pattern:
$pattern = '/<cast>(.+?)<\/cast>/ie'; // with trailing e

How to change the href (url) in a link (a) element?

Here is my full link.
Client Portal
I want above link to look like following.
Client Portal
I really don't know how to work with preg_replace to get this done.
preg_replace('\/localhost\/mysite\/client-portal\/', '#popup', $output)
If is only this link you can achieve your goal with str_replace():
<?php
$link = 'Client Portal';
$href = 'http://localhost/mysite/client-portal/';
$new_href = '#popup';
$new_link = str_replace($href, $new_href, $link);
echo $new_link;
?>
Output:
Client Portal
If you want you can use DOM:
<?php
$link = 'Client Portal';
$new_href = '#popup';
$doc = new DOMDocument;
$doc->loadHTML($link);
foreach ($doc->getElementsByTagName('a') as $link) {
$link->setAttribute('href', $new_href);
}
echo $doc->saveHTML();
?>
Output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>Client Portal</body></html>
Or you can use preg_replace() like this:
<?php
$link = 'Client Portal';
$new_href = '#popup';
$regex = "((https?|ftp)\:\/\/)?"; // SCHEME
$regex .= "(localhost)"; // Host or IP
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$pattern = "/$regex/";
$newContent = preg_replace($pattern, $new_href, $link);
echo $newContent;
?>
Output:
Client Portal
If you want you can do using jQuery also.
<script src="https://code.jquery.com/jquery-1.10.2.js"></script>
<a class="popupClass" href="http://localhost/mysite/client-portal/">Client Portal</a>
$(document).ready(function(){
$('.popupClass').attr('href','').attr('href','#popup');
});
Demo

Simple_Html_Dom how to parse chinese character

Would like to try crawling data from taobao site.
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
</head>
<body>
<?php
include_once('simple_html_dom.php');
$target_url = "http://item.taobao.com/item.htm?spm=a2106.m893.1000384.54.61Q4Fp&id=37676614376&_u=fm86qe4d813&scm=1029.newlist-0.1.50006843&ppath=&sku=&ug=#detail";
$html = new simple_html_dom();
$html->load_file($target_url);
foreach ($html->find('h3[class=tb-main-title]') as $post) {
echo html_entity_decode($post, ENT_QUOTES, "ISO-8859-1") . "<br />";
}
?>
</body>
</html>
But it displays the product title in this:
2014��ЬŮʿ�������¿��ϸ��ƽ���ļ��¿����ϴ���ƽ����Ь��
In order to avoid that, you need to use iconv function. Consider this example:
include 'simple_html_dom.php';
$target_url = "http://item.taobao.com/item.htm?spm=a2106.m893.1000384.54.61Q4Fp&id=37676614376&_u=fm86qe4d813&scm=1029.newlist-0.1.50006843&ppath=&sku=&ug=#detail";
$contents = file_get_contents($target_url);
$html = str_get_html($contents);
foreach($html->find('h3[class=tb-main-title]') as $post) {
$text = $post->innertext;
$text = iconv('gb2312', 'utf-8', $text);
echo $text;
// 2014拖鞋女士人字拖新款豹纹细带平底夏季新款凉拖大码平底拖鞋潮
}

How to convert <textarea> output to clean html and write to file?

My script writes content of < textarea > to text file:
<!DOCTYPE html>
<html <?php language_attributes(); ?>>
<head> etc
Is there anyway I can convert output to clean html so it looks like this:
<!DOCTYPE html>
<html <?php language_attributes(); ?>>
<head>
etc :)
$file = 'wp.txt';
$regex = '/<textarea name="example" id="newcontent">(.*?)<\/textarea>/s';
if ( preg_match($regex, $page, $list) )
echo $list[0];
else
print "Error";
$file = 'wp.txt';
file_put_contents($file, $list, FILE_APPEND | LOCK_EX);
Thanks!
html_entity_decode
http://php.net/manual/en/function.html-entity-decode.php
That should do the trick.
Use the method html_entity_decode....
$file = 'wp.txt';
$regex = '/<textarea name="example" id="newcontent">(.*?)<\/textarea>/s';
if ( preg_match($regex, $page, $list) )
echo html_entity_decode($list[0]);
else
print "Error";
$file = 'wp.txt';
file_put_contents($file, $list, FILE_APPEND | LOCK_EX);
You'll need the html_entitiy_decode function.

How to avoid DOM parsing adding html doctype, <head> and <body> tags?

<?
$string = '
Some photos<br>
<span class="naslov_slike">photo_by_ile_IMG_1676-01</span><br />
<span class="naslov_slike">photo_by_ile_IMG_1699-01</span><br />
<span class="naslov_slike">photo_by_ile_IMG_1697-01</span><br />
<span class="naslov_slike">photo_by_ile_IMG_1695-01</span><br />
';
$dom = new DOMDocument();
$dom->loadHTML($string);
$dom->preserveWhiteSpace = false;
$elements = $dom->getElementsByTagName('span');
$spans = array();
foreach($elements as $span) {
$spans[] = $span;
}
foreach($spans as $span) {
$span->parentNode->removeChild($span);
}
echo $dom->saveHTML();
?>
I'm using this code to parse strings. When string is returned by this function, it has some added tags:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Some photos<br><br><br><br><br></p></body></html>
Is there any way to avoid this and to have clean string returned? This input string is just for example, in usage it can be any html string.
PHP versions since 5.4, when compiled with Libxml 2.6.0 or later, can use the the options parameter to DomDocument::loadHTML(). With it you can do this:
$dom = new DomDocument();
$dom->loadHTML($string, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
// do stuff
echo $dom->saveHTML();
We pass two libxml constants: LIBXML_HTML_NODEFDTD says not to add a document type definition, and LIBXML_HTML_NOIMPLIED says not to add implied elements like <html> and <body>.
I'm actually looking for the same solution. I've been using the following method to do this, however the <p> around the text node will still be added when you do loadHTML(). I don't there's a way to get around that without using another parser, or there's some hidden flag to tell it to not do that.
This code:
<?php
function innerHTML($node){
$doc = new DOMDocument();
foreach ($node->childNodes as $child)
$doc->appendChild($doc->importNode($child, true));
return $doc->saveHTML();
}
$string = '
Some photos<br>
<span class="naslov_slike">photo_by_ile_IMG_1676-01</span><br />
<span class="naslov_slike">photo_by_ile_IMG_1699-01</span><br />
<span class="naslov_slike">photo_by_ile_IMG_1697-01</span><br />
<span class="naslov_slike">photo_by_ile_IMG_1695-01</span><br />
';
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($string);
$elements = $dom->getElementsByTagName('span');
$spans = array();
foreach($elements as $span) {
$spans[] = $span;
}
foreach($spans as $span) {
$span->parentNode->removeChild($span);
}
echo innerHTML( $dom->documentElement->firstChild );
Will output:
<p>Some photos<br><br><br><br><br></p>
However of course this solution does not keep the markup 100% intact, but it's close.
After using loadHTML, you can do this:
# loadHTML causes a !DOCTYPE tag to be added, so remove it:
$dom->removeChild($dom->firstChild);
# it also wraps the code in <html><body></body></html>, so remove that:
$dom->replaceChild($dom->firstChild->firstChild->firstChild, $dom->firstChild);
The !DOCTYPE tag will be removed, and the first tag inside the body tag will replace the html tag.
Obviously, this will only work if you're only interested in the first tag inside the body, as I was when I encountered this problem. But this example could be adapted to copy everything inside the body with a little bit of effort.
Edit: Meh, nevermind. I like meder's solution.
You could always just use a regex to strip that first bit out:
echo preg_replace("/<!DOCTYPE [^>]+>/", "", $dom->saveHTML());
From the manual:
http://php.net/manual/en/domdocument.savehtml.php
$html_fragment = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML()));
Works for me.
I'm not sure if either of these will actually work, but you could try using DOMImplementation::createDocument when constructing your DOMDocument - the third argument is the DOCTYPE you wish to use.
Also, instead of saveHTML(), you could try saveXML()

Categories