Reading and encoding html

Reading and encoding html - php

I am trying to read and display the content of the title (contained in a h1 tag) from many HTML files. These files are all in the same folder.
This is what the html files look like :
<!DOCTYPE html PUBLIC '-//W3C//DTD HTML 4.01//EN'>
<html>
<head>
<title>A title</title>
<style type='text/css'>
... Styles here ...
</style>
</head>
<body>
<h1>Être aidant</h1>
<p>En général, les aidants doivent équilibrer...</p>
... more tags ...
</body>
I have tried to display the content from the H1 tag with this PHP script :
<?php
foreach (glob("test/*.html") as $file) {
$file_handle = fopen($file, "r");
$doc = new DOMDocument();
$doc->loadHTMLfile($file);
$title = $doc->getElementsByTagName('h1');
if ( $title && 0<$title->length ) {
$title = $title->item(0);
$content = $doc->savehtml($title);
echo $content;
}
fclose($file_handle);
}
?>
But the output contains wrong characters. For the example file, the output is :
ÃŠtre aidant
How can I achieve this output?
Être aidant

You should state a charset in the <head> of your HTML document.
<meta charset="utf-8">

you need to use utf-8 encoding
change echo $content to echo utf8_encode($content);

Related

dompdf exclude menu and footer

Hy,
I use dompdf and when generating the pdf, the respective menu and the footer from the "mother" page are included. I don't know where to edit the code to exclude the menu and footer.
Code controller:
function export_invoice($param1 = 'export' ) {
$page_data['action'] = $param1;
$page_data['page_name'] = 'export_invoice';
$page_data['page_title'] = get_phrase('export_invoice');
$this->load->view('frontend/'.get_frontend_settings('theme').'/index', $page_data);
$html1 =$this->load->view('frontend/'.get_frontend_settings('theme').'/index',$page_data, true);
$html = mb_convert_encoding($html1, 'HTML-ENTITIES', 'UTF-8');
$this->pdf->loadHtml($html);
$this->pdf->set_paper("a5", "portrait" ); //landscape
$this->pdf->render();
// FILE DOWNLOADING CODES
$url = current_url();
$str = substr(strrchr($url, '/'), 1);
$str1=$str;
$fileName = 'Invoice-'.$str.'.pdf';
$this->pdf->stream($fileName, array("Attachment" => 0)); //initial era 1 pentru descarcare si 0 pentru preview in broswer
}
And view:
<html>
<head>
<meta charset="utf-8">
<title>PDF Output</title>
<style>
/* your style here */
</style>
</head>
<body>
{contents}
</body>
</html>
Image from pdf

I use css and it works with:
.menu-area{display: none;}

php getimagesize with persian file name

I'm trying to write an Joomla plugin to add width and height tag to each <img> in HTML file.
Some image file names are Persian, and getimagesize faces error.
The code is this:
#$dom->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . "\n" . '
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<img src="images\banners\س.jpg" style="max-width: 90%;" >
</body>
</html>
');
$x = new DOMXPath($dom);
foreach($x->query("//img") as $node)
{
$imgtag = $node->getAttribute("src");
$imgtag = pathinfo($imgtag);
$imgtag = $imgtag['dirname'].'\\'.$imgtag['basename'];
$imgtag = getimagesize($imgtag);
$node->setAttribute("width",$imgtag[0]);
$node->setAttribute("height",$imgtag[1]);
}
$newHtml = urldecode($dom->saveHtml($dom->documentElement));
And when Persian characters exist in file name, getimagesize shows:
Warning: getimagesize(images\banners\س.jpg): failed to open stream: No such file or directory in C:\wamp64\www\plugin.php
How can I solve this?

Thanks to all,
I couldn't reach to results on WAMP server (local server on Windows),
but when I migrated to Linux server, finally this code worked properly.
$html = $app->getBody();
setlocale(LC_ALL, '');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//img") as $node)
{
$imgtag = $node->getAttribute("src");
if(strpos($imgtag,"data:image")===false)
{
$imgtag = getimagesize($imgtag);
$node->setAttribute("width",$imgtag[0]);
$node->setAttribute("height",$imgtag[1]);
}
}
$bodytag = $x->query("//body");
$node = $dom->createElement("script", ' /* java script which may be necessary on client */ ');
$bodytag[0]->appendChild($node);
$html = '<!DOCTYPE html>'."\n" . $dom->saveHtml($dom->documentElement);
Some hints:
the code, shouldn't touch base64 image sources, so I added an condition to the code.
if some script (or whatever, div, p, ....) should be added to body tag, you can use appendChild method.
<!DOCTYPE html> should be added to final DOM object output :)

Change stylesheet attribute inside iframe using DOMDocument class

I'm using DOMDocument class in order to change link tag attribute when the page is load.
The tag that I want to manipulate is inside iframe block, so my code doesn't take effect in that specific case.
Here's the code:
$page_content = file_get_contents($page_link);
$dom = new DOMDocument();
$dom->loadHtml($page_content);
$links = $dom->getElementsByTagName('link');
foreach( $links as $k => $link ){
if( $link->getAttribute('rel') === 'stylesheet' ){
$link->setAttribute('rel', 'test'); //just fot testing
}
}
$newHtml = $dom->saveHtml();
echo $newHtml;
But only the main tag is affected, while the <link> inside the iframe block don't:
<head>
<link rel="test" href="style.css"/>
</head>
<body>
.
.
.
.
<iframe>
<head>
<link rel="stylesheet" href="style.css"/><!--"rel" stays the same-->
</head>
</iframe>
.
.
.
</body>
<footer>
</footer>
Much appreciate your help!

Can't read arabic txt file in PHP

I have so many text(.txt) files and I want to read it with PHP and display in web browser. Some of my files is in arabic language.
I am using file_get_contents function to read files. But I can't get proper result.
Here is sample of what I is my input and output.
Input Text ===> لة إلى الشعب الأردني العزيز والى شعوب العالم الحر والى المنظمات الدولية للحرية وحقوق الإنسان والى معاقل الديم
Output Text ===> J2 H'DI 49H( 'D9'DE 'D-1 H'DI 'DEF8E'* 'D/HDJ) DD-1J) H-BHB 'D%F3'F H'DI E9'BD 'D/JEHB
My page is has already UTF-8 charset. I have also tried fopen function and still same result.
What I am missing?

This works for me:
<html>
<head>
<!-- <link rel="stylesheet" href="css/style.css" /> -->
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
</head>
<body>
<?php
$file = "arabic.txt";
$data = file_get_contents($file);
echo $data; ?>
</body>
</html>
, where arabic.txt is saved with UTF-8 encoding.

you can try this:
<html>
<meta http-equiv='Content-Type' content='text/html'; charset='UTF-8'/>
<body>
<?php
//put your file in this folder
$path='D:\test';
$files=scandir($path);
foreach ($files as $key => $value) {
if($value!="." && $value!="..")
{
print_r(file_get_contents($path."/".$value));
}
}
?>
</body>
<html>
you can see : How to print all the txt files inside a folder with php?

how to access DOM in php that will echo out everything between <html></html> [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Export particular element in DOMDocument to string
i know how to access different element depending on id but don't know how to get everything between html start tag to html end tag. Can anyone please help me.
thanks.

If you would like to parse an html page with PHP, you could use PHP's DOMDocument extension, as such:
// a new dom object
$dom = new domDocument;
// load the html into the object
$dom->loadHTML($html);
// keep white space
$dom->preserveWhiteSpace = true;
// nicely format output
$dom ->formatOutput = true;
//get element by tag name
$htmlRootElement = $dom->getElementsByTagName('html');
echo htmlspecialchars($dom->saveHTML(), ENT_QUOTES);
Or you could do this with JavaScript on the client side:
var htmlRootElement = document.getElementsByTagName("html");
alert(htmlRootElement.innerHTML);

You can access each element in the <html> tag with the DOMDocument class.
Example
$htmlDoc = new DOMDocument;
$html = <<<HTML
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>My Site</title>
<meta name="description" content="DOM test">
</head>
<body>
<h1>Hello</h1>
<p>This is a DOM test</p>
</body>
</html>
HTML;
$htmlDoc->loadHTML($html);
$htmlElement = $htmlDoc->getElementsByTagName("html");
foreach ($htmlElement->item(0)->childNodes as $element) {
echo 'Element name: ' . $element->nodeName . PHP_EOL;
echo 'Element value: '. $element->nodeValue . PHP_EOL;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Reading and encoding html - php

You should state a charset in the <head> of your HTML document. <meta charset="utf-8">

you need to use utf-8 encoding change echo $content to echo utf8_encode($content);

Related

dompdf exclude menu and footer

php getimagesize with persian file name

Change stylesheet attribute inside iframe using DOMDocument class

Can't read arabic txt file in PHP

how to access DOM in php that will echo out everything between <html></html> [duplicate]

Categories

Resources