Hebrew chars from pdf file shows gibberish using PHP - php

I'm trying to get text from a pdf file with Hebrew in it and manipulate it, but when I'm using echo it shows these letters instead of Hebrew:
Ço̬mÀÃ6ÜÍzWÃýCW¶°ÐÞ]Aµ±¸¤:ÄÞ[JÞaCå+wÎ[n6GZù>"âÊù+ýÕ9^6ÓF½íoßEcì¸_pùnÚbïjÅÅß^UtýÝ-®»þgåĿٻƷ8ԯβzÅr
I made sure the page is in utf-8 and converted the returned text to utf-8 but it doesn't fix it.
When The text wasn't in utf-8 it showed these symbols:
��G�W����/��<� ������%�M����>����z.�m47�M �O�4�Nf�/7ʓ쓻#2FGj��,U8�J
I feel like I'm just missing something.
This is my code:
<?php
header('Content-type: text/html; charset=UTF-8');
$formReturn = $_POST["formReturn"];
if ($formReturn)
{
$file = $_FILES["gradesPdf"]["tmp_name"];
$text = file_get_contents($file);
$text = utf8_encode($text);
}
$html = '
<!DOCTYPE html>
<html lang="he">
<meta charset="utf-8" />
<head>
<title>נסיון</title>
</head>
<body>
<form enctype="multipart/form-data" method="post">
<input type="file" name="gradesPdf" id="gradesPdf">
<br><br>
<button type="submit">run</button>
<input type="hidden" name="formReturn" value="1">
</form>
'. $text .'
</body>
</html>
';
echo $html;
Btw I can't use pdfParser, I tried the demo on their site and it didn't return the text the way I wanted. I think since my pdf has a table in it.

Related

TinyMCE angle brackets

I am new to using TinyMCE but am frustrated with its behavior of angle brackets. It appears to be interpreting input such as <foo> or <foo>Foo</foo> as tags despite the page source showing that both cases are converted to <foo> and <foo>Foo</foo> respectively
I reduced my code for SO, it is below:
<?php
// Simplified for SO, no file writing / reading
$content = isset($_POST["forSo"]) ? $_POST["forSo"] : "";
?>
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>For Stack Overflow</title>
<script src="/tinymce/js/tinymce/tinymce.min.js"></script>
<script>
tinymce.init({ selector : "#forSo" });
</script>
</head>
<body>
<?php
// Behaves as expected, TinyMCE correctly automatically converts HTML Entities
echo $content . "\n";
?>
<form action="/forSo.php" method="post" enctype="multipart/form-data">
<textarea id="forSo" name="forSo">
<?php
// Page source shows that this has HTML Entities, still loses information
echo $content . "\n";
?>
</textarea><br>
<input type="submit" value="Submit">
</form>
</body>
</html>
If I input say <foo> then the resulting page source is:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>For Stack Overflow</title>
<script src="/tinymce/js/tinymce/tinymce.min.js"></script>
<script>
tinymce.init({ selector : "#forSo" });
</script>
</head>
<body>
<p> <foo></p>
<form action="/forSo.php" method="post" enctype="multipart/form-data">
<textarea id="forSo" name="forSo">
<p> <foo></p>
</textarea><br>
<input type="submit" value="Submit">
</form>
</body>
</html>
However TinyMCE seems to have thrown away the textarea's content resulting in this (hitting submit again causes all of $content to be an empty string):
Replacing the second <?php echo $content . "\n"; ?> inside the textarea with <?php echo str_replace("&", "&", $content) . "\n"; ?> accomplishes the task
This does prevent writing valid HTML tags but text that is meant to be taken literally such as <foo> is preserved inside the TinyMCE editor as &lt;foo&gt; as opposed to <foo> which TinyMCE interprets as a HTML tag

How to stop HTML text in textarea to be interpreted as code

I have a textarea that users can edit. After the edit I save the text in a PHP variable $bio. When I want to display it I do this:
<?php
$bio = nl2br($bio);
echo $bio;
?>
But if a user for example types an HTML command like "strong" in their text my site will actually output the text as bold. Which is nothing I want.
How can I print/echo the $bio on the screen just as text and not as HTML code?
Thanks in advance!
Replace echo $bio; with echo htmlspecialchars($bio);
http://php.net/htmlspecialchars
When you output text to the html / the browser and you want to make sure that the output does not break the html, you should always use htmlspecialchars().
In your case you do want to show the <br> tags, so you should do that before you add them:
$bio = nl2br(htmlspecialchars($bio));
You can also use strip_tags() to get rid of the html tags altogether, but you would still need to use htmlspecialchars() so that for example a < character will not break your html.
You can also use htmlentites()
<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title></title>
</head>
<body>
<form method="POST" action="">
<p><textarea rows="8" name="bio" cols="40"></textarea></p>
<p><input type="submit" value="Submit"></p>
</form>
<p>Result:</p>
<?php echo isset($_POST['bio']) ? htmlentities($_POST['bio']) : null; ?>
</body>
</html>
So like:

Missing characters in displaying UTF that written to file

I am trying to input some Vietnamese language from text box to my file then read from that file and display in another page.
The display part is working well as I tried to copy, paste some Vietnamese directly to file and test the displaying. However the writing part some how not right, because when I try input some Vietnamese and test at the display, it will miss some characters at some places. Here is the code I am using to input to file:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<form name="form" method="post">
<style type="text/css">
.inputtext { width: 550px; height: 550px; }
</style>
<input type="text" name="text_box" class="inputtext" size="250"/>
<input type="submit" id="search-submit" value="SAVE" />
</form>
</body>
</html>
<?php
if(isset($_POST['text_box'])) { //only do file operations when appropriate
$a = $_POST['text_box'];
$myFile = mb_convert_encoding("test.txt", "UTF-8", "auto");
$data = mb_convert_encoding($a, 'UTF-8', "auto");
$fh = fopen($myFile, 'w') or die("can't open file");
fwrite($fh,utf8_encode($data));
fclose($fh);
}
?>
So how is the right way to write UTF8 (or any multi language) to file ?
If you make sure all your pages are already using UTF-8, then the solution would be: Do nothing special with the file, simply write the string (which already is UTF-8) to the file.
So now you need to find out how to make everything in your page UTF-8. You should start sending a Content-type header: header('Content-type: text/html; charset=utf-8');

PHP not printing special characters

I'm receiving a text file upload and then just printing it back out. For some reason, special characters are appearing as black boxes with a white check mark. I tried htmlentities() and utf8_encode() on the content to be print to screen, but that didn't help.
Here's all of my code:
<?php ini_set("auto_detect_line_endings", true);
header('Content-Type: text/html; charset=utf-8');
?><!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body style="overflow:visible;">
<form method="post" enctype="multipart/form-data">
<input type="file" name="file" />
<button type="submit" name="upload" value="upload">Upload</button>
</form>
<pre>
<?php
if($_POST['upload']) {
//$fileName = 'old.txt';
$fileName = $_FILES['file']['tmp_name'];
if(file_exists($fileName)) {
$file = fopen($fileName,'r');
while(!feof($file)) {
$name = fgets($file);
echo(htmlentities($name));
}
fclose($file);
}
}
?>
</pre>
</body>
</html>
This code works on my localhost LAMP server, but the character probelm appears on some other people's servers. What can I do to maek the special characters show up?

Replace non standard characters in php

I'm trying to replace some non standard characters like ë,Ë,ç,Ç with numeric entities like Ë , ' etc but i ran into a bit of a problem.
When i try to replace them directly like this it works fine:
$string = "Ë";
$vname = str_replace("Ë","AAAA",$string);
echo $vname."<br>";
an i get AAAA as a result.
But when i try to replace the characters from a string that i get from a form with POST then it doesn't change the characters. Here is an example:
<?php
if(isset($_POST['submit'])) {
$string = $_POST['title'];
if ($string == "Ë")
echo "Yes";
else
echo "No";
$vname = str_replace("Ë","AAAA",$string);
echo $vname."<br>";
echo $string;
}
?>
<form method="post" name="Form">
Title: <input name="title" type="text" value="" size="20"/>
<input name="submit" type="submit" value="submit"/>
</form>
Any help would be great!!
Most likely your characterset is wrong. I would suggest sending the following header when outputing html:
<?php header("content-type: text/html; charset=utf-8"); ?>
Where the charset match the charset you are storing your file in.
Edit: Just some more information. The file you store is in one charset for example latin1, while your browser interprets your html page as another charset (utf-8 for example). When the browser then sends the Ë character, it will send the utf-8 code 0xc38b, while the same character is 0xcb. As you can see, these does not match.
Edit - You can also update the CHARSET via HTML5 or xHTML:
HTML5
<meta charset="UTF-8"/>
xHTML
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Categories