html decimal coded string - php

I'm parsing html from a website using simplehtmldom_1_5, when i echo the parsed text to the screen it's printed correctly but when i try to save it to a file using file_put_contents i've my string coded to html decimal code :
&#40&#98&#46&#32&#97&#110&#100&#101&#114&#115&#115&#111&#110&#44&#32
i've already tried all possible combination of utf8_encode, utf8_decode, htmlentities... but nothing worked, same problem when i try to insert to mysql table.
mb_detect_encoding for the parsed text returns ASCII.
Any suggestions ?
header('Content-Type: text/html; charset=utf-8');
ini_set('max_execution_time', 0);
include 'simplehtmldom_1_5/simple_html_dom.php';
$html = file_get_html($curr_url);
$texts = $html->find('div[id=content_h]');
foreach($texts as $text) {
file_put_contents('queries.txt', $text->innertext . "\n", FILE_APPEND);
}

Did you also try html_entity_decode ( http://de1.php.net/html_entity_decode ) ?
Thats the function converting entities back to clear type text
*edit
I just tested this to verify it's working.
Yes it works, BUT:
your data is incorrect !
Every single entity is missing a semicolon at its end!
Thats why decoding only works in loose browser-render engines...
Your data shall be looking like this:
(b.
and not like this
&#40&#98&#46
See the difference?

Finally this worked for me
preg_replace('/&#(\d+)/me',"chr(\\1)", $text)

Related

html2text result deletes some special caracters

I am trying to display a message using html2text function, the result in encoded in utf-8, the only problem is that for some cases, caracters are deleted from the words.
Example: instead of n'hésitez I get nhsitez, here is my code
$h2t = new html2text($leMessage);
$altBody = $h2t->get_text();
logMessagePreformate($id_dossier, utf8_decode($sujet),$altBody, $pour1, $pour2);
I tried to utf8_encode and mb_convert_encoding but it didn't work, any suggestions ?
For those who face the same problem, I added html_entity_decode() function to my code in order to decode the data I send to the database :
$h2t = new html2text(html_entity_decode($leMessage));
then to display it I used:
mb_convert_encoding($h2t),"HTML-ENTITIES", 'UTF-8')

The proper use of PHP_EOL and how to get rid of character count when reading a file

I am trying to write a function in which a string of text is written with timestamps to the file "text2.txt", I want each entry to be on a new line, but PHP_EOL does not seem to work for me.The strings simply write on the same line and does not write to a new line for each string.
Could anyone give me some pointers or ideas as to how to force the script to write to a new line every time the function is activated?
Some sort of example would be highly appreciated.
Thank you in advance.
<?php
if($_SERVER['REQUEST_METHOD'] == "POST" and isset($_POST['sendmsg']))
{
writemsg();
}
function writemsg()
{
$txt = $_POST['tbox'];
$file = 'text2.txt';
$str = date("Y/m/d H:i:s",time()) . ":" . $txt;
file_put_contents($file, $str . PHP_EOL , FILE_APPEND );
header("Refresh:0");
}
?>
Also, I want to get rid of the character count on the end of the string when using the below code :
<?php
echo readfile("text2.txt");
?>
Is there any way for the character count to be disabled or another way to read the text file so it does not show the character count?
Could anyone give me some pointers or ideas as to how to force the script to write to a new line every time the function is activated? Some sort of example would be highly appreciated.
Given the code you posted I'm pretty sure newlines are properly appended to the text lines you are writing to the file.
Try opening the file text2.txt on a text editor to have a definitive confirmation.
Note that if you insert text2.txt as part of a HTML document newlines won't cause a line break in the rendered HTML by the browser.
You have to turn them into line break tags <br/>.
In order to do that simply
<?php
echo nl2br( file_get_contents( "text2.txt" ) );
?>
Using file_get_contents will also solve your issue with the characters count display.
A note about readfile you (mis)used in the code in your answer.
Accordind to the documentation
Reads a file and writes it to the output buffer.
[...]
Returns the number of bytes read from the file. If an error occurs, FALSE is returned and unless the function was called as #readfile(), an error message is printed.
As readfile reads a file and sends the contents to the output buffer you would have:
$bytes_read = readfile( "text2.txt" );
Without the echo.
But in your case you need to operate on the contents of the file (replacing line breaks with their equivalent html tags) so using file_get_contents is more suitable.
To put new line in text simply put "\r\n" (must be in double quotes).
Please note that if you try to read this file and output to HTML, all new line (no matter what combination) will be replaced to simple space, because new line in HTML is <br/>. Use nl2br($text) to convert new lines to <br/>'s.
For reading file use file_get_contents($file);

Decoding UTF-8 Encoded Header

I'm using PHP imap to read emails out of an inbox. It extracts some information from headers. One of the headers looks like this:
X-My-Custom-Header: =?UTF-8?B?RXVnZW4gQmFiacSH?=
The original value of that encoded string is Eugen Babić.
When I try to decode that string using PHP, I can't get it quite right, the ć always comes back messed up.
I've tried imap_utf8, imap_mime_header_decode and a bunch of others I can't quite recall. They either don't return anything at all, or they mess up the ć as I mentioned before.
What is the correct way to decode this?
imap_utf8 and imap_mime_header_decode work just fine; there's also iconv_mime_decode:
php > echo imap_utf8('X-My-Custom-Header: =?UTF-8?B?RXVnZW4gQmFiacSH?='), "\n";
X-My-Custom-Header: Eugen Babić
php > list($k,$v) = imap_mime_header_decode('X-My-Custom-Header: =?UTF-8?B?RXVnZW4gQmFiacSH?=');
php > echo $v->text, "\n";
Eugen Babić
php > echo iconv_mime_decode('X-My-Custom-Header: =?UTF-8?B?RXVnZW4gQmFiacSH?=', 0, "utf8"), "\n";
X-My-Custom-Header: Eugen Babić
It seems that imap_utf8 returns its output in NFD, so that the accent over the c may appear out of place in some settings.
Here's what you're doing wrong: You're HTML (as generated by the PHP) is not UTF-8 encoded. So even though it's returning the accented c, the page isn't displaying it correctly.
To fix it, add this in your <head> tag:
<meta http-equiv='Content-Type' content='text/html; charset=utf-8'>
The function mb_decode_mimeheader() solved the problem
"fromName" => (isset($fromInfo->personal))
? mb_decode_mimeheader( $fromInfo->personal) : "",

Storing HTML in MySQL

I'm storing HTML and text data in my database table in its raw form - however I am having a slight problem in getting it to output correctly. Here is some sample data stored in the table AS IS:
<p>Professional Freelance PHP & MySQL developer based in Manchester.
<br />Providing an unbeatable service at a competitive price.</p>
To output this data I do:
echo $row['details'];
And this outputs the data correctly, however when I do a W3C validator check it says:
character "&" is the first character of a delimiter but occurred as data
So I tried using htmlemtities and htmlspecialchars but this just causes the HMTL tags to output on the page.
What is the correct way of doing this?
Use & instead of &.
What you want to do is use the php function htmlentities()...
It will convert your input into html entities, and then when it is outputted it will be interpreted as HTML and outputted as the result of that HTML...For example:
$mything = "<b>BOLD & BOLD</b>";
//normally would throw an error if not converted...
//lets convert!!
$mynewthing = htmlentities($mything);
Now, just insert $mynewthing to your database!!
htmlentities is basically as superset of htmlspecialchars, and htmlspecialchars replaces also < and >.
Actually, what you are trying to do is to fix invalid HTML code, and I think this needs an ad-hoc solution:
$row['details'] = preg_replace("/&(?![#0-9a-z]+;)/i", "&", $row['details']);
This is not a perfect solution, since it will fail for strings like: someone&son; (with a trailing ;), but at least it won't break existing HTML entities.
However, if you have decision power over how the data is stored, please enforce that the HTML code stored in the database is correct.
In my Projects I use XSLT Parser, so i had to change to   (e.g.). But this is the safety way i found...
here is my code
$html = trim(addslashes(htmlspecialchars(
html_entity_decode($_POST['html'], ENT_QUOTES, 'UTF-8'),
ENT_QUOTES, 'UTF-8'
)));
And when you read from DB, don't forget to use stripslashes();
$html = stripslashes($mysq_row['html']);

Character Format in php

Sorry I can’t log in claim ID is having server issues (im normally Arthur Gibbs)
Data from my database currently outputs this when there are strange charecters...
This is just a example
What I get: De√ilscrat™
What I want: De√ilscrat™
It seems that some characters are being translated into character code by the other guys system..
So what I want to know is:
Is there a function that will expand charecter codes within a string?
Turning FUNCTION(De√ilscrat™) >>> De√ilscrat™.
This √ stuff looks like an HTML entity ; so, let's try de-entitying it...
This can be done using the html_entity_decode function, that's provided by PHP.
For instance, with the string you provided, here's a sample of code :
// So the browser interprets the correct charsert
header('Content-type: text/html; charset=UTF-8');
$input = 'De√ilscrat™';
$output = html_entity_decode($input, ENT_NOQUOTES, 'UTF-8');
var_dump($input, $output);
And the output I'm getting is this one :
string 'De√ilscrat™' (length=19)
string 'De√ilscrat™' (length=15)
(First one is the original version, and second one is the "decoded" version)
So, it seems to do the trick ;-)

Categories