PHP - Convert hex character to HTML entity - php

I'm using imagettftext to create images of certain text characters. To do this, I need to convert an array of hexidecimal character codes to their HTML equivalents and I can't seem to find any functionality built-in to PHP to do this. What am I missing? (through searching, I came across this: PHP function imagettftext() and unicode, but it none of the answers seem to do what I need - some characters convert but most don't).
Here's the resulting HTML representation (in the browser)
[characters] => Array
(
[33] => A
[34] => B
[35] => C
[36] => D
[37] => E
[38] => F
[39] => G
[40] => H
[41] => I
[42] => J
[43] => K
[44] => L
)
Which comes from this array (not capable of rendering in imagettftext):
[characters] => Array
(
[33] => &#x41
[34] => &#x42
[35] => &#x43
[36] => &#x44
[37] => &#x45
[38] => &#x46
[39] => &#x47
[40] => &#x48
[41] => &#x49
[42] => &#x4a
[43] => &#x4b
[44] => &#x4c
)

Based on a sample from the PHP manual, you could do this with a regex:
$newText = preg_replace('/&#x([a-f0-9]+)/mei', 'chr(0x\\1)', $oldText);
I'm not sure a raw html_entity_decode() would work in your case, as your array elements are missing the trailing ; -- a necessary part of these entities.
EDIT, July 2015:
In response to Ben's comment noting the /e modifier being deprecated, here's how to write this using preg_replace_callback() and an anonymous function:
$newText = preg_replace_callback(
'/&#x([a-f0-9]+)/mi',
function ($m) {
return chr(hexdec($m[1]));
},
$oldText
);

Well, you obviously haven't search hard enough. html_entity_decode().

Related

PHP JSON_encode() is getting "Malformed UTF-8 characters, possibly incorrectly encoded" (error)

I cannot solve this issue and I'm getting crazy.
JSON_encode() is casting the error: Malformed UTF-8 characters, possibly incorrectly encoded on few records (2 or 3) from a set of 10k records.
However this seems very impossible to fix.
mysql is already utf8mb4 everywhere (database, table, columns and collation)
php is 7.2 and of course in utf8
apache default charset is utf8 (however the error is throw at PHP-level).
I can also print to screen correctly the record in PHP without issue in a simple HTML debug page. However If I try to encode it in JSON I get the error.
I found that these records have been imported from a CVS probably bypassing the cleaner. What is so strange is that the entire CSV file is parsed with:
$this->encoding = mb_detect_encoding($source,mb_detect_order(),true);
if ($this->encoding!="" && $this->encoding!="UTF8") {
$source = iconv($this->encoding, "UTF-8", $source);
}
I cannot post any full broken data due to the privacy (and GDPR).
However I succeed to extract a part which seems to be the broken one:
RESIDENCE �PRINCIPE
UPDATES
I try to get the bitcode of these broken chars. This is what I found.
In ASCII by using simple native function str_split and ord these char is:
'�' 160
I would like to find the bitcode also in utf8, so I find this usefull function on PHP.net http://php.net/manual/en/function.ord.php#109812
Which try to find bitcode of MultiByteStrings. and it gives me:
-2096
Which is....... negative?
SOLVED!
The issue was in the function mb_detect_order(), this function just don't work as I was expecting. I was thinking this was a list of full supporting encoding order by mostly used in order to speed up the detection process.
But I just found that this function return just 2 encoding:
//print_r(mb_detect_order());
Array
(
[0] => ASCII
[1] => UTF-8
)
Which is almost completly useless in my case.
MB functions can detect much more charset.
You can check them out by run mb_list_encodings() and get the full list:
//print_r(mb_list_encodings());
Array
(
[0] => pass
[1] => auto
[2] => wchar
[3] => byte2be
[4] => byte2le
[5] => byte4be
[6] => byte4le
[7] => BASE64
[8] => UUENCODE
[9] => HTML-ENTITIES
[10] => Quoted-Printable
[11] => 7bit
[12] => 8bit
[13] => UCS-4
[14] => UCS-4BE
[15] => UCS-4LE
[16] => UCS-2
[17] => UCS-2BE
[18] => UCS-2LE
[19] => UTF-32
[20] => UTF-32BE
[21] => UTF-32LE
[22] => UTF-16
[23] => UTF-16BE
[24] => UTF-16LE
[25] => UTF-8
[26] => UTF-7
[27] => UTF7-IMAP
[28] => ASCII
[29] => EUC-JP
[30] => SJIS
[31] => eucJP-win
[32] => EUC-JP-2004
[33] => SJIS-win
[34] => SJIS-Mobile#DOCOMO
[35] => SJIS-Mobile#KDDI
[36] => SJIS-Mobile#SOFTBANK
[37] => SJIS-mac
[38] => SJIS-2004
[39] => UTF-8-Mobile#DOCOMO
[40] => UTF-8-Mobile#KDDI-A
[41] => UTF-8-Mobile#KDDI-B
[42] => UTF-8-Mobile#SOFTBANK
[43] => CP932
[44] => CP51932
[45] => JIS
[46] => ISO-2022-JP
[47] => ISO-2022-JP-MS
[48] => GB18030
[49] => Windows-1252
[50] => Windows-1254
[51] => ISO-8859-1
[52] => ISO-8859-2
[53] => ISO-8859-3
[54] => ISO-8859-4
[55] => ISO-8859-5
[56] => ISO-8859-6
[57] => ISO-8859-7
[58] => ISO-8859-8
[59] => ISO-8859-9
[60] => ISO-8859-10
[61] => ISO-8859-13
[62] => ISO-8859-14
[63] => ISO-8859-15
[64] => ISO-8859-16
[65] => EUC-CN
[66] => CP936
[67] => HZ
[68] => EUC-TW
[69] => BIG-5
[70] => CP950
[71] => EUC-KR
[72] => UHC
[73] => ISO-2022-KR
[74] => Windows-1251
[75] => CP866
[76] => KOI8-R
[77] => KOI8-U
[78] => ArmSCII-8
[79] => CP850
[80] => JIS-ms
[81] => ISO-2022-JP-2004
[82] => ISO-2022-JP-MOBILE#KDDI
[83] => CP50220
[84] => CP50220raw
[85] => CP50221
[86] => CP50222
)
I was in wrong, thinking that mb_detect_order was just an ordered version of this list. The mb_detect_order is just.... useless. In order to encode in UTF8 in the right way use the following code:
$my_encoding_list = [
"UTF-8",
"UTF-7",
"UTF-16",
"UTF-32",
"ISO-8859-16",
"ISO-8859-15",
"ISO-8859-10",
"ISO-8859-1",
"Windows-1254",
"Windows-1252",
"Windows-1251",
"ASCII",
//add yours preferred
];
//remove unsupported encodings
$encoding_list = array_intersect($my_encoding_list, mb_list_encodings());
//detect 'finally' the encoding
$this->encoding = mb_detect_encoding($source,$encoding_list,true);
This worked and solved my issue with bad data saved in the database.
You can filter these unknown characters by using the UTF-8//IGNORE charset in your iconv method.
$this->encoding = mb_detect_encoding($source,mb_detect_order(),true);
if ($this->encoding!="" && $this->encoding!="UTF8") {
$source = iconv($this->encoding, "UTF-8//IGNORE", $source);
}
By using the //IGNORE after your charset, every characters that cannot be represented in the target charset will be silently discarded.

PHP Array to XML

I have created a multi-dimensional array using fgetcsv from a CSV file.
Using both DOMDocument and SimpleXML I am trying to create a XML file of the CSV document.
The array and XML variables are being passed to a function within the same class file. The XML document is being created without any issues, but no value is passing from the array into the XML. It does work it I use a static value opposed to passing a value from the array, also if I print_r the array the structure and values are all correct.
I have tried 'htmlspecialcharacters' and 'encode_UTF8' before passing the value into the XML.
An example of the code is below, product is the multi-dimensional array.
public function array_to_xml($product, &$xml)
{
foreach($product as $row)
{
$element = $xml->createElement("Product");
$xml->appendChild($element);
$element = $xml->createElement("ID", ($row[38]));
$xml->appendChild($element);
}
}
The problem is obviously with the array but I can't find the answer. Any help would be gratefully appreciated.
The output currently looks like (with not value in the ID element). Once it is working Product will have about 20 child elements.
<?xml version="1.0"?>
<ProductList/>
<Product>
<ID/>
</Product>
</ProductList>
Example of $row when printed to screen:
Array ( [0] => [1] => [2] => 6/10/2016 [3] => [4] => [5] => 7.35 [6] => N [7] => N [8] => N [9] => 0 [10] => 0 [11] => 0 [12] => 0 [13] => 0 [14] => 80 [15] => 0 [16] => 80 [17] => 0 [18] => 80 [19] => N [20] => N [21] => N [22] => N [23] => 236.50 [24] => 0.00 [25] => 4.86 [26] => AFG Home Loans - Alpha [27] => 100% Offset Lo Doc Fixed [28] => 100% Offset Lo Doc 4 Year Fixed Owner Occupied [29] => 250.00 [30] => [31] => 7.35 [32] => 0.00 [33] => 4.9 [34] => N [35] => 325.00 [36] => 48 [37] => 4.52 [38] => 1-1MX78TF [39] => N [40] => [41] => [42] => N [43] => N [44] => [45] => Y [46] => 0.00 [47] => 10,000.00 [48] => 2,000,000.00 [49] => Y [50] => 30 [51] => [52] => [53] => Y [54] => 0.00 )
A couple things stand out. First, you have a syntax error on this line:
$element = $xml->createElement("ID", ($row[38])); (note the errant parentheses around $row[38]. The createElement method takes a String for its second parameter.
Second, you're not adding the ID to the product, but to the root XML. Fixing that, your code should look closer to this.
public function array_to_xml($product, &$xml)
{
foreach ($product as $row)
{
$product= $xml->createElement("Product");
$id = $xml->createElement("ID", $row[38]);
$product->appendChild($id);
$xml->appendChild($product);
}
}
If you need it as an attribute as #Barmar commented, you'd use the DOMElement->setAttribute() method, and it would look like:
public function array_to_xml($product, &$xml)
{
foreach ($product as $row)
{
$product= $xml->createElement("Product");
$product->setAttribute('ID', $row[38]);
$xml->appendChild($product);
}
}

best way to handle a tree array

I have a tree array from cakephp 2.0 tree behavior noe i need to separate it in levels so i can build a selector per each level. Here is an example of the array:
Array
(
[25] => Global
[26] => _Region1
[29] => __Pais1
[41] => ___Distrito1
[42] => ___Distrito2
[30] => __Pais2
[43] => ___Distrito1
[44] => ___Distrito2
[31] => __Pais3
[45] => ___Distrito1
[32] => __Pais4
[46] => ___Distrito1
[27] => _Region2
[33] => __Pais1
[47] => ___Distrito1
[34] => __Pais2
[48] => ___Distrito1
[35] => __Pais3
[36] => __Pais4
[28] => _Region3
[37] => __Pais1
[38] => __Pais2
[39] => __Pais3
[40] => __Pais4
)
Now what i need is to create one select for global, another select for region1 another for pais1 and another for disctrito1
The problem is that i can use ids to generate it since this will be changing by the user.
What will be the best way to manipulate this array so it builds a select to each level of the array.
Thanks in advance.
You can get the tree as a nested array via find('threaded'). This will give you each item in the tree, with a 'children' key containing all child nodes as an array;
Retrieving Your Data - find('threaded')
However, as indicated, because the content will be modified by the user, the depth of the tree may change. To accommodate for those changes, you'll need to use a 'recursive function' to loop through the tree, regardless of its depth
Mockup Code:
echo $this->MyTreeHelper->buildTreeDropDowns($treeData);
class MyTreeHelper extends AppHelper {
public function buildTreeDropDowns($treeData)
{
$output = '';
foreach($treeData as $item) {
$output .= createAdropDown($item);
if (!empty($item['children'])) {
$output .= $this->buildTreeDropDowns($item['children']);
}
}
return $output;
}
}

Php regular expressions work different on different servers

I am using regex to get URL's from a webpage.
On localhost (PHP 5.3.15 with Suhosin-Patch (cli) (built: Aug 24 2012 17:45:44)) code:
$file = file_get_contents("http://www.etech.haw-hamburg.de/Stundenplan/");
$pattern = "/<a href=\"([^\"]*.pdf)\">(.*)<\/a>/iU";
preg_match_all($pattern, $file, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";
gives:
=> Array
(
[0] => Sem_IuE_E1a.pdf
[1] => Sem_IuE_E2a.pdf
[2] => Sem_IuE_E3a.pdf
[3] => Sem_IuE_E4a.pdf
[4] => Sem_IuE_E6AT.pdf
[5] => Sem_IuE_E7.pdf
[6] => Sem_IuE_E1b.pdf
[7] => Sem_IuE_E2b.pdf
[8] => Sem_IuE_E3b.pdf
[9] => Sem_IuE_E4b.pdf
[10] => Sem_IuE_E6II.pdf
[11] => Sem_IuE_E6KT.pdf
[12] => Sem_IuE_BMT1.pdf
[13] => Laborplan%20BMT1%20KoP%201.pdf
[14] => Sem_IuE_BMT2.pdf
[15] => Sem_IuE_BMT3.pdf
[16] => Sem_IuE_BMT4.pdf
[17] => Sem_IuE_BMT5.pdf
[18] => Sem_IuE_BMT6.pdf
[19] => Sem_IuE_IE2.pdf
[20] => Sem_IuE_IE4.pdf
[21] => Sem_IuE_IE6.pdf
[22] => Sem_IuE_AM.pdf
[23] => Sem_IuE_IKM1.pdf
[24] => Legende_Stud.pdf
[25] => Kalender.pdf
[26] => Doz.pdf
[27] => Doz.pdf
)
while, on the remote server (PHP 5.3.3 (cli) (built: Feb 22 2013 02:51:11)) the same code gives:
=> Array
(
[0] => Sem_IuE_E2a.pdf
[1] => Sem_IuE_E7.pdf
[2] => Sem_IuE_E1b.pdf
[3] => Sem_IuE_E2b.pdf
[4] => Sem_IuE_E3b.pdf
[5] => Sem_IuE_E6II.pdf
[6] => Sem_IuE_E6KT.pdf
[7] => Sem_IuE_BMT1.pdf
[8] => Laborplan%20BMT1%20KoP%201.pdf
[9] => Sem_IuE_BMT2.pdf
[10] => Sem_IuE_BMT3.pdf
[11] => Sem_IuE_BMT4.pdf
[12] => Sem_IuE_BMT5.pdf
[13] => Sem_IuE_BMT6.pdf
[14] => Sem_IuE_IE2.pdf
[15] => Sem_IuE_IE4.pdf
[16] => Sem_IuE_IE6.pdf
[17] => Sem_IuE_AM.pdf
[18] => Doz.pdf
[19] => Doz.pdf
)
What is the problem?
I have no precise answer. But in your question you mention that you have different results by using PHP 5.3.3 and PHP 5.3.15.
I took a look at PHP5 ChangeLog, where the answer probably lies, and saw the following possible explanations.
PHP 5.3.6:
Upgraded bundled PCRE to version 8.11. (Ilia)
PHP 5.3.7
Upgraded bundled PCRE to version 8.12. (Scott)
I read the release notes for both PCRE versions, and I am not sure what could affect matching in your case, except for a few corrections mentioning UTF8 encoding.
But, while looking at U modifier I noticed in PCRE Configuration Options that:
PCRE's backtracking limit. Defaults to 100000 for PHP < 5.3.7.
My guess is that some fix in the U (PCRE_UNGREEDY) modifier changed the way that the part between the <a> is matched. This makes sense, because by looking at the source of the page you are scraping, the only one that matches in the earlier PHP version are the <a> tags that don't contain inner HTML.
Example, this one matches:
E2a
This one doesn't:
<span lang=IT style='mso-ansi-language:IT'>E4a</span>
Very interesting, but how to fix it?
I don't have access to an earlier PHP version so I cannot test it, but I would say remove the greedy part of your regular expression, because you don't need to match the part inside the <a></a> tags, since the value is already contained in the PDF filename:
$pattern = "/<a href=\"([^\"]*.pdf)\">/i";
Or
Use a DOM Parser.
I've come up with a work-around. If you open the page, strip the tags, then parse you should get more consistent answers. Code from Microsoft apps (target page) is horrible.
<?php
$file = file_get_contents("http://www.etech.haw-hamburg.de/Stundenplan/");
$file = strip_tags($file,'<a>');
$pattern = "!\<a href=[\"|']([^.]+\.pdf)[\"|']\>([^\<]+)\<\/a\>!iU";
preg_match_all($pattern, $file, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";
?>

Problem with regular expression for some special parttern

I got a problem when I tried to find some characters with following code:
$str = "统计类型目前分为0日Q统计,月统q计及287年7统1计三7种,如需63自定义时间段,点1击此hell处进入自o定w义统or计d!页面。其他统计:客服工作量统计 | 本周服务统计EXCEL";
preg_match_all('/[\w\uFF10-\uFF19\uFF21-\uFF3A\uFF41-\uFF5A]/',$str,$match); //line 5
print_r($match);
And I got error as below:
Warning: preg_match_all() [function.preg-match-all]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u at offset 4 in E:\mycake\app\webroot\re.php on line 5
I'm not so familiar with reg expression and have no idea about this error.How can I fix this?Thanks.
The problem is, that the PCRE regular expression engine does not understand the \uXXXX-syntax to denote characters via their unicode codepoints. Instead the PCRE engine uses a \x{XXXX}-syntax combined with the u-modifier:
preg_match_all('/[\w\x{FF10}-\x{FF19}\x{FF21}-\x{FF3A}\x{FF41}-\x{FF5A}]/u',$str,$match);
print_r($match);
See my answer here for some more information.
EDIT:
$str = "统计类型目前分为0日Q统计,月统q计及287年7统1计三7种,如需63自定义时间段,点1击此hell处进入自o定w义统or计d!页面。其他统计:客服工作量统计 | 本周服务统计EXCEL";
preg_match_all('/[\w\x{FF10}-\x{FF19}\x{FF21}-\x{FF3A}\x{FF41}-\x{FF5A}]/u',$str,$match);
// ^
// |
print_r($match);
/* Array
(
[0] => Array
(
[0] => 0
[1] => Q
[2] => q
[3] => 2
[4] => 8
[5] => 7
[6] => 7
[7] => 1
[8] => 7
[9] => 6
[10] => 3
[11] => 1
[12] => h
[13] => e
[14] => l
[15] => l
[16] => o
[17] => w
[18] => o
[19] => r
[20] => d
[21] => E
[22] => X
[23] => C
[24] => E
[25] => L
)
) */
You're sure, that you used the u-modifier (see arrow above)? If so, you'd have to check if your PHP supports th u-modifier at all (PHP > 4.1.0 on Unix and > 4.2.3 on Windows).

Categories