I want to get the width of each japanese character inside a string to avoid being cut inside th pdf. However it gives me this error.
array_map(): Argument #2 ($array) must be of type array, bool given
Related
Description of the problem
I am trying to import email content into a database table. Sometimes, I get an SQL error while inserting a message. I found that it fails when the message content is a binary string instead of a string.
For exemple, I get this in the console if I print a message that is imported successfully (Truncated)
However, I get this with problematic import:
I found out that if I use the function utf8_encode, I am successfully able to import it into SQL. The problem is that it "breaks" previously successfull imports accented characters:
What I have tried
Detect if the string was a binary string with ctype_print, returned false for both non binary and binary string. I would have then be able to call utf8_encode only if it was binary
Use of unpack, did not work
Detect string encoding with mb_detect_encoding, return UTF-8 for both
use iconv , failed with iconv(): Detected an illegal character in input string
Cast the content as string using (string) / settype($html, 'string')
Question
How can I transform the binary string in a normal string so I can then import it in my database without breaking accented characters in other imports?
This is pretty late, but for anyone else reading... Apparently the b prefix is meaningless in PHP, it's a bit of a red herring. See: https://stackoverflow.com/a/51537602/6111743
What encodings did you pass to iconv()? This is the correct solution but you have to give it the correct first argument, which depends on your input. In my example I use "LATIN1" because that turned out to be the correct way to interpret my input but your use case may vary.
You can use mb_check_encoding() to check if it is valid UTF-8 or not. This returns a boolean.
Assuming the question is really something like "how to convert extended ascii string to valid utf-8 string in PHP" - Here is how I did it in my application:
if(!mb_check_encoding($string)) {
$string = iconv("LATIN1", "UTF-8//TRANSLIT//IGNORE", $string);
}
The "TRANSLIT" part tells it to attempt transliteration, that's optional for you. The "IGNORE" will prevent it from throwing Detected an illegal character in input string if it does detect one; instead the character will just get ignored, meaning, removed. Your use case may not need either of these.
When you're debugging, I recommend just using "UTF-8" as the second argument so you can see what it's doing. It's useful to see if it throws an error. For me, I had given it the wrong first argument at first (I wrote "ASCII" instead of "LATIN-1") and it threw the illegal character error on an accented character. That error went away once I passed it the correct encoding.
By the way, mb_detect_encoding() was no help to me in figuring out that Latin-1 was what I needed. What helped was dumping the contents of unpack("C*", $string) to see what exact bytes were in there. That's more debugging advice than solution but worth mentioning in case it helps.
I have code that extracts text from a PDF using a filetotext class. Worked until last week when something changed in the pdf's being generated. Weird thing is that it appears the characters are there and correct once I add 29 to the ord of the character.
Example response debug printout:
/F1 7.31 Tf
0 0 0 rg
1 0 0 1 195.16 597.4 Tm
($PRXQW)Tj
ET
BT
The code uses gzuncompress on the stream section of the pdf.
The $PRXQW is Amount, and adding 29dec to the ord of each character gives me this. But sometimes a character will not be this exact translation, such as what should be a ) in the text appears to be two bytes of 5C66.
Just wondering about this code ring type of character coming out of PDF's now and if anyone has seen this kind of thing?
The encoding of the string argument of the Tj operation depends entirely on the PDF font used (F1 in the case at hand):
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(section 9.4.3 "Text-Showing Operators" in ISO 32000-1)
The OP's code seems to assume a standard encoding like MacRomanEncoding or WinAnsiEncoding, but these merely are special cases. As indicated in the quote above, the encoding might as well be some ad-hoc mixed multibyte encoding.
The PDF specification in a later section describes how to properly extract text:
A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):
If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.
If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):
a) Map the character code to a character name according to Table D.1 and the font’s Differences array.
b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.
If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:
a) Map the character code to a character identifier (CID) according to the font’s CMap.
b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.
c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).
d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).
e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
(section 9.10.2 "Mapping Character Codes to Unicode Values" in ISO 32000-1)
Thus:
Just wondering about this code ring type of character coming out of PDF's now and if anyone has seen this kind of thing?
Yes, it is fairly common in PDFs from the wild to have text drawing operator string arguments in an encoding entirely different from something ASCII'ish. And as the last paragraph in the second quote above hints at, there are situation not allowing text extraction at all (without OCR, that is), even though there are additional places one can look for the mapping to Unicode.
What you're seeking to decode the mystery string in the most general case is /Encoding field of the selected font, in your case the font /F1. More than likely, the encoding scheme is /Identity-H, which can contain an arbitrary mapping of 16-bit characters in PDF strings onto UTF-16 characters.
Here is an example from the PDF parser I'm writing. Each page contains a dictionary of resources, which contains a dictionary of fonts:
[&3|0] => Array [
[/Type] => |/Page|
[/Resources] => Array [
[/Font] => Array [
[/F1] => |&5|0|
[/F2] => |&7|0|
[/F3] => |&9|0|
[/F4] => |&14|0|
[/F5] => |&16|0|
]
]
[/Contents] => |&4|0|
]
In my case, /F3 was producing unusable text, so looking at /F3:
[&9|0] => Array [
[/Type] => |/Font|
[/Subtype] => |/Type0|
[/BaseFont] => |/Arial|
[/Encoding] => |/Identity-H|
[/DescendantFonts] => |&10|0|
[/ToUnicode] => |&96|0|
]
Here you can see the /Encoding type is /Identity-H. The mapping of characters decoding for the decoding chars used in /F3 is stored in the stream referenced by /ToUnicode. Here is the text of relevance from the stream referenced by '&96|0' (96 0 R) - The rest is omitted as boilerplate and can be ignored:
...
beginbfchar
<0003> <0020>
<000F> <002C>
<0015> <0032>
<001B> <0038>
<002C> <0049>
<003A> <0057>
endbfchar
...
beginbfrange
<0044> <0045> <0061>
<0047> <004C> <0064>
<004F> <0053> <006C>
<0055> <0059> <0072>
endbfrange
...
beginbfchar
<005C> <0079>
<00B1> <2013>
<00B6> <2019>
endbfchar
...
The 16-bit pairs between beginbfchar/endbfchar are mappings of individual characters. For example <0003> (0x0003) is mapped onto <0020> (0x0020), which is the space character.
The 16-bit triplets between beginbfrange/endbfrange are mappings of ranges of character. For example characters from <0055> (first) to <0059> (last) are mapped onto <0072>, <0073>, <0074>, <0075> and <0076> ('r' through 'v' in UTF16 & ASCII).
I want to check the size of a string that can contain any type of data.
I have checked strlen and mb_strlen but I am unsure about the differences relating to different data contents.
Some background : what I need to do in the end is cut the string in chunks to serialize it and store it in chunks (being able to restore afterwards). Chunks always have the same size (32Kb) and contain a serialized object with data and the part of the string that I cut, so I need the exact size of the string to be able to do that.
From PHP's manual:
Note:
strlen() returns the number of bytes rather than the number of characters in a string.
By contrast, mb_strlen will take character encoding into consideration. It returns the number of actual characters as defined in the character encoding in the string. For multibyte/variable byte character encodings, strlen can/will be bigger than mb_strlen.
mb_strlen may also return FALSE if you specify a character encoding to which the string being tested doesn't conform.
I'm generating text in php using imagettftext. the text is being pulled from a mysql database. some characters are not appearing in the rendered text despite being in the character map for the font and appearing in the database. for example, m-dashes (—)and smartquotes/apostrophes (“”’).
the characters either don't appear or are replaced by question marks.
i suspect this has to do with encoding, but i don't know enough about encoding to know where to start. any help would be much appreciated.
Try using htmlentityencode on the text before you pass it to the function.
The text string in UTF-8 encoding.
May include decimal numeric character references (of the form: €) to access characters in a font beyond position 127. The hexadecimal format (like ©) is supported. Strings in UTF-8 encoding can be passed directly.
Named entities, such as ©, are not supported. Consider using html_entity_decode() to decode these named entities into UTF-8 strings (html_entity_decode() supports this as of PHP 5.0.0).
If a character is used in the string which is not supported by the font, a hollow rectangle will replace the character.
Source: http://www.php.net/manual/en/function.imagettftext.php
I have the following string:
ᴰᴶ Bagi
Is it possible to let iconv make it into DJ Bagi?
First I tried with:
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
Which resulted in the following notice:
Notice: iconv() [function.iconv]: Detected an illegal character in input string
On the PHP site I saw someone using:
//IGNORE//TRANSLIT
While this prevents the notice I only get:
Bagi
I initially thought that this is an encoding problem on your end, but if I copy + paste those characters locally from the soundcloud source page:
ᴰᴶ Bagi
and try to iconv them, I get the same result as you do. That means that the data is UTF-8, but iconv does not recognize ᴰ as a "child" of D. Unable to convert the character, it complains (a bit misleadingly IMO) about an illegal character.
Edit: This seems indeed true. Superscript D is not in the Unicode Superscripts and Subscripts range, but it's a phonetic character. That's probably why they can't be mapped back to their "parent" letter. Here is more info on ᴰ
As far as I can see, your only choice is to replace the characters manually.
The most primitive example of a replace is
str_replace("ᴰ", "D", $string);
(note that your source file needs to be stored as UTF-8 for this to work)
For an elegant solution, you could build an array out of the source and replacement characters, and pass that to the str_replace call.
Or call DJ Bagi and tell him to get the damn letters straight. You will notice that Soundcloud's URL builder encountered exactly the same problem.
soundcloud.com/bagi