Issue with file_get_contents encoding - php

I'm getting file_get_contents(uri) and getting back a Json that I'm unable to encode.
I tried several encodings and str_replace but I don't quite understand what the issue is.
This is the start of my json with file_get_contents:
string(67702) "��{"localidades"
I know it's finding unknown characters and that's what the ? are for, but I don't understand how to solve it.
I've tried this but to no avail
if(substr($s, 0, 2) == chr(0xFF).chr(0xFE)){
return substr($s,3);
}
else{
return $s;
}
}
This is xxd | head from terminal
00000000: fffe 7b00 2200 6c00 6f00 6300 6100 6c00 ..{.".l.o.c.a.l.
00000010: 6900 6400 6100 6400 6500 7300 2200 3a00 i.d.a.d.e.s.".:.
00000020: 2000 5b00 7b00 2200 6900 6400 4c00 6f00 .[.{.".i.d.L.o.
00000030: 6300 6100 6c00 6900 6400 6100 6400 2200 c.a.l.i.d.a.d.".
00000040: 3a00 2000 3300 2c00 2200 6c00 6f00 6300 :. .3.,.".l.o.c.
00000050: 6100 6c00 6900 6400 6100 6400 2200 3a00 a.l.i.d.a.d.".:.
00000060: 2000 2200 4200 7500 6500 6e00 6f00 7300 .".B.u.e.n.o.s.
00000070: 2000 4100 6900 7200 6500 7300 2200 2c00 .A.i.r.e.s.".,.
00000080: 2200 6900 6400 5000 7200 6f00 7600 6900 ".i.d.P.r.o.v.i.
00000090: 6e00 6300 6900 6100 2200 3a00 2000 2200 n.c.i.a.".:. .".

What you have there is UTF-16LE in which each codepoint is encoded as at least two bytes, even "basic ASCII". The first two bytes of the document are the Byte Order Mark [BOM] that declares in what byte-order [endian] those codepoints are encoded
$input = "\xff\xfe{\x00}\x00"; // UTF-16-LE with BOM
function convert_utf16($input, $charset=NULL) {
// if your data has no BOM you must explicitly define the charset.
if( is_null($charset) ) {
$bom = substr($input, 0, 2);
switch($bom) {
case "\xff\xfe":
$charset = "UTF-16LE";
break;
case "\xfe\xff":
$charset = "UTF-16BE";
break;
default:
throw new \Exception("No encoding specified, and no BOM detected");
break;
}
$input = substr($input, 2);
}
return mb_convert_encoding($input, "UTF-8", $charset);
}
$output = convert_utf16($input);
var_dump(
$output,
bin2hex($output),
json_decode($output, true)
);
Output:
string(2) "{}"
string(4) "7b7d"
array(0) {
}
It's also worth noting that using anything other than UTF-8 to encode JSON makes it invalid JSON, and you should tell whoever is giving you this data to fix their app.

What you are getting is UTF-16 LE. The fffe at the beginning is called a BOM. You can use iconv:
$data = iconv( 'UTF-16', 'UTF-8', $data);
And now you have a UTF-8 with BOM. Which i think will work with json_decode, because PHP seems to handle it. Still, if you want to remove the BOM, which you should (see #Sammitch comment), you can use this one as well:
$data = preg_replace("/^pack('H*','EFBBBF')/", '', $data);
I recreated a part of your file and i get this:
$data = file_get_contents('/var/www/html/utf16le.json');
$data = preg_replace("/^pack('H*','EFBBBF')/", '', iconv( 'UTF-16', 'UTF-8', $data));
print_r(json_decode($data));
Output:
stdClass Object
(
[localidades] => Array
(
[0] => stdClass Object
(
[idLocalidad] => 3
[localidad] => Buenos Aires
)
)
)
And from xxd:

The file you try to process is encoded in UTF-16, which isn’t natively supported by PHP. So, in order to process it, you’ll have to remove BOM header first (first two bytes) and then convert encoding to UTF-8 using iconv or mbstring.

Related

Save data to CSV and encode to utf-8 [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 3 years ago.
I have a database encoded as utf8mb4. I connect with this database and I set utf8 charset:
$dbHandler = new PDO("mysql:host=$dbHost;dbname=$dbName;charset=utf8mb4", $dbUsername, $dbPassword);
All data is property encoded in DB. I want to fetch data and save it as CSV:
$fp = fopen('data.csv', 'w+');
foreach ($result as $row) {
...
fputcsv($fp, $csvData, ';');
}
But then all the encoding is broken:
groÃ<9f>e,
Zubehör. etc.
I've tried to add BOM (didn't help) and convert array_map("utf8_encode", $csvData); (some characters are displaying correct: große, Zubehör, but some not: Kabelverl?ng, F?r). Any idea?
EDIT:
Hexdump output beginning of file:
00000000: efbb bf70 726f 6475 6374 3b61 7274 6963 ...product;artic
00000010: 6c65 3b73 6b75 3b64 6174 653b 6e61 6d65 le;sku;date;name
00000020: 0a30 3030 3239 3039 3530 3030 3b3b 3b3b .00028151000;;;;
00000030: 2242 7265 616b 6f75 742d 626f 7820 4b70 "Breakout-box Kp
00000040: 6c2e 223b 223c 7374 726f 6e67 3e42 7265 l.";"<strong>Bre
00000050: 616b 6f75 742d 626f 7820 4b70 6c2e 3c2f akout-box Kpl.</
Hexdump output of file with 1 record where we can see the issue (F..r instead of Für). By the way - original string was modified by ucwords and strtolower:
00000000: 3030 3032 3930 3936 3030 333b 3b3b 3b22 00028151000;;;;"
00000010: 4e65 747a 7465 696c 2032 3230 762f 3132 Netzteil 220v/12
00000020: 7620 46e3 9c72 2041 766c 223b 223c 7374 v F..r Avl";"<st
00000030: 726f 6e67 3e4e 6574 7a74 6569 6c20 3232 rong>Netzteil 22
00000040: 3076 2f31 3276 2046 e39c 7220 4176 6c3c 0v/12v F..r Avl<
00000050: 2f73 7472 6f6e 673e 3c62 723e 3c62 723e /strong><br><br>
00000060: 4f45 4d20 4e75 6d6d 6572 3a20 3030 3032 OEM Nummer: 0002
00000070: 3930 3936 3030 3322 3b31 3038 2e34 363b 9096003";108.46;
00000080: 3030 3032 3930 3936 3030 332d 6e65 747a 00028151000-netz
00000090: 7465 696c 2d32 3230 762d 3132 762d 6675 teil-220v-12v-fu
000000a0: 722d 6176 6c3b 4875 7371 7661 726e 613b r-avl;Husqvarna;
000000b0: 4452 4f50 444f 574e 3b59 3b4e 3b68 7474 DROPDOWN;Y;N;htt
000000c0: 7073 3a2f 2f73 7061 7265 7061 7274 7366 ps://sparepartsf
000000d0: 696e 6465 722e 6b74 6d2e 636f 6d2f 5350 inder.fha.com/SP
000000e0: 462f 496d 6167 6573 2f6d 6170 732f 3130 F/Images/maps/10
000000f0: 3030 3032 3932 302e 6769 663b 313b 4154 0002920.gif;1;AT
00000100: 3b57 6964 6765 743b 224b 544d 204f 7269 ;Ponret;"KTM Ori
00000110: 6769 6e61 6c20 4572 7361 747a 7465 696c ginal Ersatzteil
00000120: 6522 3b22 4875 7371 7661 726e 6120 4e65 e";"Husqvarna Ne
00000130: 747a 7465 696c 2032 3230 762f 3132 7620 tzteil 220v/12v
00000140: 46e3 9c72 2041 766c 202d 204f 454d 204e F..r Avl - OEM N
00000150: 756d 6d65 723a 2030 3030 3239 3039 3630 ummer: 000290960
00000160: 3033 223b 3b22 4b61 7566 656e 2053 6965 03";;"Kaufen Sie
00000170: 2048 7573 7176 6172 6e61 204e 6574 7a74 Husqvarna Netzt
00000180: 6569 6c20 3232 3076 2f31 3276 2046 e39c eil 220v/12v F..
00000190: 7220 4176 6c20 6d69 7420 4f45 4d2d 4e75 r Avl mit OEM-Nu
000001a0: 6d6d 6572 2030 3030 3239 3039 3630 3033 mmer 00028151000
000001b0: 2062 6569 2065 696e 656d 2048 7573 7176 bei einem Husqv
000001c0: 6172 6e61 2d56 6572 7472 6167 7368 c3a4 arna-Vertragsh..
000001d0: 6e64 6c65 722e 2057 6972 2068 6162 656e ndler. Wir haben
000001e0: 2065 696e 6520 6772 6fc3 9f65 2041 7573 eine gro..e Aus
000001f0: 7761 686c 2061 6e20 4875 7371 7661 726e wahl an Husqvarn
00000200: 612d 4572 7361 747a 7465 696c 656e 2c20 a-Ersatzteilen,
00000210: 4163 6365 7373 6f72 6965 732c 2043 6c6f Accessories, Clo
00000220: 7468 696e 672c 204d 5820 4265 6b6c 6569 thing, MX Beklei
00000230: 6475 6e67 2075 6e64 205a 7562 6568 c3b6 dung und Zubeh..
00000240: 722e 220a r.".
file data.csv output:
data.csv: Non-ISO extended-ASCII text, with very long lines
The problem was that I was using strtolower and ucfirst. I changed it to
$name = mb_convert_case($name, MB_CASE_LOWER, "UTF-8");
$name = mb_convert_case($name, MB_CASE_TITLE, "UTF-8");
and it works.

Why is str_getcsv() stripping quotation marks within fields?

The str_getcsv() function is designed to work on a line of CSV text, not on a set of lines. I am trying to use it twice, once to split multiple lines of CSV into an array of lines, and then again on each of those. This solution was working for me, and indeed I supplied it as an answer to another question.
However, I now have a problem whereby the line "AC150AC,",service tool,845.71,-2 is returned as AC150AC,,service tool,845.71,-2, with the quotation marks removed, so the comma is now treated as a separator. In debugging that, I found that multi-line values are now also not working, and are now split in the middle despite being enclosed correctly.
How can I debug this?
$ cat csv.php
<?php
$csv = '130,TEST A 1258 (U10 001),28.66,2
"AC150AC,",service tool,845.71,-2
AL7951,SEA LION,47.19,2
T11,"Test multi-
line segments",587.36,4';
$n = str_getcsv($csv, "\n");
$r = str_getcsv($csv, "\r");
print_r($n);
print_r($r);
$ xxd csv.php
00000000: 3c3f 7068 700a 0a24 6373 7620 3d20 2731 <?php..$csv = '1
00000010: 3330 2c54 4553 5420 4120 3132 3538 2028 30,TEST A 1258 (
00000020: 5531 3020 3030 3129 2c32 382e 3636 2c32 U10 001),28.66,2
00000030: 0a22 4143 3135 3041 432c 222c 7365 7276 ."AC150AC,",serv
00000040: 6963 6520 746f 6f6c 2c38 3435 2e37 312c ice tool,845.71,
00000050: 2d32 0a41 4c37 3935 312c 5345 4120 4c49 -2.AL7951,SEA LI
00000060: 4f4e 2c34 372e 3139 2c32 0a54 3131 2c22 ON,47.19,2.T11,"
00000070: 5465 7374 206d 756c 7469 2d0a 6c69 6e65 Test multi-.line
00000080: 2073 6567 6d65 6e74 7322 2c35 3837 2e33 segments",587.3
00000090: 362c 3427 3b0a 0a24 6e20 3d20 7374 725f 6,4';..$n = str_
000000a0: 6765 7463 7376 2824 6373 762c 2022 5c6e getcsv($csv, "\n
000000b0: 2229 3b0a 2472 203d 2073 7472 5f67 6574 ");.$r = str_get
000000c0: 6373 7628 2463 7376 2c20 225c 7222 293b csv($csv, "\r");
000000d0: 0a0a 7072 696e 745f 7228 246e 293b 0a70 ..print_r($n);.p
000000e0: 7269 6e74 5f72 2824 7229 3b0a 0a rint_r($r);..
$ php csv.php
Array
(
[0] => 130,TEST A 1258 (U10 001),28.66,2
[1] => AC150AC,,service tool,845.71,-2
[2] => AL7951,SEA LION,47.19,2
[3] => T11,"Test multi-
[4] => line segments",587.36,4
)
Array
(
[0] => 130,TEST A 1258 (U10 001),28.66,2
"AC150AC,",service tool,845.71,-2
AL7951,SEA LION,47.19,2
T11,"Test multi-
line segments",587.36,4
)
The problem is that you are splitting it by end of line characters(of various types) which you don't know if they are mid line or genuine end of lines.
A fudge is to use fgetcsv() to do the work for you, so you first have to give it a file to work with. This code creates a temporary file, writes the contents to it and then rewinds the file so the read starts from the beginning...
$fh = tmpfile();
fwrite($fh, $csv);
fseek($fh, 0);
while ( $row = fgetcsv($fh)) {
print_r($row);
}
fclose($fh);
What is the result expected ? I'm not sure to understand what you want.
I guess explode() will help you ?
$csv = '130,TEST A 1258 (U10 001),28.66,2
"AC150AC,",service tool,845.71,-2
AL7951,SEA LION,47.19,2
T11,"Test multi-
line segments",587.36,4';
$n = explode("\n", $csv);
var_dump(array_map('str_getcsv', $n));

Merging Different PDF formats with PHP?

I am trying to merge few PDF files with Setasign FPDI. This packages is working fine for some PDF format but failing for others.
There are three different formats of PDF i could find.
Format 1:
%PDF-1.4
%´µ¶·
%
1 0 obj
<<
/Type /Catalog
/PageMode /UseNone
/ViewerPreferences 2 0 R
/Pages 3 0 R
/PageLayout /OneColumn
>>
Format 2:
--uuid:3c4caf6a-2a7e-4ca5-9e0a-63346610deae
Content-Type: application/octet-stream
Content-Transfer-Encoding: binary
Content-ID: <1>
%PDF-1.4
%âãÏÓ
1 0 obj
<</ColorSpace/DeviceGray/Subtype/Image
Format 3:
2550 4446 2d31 2e34 0a25 aaab acad 0a34
2030 206f 626a 0a3c 3c0a 2f43 7265 6174
6f72 2028 4170 6163 6865 2046 4f50 2056
6572 7369 6f6e 2031 2e30 290a 2f50 726f
6475 6365 7220 2841 7061 6368 6520 464f
5020 5665 7273 696f 6e20 312e 3029 0a2f
4372 6561 7469 6f6e 4461 7465 2028 443a
3230 3136 3131 3130 3135 3437 3532 5a29
0a3e 3e0a 656e 646f 626a 0a35 2030 206f
FPDI works great with Format 1 but it is failing for format 2.
When i tried to merge two files from Format 2 from Another PDF Merging Website, i got combined pdf in Format 3.
My question is how can merge 2 Format 2 files in to any format in PHP.
And if anyone can explain these formats, that would be great too.
"Format 2" is a corrupted file, because it includes invalid header data which will corrupt the byte offset positions in the PDF (FPDI will not repair such files but requires valid PDFs).
"Format 3" is only a bunch of hex values not a PDF file.
Thanks to Setasign's Answer, I have cleaned the invalid format to a valid one.
I am using simple content splitting.
public function parseRawResponse($raw, $from)
{
$positionMap = [
'PDF' => [ 'init' => "%PDF-1.4\n", 'end' => "\n%%EOF"]
];
$initPos = strpos($raw,$positionMap[$from]['init']);
$endPos = strrpos($raw, $positionMap[$from]['end']) + strlen($positionMap[$from]['end']);
$content = substr($raw, $initPos, ($endPos - $initPos));
return $content;
}
Where $raw is format 2 and $content is actual content for PDF.

cURL font encoding-error

I want to get contents via cURL from this page.
Here is my code:
$url = $_GET["url"];
$url = str_replace(" ", "%20", $url);
$curlSession = curl_init();
curl_setopt($curlSession, CURLOPT_URL, $url);
curl_setopt($curlSession, CURLOPT_BINARYTRANSFER, true);
curl_setopt($curlSession, CURLOPT_RETURNTRANSFER, true);
$jsonData = curl_exec($curlSession);
curl_close($curlSession);
if (strpos($url, "toomva.com") >= 0) {
$jsonData = str_replace("toomva.com", "http://av.bsquochoai.ga ⇔ ", $jsonData);
}
if (strpos($url, "Toomva -") >= 0){
$jsonData = str_replace("toomva.com", "http://av.bsquochoai.ga ⇔ ", $jsonData);
}
echo($jsonData);
Here you can find a live demo.
My problem is that the returned text is not as I expect. It has a lot of �����:
��1� � �0�0�:�0�0�:�2�4�,�4�0�0� �-�-�>� �0�0�:�0�0�:�3�3�,�1�4�0� �
�M��i� �k�h�i� �a�n�h� �t�r���n�g� �t�h��y� �k�h�u���n� �m��t� �e�m�,�
�t�h�� �g�i�a�n� �n���y� �n�h�� �c�h��t� �t�a�n� �b�i��n� � �
Can you please help me with this?
Here are the first few bytes of the file you're trying to access:
$ curl -s 'http://toomva.com/Data/subtitle/Duncan%20James%20ft.%20Keedie%20-%20I%20Believe%20My%20Heart.Vie_Syned.srt' | xxd | head
0000000: fffe 3100 0d00 0a00 3000 3000 3a00 3000 ..1.....0.0.:.0.
0000010: 3000 3a00 3200 3400 2c00 3400 3000 3000 0.:.2.4.,.4.0.0.
0000020: 2000 2d00 2d00 3e00 2000 3000 3000 3a00 .-.-.>. .0.0.:.
0000030: 3000 3000 3a00 3300 3300 2c00 3100 3400 0.0.:.3.3.,.1.4.
0000040: 3000 0d00 0a00 4d00 d71e 6900 2000 6b00 0.....M...i. .k.
0000050: 6800 6900 2000 6100 6e00 6800 2000 7400 h.i. .a.n.h. .t.
0000060: 7200 f400 6e00 6700 2000 7400 6800 a51e r...n.g. .t.h...
0000070: 7900 2000 6b00 6800 7500 f400 6e00 2000 y. .k.h.u...n. .
0000080: 6d00 b71e 7400 2000 6500 6d00 2c00 2000 m...t. .e.m.,. .
0000090: 7400 6800 bf1e 2000 6700 6900 6100 6e00 t.h... .g.i.a.n.
It starts with 0xff 0xfe, which is the byte order mark for UTF-16 Little Endian. This information should really be provided in the file's HTTP headers, but apparently not in this case.
You can use PHP's mb_convert_encoding() function to change the file's content into whatever character set you're using for your website. For example, this will convert it into utf-8:
$src = file_get_contents('http://toomva.com/Data/subtitle/Duncan%20James%20ft.%20Keedie%20-%20I%20Believe%20My%20Heart.Vie_Syned.srt');
$utf8src = mb_convert_encoding($src,'UTF-8','UTF-16LE');
header('Content-Type: text/plain; charset=utf-8');
die($utf8src);
However, the file doesn't contain JSON data. Here are the first few lines:
1
00:00:24,400 --> 00:00:33,140
Mỗi khi anh trông thấy khuôn mặt em, thế gian này như chợt tan biến
2
00:00:33,140 --> 00:00:42,700
Tất cả đều phơi bày trong một ánh nhìn thoáng qua
use utf8_encode when you echo your jsonDate :
echo(utf8_encode($jsonData));

How to get the number of pages in a Word Document on linux?

I saw this question PHP - Get number of pages in a Word document . I also need to determine the pages count from given word file (doc/docx). I tried to investigate phplivedocx/ZF (#hobodave linked to those in the original post answers), but I lost my hands and legs there. I can't use any outer web service either (like DOC2PDF sites, and then count the pages in the PDF version, or so...).
Simply: Is there any php code (using ZF or anything else in PHP, excluding COM object or other execution-files, such 'AbiWord'; I'm using shared Linux server, without exec or similar function), to find the pages count of word file?
EDIT: The word versions that about to be supported are Microsoft-Word 2003 & 2007.
Getting the number of pages for docx files is very easy:
function get_num_pages_docx($filename)
{
$zip = new ZipArchive();
if($zip->open($filename) === true)
{
if(($index = $zip->locateName('docProps/app.xml')) !== false)
{
$data = $zip->getFromIndex($index);
$zip->close();
$xml = new SimpleXMLElement($data);
return $xml->Pages;
}
$zip->close();
}
return false;
}
For 97-2003 format it's certainly challenging, but by no means impossible. The number of pages is stored in the SummaryInformation section of the document, but due to the OLE format of the files that makes it a pain to find. The structure is defined extremely thoroughly (though badly imo) here and simpler here. I looked at this for an hour today, but didn't get very far! (not a level of abstraction I'm used to), but output the hex to better understand the structure:
function get_num_pages_doc($filename)
{
$handle = fopen($filename, 'r');
$line = #fread($handle, filesize($filename));
echo '<div style="font-family: courier new;">';
$hex = bin2hex($line);
$hex_array = str_split($hex, 4);
$i = 0;
$line = 0;
$collection = '';
foreach($hex_array as $key => $string)
{
$collection .= hex_ascii($string);
$i++;
if($i == 1)
{
echo '<b>'.sprintf('%05X', $line).'0:</b> ';
}
echo strtoupper($string).' ';
if($i == 8)
{
echo ' '.$collection.' <br />'."\n";
$collection = '';
$i = 0;
$line += 1;
}
}
echo '</div>';
exit();
}
function hex_ascii($string, $html_safe = true)
{
$return = '';
$conv = array($string);
if(strlen($string) > 2)
{
$conv = str_split($string, 2);
}
foreach($conv as $string)
{
$num = hexdec($string);
$ascii = '.';
if($num > 32)
{
$ascii = unichr($num);
}
if($html_safe AND ($num == 62 OR $num == 60))
{
$return .= htmlentities($ascii);
}
else
{
$return .= $ascii;
}
}
return $return;
}
function unichr($intval)
{
return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}
which will out put code where you can find the sections such as:
007000: 0500 5300 7500 6D00 6D00 6100 7200 7900 ..S.u.m.m.a.r.y.
007010: 4900 6E00 6600 6F00 7200 6D00 6100 7400 I.n.f.o.r.m.a.t.
007020: 6900 6F00 6E00 0000 0000 0000 0000 0000 i.o.n...........
007030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
Which will allow you to see the referencing info such as:
007040: 2800 0201 FFFF FFFF FFFF FFFF FFFF FFFF (...ÿÿÿÿÿÿÿÿÿÿÿÿ
007050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
007060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
007070: 0000 0000 2500 0000 0010 0000 0000 0000 ....%...........
Which will allow you to determine properties described:
_ab = ("SummaryInformation")
_cb = 0028
_mse = 02 (STGTY_STREAM)
_bflags = 01 (DE_BLACK)
_sidLeftSib = FFFF FFFF
_sidRightSib = FFFF FFFF (none)
_sidChild = FFFF FFFF (n/a for STGTY_STREAM)
_clsid = 0000 0000 0000 0000 0000 0000 0000 0000 (n/a)
_dwUserFlags = 0000 0000 (n/a)
_time[0] = CreateTime = 0000 0000 0000 0000 (n/a)
_time[1] = ModifyTime = 0000 0000 0000 0000 (n/a)
_startSect = 0000 0000
_ulSize = 0000 1000
_dptPropType = 0000 (n/a)
Which will let you find the relevant section of code, unpack it and get the page number. Of course this is the hard bit that I just don't have time for, but should set you in the right direction.
M$ don't make it easy!
Have a look at PhpWord from microsoft codeplex ... "http://phpword.codeplex.com/
It will allow you to open and read the word formatted file in PHP and do whatever processing you require.
To get meta data properties of doc,docx,ppt and pptx like number of pages, number of slides using PHP i followed the following process and it worked liked charm and iam so happy, below is the process i followed , hope it helps someone
Download and configure Apache Tika.
once its done you could try executing the following commadn it will give all the meta data about your file
java -jar tika-app-1.5.jar -m test.docx
java -jar tika-app-1.5.jar -m test.doc
java -jar tika-app-1.5.jar -m test.pptx
java -jar tika-app-1.5.jar -m test.ppt
once tested you can execute this comman in PHP script. Thanks.
Excluding using Abiword or OpenOffice? Impossible - number of pages will depend on number of words/letters, fonts used, justification and kerning, margin size, line spacing, paragraph spacing, number of paragraphs, columns, size of graphics / embedded objects, page / column breaks and page margins.
You need something which will can understand all of these.
Even if you use OpenOffice or Abiword, reflowing the text may change the number of pages. Indeed, in some cases opening the same document on a different instance of MSWord may result in a difference.
The best you could probably manage would be a statistical approach based on a representation of the document - but you'll still see huge variance.

Categories