What I'm trying to do is to convert some archived CSV data. It all worked well on a couple thousand files. I parse out a date and convert it to a timestamp. However on one file, somehow it doesn't work. I use (int) $string to cast the parsed strings to int values -> it returns int(0). I also used intval() -> same result. When I use var_dump($string), I get some weird output, for example string(9) "2008", which actually should be string(4) "2008". I tried to get to use preg_match on the string, without success. Is this an encoding problem?
Here is some code, it's just pretty standard stuff:
date_default_timezone_set('UTC');
$ms = 0;
function convert_csv($filename)
{
$target = "tmp.csv";
$fp = fopen("$filename","r") or die("Can't read the file!");
$fpo = fopen("$target","w") or die("Can't read the file!");
while($line = fgets($fp,1024))
{
$linearr = explode(",","$line");
$time = $linearr[2];
$bid = $linearr[3];
$ask = $linearr[4];
$time = explode(" ",$time);
$date = explode("-",$time[0]);
$year = (int) $date[0]);
$month = (int)$date[1];
$day = (int)$date[2];
$time = explode(":",$time[1]);
$hour = (int)$time[0];
$minute = (int)$time[1];
$second = (int)$time[2];
$time = mktime($hour,$minute,$second,$month,$day,$year);
if($ms >= 9)
{
$ms = 0;
}else
{
$ms ++;
}
$time = $time.'00'.$ms;
$newline = "$time,$ask,$bid,0,0\n";
fwrite($fpo,$newline);
}
fclose($fp);
fclose($fpo);
unlink($filename);
rename($target,$filename);
}
Here is a link to the file we are talking about:
http://ratedata.gaincapital.com/2008/04%20April/EUR_USD_Week1.zip
The file seems to be encoded in UTF-16, so it is indeed an encoding problem. The string(9) is caused by the null-bytes that you get if UTF-16 is interpreted as a single-byte encoding.
This makes the file hard to read with functions like fgets, since they are binary-safe and thus not encoding aware. You could read the entire file in memory and perform an encoding conversion, but this is horribly inefficient.
I'm not sure if it's possible to read the file properly as UTF-16 using native PHP functions. You might need to write or use an external library.
You may try to convert your file to plan ascii using iconv.
If you are on a linux or similar system that has iconv command:
$ iconv -f UTF16 -t ASCII EUR_USD_Week1.csv > clean.csv
Otherwise you may found the PHP iconv function useful:
http://php.net/manual/en/function.iconv.php
Related
I wrote a script that uses file_get_contents() to open files on server, path is delivered by POST. I only had to receive path from POST and use str_replace() to change "%2F" to "/". Everything worked fine until today. Now in some of supported paths php makes unescaped hexadecimal codes that cannot be supplied to file_get_contents().
Example:
2 parameters:
end = wypisZgierz/Łagiewniki Nowe Zachód/koncowe.htm
start = wypisZgierz/Łagiewniki Nowe Zachód/ogolne.htm
source:
start=wypisZgierz%2F%C5%81agiewniki+Nowe+Zach%C3%B3d%2Fogolne.htm&end=wypisZgierz%2F%C5%81agiewniki+Nowe
+Zach%C3%B3d%2Fkoncowe.htm
Start works fine, but end not.
PHP error log:
PHP Warning: file_get_contents(wypisZgierz/\xc5\x81agiewniki Nowe Zach\xc3\xb3d/koncowe.htm): failed to open stream: No such file or directory
*EDIT 1 *
Thanks to rawurldecode() I coul skip my str_replace(), but this doesn't help with original problem.
Real php code:
<?php
session_start();
$start = "";
$end = "";
foreach ($_POST as $key => $value) {
if($key == "start") {
$start = $value;
$start = str_replace("%2F","/",$start);}
if($key == "end") {
$end = $value;
$end = str_replace("%2F","/",$end);
}
}
$tempstart = file_get_contents($start);
$html = $tempstart;
$tempend = file_get_contents($end);
$html = $html.$tempend;
echo $html;
?>
The problem is that for old data it works fine, and now once $start and $end doesn't work, once only $start works. And if i supply start and end with the same value it still shows error with end (with start it might work or not). #pcdoc
the php function rawurldecode should do the trick! But if you posted some real php code it'd be easier to determine your exact problem!
I'm sorry if I'm asking the obvious, but I can't seem to find a working solution for a simple task. On the input I have a string, provided by a user, encoded with UTF-8 encoding. I need to sanitize it by removing all characters less than 0x20 (or space), except 0x7 (or tab.)
The following works for ANSI strings, but not for UTF-8:
$newName = "";
$ln = strlen($name);
for($i = 0; $i < $ln; $i++)
{
$ch = substr($name, $i, 1);
$och = ord($ch);
if($och >= 0x20 ||
$och == 0x9)
{
$newName .= $ch;
}
}
It totally missed UTF-8 encoded characters and treats them as bytes. I keep finding posts where people suggest using mb_ functions, but that still doesn't help me. (For instance, I tried calling mb_strlen($name, "utf-8"); instead of strlen, but it still returns the length of string in BYTEs instead of characters.)
Any idea how to do this in PHP?
PS. Sorry, my PHP is somewhat rusty.
If you use multibyte functions (mb_) then you have to use them for everything. In this example you should use mb_strlen() and mb_substr().
The reason it is not working is probably because you are using ord(). It only works with ASCII values:
ord
(PHP 4, PHP 5)
ord — Return ASCII value of character
...
Returns the ASCII value of the first character of string.
In other words, if you throw a multibyte character into ord() it will only use the first byte, and throw away the rest.
Wow, PHP is one messed up language. Here's what worked for me (but how much slower will this run for a longer chunk of text...):
function normalizeName($name, $encoding_2_use, $encoding_used)
{
//'$name' = string to normalize
// INFO: Must be encoded with '$encoding_used' encoding
//'$encoding_2_use' = encoding to use for return string (example: "utf-8")
//'$encoding_used' = encoding used to encode '$name' (can be also "utf-8")
//RETURN:
// = Name normalized, or
// = "" if error
$resName = "";
$ln = mb_strlen($name, $encoding_used);
if($ln !== false)
{
for($i = 0; $i < $ln; $i++)
{
$ch = mb_substr($name, $i, 1, $encoding_used);
$arp = unpack('N', mb_convert_encoding($ch, 'UCS-4BE', $encoding_used));
if(count($arp) >= 1)
{
$och = intval($arp[1]); //Index 1?! I don't understand why, but it works...
if($och >= 0x20 || $och == 0x9)
{
$ch2 = mb_convert_encoding('&#'.$och.';', $encoding_2_use, 'HTML-ENTITIES');
$resName .= $ch2;
}
}
}
}
return $resName;
}
I have to read a file and identify its decoding type, I used mb_detect_encoding() to detect utf-16 but am getting wrong result.. how can i detectutf-16 encoding type in php.
Php file is utf-16 and my header was windows-1256 ( because of Arabic)
header('Content-Type: text/html; charset=windows-1256');
$delimiter = '\t';
$f= file("$fileName");
foreach($f as $dailystatmet)
{
$transactionData = str_replace("'", '', $dailystatmet);
preg_match_all("/('?\d+,\d+\.\d+)?([a-zA-Z]|[0-9]|)[^".$delimiter."]+/",$transactionData,$matches);
array_push($matchesz, $matches[0]);
}
$searchKeywords = array ("apple", "orange", 'mango');
$rowCount = count($matchesz);
for ($row = 1; $row <= $rowCount; $row++) {
$myRow = $row;
$cell = $matchesz[$row];
foreach ($searchKeywords as $val) {
if (partialArraySearch($cell[$c_description], $val)) {
}
}}
function partialArraySearch($cell, $searchword)
{
if (strpos(strtoupper($cell), strtoupper($searchword)) !== false) {
return true;
}
return false;
}
Above code is for search with in the uploaded file.. if the file was in utf-8 then match was getting but when same file with utf-16 or utf-32 am not getting the result..
so how can i get the encoding type of uploaded file ..
If someone is still searching for a solution, I have hacked something like this in the "voku/portable-utf8" repo on github. => "UTF8::file_get_contents()"
The "file_get_contents"-wrapper will detect the current encoding via "UTF8::str_detect_encoding()" and will convert the content of the file automatically into UTF-8.
e.g.: from the PHPUnit tests ...
$testString = UTF8::file_get_contents(dirname(__FILE__) . '/test1Utf16pe.txt');
$this->assertContains('<p>Today’s Internet users are not the same users who were online a decade ago. There are better connections.', $testString);
$testString = UTF8::file_get_contents(dirname(__FILE__) . '/test1Utf16le.txt');
$this->assertContains('<p>Today’s Internet users are not the same users who were online a decade ago. There are better connections.', $testString);
My solution is to detect UTF-16 and convert the code in Latin 15 is
preg_match_all('/\x00/',$content,$count);
if(count($count[0])/strlen($content)>0.4) {
$content = iconv('UTF-16', 'ISO-8859-15', $content);
}
In other words i check the frequency of the hexadecimal character 00. If it is higher than 0.4 probably the text contains characters in the base set encoded in UTF-16. This means two bytes for character but usually the second byte is 00.
I have a set of ZLIB compressed / base64 encoded strings (done in a C program) that are stored in a database. I have written a small PHP page that should retrieve these values and plot them (the string originally was a list of floats).
Chunk of C program that compresses/encodes:
error=compress2(comp_buffer, &comp_length,(const Bytef*)data.mz ,(uLongf)length,Z_DEFAULT_COMPRESSION); /* compression */
if (error != Z_OK) {fprintf(stderr,"zlib error..exiting"); exit(EXIT_FAILURE);}
mz_binary=g_base64_encode (comp_buffer,comp_length); /* encoding */
(Example) of original input format:
292.1149 8379.5928
366.1519 101313.3906
367.3778 20361.8105
369.1290 17033.3223
375.4355 1159.1841
467.3191 8445.3926
Each column was compressed/encoded as a single string. To reconstruct the original data i am using the following code:
//$row[4] is retrieved from the DB and contains the compressed/encoded string
$mz = base64_decode($row[4]);
$unc_mz = gzuncompress($mz);
echo $unc_mz;
Yet this gives me the following output:
f6jEÍ„]EšiSE#IEfŽ
Could anyone give me a tip/hint about what I might be missing?
------ Added Information -----
I feel that the problem comes from the fact that currently php views $unc_mz as a single string while in reality i would have to re-construct an array containing X lines (this output was from a 9 line file) but... no idea how to do that assignment.
The C program that did that went roughly like this:
uncompress( pUncompr , &uncomprLen , (const Bytef*)pDecoded , decodedSize );
pToBeCorrected = (char *)pUncompr;
for (n = 0; n < (2 * peaksCount); n++) {
pPeaks[n] = (RAMPREAL) ((float *) pToBeCorrected)[n];
}
where peaksCount would be the amount of 'lines' in the input file.
EDIT (15-2-2012): The problem with my code was that I was not reconstructing the array, the fixed code is as follows (might be handy if someone needs a similar snippet):
while ($row = mysql_fetch_array($result, MYSQL_NUM)) {
$m< = base64_decode($row[4]);
$mz_int = gzuncompress($int);
$max = strlen($unc_mz);
$counter = 0;
for ($i = 0; $i < $max; $i = $i + 4) {
$temp= substr($unc_mz,$i,4);
$temp = unpack("f",$temp);
$mz_array[$counter] = $temp[1];
$counter++;
}
The uncompressed string has to be chopped into chunks corresponding to the length of a float, unpack() then reconstructs the float data from teh binary 'chunk'. That's the simplest description that I can give for the above snippet.
compress2() produces the zlib format (RFC 1950). I would have to guess that something called gzuncompress() is expecting the gzip format (RFC 1952). So gzuncompress() would immediately fail upon not finding a gzip header.
You would need to use deflateInit2() in zlib to request that deflate() produce gzip-formatted output, or find or provide a different function in PHP that expects the zlib format.
I searched google for my problem but found no solution.
I want to read a file and convert the buffer to binary like 10001011001011001.
If I have something like this from the file
bmoov���lmvhd�����(tF�(tF�_�
K�T��������������������������������������������#���������������������������������trak���\tkh
d����(tF�(tF������� K������������������������������������������������#������������$edts��
How can I convert all characters (including also this stuff ��) to 101010101000110010 representation??
I hope someone can help me :)
Use ord() on each byte to get its decimal value and then sprintf to print it in binary form (and force each byte to include 8 bits by padding with 0 on front).
<?php
$buffer = file_get_contents(__FILE__);
$length = filesize(__FILE__);
if (!$buffer || !$length) {
die("Reading error\n");
}
$_buffer = '';
for ($i = 0; $i < $length; $i++) {
$_buffer .= sprintf("%08b", ord($buffer[$i]));
}
var_dump($_buffer);
$ php test.php
string(2096) "00111100001111110111000001101000011100000000101000100100011000100111010101100110011001100110010101110010001000000011110100100000011001100110100101101100011001010101111101100111011001010111010001011111011000110110111101101110011101000110010101101110011101000111001100101000010111110101111101000110010010010100110001000101010111110101111100101001001110110000101000100100011011000110010101101110011001110111010001101000001000000011110100100000011001100110100101101100011001010111001101101001011110100110010100101000010111110101111101000110010010010100110001000101010111110101111100101001001110110000101000001010011010010110011000100000001010000010000100100100011000100111010101100110011001100110010101110010001000000111110001111100001000000010000100100100011011000110010101101110011001110111010001101000001010010010000001111011000010100010000000100000011001000110100101100101001010000010001001010010011001010110000101100100011010010110111001100111001000000110010101110010011100100110111101110010010111000110111000100010001010010011101100001010011111010000101000001010001001000101111101100010011101010110011001100110011001010111001000100000001111010010000000100111001001110011101100001010011001100110111101110010001000000010100000100100011010010010000000111101001000000011000000111011001000000010010001101001001000000011110000100000001001000110110001100101011011100110011101110100011010000011101100100000001001000110100100101011001010110010100100100000011110110000101000100000001000000010010001011111011000100111010101100110011001100110010101110010001000000010111000111101001000000111001101110000011100100110100101101110011101000110011000101000001000100010010100110000001110000110010000100010001011000010000001100100011001010110001101100010011010010110111000101000011011110111001001100100001010000010010001100010011101010110011001100110011001010111001001011011001001000110100101011101001010010010100100101001001110110000101001111101000010100000101001110110011000010111001001011111011001000111010101101101011100000010100000100100010111110110001001110101011001100110011001100101011100100010100100111011"
On thing you could do is to read the file into a string variable, then print the string in your binary number representation with the use of sprintfDocs:
$string = file_get_contents($file);
for($l=strlen($string), $i=0; $i<$l; $i++)
{
printf('%08b', ord($string[$i]));
}
If you're just looking for a hexadecimal representation, you can use bin2hexDocs:
echo bin2hex($string);
If you're looking for a nicer form of hexdump, please see the related question:
How can I get a hex dump of a string in PHP?
Reading a file word-wise (32 bits at once) would be faster than byte-wise:
$s = file_get_contents("filename");
foreach(unpack("L*", $s) as $n)
$buf[] = sprintf("%032b", $n);