php - Piping input to perl process automatically decodes url-encoded string

php - Piping input to perl process automatically decodes url-encoded string - php

I'm using proc_open to pipe some text over to a perl script for faster processing. The text includes url-encoded strings as well as literal spaces. When a url-encoded space appears in the raw text, it seems to be decoded into a literal space by the time it reaches the perl script. In the perl script, I rely on the positioning of the literal spaces, so these unwanted spaces mess up my output.
Why is this happening, and is there a way to prevent it from happening?
Relevant code snippet:
$descriptorspec = array(
0 => array("pipe", "r"),
1 => array("pipe", "w"),
);
$cmd = "perl script.pl";
$process = proc_open($cmd, $descriptorspec, $pipes);
$output = "";
if (is_resource($process)) {
fwrite($pipes[0], $raw_string);
fclose($pipes[0]);
while (!feof($pipes[1])) {
$output .= fgets($pipes[1]);
}
fclose($pipes[1]);
proc_close($process);
}
and a line of raw text input looks something like this:
key url\tvalue1\tvalue2\tvalue3
I might be able to avoid the issue by converting the formatting of my input, but for various reasons that is undesirable, and circumvents rather than solves, the key issue.
Furthermore, I know that the issue is occurring somewhere between the php script and the perl script because I have examined the raw text (with an echo) immediately before writing it to the perl scripts STDIN pipe, and I have tested my perl script directly on url-encoded raw strings.
I've now added the perl script below. It basically boils down to a mini map-reduce job.
use strict;
my %rows;
while(<STDIN>) {
chomp;
my #line = split(/\t/);
my $key = $line[0];
if (defined #rows{$key}) {
for my $i (1..$#line) {
$rows{$key}->[$i-1] += $line[$i];
}
} else {
my #new_row;
for my $i (1..$#line) {
push(#new_row, $line[$i]);
}
$rows{$key} = [ #new_row ];
}
}
my %newrows;
for my $key (keys %rows) {
my #temparray = split(/ /, $key);
pop(#temparray);
my $newkey = join(" ", #temparray);
if (defined #newrows{$newkey}) {
for my $i (0..$#{ $rows{$key}}) {
$newrows{$newkey}->[$i] += $rows{$key}->[$i] > 0 ? 1 : 0;
}
} else {
my #new_row;
for my $i (0..$#{ $rows{$key}}) {
push(#new_row, $rows{$key}->[$i] > 0 ? 1 : 0);
}
$newrows{$newkey} = [ #new_row ];
}
}
for my $key (keys %newrows) {
print "$key\t", join("\t", #{ $newrows{$key} }), "\n";
}

Note to self: always check your assumptions. It turns out that somewhere in my hundreds of millions of lines of input there were, in fact, literal spaces where there should have been url-encoded spaces. It took a while to find them, since there were hundreds of millions of correct literal spaces, but there they were.
Sorry guys!

Related

PHP - output text one character at a time

Trying to create a old school terminal text effect (one character at a time with a small delay) in PHP - without javascript if possible.
All text written to the screen should go through this function.
I was thinking something like a buffer you can dynamically append text to make sure it would finish one line, before starting on the next.
Not sure how to preceed or if it's even possible without using Javascript.

Inefficient, but to achieve the goal you set (without javascript), you could use PHP's output buffering to achieve a small delay between characters output:
<?php
ob_start();
$buffer = str_repeat(" ", 4096); // fill the buffer
$string = 'Hello World';
$len = strlen($string);
$sleep = 0.5; // sleep half a second between output chars
for($i=0; $i < $len; $i++) {
echo $buffer . $string[$i];
ob_flush();
flush();
usleep($sleep * 1000000);
}

Editing a Python script from PHP

I want to edit the code of a Python script from PHP, to change a formula and initial value for one of my projects.
The Python script:
import numpy as np
import scipy as sp
import math
from scipy.integrate import odeint
import matplotlib.pyplot as plt
def g(y, x):
y0 = y[0]
y1 = x #formula
return y1
init = 90#formula2
x= np.linspace(0,1,100)
sol=odeint(g, init, x)
plt.legend()
plt.plot(x, sol[:,0], color='b')
plt.show()
PHP:
<?php
if (isset($_POST['equation1'])&&isset($_POST['constant'])) {
$eqn1 = $_POST['equation1'];
//$equation = $_POST['equation'];
$eqn2 = $_POST['constant'];
//str_replace("^", "**", $equation);
$filename ='ode3.py';
$pgm = "C:\\wamp\\www\\working\\".$filename;
function get_string_between($string, $start, $end) {
$string = " ".$string;
$ini = strpos($string,$start);
if ($ini == 0) return "";
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
return substr($string,$ini,$len);
}
//write the equation in python code
$pgm_file = $pgm;
$myfile = fopen($pgm_file, "r") or die("Unable to open file!");
$data = fread($myfile,filesize($pgm_file));
$parsed = get_string_between($data, "y1 = ", " #formula");
$parsed2 = get_string_between($data, "init = ", "#formula2");
$datanew1 = str_replace($parsed, $eqn1, $data);
$datanew2 = str_replace($parsed2,$eqn2, $datanew1);
fclose($myfile);
$myfile = fopen($pgm_file, "w") or die("Unable to open file!");
$datacode = fwrite($myfile,$datanew2);
$pgmfile = fopen($pgm_file, "r") or die("Unable to open file!");
$pgmdata = fread($pgmfile,filesize($pgm_file));
fclose($pgmfile);
/*$demo = fopen($pgm, "r") or die("Unable to open file!");
$xxx = fread($demo,filesize($pgm));
echo $xxx;*/
$pyscript = "C:\\wamp\\www\\working\\".$filename;
$python = 'C:\\Python34\\python.exe';
$cmd = "$python $pyscript";
exec("$cmd", $output);
/* if (count($output) == 0) {
echo "error%";
} else {
echo $output[0]."%";*/
}
var_dump($output);
}
?>
I want it to only replace the part of the Python script marked by #formula and #formula2, where x should be replaced by x+45 and the init value should be set to 85.
However, when I run the PHP script, every instance of x in the Python script is replaced with x+45. How can I limit it to only the part marked by #formula inside the g() function?

I think this is a poor way to approach this problem. Rather than editing the source of your Python script, you should re-engineer it so that the variable parts can be passed in as arguments when the script is called. You can then access these execution arguments using sys.argv.
#!/usr/bin/env python
import sys
# ... import numpy etc.
def g(y,x, expr):
y0 = y[0]
# use the Python interpreter to evaluate the expression.
# It should be of the form 'x**2 + 3' or something - a valid Python/math
# expression otherwise it will crash with a syntax error.
y1 = eval(expr)
return y1
if __name__=='__main__':
init = int(sys.argv[1]) # the first argument passed to the script
expr = sys.argv[2].strip('"') # the second argument - if it includes
# whitespace, wrap it in "" so it works.
# ... do stuff
You can then call this as a standard script using PHP's exec or shell_exec function. Pass the function expression in as arguments like so:
python /path/to/ode3.py 85 "x+45".
Please be very careful with Python's eval - it will evaluate anything you put in there. Since you are calling this from a web-facing PHP script that accepts POST arguments, you must be careful to sanitize the input. From the script you've posted, it doesn't look like you validate it, so if it is publicly accessible, it will allow arbitrary code execution with the same permissions as the PHP script. This is true whether you use eval or edit the Python file as you have been attempting.
Update: Also, the operators have different meanings in different languages. For example, in Python, the ^ operator does not mean exponent, it actually means bitwise XOR. If you want to get x^2 (x-squared), the Python syntax is x**2. If you want to be able to write x^2, you would need to write a parser. See here for more information.

How can Detect UTF 16 decoding

I have to read a file and identify its decoding type, I used mb_detect_encoding() to detect utf-16 but am getting wrong result.. how can i detectutf-16 encoding type in php.
Php file is utf-16 and my header was windows-1256 ( because of Arabic)
header('Content-Type: text/html; charset=windows-1256');
$delimiter = '\t';
$f= file("$fileName");
foreach($f as $dailystatmet)
{
$transactionData = str_replace("'", '', $dailystatmet);
preg_match_all("/('?\d+,\d+\.\d+)?([a-zA-Z]|[0-9]|)[^".$delimiter."]+/",$transactionData,$matches);
array_push($matchesz, $matches[0]);
}
$searchKeywords = array ("apple", "orange", 'mango');
$rowCount = count($matchesz);
for ($row = 1; $row <= $rowCount; $row++) {
$myRow = $row;
$cell = $matchesz[$row];
foreach ($searchKeywords as $val) {
if (partialArraySearch($cell[$c_description], $val)) {
}
}}
function partialArraySearch($cell, $searchword)
{
if (strpos(strtoupper($cell), strtoupper($searchword)) !== false) {
return true;
}
return false;
}
Above code is for search with in the uploaded file.. if the file was in utf-8 then match was getting but when same file with utf-16 or utf-32 am not getting the result..
so how can i get the encoding type of uploaded file ..

If someone is still searching for a solution, I have hacked something like this in the "voku/portable-utf8" repo on github. => "UTF8::file_get_contents()"
The "file_get_contents"-wrapper will detect the current encoding via "UTF8::str_detect_encoding()" and will convert the content of the file automatically into UTF-8.
e.g.: from the PHPUnit tests ...
$testString = UTF8::file_get_contents(dirname(__FILE__) . '/test1Utf16pe.txt');
$this->assertContains('<p>Today’s Internet users are not the same users who were online a decade ago. There are better connections.', $testString);
$testString = UTF8::file_get_contents(dirname(__FILE__) . '/test1Utf16le.txt');
$this->assertContains('<p>Today’s Internet users are not the same users who were online a decade ago. There are better connections.', $testString);

My solution is to detect UTF-16 and convert the code in Latin 15 is
preg_match_all('/\x00/',$content,$count);
if(count($count[0])/strlen($content)>0.4) {
$content = iconv('UTF-16', 'ISO-8859-15', $content);
}
In other words i check the frequency of the hexadecimal character 00. If it is higher than 0.4 probably the text contains characters in the base set encoded in UTF-16. This means two bytes for character but usually the second byte is 00.

Clear PHP CLI output

I'm trying to get a "live" progress indicator working on my php CLI app. Rather than outputting as
1Done
2Done
3Done
I would rather it cleared and just showed the latest result. system("command \C CLS") doesnt work. Nor does ob_flush(), flush() or anything else that I've found.
I'm running windows 7 64 bit ultimate, I noticed the command line outputs in real time, which was unexpected. Everyone warned me that out wouldn't... but it does... a 64 bit perk?
Cheers for the help!
I want to avoid echoing 24 new lines if I can.

Try outputting a line of text and terminating it with "\r" instead of "\n".
The "\n" character is a line-feed which goes to the next line, but "\r" is just a return that sends the cursor back to position 0 on the same line.
So you can:
echo "1Done\r";
echo "2Done\r";
echo "3Done\r";
etc.
Make sure to output some spaces before the "\r" to clear the previous contents of the line.
[Edit] Optional: Interested in some history & background? Wikipedia has good articles on "\n" (line feed) and "\r" (carriage return)

I came across this while searching for a multi line solution to this problem. This is what I eventually came up with. You can use Ansi Escape commands. http://www.inwap.com/pdp10/ansicode.txt
<?php
function replaceOut($str)
{
$numNewLines = substr_count($str, "\n");
echo chr(27) . "[0G"; // Set cursor to first column
echo $str;
echo chr(27) . "[" . $numNewLines ."A"; // Set cursor up x lines
}
while (true) {
replaceOut("First Ln\nTime: " . time() . "\nThird Ln");
sleep(1);
}
?>

I recently wrote a function that will also keep track of the number of lines it last output, so you can feed it arbitrary string lengths, with newlines, and it will replace the last output with the current one.
With an array of strings:
$lines = array(
'This is a pretty short line',
'This line is slightly longer because it has more characters (i suck at lorem)',
'This line is really long, but I an not going to type, I am just going to hit the keyboard... LJK gkjg gyu g uyguyg G jk GJHG jh gljg ljgLJg lgJLG ljgjlgLK Gljgljgljg lgLKJgkglkg lHGL KgglhG jh',
"This line has newline characters\nAnd because of that\nWill span multiple lines without being too long",
"one\nmore\nwith\nnewlines",
'This line is really long, but I an not going to type, I am just going to hit the keyboard... LJK gkjg gyu g uyguyg G jk GJHG jh gljg ljgLJg lgJLG ljgjlgLK Gljgljgljg lgLKJgkglkg lHGL KgglhG jh',
"This line has newline characters\nAnd because of that\nWill span multiple lines without being too long",
'This is a pretty short line',
);
One can use the following function:
function replaceable_echo($message, $force_clear_lines = NULL) {
static $last_lines = 0;
if(!is_null($force_clear_lines)) {
$last_lines = $force_clear_lines;
}
$term_width = exec('tput cols', $toss, $status);
if($status) {
$term_width = 64; // Arbitrary fall-back term width.
}
$line_count = 0;
foreach(explode("\n", $message) as $line) {
$line_count += count(str_split($line, $term_width));
}
// Erasure MAGIC: Clear as many lines as the last output had.
for($i = 0; $i < $last_lines; $i++) {
// Return to the beginning of the line
echo "\r";
// Erase to the end of the line
echo "\033[K";
// Move cursor Up a line
echo "\033[1A";
// Return to the beginning of the line
echo "\r";
// Erase to the end of the line
echo "\033[K";
// Return to the beginning of the line
echo "\r";
// Can be consolodated into
// echo "\r\033[K\033[1A\r\033[K\r";
}
$last_lines = $line_count;
echo $message."\n";
}
In a loop:
foreach($lines as $line) {
replaceable_echo($line);
sleep(1);
}
And all lines replace each other.
The name of the function could use some work, just whipped it up, but the idea is sound. Feed it an (int) as the second param and it will replace that many lines above instead. This would be useful if you were printing after other output, and you didn't want to replace the wrong number of lines (or any, give it 0).
Dunno, seemed like a good solution to me.
I make sure to echo the ending newline so that it allows the user to still use echo/print_r without killing the line (use the override to not delete such outputs), and the command prompt will come back in the correct place.

i know the question isn't strictly about how to clear a SINGLE LINE in PHP, but this is the top google result for "clear line cli php", so here is how to clear a single line:
function clearLine()
{
echo "\033[2K\r";
}

function clearTerminal () {
DIRECTORY_SEPARATOR === '\\' ? popen('cls', 'w') : exec('clear');
}
Tested on Win 7 PHP 7. Solution for Linux should work, according to other users reports.

something like this :
for ($i = 0; $i <= 100; $i++) {
echo "Loading... {$i}%\r";
usleep(10000);
}

Use this command for clear cli:
echo chr(27).chr(91).'H'.chr(27).chr(91).'J'; //^[H^[J

Console functions are platform dependent and as such PHP has no built-in functions to deal with this. system and other similar functions won't work in this case because PHP captures the output of these programs and prints/returns them. What PHP prints goes to standard output and not directly to the console, so "printing" the output of cls won't work.

<?php
error_reporting(E_ERROR | E_WARNING | E_PARSE);
function bufferout($newline, $buffer=null){
$count = strlen(rtrim($buffer));
$buffer = $newline;
if(($whilespace = $count-strlen($buffer))>=1){
$buffer .= str_repeat(" ", $whilespace);
}
return $buffer."\r";
};
$start = "abcdefghijklmnopqrstuvwxyz0123456789";
$i = strlen($start);
while ($i >= 0){
$new = substr($start, 0, $i);
if($old){
echo $old = bufferout($new, $old);
}else{
echo $old = bufferout($new);
}
sleep(1);
$i--;
}
?>
A simple implementation of #dkamins answer. It works well. It's a bit- hack-ish. But does the job. Wont work across multiple lines.

function (int $count = 1) {
foreach (range(1,$count) as $value){
echo "\r\x1b[K"; // remove this line
echo "\033[1A\033[K"; // cursor back
}
}
See the full example here

Unfortunately, PHP 8.0.2 does not has a function to do it. However, if you just want to clear console try this: print("\033[2J\033[;H"); or use : proc_open('cls', 'w');
It works in php 8.0.2 and windows 10. It is the same that system('cls') using c language programing.

Tried some of solutions from answers:
<?php
...
$messages = [
'11111',
'2222',
'333',
'44',
'5',
];
$endlines = [
"\r",
"\033[2K\r",
"\r\033[K\033[1A\r\033[K\r",
chr(27).chr(91).'H'.chr(27).chr(91).'J',
];
foreach ($endlines as $i=>$end) {
foreach ($messages as $msg) {
output()->write("$i. ");
output()->write($msg);
sleep(1);
output()->write($end);
}
}
And \033[2K\r seems like works correct.

Problem reading files greater than 1GB with XMLReader

Is there a maximum file size the XMLReader can handle?
I'm trying to process an XML feed about 3GB large. There are certainly no PHP errors as the script runs fine and successfully loads to the database after it's been run.
The script also runs fine with smaller test feeds - 1GB and below. However, when processing larger feeds the script stops reading the XML File after about 1GB and continues running the rest of the script.
Has anybody experienced a similar problem? and if so how did you work around it?
Thanks in advance.

I had same kind of problem recently and I thought to share my experience.
It seems that problem is in the way PHP was compiled, whether it was compiled with support for 64bit file sizes/offsets or only with 32bit.
With 32bits you can only address 4GB of data. You can find a bit confusing but good explanation here: http://blog.mayflower.de/archives/131-Handling-large-files-without-PHP.html
I had to split my files with Perl utility xml_split which you can find here: http://search.cpan.org/~mirod/XML-Twig/tools/xml_split/xml_split
I used it to split my huge XML file into manageable chunks. The good thing about the tool is that it splits XML files over whole elements. Unfortunately its not very fast.
I needed to do this one time only and it suited my needs, but I wouldn't recommend it repetitive use. After splitting I used XMLReader on smaller files of about 1GB in size.

Splitting up the file will definitely help. Other things to try...
adjust the memory_limit variable in php.ini. http://php.net/manual/en/ini.core.php
rewrite your parser using SAX -- http://php.net/manual/en/book.xml.php . This is a stream-oriented parser that doesn't need to parse the whole tree. Much more memory-efficient but slightly harder to program.
Depending on your OS, there might also be a 2gb limit on the RAM chunk that you can allocate. Very possible if you're running on a 32-bit OS.

It should be noted that PHP in general has a max file size. PHP does not allow for unsigned integers, or long integers, meaning you're capped at 2^31 (or 2^63 for 64 bit systems) for integers. This is important because PHP uses an integer for the file pointer (your position in the file as you read through), meaning it cannot process a file larger than 2^31 bytes in size.
However, this should be more than 1 gigabyte. I ran into issues with two gigabytes (as expected, since 2^31 is roughly 2 billion).

I've run into a similar issue when parsing large documents. What I wound up doing is breaking the feed into smaller chunks using filesystem functions, then parsing those smaller chunks... So if you have a bunch of <record> tags that you are parsing, parse them out with string functions as a stream, and when you get a full record in the buffer, parse that using the xml functions... It sucks, but it works quite well (and is very memory efficient, since you only have at most 1 record in memory at any one time)...

Do you get any errors with
libxml_use_internal_errors(true);
libxml_clear_errors();
// your parser stuff here....
$r = new XMLReader(...);
// ....
foreach( libxml_get_errors() as $err ) {
printf(". %d %s\n", $err->code, $err->message);
}
when the parser stops prematurely?

Using WindowsXP, NTFS as filesystem and php 5.3.2 there was no problem with this test script
<?php
define('SOURCEPATH', 'd:/test.xml');
if ( 0 ) {
build();
}
else {
echo 'filesize: ', number_format(filesize(SOURCEPATH)), "\n";
timing('read');
}
function timing($fn) {
$start = new DateTime();
echo 'start: ', $start->format('Y-m-d H:i:s'), "\n";
$fn();
$end = new DateTime();
echo 'end: ', $start->format('Y-m-d H:i:s'), "\n";
echo 'diff: ', $end->diff($start)->format('%I:%S'), "\n";
}
function read() {
$cnt = 0;
$r = new XMLReader;
$r->open(SOURCEPATH);
while( $r->read() ) {
if ( XMLReader::ELEMENT === $r->nodeType ) {
if ( 0===++$cnt%500000 ) {
echo '.';
}
}
}
echo "\n#elements: ", $cnt, "\n";
}
function build() {
$fp = fopen(SOURCEPATH, 'wb');
$s = '<catalogue>';
//for($i = 0; $i < 500000; $i++) {
for($i = 0; $i < 60000000; $i++) {
$s .= sprintf('<item>%010d</item>', $i);
if ( 0===$i%100000 ) {
fwrite($fp, $s);
$s = '';
echo $i/100000, ' ';
}
}
$s .= '</catalogue>';
fwrite($fp, $s);
flush($fp);
fclose($fp);
}
output:
filesize: 1,380,000,023
start: 2010-08-07 09:43:31
........................................................................................................................
#elements: 60000001
end: 2010-08-07 09:43:31
diff: 07:31
(as you can see I screwed up the output of the end-time but I don't want to run this script another 7+ minutes ;-))
Does this also work on your system?
As a side-note: The corresponding C# test application took only 41 seconds instead of 7,5 minutes. And my slow harddrive might have been the/one limiting factor in this case.
filesize: 1.380.000.023
start: 2010-08-07 09:55:24
........................................................................................................................
#elements: 60000001
end: 2010-08-07 09:56:05
diff: 00:41
and the source:
using System;
using System.IO;
using System.Xml;
namespace ConsoleApplication1
{
class SOTest
{
delegate void Foo();
const string sourcepath = #"d:\test.xml";
static void timing(Foo bar)
{
DateTime dtStart = DateTime.Now;
System.Console.WriteLine("start: " + dtStart.ToString("yyyy-MM-dd HH:mm:ss"));
bar();
DateTime dtEnd = DateTime.Now;
System.Console.WriteLine("end: " + dtEnd.ToString("yyyy-MM-dd HH:mm:ss"));
TimeSpan s = dtEnd.Subtract(dtStart);
System.Console.WriteLine("diff: {0:00}:{1:00}", s.Minutes, s.Seconds);
}
static void readTest()
{
XmlTextReader reader = new XmlTextReader(sourcepath);
int cnt = 0;
while (reader.Read())
{
if (XmlNodeType.Element == reader.NodeType)
{
if (0 == ++cnt % 500000)
{
System.Console.Write('.');
}
}
}
System.Console.WriteLine("\n#elements: " + cnt + "\n");
}
static void Main()
{
FileInfo f = new FileInfo(sourcepath);
System.Console.WriteLine("filesize: {0:N0}", f.Length);
timing(readTest);
return;
}
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php - Piping input to perl process automatically decodes url-encoded string - php

Related

PHP - output text one character at a time

Editing a Python script from PHP

How can Detect UTF 16 decoding

Clear PHP CLI output

Problem reading files greater than 1GB with XMLReader

Categories

Resources