I am writing a JSONScanner class that basically takes a string and scans the whole thing to construct a JSONObject. Currently I'm writing read_string() method, to read a string. When reading a string that escapes '\', I get some invalid output.
Here is my JSONScanner class
class JSONScanner {
private $in;
private $pos;
public function __construct($in) {
$this->in = $in;
$this->pos = 0;
}
#########################################################
############### Method used for debugging ###############
#########################################################
public function display() {
$this->pos = 1;
echo $this->read_string($this->get_char());
}
#########################################################
#########################################################
private function read_string($quote) {
$str = "";
while(($c = $this->get_char()) != $quote) {
if($c == '\\') {
$str .= $this->get_escaped_char();
} else {
$str .= $c;
}
}
return $str;
}
private function get_escaped_char() {
$c = $this->get_char();
switch($c) {
case 'n':
return '\n';
case 't':
return '\t';
case 'r':
return '\r';
// display the characters being escaped
case '\\':
case '\'':
case '"':
default:
return $c;
}
}
private function get_char() {
if($this->pos >= strlen($this->in)) {
return -1; // END OF INPUT
}
return substr($this->in, $this->pos++, 1);
}
}
Here is my running code
$str = '{"a\\":1,"b":2}';
$jscan = new JSONScanner($str);
$jscan->display();
With the above string, I'm getting
a":1,
However when I try
$str = '{"a\\\":1,"b":2}';
$jscan = new JSONScanner($str);
$jscan->display();
I get what I need, which is
a\
Why am I needing to put 2 backslashes to escape 1 backslash?
EDIT:
I was trying the same json string on json_decode, and it gave me the same results, with 2 backslashes, nothing but with 3 backslahes it gave me a\. Why is that? Isn't escaping a backslash takes 2 consecutive ones \\?
$str = '{"a\\":1,"b":2}';
This is a PHP string literal, which has its own escaping rules. The actual string you're representing with the above is:
{"a\":1,"b":2}
If you want to represent one backslash in a PHP string literal, you need to write two backslashes. So the correct string representation for what you want is:
$str = '{"a\\\\":1,"b":2}';
It happens to work with three backslashes, because \\ becomes one \ and the next \ isn't followed by any special character, so it by itself also represents a single backslash.
Related
str_getcsv () has some odd behaviour. It removes all characters that match the enclosing character, instead of just the enclosing ones. I'm trying to parse a CSV string (contents of an uploaded file) in two steps:
split the CSV string into an array of lines
split each line into an array of fields
with this code:
$whole_file_string = file_get_contents($file);
$array_of_lines = str_getcsv ($whole_file_string, "\n", "\""); // step 1. split csv into lines
foreach ($array_of_lines as $one_line_string) {
$splitted_line = str_getcsv ($one_line_string, ",", "\""); // step 2. split line into fields
};
In the code example nothing is done with $splitted_line for clarity of example
Then I feed this script a file with the following contents: "text,with,delimiter",secondfield.
When step 1 is performed the first (and only) element of $array_of_lines is text,with,delimiter,secondfield. So when step 2 is performed, it splits the line into 4 fields, but that needs to be 2.
I can't use fgetcsv() because some string conversion is done (checking BOM, converting encoding accordingly and stuff like that) after reading the file and before splitting it into lines in step 1.
I'm at the point of writing my own string parser (which isn't that complicated for CSV format), but before I do so I want to make sure that that's the best approach. I'm a bit disappointed that the PHP functions are letting me down on this simple (and I guess quite common) use case: processing an uploaded csv file with varying encoding.
Any tips?
You should only be calling str_getcsv() on one line at a time, not the whole file.
$array_of_lines = file($file, FILE_IGNORE_NEW_LINES); // split CSV into lines
foreach($array_of_lines as $one_line_string) {
$splitted_line = str_getcsv($one_line_string, ",", "\""); // split line into fields
}
Here's my own CSV parser, fully compliant with IETF rfc 4180.
I'm curious to know if this can be done in regular expressions, those are not my forte.
/**
* Parse a string according to CSV format (https://tools.ietf.org/html/rfc4180), with variabele delimiter (default ,).
* #param string $string String to be parsed as csv
* #param string $delimiter character to be used as field delimiter
* #return array Array with for each line an array with csv field values
*/
function csv_parse ($string, $delimiter = ",", $line_mode = true) {
// This function parses on line-level first ($line_mode = true) and calls itself recursively to parse each line on field-level ($line_mode = false).
// when in line mode, the delimiter is eol (\n, \r\n and \n\r).
// when in field mode, the delimiter is the passed $delimiter.
$delimiter = substr ($delimiter,0,1); // delimiter is one character
$length = strlen ($string);
$parsed_array = array();
$end_of_line_state = false;
$enclosed_state = false;
$i = 0;
$field = "";
do {
switch (true) {
case (!$enclosed_state && $end_of_line_state && ($string[$i] == "\r" && $string[$i-1] == "\n")) :
case (!$enclosed_state && $end_of_line_state && ($string[$i] == "\n" && $string[$i-1] == "\r")) :
// ...found second character of eol (\r\n of \n\r). Ignore
$end_of_line_state = false;
break;
case (!$enclosed_state && !$end_of_line_state && ($string[$i] == "\n" || $string[$i] == "\r")) :
// ... found first character of eol \n, \r\n of \n\r
$end_of_line_state = true; // eol can be two characters, so prepare for the second
if ($field != "") { // ignore empty lines. Prohibited in csv
$parsed_array [] = csv_parse ($field, $delimiter, false); // recursive call to parse on field-level. Flush result
};
$field = ""; // prepare for next one
break;
case (!$enclosed_state && $string[$i] == $delimiter && !$line_mode) :
// ...delimiter found
$parsed_array [] = $field; // flush field as new array element
$field = ""; // prepare for next one
break;
case ($string[$i] == "\"") :
// ...encloser found
if ($enclosed_state) {
if ($i < $length && $string[$i+1] == "\"") {
// ... escaped " found
if (!$line_mode) {
$field .= "\""; // when parsing fieldlevel, only " is part of the line
} else {
$field .= "\"\""; // when parsing line level, the escaping " is also part of the line
};
$i++;
} else {
// ...closing encloser found
$enclosed_state = false;
if ($line_mode) {
$field .= $string[$i]; // when parsing line level, the enclosing " are part of the line
};
};
} else {
// ... opening encloser found
$enclosed_state = true;
if ($line_mode) {
$field .= $string[$i]; // when parsing line level, the enclosing " are part of the line
};
};
break;
default:
// ...regular character found
$field .= $string[$i];
};
$i++;
if ($i >= $length) { // end of string
if ($line_mode) {
$parsed_array [] = csv_parse ($field, $delimiter, false); // recursive call to parse on field-level. Flush result.
} else {
$parsed_array [] = $field; // flush last field
};
};
} while ($i < $length);
return $parsed_array;
};
Hm. Strange. I'm fiddling around a bit and I discovered something interesting:
$a = str_getcsv("\"#ne piece, #f text\"", "\n");
$b = str_getcsv("\"#ne piece, #f text\"", "\n","\"");
$c = str_getcsv("\"#ne piece, #f text\"", "\n","#");
echo $a; // #ne piece, #f text
echo $b; // #ne piece, #f text
echo $c; // "#ne piece, #f text"
So Passing an # as enclosing character instead of " (or leaving it out and using the default, which is " as well) is semantically bullocks, but it does the job. It leaves the enclosing " around the field, so if you then str_getcsv($c, ",") it results in one value as it should. And if you have a field enclosed in # it leaves it untouched. I tested that.
It is clear that str_getcsv has its flaws. On splitting to lines it shouldn't strip quotes, just like it doesn't strip # when that's the enclosing parameter. But sadly it does and therefore isn't CSV compliant.
Problem:
I'm looking for a PHP function to easily and efficiently normalise CSV content in a string (not in a file). I have made a function for that. I provide it in an answer, because it is a possible solution. Unfortuanately it doesn't work when the separator is included in incomming string values.
Can anyone provide a better solution?
Why not using fputcsv / fgetcsv ?
Because:
it requires at least PHP 5.1.0 (which is sometimes not available)
it can only read from files, but not from a string. even though, sometimes the input is not a file (eg. if you fetch the CSV from an email)
putting the content into a temporary file might be unavailable due to security policies.
Why / what kind of normalisation?
Normalise in a way, that the encloser encloses every field. Because the encloser can be optional and different per line and per field. This can happen if one is implementing unclean/incomplete specifications and/or using CSV content from different sources/programs/developers.
Example function call:
$csvContent = "'a a',\"b\",c,1, 2 ,3 \n a a,'bb',cc, 1, 2, 3 ";
echo "BEFORE:\n$csvContent\n";
normaliseCSV($csvContent);
echo "AFTER:\n$csvContent\n";
Output:
BEFORE:
'a a',"b",c,1, 2 ,3
a a,'bb',cc, 1, 2, 3
AFTER:
"a a","b","c","1","2","3"
"a a","bb","cc","1","2","3"
To specifically address your concern regarding f*csv working only with files:
Since PHP 5.3 there's str_getcsv.
For at least PHP >= 5.1 (and I really hope that's the oldest you'll have to deal with these days), you can use stream wrappers:
$buffer = fopen('php://memory', 'r+');
fwrite($buffer, $string);
rewind($buffer);
fgetcsv($buffer) ..
Or obviously the reverse if you want to use fputcsv.
This is a possible solution. But it doesn't consider the case that the separator (,) might be included in incoming strings.
function normaliseCSV(&$csv,$lineseperator = "\n", $fieldseperator = ',', $encloser = '"')
{
$csvArray = explode ($lineseperator,$csv);
foreach ($csvArray as &$line)
{
$lineArray = explode ($fieldseperator,$line);
foreach ($lineArray as &$field)
{
$field = $encloser.trim($field,"\0\t\n\x0B\r \"'").$encloser;
}
$line = implode ($fieldseperator,$lineArray);
}
$csv = implode ($lineseperator,$csvArray);
}
It is a simple chain of explode -> explode -> trim -> implode -> implode .
Although I agree with #deceze that you could expect atleast 5.1 these days, i'm sure there are some internal company servers somewhere who don't want to update.
I altered your method to be able to use field and line separators between double quotes, or in your case the $encloser value.
<?php
/*
In regards to the specs on http://tools.ietf.org/html/rfc4180 I use the following rules:
- "Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes."
- "If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
Exception:
Even though the specs says use double quotes, I 'm using your $encloser variable
*/
echo normaliseCSV('a,b,\'c\',"d,e","f","g""h""i","""j"""' . "\n" . "\"k\nl\nm\"");
function normaliseCSV($csv,$lineseperator = "\n", $fieldseperator = ',', $encloser = '"')
{
//We need 4 temporary replacement values
//line seperator, fieldseperator, double qoutes, triple qoutes
$keys = array();
while (count($keys)<3) {
$tmp = "##".md5(rand().rand().microtime())."##";
if (strpos($csv, $tmp)===false) {
$keys[] = $tmp;
}
}
//first we exchange "" (double $encloser) and """ to make sure its not exploded
$csv = str_replace($encloser.$encloser.$encloser, $keys[0], $csv);
$csv = str_replace($encloser.$encloser, $keys[0], $csv);
//Explode on $encloser
//Every odd index is within quotes
//Exchange line and field seperators for something not used.
$content = explode($encloser,$csv);
$len = count($content);
if ($len>1) {
for ($x=1;$x<$len;$x=$x+2) {
$content[$x] = str_replace($lineseperator,$keys[1], $content[$x]);
$content[$x] = str_replace($fieldseperator,$keys[2], $content[$x]);
}
}
$csv = implode('',$content);
$csvArray = explode ($lineseperator,$csv);
foreach ($csvArray as &$line)
{
$lineArray = explode ($fieldseperator,$line);
foreach ($lineArray as &$field)
{
$val = trim($field,"\0\t\n\x0B\r '");
//put back the exchanged values
$val = str_replace($keys[0],$encloser.$encloser,$val);
$val = str_replace($keys[1],$lineseperator,$val);
$val = str_replace($keys[2],$fieldseperator,$val);
$val = $encloser.$val.$encloser;
$field = $val;
}
$line = implode ($fieldseperator,$lineArray);
}
$csv = implode ($lineseperator,$csvArray);
return $csv;
}
?>
Output would be:
"a","b","c","d,e","f","g""h""i","""j"""
"k
l
m"
Codepad example
when i first read this question wasn´t sure if it should be solved or not, since <5.1 environments should be extinguished a long time ago, dispite of that is a hell of a question how to solve this so we should be thinking wich approach to take... and my guess is it should be char by char examination.
I have separated logic in three main scenarios:
A: CHAR is a separator
B: CHAR is a Fuc$€/& quotation
C: CHAR is a Value
Obtaining as a reulst this weapon class (including log for it) for our arsenal:
<?php
Class CSVParser
{
#basic requirements
public $input;
public $separator;
public $currentQuote;
public $insideQuote;
public $result;
public $field;
public $quotation = array();
public $parsedArray = array();
# for logging purposes only
public $logging = TRUE;
public $log = array();
function __construct($input, $separator, $quotation=array())
{
$this->separator = $separator;
$this->input = $input;
$this->quotation = $quotation;
}
/**
* The main idea is to go through the string to parse char by char to analize
* when a complete field is detected it´ll be quoted according and added to an array
*/
public function parse()
{
for($i = 0; $i < strlen($this->input); $i++){
$this->processStream($i);
}
foreach($this->parsedArray as $value)
{
if(!is_null($value))
$this->result .= '"'.addslashes($value).'",';
}
return rtrim($this->result, ',');
}
private function processStream($i)
{
#A case (its a separator)
if($this->input[$i]===$this->separator){
$this->log("A", $this->input[$i]);
if($this->insideQuote){
$this->field .= $this->input[$i];
}else
{
$this->saveField($this->field);
$this->field = NULL;
}
}
#B case (its a f"·%$% quote)
if(in_array($this->input[$i], $this->quotation)){
$this->log("B", $this->input[$i]);
if(!$this->insideQuote){
$this->insideQuote = TRUE;
$this->currentQuote = $this->input[$i];
}
else{
if($this->currentQuote===$this->input[$i]){
$this->insideQuote = FALSE;
$this->currentQuote ='';
$this->saveField($this->field);
$this->field = NULL;
}else{
$this->field .= $this->input[$i];
}
}
}
#C case (its a value :-) )
if(!in_array($this->input[$i], array_merge(array($this->separator), $this->quotation))){
$this->log("C", $this->input[$i]);
$this->field .= $this->input[$i];
}
}
private function saveField($field)
{
$this->parsedArray[] = $field;
}
private function log($type, $value)
{
if($this->logging){
$this->log[] = "CASE ".$type." WITH ".$value." AS VALUE";
}
}
}
and example of how to use it would be:
$original = 'a,"ab",\'ab\'';
$test = new CSVParser($original, ',', array('"', "'"));
echo "<PRE>ORIGINAL: ".$original."</PRE>";
echo "<PRE>PARSED: ".$test->parse()."</PRE>";
echo "<pre>";
print_r($test->log);
echo "</pre>";
and here are the results:
ORIGINAL: a,"ab",'ab'
PARSED: "a","ab","ab"
Array
(
[0] => CASE C WITH a AS VALUE
[1] => CASE A WITH , AS VALUE
[2] => CASE B WITH " AS VALUE
[3] => CASE C WITH a AS VALUE
[4] => CASE C WITH b AS VALUE
[5] => CASE B WITH " AS VALUE
[6] => CASE A WITH , AS VALUE
[7] => CASE B WITH ' AS VALUE
[8] => CASE C WITH a AS VALUE
[9] => CASE C WITH b AS VALUE
[10] => CASE B WITH ' AS VALUE
)
I might have mistakes since i only dedicated 25 mins to it, so any comment will be appreciated an edited.
I'm trying to write a php function that will grab the first letter from a long string ("ZT-FUL-ULT-10SF-S" would return "Z").
Some of the strings start with numbers, and for those, the function needs to return "#".
function returnFirst($rsrnum) {
substr("$rsrnum", 0, 1); {
echo "$rsrnum";
}
}
That's as far as I've gotten. How would I differentiate between numbers, and if it is a number, return #?
Thanks!
Edit: Seems to be working like a champ with:
function returnFirst($rsrnum) {
$char = substr($rsrnum, 0, 1);
return ctype_alpha($char) ? $char : "#";
}
Thanks!
Use ctype_alpha to check if the first character is a letter and return accordingly:
$char = substr($rsrnum, 0, 1);
return ctype_alpha($char) ? $char : "#";
A little delayed. but I think this might do the trick just fine!
function CheckFirst($String,$CharToCheck = 1){
if ($CharToCheck < 1){
return false;
}
$First_Char = $String{$CharToCheck - 1}; // Seek first Letter of Entered String
if (ctype_alpha($First_Char) === true){
// If the first Letter is in the aplhabet (A-Z/a-z)
return "First Character Is A String";
}elseif (ctype_digit($First_Char) === true){
// If The first Letter is a digit (0-9)
return "First Character Is A Digit";
}
}
I've left the function to return a string, so it's clear what will be returned upon pushing a string to the function. Suit the returns to your requirements
Edited the function to perform a better role within the script. It can now be called to check more than the first character of a string. Defaulting to the first
i'm using a php function to return words instead of characters it works fine when i pass string to the function but i have a variable equals another variable containing the string and i've tried the main variable but didn't work
////////////////////////////////////////////////////////
function words($text)
{
$words_in_text = str_word_count($text,1);
$words_to_return = 2;
$result = array_slice($words_in_text,0,$words_to_return);
return '<em>'.implode(" ",$result).'</em>';
}
$intro = $blockRow03['News_Intro'];
echo words($intro);
/* echo words($blockRow03['News_Intro']); didn't work either */
the result is nothing
str_word_count won't work correctly with accented (multi-byte) characters. you can use below sanitize words function to overcome this problem:
function sanitize_words($string) {
preg_match_all("/\p{L}[\p{L}\p{Mn}\p{Pd}'\x{2019}]*/u",$string,$matches,PREG_PATTERN_ORDER);
return $matches[0];
}
function words($text)
{
$words_in_text = sanitize_words($text);
$words_to_return = 2;
$result = array_slice($words_in_text,0,$words_to_return);
return '<em>'.implode(" ",$result).'</em>';
}
$intro = "aşağı yukarı böyle birşey";
echo words($intro);
I want to check all brackets start and close properly and also check it is mathematical expression or not in given string.
ex :
$str1 = "(A1+A2*A3)+A5+(B3^B5)*(C1*((A3/C2)+(B2+C1)))"
$str2 = "(A1+A2*A3)+A5)*C1+(B3^B5*(C1*((A3/C2)+(B2+C1)))"
$str3 = "(A1+A2*A3)+A5++(B2+C1)))"
$str4 = "(A1+A2*A3)+A5+(B3^B5)*(C1*(A3/C2)+(B2+C1))"
In above Example $str1 and $str4 are valid string....
Please Help....
You'll need a kind of parser. I don't think you can handle this by a regular expression, because you have to check the amount and the order of parentheses and possible nested ones. This class below is quick PHP port of a Python based Math expression syntax validator of parentheses I found:
class MathExpression {
private static $parentheses_open = array('(', '{', '[');
private static $parentheses_close = array(')', '}', ']');
protected static function getParenthesesType( $c ) {
if(in_array($c,MathExpression::$parentheses_open)) {
return array_search($c, MathExpression::$parentheses_open);
} elseif(in_array($c,MathExpression::$parentheses_close)) {
return array_search($c, MathExpression::$parentheses_close);
} else {
return false;
}
}
public static function validate( $expression ) {
$size = strlen( $expression );
$tmp = array();
for ($i=0; $i<$size; $i++) {
if(in_array($expression[$i],MathExpression::$parentheses_open)) {
$tmp[] = $expression[$i];
} elseif(in_array($expression[$i],MathExpression::$parentheses_close)) {
if (count($tmp) == 0 ) {
return false;
}
if(MathExpression::getParenthesesType(array_pop($tmp))
!= MathExpression::getParenthesesType($expression[$i])) {
return false;
}
}
}
if (count($tmp) == 0 ) {
return true;
} else {
return false;
}
}
}
//Mathematical expressions to validate
$tests = array(
'(A1+A2*A3)+A5+(B3^B5)*(C1*((A3/C2)+(B2+C1)))',
'(A1+A2*A3)+A5)*C1+(B3^B5*(C1*((A3/C2)+(B2+C1)))',
'(A1+A2*A3)+A5++(B2+C1)))',
'(A1+A2*A3)+A5+(B3^B5)*(C1*(A3/C2)+(B2+C1))'
);
// running the tests...
foreach($tests as $test) {
$isValid = MathExpression::validate( $test );
echo 'test of: '. $test .'<br>';
var_dump($isValid);
}
Well I suppose that the thing, you are looking for, is some Context-free grammar or Pushdown automaton. It can not be done only using regular expressions. (at least there is no easy or nice way)
That is because you are dealing with nested structures. Some idea of an implementation can be found here Regular expression to detect semi-colon terminated C++ for & while loops
Use Regular Expression that returns you howmany Opening Brackets and Closing Brackets are there?
then check for the number of both braces....if it is equal then your expression is right otherwise wrong...