I have a CSV parser, that takes Outlook 2010 Contact Export .CSV file, and produces an array of values.
I break each row on the new line symbol, and each column on the comma. It works fine, until someone puts a new line inside a field (typically Address). This new line, which I assume is "\n" or "\r\n", explodes the row where it shouldn't, and the whole file becomes messed up from there on.
In my case, it happens when Business Street is written in two lines:
123 Apple Dr. Unit A
My code:
$file = file_get_contents("outlook.csv");
$rows = explode("\r\n",$file);
foreach($rows as $row)
{
$columns = explode(",",$row);
// Further manipulation here.
}
I have tried both "\n" and "\r\n", same result.
I figured I could calculate the number of columns in the first row (keys), and then find a way to not allow a new line until this many columns have been parsed, but it feels shady.
Is there another character for the new line that I can try, that would not be inside the data fields themselves?
The most common way of handling newlines in CSV files is to "quote" fields which contain significant characters such as newlines or commas. It may be worth looking into whether your CSV generator does this.
I recommend using PHP's fgetcsv() function, which is intended for this purpose. As you've discovered, splitting strings on commas works only in the most trivial cases.
In cases, where that doesn't work, a more sophisticated, reportedly RFC4180-compliant parser is available here.
I also recommend fgetcsv()
fgetcsv will also take care of commas inside strings ( between quotes ).
Interesting parsing tutorial
+1 to the previous answer ;)
PS: fgetcsv is a bit slower then opening the file and explode the contents etc. But imo it's worth it.
Related
I have a file with the next structure:
concept
[at0000] -- Blood Pressure
language
original_language =
translations =
author =
["organisation"] =
["email"] =
>
accreditation =
>
>
description
original_author =
["organisation"] =
["email"] =
["date"] =
>
details =
purpose =
I need to open and parse this file, but I must admit the indentations of each line, as the indentations represent hierarchical structures. Is there any way in PHP to go line by line analysis of the indentation, either the beginning, middle or end of the line?
//rant on
It's simple: who provides such a crappy data structure to parse.
It's 2014. XML all over the place and lightweight JSON.
What do we get? Not even CSV :)
//rant off
Maybe a fixed column width parser would fit:
https://github.com/t-geindre/fixed-column-width-parser
Basically you get lines with $lines = file("file.txt");
Then it's a matter of detecting the spaces or tabs in front of each line.
Update
Turns out this "data" has a structure.
The data-structure "Archetype Definition Language" (ADL) is described in ISO 13606-2.
http://pangea.upv.es/en13606/index.php/resources/files/doc_download/2-en13606-part-2
This document contains a grammar description in Chapter 8.
You might use this grammer for parser construction.
Parsing indentions is your smallest problem. Getting the data structure right, is the real task.
Happy test writing - this will be a lot of work... be warned.
Let me also point to OpenEHR.
OpenEHR uses Java and Eiffel as programming languages.
The ADL parser is implemented in Java.
You might find it at https://github.com/openEHR/java-libs/blob/master/adl-parser/src/main/javacc/adl.jj
This is the parser ADL v1.4 in Ruby:
https://github.com/skoba/openehr-ruby/tree/master/lib/openehr/parser
This should get you pretty close to a solution.
Hope this helps a bit..
You can use ltrim and rtrim functions.
For example using the following code:
$line = ' concept';
echo strlen(ltrim($line));
echo strlen($line);
you can calculate length of string with and without white-spaces at the beginning of the line.
However I don't know what you mean that you want to calculate indentations in the middle of the line. You should in that case go probably use substr function to go to the place when you expect indentation and then again use ltrim and strlen to calculate whitespaces at the beginning of substring.
You may also want to use Mb string functions in case you have in your code non-ASCII characters.
For parsing lines you can simply use file() function
ADL doesn't have a parser for PHP. But ADL can be transformed to XML using the CKM (http://ckm.openehr.org/ckm/) or the Archetype Editor (http://www.openehr.org/downloads/modellingtools).
You should use the XML in PHP.
Here is the line of code from a PHP file, specifically it is from zstore.php which is a file include as part of the "Zazzle Store Builder" toolset from Zazzle.com
The set of files allows someone like me, who has products for sale on Zazzle and massage that data into a nicer "storefront" which I can set up my way instead of being confined by the CMS structure of Zazzle.com where they understandably want to keep the monkeys (uhmmm... users like myself) from causing too much mayhem.
So... here is the code:
$keywords = str_replace(" ",",",str_replace(",","",$keywords));
Two questions:
Am I understanding what it does and
Is there an extra single or double quote in the string that does not need to be there?
Here is what I think the line of code is saying:
Take the string of characters that the user inputs (dance diva) and assign it to the variable called
$keywords
then run the following function on that character string
= str_replace
(" ","," <<< look for spaces. If you find a space, replace it with a comma
,str_replace(",","" <<< this is the bit I don't understand or which may have a typo
I THINK that it is saying " if you find commas, leave them alone, but I'm not certain.
,$keywords)); <<< then put the edited string of characters backing to the variable called $keywords.
What lead me to look at this was that I was inputting the following:
dance,diva which is what I THOUGHT the script was wanting from me based on the commented text in the README.txt file:
// Search terms. Comma separated keywords you can use to select products for your store
So..
Am I understanding what this line of code is supposed to do?
which, assuming I am correct, and I'm pretty sure that the first half is supposed to work as I've described, now brings me to my second question:
Why isn't the second bit working? Is there a typo?
To review:
dance diva produces results
dance,diva does not
Both, SHOULD work.
Thanks in advance for your help. I have a lot of HTML experience and computer experience but PHP is new to me.
$keywords = str_replace(" ",",",str_replace(",","",$keywords));
You can split into
$temp = str_replace(",","",$keywords);
$keywords = str_replace(" ",",",$temp);
First it replaces all comas with empty string, it is removes all comas. Then replaces all spaces with comas.
For "dance diva" there are no comas so first does nothing, then it replaces space and result is "dance,diva"
For "dance,diva" it removes coma, you get "dancediva" and there in no space to replace next so it is Your result.
I have a text file of approximately 25,000 lines. About 525kb.
Some lines have random text at the beginning.
Some have long strings of semicolons.
Some others only have three semi-colons and then a space and optionally more text on the same line. These are the lines I want to remove.
Here is a sample....
;;; Updated Time 20120706122706
;;; Generic DEveloper Output
;;; Some Random Comments
;;; I got some more...
;;; Yet another uneeded line
;;; Thanks for using StackOverflow <http://stackoverflow.com>, or...
;;; Not.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; Banana Production
[Data_Release_Version]
Version=12586
Released=20120706122706
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; Baseline Properties
[BaseLineProperties]
Comment=BaselineProperties
----- and so on.
Once it gets to the first line with 4 or more ; on the line, I need the rest of the file as there are no ";;; " lines.
Trying to find something fast instead of reading everything line and writing it back out if it doesn't match ";;; ".
File is ASCII (possibly UTF-8) text type file.
Any ideas?
Thank you for your time, assistance and knowledge.
What I would suggest is to use file_get_contents() and save file's contents in a variable as a string, then use explode() that string at every newline character, then in a foreach loop, use preg_match() to check if the line begins with 3 semicolons and a space, if it dosent, put it in another array named $output. After foreach, implode() $output and add a newline character and use file_put_contents() to print it in another file. Hope this helps :-)
code:
<?php
$string = file_get_contents($filename);
$array = explode("\n",$string);
foreach($array as $arr) {
if(!(preg_match("^;;;\s",$arr))) {
$output[] = $arr;
}
}
$out = implode("\n",$output);
file_put_contents($path,$out);
?>
Depends.. I would try to load into a string, then do a explode() with newline, so it's in array, then run a foreach with a skip on any that doesnt have strpos == 0 -AND- strpos !== false, you can put in a continue to skip to the next line if it doesnt match.
Another option, is to parse, and skip, or even using fseek, and such. Depends on alot of different factors to determine whats going to be fastest.
You can implode later on, and add the newlines back in, and then push out a file, and/or use line breaks. Depending where the output is supposed to go.
I think you gave the answer yourself:
Make a script that reads the input file line by line in a loop (while). It writes every line into an output file if two conditions are met: 1. a flag ("done") is FALSE and 2. the line does NOT start with ";;; " (not the blank). This removes those lines starting with three semicolons. Once you come about a line containing more semicolons you set the flag to TRUE, thus the remaining lines wil be copied without being examined.
I have a text field in the database with datatype also as text. It holds comments and stuff. Now when I read this and export in a csv if it finds a new line in that comment such as
This is a comment
This is another line
The csv import show "This is another line" in next line and thus mess up my data.
So far I have tried str_replace(), trim(). Still don't seem to do anything. I have looked for similar answers in stackoverflow but couldn't find one that suits my problem
Thanks
What are you searching for in the str_replace function? You should be able to search for "\r" and or "\n" the the given string. Examples are documented in the PHP str_replace documentation. Example links below:
http://php.net/manual/en/function.str-replace.php#example-4450
http://www.php.net/manual/en/function.str-replace.php#97374
If you convert your newlines to printf-style strings, you can avoid this problem.
<?php
$t='This is a comment
This is another line
';
print str_replace("\n","\\n",$t);
Output:
This is a comment\n\nThis is another line\n
Also look in to htmlspecialchars(). You need to protect other characters that could break your CSV.
Lastly, have a close look at fputcsv() to see if it'll do what you need. Might be better to use builtins than to roll your own.
I'm learning PHP and MySQL together from Head First PHP & MySQL and in the book, they often split their long strings (over 80~ characters) and concatenate them, like this:
$variable = "a very long string " .
"that requires a new line " .
"and apparently needs to be concatenated.";
I have no issue with this, but what strikes me odd is that whitespace in other languages usually don't need concatenation.
$variable = "you guys probably already know
that this simply works too.";
I tried this and it worked just fine. Aren't line breaks always interpreted with a space at the end? Even the PHP manual doesn't concatenate in the echo examples if they span over one line.
Should I follow my book's example or what? I can't tell which is more correct or "proper" since both work and the manual even takes a shorter approach. I also would like to know how important is it to keep code under 80 characters in width? I have always been fine with word warp since my monitor is pretty large and I hate my code getting cut short when I have the screen space.
There's 3 basic ways of building multiline strings in PHP.
a. building string via concatenation and embedded newlines:
$str = "this is the first line, with a line break\n";
$str .= "this is the second line, but won't have a break";
$str .= "this would've been the 3rd line, but since there's no line break in the previous line..."`
b. multi-line string assignment, with embedded newlines:
$str = "this is the first line, with a line break\n
this is the second line, because of the line break.
this line will actually is actually part of the second line, because of no newline";
c. HEREDOC syntax:
$str = <<<EOL
this is the first line
this is the second line, note the lack of a newline
this is the third line\n
this is actually the fifth line, because the newline previously isn't necessary.
EOL;
Heredocs are generally preferable for building multi-line strings. You don't have to escape quotes within the text, variables are interpolated within them as if it was a regular double-quoted string, and newlines within the text are honored.
In PHP long strings don't need concatenation but keep in mind that:
$variable = "you guys probably already know
that this simply works too.";
is the equivalent of
$variable = "you guys probably already know\nthat this simply works too.";
The newline is just the same in these 2 examples (if your system uses \n as a newline - Windows uses \r\n).
So to answer your question, no, you don't have to break large strings in many smaller ones. Doing so is just a matter of preference (which I don't really often see).
The 80 char "limit" is throwback to the old days where terminal screens had an 80 char width. If you ever need to edit something in a narrow width terminal, respecting 80 chars can be helpful. However, if longer than 80 char lines wrapping are causing you headaches in your editor, Don't follow that convention.
When you have a multi-line string as in your second example, the string will be exactly as you type it in your editor. If you have a whole bunch of spaces before your retrun char, those will be in your string var. The only exception to this is if your editor is doing line wrapping, then there is not actually a return char in the string, and it won't show up in the variable.
PHP syntax allows literal line feeds in the strings. Your second example equals this:
you guys probably already know[LF][SPACE][SPACE][SPACE][SPACE]that this simply works too.
where [LF] will be \r\n or \n depending on your editor settings. Those redundant spaces may be an issue or not (not everything is HTML), but it's not the same as concatenating.
No.
1) open quotes
2) write as much as you need, adding spaces, tabs, whatever else
3) close quotes.
If you're using the same quotes within, escape them with \
"Jane said \"It's hot today!\"";
or
'Jane said "It\'s hot today!"';