Reading CSV file with unescaped enclosures

Reading CSV file with unescaped enclosures - php

I am reading a CSV file but some of the values are not escaped so PHP is reading it wrong. Here is an example of a line that is bad:
" 635"," ","AUBREY R. PHILLIPS (1920- ) - Pastel depicting cottages in
a steep sided river valley, possibly North Wales, signed and dated
2000, framed, 66cm by 48cm. another of a rural landscape, titled verso
"Harvest Time, Somerset" signed and dated '87, framed, 69cm by 49cm.
(2) NB - Aubrey Phillips is a Worcestershire artist who studied at
the Stourbridge School of Art.","40","60","WAT","Paintings, prints and
watercolours",
You can see Harvest Time, Somerset has quotes around it, causing PHP to think its a new value.
When i do print_r() on each line, the broken lines end up looking like this:
Array
(
[0] => 635
[1] =>
[2] => AUBREY R. PHILLIPS (1920- ) - Pastel depicting cottages in a steep sided river valley, possibly North Wales, signed and dated 2000, framed, 66cm by 48cm. another of a rural landscape, titled verso Harvest Time
[3] => Somerset" signed and dated '87
[4] => framed
[5] => 69cm by 49cm. (2) NB - Aubrey Phillips is a Worcestershire artist who studied at the Stourbridge School of Art."
[6] => 40
[7] => 60
[8] => WAT
[9] => Paintings, prints and watercolours
[10] =>
)
Which is obviously wrong, as it now contains many more array elements than other correct rows.
Here is the PHP i am using:
$i = 1;
if (($file = fopen($this->request->data['file']['tmp_name'], "r")) !== FALSE) {
while (($row = fgetcsv($file, 0, ',', '"')) !== FALSE) {
if ($i == 1){
$header = $row;
}else{
if (count($header) == count($row)){
$lots[] = array_combine($header, $row);
}else{
$error_rows[] = $row;
}
}
$i++;
}
fclose($file);
}
Rows with the wrong amount of values get put into $error_rows and the rest get put into a big $lots array.
What can I do to get around this? Thanks.

If you know that you'll always get entries 0 and 1, and that the last 5 entries in the array are always correct, so it's just the descriptive entry that's "corrupted" because of unescaped enclosure characters, then you could extract the first 2 and last 5 using array_slice(), implode() the remainder back into a single string (restoring the lost quotes), and rebuild the array correctly.
$testData = '" 635"," ","AUBREY R. PHILLIPS (1920- ) - Pastel depicting cottages in a steep sided river valley, possibly North Wales, signed and dated 2000, framed, 66cm by 48cm. another of a rural landscape, titled verso "Harvest Time, Somerset" signed and dated \'87, framed, 69cm by 49cm. (2) NB - Aubrey Phillips is a Worcestershire artist who studied at the Stourbridge School of Art.","40","60","WAT","Paintings, prints and watercolours",';
$result = str_getcsv($testData, ',', '"');
$hdr = array_slice($result,0,2);
$bdy = array_slice($result,2,-5);
$bdy = trim(implode('"',$bdy),'"');
$ftr = array_slice($result,-5);
$fixedResult = array_merge($hdr,array($bdy),$ftr);
var_dump($fixedResult);
result is:
array
0 => string ' 635' (length=4)
1 => string ' ' (length=1)
2 => string 'AUBREY R. PHILLIPS (1920- ) - Pastel depicting cottages in a steep sided river valley, possibly North Wales, signed and dated 2000, framed, 66cm by 48cm. another of a rural landscape, titled verso Harvest Time" Somerset" signed and dated '87" framed" 69cm by 49cm. (2) NB - Aubrey Phillips is a Worcestershire artist who studied at the Stourbridge School of Art.' (length=362)
3 => string '40' (length=2)
4 => string '60' (length=2)
5 => string 'WAT' (length=3)
6 => string 'Paintings, prints and watercolours' (length=34)
7 => string '' (length=0)
Not perfect, but possibly good enough
The alternative is to get whoever is generating the csv to properly escape their enclosures

If you can ecape the " in the text like this: \"
and the in fgetcsv use specify th escape char
fgetcsv($file, 0, ',', '"','\');

This is a long shot so don't take i to seriously.
I saw a pattern in the text that all the ',' you want to ignore has a space after it.
Search and replace ', ' with 'FUU' or something unique.
Now parse the csv file. It might get the correct format. You only need to replace 'FUU' back to ', '
:)

You are probably reading the contents of the CSV file as an array of lines, then splitting each line on the comma. This fails since some of the fields also contain commas. One trick that could help you out is to look for ",", which would indicate a field separator which would be unlikely (but not impossible, unfortunately) to occur inside a field.
<?php
$csv = file_get_contents("yourfile.csv");
$lines = split("\r\n", $csv);
echo "<pre>";
foreach($lines as $line)
{
$line = str_replace("\",\"", "\"###\"", $line);
$fields = split("###", $line);
print_r($fields);
}
echo "</pre>";
?>

$csv = explode(' ', $csv);
foreach ($csv as $k => $v) if($v[0] == '"' && substr($v, -1) == '"') {
$csv[$k] = mb_convert_encoding('“' . substr($v, 1, -1) . '”', 'UTF-8', 'HTML-ENTITIES');
}
$csv = implode(' ', $csv);
$csv = str_getcsv($csv);

Related

How can multiple identical values be printed from an array in PHP?

I'm trying to create a basic concordance script that will print the ten words before and after the value found inside an array. I did this by splitting the text into an array, identifying the position of the value, and then printing -10 and +10 with the searched value in the middle. However, this only presents the first such occurrence. I know I can find the others by using array_keys (found in positions 52, 78, 80), but I'm not quite sure how to cycle through the matches, since array_keys also results in an array. Thus, using $matches (with array_keys) in place of $location below doesn't work, since you cannot use the same operands on an array as an integer. Any suggestions? Thank you!!
<?php
$text = <<<EOD
The spread of a deadly new virus is accelerating, Chinese President Xi Jinping warned, after holding a special government meeting on the Lunar New Year public holiday.
The country is facing a "grave situation" Mr Xi told senior officials.
The coronavirus has killed at least 42 people and infected some 1,400 since its discovery in the city of Wuhan.
Meanwhile, UK-based researchers have warned of a real possibility that China will not be able to contain the virus.
Travel restrictions have come in place in several affected cities. From Sunday, private vehicles will be banned from central districts of Wuhan, the source of the outbreak.
EOD;
$new = explode(" ", $text);
$location = array_search("in", $new, FALSE);
$concordance = 10;
$top_range = $location + $concordance;
$bottom_range = $location - $concordance;
while($bottom_range <= $top_range) {
echo $new[$bottom_range] . " ";
$bottom_range++;
}
?>

You can just iterate over the values returned by array_keys, using array_slice to extract the $concordance words either side of the location and implode to put the sentence back together again:
$words = explode(' ', $text);
$concordance = 10;
$results = array();
foreach (array_keys($words, 'in') as $idx) {
$results[] = implode(' ', array_slice($words, max($idx - $concordance, 0), $concordance * 2 + 1));
}
print_r($results);
Output:
Array
(
[0] => least 42 people and infected some 1,400 since its discovery in the city of Wuhan.
Meanwhile, UK-based researchers have warned of a
[1] => not be able to contain the virus.
Travel restrictions have come in place in several affected cities. From Sunday, private vehicles will
[2] => able to contain the virus.
Travel restrictions have come in place in several affected cities. From Sunday, private vehicles will be banned
)
If you want to avoid generating similar phrases where a word occurs twice within $concordance words (e.g. indexes 1 and 2 in the above array), you can maintain a position for the end of the last match, and skip occurrences that occur in that match:
$words = explode(' ', $text);
$concordance = 10;
$results = array();
$last = 0;
foreach (array_keys($words, 'in') as $idx) {
if ($idx < $last) continue;
$results[] = implode(' ', array_slice($words, max($idx - $concordance, 0), $concordance * 2 + 1));
$last = $idx + $concordance;
}
print_r($results);
Output
Array
(
[0] => least 42 people and infected some 1,400 since its discovery in the city of Wuhan.
Meanwhile, UK-based researchers have warned of a
[1] => not be able to contain the virus.
Travel restrictions have come in place in several affected cities. From Sunday, private vehicles will
)
Demo on 3v4l.org

Try this:
<?php
$text = <<<EOD
The spread of a deadly new virus is accelerating, Chinese President Xi Jinping warned, after holding a special government meeting on the Lunar New Year public holiday.
The country is facing a "grave situation" Mr Xi told senior officials.
The coronavirus has killed at least 42 people and infected some 1,400 since its discovery in the city of Wuhan.
Meanwhile, UK-based researchers have warned of a real possibility that China will not be able to contain the virus.
Travel restrictions have come in place in several affected cities. From Sunday, private vehicles will be banned from central districts of Wuhan, the source of the outbreak.
EOD;
$words = explode(" ", $text);
$concordance = 10; // range -+
$result = []; // result array
$index = 0;
if (count($words) === 0) // be sure there is no empty array
exit;
do {
$location = array_search("in", $words, false);
if (!$location) // break loop if $location not found
break;
$count = count($words);
// check range of array indexes
$minRange = ($location - $concordance > 0) ? ($location-$concordance) : 0; // array can't have index less than 0 (shorthand if)
$maxRange = (($location + $concordance) < ($count - 1)) ? ($location+$concordance) : $count - 1; // array can't have index equal or higher than array count (shorthand if)
for ($range = $minRange; $range < $maxRange; $range++) {
$result[$index][] = $words[$range]; // group words by index
}
unset($words[$location]); // delete element which contain "in"
$words = array_values($words); // reindex array
$index++;
} while ($location); // repeat until $location exist
print_r($result); // <--- here's your results
?>

Php: String indexing inconsistant?

I have created a function which randomly generates a phrase from a hardcoded list of words. I have a function get_words() which has a string of hardcoded words, which it turns into an array then shuffles and returns.
get_words() is called by generate_random_phrase(), which iterates through get_words() n times, and on every iteration concatenates the n word into the final phrase which is destined to be returned to the user.
My problem is, for some reason PHP keeps giving me inconsistent results. It does give me words which are randomized, but it gives inconsistent number of words. I specify 4 words as the default and it gives me phrases ranging from 1-4 words instead of 4. This program is so simple it is almost unbelievable I can't pinpoint the exact issue. It seems like the broken link in the chain is the $words array which is being indexed, it seems like for some reason sometimes the indexing fails. I am unfamiliar with PHP, can someone explain this to me?
<?php
function generate_random_phrase() {
$words = get_words();
$number_of_words = get_word_count();
$phrase = "";
$symbols = "!##$%^&*()";
echo print_r($phrase);
for ($i = 0;$i < $number_of_words;$i++) {
$phrase .= " ".$words[$i];
}
if (isset($_POST['include_numbers']))
$phrase = $phrase.rand(0, 9);
if (isset($_POST['include_symbols']))
$phrase = $phrase.$symbols[rand(0, 9)];
return $phrase;
}
function get_word_count() {
if ($_POST['word_count'] < 1 || $_POST['word_count'] > 9)
$word_count = 4; #default
else
$word_count = $_POST['word_count'];
return $word_count;
}
function get_words() {
$BASE_WORDS = "my sentence really hope you
like narwhales bacon at midnight but only
ferver where can paper laptops spoon door knobs
head phones watches barbeque not say";
$words = explode(' ', $BASE_WORDS);
shuffle($words);
return $words;
}
?>

In $BASE_WORDS your tabs and new lines are occupying a space in the exploded array that's why. Remove the newlines and tabs and it'll generate the correct answer. Ie:
$BASE_WORDS = "my sentence really hope you like narwhales bacon at midnight but only ferver where can paper laptops spoon door knobs head phones watches barbeque not say";

Your function seems a bit inconsistent since you also include spaces inside the array, thats why when you included them, you include them in your loop, which seems to be 5 words (4 real words with one space index) is not really correct. You could just filter spaces also first, including whitespaces.
Here is the visual representation of what I mean:
Array
(
[0] => // hello im a whitespace, i should not be in here since im not really a word
[1] => but
[2] =>
[3] => bacon
[4] => spoon
[5] => head
[6] => barbeque
[7] =>
[8] =>
[9] => sentence
[10] => door
[11] => you
[12] =>
[13] => watches
[14] => really
[15] => midnight
[16] =>
So when you loop it, you include spaces, in this case. If you got a number of words of 5, you really dont get those 5 words, index 0 - 4 it will look like you only got 3 (1 => but, 3 => bacon, 4 => spoon).
Here is a modified version:
function generate_random_phrase() {
$words = get_words();
$number_of_words = get_word_count();
$phrase = "";
$symbols = "!##$%^&*()";
$words = array_filter(array_map('trim', $words)); // filter empty words
$phrase = implode(' ', array_slice($words, 0, $number_of_words)); // no need for a loop
// this simply gets the array from the first until the desired number of words (0,5 or 0,9 whatever)
// and then implode, just glues all the words with space
// so this ensure its always according to how many words you want
if (isset($_POST['include_numbers']))
$phrase = $phrase.rand(0, 9);
if (isset($_POST['include_symbols']))
$phrase = $phrase.$symbols[rand(0, 9)];
return $phrase;
}

Inconsistent spacing in your words list is the issue.
Here is a fix:
function get_words() {
$BASE_WORDS = "my|sentence|really|hope|you|
|like|narwhales|bacon|at|midnight|but|only|
|ferver|where|can|paper|laptops|spoon|door|knobs|
|head|phones|watches|barbeque|not|say";
$words = explode('|', $BASE_WORDS);
shuffle($words);
return $words;
}

Creating an accurate Bible Search

I am creating a Bible search. The trouble with bible searches is that people often enter different kinds of searches, and I need to split them up accordingly. So i figured the best way to start out would be to remove all spaces, and work through the string there. Different types of searches could be:
Genesis 1:1 - Genesis Chapter 1, Verse 1
1 Kings 2:5 - 1 Kings Chapter 2, Verse 5
Job 3 - Job Chapter 3
Romans 8:1-7 - Romans Chapter 8 Verses 1 to 7
1 John 5:6-11 - 1 John Chapter 5 Verses 6 - 11.
I am not too phased by the different types of searches, But If anyone can find a simpler way to do this or know's of a great way to do this then please tell me how!
Thanks

The easiest thing to do here is to write a regular expression to capture the text, then parse out the captures to see what you got. To start, lets assume you have your test bench:
$tests = array(
'Genesis 1:1' => 'Genesis Chapter 1, Verse 1',
'1 Kings 2:5' => '1 Kings Chapter 2, Verse 5',
'Job 3' => 'Job Chapter 3',
'Romans 8:1-7' => 'Romans Chapter 8, Verses 1 to 7',
'1 John 5:6-11' => '1 John Chapter 5, Verses 6 to 11'
);
So, you have, from left to right:
A book name, optionally prefixed with a number
A chapter number
A verse number, optional, optionally followed by a range.
So, we can write a regex to match all of those cases:
((?:\d+\s)?\w+)\s+(\d+)(?::(\d+(?:-\d+)?))?
And now see what we get back from the regex:
foreach( $tests as $test => $answer) {
// Match the regex against the test case
preg_match( $regex, $test, $match);
// Ignore the first entry, the 2nd and 3rd entries hold the book and chapter
list( , $book, $chapter) = array_map( 'trim', $match);
$output = "$book Chapter $chapter";
// If the fourth match exists, we have a verse entry
if( isset( $match[3])) {
// If there is no dash, it's a single verse
if( strpos( $match[3], '-') === false) {
$output .= ", Verse " . $match[3];
} else {
// Otherwise it's a range of verses
list( $start, $end) = explode( '-', $match[3]);
$output .= ", Verses $start to $end";
}
}
// Here $output matches the value in $answer from our test cases
echo $answer . "\n" . $output . "\n\n";
}
You can see it working in this demo.

I think I understand what you are asking here. You want to devise an algorithm that extracts information (ex. book name, chapter, verse/verses).
This looks to me like a job for pattern matching (ex. regular expressions) because you could then define patterns, extract data for all scenario's that make sense and work from there.
There are actually quite a few variants that could exist - perhaps you should also take a look at natural language processing. Fuzzy string matching on names could provide better results (ex. people misspelling book names).
Best of luck

Try out something based on preg_match_all, like:
$ php -a
Interactive shell
php > $s = '1 kings 2:4 and 1 sam 4-5';
php > preg_match_all("/(\\d*|[^\\d ]*| *)/", $s, $parts);
php > print serialize($s);

Okay Well I am not too sure about regular expressions and I havent yet studied them out, So I am stuck with the more procedural approach. I have made the following (which is still a huge improvement on the code I wrote 5 years ago, which was what I was aiming to achieve) That seems to work flawlessly:
You need this function first of all:
function varType($str) {
if(is_numeric($str)) {return false;}
if(is_string($str)) {return true;}
}
$bible = array("BookNumber" => "", "Book" => "", "Chapter" => "", "StartVerse" => "", "EndVerse" => "");
$pos = 1; // 1 - Book Number
// 2 - Book
// 3 - Chapter
// 4 - ':' or 'v'
// 5 - StartVerse
// 6 - is a dash for spanning verses '-'
// 7 - EndVerse
$scan = ""; $compile = array();
//Divide into character type groups.
for($x=0;$x<=(strlen($collapse)-1);$x++)
{ if($x>=1) {if(varType($collapse[$x]) != varType($collapse[$x-1])) {array_push($compile,$scan);$scan = "";}}
$scan .= $collapse[$x];
if($x==strlen($collapse)-1) {array_push($compile,$scan);}
}
//If the first element is not a number, then it is not a numbered book (AKA 1 John, 2 Kings), So move the position forward.
if(varType($compile[0])) {$pos=2;}
foreach($compile as $val)
{ if(!varType($val))
{ switch($pos)
{ case 1: $bible['BookNumber'] = $val; break;
case 3: $bible['Chapter'] = $val; break;
case 5: $bible['StartVerse'] = $val; break;
case 7: $bible['EndVerse'] = $val; break;
}
} else {switch($pos)
{ case 2: $bible['Book'] = $val; break;
case 4: //Colon or 'v'
case 6: break; //Dash for verse spanning.
}}
$pos++;
}
This will give you an array called 'Bible' at the end that will have all the necessary data within to run on an SQL database or whatever else you might want it for. Hope this helps others.

I know this is crazy talk, but why not just have a form with 4 fields so they can specify:
Book
Chapter
Starting Verse
Ending Verse [optional]

Splitting string by fixed length

I am looking for ways to split a string of a unicode alpha-numeric type to fixed lenghts.
for example:
992000199821376John Smith 20070603
and the array should look like this:
Array (
[0] => 99,
[1] => 2,
[2] => 00019982,
[3] => 1376,
[4] => "John Smith",
[5] => 20070603
)
array data will be split like this:
Array[0] - Account type - must be 2 characters long,
Array[1] - Account status - must be 1 character long,
Array[2] - Account ID - must be 8 characters long,
Array[3] - Account settings - must be 4 characters long,
Array[4] - User Name - must be 20 characters long,
Array[5] - Join Date - must be 8 characters long.

Or if you want to avoid preg:
$string = '992000199821376John Smith 20070603';
$intervals = array(2, 1, 8, 4, 20, 8);
$start = 0;
$parts = array();
foreach ($intervals as $i)
{
$parts[] = mb_substr($string, $start, $i);
$start += $i;
}

$s = '992000199821376Николай Шмидт 20070603';
if (preg_match('~(.{2})(.{1})(.{8})(.{4})(.{20})(.{8})~u', $s, $match))
{
list (, $type, $status, $id, $settings, $name, $date) = $match;
}

Using the substr function would do this quite easily.
$accountDetails = "992000199821376John Smith 20070603";
$accountArray = array(substr($accountDetails,0,2),substr($accountDetails,2,1),substr($accountDetails,3,8),substr($accountDetails,11,4),substr($accountDetails,15,20),substr($accountDetails,35,8));
Should do the trick, other than that regular expressions (as suggested by akond) is probably the way to go (and more flexible). (Figured this was still valid as an alternate option).

It is not possible to split a unicode string in a way you ask for.
Not possible without making the parts invalid.
Some code points have no way of standing out, for example: שׁ is 2 code points (and 4 bytes in UTF-8 and UTF-16) and you cannot split it because it is undefined.
When you work with unicode, "character" is a very slippery term. There are code points, glyphs, etc. See more at http://www.utf8everywhere.org, the part on "length of a string"

In PHP, what is a fast way to search an array for values which contain a substring?

I have an array of street names sorted alphabetically that I have gathered from a web service. This array exists on the server side.
On the client side, a user starts typing the name of the street he lives on and AJAX is used to return a list of the closest match to the partial street name, plus the next 9 street names in the array (the list is updated while he is typing).
For example, if the user typed "al", I would expect the results to be something like the following:
Albany Hwy
Albens Vale
Alcaston Rd
Alex Wood Dr
Alice Rd
Allawah Ct
Allen Rd
Alloway Pl
Allwood Av
Alola St
Amanda Dr
This is my try at it:
$matches = array();
for($i = 0; $i < count($streetNames); $i++)
{
if( (stripos($streetNames, $input) === 0 && count($matches) == 0) || count($matches) < 10 ){
$matches[] = $streetNames[$i];
} else {
break;
}
}
Does anyone else know a faster way?
Please note: I have no control over how this list is obtained from the database - it's from an external web service.

Use preg_grep():
$matches = preg_grep('/al/', $streetNames);
Note: this method like yours will be a brute force search. If you're searching a huge list of names (hundreds of thousands) or searching a huge number of times then you may need something better. For small data sets this is fine however.

The only way to get faster than looking through all the strings would be to have a data structure optimized for this kind of thing, a trie. You may not have control over what the webservice gives you, but if you can cache the result on your server and reuse it for serving many requests, then building a trie and using that would be much faster.

I think what you're looking for is preg_grep()
You can search either for elements starting with the input text:
$result = preg_grep('/^$input/', $streetNames);
or for elements that contain the text in any place:
$result = preg_grep('/$input/', $streetNames);
or you can also anchor the search to the end but that doesn't look so useful

Can't really tell if it is faster, but this is my version of it.
$input = 'al';
$matches = array_filter($streetNames, create_function('$v','return (stripos($v,'.$input.') !== false ? true : false);'));
$weight = array_map(create_function('$v','return array($v,levenshtein('.$input.',$v));'),$matches);
uasort($weight, create_function('$a,$b', 'if ($a[1] == $b[1]) {return 0;} return ($a[1] < $b[1]) ? -1 : 1;'));
$weight = array_slice($weight, 0, 10);
This creates a weighted list of matches. They are sorted according to the distance between the input string and the street name. 0 represents a true match.
Resulting array looks like this
array (
0 =>
array (
0 => 'Alola St',
1 => 7,
),
1 =>
array (
0 => 'Allen Rd',
1 => 7,
)
)
Where 0 => street name and 1 => levenshtein distance

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Reading CSV file with unescaped enclosures - php

If you can ecape the " in the text like this: \" and the in fgetcsv use specify th escape char fgetcsv($file, 0, ',', '"','\');

This is a long shot so don't take i to seriously. I saw a pattern in the text that all the ',' you want to ignore has a space after it. Search and replace ', ' with 'FUU' or something unique. Now parse the csv file. It might get the correct format. You only need to replace 'FUU' back to ', ' :)

$csv = explode(' ', $csv); foreach ($csv as $k => $v) if($v[0] == '"' && substr($v, -1) == '"') { $csv[$k] = mb_convert_encoding('“' . substr($v, 1, -1) . '”', 'UTF-8', 'HTML-ENTITIES'); } $csv = implode(' ', $csv); $csv = str_getcsv($csv);

Related

How can multiple identical values be printed from an array in PHP?

Php: String indexing inconsistant?

Creating an accurate Bible Search

Splitting string by fixed length

In PHP, what is a fast way to search an array for values which contain a substring?

Categories

Resources