Finding duplicate column values in a CSV - php

I'm importing a CSV that has 3 columns, one of these columns could have duplicate records.
I have 2 things to check:
1. The field 'NAME' is not null and is a string
2. The field 'ID' is unique
So far, I'm parsing the CSV file, once and checking that 1. (NAME is valid), which if it fails, it simply breaks out of the while loop and stops.
I guess the question is, how I'd check that ID is unique?
I have fields like the following:
NAME, ID,
Bob, 1,
Tom, 2,
James, 1,
Terry, 3,
Joe, 4,
This would output something like `Duplicate ID on line 3'
Thanks
P.S this CSV file has more columns and can have around 100,000 records. I have simplified it for a specific reason to solve the duplicate column/field
Thanks

<?php
$cnt = 0;
$arr=array();
if (($handle = fopen("1.csv", "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
$num=count($data);
$cnt++;
for ($c=0; $c < $num; $c++) {
if(is_numeric($data[$c])){
if (array_key_exists($data[$c], $arr))
$arrdup[] = "duplicate value at ".($cnt-1);
else
$arr[$data[$c]] = $data[$c-1];
}
}
}
fclose($handle);
}
print_r($arrdup);

Give it a try:
$row = 1;
$totalIDs = array();
if (($handle = fopen('/tmp/test1.csv', "r")) !== FALSE)
{
while (($data = fgetcsv($handle)) !== FALSE)
{
$name = '';
if (isset($data[0]) && $data[0] != '')
{
$name = $data[0];
if (is_numeric($data[0]) || !is_string($data[0]))
echo "Name is not a string for row $row\n";
}
else
{
echo "Name not set for row $row\n";
}
$id = '';
if (isset($data[1]))
{
$id = $data[1];
}
else
{
echo "ID not set for row $row\n";
}
if (isset($totalIDs[$id])) {
echo "Duplicate ID on line $row\n";
}
else {
$totalIDs[$id] = 1;
}
$row++;
}
fclose($handle);
}

I went assuming a certain type of design, as stripped out the CSV part, but the idea will remain the same :
<?php
/* Let's make an array of 100,000 rows (Be careful, you might run into memory issues with this, issues you won't have with a CSV read line by line)*/
$arr = [];
for ($i = 0; $i < 100000; $i++)
$arr[] = [rand(0, 1000000), 'Hey'];
/* Now let's have fun */
$ids = [];
foreach ($arr as $line => $couple) {
if ($ids[$couple[0]])
echo "Id " . $couple[0] . " on line " . $line . " already used<br />";
else
$ids[$couple[0]] = true;
}
?>
100, 000 rows aren't that much, this will be enough. (It ran in 3 seconds at my place.)
EDIT: As pointed out, in_array is less efficient than key lookup. I've updated my code consequently.

Are the IDs sorted with possible duplicates in between or are they randomly distributed?
If they are sorted and there are no holes in the list (1,2,3,4 is OK; 1,3,4,7 is NOT OK) then just store the last ID you read and compare it with the current ID. If current is equal or less than last then it's a duplicate.
If the IDs are in random order then you'll have to store them in an array. You have multiple options here. If you have plenty of memory just store the ID as a key in a plain PHP array and check it:
$ids = array();
// ... read and parse CSV
if (isset($ids[$newId])) {
// you have a duplicate
} else {
$ids[$newId] = true; // new value, not a duplicate
}
PHP arrays are hash tables and have a very fast key lookup. Storing IDs as values and searching with in_array() will hurt performance a lot as the array grows.
If you have to save memory and you know the number of lines you going to read from the CSV you could use SplFixedArray instead of a plain PHP array. The duplicate check would be the same as above.

Related

PHP & CSV - Echo Result only once

i am fairly new to PHP and tried several hours to get something going, sadly without a result. I hope you can point me into the right direction.
So what i got is a CSV file containing Articles. They are separated into diff columns and always the same structure, for example :
ArtNo, ArtName, ColorCode, Color, Size
When an article has different color codes in the CSV, the article is simply repeated with the same information except for the color code, see an example:
ABC237;Fingal Edition;48U;Nautical Blue;S - 5XL;
ABC237;Fingal Edition;540;Navy;S - 5XL;
My problem is, i want to display all the articles in a table, include an article image etc.. so far i got that working which is not a problem, but instead of showing the article twice for every different color code i want to create only one line per ArtNo (First CSV Line) but still read the second duplicate line to add the article color to the first one, like :
ABC237; Fingal Edition ;540;Nautical Blue, Navy;S - 5XL;
Is this even possible or am I going into a complete wrong direction here? My code looks like this
<?php
$csv = readCSV('filename.csv');
foreach ($csv as $c) {
$artNo = $c[0]; $artName = $c[1]; $colorCode = $c[2]; $color = $c[3]; $sizes = $c[4]; $catalogue = $c[5]; $GEP = $c[6]; $UVP = $c[7]; $flyerPrice = $c[8]; $artDesc = $c[9]; $size1 = $c[10]; $size2 = $c[11]; $size3 = $c[12]; $size4 = $c[13]; $size5 = $c[14]; $size6 = $c[15]; $size7 = $c[16]; $size8 = $c[17]; $picture = $c[0] . "-" . $c[2] . "-d.jpg";
// Echo HTML Stuff
}
?>
Read CSV Function
<?php
function readCSV($csvFile){
$file_handle = fopen($csvFile, 'r');
while (!feof($file_handle) )
{
$line_of_text[] = fgetcsv($file_handle, 0, ";");
}
fclose($file_handle);
return $line_of_text;
}
?>
I tried to get along with array_unique etc but couldn't find a proper solution.
Read all the data into an array, using the article number as the key....
while (!feof($file_handle) ) {
$values = fgetcsv($file_handle, 0, ";");
$artno = array_shift($values);
if (!isset($data[$artno])) $data[$artno]=array();
$data[$artno][]=$values;
}
And then output it:
foreach ($data as $artno=>$v) {
$first=each($v);
print $artno . "; " . each($first);
foreach ($v as $i) {
$discard=array_shift($i);
print implode(";", $i);
}
print "\n";
}
(code not tested, YMMV)
You need to know exactly how many items belong to each ArtNo group. This means a loop to group, and another loop to display.
When grouping, I steal the ArtNo from the row of data and use it as the grouping key. The remaining data in the row will be an indexed subarray of that group/ArtNo.
I am going to show you some printf() and sprintf() syntax to keep things clean. printf() will display the first parameter's content and using any subsequent values to replace the placeholders in the string. In this case, the 2nd parameter is a conditional expression. On the first iteration of the group, ($i = 0), we want to show the ArtNo as the first cell of the row and declare the number of rows that it should span. sprinf() is just like printf() except it produces a value (silently). Upon any subsequent iterations of the group, $i will be greater than zero and therefore an empty string is passed as the value.
Next, I'm going to use implode() which is beautifully flexible when you don't know exactly how many columns your table will have (or if the number of columns may change during the lifetime of your project).
Tested Code:
$csv = <<<CSV
ABC237;Fingal Edition;48U;Nautical Blue;S - 5XL
ABC236;Fingal Edition;540;Navy;S - 5XL
ABC237;Fingal Edition;49U;Sea Foam;L - XL
ABC237;Fingal Edition;540;Navy;S - 5XL
CSV;
$lines = explode(PHP_EOL, $csv);
foreach ($lines as $line) {
$row = str_getcsv($line, ';');
$grouped[array_shift($row)][] = $row;
}
echo '<table>';
foreach ($grouped as $artNo => $group) {
foreach ($group as $i => $values) {
printf(
'<tr>%s<td>%s</td></tr>',
(!$i ? sprintf('<td rowspan="%s">%s</td>', count($group), $artNo) : ''),
implode('</td><td>', $values)
);
}
}
echo '</table>';
Output:

More elegant way of looping through array and aggregating result in PHP

I need to loop through a set of data (example below) and generate an aggregate. Original data format is CSV (but could be other kind).
LOGON;QUERY;COUNT
L1;Q1;1
L1;Q1;2
L1;Q2;3
L2;Q2;1
I need to group the quantities by LOGON and QUERY, so at the end I would have an array like:
"L1-Q1" => 3,
"L1-Q2" => 3,
"L2-Q1" => 1,
I usually use a code like this:
$logon = NULL;
$query = NULL;
$count = 0;
$result = array();
// just imagine I get each line as a named array
foreach ($csvline as $line) {
if ($logon != $line['logon'] || $query != $line['query']) {
if ($logon !== NULL) {
$result[$logon . $query] = $count;
}
$logon = $line['logon'];
$query = $line['query'];
$count = 0;
}
$count += $line['count'];
}
$result[$logon . $query] = $count;
Sincerely, I don't think this is nice, as I have to repeat last statement to include last line. So, is there a more elegant way of solving this in PHP?
Thanks!
You simply would need to check for the existence of a key, then increment - create missing keys at any time with value 0.
Then you dont need to repeat anything at any time:
$result = array();
foreach ($csvline as $line) {
if (!isset($result[$line['logon'] . $line['query']])){
//create entry
$result[$line['logon'] . $line['query']] = 0;
}
//increment, no matter what we encounter
$result[$line['logon'] . $line['query']] += $line['count'];
}
For readability and to avoid misstakes, you should generate the key just one time, instead of performing the same concatenation over and over:
foreach ($csvline as $line) {
$curKey = $line['logon'] . $line['query'];
if (!isset($result[$curKey])){
//create entry
$result[$curKey] = 0;
}
//increment, no matter what we encounter
$result[$curKey] += $line['count'];
}
this would allow you to refactor the key without touching several lines of code.

count rows of a file with same first character

In a comma separated csv file, I want to count the number of rows only where the the first number is same.
Following is the example data, I want to get number of rows which start with 2 (2 rows), and the rows which start with 4 (3 rows). This is just an example, the numbers are random.:
2,0,0
2,1,0
4,0,0
4,3,0
4,4,0
I'm trying following code, I can count only all rows of the file but do not know how can I count only the rows which have same first number.
$i = 0;
while ($i < 5) { //fixed number of times
$i++;
$rows = 0;
$fp = fopen("test.csv", "r");
while (fgetcsv($fp)) { //don't want all rows
$rows++;
}
fclose($fp);
echo $rows;
}
Edit:
Sorry, I forgot to mention the numbers in above file are random, they are not always 2 or 4.
You will need an array to get statistics.
$i = 0;
$agg = []; // to count stats
while ($i < 5) {
$i++;
$rows = 0;
$fp = fopen("test.csv", "r");
while ($line = fgetcsv($fp)) {
$number = $line[0]; // get first number
if(isset($agg[$number])){ // if there is the number in stats
$agg[$number]++; // count new one
} else {
$agg[$number] = 1; // mark as found one
}
}
fclose($fp);
print_r($agg);
}
You could probably try something like this :
$i = 0;
while ($i < 5) { //fixed number of times
$i++;
$rows = array();
$fp = fopen("test.csv", "r");
while ($data = fgetcsv($fp)) {
$first = $data[0];
if(isset($rows[$first]) {
$rows[$first] += 1;
} else {
$rows[$first] = 1;
}
}
fclose($fp);
print_r($rows);
}
I didn't test my code, so it may contains errors.
After the execution, $rows will contain an associative array with the form 'first number' => 'count'
What I don't understand however is why you are doing it five times and also why you are not using a for loop for that instead of a while ?
Assuming you know how to get the contents of the text file, something like this should work:
$ar = explode("\n",$txtfileContents);
$counts = array();
foreach($ar as $row){
$counts[$row[0]]++;
}
$counts now contains an array of all first characters and how many times they appeared. You can now simply access them like this:
echo "'2' appeared ".$count['2']." times<br/>";
echo "'4' appeared ".$count['4']." times";
Step by step:
Split the text file into an array of individual rows:
$ar = explode("\n",$txtfileContents);
(Try alternatively file() here, which may be more reliable than explode)
Loop through it:
foreach($ar as $row){
}
Then check the first character inside the loop and count each character separately:
$counts[$row[0]]++;

Create PHP associative array with any number of associations

I am writing code that reads in .csv files and creates associative arrays in PHP. I want to organize the array such that each column (in order) before the last column is a level of an associative array. In the non-generalized case:
$data = array();
$file =fopen("data.csv", "r");
while (($line = fgetcsv($file)) !== FALSE) {
$var1 = $line[0];
$var2 = $line[1];
$value = $line[2];
$data[$var1][$var2] = $value;
}
I want to be able to do this regardless of the number of columns there are (ie, var1... varN). It will be organized such that variables (columns) 1-N uniquely identify each row, and the desired value is always the last columns.
Just put another while inside your while considering how much columns are in that line you took from the file. That should be enough.
I was able to do this by building the line of code and using the eval statement:
$str = '$allData["data"]';
for($x=0; $x<$numVars; $x++) {
$var = $line[$x];
if(($x+1) != $numVars) {
$str .= "['$var']";
}
}
$str .= "=";
$str .= $line[$numVars-1];
$str .= ";";
eval($str);

about 20% of time opendir script fails. See example

Hit refresh several times and see sometimes I get "null".
This script loops through a folder to get all mp3 files and randomly selects one.
What am I doing wrong? Thanks
if ($handle = opendir('../../hope/upload/php/files/')) {
while (false !== ($entry = readdir($handle))) {
$entry = trim($entry);
if(preg_match('/.mp3/', $entry))
{
$mp3[] = "$entry";
}
}
closedir($handle);
$count = count($mp3);
$rand = rand(0,$count -1); /// FIXED BY adding a -1 after count**
$mp3 = $mp3[$rand];
if($mp3)
{
echo "http://MyWebsite.com/hope/upload/php/files/$mp3";
}
else
{
echo "null";
}
}
This is happening because array indexes go from 0 to length - 1, but your script is generating a random index from 0 to length. The preferred way to fix this would be to use array_rand():
$rand = array_rand($mp3);
$mp3 = $mp3[$rand];
You random range is out (the max integer is the result of count(), and remember the count of an array is one higher than its highest index in an ordinal 0-based array), and your code looks far too verbose.
Try...
$mp3s = glob('../../hope/upload/php/files/*.mp3');
$key = array_rand($mp3s);
$randomMp3 = $mp3s[$key];

Categories