Comparing keywords in a page or CSV file: PHP ? Bash?

Comparing keywords in a page or CSV file: PHP ? Bash? - php

I have a series of keywords in an HTML web page - they are comma separated so I could get them to CSV, and would like to know which ones are NOTin another CSV file displayed as an html web page.
How would you do that comparison ? I have ideas for mysql and tables but this is CSV or html sources.
Thanks !

In Python, given 2 csv files, a.csv and b.csv, this script will create (or edit if it already exists) a new file out.csv that contains everything in a.csv that's not found in b.csv.
import urllib
url = 'http://www.website.com/x.csv'
urllib.urlretrieve(url, 'b.csv')
file_a = open('a.csv', 'r')
file_b = open('b.csv', 'r')
file_out = open('out.csv', 'w')
list_a = [x.strip() for x in file_a.read().split(',')]
list_b = [x.strip() for x in file_b.read().split(',')]
list_out = list(set(list_a) - set(list_b)) # Reverse if necessary
file_out.write(','.join(list_out))
file_out.close()

If it is just a list of keywords, you want to do a search and replace (you can use sed) to replace all the commas with carriage returns. So you will end up with a file containing one keyword on each line. Do that to both versions of the list. Then use the "join" command:
join -v 1 leftfile rightfile
This will report all the entries in leftfile that are not in rightfile. Don't forget to sort the files first, or join won't work. There is a bash tool for sorting too (it's called, not surprisingly, "sort").

PHP solution..
Get keywords as strings, convert then in arrays and use array_diff function:
<?php
$csv1 = 'a1, a2, a3, a4';
$csv2 = 'a1, a4';
$csv1_arr = explode(',', $csv1);
$csv2_arr = explode(',', $csv2);
$diff = array_diff($csv1_arr, $csv2_arr);
print_r($diff);
?>

Related

Combine 2 CSV files based on a match within a column disregarding the header row

I have been scouring the ole interweb for this solution but have not found anything successful. I have a CSV output from one script that has data presented in a specific way and i need to match that and merge with another file. Added bonus if i can round up to a simple 2 x decimal points.
File 1: dataset1.csv (using column 1 as a primary key or what i want to search the other file for.)
5033db62b38f86605f0baeccae5e6cbc,20.875,20.625,41.5
5033d9951846c1841437b437f5a97f0a,3.3529411764705882,12.4117647058823529,13.7647058823529412
50335ab3ab5411f88b77900736338bc6,6.625,1.0625,3
5033db62b38f86605f0baeccae5e6cbc,2.9375,1,1.4375
File 2: dataset2.csv (if column 2 matches column 1 of file join column 1 from file 2 replacing the data in column 1 of file 1.)
"dc2","5033db62b38f86605f0baeccae5e6cbc"
"dc1","5033d9951846c1841437b437f5a97f0a"
Desired results:
File 1 (or new file3):
dc1,3.35,12.41,13.76
dc2,20.875,20.625,41.5
Just to demonstrate that I have been trying to find a way, and not just randomly asking a question hoping someone else would solve my problem.
I have found a number of resources that say to use join.
join -o 1.1,1.2,1.3,1.4,2.3 file 1 file 2 etc. I have tested this a number of different ways. I read on a number of posts that the results need to be sorted - with that long of a string its a little hard. Not to mention file 1 may have 30 to 40 entries but file2 may only have 10. I just need a name associated with the long string.
I started looking at grep - but then I will need a forEach loop to cycle through all the results and there has to be an easier way.
I have also looked at AWK - now this is a fun one trying to figure out exactly how to make this work.
awk 'FNR==NR {a[$2]; next} $2 in a' file.csv testfile2.csv
Yeah.... tried many ways to get this to compare as this seems to be the general idea... but still haven't got it to work. I would like this to be some type of shell script for linux to be very simple and something i can call from a php page and have it run. Like if user hits refresh it churns through it and digests the data.
Any help would be greatly appreciated!
Thank you.
j.

You can use a combination of sort and gnu awk:
mergef.awk:
BEGIN { FS= "[ ,\"]+"; }
FNR == NR { if ( !($1 in vals) ) vals [ $1 ] = sprintf("%.2f,%.2f,%.2f", $2, $3,$4) ;}
FNR != NR { print $2 "," vals[ $3 ]; }
Say your files are f1.csv and f2.csv then use this command:
awk -f mergef.awk f1.csv f2.csv | sort
the first line in the script deals with the quotes present in the second file (because of this setting there is an empty field $1 for the second file)
the second line reads in the first file. The if takes care that only the first occurence of a key is used.
the last line prints the new keys from the second file along the stored values from the first file, retrieved via the old keys
FNR == NR is true for the first file

Using python and the pandas library:
import pandas as pd
# Read in the csv files.
df1 = pd.read_csv(dataset1.csv, header=None, index_col=0)
df2 = pd.read_csv(dataset2.csv, header=None, index_col=1)
# Round values in the first file to two decimal places.
df1 = df1.round(2)
# Merge the two files.
df3 = pd.merge(df2, df1, how='inner', left_index=True, right_index=True)
# Write the output.
df3.to_csv(output.csv, index=False, header=False)

except formatting the numbers this does the job
$ join -t, -1 1 -2 2 -o2.1,1.2,1.3,1.4 <(sort file1) <(tr -d '"' <file2 | sort -t, -k2)
dc1,3.3529411764705882,12.4117647058823529,13.7647058823529412
dc2,2.9375,1,1.4375
dc2,20.875,20.625,41.5
note that there two matches for dc2.
Bonus: for required formatting pipe the output of the previous script to
$ ... | tr ',' ' ' | xargs printf "%s,%.2f,%.2f,%.2f\n"
dc1,3.35,12.41,13.76
dc2,2.94,1.00,1.44
dc2,20.88,20.62,41.50
but then, perhaps awk is a better alternative. This is to show that no programming is required if you can utilize existing unix toolset.

Here is a solution with PHP:
foreach (file("dataset1.csv") as $line_no => $csv) {
if (!$line_no) continue; // in case you have a header on first line
$fields = str_getcsv($csv);
$key = array_shift($fields);
$data1[$key] = array_map(function ($v) { return number_format($v, 2); }, $fields);
};
foreach (file("dataset2.csv") as $csv) {
$fields = str_getcsv($csv);
if (!isset($data1[$fields[1]])) continue;
$data2[$fields[0]] = array_merge(array($fields[0]), $data1[$fields[1]]);
};
ksort($data2);
$csv = implode("\n", array_map(function ($v) {
return implode(',', $v);
}, $data2));
file_put_contents("dataset3.csv", $csv);
NB: As you mentioned that the first file will be using column 1 as a primary key, a duplicate key value should not occur. If it does, the last occurrence will prevail.

How to extract data from CSV with PHP

I'm using the Sebastian Bergmann PHPUnit selenium webdriver.
Current I have:
$csv = file_get_contents('functions.csv', NULL,NULL,1);
var_dump($csv);
// select 1 random line here
This will load my CSV file and give me all possible data from the file.
It has multiple rows for example:
Xenoloog-FUNC/8_4980
Xylofonist-FUNC/8_4981
IJscoman-FUNC/8_4982
Question: How can I get that data randomly?
I just want to use 1 ( random) line of data.
Would it be easier to just grab 1 (random) line of the file instead of everything?

Split the string into an array, then grab a random index from that array:
$lines = explode("\n", $csv);
$item = $lines[array_rand($lines)];

You could use the offset and maxlen parameters to grab part of the file using file_get_contents. You could also use fseek after fopen to select part of a file. These functions both take numbers of bytes as arguments. This post has a little more information:
Get part of a file by byte number in php
It may require some hacking to translate a particular row-index of a CSV file to a byte offset. You might need to generate and load a small meta-data file which contains a list of bytes-occupancies for each row of of CSV data. That would probably help.

Read SELECTED contents from a large text file (varying length text)

I'm looking to read contents of a file between two tags in a large text file (so can't read the whole file at once due to memory restrictions on my server provider). This file has around 500000 lines of text.
This ( PHP: Read Specific Line From File ) isn't an option (I don't think), as the text I need to read varies in length and will take up multiple lines (varies from 20-5000 lines).
I am planning to use fopen, fread (read only) and fclose to read the file contents. I have experience of using these functions already.
I am looking to read all the contents in a selected part of the file. i.e.
File contents example
<<TAGNAME-1>>AAAA AAAA AAAA<<//TAGNAME-1>>
<<TAGNAME-2>>TEXT TEXT TEXT<<//TAGNAME-2>>
To select the text "AAAA AAAA AAAA" between the <<TAGNAME-1>> and <<//TAGNAME-1>> when TAGNAME-1 is called as a variable in my script.
How could I go about selecting all the text between the two tags that I require? (and ignore the remainder of the file) I have the ability to create the two tags where required in my php script - my issue is implementing this within the fread function.

You could grep the text file which would only return the text with a matching tag.
$tagnum = 2; //variable
$pattern = "<<TAGNAME-";
$searchstr = $pattern.$tagnum; //concat the prefix with the tag number
$fpath ="testtext.txt"; //define path to text file
$result = exec('grep -in "'.$searchstr.'" '.$fpath);
echo $result;
Where $tagnum would define each tag to search. I've tested it in my sandbox and it works as expected. Note this will read the whole line until the end tad or newline is reached.
Regards,

removing a complete column and data from csv

I have exported my sql dump in csv format, so suppose my schema was like name,email ,country, I want to remove email column and all its data from csv. what would be the most optimized way to do that either using a tool or any technique.I tried to load that dump in excel but that didn't looked proper
Thanks

you could copy the table inside the mysql database, delete the email column using some mysql client and export back to csv.

Importing to excel should work with ordered data - you might need to consider alternative delimiters if your data contains commas (such as addresses). If possible use an alternative delimiter, add quote marks around troublesome fields or shift to fixed width output.
Any tool you write or use will need to be able to parse your data and that will always be an issue if the delimiter is scattered through the data.
Alternatively rewrite the view / select / procedure that is generating the data set initially.

This command should do it (assuming a unix* OS):
$ cut -d ',' -f 1,3- dump.csv > newdump.csv
UPDATE: DevZer0 is right, this is unfit for the general case. So you could do (it's tested):
#!/usr/bin/env perl
use Text::ParseWords;
my $file = 'dump.csv';
my $indexOfFieldToBeRemoved = 1; # 0, 1, ...
my #data;
open(my $fh, '<', $file) or die "Can't read file '$file' [$!]\n";
while (my $line = <$fh>) {
chomp $line;
my #fields = Text::ParseWords::parse_line(',', 0, $line);
splice(#fields, $indexOfFieldToBeRemoved, 1);
foreach (#fields) {
print "\"$_\",";
}
print "\n";
}
close $fh;
Sorry, nothing simpler (if you can't re-generate csv dump, as suggested)...

.aba file format not understanding it

Do you know about Australian Banker's Association (.aba) file format ? It is used for batch transactions which is quite similar to csv files. However, what I don't understand is, how is the columns separated from each other. For example, in csv files, we use like (,;) etc. Also I don't find a sample files. Here is one link that could help you help me fast if you don't know already.
http://www.cemtexaba.com/aba-format/cemtex-aba-file-format-details.html

However, what I don't understand is, how is the columns separated from each other. For example, in csv files, we use like (,;) etc
It is similar to CSV but it is a plan text file consisting of strings and lines...
Here are some ready solutions
Symfony 2 bundle -> https://github.com/latysh/aba-bundle
php library -> https://github.com/simonblee/aba-file-generator

It looks like a simple file format. Instead of thinking of it as a CSV file, which is delimited using some symbol, think of it as a string of characters.
So, if you have an ABA file then you can parse it using fopen() and fread().
<?php
$fh = fopen('example.aba', 'rb');
$block1 = fread($fh, 1);
$block2 = fread($fh, 17);
$block3 = fread($fh, 2);
$block4 = fread($fh, 3);
// And so on...
Of course it would make sense to also have some mechanism that validates the data, and makes sure that the file is not corrupted, but this is just a simple example.

I know this is a late reply, but for other people coding for ABA file format,
http://www.cemtexaba.com/aba-format/ has a sample file format.
And this site has a great explanation on each of the fields
https://github.com/mjec/aba/blob/master/sample-with-comments.aba but don't refer to it as your single source of truth.
As Sverri M. Olsen has mentioned, the columns are not specifically separated by a separator, instead they just stick to the length specified by the specification.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.