generating mysql statistics - php

I have a csv file, which is generated weekly, and loaded into a mysql database. I need to make a report, which will include various statistics on the records imported. The first such statistic is how many records were imported.
I use PHP to interface with the database, and will be using php to generate a page showing such statistics.
However, the csv files are imported via a mysql script, quite separate from any PHP.
Is it possible to calculate the records that were imported and store the number in a different field/table, or some other way?
Adding an additional timefield to work out fields added since a certain time is not possible, as the structure of the database can not be changed.
Is there a query I can use while importing from a mysql script, or a better way to generate/count the number of imported records from within php?

You can get the number of records in a table using the following query.
SELECT COUNT(*) FROM tablename
So what you can do is you can count the number of records before the import and after the import and then select the difference like so.
$before_count = mysql_fetch_assoc(mysql_query("SELECT COUNT(*) AS c FROM tablename"));
// Run mysql script
$after_count = mysql_fetch_assoc(mysql_query("SELECT COUNT(*) AS c FROM tablename"));
$records_imported = $after_count['c'] - $before_count['c'];
You could do this all from the MySql script if you would like but I think using PHP to do it turns out to be a bit more clean.

A bit of a barrel-scraper, but depending on permissions you could edit the cron executed MySQL script to output some pre-update stats into a file using INTO OUTFILE and then parse the resultant file in PHP. You'd then have the 'before' stats and could execute the stats queries against via PHP to obtain the 'after' stats.
However, as with many of these solutions it'll be next to impossible to find updates to existing rows using this solution. (Although new rows should be trivial to detect.)

Not really sure what you're after, but here's a bit more detail:
Get MySQL to export the relevant stats to a known directory using
SELECT... INTO OUTFILE..
This directory would need to be readable/writable by the MySQL
user/group and your web server's user/group (or whatever user/group
you're running PHPas if you're going to automate the cli via cron on a
weekly basis. The file should be in CSV format and datestamped as
"stats_export_YYYYMMDD.csv".
Get PHP to scan the export directory for files beginning
"stats_export_", perhaps using the "scandir" function with a simple
substr test. You can then add the matching filename to an array. Once
you're run out of files, sort the array to ensure it's in date order.
Read the stats data from each of the files listed in the array in
turn using fgetcsv. It would be wise to place this data into a clean
array which also contains the relevant datestamp as extracted from the
filename.
At this point you'll have a summary of the stats at the end of each
day in an array. You can then execute the relevant stats SQL queries
again (if so required) directly from PHP and add the stats to the data
array.
Compare/contrast and output as required.

Load the files using PHP and 'LOAD DATA INFILE .... INTO TABLE .. ' and then get the number of imported rows using mysqli_affected_rows() (or mysql_affected_rows)

Related

Optimizing EXCEL + MySQL Processing

I have a module in my application whereby a user will upload an excel sheet with around 1000-2000 rows. I am using excel-reader to read the excel file.
In the excel there are following columns:
1) SKU_CODE
2)PRODUCT_NAME
3)OLD_INVENTORY
4)NEW_INVENTORY
5)STATUS
I have a mysql table inventory which contains the data regarding the sku codes:
1) SKU_CODE : VARCHAR(100) Primary key
2) NEW_INVENTORY INT
3) STATUS : 0/1 BOOLEAN
There are two options available with me:
Option 1: To process all the records from php, extract all the sku_codes and do a msql in query:
Select * from inventory where SKU_CODE in ('xxx','www','zzz'.....so on ~ 1000-2000 values);
- Single query
Option 2: is to process each record one by one for the current sku data
Select * from inventory where SKU_CODE = 'xxx';
..
...
around 1000-2000 queries
So can you please help me choose the best way of achieving the above task with proper explanation so that i can be sure of a good product module.
As you've probably realized, both options have their pro's and cons. On a properly indexed table, both should perform fairly well.
Option 1 is most likely faster, and can be better if you're absolutely sure that the number of SKU's will always be fairly limited, and users can only do something with the result after the entire file is processed.
Option 2 has a very important advantage in that you can process each record in your Excel file separately. This offers some interesting options, in that you can begin generating output for each row you read from the Excel instead of having to parse the entire file in one go, and then run the big query.
You shall find a middle way, have a specific optimal BATCH_SIZE , and use that as criteria for querying your database.
An example batch size could be 5000.
So if your excel contains 2000 rows, all the data gets returned in single query.
If the excel contains 19000 rows, you do four queries i.e 0-5000 sku codes, 5001-1000 sku codes....and so on.
Try optimizing on BATCH_SIZE as per your benchmark.
It is always good to save on database queries.

Improve SQL query results when running php app on GAE and DB on Amazon RDS

Here is something that hit me and wanted to know if I was right or if it could be done better? I am currently running the PHP part on GAE and use Amazon RDS since it is cheaper than google cloud SQL. And also since PHP on GAE does not have native api for Datastore. I know there is a work around but hey this is simpler and I bet a lot of others want their GAE app to sync with their DB than move the who stuff.
I run two queries
This is a join statement that runs when the page loads
$STH = $DBH->prepare("SELECT .....a few selected colmns with time coversion.....
List of Associates.Supervisor FROM Box Scores INNER JOIN
List of Associates ON Box Scores.Initials = List of
Associates.Initials WHERE str_to_date(Date, '%Y-%m-%d') BETWEEN
'{$startDate}' AND '{$endDate}' AND Box Scores.Initials LIKE
'{$initials}%' AND List of Associates.Supervisor LIKE'{$team}%'
GROUP BY Login");
What I get I calculate and then display as a table with each username as link
echo("<td >$row[0]</td>");
So when some one clicks on this link it will call another PHP and using AJAX to display the output I run the second query
2.Second query. This time I am getting everything.
$STH = $DBH->prepare("SELECT * FROM `Box Scores` INNER JOIN `List of Associates` ON
`Box Scores`.`Initials` = `List of Associates`.`Initials`
WHERE str_to_date(`Date`, '%Y-%m-%d') BETWEEN '{$startDate}' AND '{$endDate}'
AND `V2 Box Scores`.`Initials` LIKE '{$Agent}%'
AND `List of Associates`.`Supervisor` LIKE '{$team}%'");
The output I display in a small pop up as a light box after formatting the output as a table.
I find that the first query to be faster. So it got me thinking should I do something to the second part to make it faster.
Would only selecting the needed columns make it faster. OR should I do a SELECT * FROM as the first and then save it all to a unique file in Google bucket and then make the corresponding SELECT calls from that file?
I trying to make it such that it scale and not slow then when the query has to go through tens of thousands of rows in the DB. The above Queries are executed using PDO or PHP Data Objects.
so what are your thoughts?
Amazon Red Shift stores each column in a separate partition -- something called a columnar database or vertical partitioning. This results in some unusual performance issues.
For instance, I have run a query like this on a table will hundreds of millions of row, and it took about minute to return:
select *
from t
limit 10;
On the other hand, a query like this would return in a few seconds:
select count(*), count(distinct field)
from t;
This takes some getting used to. But, you should explicitly limit the columns you refer to in the query to get the best performance on Amazon (and other columnar databases). Each additional referenced column requires reading in that data from disk to memory.
Also, limiting the number of columns also reduces the I/O needed to the application. This can be significant, if you are storing wide-ish data in some of the columns, and you don't use the data.

Large mysql query in PHP

I have a large table of about 14 million rows. Each row has contains a block of text. I also have another table with about 6000 rows and each row has a word and six numerical values for each word. I need to take each block of text from the first table and find the amount of times each word in the second table appears then calculate the mean of the six values for each block of text and store it.
I have a debian machine with an i7 and 8gb of memory which should be able to handle it. At the moment I am using the php substr_count() function. However PHP just doesn't feel like its the right solution for this problem. Other than working around time-out and memory limit problems does anyone have a better way of doing this? Is it possible to use just SQL? If not what would be the best way to execute my PHP without overloading the server?
Do each record from the 'big' table one-at-a-time. Load that single 'block' of text into your program (php or what ever), and do the searching and calculation, then save the appropriate values where ever you need them.
Do each record as its own transaction, in isolation from the rest. If you are interrupted, use the saved values to determine where to start again.
Once you are done the existing records, you only need to do this in the future when you enter or update a record, so it's much easier. You just need to take your big bite right now to get the data updated.
What are you trying to do exactly? If you are trying to create something like a search engine with a weighting function, you maybe should drop that and instead use the MySQL fulltext search functions and indices that are there. If you still need to have this specific solution, you can of course do this completely in SQL. You can do this in one query or with a trigger that is run each time after a row is inserted or updated. You wont be able to get this done properly with PHP without jumping through a lot of hoops.
To give you a specific answer, we indeed would need more information about the queries, data structures and what you are trying to do.
Redesign IT()
If for size on disc is not !important just joints table into one
Table with 6000 put into memory [ memory table ] and make backup every one hour
INSERT IGNORE into back.table SELECT * FROM my.table;
Create "own" index in big table eq
Add column "name index" into big table with id of row
--
Need more info about query to find solution

Import and process text file within MySQL

I'm working on a research project that requires me to process large csv files (~2-5 GB) with 500,000+ records. These files contain information on government contracts (from USASpending.gov). So far, I've been using PHP or Python scripts to attack the files row-by-row, parse them, and then insert the information into the relevant tables. The parsing is moderately complex. For each record, the script checks to see if the entity named is already in the database (using a combination of string and regex matching); if it is not, it first adds the entity to a table of entities and then proceeds to parse the rest of the record and inserts the information into the appropriate tables. The list of entities is over 100,000.
Here are the basic functions (part of a class) that try to match each record with any existing entities:
private function _getOrg($data)
{
// if name of organization is null, skip it
if($data[44] == '') return null;
// use each of the possible names to check if organization exists
$names = array($data[44],$data[45],$data[46],$data[47]);
// cycle through the names
foreach($names as $name) {
// check to see if there is actually an entry here
if($name != '') {
if(($org_id = $this->_parseOrg($name)) != null) {
$this->update_org_meta($org_id,$data); // updates some information of existing entity based on record
return $org_id;
}
}
}
return $this->_addOrg($data);
}
private function _parseOrg($name)
{
// check to see if it matches any org names
// db class function, performs simple "like" match
$this->db->where('org_name',$name,'like');
$result = $this->db->get('orgs');
if(mysql_num_rows($result) == 1) {
$row = mysql_fetch_object($result);
return $row->org_id;
}
// check to see if matches any org aliases
$this->db->where('org_alias_name',$name,'like');
$result = $this->db->get('orgs_aliases');
if(mysql_num_rows($result) == 1) {
$row = mysql_fetch_object($result);
return $row->org_id;
}
return null; // no matches, have to add new entity
}
The _addOrg function inserts the new entity's information into the db, where hopefully it will match subsequent records.
Here's the problem: I can only get these scripts to parse about 10,000 records / hour, which, given the size, means a few solid days for each file. The way my db is structured requires a several different tables to be updated for each record because I'm compiling multiple external datasets. So, each record updates two tables, and each new entity updates three tables. I'm worried that this adds too much lag time between MySQL server and my script.
Here's my question: is there a way to import the text file into a temporary MySQL table and then use internal MySQL functions (or PHP/Python wrapper) to speed up the processing?
I'm running this on my Mac OS 10.6 with local MySQL server.
load the file into a temporary/staging table using load data infile and then use a stored procedure to process the data - shouldnt take more than 1-2 mins at the most to completely load and process the data.
you might also find some of my other answers of interest:
Optimal MySQL settings for queries that deliver large amounts of data?
MySQL and NoSQL: Help me to choose the right one
How to avoid "Using temporary" in many-to-many queries?
60 million entries, select entries from a certain month. How to optimize database?
Interesting presentation:
http://www.mysqlperformanceblog.com/2011/03/18/video-the-innodb-storage-engine-for-mysql/
example code (may be of use to you)
truncate table staging;
start transaction;
load data infile 'your_data.dat'
into table staging
fields terminated by ',' optionally enclosed by '"'
lines terminated by '\n'
(
org_name
...
)
set
org_name = nullif(org_name,'');
commit;
drop procedure if exists process_staging_data;
delimiter #
create procedure process_staging_data()
begin
insert ignore into organisations (org_name) select distinct org_name from staging;
update...
etc..
-- or use a cursor if you have to ??
end#
delimiter ;
call process_staging_data();
Hope this helps
It sounds like you'd benefit the most from tuning your SQL queries, which is probably where your script spends the most time. I don't know how the PHP MySQL client performs, but MySQLdb for Python is fairly fast. Doing naive benchmark tests I can easily sustain 10k/sec insert/select queries on one of my older quad-cores. Instead of doing one SELECT after another to test if the organization exists, using a REGEXP to check for them all at once might be more efficient (discussed here: MySQL LIKE IN()?). MySQLdb lets you use executemany() to do multiple inserts simultaneously, you could almost certainly leverage that to your advantage, perhaps your PHP client lets you do the same thing?
Another thing to consider, with Python you can use multiprocessing to and try parallelize as much as possible. PyMOTW has a good article about multiprocessing.

Sorting postgresql database dump (pg_dump)

I am creating to pg_dumps, DUMP1 and DUMP2.
DUMP1 and DUMP2 are exactly the same, except DUMP2 was dumped in REVERSE order of DUMP1.
Is there anyway that I can sort the two DUMPS so that the two DUMP files are exactly the same (when using a diff)?
I am using PHP and linux. I tried using "sort" in linux, but that does not work...
Thanks!
From your previous question, I assume that what you are really trying to do is compare to databases to see if they are they same including the data.
As we saw there, pg_dump is not going to behave deterministically. The fact that one file is the reverse of the other is probably just coincidental.
Here is a way that you can do the total comparison including schema and data.
First, compare the schema using this method.
Second, compare the data by dumping it all to a file in an order that will be consistent. Order is guaranteed by first sorting the tables by name and then by sorting the data within each table by primary key column(s).
The query below generates the COPY statements.
select
'copy (select * from '||r.relname||' order by '||
array_to_string(array_agg(a.attname), ',')||
') to STDOUT;'
from
pg_class r,
pg_constraint c,
pg_attribute a
where
r.oid = c.conrelid
and r.oid = a.attrelid
and a.attnum = ANY(conkey)
and contype = 'p'
and relkind = 'r'
group by
r.relname
order by
r.relname
Running that query will give you a list of statements like copy (select * from test order by a,b) to STDOUT; Put those all in a text file and run them through psql for each database and then compare the output files. You may need to tweak with the output settings to COPY.
My solution was to code an own program for the pg_dump output. Feel free to download PgDumpSort which sorts the dump by primary key. With the java default memory of 512MB it should work with up to 10 million records per table, since the record info (primary key value, file offsets) are held in memory.
You use this little Java program e.g. with
java -cp ./pgdumpsort.jar PgDumpSort db.sql
And you get a file named "db-sorted.sql", or specify the output file name:
java -cp ./pgdumpsort.jar PgDumpSort db.sql db-$(date +%F).sql
And the sorted data is in a file like "db-2013-06-06.sql"
Now you can create patches using diff
diff --speed-large-files -uN db-2013-06-05.sql db-2013-06-06.sql >db-0506.diff
This allows you to create incremental backup which are usually way smaller. To restore the files you have to apply the patch to the original file using
patch -p1 < db-0506.diff
(Source code is inside of the JAR file)
If
performance is less important than order
you only care about the data not the schema
and you are in a position to recreate both dumps (you don't have to work with existing dumps)
you can dump the data in CSV format in a determined order like this:
COPY (select * from your_table order by some_col) to stdout
with csv header delimiter ',';
See COPY (v14)
It's probably not worth the effort to parse out the dump.
It will be far, far quicker to restore DUMP2 into a temporary database and dump the temporary in the right order.
Here is another solution to the problem: https://github.com/tigra564/pgdump-sort
It allows sorting both DDL and DML including resetting volatile values (like sequence values) to some canonical values to minimize the resulting diff.

Categories