PHP loading structed files faster

PHP loading structed files faster - php

I have multiple dataset as below, and I want to handle them using PHP.
Dataset #1 (75 cols * 27,000 rows)
col #1 col #2 ...
record #1
record #2
...
Dataset #2 (32 cols * 7,500 rows)
....
Dataset #3 (44 cols * 17,500 rows)
....
Here, the number of records and columns are different so it is hard to use database structure.
And note that each 'cell' of dataset is only consists of either real number or N/A... and the dataset is perfectly fixed, i.e., there will be no any change.
So what I've done so far is make them as a file-based table, and write a starting offset of each record in the file.
Using this way, quite nice access speedup was achieved, but not satisfactory so far, because an access to each record requires parsing it as PHP data structure.
What I ultimately want to achieve is eliminating the parsing step. But serialization was not a good choice because it loads entire dataset. Of course it is possible to serialize each record and keep their offset as I've done but without serialization, but it seems me to not so fancy.
So here's the question, is there any method to load a part of dataset without any parsing step, but more better than the partial serialization what I suggested?
Many thanks in advance.
More information
Maybe I made the viewers a little bit confused.
Each dataset is separated and they exist as independent files.
Usual data access pattern is row-wise. Each row have unique ID by string, and an ID in one dataset could be exists in other dataset, but not necessarily. But above of that, what I concern is accelerating an access speed when I have some query to fetch specific row(s) in the dataset. For example, let there is a dataset like below.
Dataset #1 (plain-text file)
obs1 obs2 obs3 ...
my1 3.72 5.28 10.22 ...
xu1 3.44 5.82 15.33 ...
...
qq7 8.24 10.22 47.54 ...
And there is a corresponding index file, serialized using PHP. A key of each item represents unique ID in the dataset, and their value represents their offset in the file
Index #1 (PHP-serialized one, not same as actual serialized one)
Array (
"my1" => 0,
"xu1" => 337,
...
"qq7" => 271104
)
So it is possible to know record "xu1" starts at 337 bytes from the beginning of dataset file.
In order to access and fetch some rows using their unique ID,
1) Load serialized index file
2) Find matching IDs with query
3) Access to those position and fetch rows, and parsing them as an array of PHP.
The problems what I have is
1) Since I using exact matching, it is impossible to fetch multiple rows that partially matching with query (for example, fetch "xu1" row from query "xu")
2) Even though I indexed dataset, fetch speed is not satisfactory (took 0.05 sec. from single query)
3) When I tried to solve above problem by serializing an entire dataset, (maybe of course) the loading speed become substantially slower.
The only easiest way to solve above problems is make them as database I would do so,
but hope to find better way as keep them with plain text or some text-like format (for example, serialized or json-coded).
Many thanks and interests about my problem!

I think I understand your question to some extent. You've got 3 sets of data, that can be or cannot be related, with different number of columns and rows.
This may not the most cleanest looking solution, but I think it could solve the purpose. You can use mysql to store the data to avoid parsing the file every now and again. You can store the data in three tables or put them in one table with all the columns (the rows without need for a set column can have "null" for the field value).
You can also use sql unions, in case you want to run queries on all the three datasets collectively, by using tricks like
select null as "col1", col2, col3 from table1 where col2="something"
union all
select col1,null as "col2", null as "col3" from table2 where co1="something else"
order by col1

Related

How do I select rows which IDs not in PHPs LARGE array?

I need to solve the following task: I have a quite large array of IDs in PHP script and I need to select from MySQL DB all rows with IDs NOT IN this array.
There are several similar questions (How to find all records which are NOT in this array? (MySql)) and the most favourite answer is use NOT IN () construction with implode(',',$array) within a brackets.
And this worked... until my array gown up to 2007 IDs and about 20 kB (in my case) I've got a "MySQL server has gone away" error. As I can understand this is because of the lengthy query.
There are also some solutions to this problem like this:
SET GLOBAL max_allowed_packet=1073741824;
(just taken from this question).
Probably I could do it in this way, however now I doubt that NOT IN (implode) approach is a good one to a big arrays (I expect that in my case array can be up to 8000 IDs and 100 kB).
Is there any better solution for a big arrays?
Thanks!
EDIT 1
As a solution it is recommended to insert all IDs from array to a temporary table and than use JOIN to solve the initial task. This is clear. However I never used temporary tables and therefore I have some additional question (probably worth to be as a separate question but I decided to leave it here):
If I need to do this routine several times during one MySQL session, which approach will be better:
Each time I need to SELECT ID NOT IN PHP array I will create a NEW temporary table (all those tables will be deleted after MySQL connection termination - after my script will be terminated in fact).
I will create a temporary table and delete one after I made needed SELECT
I will TRNCATE a temporary table afterwards.
Which is the better? Or I missed something else?

In such cases it is usually better to create a temporary table and perform the query against it instead. It'd be something along the lines of:
CREATE TEMPORARY TABLE t1 (a int);
INSERT INTO t1 VALUES (1),(2),(3);
SELECT * FROM yourtable
LEFT JOIN t1 on (yourtable.id=t1.a)
WHERE t1.a IS NULL;
Of course INSERT statement should be constructed so that you'd insert all values from your array into the temporary table.
Edit: Inserting all values in a single INSERT statement would most probably lead into the same problem you already faced. Hence I'd suggest that you use a prepared statement that will be executed to insert the data into temporary table while you iterate through the PHP array.

I've once had to tackle this problem, but with a IN(id) WHERE Clause with approx 20,000-30,000 identifiers (indexes).
The way I got around this, with SELECT query, was that I reduced the number of filtered identifiers and increased the number of times I sent the same query, in order to extract the same data.
You could use array_chunk for PHP and divide 20,000 by 15, which would give you 15 separate SQL Calls, filtering records by 1500 identifiers (per call, you can divide more than 15 to reduce the number of identifiers further). But in your case, if you just divide 2007 idenitifers by 10 it would reduce the number of identifiers you're pushing to the database to 200 per SQL request, there are otherways to optimize this further with temporary tables and so fourth.
By dividing the number of indexes you're trying filter it will speed up each query, to run faster than if you were to send every index to the database in a single dump.

Best way to find a match from large table

The best way to find a match between a few columns in the Data Base
I'd like to do something like this:
If you find a match to $ a, display the ID of the row
I am debating between two ways:
Select the entire table and look for a match and keep them a Data Base and then present them to from the array
Or that each time it search for matching from the table
The problem is that each time I perform a query for all the table (very large table) there is a problem with memory limit
So I'm looking for a way that takes the least memory

If all the data is in a single table, be sure that the data you are querying is indexed. This will ensure an 'optimal' search for your table.
In terms of memory, if you have an extremely large result set and slam the entire dataset into an array, you may run out of memory. To deal with this, you should page the data e.g. load some limited number results into the array for display, then present more data as the user asks for it.

Generally, selecting limited results from the database is faster and less memory intensive than populating large arrays. For a large table, be sure you only select the data you require. You might be looking for something like
SELECT record_id FROM your_table WHERE your_table.your_column = '$a' LIMIT 1;
This will only return one record in your result set.

Searching a Grid of Tiles With MySQL and PHP

I have a series of rows in MySQL with a 'location' column, which represents the location of an object on a two dimensional xy grid. I want to search the table for rows with a location which is within a given distance of a certain tiles.
For example, if I ran a search within 10 tiles of [34,56], that would return any rows with a 'location' value between [24-44 and 46-66].
My solution to this problem was to create an array (using for loops) with all of the possible tiles that would fall within that search term, and then query MySQL thusly:
"SELECT * FROM table WHERE localcoordinate IN ('$rangearray')"
This solution works fine, but is very resource intensive. I'd like to be able to run many searches at a distance of hundreds or thousands of tiles. Can anyone suggest a better approach that might run faster?

I improved my resource consumption by a factor of 100 by implementing the following strategy changes.
1) I broke the xy coordinate into two fields within the table.
2) I searched natively in MySQL with the "BETWEEN" function.
The final query looked something like this. You can extrapolate the data structure from the query.
SELECT * FROM table WHERE localcoordinateX BETWEEN $x-lo AND $x-hi AND localcoordinateY BETWEEN $y-lo AND $y-hi.
I should have thought of this the first time around but I didn't. Just the act of posting to stack exchange got me thinking clearly again, though!

MySQL Optimization for data tables and Query optimization

So a LOT of details here, my main objective is to do this as fast as possible.
I am calling an API which returns a large json encoded string.
I am storing the quoted encoded string into MySQL (InnoDB) with 3 fields: (tid (key), json, tags) in a table called store.
I will, at up to 3+ months later, pull information from this database by using:
WHERE tags LIKE "%something%" AND "%somethingelse%"
Tags are + delimited. (Which makes them too long to be efficiently keyed.)
Example:
'anime+pikachu+shingeki no kyojin+pokemon+eren+attack on titan+'
I do not wish to repeat API calls at ANYTIME. If you are going to include an API call use:
API(tag, time);
All of the JSON data is needed.
This table is an active archive.
One Idea I had was to put the tags into their own 2 column table (pid, tag (key)). pid points to tid in the store table.
Questions
Are there any MySQL configurations I can change to make this faster?
Are there any table structure changes I can do to make this faster?
Is there anything else I can do to make this faster?
QUOTED JSON Example (Messy, to see another clean example see TUMBLR APIv2):
'{\"blog_name\":\"roxannemariegonzalez\",\"id\":62108559921,\"post_url\":\"http:\\/\\/roxannemariegonzalez.tumblr.com\\/post\\/62108559921\",\"slug\":\"\",\"type\":\"photo\",\"date\":\"2013-09-24 00:36:56 GMT\",\"timestamp\":1379983016,\"state\":\"published\",\"format\":\"html\",\"reblog_key\":\"uLdTaScb\",\"tags\":[\"anime\",\"pikachu\",\"shingeki no kyojin\",\"pokemon\",\"eren\",\"attack on titan\"],\"short_url\":\"http:\\/\\/tmblr.co\\/ZxlLExvrzMen\",\"highlighted\":[],\"bookmarklet\":true,\"note_count\":19,\"source_url\":\"http:\\/\\/weheartit.com\\/entry\\/78231354\\/via\\/roxannegonzalez?page=2\",\"source_title\":\"weheartit.com\",\"caption\":\"\",\"link_url\":\"http:\\/\\/weheartit.com\\/entry\\/78231354\\/via\\/roxannegonzalez\",\"image_permalink\":\"http:\\/\\/roxannemariegonzalez.tumblr.com\\/image\\/62108559921\",\"photos\":[{\"caption\":\"\",\"alt_sizes\":[{\"width\":500,\"height\":444,\"url\":\"http:\\/\\/31.media.tumblr.com\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_500.png\"},{\"width\":400,\"height\":355,\"url\":\"http:\\/\\/25.media.tumblr.com\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_400.png\"},{\"width\":250,\"height\":222,\"url\":\"http:\\/\\/31.media.tumblr.com\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_250.png\"},{\"width\":100,\"height\":89,\"url\":\"http:\\/\\/25.media.tumblr.com\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_100.png\"},{\"width\":75,\"height\":75,\"url\":\"http:\\/\\/25.media.tumblr.com\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_75sq.png\"}],\"original_size\":{\"width\":500,\"height\":444,\"url\":\"http:\\/\\/31.media.tumblr.com\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_500.png\"}}]}'

Look into the Mysql MATCH()/AGAINST() functions and FULLTEXT index feature, this is probably what you are looking for. Make sure a FULLTEXT index will operate reasonably on a json document.
What kind of data sizes are we talking about? Huge amounts of memory is cheap these days, so having the entire Mysql dataset buffered in memory where you can do full text scans isn't unreasonable.
Breaking out some of the json field values and putting them into their own columns would allow you to search quickly for those ... but that doesn't help you for the general case.

This option you suggested is the correct design:
One Idea I had was to put the tags into their own 2 column table (pid,
tag (key)). pid points to tid in the store table.
But if you're searching LIKE '%something%' then the leading '%' will mean the index can only be used to reduce disk reads - you will still need to scan the entire index.
If you can drop the leading % (because you now have the entire tags) then this is certainly the way to go. The trailing '%' is not as important.

How to index a query the right way

I am trying to make my DB more optimized and are in the beginning of indexing it but not sure how to do it right.
I have this query:
$year = date("Y");
$thisYear = $year;
//$nextYear = $thisYear + 1;
$sql = mysql_query("SELECT SUM(points) as userpoints
FROM ".$prefix."_publicpoints
WHERE date BETWEEN '$thisYear" . "-01-01' AND '$thisYear" . "-12-31' AND fk_player_id = $playerid");
$row = mysql_fetch_assoc($sql);
$userPoints = $row['userpoints'];
$sql = mysql_query("SELECT
fk_player_id
FROM ".$prefix."_publicpoints
WHERE date BETWEEN '$thisYear" . "-01-01' AND '$thisYear" . "-12-31'
GROUP BY fk_player_id
HAVING SUM(points) > $userPoints");
$row = mysql_fetch_assoc($sql);
$userWrank = mysql_num_rows($sql)+1;
I am not sure how to index this? I have tried indexing the fk_player_id but it still looks through all the rows (287937).
I have indexed the date field which gives me this back in EXPLAIN:
1
SIMPLE
nf_publicpoints
range
IDXdate
IDXdate
3
NULL
143969
Using where with pushed condition; Using temporary...
I also have 2 calls to the same table... Could that be done in one?
How do I index this and/or could it be done smarter?

You should definitely spend some time reading up on indexing, there's a lot written about it, and it's important to understand what's going on.
Broadly speaking, and index imposes an ordering on the rows of a table.
For simplicity's sake, imagine a table is just a big CSV file. Whenever a row is inserted, it's inserted at the end. So the "natural" ordering of the table is just the order in which rows were inserted.
Imagine you've got that CSV file loaded up in a very rudimentary spreadsheet application. All this spreadsheet does is display the data, and numbers the rows in sequential order.
Now imagine that you need to find all the rows that has some value "M" in the third column. Given what you have available, you have only one option. You scan the table checking the value of the third column for each row. If you've got a lot of rows, this method (a "table scan") can take a long time!
Now imagine that in addition to this table, you've got an index. This particular index is the index of values in the third column. The index lists all of the values from the third column, in some meaningful order (say, alphabetically) and for each of them, provides a list of row numbers where that value appears.
Now you have a good strategy for finding all the rows where the value of the third column is M! For instance, you can perform a binary search! Whereas the table scan requires you to look N rows (where N is the number of rows), the binary search only requires that you look at log-n index entries, in the very worst case. Wow, that's sure a lot easier!
Of course, if you have this index, and you're adding rows to the table (at the end, since that's how our conceptual table works), you need need to update the index each and every time. So you do a little more work while you're writing new rows, but you save a ton of time when you're searching for something.
So, in general, indexing creates a tradeoff between read efficiency and write efficiency. With no indexes, inserts can be very fast -- the database engine just adds a row to the table. As you add indexes, the engine must update each index while performing the insert.
On the other hand, reads become a lot faster.
Hopefully that covers your first two questions (as others have answered -- you need to find the right balance).
Your third scenario is a little more complicated. If you're using LIKE, indexing engines will typically help with your read speed up to the first "%". In other words, if you're SELECTing WHERE column LIKE 'foo%bar%', the database will use the index to find all the rows where column starts with "foo", and then need to scan that intermediate rowset to find the subset that contains "bar". SELECT ... WHERE column LIKE '%bar%' can't use the index. I hope you can see why.
Finally, you need to start thinking about indexes on more than one column. The concept is the same, and behaves similarly to the LIKE stuff -- essentialy, if you have an index on (a,b,c), the engine will continue using the index from left to right as best it can. So a search on column a might use the (a,b,c) index, as would one on (a,b). However, the engine would need to do a full table scan if you were searching WHERE b=5 AND c=1)
Hopefully this helps shed a little light, but I must reiterate that you're best off spending a few hours digging around for good articles that explain these things in depth. It's also a good idea to read your particular database server's documentation. The way indices are implemented and used by query planners can vary pretty widely.
More information and example visit here : http://blog.sqlauthority.com/category/sql-index/

Try create index on date column, indexing fk_payer_id will not help with this query. If does not work - paste explain...
For more information about indexes in Mysql look here: http://hackmysql.com/case1

Why not index the date column, seeing how that's the main criterion that will be evaluated in the lookup?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.