How to build datastore indexes (PHP GAE)

How to build datastore indexes (PHP GAE) - php

I am using Tom Walder's Google Datastore Library for PHP to insert data into my Google App Engine Datastore.
$obj_schema = (new GDS\Schema('Add Log'))
->addString('name', TRUE)
->addDatetime('time', TRUE);
$obj_store = new GDS\Store($obj_gateway, $obj_schema);
$obj_store->upsert($obj_store->createEntity(['name' => "test",'time' => date('Y-m-d H:i:s', time())]));
When I insert data like the above code, everything seems to be importing properly (each property say they are indexed).
But when I go to do a query with multiple selectors it says "You need an index to execute this query".
My query
The error message
Does anyone know what I need to do to make sure my queries are being indexed? This is what my dashboard hows with plenty of data using the code I showed.

As Alex Martelli mentioned in a comment, most of the time, your indexes are built when you run your app on the devserver and have your datastore get queried there (this adds the required indexes for any question into your index.yaml file.
So you have two ways you can go at it.
1- Run your app on local devserver, go to your dev "Developer Console" to add one or two entities to your datastore. Run your queries, that'll populate your index.yaml with all required indexes. You can then run appcfg.py update_indexes to just deploy your index.yaml (bottom of this page)
2- Your other solution would be to read this, a page on how datastore indexes work. Then read this advanced article on indexes. You should also watch the following presentation that will give you a better insight into indexes and the datastore. Once that's all done, figure out which queries you want, and flesh out the required indexes in your index.yaml, then deploy with the same method as in 1.
Quick summary of how indexes work
So you can think of the datastore as a pure READER. It doesn't, like normal relational databases, do any kind of computation as it reads your data and returns it. Therefore, to be able to run a given query (say "all client orders passed before christmas 2013"), then you need a table where all your client orders are ordered by date (so the system doesn't have to check every row's date to see if it matches. It just takes the first "chunk" of your data, up to the date you're looking for, and returns it).
Therefore, you need to have those indexes built, and they will influence the queries you can run. By default, every attribute is indexed by itself, in descending order. For any queries on more than one attribute (or with a different sort order), you need to have the index (in that case they are called composite indexes) built by the datastore, so you need to declare it in your index.yaml.
In the last years, Google added the zigzag merge join algorithm, which is basically a way to take 2 composite indexes (that start with the same attributes, so there is common ground between the 2 sub-queries) and run 2 sub-queries on them, then have the algorithm join the responses of both sub-queries.

Related

Apply SQL Query to in-memory PHP Object or Array

Use Case:
I'm building a site where users can search records - with SQL. BUT - they should also be able to save their search and be notified when a new submitted record meets the criteria.
It's not a car buying site, but for example: The user searches for a 1967 Ford Mustang with a 289 V8 engine, within the 90291 ZIP code. Can't find the right one, but they want to be notified if a matching car is submitted 2 weeks later.
So of course, every time a new car is added to the DB, I can retrieve all the user search queries, and run all of them over all the cars in the DB. But that is not scalable.
Rather than search the entire "car" table with every "search" query every time a new car is submitted, I would like to just check that single "car" object/array in memory, with the existing user queries.
I'm doing this in PHP with Laravel and Eloquent, but I am implementation agnostic and welcome any theoretical approaches.
Thanks,
Chris

I would rather run the saved searches in batches at scheduled intervals and not run them avery time a record is appended to the tables.

It comes down to how you structure your in memory cache.
Whatever cache it is it usually relies on key, value pairs. It will be the same for the cache you are using:
http://laravel.com/docs/4.2/cache
So in the end it is all about using the right key. If you want to update the cached objects based on a car, then you would need to make the key in a way so that you can retrieve all objects from the cache using the car as (part of) the key. Usually you would concat multiple things for key like userId+carId+xyz and then make a MD5 checksum of that.
So that would be the answer to your question. However generally I would not recommend this approach. It sounds like your search results are more like persisted long term available results. So you would probably want to store them somewhere more permanent like a simple table. Then you can use standard SQL tools to join the table and find out what is needed.

My approach would be to use a MySQL stored procedure and use https://dev.mysql.com/doc/refman/5.1/en/event-scheduler.html to review the configs for possible changes and then flag them storing some kind of dirty indicator which is then checked by a php script which would be executed on demand or from cron etc periodically.
You could use the trigger to simply flag that the event scheduler has work to do. However you approach there are a number of state variables which starts to get ugly however this use case doesn't seem to map neatly into a queuing architecture as far as I can see.

A possible approach would be to use a trigger in SQL to send a notification. Here is something related with it: 1s link or 2nd link.

Redis CRUD patterns

i've recently started learning Redis and am currently building an app using it as sole datastore and I'd like to check with other Redis users if some of my conclusions are correct as well as ask a few questions. I'm using phpredis if that's relevant but I guess the questions should apply to any language as it's more of a pattern thing.
As an example, consider a CRUD interface to save websites (name and domain) with the following requirements:
Check for existing names/domains when saving/validating a new site (duplicate check)
Listing all websites with sorting and pagination
I have initially chosen the following "schema" to save this information:
A key "prefix:website_ids" in which I use INCR to generate new website id's
A set "prefix:wslist" in which I add the website id generated above
A hash for each website "prefix:ws:ID" with the fields name and website
The saving/validation issue
With the above information alone I was unable (as far as I know) to check for duplicate names or domains when adding a new website. To solve this issue I've done the following:
Two sets with keys "prefix:wsnames" and "prefix:wsdomains" where I also SADD the website name and domains.
This way, when adding a new website I can check if the submitted name or domain already exist in either of these sets with SISMEMBER and fail the validation if needed.
Now if i'm saving data with 50 fields instead of just 2 and wanted to prevent duplicates I'd have to create a similar set for each of the fields I wanted to validate.
QUESTION 1: Is the above a common pattern to solve this problem or is there any other/better way people use to solve this type of issue?
The listing/sorting issue
To list websites and sort by name or domain (ascending or descending) as well as limiting results for pagination I use something like:
SORT prefix:wslist BY prefix:ws:*->name ALPHA ASC LIMIT 0 10
This gives me 10 website ids ordered by name. Now to get these results I came to the following options (examples in php):
Option 1:
$wslist = the sort command here;
$websites = array();
foreach($wslist as $ws) {
$websites[$ws] = $redis->hGetAll('prefix:ws:'.$ws);
}
The above gives me a usable array with website id's as key and an array of fields. Unfortunately this has the problem that I'm doing multiple requests to redis inside a loop and common sense (at least coming from RDBMs) tells me that's not optimal.
The better way it would seem to be to use redis pipelining/multi and send all request in a single go:
Option 2:
$wslist = the sort command here;
$redis->multi();
foreach($wslist as $ws) {
$redis->hGetAll('prefix:ws:'.$ws);
}
$websites = $redis->exec();
The problem with this approach is that now I don't get each website's respective ID unless I then loop the $websites array again to associate each one. Another option is to maybe also save a field "id" with the respective website id inside the hash itself along with name and domain.
QUESTIONS 2/3: What's the best way to get these results in a usable array without having to loop multiple times? Is it correct or good practice to also save the id number as a field inside the hash just so I can also get it with the results?
Disclaimer: I understand that the coding and schema building paradigms when using a key->value datastores like Redis are different from RDBMs and document stores and so notions of "best way to do X" are likely to be different depending on the data and application at hand.
I also understand that Redis might not even be the most suitable datastore to use in mostly CRUD type apps but I'd still like to get any insights from more experienced developers since CRUD interfaces are very common on most apps.

Answer 1
Your proposal looks pretty common. I'm not sure why you need an auto-incrementing ID though. I imagine the domain name has to be unique, or the website name has to be unique, or at the very least the combination of the two has to be unique. If this is the case it sounds like you already have a perfectly good key, so why invent an integer key when you don't need it?
Having a SET for domains and a SET for website names is a perfect solution for quickly checking to see if a specific domain or website name already exists. Though, if one of those (domain or website name) is your key you might not even need these SETs since you could just look if the key prefix:ws:domain-or-ws-name-here exists.
Also, using a HASH for each website so you can store your 50 fields of details for the website inside is perfect. That is what hashes are for.
Answer 2
First, let me point out that if your websites and domain names are stored in SORTED SETs instead of SETs, they will already be alphabetized (assuming they are given the same score). If you are trying to support other sort options this might not help much, but wanted to point it out.
Your Option 1 and Option 2 are actually both relatively reasonable. Redis is lightning fast, so Option 1 isn't as unreasonable as it seems at first. Option 2 is clearly even more optimal from the perspective of redis since all the commands will be bufferred and executed all at once. Though, it will require additional processing in PHP afterwards as you noted if you want the array to be indexed by the id.
There is a 3rd option: lua scripting. You can have redis execute a Lua script that returns both the ids and hash values all in one shot. But, not being super familiar with PHP anymore and how redis's multibyte replies map to PHPs arrays I'm not 100% sure what the lua script would look like. You'll need to look for examples or do some trial and error. It should be a pretty simple script, though.
Conclusion
I think redis sounds like a decent solution for your problem. Just keep in mind the dataset needs to always be small enough to keep in memory. If that's not really a concern (unless your fields are huge, you should be able to fit thousands of websites into only a few MB) or if you don't mind having to upgrade your RAM to grow your DB, then Redis is perfectly suitable.
Be familiar with the various persistence options and configurations for redis and what they mean for availability and reliability. Also, make sure you have a backup solution in place. I would recommend having both a secondary redis instance that slaves off of your main instance, and a recurring process that backs up your redis database file at least daily.

Simulate MySQL connection to analyze queries to rebuild table structure (reverse-engineering tables)

I have just been tasked with recovering/rebuilding an extremely large and complex website that had no backups and was fully lost. I have a complete (hopefully) copy of all the PHP files however I have absolutely no clue what the database structure looked like (other than it is certainly at least 50 or so tables...so fairly complex). All data has been lost and the original developer was fired about a year ago in a fiery feud (so I am told). I have been a PHP developer for quite a while and am plenty comfortable trying to sort through everything and get the application/site back up and running...but the lack of a database will be a huge struggle. So...is there any way to simulate a MySQL connection to some software that will capture all incoming queries and attempt to use the requested field and table names to rebuild the structure?
It seems to me that if i start clicking through the application and it passes a query for
SELECT name, email, phone from contact_table WHERE
contact_id='1'
...there should be a way to capture that info and assume there was a table called "contact_table" that had at least 4 fields with those names... If I can do that repetitively, each time adding some sample data to the discovered fields and then moving on to another page, then eventually I should have a rough copy of most of the database structure (at least all public-facing parts). This would be MUCH easier than manually reading all the code and pulling out every reference, reading all the joins and subqueries, and sorting through it all manually.
Anyone ever tried this before? Any other ideas for reverse-engineering the database structure from PHP code?

mysql> SET GLOBAL general_log=1;
With this configuration enabled, the MySQL server writes every query to a log file (datadir/hostname.log by default), even those queries that have errors because the tables and columns don't exist yet.
http://dev.mysql.com/doc/refman/5.6/en/query-log.html says:
The general query log can be very useful when you suspect an error in a client and want to know exactly what the client sent to mysqld.
As you click around in the application, it should generate SQL queries, and you can have a terminal window open running tail -f on the general query log. As you see queries run by that reference tables or columns that don't exist yet, create those tables and columns. Then repeat clicking around in the app.
A number of things may make this task even harder:
If the queries use SELECT *, you can't infer the names of columns or even how many columns there are. You'll have to inspect the application code to see what column names are used after the query result is returned.
If INSERT statements omit the list of column names, you can't know what columns there are or how many. On the other hand, if INSERT statements do specify a list of column names, you can't know if there are more columns that were intended to take on their default values.
Data types of columns won't be apparent from their names, nor string lengths, nor character sets, nor default values.
Constraints, indexes, primary keys, foreign keys won't be apparent from the queries.
Some tables may exist (for example, lookup tables), even though they are never mentioned by name by the queries you find in the app.
Speaking of lookup tables, many databases have sets of initial values stored in tables, such as all possible user types and so on. Without the knowledge of the data for such lookup tables, it'll be hard or impossible to get the app working.
There may have been triggers and stored procedures. Procedures may be referenced by CALL statements in the app, but you can't guess what the code inside triggers or stored procedures was intended to be.
This project is bound to be very laborious, time-consuming, and involve a lot of guesswork. The fact that the employer had a big feud with the developer might be a warning flag. Be careful to set the expectations so the employer understands it will take a lot of work to do this.
PS: I'm assuming you are using a recent version of MySQL, such as 5.1 or later. If you use MySQL 5.0 or earlier, you should just add log=1 to your /etc/my.cnf and restart mysqld.

Crazy task. Is the code such that the DB queries are at all abstracted? Could you replace the query functions with something which would log the tables, columns and keys, and/or actually create the tables or alter them as needed, before firing off the real query?
Alternatively, it might be easier to do some text processing, regex matching, grep/sort/uniq on the queries in all of the PHP files. The goal would be to get it down to a manageable list of all tables and columns in those tables.

I once had a similar task, fortunately I was able to find an old backup.
If you could find a way to extract the queries, like say, regex match all of the occurrences of mysql_query or whatever extension was used to query the database, you could then use something like php-sql-parser to parse the queries and hopefully from that you would be able to get a list of most tables and columns. However, that is only half the battle. The other half is determining the data types for every single column and that would be rather impossible to do autmatically from PHP. It would basically require you inspect it line by line. There are best practices, but who's to say that the old dev followed them? Determining whether a column called "date" should be stored in DATE, DATETIME, INT, or VARCHAR(50) with some sort of manual ugly string thing can only be determined by looking at the actual code.
Good luck!

You could build some triggers with the BEFORE action time, but unfortunately this will only work for INSERT, UPDATE, or DELETE commands.
http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html

Preferred method for Materialized Views (Summary Tables) with MySQL

I am developing a project at work for which I need to create and maintain Summary Tables for performance reasons. I believe the correct term for this is Materialized Views.
I have 2 main reasons to do this:
Denormalization
I normalized the tables as much as possible. So there are situations where I would have to join many tables to pull data. We work with MySQL Cluster, which has pretty poor performance when it comes to JOIN's.
So I need to create Denormalized Tables that can run faster SELECT's.
Summarize Data
For example, I have a Transactions table with a few million records. The transactions come from different websites. The application needs to generate a report will display the daily or monthly transaction counts, and total revenue amounts per website. I don't want the report script to calculate this every time, so I need to generate a Summary Table that will have a breakdown by [site,date].
That is just one simple example. There are many different kinds of summary tables I need to generate and maintain.
In the past I have done these things by writing several cron scripts to keep each summary table updated. But in this new project, I am hoping to implement a more elegant and proper solution.
I would prefer a PHP based solution, as I am not a server administrator, and I feel the most comfortable when I can control everything through my application code.
Solutions that I have considered:
Copying VIEW's
If the resulting table can be represented as a single SELECT query, I can generate a VIEW. Since they are slow, there can be a cronjob that copies this VIEW into a real table.
However, some of these SELECT queries can be so slow that it's not acceptable even for cronjobs. It is not very efficient to recreate the whole summary data, if older rows are not even being updated much.
Custom Cronjobs for each Summary Table
This is the solution I have used before, but now I am trying to avoid it if possible. If there will be many summary tables, it can be messy to maintain.
MySQL Triggers
It is possible to add triggers to the main tables so that every time there is an INSERT, UPDATE or DELETE, the summary tables get updated accordingly.
There would be no cronjobs and the summaries would be in real time. However if there is ever a need to rebuild a summary table from scratch, it would have to be done with another solution (probably #1 above).
Using ORM Hooks/Triggers
I am using Doctrine as my ORM. There is a way to add event listeners that will trigger stuff on INSERT/UPDATE/DELETE, which in turn can update the summary tables. In a sense this solution is similar to #3 above, but I will have better control over these triggers since they will be implemented in PHP.
Implementation Considerations:
Complete Rebuilds
I want to avoid having to rebuild the summary tables, for efficiency, and only update for new data. But in case something goes wrong, I need the capability to rebuild the summary table from scratch using existing data on the main tables.
Ignoring UPDATE/DELETE on Old Data
Some summaries can assume that older records will never be updated or deleted, but only new records will be inserted. The summary process can save a lot of work by making the assumption that it doesn't need to check for updates on older data.
But of course this won't apply to all tables.
Keeping a Log
Let's assume that I won't have access to, or do not want to use the binary MySQL logs.
For summarizing new data, the summary process just needs to remember the last primary key id's for the last records it summarized. Next time it runs, it can summarize everything after that id. However, to keep track of older records that have been updated/deleted, it needs another log so it can go back and re-summarize that data.
I would appreciate any kind of strategies, suggestions or links that can help. Thank you!

As noted above materialized views in Oracle are different than indexed views in SQL Server. They are very cool and useful. See http://download.oracle.com/docs/cd/B10500_01/server.920/a96567/repmview.htm for details
MySql does not have support for these however.
One thing you mention several times is poor performance. Have you checked your database design for proper indexing and run explain plans on the queries to see why they are slow. See here http://dev.mysql.com/doc/refman/5.1/en/using-explain.html. This is of course assuming that your server is tuned properly, you have mysql setup and tuned, e.g. buffer caches, etc. etc. etc.
To your direct question. What you sound like you want to do is something we do often in a data warehouse situation. We have a production database and a DW that pulls in all sorts of information, aggregates and pre-caclulates it to speed up querying. This may be overkill for you but you can decide. Depending on the latency you define for your reports, i.e. how often you need them, we normally go through an ETL (extract transform load) process periodically (daily, weekly, etc.) to populate the DW from the production system. This keeps impact low on the production system and moves all reporting to another set of servers which also lessens the load. On the DW side, I would normally design my schemas different, i.e. using star schemas. (http://www.orafaq.com/node/2286) Star schemas have fact tables (things you want to measure) and dimensions (things you want to aggregate the measures by (time, geography, product categories, etc.) On SQL Server they also include an additional engine called SQL Server Analysis services (SSAS) to look at fact tables and dimensions, pre calculate and build OLAP data cubes. In these data cubes you can drill down and look at all types of patterns, do data analysis and data mining. Oracle does things slightly differently but the outcome is the same.
Whether you want to go the about route really depends on the business need and how much value you get from data analysis. As I said it is likely overkill if you just have a few summary tables but some of the concepts you may find helpful as you think things through. If your business is going toward a business intelligence solution then this is something to consider.
PS You can actually set a DW up to work in "real-time" using something called ROLAP if that is the business need. Microstrategy has a good product that works well for this.
PPS You also may want to look at PowerPivot from MS (http://www.powerpivot.com/learn.aspx) I have only played with it so I cannot tell you how it works on very large datasets.

Flexviews (http://flexvie.ws) is an open source PHP/MySQL based project. Flexviews adds incrementally refreshable materialized views (like the materialized views in Oracle) to MySQL, usng PHP and stored procedures.
It includes FlexCDC, a PHP based change data capture utility which reads binary logs, and the Flexviews MySQL stored procedures which are used to define and maintain the views.
Flexviews supports joins (inner join only) and aggregation so it can be used to create summary tables. Moreover, you can use Flexviews in combination with Mondrian's (a ROLAP server) aggregation designer to create summary tables that the ROLAP tool can automatically use.
If you don't have access to the logs (it can read them remotely, btw, so you don't need server access, but you do need SUPER privs) then you can use 'COMPLETE' refresh with Flexviews. This automates creating a new table with 'CREATE TABLE ... AS SELECT' under a new table name. It then uses RENAME TABLE to swap the new table for the one, renaming the old with an _old postfix. Finally it drops the old table. The advantage here is that the SQL to create the view is stored in the database (flexviews.mview) and can be refreshed with a simple API call which automates the swapping process.

Tracking the views of a given row

I have a site where the users can view quite a large number of posts. Every time this is done I run a query similar to UPDATE table SET views=views+1 WHERE id = ?. However, there are a number of disadvantages to this approach:
There is no way of tracking when the pageviews occur - they are simply incremented.
Updating the table that often will, as far as I understand it, clear the MySQL cache of the row, thus making the next SELECT of that row slower.
Therefore I consider employing an approach where I create a table, say:
object_views { object_id, year, month, day, views }, so that each object has one row pr. day in this table. I would then periodically update the views column in the objects table so that I wouldn't have to do expensive joins all the time.
This is the simplest solution I can think of, and it seems that it is also the one with the least performance impact. Do you agree?
(The site is build on PHP 5.2, Symfony 1.4 and Doctrine 1.2 in case you wonder)
Edit:
The purpose is not web analytics - I know how to do that, and that is already in place. There are two purposes:
Allow the user to see how many times a given object has been shown, for example today or yesterday.
Allow the moderators of the site to see simple view statistics without going into Google Analytics, Omniture or whatever solution. Furthermore, the results in the backend must be realtime, a feature which GA cannot offer at this time. I do not wish to use the Analytics API to retrieve the usage data (not realtime, GA requires JavaScript).

Quote : Updating the table that often will, as far as I understand it, clear the MySQL cache of the row, thus making the next SELECT of that row slower.
There is much more than this. This is database killer.
I suggest u make table like this :
object_views { object_id, timestamp}
This way you can aggregate on object_id (count() function).
So every time someone view the page you will INSERT record in the table.
Once in a while you must clean the old records in the table. UPDATE statement is EVIL :)
On most platforms it will basically mark the row as deleted and insert a new one thus making the table fragmented. Not to mention locking issues .
Hope that helps

Along the same lines as Rage, you simply are not going to get the same results doing it yourself when there are a million third party log tools out there. If you are tracking on a daily basis, then a basic program such as webtrends is perfectly capable of tracking the hits especially if your URL contains the ID's of the items you want to track... I can't stress this enough, it's all about the URL when it comes to these tools (Wordpress for example allows lots of different URL constructs)
Now, if you are looking into "impression" tracking then it's another ball game because you are probably tracking each object, the page, the user, and possibly a weighted value based upon location on the page. If this is the case you can keep your performance up by hosting the tracking on another server where you can fire and forget. In the past I worked this using SQL updating against the ID and a string version of the date... that way when the date changes from 20091125 to 20091126 it's a simple query without the overhead of let's say a datediff function.

First just a quick remark why not aggregate the year,month,day in DATETIME, it would make more sense in my mind.
Also I am not really sure what is the exact reason you are doing that, if it's for a marketing/web stats purpose you have better to use tool made for that purpose.
Now there is two big family of tool capable to give you an idea of your website access statistics, log based one (awstats is probably the most popular), ajax/1pixel image based one (google analytics would be the most popular).
If you prefer to build your own stats database you can probably manage to build a log parser easily using PHP. If you find parsing apache logs (or IIS logs) too much a burden, you would probably make your application ouput some custom logs formated in a simpler way.
Also one other possible solution is to use memcached, the daemon provide some kind of counter that you can increment. You can log view there and have a script collecting the result everyday.

If you're going to do that, why not just log each access? MySQL can cache inserts in continuous tables quite well, so there shouldn't be a notable slowdown due to the insert. You can always run Show Profiles to see what the performance penalty actually is.
On the datetime issue, you can always use GROUP BY MONTH( accessed_at ) , YEAR( accessed_at) or WHERE MONTH(accessed_at) = 11 AND YEAR(accessed_at) = 2009.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.