Check input data by MySQL schema or PHP?

Check input data by MySQL schema or PHP? - php

Suppose I'd like to check whether a comment is duplicated or not.
I have two options:
1) Create a query to database and check for it:
select * from comments where content=$santized_content and post_id=$id
2) Create a unique index for comment and post_id and catch MySQL error.
It's important for my complex and busy app to decrease number of queries to database as much as possible. However the first option is more usual and readable.
You can generalize this question to other situations.

MySQL is definitely faster than PHP. I would always prefer using a failing INSERT or a REPLACE against a suitable key than checking in PHP.
The only exception would probably be that your key becomes very complex which will obviously create overhead on MySQL for all queries run against that table. However, there isn't a one-size-fits-all answer to the question what would be too complex to be worthwhile doing. It's largely a matter of real life testing.

i prefer number 2 ,
your PHP script will less code.
It will check automatically whenever comment is duplicate or not.
then this is the function of UNIQUE.

Related

Using database as cache. Rebuilding vs updating

My question involves what the most efficient way to store an entire JSON document in a database table and refresh it periodically is.
Essentially, I'm calling the Google Analytics API once every 15 minutes via a cron job to pull out data about my site. I'm dumping this information into a sql table so that my front end application can search, sort and consume it. This JSON is paginated, such that only 5,000 rows come through at a time. I'll be storing as many as 100,000.
What I'm trying to do is optimize the way I rebuild the table. The most naive approach would be to truncate the table and insert every row from the JSON fresh. I have the feeling this is a bad approach, but maybe I'm underestimating sql.
I could also update each existing row and add new rows as necessary. However, I'm struggling with how I should delete old rows that might not be in the freshest JSON object.
Or perhaps I'm missing a more obvious solution.

The real answer to this question is it depends on what works the best. As I am not familiar with the data I cant give you a straight forward answer but some guidelines.
Firstly 100 000 rows is nothing for a SQL server to handle. So truncating the table and inserting the values fresh might actually be workable, however if this data was to grow substantially this might not be a solution that scales well. The main disadvantage to this approach is that for a period of time the table will be empty and this might be a problem for some users.
Summary of this approach:
Easy and quick to code and maintain.
Truncate will always be fast but insert will slow down as volumes increases.
Data will be offline during truncate and insert cycle.
Inserting and updating as we go along is known as Upserts/Merges. This approach will involve more work but the data will always be online. One of the difficulties you face is working with the JSON data and the SQL data(finding differences in the native JSON dataset compared to SQL table), this is going to be ineffective and cumbersome.
So I would create a staging table for the JSON. This table will be a exact copy of the final production table. I would then use LEFT and RIGHT JOINS to insert the new data and remove the deleted data. You could also create a hash for each row and compare these hashes to identify the rows that have changes and then update only were necessary. All these transformations can be handles in a simple SQL script. Yes you are underestimating SQL a bit...
Summary of this approach:
More complicated to code but not difficult to code simple joins and hash comparisons will do the trick.
Only insert new value, update changed values and delete old values. When scaling this solution it will eventually outperform the truncate, insert cycle.
Data remains online all the time.
If you need clarification around this please ask away.

Simulate MySQL connection to analyze queries to rebuild table structure (reverse-engineering tables)

I have just been tasked with recovering/rebuilding an extremely large and complex website that had no backups and was fully lost. I have a complete (hopefully) copy of all the PHP files however I have absolutely no clue what the database structure looked like (other than it is certainly at least 50 or so tables...so fairly complex). All data has been lost and the original developer was fired about a year ago in a fiery feud (so I am told). I have been a PHP developer for quite a while and am plenty comfortable trying to sort through everything and get the application/site back up and running...but the lack of a database will be a huge struggle. So...is there any way to simulate a MySQL connection to some software that will capture all incoming queries and attempt to use the requested field and table names to rebuild the structure?
It seems to me that if i start clicking through the application and it passes a query for
SELECT name, email, phone from contact_table WHERE
contact_id='1'
...there should be a way to capture that info and assume there was a table called "contact_table" that had at least 4 fields with those names... If I can do that repetitively, each time adding some sample data to the discovered fields and then moving on to another page, then eventually I should have a rough copy of most of the database structure (at least all public-facing parts). This would be MUCH easier than manually reading all the code and pulling out every reference, reading all the joins and subqueries, and sorting through it all manually.
Anyone ever tried this before? Any other ideas for reverse-engineering the database structure from PHP code?

mysql> SET GLOBAL general_log=1;
With this configuration enabled, the MySQL server writes every query to a log file (datadir/hostname.log by default), even those queries that have errors because the tables and columns don't exist yet.
http://dev.mysql.com/doc/refman/5.6/en/query-log.html says:
The general query log can be very useful when you suspect an error in a client and want to know exactly what the client sent to mysqld.
As you click around in the application, it should generate SQL queries, and you can have a terminal window open running tail -f on the general query log. As you see queries run by that reference tables or columns that don't exist yet, create those tables and columns. Then repeat clicking around in the app.
A number of things may make this task even harder:
If the queries use SELECT *, you can't infer the names of columns or even how many columns there are. You'll have to inspect the application code to see what column names are used after the query result is returned.
If INSERT statements omit the list of column names, you can't know what columns there are or how many. On the other hand, if INSERT statements do specify a list of column names, you can't know if there are more columns that were intended to take on their default values.
Data types of columns won't be apparent from their names, nor string lengths, nor character sets, nor default values.
Constraints, indexes, primary keys, foreign keys won't be apparent from the queries.
Some tables may exist (for example, lookup tables), even though they are never mentioned by name by the queries you find in the app.
Speaking of lookup tables, many databases have sets of initial values stored in tables, such as all possible user types and so on. Without the knowledge of the data for such lookup tables, it'll be hard or impossible to get the app working.
There may have been triggers and stored procedures. Procedures may be referenced by CALL statements in the app, but you can't guess what the code inside triggers or stored procedures was intended to be.
This project is bound to be very laborious, time-consuming, and involve a lot of guesswork. The fact that the employer had a big feud with the developer might be a warning flag. Be careful to set the expectations so the employer understands it will take a lot of work to do this.
PS: I'm assuming you are using a recent version of MySQL, such as 5.1 or later. If you use MySQL 5.0 or earlier, you should just add log=1 to your /etc/my.cnf and restart mysqld.

Crazy task. Is the code such that the DB queries are at all abstracted? Could you replace the query functions with something which would log the tables, columns and keys, and/or actually create the tables or alter them as needed, before firing off the real query?
Alternatively, it might be easier to do some text processing, regex matching, grep/sort/uniq on the queries in all of the PHP files. The goal would be to get it down to a manageable list of all tables and columns in those tables.

I once had a similar task, fortunately I was able to find an old backup.
If you could find a way to extract the queries, like say, regex match all of the occurrences of mysql_query or whatever extension was used to query the database, you could then use something like php-sql-parser to parse the queries and hopefully from that you would be able to get a list of most tables and columns. However, that is only half the battle. The other half is determining the data types for every single column and that would be rather impossible to do autmatically from PHP. It would basically require you inspect it line by line. There are best practices, but who's to say that the old dev followed them? Determining whether a column called "date" should be stored in DATE, DATETIME, INT, or VARCHAR(50) with some sort of manual ugly string thing can only be determined by looking at the actual code.
Good luck!

You could build some triggers with the BEFORE action time, but unfortunately this will only work for INSERT, UPDATE, or DELETE commands.
http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html

SELECT * FROM versus SELECT specified FROM

what will be faster?
SELECT * FROM
or
SELECT specified FROM
background: table have one field (specified) which at the same time is a primary index

In your particular case it may very well be the same, but as a matter of good practice, you should always specify the columns you want.
In addition to the various good reasons Dark Falcon put in a comment, it also creates a form of self-documentation in your application code, since it's directly in the query each field you're expecting.

As a matter of good practice, it's usually better to explicitly specify the columns you want, regardless of the performance implications you're concerned about in this question.
But in general, the answer will depend heavily on your version of mysql. Profile it and see:
explain select * from ...;
explain select specified from ...;
I suspect strongly that this is a case of premature optimization, and that you don't really need to know which is faster.

Imho the explicit version will be faster, cause mysql don't need to look up what fields the table contains.

Depending on table structure (including indexes) it may not make a difference -- running some benchmarks and using EXPLAIN SELECT to see where things can be improved will help you along the way. But in general, if you know you only want n fields, only select n fields.

Just specify it, in case in the future more columns are added, but you don't want to retrieve all of them. In any case it is better to be specific.

Write a console app with 2 functions doing the two methods and loop 1000 times on each and print out the average time it took. This would be your fastest way to test the performance.

Generally it's better and I think faster to specify something in your sql query to avoid to get some unuseful data

The "select * from" format imho is just a fast way when quering as dba you just want a quick glance at the table. Even though that will work in programming I wouldn't recommend it, by listing the columns as a programmer it keeps you from having to go back and forth to your db and see what column you want to use or query for. It keeps you in one spot..that's just me though..this really is up to you and how you want to program.

You are looking at this bass ackwards. Do you need the content of the column or not, if you don't and you get it, that will take longer than not.
Parsing the sql to check the column names, order them and potentially alias them, is trivial compared to flooding the network transferring loads of stuff you don't need.

Which is faster: process data inside PHP or make several MySQL requests?

I have some extensive data to process: a very big table in MySQL. The processing takes place three times. Right now I am doing one request to MySQL and then in PHP I run a while loop three times to extract the necessary values.
Is what I'm currently doing the best option, or would it use less server resources to make three separate requests to MySQL with certain filters?

Use MySQL as much as possible in this situation. It is potentially much quicker.
EDIT: As kindly pointed out below, it is not always better to use SQL queries over processing in PHP, and as such, the statement above may be misleading.
However, from the wording of this question I had assumed he was returning a large record set from the MySQL query and using several while loops to extract only certain values from the record set. If this assumption is true then I believe it would be quicker and less consumptive of resources to perform the whole operation within the MySQL query.
As this answer isn't very helpful to people finding this with similar problems it would be grand if the original poster could post some code to clarify the exact situation.

Is it possible to do count(*) while doing insert...select... query in mysql/php?

Is it possible to do a simple count(*) query in a PHP script while another PHP script is doing insert...select... query?
The situation is that I need to create a table with ~1M or more rows from another table, and while inserting, I do not want the user feel the page is freezing, so I am trying to keep update the counting, but by using a select count(\*) from table when background in inserting, I got only 0 until the insert is completed.
So is there any way to ask MySQL returns partial result first? Or is there a fast way to do a series of insert with data fetched from a previous select query while having about the same performance as insert...select... query?
The environment is php4.3 and MySQL4.1.

Without reducing performance? Not likely. With a little performance loss, maybe...
But why are you regularily creating tables and inserting millions of row? If you do this only very seldom, can't you just warn the admin (presumably the only one allowed to do such a thing) that this takes a long time. If you're doing this all the time, are you really sure you're not doing it wrong?

I agree with Stein's comment that this is a red flag if you're copying 1 million rows at a time during a PHP request.
I believe that in a majority of cases where people are trying to micro-optimize SQL, they could get much greater performance and throughput by approaching the problem in a different way. SQL shouldn't be your bottleneck.

If you're doing a single INSERT...SELECT, then no, you won't be able to get intermediate results. In fact this would be a Bad Thing, as users should never see a database in an intermediate state showing only a partial result of a statement or transaction. For more information, read up on ACID compliance.
That said, the MyISAM engine may play fast and loose with this. I'm pretty sure I've seen MyISAM commit some but not all of the rows from an INSERT...SELECT when I've aborted it part of the way through. You haven't said which engine your table is using, though.

The other users can't see the insertion until it's committed. That's normally a good thing, since it makes sure they can't see half-done data. However, if you want them to see intermediate data, you could throw in an occassional call to "commit" while you're inserting.
By the way - don't let anybody tell you to turn autocommit on. That a HUGE time waster. I have a "delete and re-insert" job on my database that takes 1/3rd as long when I turn off autocommit.

Just to be clear, MySQL 4 isn't configured by default to use transactions. It uses the MyISAM table type which locks the entire table for each insert, if I remember correctly.
Your best bet would be to use one of the MySQL bulk insertion functions, such as LOAD DATA INFILE, as these are dramatically faster at inserting large amounts of data. As for the counting, well, you could break the inserts into N groups of 1000 (or Y) then divide your progress meter into N sections and just update it on each group's request.
Edit: Another thing to consider is, if this is static data for a template, then you could use a "select into" to create a new table with the same data. Not sure what your application is, or the intended functionality, but that could work as well.

If you can get to the console, you can ask various status questions that will give you the information you are looking for. There's a command that goes something like "SHOW processlist".

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.