Query by slug or query by id?

Query by slug or query by id? - php

Typically, when I make applications that need slugs in the URL, I query by the slug in the database to get the content (with an index on the slug field, of course). Using a typical LAMP stack (with PHP & MySQL), what is the advantage or disadvantage of doing this from a database perspective? Would it make more sense to always query by id and simply create some sort or route for slugs? Could this application design pose any security problems?
I'm using cakePHP, specifically, so if there are any cake-specific answers, that would be appreciated, but not necessary.

If you are absolutely positive that slug won't change, you can use it. But numbers are probably faster and safer (this one is for sure).
I just ran queries on a database with 1 000 000 rows, you can see that by ID is 230x faster (subject is not indexed though). Besides that, like Col Shrapnel said, what if you change the subject? Your URL will be broken, and you cannot remember every change in subject.
Numbers are numbers, computers work with numbers :) JK
SELECT tid FROM `forum_threads` WHERE subject = "New BMW X5";
Showing rows 0 - 0 (1 total, Query took 0.0230 sec)
SELECT tid FROM `forum_threads` WHERE tid = 19906;
Showing rows 0 - 0 (1 total, Query took 0.0001 sec)

Two things :
Keep in mind slug values often need to be unique.
Also, using an i18n database schema, SQL queries on the table need an additional Join (slug is usually related to language)

Related

Adjust limit when common rows will be split between pages?

I have a result set from MYSQL which I'm displaying in a paging scenario using PHP. (Prev/Next links)
Many result rows may have "child" rows associated with it. IE, they share a column containing the same "root number".
Due to the paging and limit arguments in my query, those groups of rows with common root numbers can be split between pages, which makes the display awkward.
I need the query to take that root number column into consideration and NOT split those child rows across to a second page. Instead, it should go ahead and include all of the rows sharing that root number on the same page together. In my mind, to achieve this, the query would take the root number into account and adjust the LIMIT upwards if the last row in the select has other rows with the same root number.
Seems like the offset value could also be exploited to achieve the desired result, but I'm not sure how I might do that on the fly.
Does anyone have thoughts on how to accomplish this?
SELECT * FROM (`tablename`) LIMIT 3600, 100
Example data:
id name rootnumber
-------------------------------------------------
1 Joe 789
2 Susan 789
3 Bill 789
4 Peter 123

Pagination with limit has several problems. Normally you count the complete result set (which takes almost as much work for the MySQL-server as retrieving the whole set), and as soon as you have limit 2000,50, you put as much work on the server as retrieving the first 2050 rows and throwing the first 2000 away. The third problem is that there is no other solution as easy as limit. ;-)
So, you could try different things:
Send bigger data packets of many pages to the client and do pagination in html/javascript/css. Just fetch a new packet when the user comes to the last of those pages. There you can work with the trick of fetching one row more than needed, so you see if that row is the same rootnumber as the last (so you discard that rootnumber completely) or if it has a new rootnumber (so the last rootnumber was completely read)
Give the user better search parameters - no user really reads through 250 lines completely, the user normally just searches for a certain date or a certain keyword, or some property of the root. As soon as the user 'paginates' through months or weeks, she has a clue at what time it was. This does have the problem of sometimes very different 'page' sizes. But that you could fix that in the client.
The MySQL-Server is very happy to do the search like where date between '2013-12-01' and '2014-01-01' or where color='blue' and customer.sex='f', there it an work with its magic and indices. Much better than that `limit 2000, 50.
This is work, this is not easy, but if you are good you can find better solutions for the customer, who does not really like to read all the lines in between.
EDIT:
There are technical solutions to that. That you show entries together when they have the same root number looks like you sort them. So in a query before (we do hope your MySQL Server has a Sado touch and likes that) and fetch only the root numbers:
select t.* from tablename as t
inner join
(
select rootnumber from tablename limit 3600, 50 # you put in your sort her, do you?
) as mt on mt.rootnumber = t.rootnumber;
As soon as your MySQL Server version uses indices on where in (subquery) (try explain), you can also use the nicer version
/* TRY EXPLAIN AND BEWARE OF FULL TABLE SCAN!*/
select t.* from table_name where rootnumber in
( select rootnumber from table_name limit 3600, 50)
;
But right now that might be really slow.
But: try to provide search parameters to reduce the table walking to an absolute minimum!
Much Fun!

Count content views in Php, Mysql

Hi, I am creating a simple news website and I need to count news views. Currently I have 25000 rows and 25 columns. The hits count increases per page reload like Joomla. How should I structure the tables?
I have 2 approaches to this issue:
Create column named hits in the content table.
Create a new table that has 2 columns: content_id and hits.
I used the first approach and I think that slows my site.Will the second approaches perform better than the first one? Is there a better approach?

Well I don't know what's your logic in MySQL or PHP or what is you current table structure for news but I would suggest you to use Stored Procedure in MySQL as
Begin
update tblnews set hits = hits+1;
select news from tblnews;
End
and off course use PHP PDO Prepared Statement for performance
and if you are trying to get last 10 news or something like that then must set Indexing for content_id say like Primary key with auto_increment for better retrieval of query otherwise don't even use content_id column. I don't think there should be any hard structre for table. This would definitely increase performance and more than 100000000 rows would not make any big difference I hope. I don't think there would be any other better solution because these 2 queries needs to be performed at every page view.

Option 1 sounds the best. Option 2 is redundant because your just storing the hits. Again you would a join query to pull the hits

How to approach multi-million data selection

I have a table that stores specific updates for all customers.
Some sample table:
record_id | customer_id | unit_id | time_stamp | data1 | data2 | data3 | data4 | more
When I created the application, I did not realize how much this table would grow -- currently I have over 10mil records within 1 month. I am facing issues, when php stops executing due to amount of time it takes. Some queries produce top-1 results, based on the time_stamp + customer_id + unit_id
How would you suggest handling this type of issues? For example, I can create new table for each customer, although I think it does not a good solution.
I am stuck with no good solution in mind.

If you're on the cloud (where you're charged for moving data between server and db), ignore.
Move all logic to the server
The fastest query is a SELECT WHEREing the PRIMARY. It won't matter how large your database is, it will come back just as fast with a table of 1 row (as long as your hardware isn't unbalanced).
I can't tell exactly what you're doing with your query, but first download all of the sorting and limiting data into PHP. Once you've got what you need, SELECT the data directly WHEREing on record_id (I assume that's your PRIMARY).
It looks like your on demand data is pretty computationally intensive and huge, so I recommend using a faster language. http://blog.famzah.net/2010/07/01/cpp-vs-python-vs-perl-vs-php-performance-benchmark/
Also, when you start sorting and limiting on the server rather than the db, you can start identifying shortcuts to speed it up even further.
This is what the server's for.

I suggest you use partitioning of your data following some criteria.
You can make horizontal or vertical partition of your data.
For example group your customer_id in 10 partitions, using his id module 10.
So, customer_id terminated in 0 goes to partition 0, with ended in 1 goes to partition 1
MySQL can make this for you easily.

What is the count of records within the tables? Often, with relational databases, it's not how much data you have (millions are nothing to relational databases), it's how you're retrieving it.
From the look of your select, in fact, you probably just need to optimize the statement itself and avoid the multiple subselects, which is probably the main cause of the slowdown. Try running an explain on that statement, or just get the ids and run the interior select individually on the ids of the records that you've actually found & retrieved in the first run.
Just the fact that you have those subselects within your overall statement means that you haven't optimized that far into the process anyway. For example, you could be running a nightly or hourly cron job that aggregates into a new table the sets like the one created by SELECT gps_unit.idgps_unit, and then you can run your selects against a previously generated table instead of creating blocks of data that are equivalent of a table on the fly.
If you find yourself unable to effectively optimize that select statement, you have "final" options like:
Categorize via some criteria and split into different tables.
Keep a deep archive, such that anything past the first year or so is migrated to a less used table and requires special retrieval.
Finally, if you have so much small data, you may be able to completely archive certain tables and keep them around in file form only and then truncate past a certain date. Often with web tracking data that isn't that important and is kinda spammy, I end up doing this after a few years, when the data is really not going to do anyone any good any more.

Optimizing an MYSQL COUNT ORDER BY query

I have recently written a survey application that has done it's job and all the data is gathered. Now i have to analyze the data and i'm having some time issues.
I have to find out how many people selected what option and display it all.
I'm using this query, which does do it's job:
SELECT COUNT(*)
FROM survey
WHERE users = ? AND table = ? AND col = ? AND row = ? AND selected = ?
GROUP BY users,table,col,row,selected
As evident by the "?" i'm using MySQLi (in php) to fetch the data when needed, but i fear this is causing it to be so slow.
The table consists of all the elements above (+ an unique ID) and all of them are integers.
To explain some of the fields:
Each survey was divided into 3 or 4 tables (sized from 2x3 to 5x5) with a 1 to 10 happiness grade to select form. (questions are on the right and top of the table, then you answer where the questions intersect)
users - age groups
table, row, col - explained above
selected - dooooh explained above
Now with the surveys complete and around 1 million entries in the table the query is getting very slow. Sometimes it takes like 3 minutes, sometimes (i guess) the time limit expires and you get no data at all. I also don't have access to the full database, just my empty "testing" one since the costumer is kinda paranoid :S (and his server seems to be a bit slow)
Now (after the initial essay) my questions are: I left indexing out intentionally because with a lot of data being written during the survey, it would be a bad idea. But since no new data is coming in at this point, would it make sense to index all the fields of a table? How much sense does it make to index integers that never go above 10? (as you can guess i haven't got a clue about indexes). Do i need the primary unique ID in this table? I
I read somewhere that indexing may help groups but only if you group by the first columns in a table (and since my ID is first and from my point of view useless can i remove it and gain anything by it?)
Is there another way to write my query that would basically do the same thing but in a shorter period of time?
Thanks for all your suggestions in advance!

Add an index on entries that you "GROUP BY" or do "WHERE". So that's ONE index incorporating users,table,col,row and selected in your case.
Some quick rules:
combine fields to have the WHERE first, and the GROUP BY elements last.
If you have other queries that only use part of it (e.g. users,table,col and selected) then leave the missing value (row, in this example) last.
Don't use too many indexes/indeces, as each will slow the table to updates marginally - so on really large system you need to balance queries with indexes.
Edit: do you need the GROUP BY user,col,row as these are used in the WHERE. If the WHERE has already filtered them out, you only need group by "selected".

How to design the user table for an online dating site?

I'm working on the next version of a local online dating site, PHP & MySQL based and I want to do things right. The user table is quite massive and is expected to grow even more with the new version as there will be a lot of money spent on promotion.
The current version which I guess is 7-8 years old was done probably by someone not very knowledgeable in PHP and MySQL so I have to start over from scratch.
There community has currently 200k+ users and is expected to grow to 500k-1mil in the next one or two years. There are more than 100 attributes for each user's profile and I have to be able to search by at least 30-40 of them.
As you can imagine I'm a little wary to make a table with 200k rows and 100 columns. My predecessor split the user table in two ... one with the most used and searched columns and one with the rest (and bulk) of the columns. But this lead to big synchronization problems between the two tables.
So, what do you think it's the best way to go about it?

This is not an answer per se, but since few answers here suggested the attribute-value model, I just wanted to jump in and say my life experience.
I've tried once using this model with a table with 120+ attributes (growing 5-10 every year), and adding about 100k+ rows (every 6 months), the indexes is growing so big that it takes for ever to add or update a single user_id.
The problem I find with this type of design (not that it's completely unfit to any situation) is that you need to put a primary key on user_id,attrib on that second table. Unknowing the potential length of attrib, you would usually use a greater length value, thus increasing the indexes. In my case, attribs could have from 3 to 130 chars. Also, the value most certainly suffer from the same assumption.
And as the OP said, this leads to synchronization problems. Imagine if every attributes (or say at least 50% of them) NEED to exist.
Also, as the OP suggest, the search needs to be done on 30-40 attributes, and I can't just imagine how a 30-40 joins would be efficient, or even a group_concat() due to length limitation.
My only viable solution was to go back to a table with as much columns as there are attributes. My indexes are now greatly smaller, and searches are easier.
EDIT: Also, there are no normalization problems. Either having lookup tables for attribute values or have them ENUM().
EDIT 2: Of course, one could say I should have a look-up table for attribute possible values (reducing index sizes), but I should then make a join on that table.

What you could do is split the user data accross two tables.
1) Table: user
This will contain the "core" fixed information about a user such as firstname, lastname, email, username, role_id, registration_date and things of that nature.
Profile related information can go in its own table. This will be an infinitely expandable table with a key => val nature.
2) Table: user_profile
Fields: user_id, option, value
user_id: 1
option: profile_image
value: /uploads/12/myimage.png
and
user_id: 1
option: questions_answered
value: 24
Hope this helps,
Paul.

The entity-attribute-value model might be a good fit for you:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
Rather than have 100 and growing columns, add one table with three columns:
user_id, property, value.

In general, you shouldn't sacrifice database integrity for performance.
The first thing that I would do about this is to create a table with 1 mln rows of dummy data and test some typical queries on it, using a stress tool like ab. It will most probably turn out that it performs just fine - 1 mln rows is a piece of cake for mysql. So, before trying to solve a problem make sure you actually have it.
If you find the performance poor and the database really turns out to be a bottleneck, consider general optimizations, like caching (on all levels, from mysql query cache to html caching), getting better hardware etc. This should work out in most cases.

In general you should always get the schema formally correct before you worry about performance!
That way you can make informed decisions about adapting the schema to resolve specific performance problems, rather than guessing.
You definitely should go down the 2 table route. This will significantly reduce the amount of storage, code complexity, and the effort to changing the system to add new attributes.
Assuming that each attribute can be represented by an Ordinal number, and that you're only looking for symmetrical matches (i.e. you're trying to match people based on similar attributes, rather than an expression of intention)....
At a simple level, the query to find suitable matches may be very expensive. Effectively you are looking for nodes within the same proximity in a N-dimensional space, unfortunately most relational databases aren't really setup for this kind of operation (I believe PostgreSQL has support for this). So most people would probably start with something like:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value=current_user.attr_value
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
However this forces the system to compare every available candidate to find the best match. Applying a little heurisitics and you could get a very effective query:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
(the value of $tolerance will affect the number of rows returned and query performance - if you've got an index on attr_type, attr_value).
This can be further refined into a points scoring system:
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
This approach lets you do lots of different things - including searching by a subset of attributes, e.g.
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs,
attribute_subsets s
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
AND s.subset_name=$required_subset
AND s.attr_type=current_user.attr_type
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
Obviously this does not accomodate non-ordinal data (e.g. birth sign, favourite pop-band). Without knowing a lot more about te structure of the existing data, its rather hard to say exactly how effective this will be.
If you want to add more attributes, then you don't need to make any changes to your PHP code nor the database schema - it can be completely data-driven.
Another approach would be to identify sterotypes - i.e. reference points within the N-dimensional space, then work out which of these a particular user is closest to. You collapse all the attributes down to a single composite identifier - then you just need to apply the same approach to find the best match within the subset of candidates whom also have been matched to the stereotype.

Can't really suggest anything without seeing the schema. Generally - Mysql database have to be normalized to at least 3NF or BNCF. It rather sounds like it is not normalized right now with 100 columns in 1 table.
Also - you can easily enforce referential integrity with foreign keys using transactions and INNODB engine.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.