php speed vs. mysql speed - php

I'm making a feeds aggregator using php and mysql. And writting a paper about it which must contain math.
I have a table feeds (id, title, description, link) where id is the primary key.
When I collect new feeds I need to add them to the database, but I must not let any duplicates in. I see two ways to do that:
1) for each feed run something like this:
SELECT id FROM feeds
WHERE title=$feed.title AND description=$feed.description;
And see if it returns any feeds.
2) Assume that feeds which came from different sources never match. In this case:
for each source of feeds run something like this:
SELECT title, description, source FROM feeds WHERE source=$source;
Then use PHP to match collected feeds against this array.
I admit, I don't have any performance problem. But I'm writing a paper about it and I must find some way to apply math to the problem. I've choosen the second approach because it allows me to go into math details about why it can be faster.
But I suspect that php might do the work much slower then mysql would and it might actually be faster to run a query for each feed.
Am I right? Is there any practical reason to choose the second approach? How can I justify my choise?

have you considered using a composite unique index instead?
alter table feeds add unique index(title, description);
this would prevent adding new rows when title and description taken together are already present in the table.
you would have to do large number of inserts in a large database to really get performance values though.
Edit:
This does have one downfall in MYSQL Null is always considered unique so you could have several rows input that are title=null and description=null. You should check for this before attempting insert of data.

For the math, consider what the scaling implications are for your database. How long does an add of a new feed take for the first feed? How about the 10,000th? What about the 10 millionth? In what way does the increase in number of existing feeds affect the speed by which a new feed can be added?

PHP and MySQL: Both running in the Serverside, not like javascript in clientside/Browser.
If you do not have more then millions of data, it wont be slow anyway.

why not just add a index that is unique on title and description? don't know if its the best to do performance wise but it will handle the logic for you in the most correct way..

I think the fastest way would be to put a UNIQUE index on the source column, and simply do an INSERT IGNORE, sending all your collected feeds in one query without even manually checking for duplicates. Not only will this save you the processing/network overhead of doing one query per feed, the index will ensure you don't have any duplicates (assuming source is actually unique per feed).

Related

RSS aggregator; how to insert only new items

A tutorial here shows how to build an agregator in PHP but I'm having some trouble finding the best way not to insert the same items in my database.
If I were to run the script on http://visualwebsiteoptimizer.com/split-testing-blog/feed/ and then run it again in 5 minutes it'll just insert the same items again.
That tutorial just has an interval time specified in wich it will reload the RSS feed and save all the items.
I was wondering if RSS implement some request header that will only send the items after a certain date. I see here that I could use lastBuildDate and mabe ignore channels that have a date older than last fetched but it doesn't say if that is mandatory.
My question here is: how can I check RSS feeds regularly and insert it in a database without inserting the same item more than once?
I'm thinking the only way to do it is to check if a record already exist using link and only insert if it doesn't exist already. I know link is optional but I won't save items that don't have one anyway. This seems a bit inefficient though; checking before every insert might be fine in the beginning but when the database starts filling up it might get very slow.
You might have to use a few different strategies depending on how well the site you are consuming has implemented the spec.
First I would try adding a unique index on the database for the GUID value, GUIDs by there nature should be unique, http://en.wikipedia.org/wiki/Globally_unique_identifier - then depending on which DB you are using you should be able to use syntax like INSERT IGNORE INTO... or INSERT ... ON DUPLICATE KEY UPDATE... and just have the update syntax not really do anything
If some sites don't have a guid field (I am assuming you will end up consuming more than just the example) you could add the unique on the siteId field and the either the time or the title, both are less than ideal of course contacting the site own to get them to implement a guid might work too ;)
You could also run an md5 hash on the post content and store that alongside the post, that should stop duplicates too.
How big are you expecting the DB to get? with proper indexing I would have thought that it would have to be huge before it runs slow; indexes on siteId, guid, time and/or hash and limited to just 1 row and just the rowId should be quick enough, epscialyl if you can get your script to run commandline / on a cron job rather than through a webserver

Autocomplete concept

I'm programming a search engine for my website in PHP, SQL and JQuery. I have experience in adding autocomplete with existing data in the database (i.e. searching article titles). But what about if I want to use the most common search queries that the users type, something similar to the one Google has, without having so much users to contribute to the creation of the data (most common queries)? Is there some kind of open-source SQL table with autocomplete data in it or something similar?
As of now use the static data that you have for auto complete.
Create another table in your database to store the actual user queries. The schema of the table can be <queryID, query, count> where count is incremented each time same query is supplied by some other user [Kind of Rank]. N-Gram Index (so that you could also auto-complete something like "Manchester United" when person just types "United", i.e. not just with the starting string) the queries and simply return the top N after sorting using count.
The above table will gradually keep on improving as and when your user base starts increasing.
One more thing, the Algorithm for accomplishing your task is pretty simple. However the real challenge lies in returning the data to be displayed in fraction of seconds. So when your query database/store size increases then you can use a search engine like Solr/Sphinx to search for you which will be pretty fast in returning back the results to be rendered.
You can use Lucene Search Engiine for this functionality.Refer this link
or you may also give look to Lucene Solr Autocomplete...
Google has (and having) thousands of entries which are arranged according to (day, time, geolocation, language....) and it is increasing by the entries of users, whenever user types a word the system checks the table of "mostly used words belonged to that location+day+time" + (if no answer) then "general words". So for that you should categorize every word entered by users, or make general word-relation table of you database, where the most suitable searched answer will be referenced to.
Yesterday I stumbled on something that answered my question. Google draws autocomplete suggestions from this XML file, so it is wise to use it if you have little users to create your own database with keywords:
http://google.com/complete/search?q=[keyword]&output=toolbar
Just replacing [keyword] with some word will give suggestions about that word then the taks is just to parse the returned xml and format the output to suit your needs.

How can I make this SQL query the most effective?

I am making a website with a large pool of images added by users.
I want to choose randomly one image out of this pool, and display it to the user, but I want to make sure that this user has never seen this image before.
So i was thinking that: when a user views an image, I make a row INSERT in MYSQL that would say "This USER has watched THIS IMAGE at (TIME)" for every entry.
But the thing is, since there might be a lot of users and a lot of images, this table can easily grow to tens of thousands of entries quite rapidly.
So alternatively, it might be done like that:
I was thinking of making a row INSERT for every USER, and in ONE field, I insert an array all id's of images that user has watched.
I can even do that to the array:
base64_encode(gzcompress(serialize($array)
And then:
unserialize(gzuncompress(base64_decode($array))
What do you think I should do?
Is the encoding/decoding functions fast enough, or at least faster than the conventional way i was describing at the beginning of the post?
Is that compression good enough to store large chunks of data into only ONE database field? (imagine if the user has viewed thousands images?)
Thanks a lot
in ONE field, I insert an array all id's
In almost all cases, serializing values like this is bad practice. Let the database do what it's designed to do -- efficiently handle large amounts of data. As long as you ensure that your cross table has an index on the user field, retrieving the list of images that a user has seen will not be an expensive operation, regardless of the number of rows in the table. Tens of thousands of entries is nothing.
You should create a new table UserImageViews with columns user_id and image_id (additionally, you could add more information on the view, such as Date/Time, IP and Browser).
That will make queries like "What images the user has (not) seen" much faster.
You should use a table. Serializing data into a single field in a database is a bad practice, as the DBMS has no clue what that data represents and cannot be used in ANY queries. For example, if you wanted to see which users had viewed an image, you wouldn't be able to in SQL alone.
Tens of thousands of entries isn't much, BTW. The main application we develop has multiple tables with hundreds of thousands of records, and we're not that big. Some web applications have tables with millions of rows. Don't worry about having "too much data" unless it starts becoming a problem - the solutions for that problem will be complex and might even slow down your queries until you get to that amount of data.
EDIT: Oh yeah, and joins against those 100k+ tables happen in under a second. Just some perspective for ya...
I don't really think that tens of thousands of rows will be a problem for a database lookup. I will recommend using the first approach over the second.
I want to choose randomly one image out of this pool, and display it
to the user, but I want to make sure that this user has never seen
this image before.
For what it's worth, that's not a random algorithm; that's a shuffle algorithm. (Knowing that will make it easier to Google when you need more details about it.) But that's not your biggest problem.
So i was thinking that: when a user views an image, I make a row
INSERT in MYSQL that would say "This USER has watched THIS IMAGE at
(TIME)" for every entry.
Good thought. Using a table that stores the fact that a user has seen a specific image makes sense in your case. Unless I've missed something, you don't need to store the time. (And you probably shouldn't. It doesn't seem to serve any useful business purpose.) Something along these lines should work well.
-- Predicate: User identified by [user_id] has seen image identified by
-- [image_filename] at least once.
create table images_seen (
user_id integer not null references users (user_id),
image_filename not null references images (image_filename),
primary key (user_id, image_filename)
);
Test that and look at the output of EXPLAIN. If you need a secondary index on image_filename . . .
create index images_seen_img_filename on images_seen (image_filename);
This still isn't your biggest problem.
The biggest problem is that you didn't test this yourself. If you know any scripting language, you should be able to generate 10,000 rows for testing in a matter of a couple of minutes. If you'd done that, you'd find that a table like that will perform well even with several million rows.
I sometimes generate millions of rows to test my ideas before I answer a question on StackOverlow.
Learning to generate large amounts of random(ish) data for testing is a fundamental skill for database and application developers.

performance issue on displaying records

I have a table with just 3,000 records.
I render these 3000 records in the home page without pagination, my client is not interested in pagination...
So to show page completely it takes around 1 min, 15 sec. What can be done to make the page load more quickly?
My table structure:
customer table
customer id
customer name
guider id
and few columns
guider table
guider id
guider name
and few columns
Where's the slow down? The query or the serving?
If the former, see the comments above. If the latter:
Enable gzip on the server. Otherwise capture the [HTML?] output to a file, compress it (zip), then serve it as a download. Same for any other format if you think something else can render it better than a browser (CSV and Open Office).
If you're outputting the data into a HTML table then you may have an issue where the browser is waiting for the end of the table before rendering it. You can either break this into multiple table chunks like every 500 records/rows or try CSS "table-layout: fixed;".
Check the Todos
sql Connection (dont open the
connection in loop) for query it
should be one time connection
check your queries and analyse it if you are using some complex logic
which can be replaced
use standard class for sql connection and query ; use ezsql
sql query best practice
While you could implement a cache to do this, you don't necessarily need to do so, an introducing unnecessary cache structures can often cause problems of its own. Depending on where the bottleneck is, it may not even help you much, or at all.
You need to look in two places for your analysis:
1) The query you're using to get your data. Take a look at its plan, or if you're not comfortable doing that, run it in your favorite query tool and see how long it takes to come back. If it doesn't take too long, you've got a pretty good idea that your bottleneck isn't the query. If the query itself takes a long time, that's where you should focus your efforts.
2) How your page is rendering. What is the size of your page, in bytes? It may be too big. Can you cut the size down by formatting? Can you more effectively use CSS to eliminate duplicate styling on the page? Are you using a fixed or dynamic table layout? Dynamic is generally going to be quite a bit slower, especially for large tables. Try to avoid nesting tables. Do everything you can to make the page as small as possible, and keep testing!
while displaying records i want to
display guidername so , i did once
function that return the guider name
Sounds like you need to use a JOIN. Here's a simple example:
SELECT * FROM customer JOIN guider ON guider.id=customer.guider_id
This will change your page from using N + 1 (3001) queries to just one.
Make sure both guider.id and customer.guider_id are indexed and of appropriate data types (such as integers).
This is a little list, what you should think about for improving the performance, the importance is relative to each point, so the first ist not to be the most important to you - which depends on the details of your project.
Check your database structure. If there are just these two tables, their might be little you can do. But keep in mind that there is stuff like indices and with an increasing number of records a second denormalizes table structure will improve the speed of retrieving results.
Use rather one Query for selecting your data, than iterating through ids and doing selects repeatedly
Run a separate Query for the guiders, I assume there are only a few of them. Save all guiders in a data structure, e.g. a dictionary, first and use the foreign key to apply the correct one to the current record - this might save a lot of data which has to be transmitted from the database to your web server.
Get your result set by using something like mysqli_result::fetch_all() which returns a 2-dimensional array with all results. This should be faster than iteration through each row with fetch_row()
Sanitize your HTML Output, use (external) CSS. This will save a lot of output space if you format your stuff with style=" ... a lot of formatting code ..." attributes in each line. If you use one large table, split them up in multiple tables (some browsers wait for the complete table to load before rendering it).
In a lot of languages very important: Use a string builder for concatenating your results into the output string!
Caching: Think about generating the output once a day or once an hour. Write it to a cachefile which is opened instead of querying the database and building the same stuff on every request. Maybe you want to offer this generated file as download, rather than displaying it as plain HTML Site on the web.
Last but not least, check the connections to webserver and database, the server load as well as the number of requests. If your servers are running on heavy load everything ales here might help reducing the load or you just have to upgrade hardware.
LOL
everyone is talking of big boys toys, like database structure, caching and stuff.
While the problem most likely lays in mere HTML and browsers.
Just to split whole HTML table in chunks will help first chunk to show up immediately while others will eventually come.
Only ones were right who said to profile whole thing first.
Trying to answer without profiling results is shooting in the dark.

search big database

I have a database which holds URL's in a table (along with other many details about the URL). I have another table which stores strings that I'm going to use to perform searches on each and every link. My database will be big, I'm expecting at least 5 million entries in the links table.
The application which communicates with the user is written in PHP. I need some suggestions about how I can search over all the links with all the patterns (n X m searches) and in the same time not to cause a high load on the server and also not to lose speed. I want it to operate at high speed and low resources. If you have any hints, suggestions in pseudo-code, they are all welcomed.
Right now I don't know whether to use SQL commands to perform these searches and have some help from PHP also or completely do it in PHP.
First I'd suggest that you rethink the layout. It seems a little unnecessary to run this query for every user, try instead to create a result table, in which you just insert the results from that query that runs ones and everytime the patterns change.
Otherwise, make sure you have indexes (full text) set on the fields you need. For the query itself you could join the tables:
SELECT
yourFieldsHere
FROM
theUrlTable AS tu
JOIN
thePatternTable AS tp ON tu.link LIKE CONCAT('%', tp.pattern, '%');
I would say that you pretty definately want to do that in the SQL code, not the PHP code. Also searching on the strings of the URLs is going to be a long operation so perhaps some form of hashing would be good. I have seen someone use a variant of a Zobrist hash for this before (google will bring a load of results back).
Hope this helps,
Dan.
Do as much searching as you practically can within the database. If you're ending up with an n x m result set, and start with at least 5 million hits, that's a LOT Of data to be repeatedly slurping across the wire (or socket, however you're connecting to the db) just to end up throwing away most (a lot?) of it each time. Even if the DB's native search capabilities ('like' matches, regexp, full-text, etc...) aren't up to the task, culling unwanted rows BEFORE they get sent to the client (your code) will still be useful.
You must optimize your tables in DB. Use a md5 hash. New column with md5, will use index and faster found text.
But it don't help if you use LIKE '%text%'.
You can use Sphinx or Lucene.

Categories