Create a PHP cache system in MySQL database? - php

I'm creating a web service that often scrapes data from remote web pages. After scraping this data, I have a simple multidimensional array of information to use. The scraping process is fairly taxing on my server, and the page load takes a while. I was considering adding a simple cache system using a MySQL database, where I create one row per remote web page with a the array of information pulled from it stored as a JSON encoded string. Is this a good enough system? Or would something like a text file per web page be a better idea?

Since you're scraping multiple web pages, and you want to your data to be persistently cached, you have a few options -- the best of which would be to use memcache or a database such as MySQL. Using text files is not a good idea, because you would have to serialize / deserialize your data, and read from your filesystem. To query a database or a memcache is many times more efficient.
Since you're probably looking for your cache to be somewhat persistent, I would suggest going with MySQL. You would simply create a table that has an auto-incrementing primary key, which a column for each element in your parsed JSON object. (Note that MySQL currently does not support arrays. In order to emulate them, you will need to use relational tables, or serialize your array data and provide it to a text field. The former method is preferred).
Every time you scrape a page, you would run an UPDATE statement to update that individual page's information in the database. If you specify a unique index on whatever you use to uniquely identify your page (URL / etc), you will achieve optimal look-up performance.

If you're looking to store the cache locally on 1 server (e.g. if your mysql server and http server are on the same box), you might be better off using APC, which is a cache service that comes with PHP.
If you're looking to store the data remotely (e.g. a dedicated cache box) then I would go with Memcache instead of MySQL.
"When all you have is a hammer ..."

I don;'t tend to have particularly large APC configs, 64 - 128MB max. Memcache can go to a couple of gigabytes or maybe more (far more if you run multiple instances). Both are also transient - a restart of Apache, or Memcache (the the latter is slightly less likely, or often) will lose the data
It depends then, on how often you are willing to process the data to produce the cache, and how long that cache could otherwise be useful for. If it was good for weeks before you re-scraped the pages - Mysql is a entirely suitable backing store.
Potential pther options, depending on how many items are being cached & how big the data is, are, as you suggest, a file-based cache, SQlite, or other systems.

Related

File based cache or many mysql queries

Lets assume you're developing a multiplayer game where the data is stored in a MySQL-database. For example the names and description texts of items, attributes, buffs, npcs, quests etc.
That data:
won't change often
is frequently requested
is required on server-side
and cannot be cached locally (JSON, Javascript)
To solve this problem, i wrote a file-based caching system that creates .php-files on the server and copies the entire mysql-tables as pre-defined php variables into them.
Like this:
$item_names = Array(0 => "name", 1 => "name");
$item_descriptions = Array(0 => "text", 1 => "text");
That file contains a loot of data, will end up having a size of around 500 KB and is then loaded on every user request.
Is that a good attempt to avoid unnecessary queries; Considering that query-caching is being deprecated in MySQL 8.0? Or is it better to just get the data needed using individual queries, even if ending up with hundreds of them per request?
I suggest you to use some kind of PSR-6 compilant cache system (it could be filesystem also) and later when your requests grow you can easily swap out to a more performant cache, like a PSR-6 Redis cache.
Example for PSR-6 compatible file system cache.
More info about PSR-6 Caching Interface
Instead of making your own caching mechanism, you can use Redis as it will handle all your caching requirements.
It will be easy to implement.
Follow the links to get to know more about Redis
REDIS
REDIS IN PHP
REDIS PHP TUTORIALS
In my experience...
You should only optimize for performance when you can prove you have a problem, and when you know where that problem is.
That means in practice that you should write load tests to exercise your application under "reasonable worst-case scenario" loads, and instrument your application so you can see what its performance characteristics are.
Doing any kind of optimization without a load test framework means you're coding on instinct; you may be making things worse without knowing it.
Your solution - caching entire tables in arrays - means every PHP process is loading that data into memory, which may or may not become a performance hit in its own right (do you know which request will need which data?). It also looks like you'll be doing a lot of relational logic in PHP (in your example, gluing the item_name to the item_description). This is something MySQL is really good at; your PHP code could easily be slower than MySQL at joins.
Then you have the problem of cache invalidation - how and when do you refresh the cached data? How does your application behave when the data is being refreshed? I've seen web sites slow to a crawl when cached data was being refreshed.
In short - it's a complicated decision, there are no obvious right/wrong answers. My first recommendation is "build a test framework so you can approach performance based on evidence", my second is "don't roll your own - consider using an ORM with built-in cache support", my third is "consider using something like Redis or memcached to store your cache information".
There are many possible solutions, depends on your requirements. Possible solution could be:
File base JSON format caching. Data retrieve from database will be save to a file for next time use before the program process.
Memory base cache, such as Memcached, APC, Redis, etc. Similar the upon solution, better performance but more integrated code required.
Memory base database, such as NoSQL, MongoDB, etc. It is a memory base database.
Multiple database servers, one master write database with multiple salve for read databases, there are a synchronisation between servers.
Quick and minimise the code changes, I suggest using option B.

How to store PHP Trie for all later uses?

I'm designing an application in PHP which involves Trie data structure.
For time efficient prefix search, I'm using Trie.
I'm constructing the Trie using records from the database.
Now, the database has millions of records. So it is not feasible to everytime create the Trie and then search in it, for every new user request.
Instead can I create the Trie only once and somehow store this information, such that it does not have to be re-created for every new user request, and then searching can be immediately done. Is there somehow I can cache the created Trie (not just for one user session, but for all user requests) using PHP?
Any help would be much appreciated.
You have a couple of standard options.
Cache the database result in memory, using a simple cache like memcached
Cache using Redis, perhaps taking advantage of some of its extra features. This might involve a process where you load the data into a structure in REDIS and have your trie search code work against Redis directly rather than the database result set.
In either case, you are going to cache the result for some period of time that is acceptable, and since the database result will be in memory in some form, there is no load placed on the RDBMS.
In your related question, you indicated that he raw serialized form of the variable would be about 200mb in size. That is well within the max object size (512mb) for Redis, but could be problematic for memcached. I personally use Redis for most app server caching these days.

PHP - MySQL call or JSON static file for unfrequently updated information

I've got a heavy-read website associated to a MySQL database. I also have some little "auxiliary" information (fits in an array of 30-40 elements as of now), hierarchically organized and yet gets periodically and slowly updated 4-5 times per year. It's not a configuration file though since this information is about the subject of the website and not about its functioning, but still kind of a configuration file. Until now, I just used a static PHP file containing an array of info, but now I need a way to update it via a backend CMS from my admin panel.
I thought of a simple CMS that allows the admin to create/edit/delete entries, periodical rare job, and then creates a static JSON file to be used by the page building scripts instead of pulling this information from the db.
The question is: given the heavy-read nature of the website, is it better to read a rarely updated JSON file on the server when building pages or just retrieve raw info from the database for every request?
I just used a static PHP
This sounds like contradiction to me. Either static, or PHP.
given the heavy-read nature of the website, is it better to read a rarely updated JSON file on the server when building pages or just retrieve raw info from the database for every request?
Cache was invented for a reason :) Same with your case - it all depends on how often data changes vs how often is read. If data changes once a day and remains static for 100k downloads during the day, then not caching it or not serving from flat file would would simply be stupid. If data changes once a day and you have 20 reads per day average, then perhaps returning the data from code on each request would be less stupid, but from other hand, all these 19 requests could be served from cache anyway, so... If you can, serve from flat file.
Caching is your best option, Redis or Memcached are common excellent choices. For flat-file or database, it's hard to know because the SQL schema you're using, (as in, how many columns, what are the datatype definitions, how many foreign keys and indexes, etc.) you are using.
SQL is about relational data, if you have non-relational data, you don't really have a reason to use SQL. Most people are now switching to NoSQL databases to handle this since modifying SQL databases after the fact is a huge pain.

JSON Flat file vs DB querying

I'm working on a site that has a store locator built in.
Since I have similar sites developed in the past, I have experienced some troubles when I had search peaks hitting the database (mySQL) hard.
All these past location search engines were querying the database to get the results.
Now I have taken a different approach, but since I'm not 100% sure, I thought that asking this great community could make me feel more secure about this direction or stick to what I did before.
So for this new search, instead of hitting the database for requests, I'm serving the search with a JSON file that regenerates (querying the database) only when something is updated, created or deleted on the locations list.
My doubt is, can a high load of requests over the json file have the same effect than a high load of query requests over the database?
Serving the search results from a JSON to lower the impact on db (and server resources) is a good approach or it's not a good idea?
Maybe someone out there had to take the same decision and can share the experience with me, or maybe you just know how things really are and recommend me a certain approach.
Flat files are the poor man's db and can be even more problematic than a heavily pounded database. For example reading and writing the file still requires a lock, and will not scale, as the same file may not be accessible to all app servers.
My suggestion would be any one of the following:
Benchmark your current hardware, identify bottlenecks, scale out or up accordingly.
Implement a caching layer, this will save on costly queries for readonly data.
Consider more high performant storage solutions such as Aerospike or Redis
Implement a real full text search engine such as ElasticSearch or SOLR.
Response to comment #1:
You could accomplish the same thing without having to read/write a flat file (which must be accessible by all app servers), by caching the data. Here's just a quick N dirty rundown of how I would do it:
Zip + 10 miles:
Query database, pull store data, json_encode, cache using a key construct like 92562_10, then store in cache. Now when other users enter 92562 + 10 they will pull data from cache vs the database (or flat file).
City, State + 50 miles:
Same as above, except key construct may look like murrieta_ca_50.
But with the caching layer you get better performance, and the cache server will be available to all your app servers, which would be much easier than having to install/configure NFS to share the file on a network.

store data in db or in files? which one is faster?

I'm developing a pagerank checker widget. and i want to cache ranks. because on every page send a request to google takes takes a lot of seconds.
the general question:
cache (store, save!) rank of each url (and get it afterward) in a database is faster and optimizer or in files? (1 file for all, or 1 file for each)
sorry for my terrible english
Thanks
It depends.
Have a read through this article by Chris Davis - particularly section 7.10. You should also have a think about the differences between speed and scalability.
While, in theory, the file based approach (using the directory hierarchy for indexing and one URL per file) will be faster, PHP does not have good facilities for managing concurrent file access. OTOH this is a key feature of a DBMS (be it relational or nosql). Another consideration is how you will be interacting with the data - you may not be retrieving it using the same indexing path as you stored it in (you can still implement multiple indexes with files, but its a lot easier in a database).
Go with the database, and remember to enable indexing for columns you care about.
How about using something like memcached which stores the data in memory? If it's just cache, I don't see the downside.
Using files will be slower than using database ...
As database uses several optimization's and best algorithms for storing and retrieving data , it is the best option to choose..
You can give indexing to your database and you can choose the database engine ( if using mysql ) as MEMORY (HEAP) Storage Engine for more faster performance..
Cache the result is more better. Use mamchached or something like similar. You just need to first check that whether you have chached data, if so then don't send request to API and take data from there. But set time for cache (after that the cache will be destroyed). This will help you to synchronize your data with live. IF you have not cached data, then send request to api and store the latest data to cache. Its better in my opinion.

Categories