Querying XML/JSON API with loading only Unique Rows - php

I'm trying to minimize the load time with some scripts that I've created. I'm connecting to multiple JSON or XML APIs either through a developer's API resource or through RSS feeds and I'm trying to find a way to only insert new rows into the mysql databases.
I have used the INSERT IGNORE function, and it works great, but the problem is that it still loads all of the different rows and ignores inserting it if the UNIQUE key is a duplicate. What I'd like to do is to create a faster running script that will only attempt to insert a row if it is different.
I have thought about doing an in_array() function to query the unique key in the DB, but eventually, the array will get too big to handle (ie. 1,000,000 records) to see if it's in the array before inserting the unique ones.
Is there a better way to do this? I've done it easily in mysql by using the WHERE clause for records already in the Database, but extracting from an XML API is a little more difficult with this I'm finding taking into consideration how the DBs will grow enormously large.

Related

Question about storing/retrieving data for a complex page

My stack is php and mysql.
I am trying to design a page to display details of a mutual fund.
Data for a single fund is distributed over 15-20 different tables.
Currently, my front-end is a brute-force php page that queries/joins these tables using 8 different queries for a single scheme. It's messy and poor performing.
I am considering alternatives. Good thing is that the data changes only once a day, so I can do some preprocessing.
An option that I am considering is to create run these queries for every fund (about 2000 funds) and create a complex json object for each of them, store it in mysql indexed for the fund code, retrieve the json at run time and show the data. I am thinking of using the simple json_object() mysql function to create the json, and json_decode in php to get the values for display. Is this a good approach?
I was tempted to store them in a separate MongoDB store - would that be an overkill for this?
Any other suggestion?
Thanks much!
To meet your objective of quick pageviews, your overnight-run approach is very good. You could generate JSON objects with your distilled data, or even prerendered HTML pages, and store them.
You can certainly store JSON objects in MySQL columns. If you don't need the database server to search the objects, simply use TEXT (or LONGTEXT) data types to store them.
To my way of thinking, adding a new type of server (mongodb) to your operations to store a few thousand JSON objects does not seem worth the the trouble. If you find it necessary to search the contents of your JSON objects, another type of server might be useful, however.
Other things to consider:
Optimize your SQL queries. Read up: https://use-the-index-luke.com and other sources of good info. Consider your queries one-by-one starting with the slowest one. Use the EXPLAIN or even the EXPLAIN ANALYZE command to get your MySQL server to tell you how it plans each query. And judiciously add indexes. Using the query-optimization tag here on StackOverflow, you can get help. Many queries can be optimized by adding indexes to MySQL without changing anything in your php code or your data. So this can be an ongoing project rather than a big new software release.
Consider measuring your query times. You can do this with MySQL's slow query log. The point of this is to identify your "dirty dozen" slowest queries in a particular time period. Then, see step one.
Make your pages fill up progressively, to keep your users busy reading while you get the data they need. Put the toplevel stuff (fund name, etc) in server-side HTML so search engines can see it. Use some sort of front-end tech (React, maybe, or Datatables that fetch data via AJAX) to render your pages client-side, and provide REST endpoints on your server to get the data, in JSON format, for each data block in the page.
In your overnight run create a sitemap file along with your JSON data rows. That lets you control exactly how you want search engines to present your data.

Comparing JSON Data to MySQL Table, Updating Rows and Inserting New Rows

I'm trying to wrap my head around a problem and would appreciate some advice.
I've built a PHP script that pulls a large amount of JSON data from a URL. It then runs through that data and inputs it in to a MySQL database.
I want to use a Cron job that pulls the JSON data every 2 or 3 hours, and if anything has changed compared to the data in the MySQL table, it updates it. If there are new records, it adds those.
The old system a friend was using would basically pull all of the data every 2/3 hours and overwrite the old data. This is fine for small amounts of data, but it seems super impractical to be writing 10,000-20,000 rows to a table every 2/3 hours.
Each JSON object has a unique identifier - so I was thinking of doing something like:
Pull MySQL table data in to an array;
Pull JSON data in to an array.
Use the unique identifier for each entry in the JSON data to search against the MySQL data. If entries are not the same, update MySQL table. If entry doesn't exist, insert a new row.
I'm looking for some tips on the best way / most efficient and fastest way to do this. I've been told I'm super bad at explaining things so let me know if I need to add any more detail.

Using database as cache. Rebuilding vs updating

My question involves what the most efficient way to store an entire JSON document in a database table and refresh it periodically is.
Essentially, I'm calling the Google Analytics API once every 15 minutes via a cron job to pull out data about my site. I'm dumping this information into a sql table so that my front end application can search, sort and consume it. This JSON is paginated, such that only 5,000 rows come through at a time. I'll be storing as many as 100,000.
What I'm trying to do is optimize the way I rebuild the table. The most naive approach would be to truncate the table and insert every row from the JSON fresh. I have the feeling this is a bad approach, but maybe I'm underestimating sql.
I could also update each existing row and add new rows as necessary. However, I'm struggling with how I should delete old rows that might not be in the freshest JSON object.
Or perhaps I'm missing a more obvious solution.
The real answer to this question is it depends on what works the best. As I am not familiar with the data I cant give you a straight forward answer but some guidelines.
Firstly 100 000 rows is nothing for a SQL server to handle. So truncating the table and inserting the values fresh might actually be workable, however if this data was to grow substantially this might not be a solution that scales well. The main disadvantage to this approach is that for a period of time the table will be empty and this might be a problem for some users.
Summary of this approach:
Easy and quick to code and maintain.
Truncate will always be fast but insert will slow down as volumes increases.
Data will be offline during truncate and insert cycle.
Inserting and updating as we go along is known as Upserts/Merges. This approach will involve more work but the data will always be online. One of the difficulties you face is working with the JSON data and the SQL data(finding differences in the native JSON dataset compared to SQL table), this is going to be ineffective and cumbersome.
So I would create a staging table for the JSON. This table will be a exact copy of the final production table. I would then use LEFT and RIGHT JOINS to insert the new data and remove the deleted data. You could also create a hash for each row and compare these hashes to identify the rows that have changes and then update only were necessary. All these transformations can be handles in a simple SQL script. Yes you are underestimating SQL a bit...
Summary of this approach:
More complicated to code but not difficult to code simple joins and hash comparisons will do the trick.
Only insert new value, update changed values and delete old values. When scaling this solution it will eventually outperform the truncate, insert cycle.
Data remains online all the time.
If you need clarification around this please ask away.

Fastest way to store table of data and retrieve

I am storing some history information on my website for future retrieval of the users. So when they visit certain pages it will record the page that they visited, the time, and then store it under their user id for future additions/retrievals.
So my original plan was to store all of the data in an array, and then serialize/unserialize it on each retrieval and then store it back in a TEXT field in the database. The problem is: I don't know how efficient or inefficient this will get with large arrays of data if the user builds up a history of (e.g.) 10k pages.
EDIT: So I want to know what is the most efficient way to do this? I was also considering just inserting a new row in the database for each and every history, but then this would make a large database for selecting things from.
The question is what is faster/efficient, massive amount of rows in database or massive serialized array? Any other better solutions are obviously welcome. I will eventually be switching to Python, but for now this has to be done in PHP.
There is no benefit to storing the data as serialized arrays. Retrieving a big blob of data, de-serializing, modifying it and re-serializing to update is slow - and worse, will get slower the larger the piece of data (exactly what you're worried about).
Databases are specifically designed to handle large numbers of rows, so use them. You have no extra cost per insert as the data grows, unlike your proposed method, and you're still storing the same amount of data, so let the database do what it does best, and keep your code simple.
Storing the data as an array also makes any sort of querying and aggregation near impossible. If the purpose of the system is to (for example) see how many visits a particular page got, you would have to de-serialize every record, find all the matching pages, etc. If you have the data as a series of rows with user and page, it's a trivial SQL count query.
If, one day, you find that you have so many rows (10,000 is not a lot of rows) that you're starting to see performance issues, find ways to optimize it, perhaps through aggregation and de-normalization.
you can check for session variable and store all data of one session and can dump it together into database.
You can do Indexing at db level to save time.
Last and the most effective thing you can do is to do operation/manipulation on data and store it in separate table.And always select data from manuplated table.You can achieve this using cron job or some schedular.

Multiple MySQL queries and how to make my script go faster

Using PHP (1900 secs time limit and more than 1GB memory limit) and MySQL (using PEAR::MDB2) on this one...
I am trying to create a search engine that will load data from site feeds in a mysql database. Some sites have rather big feeds with lots of data in them (for example more than 80.000 records in just one file). Some data checking for each of the records is done prior to inserting the record in the database (data checking that might also insert or update a mysql table).
My problem is as many of you might have already understood...time! For each record in the feed there are more than 20 checks and for a feed with eg: 10.000 records there might be >50.000 inserts to the database.
I tried to do this with 2 ways:
Read the feed and store the data in an array and then loop through the array and do the data checking and inserts. (This proves to be the fastest of all)
Read the feed and do the data checking line by line and insert.
The database uses indexes on each field that is constantly queried. The PHP code is tweaked with no extra variables and the SQL queries are simple select, update and insert statements.
Setting time limits higher and memory is nor a problem. The problem is that I want this operation to be faster.
So my question is:
How can i make the process of importing the feed's data faster? Are there any other tips that I might not be aware of?
Using LOAD DATA INFILE is often many times faster than using INSERT to do a bulk load.
Even if you have to do your checks in PHP code, dump it to a CSV file and then use LOAD DATA INFILE, this can be a big win.
If your import is a one time thing, and you use a fulltext index, a simple tweak to speed the import up is to remove the index, import all your data and add the fulltext index once the import is done. This is much faster, according to the docs :
For large data sets, it is much faster
to load your data into a table that
has no FULLTEXT index and then create
the index after that, than to load
data into a table that has an existing
FULLTEXT index.
You might take a look at php's PDO extension, and it's support for preapeared statements. You could also consider to use stored procedures in mysql.
2) You could take a look at other database systems, as CouchDB and others, and sacrifice consistency for performance.
I managed to double the inserted data with INSERT DELAYED command in 1800 sec. The 'LOAD DATA INFILE' suggestion was not the case since the data should be strongly validated and it would messup my code.
Thanks for all your answers and suggestions :)

Categories