Using PHP to dump large databases into JSON

Using PHP to dump large databases into JSON - php

I have a slight problem with an application I am working on. The application is used as a developer tool to dump tables from a database in a MySQL server to a JSON file which the devs grab by using the Unix curl command. So far the databases we've been using are relatively small tables(2GB or less) however recently we've moved into another stage of testing that use fully populated tables (40GB+) and my simple PHP script breaks. Here's my script:
[<?php
$database = $_GET['db'];
ini_set('display_errors', 'On');
error_reporting(E_ALL);
# Connect
mysql_connect('localhost', 'root', 'root') or die('Could not connect: ' . mysql_error());
# Choose a database
mysql_select_db('user_recording') or die('Could not select database');
# Perform database query
$query = "SELECT * from `".$database."`";
$result = mysql_query($query) or die('Query failed: ' . mysql_error());
while ($row = mysql_fetch_object($result)) {
echo json_encode($row);
echo ",";
}
?>]
My question to you is what can I do to make this script better about handling larger database dumps.

This is what I think that the problem is:
you are using mysql_query. mysql_query buffers data in memory and then mysql_fetch_object just fetches that data from the memory. For very large tables, you just don't have enough memory (most likely you are getting all 40G of rows into that one single call).
Use mysql_unbuffered_query instead. More info here on MySQL performance blog There you can find some other possible causes for this behavior.

I'd say just let mysql do it for you, not php:
SELECT
CONCAT("[",
GROUP_CONCAT(
CONCAT("{field_a:'",field_a,"'"),
CONCAT(",field_b:'",field_b),"'}")
)
,"]")
AS json FROM table;
it should generates something like this:
[
{field_a:'aaa',field_b:'bbb'},
{field_a:'AAA',field_b:'BBB'}
]

You might have a problem with MySQL buffering. But, you might also have other problems. If your script is timing out, try disabling the timeout with set_time_limit(0). That's a simple fix, so if that doesn't work, you could also try:
Try dumping your database offline, then transfer it via script or just direct http. You
might try making a first PHP script call a shell script which calls
a PHP-CLI script that dumps your database to text. Then, just pull
the database via HTTP.
Try having your script dump part of a database (the rows 0 through
N, N+1 through 2N, etc).
Are you using compression on your http connections? If your lag is transfer time (not script
processing time), then speeding up the transfer via compression might help.
If it's the data transfer, JSON might not be the best way to transfer the data. Maybe it is. I don't know. This question might help you: Preferred method to store PHP arrays (json_encode vs serialize)
Also, for options 1 and 3, you might try looking at this question:
What is the best way to handle this: large download via PHP + slow connection from client = script timeout before file is completely downloaded

Related

phpMyAdmin executes queries faster than normal PHP script

I have an SQL query:
SELECT country_code FROM GeoIP WHERE 3111478102>=ip_from AND 3111478102<=ip_to;
I tried to execute the same query in both phpMyAdmin and a normal php script, and here is the result:
phpMyAdmin
PHP Script
As you can see that the query took only 0.8s to be fully executed in phpMyAdmin. Whereas it took 4.5s in the normal PHP Script!
The php script code:
<?php
// ob_start();
// error_reporting(0);
error_reporting(E_ALL);
ini_set('display_errors', 'On');
header_remove("X-Powered-By");
ini_set('session.gc_maxlifetime', 13824000);
session_set_cookie_params(13824000);
session_start();
include_once './config.php';
// include_once './functions/mainClass.php';
$beforeMicro = microtime(true);
$res = $conn->query("SELECT country_code FROM GeoIP WHERE 3111478102>=ip_from AND 3111478102<=ip_to");
if ($res->rowCount() > 0){
$sen = $res->fetch(PDO::FETCH_OBJ);
$country_code = $sen->country_code;
print_r([$country_code]);
}
$afterMicro = microtime(true);
echo 'time:'.round(($afterMicro-$beforeMicro)*1000);
?>
In the meantime, I am setting up a new web server and I use the following:
CentOS 8
Apache 2.4.46
PHP 7.2.24
Please note: The table I'm searching in (GeoIP) contains over 39 minion record. But I think that this is not the problem because that the same query is ran faster in the same server. I also tried to upload the same database and the PHP script to another a Shared Hosting account (not my own server), and it worked and was executing the query in just 1.6s from the php script.

This might be a helpful thread for your question:
Why would phpmyadmin be significantly faster than the mysql command line?
Front-end tools like phpMyAdmin often staple on a LIMIT clause in order to paginate results and not crash your browser or app on large tables. A query that might return millions of records, and in so doing take a lot of time, will run faster if more constrained.
It's not really fair to compare a limited query versus a complete one, the retrieval time is going to be significantly different. Check that both tools are fetching all records.

Updating page info using jQuery from a PHP script that performs an external connection

I have a PHP script that performs a connection to my other server using file_get_contents, and then retrieves and displays the data.
//authorize connection to the ext. server
$xml_data=file_get_contents("http://server.com/connectioncounts");
$doc = new DOMDocument();
$doc->loadXML($xml_data);
//variables to check for name / connection count
$wmsast = $doc->getElementsByTagName('Name');
$wmsasct = $wmsast->length;
//start the loop that fetches and displays each name
for ($sidx = 0; $sidx < $wmsasct; $sidx++) {
$strname = $wmsast->item($sidx)->getElementsByTagName("WhoIs")->item(0)->nodeValue;
$strctot = $wmsast->item($sidx)->getElementsByTagName("Sessions")->item(0)->nodeValue;
/**************************************
Display only one instance of their name.
strpos will check to see if the string contains a _ character
**************************************/
if (strpos($strname, '_') !== FALSE){
//null. ignoring any duplicates
}
else {
//Leftovers. This section contains the names that are only the BASE (no _jibberish, etc)
echo $sidx . " <b>Name: </b>" . $strname . " Sessions: " . $strctot . "<br />";
}//end display base check
}//end name loop
From the client side, I'm calling on this script using jQuery load () and to execute using mousemove().
$(document).mousemove(function(event){
$('.xmlData').load('./connectioncounts.php').fadeIn(1000);
});
And I've also experimented with set interval which works just as well:
var auto_refresh = setInterval(
function ()
{
$('.xmlData').load('./connectioncounts.php').fadeIn("slow");
}, 1000); //refresh, 1000 milli = 1 second
It all works and the contents appear in "real time", but I can already notice an effect on performance and it's just me using it.
I'm trying to come up with a better solution but falling short. The problem with what I have now is that each client would be forcing the script to initiate a new connection to the other server, so I need a solution that will consistently keep the information updated without involving the clients making a new connection directly.
One idea I had was to use a cron job that executes the script, and modify the PHP to log the contents. Then I could simply get the contents of that cache from the client side. This would mean that there is only one connection being made instead of forcing a new connection every time a client wants the data.
The only problem is that the cron would have to be run frequently, like every few seconds. I've read about people running cron this much before, but every instance I've come across isn't making an external connection each time as well.
Is there any option for me other than cron to achieve this or in your experience is that good enough?

How about this:
When the first client reads your data, you retrieve them from the remote server and cache them together with a timestamp.
When the next clients read the same data, you check how old the contents of the cache is and only if it's older than 2 seconds (or whatever) you access the remote server again.

make yourself familiar with APC as a global storage. Once you have fetched the file, store it in the APC cache and set a timeout. You only need to connect to the remote server, once a page is not in the cache or outdated.
Mousemove: are you sure? That generates gazllions of parallel requests unless you set a semaphore clientside to not issue any AJAX queries anymore.

is mysql_connect in header bad practice?

I have a normal website. It uses PHP to call a MySQL table. To save on typing it out all the time, I include() a connect.php file which connects to the database for me on every page. The website keeps getting a "mysql too many connections" error. Is it a bad idea to connect at the start of a page like this?
Should I create a new connection each time a PHP script needs it, then close that connection with mysql_close after that script is done? I had avoided doing as it would add repeated lines of code to the website but I'm wondering if that's what's causing the issue?
So at the moment my code is similar to this:
<?php
include("connect.php"); //connects to database using mysql_connect() function
...
PHP script that calls mysql
...
another PHP script that calls mysql
?>
Should I change it to something like this?
<?php
mysql_connect('host', 'username', 'password');
mysql_select_db('db');
PHP code that calls mysql
mysql_close();
...
mysql_connect('host', 'username', 'password');
mysql_select_db('db');
more PHP code that calls mysql
mysql_close();
?>

You should avoid making new connections to your database, as long as possible.
Making connection to database is a slow and expensive process. Having an opened connection consume few resources.
By the way, stop using mysql_* functions. Use mysqli_* or PDO.

Should I create a new connection each time a PHP script needs it [...] ?
Yes, that makes most sense, especially if not every page needs a mysql connection.
In PHP this works by setting up the database credentials in the php.ini file and you can just call mysql_select_db and it will automatically connect to the configured database if no connection exists so far.
If you write new code, encapsulate the database connection in an object of it's own so that you can more fine-grained control when to connect to the database.
Modern frameworks like for example Silex allow you to lazy load such central components (Services), so you have configured them and can make use of them when you need them but you don't need to worry about the resources (like the connection limit in your example).
[...] close that connection with mysql_close after that script is done?
You don't need that normally because PHP does this for you.

I do not think there is anything really wrong with this style of coding. It all depends on what kind of app you are writing. Just make sure you check your scripts well and get ride of any errors, you should be fine.
This is what i usually do
<?php session_start(); include "inc/config.php";
//require_once('chartz/lib/idiorm.php');
if ($_SESSION["login1"]== "Superadmin" or $_SESSION["login1"]== "User" or $_SESSION["login1"]=="Administrator")
{
$snames = $_SESSION["name1"];
$id = $_SESSION["id1"];
$stype = $_SESSION["login1"];
$stokperm = $_SESSION['stokperm'];
$navtype = $_GET['nav'];
include("inc/ps_pagination.php");
}
else
{
header ("location: ../../../index.php");
}
?>
Am not saying its the very best way out there, we are all still learnig...
I also thing you should take the advice of blue112 very seriously. Most of my apps are writing in the old fashion way, but Use mysqli_* or PDO is the way to go.

PHP/MySql Connection Overhead

I have recently completed a data layer for a PHP application. In my data layer I have various methods for executing different sql tasks such as selects, inserts, deletes, etc. Coming from a .NET background, I practice opening connections, doing whatever with the connection, and closing them.
In a recent code review, I was questioned about this practice, and a colleague stated it was best to leave connections open for the life of the application. Their reasoning is that opening/closing connections is time consuming. My argument is that leaving them open is resource consuming. Following is a code sample from the data layer that executes a select query. I am fairly new to PHP so I don't really have a response for the critique. Can anyone provide any insight into this?
public static final function executeSelectQuery($qry){
$connection = mysql_connect(ADS_DB_HOST, ADS_DB_USERNAME, ADS_DB_PASSWORD) or die(ADS_ERROR_MSG . mysql_error());
$db = mysql_select_db(ADS_DB_NAME) or die(ADS_ERROR_MSG . mysql_error());
$result = mysql_query($qry) or die(ADS_ERROR_MSG . mysql_error());
mysql_close();
$results = array();
while($rows = mysql_fetch_assoc(($result))){
$results[] = $rows;
}
return sprintf('{"results":{"rows":%s}}', json_encode($results));
}

Classic case of the speed vs memory usage trade-off. There is no one-answer-fits-all to this question; it comes down to which of the two is the most important for the particular project you are undertaking.
Edit: After reading the new comments I see this project is aimed at mobile devices. In this case I would say that due to the limited memory of mobile devices (as opposed to the boatload available to most desktops nowadays) prioritising memory usage would be the right call since everyone expects mobile devices to be rather slow in comparison to desktops anyway.

Your colleague is right and this code is wrong.
Ask your self a question, what certain resources does consume an opened connection.
To add to the code review, your idea of error handling is wrong. Never use die to handle an error but throw an Exception instead.
Also, do not craft JSON manually.
return json_encode(array("results" => array ("rows" => $results)));
Also, consider to amend this unction to make it accept values for the parameterized query

Browser crashes when about around 4 million records entered in MYSQL

I downloaded a database that was exported to the TXT format and has about 700MB with 7 million records (1 per line).
I made a script to import the data to a mysql database, but when about 4 million records inserted into, the browser crashes.
I have tested in Firefox and IE.
Can someone give me an opinion and some advice about this?
The script is this:
<?php
set_time_limit(0);
ini_set('memory_limit','128M');
$conexao = mysql_connect("localhost","root","") or die (mysql_error());
$base = mysql_select_db("lista102",$conexao) or die (mysql_error());
$ponteiro = fopen("TELEFONES_1.txt","r");
$conta = 0;
function myflush(){ ob_flush(); flush(); }
while(!feof($ponteiro)){
$conta++;
$linha = fgets($ponteiro,4096);
$linha = str_replace("\"", "", $linha);
$arr = explode(";",$linha);
$sql = "insert into usuarios (CPF_CNPJ,NOME,LOG,END,NUM,COMPL,BAIR,MUN,CEP,DDD,TELEF) values ('".$arr[0]."','".$arr[1]."','".$arr[2]."','".$arr[3]."','".$arr[4]."','".$arr[5]."','".$arr[6]."','".$arr[7]."','".$arr[8]."','".$arr[9]."','".trim($arr[10])."')";
$rs = mysql_query($sql);
if(!$rs){ echo $conta ." error";}
if(($conta%5000)==4999) { sleep(10); echo "<br>Pause: ".$conta; }
myflush();
}
echo "<BR>Eof, import complete";
fclose($ponteiro);
mysql_close($conexao);
?>

Try splitting the file in 100 MB chunks. This is a quick solving suggestion to get the job done. The browser issue can get complicated to solve. Try also different browsers.
phpMyadmin has options to continue the query if a crash happened. Allows interrupt of import in case script detects it is close to time limit. This might be good way to import large files, however it can break transactions.

I'm not sure why you need a web browser to insert records into mysql. Why not just use the import facilities of the database itself and leave the web out of it?
If that's not possible, I'd wonder if chunking the inserts into groups of 1000 at a time would help. Rather than committing the entire database as a single transaction, I'd recommend breaking it up.
Are you using InnoDB?

What I've first noticed is that you are using flush() unsafely. Doing flush() when the httpd buffer is full result in an error and your script dies. Give up all this myflush() workaround and use a single ob_implicit_flush() instead.
You don't need to be seeing it with your browser to make it work to the end, you can place a ignore_user_abort() so your code shall complete its job even if your browser dies.
Not sure why your browser is dying. Maybe your script is generating too much content.

Try it with no
<br> Pause: nnnn
output to the browser, and see if that helps. It may be simply that the browser is choking on the long web page it's asked to render.
Also, is PHP timing out during the long transfer?
It doesn't help, also, that you have sleep(10) adding to the time it takes.

You can try splitting up the file in different TXT files, and redo the process using the files. I know I at least used that approach once.

The browser is choking because the request is taking too long to complete. Is there a reason this process should be part of a web page? If you absolutely have to do it this way, consider splitting up your data in manageable chunks.

Run your code in command line using PHP-CLI. This way, you will never encounter time-out for long running process. Although, the situation is your browser crash before time-out ^^.
If you try to execute in hosting server which you don't have shell access, run the code using crontab. But, you have to make sure that the crontab only run once!

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Using PHP to dump large databases into JSON - php

I'd say just let mysql do it for you, not php: SELECT CONCAT("[", GROUP_CONCAT( CONCAT("{field_a:'",field_a,"'"), CONCAT(",field_b:'",field_b),"'}") ) ,"]") AS json FROM table; it should generates something like this: [ {field_a:'aaa',field_b:'bbb'}, {field_a:'AAA',field_b:'BBB'} ]

Related

phpMyAdmin executes queries faster than normal PHP script

Updating page info using jQuery from a PHP script that performs an external connection

is mysql_connect in header bad practice?

PHP/MySql Connection Overhead

Browser crashes when about around 4 million records entered in MYSQL

Categories

Resources