I have some very large data files and for business reasons I have to do extensive string manipulation (replacing characters and strings). This is unavoidable. The number of replacements runs into hundreds of thousands.
It's taking longer than I would like. PHP is generally very quick but I'm doing so many of these string manipulations that it's slowing down and script execution is running into minutes. This is a pain because the script is run frequently.
I've done some testing and found that str_replace is fastest, followed by strstr, followed by preg_replace. I've also tried individual str_replace statements as well as constructing arrays of patterns and replacements.
I'm toying with the idea of isolating string manipulation operation and writing in a different language but I don't want to invest time in that option only to find that improvements are negligible. Plus, I only know Perl, PHP and COBOL so for any other language I would have to learn it first.
I'm wondering how other people have approached similar problems?
I have searched and I don't believe that this duplicates any existing questions.
Well, considering that in PHP some String operations are faster than array operation, and you are still not satisfied with its speed, you could write external program as you mentioned, probably in some "lower level" language. I would recommend C/C++.
There are two ways of handling this, IMO:
[easy] Precompute some generic replacements in a background process and store them in a DB/file (this trick comes from a gamedev, where all the sinuses and cosinuses are precomputed once and then stored in RAM). You can easily run into curse of dimensionality here, though;
[not so easy] Implement replacement tool in C++ or other fast and compilable programming language and use it afterwards. Sphinx is a good example of fast manipulation tool on big textual data sets implemented in C++.
If you'd allow the replacement to be handled over multiple executions, you could create a script that process each file, temporarily creating replacement files with duplicate content. This would allow for you to extract data from one file to another, process the copy - and then merge the changes, or if you use a stream buffer you might be able to remember each row so the copy/merge step can be skipped.
The problem though might be that you process a file without completing it, rendering it mixed. Therefore a temporary file is suitable.
This would allow for the script to run as many times there's still changes to be made, all you need is a temporary file that remembers which files that has been processed.
The limiting factor is about PHP rebuilding the strings. Consider:
$out=str_replace('bad', 'good', 'this is a bad example');
It's a relatively low cost operation to locate 'bad' in the string, but in order to make room for the substitution, PHP then has to move up, each of the chars e,l,p,m,a,x,e,space before writing in the new value.
Passing arrays for the needle and haystack will improve performance, but not as much as it might.
AFAIK, PHP does not have low level memory access functions, hence an optimal solution would have to be written in a different language, dividing the data up into 'pages' which can be stretched to accomodate changes. You could try this using chunk_split to divide the string up into smaller units (hence each replacement would require less memory juggling).
Another approach would be to dump it into a file and use sed (this still operates one search/replace at a time), e.g.
sed -i 's/good/bad/g;s/worse/better/g' file_containing_data
If you have to do this operation only once and you have to replace with static content you can use Dreamwaver or other editor, so you will not need PHP. It will be much faster.
Still, if you do need to do this dynamically with PHP (you need database records or others) you can use shell commands via exec - google search for search-replace
It is possible that you have hit a wall with PHP. PHP is great, but in some areas it fails, such as processing LOTS of data. There are a few things you could do:
Use more than one php process to accomplish the task (2 process potentially could take half as long).
Install a faster CPU.
Do the processing on multiple machines.
Use a compiled language to process the data (Java, C, C++, etc)
I think the question is why are you running this script frequently? Are you performing the computations (the string replacements) on the same data over and over again, or are you doing it on different data every time?
If the answer is the former then there isn't much more you can do to improve performance on the PHP side. You can improve performance in other ways such as using better hardware (SSDs for faster reads/writes on the files), multicore CPUs and breaking up the data into smaller pieces running multiple scripts at the same time to process the data concurrently, and faster RAM (i.e. higher bus speeds).
If the answer is the latter then you might want to consider caching the result using something like memcached or reddis (key/value cache stores) so that you can only perform the computation once and then it's just a linear read from memory, which is very cheap and involves virtually no CPU overhead (you might also utilize CPU cache at this level).
String manipulation in PHP is already cheap because PHP strings are essentially just byte arrays. There's virtually no overhead from PHP in reading a file into memory and storing it in a string. If you have some sample code that demonstrates where you're seeing performance issues and some bench mark numbers I might have some better advice, but right now it just looks like you need refactor your approach based on what your underlying needs are.
For example, there are both CPU and I/O costs to consider individually when you're dealing with data in different situations. I/O involves blocking since it's a system call. This means your CPU has to wait for more data to come over the wire (while your disk transfers data to memory) before it can continue to process or compute that data. Your CPU is always going to be much faster than memory and memory is always much faster than disk.
Here's a simple benchmark to show you the difference:
/* First, let's create a simple test file to benchmark */
file_put_contents('in.txt', str_repeat(implode(" ",range('a','z')),10000));
/* Now let's write two different tests that replace all vowels with asterisks */
// The first test reads the entire file into memory and performs the computation all at once
function test1($filename, $newfile) {
$start = microtime(true);
$data = file_get_contents($filename);
$changes = str_replace(array('a','e','i','o','u'),array('*'),$data);
file_put_contents($newfile,$changes);
return sprintf("%.6f", microtime(true) - $start);
}
// The second test reads only 8KB chunks at a time and performs the computation on each chunk
function test2($filename, $newfile) {
$start = microtime(true);
$fp = fopen($filename,"r");
$changes = '';
while(!feof($fp)) {
$changes .= str_replace(array('a','e','i','o','u'),array('*'),fread($fp, 8192));
}
file_put_contents($newfile, $changes);
return sprintf("%.6f", microtime(true) - $start);
}
The above two tests do the same exact thing, but Test2 proves significantly faster for me when I'm using smaller amounts of data (roughly 500KB in this test).
Here's the benchmark you can run...
// Conduct 100 iterations of each test and average the results
for ($i = 0; $i < 100; $i++) {
$test1[] = test1('in.txt','out.txt');
$test2[] = test2('in.txt','out.txt');
}
echo "Test1 average: ", sprintf("%.6f",array_sum($test1) / count($test1)), "\n",
"Test2 average: ", sprintf("%.6f\n",array_sum($test2) / count($test2));
For me the above benchmark gives Test1 average: 0.440795 and Test2 average: 0.052054, which is an order of magnitude difference and that's just testing on 500KB of data. Now, if I increase the size of this file to about 50MB Test1 actually proves to be faster since there are fewer system I/O calls per iteration (i.e. we're just reading from memory linearly in Test1), but more CPU cost (i.e. we're performing a much larger computation per iteration). The CPU generally proves to be able to handle much larger amounts of data at a time than your I/O devices can send over the bus.
So it's not a one-size-fits-all solution in most cases.
Since you know Perl, I would suggest doing the string manipulations in perl using regular expressions and use the final result in PHP web page.
This seems better for the following reasons
You already know Perl
Perl does string processing better
You can use PHP where necessary only.
does this manipulation have to happen on the fly? if not, might i suggest pre-processing... perhaps via a cron job.
define what rules your going to be using.
is it just one str_replace or a few different ones?
do you have to do the entire file in one shot? or can you split it into multiple batches? (e.g. half the file at a time)
once your rules are defined decide when you will do the processing. (e.g. 6am before everyone gets to work)
then you can setup a job queue. i have used apache's cron jobs to run my php scripts on a given time schedule.
for a project i worked on a while ago i had a setup like this:
7:00 - pull 10,000 records from mysql and write them to 3 separate files.
7:15 - run a complex regex on file one.
7:20 - run a complex regex on file two.
7:25 - run a complex regex on file three.
7:30 - combine all three files into one.
8:00 - walk into the metting with the formatted file you boss wants. *profit*
hope this helps get you thinking...
Related
I intend to make a dynamic list in php, for which I have a plain text file with an element of the list in every line. Every line has a string that needs to be parsed into several smaller chunks before rendering the final html document.
Last time I did something similar, I used a file() function to load my file into an array, but in this case I have a 12KB file with more than 50 lines, that will most certainly grow bigger over time. Should I load the entries from the file to a SQL database to avoid performance issues?
Yes, put the information into a data base. Not for performance reasons (in terms of sequential reading) because a 12KB file will be read very quickly, but for the part about parsing into separate chunks. Make those chunks into columns of your DB table. It will make the whole programming process go faster, with greater flexibility.
Breaking stuff up in to properly formatted database is -almost- always a good idea and will be a performance saver.
However, 50 lines is pretty minor (even a few hundred lines is pretty minor). A bit of quick math, 12KB / 50 lines tells me each line is only about 240 characters long on average.
I doubt that amount of processing (or even several times that much) will be a significant enough performance hit to cause dread unless this is a super high performance site.
While 50 lines doesn't seem like too much, it would be a good idea to use the database now rather than making the change later. One think you would have to remember is that using database won't straight-away eliminate performance issues, but help you make better use of resources. In fact, you can write a similarly optimized process using files too, and they would work just about the same except for I/O difference.
I reread the question and you realize that you might mean that you would load the file to the database every time. I don't see how this can help unless you are using database as a form of cache to avoid repeated hits to the file. Ultimately, reading from a file or database would only differ in how the script uses I/O, disk caches, etc... The processing you do on the list might make more of a difference here.
Memory management is not something that most PHP developers ever need to think about. I'm running into an issue where my command line script is running out of memory. It performs multiple iterations over a large array of objects, making multiple database requests per iteration. I'm sure that increasing the memory ceiling may be a short term fix, but I don't think it's an appropriate long-term solution. What should I be doing to make sure that my script is not using too much memory, and using memory efficiently?
The golden rule
The number one thing to do when you encounter (or expect to encounter) memory pressure is: do not read massive amounts of data in memory at once if you intend to process them sequentially.
Examples:
Do not fetch a large result set in memory as an array; instead, fetch each row in turn and process it before fetching the next
Do not read large text files in memory (e.g. with file); instead, read one line at a time
This is not always the most convenient thing in PHP (arrays don't cut it, and there is a lot of code that only works on arrays), but in recent versions and especially after the introduction of generators it's easier than ever to stream your data instead of chunking it.
Following this practice religiously will "automatically" take care of other things for you as well:
There is no longer any need to clean up resources with a big memory footprint by closing them and losing all references to them on purpose, because there will be no such resources to begin with
There is no longer a need to unset large variables after you are done with them, because there will be no such variables as well
Other things to do
Be careful of creating closures inside loops; this should be easy to do, as creating such inside loops is a bad code smell. You can always lift the closure upwards and give it more parameters.
When expecting massive input, design your program and pick algorithms accordingly. For example, you can mergesort any amount of text files of any size using a constant amount of memory.
You could try profiling it puting some calls to memory_get_usage(), to look for the place where it's peaking.
Of course, knowing what the code really does you'll have more information to reduce its memory usage.
When you compute your large array of objects, try to not compute it all at once. Walk in steps and process elements as you walk then free memory and take next elements.
It will take more time, but you can manage the amount of memory you use.
I am working on website and I am trying to make it fast as much as possible - especially the small things that can make my site a little bit quicker.
So, my to my question - I got loop that run 5 times and in each time it echo something, If I'll make variable and the loop will add the text I want to echo into the variable and just in the end I'll echo the variable - will it be faster?
loop 1 (with the echo inside the loop)
for ($i = 0;$i < 5;$i++)
{
echo "test";
}
loop 2 (with the echo outside [when the loop finish])
$echostr = "";
for ($i = 0;$i < 5;$i++)
{
$echostr .= "test";
}
echo $echostr;
I know that loop 2 will increase a bit the file size and therfore the user will have to download more bytes but If I got huge loop will it be better to use second loop or not?
Thanks.
The difference is negligible. Do whatever is more readable (which in this case is definitely the first case). The first approach is not a "naive" approach so there will be no major performance difference (it may actually be faster, I'm not sure). The first approach will also use less memory. Also, in many languages (not sure about PHP), appending to strings is expensive, and therefore so is concatenation (because you have to seek to the end of the string, reallocate memory, etc.).
Moreover, file size does not matter because PHP is entirely server-side -- the user never has to download your script (in fact, it would be scary if they did/could). These types of things may matter in Javascript but not in PHP.
Long story short -- don't write code constantly trying to make micro-optimizations like this. Write the code in the style that is most readable and idiomatic, test to see if performance is good, and if performance is bad then profile and rewrite the sections that perform poorly.
I'll end on a quote:
"premature emphasis on efficiency is a big mistake which may well be the source of most programming complexity and grief."
- Donald Knuth
This is a classic case of premature optimization. The performance difference is negligible.
I'd say that in general you're better off constructing a string and echoing it at the end, but because it leads to cleaner code (side effects are bad, mkay?) not because of performance.
If you optimize like this, from the ground up, you're at risk of obfuscating your code for no perceptable benefit. If you really want your script to be as fast as possible then profile it to find out where the real bottlenecks are.
Someone else mentioned that using string concatenation instead of an immediate echo will use more memory. This isn't true unless the size of the string exceeds the size of output buffer. In any case to actually echo immediately you'd need to call flush() (perhaps preceded by ob_flush()) which adds the overhead of a function call*. The web server may still keep its own buffer which would thwart this anyway.
If you're spending a lot of time on each iteration of the loop then it may make sense to echo and flush early so the user isn't kept waiting for the next iteration, but that would be an entirely different question.
Also, the size of the PHP file has no effect on the user - it may take marginally longer to parse but that would be negated by using an opcode cache like APC anyway.
To sum up, while it may be marginally faster to echo each iteration, depending on circumstance, it makes the code harder to maintain (think Wordpress) and it's most likely that your time for optimization would be better spent elsewhere.
* If you're genuinely worried about this level of optimization then a function call isn't to be sniffed at. Flushing in pieces also implies extra protocol overhead.
The size of your PHP file does not increase the size of the download by the user. The output of the PHP file is all that matters to the user.
Generally, you want to do the first option: echo as soon as you have the data. Assuming you are not using output buffering, this means that the user can stream the data while your PHP script is still executing.
The user does not download the PHP file, but only its output, so the second loop has no effect on the user's download size.
It's best not to worry about small optimizations, but instead focus on quickly delivering working software. However, if you want to improve the performance of your site, Yahoo! has done some excellent research: developer.yahoo.com/performance/rules.html
The code you identify as "loop 2" wouldn't be any larger of a file size for users to download. String concatination is faster than calling a function like echo so I'd go with loop 2. For only 5 iterations of a loop I don't think it really matters all that much.
Overall, there are a lot of other areas to focus on such as compiling PHP instead of running it as a scripted language.
http://phplens.com/lens/php-book/optimizing-debugging-php.php
Your first example would, in theory, be fastest. Because your provided code is so extremely simplistic, I doubt any performance increase over your second example would be noticed or even useful.
In your first example the only variable PHP needs to initialize and utilize is $i.
In your second example PHP must first create an empty string variable. Then create the loop and its variable, $i. Then append the text to $echostr and then finally echo $echostr.
I've a php website which displays recipes www.trymasak.my, to be exact. The recipes being displayed at the index page is updated about once a day. To get the latest recipes, I just use a mysql query which is something like "select recipe_name, page_views, image from table order by last_updated". So if I got 10000 visitors a day, obviously the query would be made 10000 times a day. A friend told me a better way (in terms of reducing server load) is when I update the recipes, I just put in the latest recipe details (names,images etc) into a text file, and make my page instead of querying a same query for 10,000 times, just get the data from the text file. Is his suggestion really better? If yes, which is the best php command should I use to open, read and close the text file?
thanks
The typical solution is to cache in memory. Either the query result or the whole page.
Benchmark
To know the truth about something you should really benchmark it. "Simple is Hard" from Rasmus Ledorf(Author of PHP) are really interesting video/slides(my opinion ;)) which explain how to benchmark your website. It will teach you to tackle the low hanging fruit of your website instead of wasting your time doing premature optimizations.
Donald Knuth made the following two
statements on optimization: "We should
forget about small efficiencies, say
about 97% of the time: premature
optimization is the root of all evil"
"In established engineering
disciplines a 12 % improvement, easily
obtained, is never considered marginal
and I believe the same viewpoint
should prevail in software
engineering"5
In a nutshell you will run benchmarks using tools like Siege, ab, httperf, etc. I would really like to advice you to watch this video if you aren't familiar with this topic, because I found it a really interesting watch/read.
Speed
If speed as your concern you should have at least consider:
Using a bytecode cache => APC. Precompiling your PHP will really speed up your website for at least these two big reasons:
Most PHP accelerators work by caching
the compiled bytecode of PHP scripts
to avoid the overhead of parsing and
compiling source code on each request
(some or all of which may never even
be executed). To further improve
performance, the cached code is stored
in shared memory and directly executed
from there, minimizing the amount of
slow disk reads and memory copying at
runtime.
PHP accelerators can substantially
increase the speed of PHP
applications. Improvements of web page
generation throughput by factors of 2
to 7 have been observed. 50
times faster for compute intensive
analysis programs.
Us an in-memory database to store your queries => Redis or Memcached. There is a very very big mismatch between memory and the disc(IO).
Thus, we observe that the main memory
is about 10 times slower and I/O units
1000 times slower than the processor.
The analogy part is also interesting read(can't copy from google books :)).
Databases are more flexible, secure and scalable in the long run. 10000 queries per day isn't really that much for modern RDBMS either. Go (or stay) database.
Optimize on the caching side of things, the HTTP specification has an own section on that:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html
Reading files:
file_get_contents(), if you want
to store contents in a variable
before outputting
readfile(), if
you just want to output file's
contents
If you try it, store the recipes in a folder structure like this:
/recipes/X/Y/Z/id.txt
where X, Y and Z is a random integer, from 1 to 25
example:
/recipes/3/12/22/12345.txt
This is because the filesystem is just another database. And it has a lot more hidden meta data updates to deal with.
I think MySQL will be faster, and certainly more manageable, since you'd have to backup the MySQL db anyway.
Opening ONE file is faster that doing a mysql connect + query.
BUT, if your website already needs a mysql connect to retreive some other informations, you probably want to stick with your query because the longuest part is the connection and your query is very light.
On the other hand, opening 10 files is longer than query 10 records from a database, because you only open one mysql connection.
In any case, you have to consider how long is your query and if caching it in a text file will have more pros than cons.
Ok, I'll try and keep this short, sweet and to-the-point.
We do massive GeoIP updates to our system by uploading a MASSIVE CSV file to our PHP-based CMS. This thing usually has more than 100k records of IP address information. Now, doing a simple import of this data isn't an issue at all, but we have to run checks against our current regional IP address mappings.
This means that we must validate the data, compare and split overlapping IP address, etc.. And these checks must be made for each and every record.
Not only that, but I've just created a field mapping solution that would allow other vendors to implement their GeoIP updates in different formats. This is done by applying rules to IPs records within the CSV update.
For instance a rule might look like:
if 'countryName' == 'Australia' then send to the 'Australian IP Pool'
There might be multiple rules that have to be run and each IP record must apply them all. For instance, 100k records to check against 10 rules would be 1 million iterations; not fun.
We're finding 2 rules for 100k records takes up to 10 minutes to process. I'm fully aware of the bottleneck here which is the shear amount of iterations that must occur for a successful import; just not fully aware of any other options we may have to speed things up a bit.
Someone recommended splitting the file into chunks, server-side. I don't think this is a viable solution as it adds yet another layer of complexity to an already complex system. The file would have to be opened, parsed and split. Then the script would have to iterate over the chunks as well.
So, question is, considering what I just wrote, what would the BEST method be to speed this process up a bit? Upgrading the server's hardware JUST for this tool isn't an option unfortunately, but they're pretty high-end boxes to begin with.
Not as short as I thought, but yeah. Halps? :(
Perform a BULK IMPORT into a database (SQL Server's what I use). The BULK IMPORT takes seconds literally, and 100,000 records is peanuts for a database to crunch on business rules. I regularly perform similar data crunches on a table with over 4 million rows and it doesn't take the 10 minutes you listed.
EDIT: I should point out, yeah, I don't recommend PHP for this. You're dealing with raw DATA, use a DATABASE.. :P
The simple key to this is keeping as much work out of the inner loop as possible.
Simply put, anything you do in the inner loop is done "100K times", so doing nothing is best (but certainly not practical), so doing as little possible is the next best bet.
If you have the memory, for example, and it's practical for the application, defer any "output" until after the main processing. Cache any input data if practical as well. This works best for summary data or occasional data.
Ideally, save for the reading of the CSV file, do as little I/O as possible during the main processing.
Does PHP offer any access to the Unix mmap facility, that is typically the fastest way to read files, particularly large files.
Another consideration is to batch your inserts. For example, it's straightforward to build up your INSERT statements as simple strings, and ship them to the server in blocks of 10, 50, or 100 rows. Most databases have some hard limit on the size of the SQL statement (like 64K, or something), so you'll need to keep that in mind. This will dramatically reduce your round trips to the DB.
If you're creating primary keys through simple increments, do that en masses (blocks of 1000, 10000, whatever). This is another thing you can remove from your inner loop.
And, for sure, you should be processing all of the rules at once for each row, and not run the records through for each rule.
100k records isn't a large number. 10 minutes isn't a bad job processing time for a single thread. The amount of raw work to be done in a straight line is probably about 10 minutes, regardless if you're using PHP or C. If you want it to be faster, you're going to need a more complex solution than a while loop.
Here's how I would tackle it:
Use a map/reduce solution to run the process in parallel. Hadoop is probably overkill. Pig Latin may do the job. You really just want the map part of the map/reduce problem. IE: you're forking of a chunk of the file to be processed by a sub process. Your reducer is probably cat. A simple version could be having PHP fork processes for each 10K record chunk, wait for the children, then re-assemble their output.
Use a queue/grid processing model. Queue up chunks of the file, then have a cluster of machines checking in, grabbing jobs and sending the data somewhere. This is very similar to the map/reduce model, just using different technologies, plus you could scale by adding more machines to the grid.
If you can write your logic as SQL, do it in a database. I would avoid this because most web programmers can't work with SQL on this level. Also, SQL is sort of limited for doing things like RBL checks or ARIN lookups.
One thing you can try is running the CSV import under command line PHP. It generally provides faster results.
If you are using PHP to do this job, switch the parsing to Python since it is WAY faster than PHP on this matters, this exchange should speed up the process by 75% or even more.
If you are using MySQL you can also use the LOAD DATA INFILE operator, I'm not sure if you need check the data before you insert it into the database though.
Have worked on this problem intensively for a while now. And, yes the better solution is to only read in a portion of the file at any one time, parse it, do validation, do filtering, then export it and then read the next portion of the file. I would agree that this is probably not a solution for php, although you can probably do it in php. As long as you have a seek function, so that you can start reading from a particular location in the file. You are right it does add a higher level of complexity but the worth that little extra effort.
It your data is pure i.e. delimited correctly, string qualified, free of broken lines etc then by all means bulk upload into a sql database. Otherwise you want to know where, when and why errors occur and to be able to handle them.
i'm working with something alike. The csv file i'm working contain portuguese data (dd/mm/yyyy) that i have to convert into mysql yyyy-mm-dd. Portuguese monetary: R$ 1.000,15 that had to be converted into mysql decimal 1000,15. Trim the possible spaces and finally, addslashes.
There are 25 variables to be treated before the insert.
If i check every $notafiscal value (select into table to see if exist and update), the php handle around 60k rows. But if i don't check it, php handle more than 1 million rows.The server work with memory of 4GB - scripting localhosting (memory of 2GB), it handles the half rows in both cases.
mysqli_query($db,"SET AUTOCOMMIT=0");
mysqli_query($db, "BEGIN");
mysqli_query($db, "SET FOREIGN_KEY_CHECKS = 0");
fgets($handle); //ignore the header line of csv file
while (($data = fgetcsv($handle, 100000, ';')) !== FALSE):
//if $notafiscal lower than 1, ignore the record
$notafiscal = $data[0];
if ($notafiscal < 1):
continue;
else:
$serie = trim($data[1]);
$data_emissao = converteDataBR($data[2]);
$cond_pagamento = trim(addslashes($data[3]));
//...
$valor_total = trim(moeda($data[24]));
//check if the $notafiscal already exist, if so, update, else, insert into table
$query = "SELECT * FROM venda WHERE notafiscal = ". $notafiscal ;
$rs = mysqli_query($db, $query);
if (mysqli_num_rows($rs) > 0):
//UPDATE TABLE
else:
//INSERT INTO TABLE
endif;
endwhile;
mysqli_query($db,"COMMIT");
mysqli_query($db,"SET AUTOCOMMIT=1");
mysqli_query($db,"SET FOREIGN_KEY_CHECKS = 1");
mysqli_close($db);