Suggestion for using php array? - php

Let say i have 100k records in table, after fetching that records from table i am pushing it to an array with some calculations, and then send them to server for further processing.
I have test the scenario with(1k) records, its working perfectly, but worrying about if there is performance issue, because the page which do calculation and fetching records from db run after each 2 mins.
My Question is can I use array for more than 2 Millions records?

There's no memory on how much data an array can hold, the limit is server memory/PHP memory limit.
Why would you push 100k records into an array? You know databases have sorting and limiting for that reason!

My Question is can I use array for more than 2 Millions records?
Yes you can, 2 Million array entries is not a limit in PHP for arrays. The array limit depends on the memory that is available to PHP.
ini_set('memory_limit', '320M');
$moreThan2Million = 2000001;
$array = range(0, $moreThan2Million);
echo count($array); #$moreThan2Million
You wrote:
The page is scheduled and run after 2 min, so I am worrying about the performance issue.
And:
But I need to fetch all, not 100 at time, and send them to server for further processing.
Performance for array operations is dependent on processing power. With a fast enough computer, you should not run into any problems. However, keep in mind that PHP is an interpreted language and therefore considerably slower than compiled binaries.
If you need to run the same script every 2 minutes but the runtime of the script is larger than two minutes, you can distribute script execution over multiple computers, so one process is not eating the CPU and memory resources of the other and can finish the work in meantime another process runs on an additional box.
Edit
Good answer, but can you write your consideration, about how much time the script will need to complete, if the there is no issue with the server processor and RAM.
That depends on the size of the array, the amount of processing each entry needs (in relation to the overall size of the array) and naturally the processor power and the amount of RAM. All these are unspecified with your question, so I can specifically say, that I would consider this unspecified. You'll need to test this on your own and building metrics for your application by profiling it.
I have 10GB RAM and More than 8 Squad processor.
For example you could do a rough metric for 1, 10, 100, 1000, 10000, 100000 and 1 million entries to see how your (unspecified) script scales on that computer.
I am sending this array to another page for further processing.
Metric as well the amount of data you send between computers and how much bandwidth you have available for inter-process communication over the wire.

Let say i have 100k records in table, after fetching that records from table i am pushing it to an array with some filters.
Filters? Can't you just write a query that implements those filters instead? A database (depending on vendor) isn't just a data store, it can do calculations and most of the time it's much quicker than transferring the data to PHP and doing the calculations there. If you have a database in, say, PostgreSQL, you can do pretty much everything you've ever wanted with plpgsql.

Related

LAMP / Laravel - Report generation maxing out single CPU

So I have developed a report generation system in Laravel. We are using php 7 (opcache enabled) / apache / mysql / on a centos 7 box. With one report, grabbing all the information ends up taking about 15 seconds but then I have to loop through and do a bunch of filtering on Collections etc etc. I have optimized this from top to bottom for about a week and have got the entire report generation to take about 45 seconds (dealing with multiple tables with greater than 1 million entries). This maxes out my CPU until its done of course.
My issue is when we pushed it live to the client their CPU is not up to the task. They have 4 cpu's # 8 cores each # 2.2ghz. However, since php is a single process it only runs on one cpu and maxes it out and since its so slow it takes closer to 10 minutes to run the report.
Is there any way to get apache / php / linux ...whatever....to use all 4 cpu's for a single php process? The only other option is to tell the client they need a better server....not an option. Please help.
So I stopped trying to find a way to have the server handle my code better and found a few ways to optimize my code.
First off, I used the collection groupBy() method to group my collection so that i had a bunch of sub-arrays with the id as key. When I looped through these I just grabbed that sub-array instead of using the collection's filter() method which is REALLY slow when dealing with this many items. That saved me a LOT of processing power.
Secondly, every time I used a sub-array I removed it from the main array. So the array became smaller and smaller every time it went through the foreach.
These optimizations ended up saving me a LOT of processing power and now my reports run fine. After days of searching for a way to allow php to handle parallel processing etc I have come to the conclusion that its simply not possible.
Hope this helps.

Zend Lucene exhausts memory when indexing

An oldish site I'm maintaining uses Zend Lucene (ZF 1.7.2) as it's search engine. I recently added two new tables to be indexed, together containing about 2000 rows of text data ranging between 31 bytes and 63kB.
The indexing worked fine a few times, but after the third run or so it started terminating with a fatal error due to exhausting it's allocated memory. The PHP memory limit was originally set to 16M, which was enough to index all other content, 200 rows of text at a few kilobytes each. I gradually increased the memory limit to 160M but it still isn't enough and I can't increase it any higher.
When indexing, I first need to clear the previously indexed results, because the path scheme contains numbers which Lucene seems to treat as stopwords, returning every entry when I run this search:
$this->index->find('url:/tablename/12345');
After clearing all of the results I reinsert them one by one:
foreach($urls as $v) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnStored('content', $v['data']);
$doc->addField(Zend_Search_Lucene_Field::Text('title', $v['title']);
$doc->addField(Zend_Search_Lucene_Field::Text('description', $v['description']);
$doc->addField(Zend_Search_Lucene_Field::Text('url', $v['path']);
$this->index->addDocument($doc);
}
After about a thousand iterations the indexer runs out of memory and crashes. Strangely doubling the memory limit only helps a few dozen rows.
I've already tried adjusting the MergeFactor and MaxMergeDocs parameters (to values of 5 and 100 respectively) and calling $this->index->optimize() every 100 rows but neither is providing consistent help.
Clearing the whole search index and rebuilding it seems to result in a successful indexing most of the time, but I'd prefer a more elegant and less CPU intensive solution. Is there something I'm doing wrong? Is it normal for the indexing to hog so much memory?
I had a similar problem for a site I had to maintain that had at least three different languages and had to re-index the same 10'000+ (and growing) localized documents for each different locale separately (each using their own localized search engine). Suffice to say that it failed usually within the second pass.
We ended up implementing an Ajax based re-indexing process that called the script a first time to initialize and start re-indexing. That script aborted at a predefined number of processed documents and returned a JSON value indicating if it was completed or not, along with other progress information. We then re-called the same script again with the progress variables until the script returned a completed state.
This allowed also to have a progress bar of the process for the admin area.
For the cron job, we simply made a bash script doing the same task but with exit codes.
This was about 3 years ago and nothing has failed since then.

How can I make my curl-based URL monitoring service lightweight to run?

I'm currently building a user panel which will scrape daily information using curl. For each URL it will INSERT a new row to the database. Every user can add multiple URLs to scrape. For example: the database might contain 1,000 users, and every user might have 5 URLs to scrape on average.
How do I to run the curl scraping - by a cron job once a day at a specific time? Will a single dedicated server stand this without lags? Are there any techniques to reduce the server load? And about MySQL databases: with 5,000 new rows a day the database will be huge after a single month.
If you wonder I'm building a statistics service which will show the daily growth of their pages (not talking about traffic), so as i understand i need to insert a new value per user per day.
Any suggestions will be appreciated.
5000 x 365 is only 1.8 million... nothing to worry about for the database. If you want, you can stuff the data into mongodb (need 64bit OS). This will allow you to expand and shuffle loads around to multiple machines more easily when you need to.
If you want to run curl non-stop until it is finished from a cron, just "nice" the process so it doesn't use too many system resources. Otherwise, you can run a script which sleeps a few seconds between each curl pull. If each scrape takes 2 seconds that would allow you to scrape 43,200 pages per 24 period. If you slept 4 sec between a 2 second pull that would let you do 14,400 pages per day (5k is 40% of 14.4k, so you should be done in half a day with 4 sec sleep between 2 sec scrape).
This seems very doable on a minimal VPS machine for the first year, at least for the first 6 months. Then, you can think about utilizing more machines.
(edit: also, if you want you can store the binary GZIPPED scraped page source if you're worried about space)
I understand that each customer's pages need to be checked at the same time each day to make the growth stats accurate. But, do all customers need to be checked at the same time? I would divide my customers into chunks based on their ids. In this way, you could update each customer at the same time every day, but not have to do them all at once.
For the database size problem I would do two things. First, use partitions to break up the data into manageable pieces. Second, if the value did not change from one day to the next, I would not insert a new row for the page. In my processing of the data, I would then extrapolate for presentation the values of the data. UNLESS all you are storing is small bits of text. Then, I'm not sure the number of rows is going to be all that big a problem if you use proper indexing and pagination for queries.
Edit: adding a bit of an example
function do_curl($start_index,$stop_index){
// Do query here to get all pages with ids between start index and stop index
$query = "select * from db_table where id >= $start_index and id<=$stop_index";
for($i=$start_index; $i<= $stop_index; $i++;){
// do curl here
}
}
urls would look roughly like
http://xxx.example.com/do_curl?start_index=1&stop_index=10;
http://xxx.example.com/do_curl?start_index=11&stop_index=20;
The best way to deal with the growing database size is to perhaps write a single cron script that would generate the start_index and stop_index based on the number of pages you need to fetch and how often you intend to run the script.
Use multi curl and properly optimise not simply normalise your database design. If I were to run this cron job, I will try to spend time studying that is it possible to do this in chunks or not? Regarding hardware start with an average configuration, keep monitoring it and increment the hardware, CPU or Memory. Remember, there is no silver bullet.

Optimizing mysql / PHP based website | 300 qps

Hey,
I currently have over 300+ qps on my mysql. There is roughly 12000 UIP a day / no cron on fairly heavy PHP websites. I know it's pretty hard to judge if is it ok without seeing the website but do you think that it is a total overkill?
What is your experience? If I optimize the scripts, do you think that I would be able to get substantially lower of qps? I mean if I get to 200 qps that won't help me much. Thanks
currently have over 300+ qps on my mysql
Your website can run on a Via C3, good for you !
do you think that it is a total overkill?
That depends if it's
1 page/s doing 300 queries, yeah you got a problem.
30-60 pages/s doing 5-10 queries each, then you got no problem.
12000 UIP a day
We had a site with 50-60.000, and it ran on a Via C3 (your toaster is a datacenter compared to that crap server) but the torrent tracker used about 50% of the cpu, so only half of that tiny cpu was available to the website, which never seemed to use any significant fraction of it anyway.
What is your experience?
If you want to know if you are going to kill your server, or if your website is optimizized, the following has close to zero information content :
UIP (unless you get facebook-like numbers)
queries/s (unless you're above 10.000) (I've seen a cheap dual core blast 20.000 qps using postgres)
But the following is extremely important :
dynamic pages/second served
number of queries per page
time duration of each query (ALL OF THEM)
server architecture
vmstat, iostat outputs
database logs
webserver logs
database's own slow_query, lock, and IO logs and statistics
You're not focusing on the right metric...
I think you are missing the point here. If 300+ qps are too much heavily depends on the website itself, on the users per second that visit the website, that the background scripts that are concurrently running, and so on. You should be able to test and/or compute an average query throughput for your server, to understand if 300+ qps are fair or not. And, by the way, it depends on what these queries are asking for (a couple of fields, or large amount of binary data?).
Surely, if you optimize the scripts and/or reduce the number of queries, you can lower the load on the database, but without having specific data we cannot properly answer your question. To lower a 300+ qps load to under 200 qps, you should on average lower your total queries by at least 1/3rd.
Optimizing a script can do wonders. I've taken scripts that took 3 minutes before to .5 seconds after simply by optimizing how the calls were made to the server. That is an extreme situation, of course. I would focus mainly on minimizing the number of queries by combining them if possible. Maybe get creative with your queries to include more information in each hit.
And going from 300 to 200 qps is actually a huge improvement. That's a 33% drop in traffic to your server... that's significant.
You should not focus on the script, focus on the server.
You are not saying if these 300+ querys are causing issues. If your server is not dead, no reason to lower the amount. And if you have already done optimization, you should focus on the server. Upgrade it or buy more servers.

Saving data to a file vs. saving it to MySQL DB

Using a PHP script I need to update a number every 5 seconds while somebody is on my page. So let's say I have 300 visitors, each one spending about 1 minute on the page and every 5 seconds they stay on the page the number will be changed...which is a total of 3600 changes per minute. I would prefer to update the number in my MySQL database, except I'm not sure if it's not too inefficient to have so many MySQL connections (just for the one number change), when I could just change the number in a file.
P.S.: I have no idea weather 3600 connections/minute is a high number or not, but what about this case in general, considering an even higher number of visitors. What is the most efficient way to do this?
Doing 3,600 reads and writes per minute against the same file is just out of question. It's complicate (you need to be extremely careful with file locking), it's going to have an awful performance and sooner or later your data will get corrupted.
DBMSs like MySQL are designed for concurrent access. If they can't cope with your load, a file won't do it better.
It will fail eventually if the user count grows but the performance depends of your server setup and other tasks that are related to this update.
You can do a slight test and open up 300 persistent connections to your database end fire up as much query's you can in minute.
If you don't need it to be transactional (the order of executed query's is not important) then i suggest you to use memcached (or redis if you need to save stuff on disk) for this instead
If you save to file, you have to solve concurrency issues (and all but the currently reading/writing process will have to wait). The db solves this for you. For better performance you could use memcached.
Maybe you could do without this "do every 5s for each user" by another means (e.g. saving current time and subtracting next time the user does something). This depends on your real problem.
Don't even think about trying to handle this with files - its just not going to work unless you build a lock queue manager - and if you're going to all that trouble you might as well use the daemon to manage the value rather than just queue locks.
Using a DBMS is the simplest approach.
For a more efficient but massively more esoteric approach, write a single-threaded socket server daemon and have the clients connect to that. (there's a lib here for doing the socket handling, and there's a PEAR class for running PHP as a daemon)
files aren't transactional and you don't want to lose count so the database is the way to go
memcached's inc command is faster then the database and was the basis of i think one really fast view counting setup
if you use say a key per hour and switch so when a page view happens inc page:time occurs and you can have a process in the background collect the counts from the past hour and insert them in a database if the memcache fails you might lose the count for that hour but you will not have double counted or missed any and keeping counts per period gives interesting statistics
Using a dedicated temporary file will certainly be the most efficient disk access you can have. However, you will not be protected from concurrent access to the file in case your server uses multiple threads or processes. If what you want to do is update 1 number per user, then using a $_SESSION sub-variable will work, and I believe this is stored in memory, so it shouldbe very efficient. Then you can easily store this number into your database every 5 minutes per user

Categories