I was wondering if it would be possible to write a script in PHP which would proceed through an extremely large data set (100 million+) to try locate specific strings within the data set?
If it is feasibly possible would it be an efficient form of identifying a keyword within the dataset?
If there is a better way of processing through such a large dataset to try an detect a string I am all ears
Well like Jari said everything is possible in programming.
I deal with large data via Hadoop, MapReduce etc.
Related
I'm starting a Incident Tracking System for IT, and its likely my first PHP project.
I've been designing it in my mind based on software I've seen like vBulletin, and I'd like it to have i18n and styles editables.
So my first question goes here:
What is best method to store these things, knowing they will be likely static. I've been thinking about getting file content with PHP, showing it in a text editor, and when save is made, replace the old one. (Making a copy if it hasn't ever been edited before so we have the "original").
I think this would be considerably faster than using MySQL and storing the language / style.
What about security here? Should I create .htaccess for asking for pw on this folder?
I know how to make a replace using for each getting an array from database and using strreplace ($name, $value, $file) but if I store language in file, can I make a an associative array with it's content (like a JSON).
Thanks a lot and sorry for so many questions, im newbie
this is what im doing in my cms:
for each plugin/program/entity (you name it) i develop, i create a /translations folder.
i put there all my translations, named like el.txt, de.txt, uk.txt etc. all languages
i store the translation data in JSON, because its easy to store to, easy to read from and easiest for everyone to post theirs.
files can be easily UTF8 encoded in-file without messing with databases, making it possible to read them in file-mode. (just JSON.parse them)
on installation of such plugins, i just loop through all translations and put them in database, each language per table row. (etc. a data column of TEXT datatype)
for each page render i just query once the database for taking this row of selected language, and call json_decode() to the whole result to get it once; then put it in a $_SESSION so the next time to get flash-speed translated strings for current selected language.
the whole thing was developed having i mind both performance and compatibility.
The benefit for storing on the HDD vs DB is that backups won't waste as much space. eg. once the file is backed-up once, it doesn't take up tape on the next day. Whereas, a db gets fully backed up every day and takes up increasing amounts of space. The down-side to writing it to the disk is that it increases your chance of somebody uploading something malicious and they might be clever enough to figure out how to execute it. You just need to be more careful, that's all.
Yes, use .htaccess to limit any action on a writable folder. Good job thinking ahead of that risk.
Your approach sounds like a good strategy.
Good luck.
I am doing a small website project. In a page their is a section where the client posts new updates, at any given time there will be a maximum of 5 to 6 posts in this division. I was trying to create a MySQL database for the content. But I wonder if their is anyway I could have all the entries as XML files and use PHP to parse it. Is it possible ?
Which one is the better option MySQL or XML?
XML is a horrid piece of crap in my opinion. It's bloated and rather unpleasant to work with. However, it is a viable option as long as your number of entries and the amount of traffic stays small.
You can use SimpleXML to parse the XML, but the performance is going to degrade as file size increases. MySQL, however, will handle quite a lot of data before performance becomes a concern provided the schema is properly setup.
If you do use XML, you could always use a half-way XML solution. Like parse the file once, then store a serialized array of it.
Though really, if you're going to store it in a file of some sort, I would suggest, in order: SQLite, serialized array, JSON, XML. (Depending on your situation that order may change.)
If you abstract away the low level details enough, you should be able to make adapters that can be used interchangeably, thus allowing you to easily switch out storage backends. (On a large project, that would likely be unfeasible, but it sounds like your data storage/retrieval will remain fairly simple.)
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Cache Object in PHP without using serialize
So I have built a rather large data structure which cannot easily be turned into a relational database format. Using this data structure the requests I make are very fast but it takes about 4-5 seconds to load into memory. What I want is to load it into memory once, then have it sit there and be able to quickly answer individual requests which of course is not the normal flow with the php scripts I have written normally. Is there any good way to do this in php? (again no using a database, it has to use this specialized precomputed structure which takes a long time to load into memory.)
EDIT: This tutorial kind of gives what I want but it is pretty complicated and I was hoping someone would have a more elegant solution. As he says in the tutorial the whole problem is that naturally php is stateless.
You absolutely must do something like what your linked tutorial proposes.
No PHP state persist between requests. This is by design.
Thus you will need some kind of separate long-running process and thus some kind of IPC method, or else you need a better data structure you can load piecemeal.
If you really can't put this into a relational database (such as sqlite--it doesn't have to be a process database), explore using some other kind of database, such as a file-based key-value store.
Note that it is extremely unlikely that any long-running process you write, in any language, will be faster, easier, or better than getting this data structure of yours into a real database, relational or otherwise! Get your data structure into a database! It's the easiest among your possible paths.
Another thing you can do is just make loading your data structure as quick as possible. You can serialize it to a file and then deserialize the file; if that is not fast enough you can try igbinary, which is a much-faster-than-standard-php serializer.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Least memory intensive way to read a file in PHP
I have a problem with speed vs. memory usage.
I have a script which needs to be able to run very quickly. All it does is load multiple files from 1-100MB, consisting of a list of values and checks how many of these exist against another list.
My preferred way of doing this is to load the values from the file into an array (explode), and then loop through this array and check whether the value exists or not using isset.
The problem I have is that there are too many values, it uses up >10GB of memory (I don't know why it uses so much). So I have resorted to loading the values from the file into memory a few at a time, instead of just exploding the whole file. This cuts the memory usage right down, but is VERY slow.
Is there a better method?
Code Example:
$check=array('lots','of','values','here');
$check=array_flip($check);
$values=explode('|',file_get_contents('bigfile.txt'));
$matches=0;
foreach($values as $key)
if (isset($check[$key])) $matches++;
Maybe you could code your own C extension of PHP (see e.g. this question), or code a small utility program in C and have PHP run it (perhaps using popen)?
These seems like a classic solution for some form of Key/Value orientated NoSQL datastore (mongodb, couchdb, Riak) (or maybe even just a large memcache instance).
Assuming you can load the large data files into the datastore ahead of when you need to do the searching and that you'll be using the data from the loaded files more than once, you should see some impressive gains (as long your queries, mapreduce, etc aren't awful), judging by the size of your data you may want to look at a data store which doesn't need to hold everything in memory to be quick.
There are plenty of PHP drivers (and tutorials) for each of the datastores I mentioned above.
Open the files and read through them line wise. Maybe use MySQL, for import (LOAD DATA INFILE), for resulting data or both.
It seems you need some improved search engine.
Sphinx search server can be used for searching your values really fast.
Dropping my lurker status to finally ask a question...
I need to know how I can improve on the performance of a PHP script that draws its data from XML files.
Some background:
I've already mapped the bottleneck to CPU - but want to optimize the script's performance before taking a hit on processor costs. Specifically, the most CPU-consuming part of the script is the XML loading.
The reason I'm using XML to store object data because the data needs to be accessible via a browser Flash interface, and we want to provide fast user access in that area. The project is still in early stages though, so if best practice would be to abandon XML altogether, that would be a good answer too.
Lots of data: Currently plotting for roughly 100k objects, albeit usually small ones - and they must ALL be taken up into the script, with perhaps a few rare exceptions. The data set will only grow with time.
Frequent runs: Ideally, we'd run the script ~50k times an hour; realistically, we'd settle for ~1k/h runs. This coupled with data size makes performance optimization completely imperative.
Already taken an optimization step of making several runs on the same data rather than loading it for each run, but it's still taking too long. The runs should generally use "fresh" data with the modifications done by users.
Just to clarify: is the data you're loading coming from XML files for processing in its current state and is it being modified before being sent to the Flash application?
It looks like you'd be better off using a database to store your data and pushing out XML as needed rather than reading it in XML first; if building the XML files gets slow you could cache files as they're generated in order to avoid redundant generation of the same file.
If the XML stays relatively static, you could cache it as a PHP array, something like this:
<xml><foo>bar</foo></xml>
is cached in a file as
<?php return array('foo' => 'bar');
It should be faster for PHP to just include the arrayified version of the XML.
~1k/hour, 3600 seconds per hour, more than 3 runs a second (let alone the 50k/hour)...
There are many questions. Some of them are:
Does your php script need to read/process all records of the data source for each single run? If not, what kind of subset does it need (~size, criterias, ...)
Same question for the flash application + who's sending the data? The php script? "Direct" request for the complete, static xml file?
What operations are performed on the data source?
Do you need some kind of concurrency mechanism?
...
And just because you want to deliver xml data to the flash clients it doesn't necessarily mean that you have to store xml data on the server. If e.g. the clients only need a tiny little subset of the availabe records it probably a lot faster not to store the data as xml but something more suited to speed and "searchability" and then create the xml output of the subset on-the-fly, maybe assisted by some caching depending on what data the client request and how/how much the data changes.
edit: Let's assume that you really,really need the whole dataset and need a continuous simulation. Then you might want to consider a continuous process that keeps the complete "world model" in memory and operates on this model on each run (world tick). This way at least you wouldn't have to load the data on each tick. But such a process is usually written in something else than php.