How can I write data into text file at fastest speed? [closed] - php

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
It takes quite long if I random generate 4 billion IP address, and write them into a text file. Do you guys have better idea to finish this action faster?
this is my code
$ip_long = array(
array('607649792', '608174079'), //36.56.0.0-36.63.255.255
array('1038614528', '1039007743'), //61.232.0.0-61.237.255.255
array('1783627776', '1784676351'), //106.80.0.0-106.95.255.255
array('2035023872', '2035154943'), //121.76.0.0-121.77.255.255
array('2078801920', '2079064063'), //123.232.0.0-123.235.255.255
array('-1950089216', '-1948778497'), //139.196.0.0-139.215.255.255
array('-1425539072', '-1425014785'), //171.8.0.0-171.15.255.255
array('-1236271104', '-1235419137'), //182.80.0.0-182.92.255.255
array('-770113536', '-768606209'), //210.25.0.0-210.47.255.255
array('-569376768', '-564133889'), //222.16.0.0-222.95.255.255
);
$rand_key = mt_rand(0, 9);
$handle = fopen('ip_data.dat', 'a+');
for ($i=0; $i<4000000000; $i++) {
$ip= long2ip(mt_rand($ip_long[$rand_key][0], $ip_long[$rand_key][1]));
fwrite($handle, decbin( ip2long( $ip )) . "\r\n");
}

Because of the restricted ranges that you identified, the total number of distinct values that you can generate is considerably smaller than 4 billion (depending on the value of $rand_key, which is only evaluated once, it's never more than about 79*256*256 ~ 5M ) - so you are going to gets lots of duplicates. That being the case, you will be much faster if you generate an array of strings - one for each valid IP address in the range. Then pick a random string from that list and append it to a string. Write the string when it gets to a typical block size, set it back to "" and repeat.
More importantly, I question how sensible it is to use decbin - it turns your IP string into lots of ones and zeros, so a simple IP address will take 32 bytes (plus \r\n, that's 34). Multiply by 4G, and you have 120G+. That's actually a lot more data than the 50 GB that #Jon was computing above...
If you store the IP address as a binary number instead there will be just four bytes per number - and you might as well leave the CRLF off at that point. It will be faster to write, faster to read. So the suggestion becomes:
Create an array with the range of valid values (this is a range of integers; would like to think of them as unsigned but that is not a type that php knows about)
Pick a random value from the array (random index)
Put the value pointed to by the random index into another array (of predefined size - 2048 elements is good)
Do a binary write of this array when you filled it
Repeat
What you end up with is a file full of random binary numbers each of which represents an ip address. It's very fast (although it still involves writing a 16G file - but that's as small as it can get).

Since you're writing 50GB of data, your disk is most likely your bottleneck. A few suggestions
Stop calling fwrite so often - build up roughly 1000 values, then fwrite them all at once, try the same with 10000 values, measure the performance
Use C... or better yet assembly
Buy a hard drive with higher RPM or solid state memory

Don't run heavy functions in loops. Incrementing a variable and then writing once is much better. Note that the amount of data you write in the example still is extremely heavy, so I chose to guess that it was just an "immense" number in order to really get time differences to show while trying to tweak you functions.
$ip_long = array(
array('607649792', '608174079'), //36.56.0.0-36.63.255.255
array('1038614528', '1039007743'), //61.232.0.0-61.237.255.255
array('1783627776', '1784676351'), //106.80.0.0-106.95.255.255
array('2035023872', '2035154943'), //121.76.0.0-121.77.255.255
array('2078801920', '2079064063'), //123.232.0.0-123.235.255.255
array('-1950089216', '-1948778497'), //139.196.0.0-139.215.255.255
array('-1425539072', '-1425014785'), //171.8.0.0-171.15.255.255
array('-1236271104', '-1235419137'), //182.80.0.0-182.92.255.255
array('-770113536', '-768606209'), //210.25.0.0-210.47.255.255
array('-569376768', '-564133889'), //222.16.0.0-222.95.255.255
);
$rand_key = mt_rand(0, 9);
$ip = '';
for ($i=0; $i<40000; $i++) {
$ip .= decbin( ip2long( long2ip(mt_rand($ip_long[$rand_key][0], $ip_long[$rand_key][1])))) . "\r\n";
}
$handle = fopen('ip_data.dat', 'a+');
fwrite($handle, $ip);

Related

Faster Way to Read File Line by Line?

In PHP, I use fopen( ), fgets( ), and fclose( ) to read a file line by line. It works well. But I have a script (being run from the CLI) that has to process three hundred 5GB text files. That's approximately 3 billion fgets( ). So it works well enough but at this scale, tiny speed savings will add up extremely fast. So I'm wondering if there are any tricks to speed up the process?
The only potential thing I thought of was getting fgets( ) to read more than one line at once. It doesn't look like it supports that, but I could in theory do lets say 20 consecutive $line[] = fgets($file); and then process the array. That's not quite the same thing as reading multiple lines in one command so it may not have any affect. But I know queue your mysql inserts and sending them as one giant insert (another trick I'm going to implement in this script after more testing and benchmarking) will save a lot of time.
Update 4/13/19
Here is the solution I went with. Originally I had a much more complicated method of slicing off the end of each read, but then I realized you can do it much simpler.
$index_file = fopen( path to file,"r" );
$chunk = "";
while ( !feof($index_file) )
{
$chunk .= fread($index_file,$read_length);
$payload_lines = explode("\n",$chunk);
if ( !feof($index_file) )
{ $chunk = array_pop($payload_lines); }
}
Of course PHP has a function for everything. So I break every read into an array of lines, and array_pop() the last item in the array back to the beginning of the 'read buffer'. That last part is probably split, but not necessarily split. But either way, it goes back in and gets processed with the next loop (unless we're done with the file, then we don't pop it).
The only thing you have to watch out for here is if you have a line so long that a single read won't capture the whole thing. But know your data, that probably won't be a hassle. For me, I'm parsing a json-ish file, and I'm reading 128 KB at a time, so there are always many line breaks in my read.
Note: I settled on 128 KB by doing a million benchmarks and finding the size my server processes the absolute fastest. This parsing function will run 300 times so every second I save, saves me 5 minutes of total runtime.
One possible approach that might be faster would be to read large chunks of the file in with fread(), split it by newlines and then process the lines. You'd have to take in account that the chunks may sever lines and you'd have to detect this and glue them back together.
Generally speaking the larger the chunk you can read in one go the faster your process should become. Within the limits of your available memory.
From fread() docs:
Note that fread() reads from the current position of the file pointer. Use ftell() to find the current position of the pointer and rewind() to rewind the pointer position.

php read from text file and combine duplicate entries [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have this little piece of code I'm just testing out that basically redirects a user if their IP doesn't match the predefined IP and if it doesn't match write that IP into a text file.
$file = fopen("ips.txt", "w");
if ($ip == "iphere") {
echo "Welcome";
fclose($file);
} else {
header('Location: http://www.google.com');
fwrite($file, "\n" . $ip);
if ($file) {
$array = explode("\n", fread($file, filesize("ips.txt")));
}
$result = print_r($array, TRUE);
fclose($file);
}
What I want to do is take the IPs that I'm writing to the text file, put them all into an array to find the duplicates, make note of the duplicates, filter them out, then write them back into that file or another txt file, but I'm stuck and not sure where to go from here.
I could suggest you use serialize or json_encode to store the ip's in a file , that way you could add more info (how many times an IP has visited, last visit, etc.).
I'll show you a simple example.
1: Create some dummy ips for test.
$IPs = array(
'192.168.0.1' => array(
'visits' => 23,
'last' => '2015-07-20'
),
'192.168.0.2' => array(
'visits' => 32,
'last' => '2015-06-23'
)
);
So here we created an associative array with 2 IP addreses, that also contain visit count and last visit.
Save the file using php serialize function or json_encode (i prefer json format, because it can be used by other languages).
$for_save = json_encode($IPs); // OR serialize($IPs)
file_put_contents("FILE_NAME",$for_save); //Save the file with the IP's
Now its time to read the file
$file = fopen("FILE_NAME", "w");
$file = json_decode($file) // or unserialize($file);
and now we have the array to use as we wish and we can search for ip's using php array functions, and offcourse modify information about ips :
if(array_key_exists("YOUR_IP_HERE",$file)){
//What to do if we have found the ip in the file, for example :
$file['YOUR_IP']['visits']++; //we add +1 visit for that ip
}
And now we can save the file again
$file = json_encode($file);
file_put_contents("IP_FILE_NAME",$file);
There are a couple issues with this approach, around threading and performance. What happens if two people hit the webpage and write to the same file at the same time? Also, this file can grow to an unbounded size? This will be slow. You don't need to manually check all ip's, only that one exists.
It might be better to use a database table for this. Otherwise, you'll need to handle filelocking as well.
psuedo code for function check_ips:
Select * from ips where ip =?. Check the user id
if no result, insert the ip. it's unknown. (also if needed you can add constraint to the table to prevent duplicate ip's)
otherwise, the ip is known
You can log counts, dates, last access, or other stats in the table as a calculated summary.
You can do easly reading the file with the ip in an array and the get the unique value from the array like this
$ipList = file(ips.txt);
$ipUnique = array_unique($ipList);
then yo can save or parse the $ipUnique for your porpose.

Modify a line with "KEY - AMOUNT" of a file in PHP

this has been bugging me for ages now but i can't figure it out..
Basically i'm using a hit counter which stores unique IP address in a file. But what i'm trying to do is get it to count how many hits each IP address has made.
So instead of the file reading:
222.111.111.111
222.111.111.112
222.111.111.113
I want it to read:
222.111.111.111 - 5
222.111.111.112 - 9
222.111.111.113 - 41
This is the code i'm using:
$file = "stats.php";
$ip_list = file($file);
$visitors = count($ip_list);
if (!in_array($_SERVER['REMOTE_ADDR'] . "\n", $ip_list))
{
$fp = fopen($file,"a");
fwrite($fp, $_SERVER['REMOTE_ADDR'] . "\n");
fclose($fp);
$visitors++;
}
What i was trying to do is change it to:
if (!in_array($_SERVER['REMOTE_ADDR'] . " - [ANY NUMBER] \n", $ip_list))
{
$fp = fopen($file,"a");
fwrite($fp, $_SERVER['REMOTE_ADDR'] . " - 1 \n");
fclose($fp);
$visitors++;
}
else if (in_array($_SERVER['REMOTE_ADDR'] . " - [ANY NUMBER] \n", $ip_list))
{
CHANGE [ANY NUMBER] TO [ANY NUMBER]+1
}
I think i can figure out the last adding part, but how do i represent the [ANY NUMBER] part so that it finds the IP whatever the following number is?
I realise i'm probably going about this all wrong but if someone could give me a clue i'd really appreciate it.
Thanks.
This is bad idea, don't do it this way.
Its normal to store website statics in the file-system but not with pre-aggregation applied to it.
If you going to use the file-system then do post-aggregation on the data otherwise use a database.
What you are doing is a very bad idea
But lets first answer the actual question you are asking.
To be able to do that you will have to actually process the file first in some kind of data structure that allows for that to be done. I'd presonally recommend an array in the form of IP => AMOUNT.
For example (untested code):
$fd = file($file);
$ip_list = array();
for ($fd as $line) {
list($ip, $amount) = explode("-", $line);
$ip_list[$ip] = $amount;
}
Note that the code is not perfect as it would leave a space at the end of $ip and another in front of $amount due to the nature of your original data. But it works good enough just to point you in the right direction. A more "accurate" solution would involve regular expressions or modifying the original data source to a more convenient format.
Now the real answer to your actual problem
Your process will quickly become a performance bottleneck as you would have to open up that file, process it and write it all back again afterwards (not sure if you can do in-line editing of an open file) for every request.
As you are trying to do some kind of per-IP hit count, there are a lot of better solutions to your problem:
Use an existing solution for it (like piwik)
Use an actual database for your data
Keep your file simple with just a list of IPs and post-process it off-line periodically to make it be the format you want
You can avoid writing that file altogether if you have access to your webserver's logs (and they are setup to log every request with the originating IP) and you can post-process that file instead
in_array() simply does a basic a string match. it will NOT look for substrings. Ignoring how bad an idea it is to use a flat file for data storage, what you want is preg_grep, which allows you to use regexes
$ip_list = file('ips.txt');
$matches = preg_replace('/^\d+\.\d+\.\d+\.\d+ - \d+$/', $ip_list);
of course, this is a very basic and very broken IP address match, and will not help you actually CHANGE the value in $ip_list, because you don't get the actual index(es) of the matched lines.

Encoding unique IDs to a maximum document size of 100kb

This is going to be a nice little brainbender I think. It is a real life problem, and I am stuck trying to figure out how to implement it. I don't expect it to be a problem for years, and at that point it will be one of those "nice problems to have".
So, I have documents in my search engine index. The documents can have a number of fields, however, each field size must be limited to only 100kb.
I would like to store the IDs, of particular sites which have access to this document. The site id count is low, so it is never going to get up into the extremely high numbers.
So example, this document here can be accessed by sites which have an ID of 7 and 10.
Document: {
docId: "1239"
text: "Some Cool Document",
access: "7 10"
}
Now, because the "access" field is limited to 100kb, that means that if you were to take consecutive IDs, only 18917 unique IDs could be stored.
Reference:
http://codepad.viper-7.com/Qn4N0K
<?php
$ids = range(1,18917);
$ids = implode(" ", $ids);
echo mb_strlen($ids, '8bit') / 1024 . "kb";
?>
// Output
99.9951171875kb
In my application, a particular site, of site ID 7, tries to search, and he will have access to that "Some Cool Document"
So now, my question would be, is there any way, that I could some how fit more IDs into that field?
I've thought about proper encoding, and applying something like a Huffman Tree, but seeing as each document has different IDs, it would be impossible to apply a single encoding set to every document.
Prehaps, I could use something like tokenized roman numerals?
Anyway, I'm open to ideas.
I should add, that I want to keep all IDs in the same field, for as long as possible. Searching over a second field, will have a considerable performance hit. So I will only switch to using a second access2 field, when I have milked the access field for as long as possible.
Edit:
Convert to Hex
<?php
function hexify(&$item){
$item = dechex($item);
}
$ids = range(1,21353);
array_walk( $ids, "hexify");
$ids = implode(" ", $ids);
echo mb_strlen($ids, '8bit') / 1024 . "kb";
?>
This yields a performance boost of 21353 consecutive IDs.
So that is up like 12.8%
Important Caveat
I think the fact that my fields can only store UTF encoded characters makes it next to impossible to get anything more out of it.
Where did 18917 come from? 100kb is a big number.
You have 100,000 or so bytes. Each byte can be 255 long, if you store it as a number.
If you encode as hex, you'll get 100,000 ^ 16, which is a very large number, and that just hex encoding.
What about base64? You stuff 3 bytes into a 4 byte space (a little loss), but you get 64 characters per character. So 100,000 ^ 64. That's a big number.
You won't have any problems. Just do a simple hex encoding.
EDIT:
TL;DR
Let's say you use base64. You could fit 6.4 times more data in the same spot. No compression needed.
How about using data compression?
$ids = range(1,18917);
$ids = implode(" ", $ids);
$ids = gzencode($ids);
echo mb_strlen($ids, '8bit') / 1024 . "kb"; // 41.435546875kb

php file random access and object to file saving

I have a csv file with records being sorted on the first field. I managed to generate a function that does binary search through that file, using fseek for random access through file.
However, this is still a pretty slow process, since when I seek some file position, I actually need to look left, looking for \n characted, so I can make sure I'm reading a whole line (once whole line is read, I can check for first field value mentioned above).
Here is the function that returns a line that contains character at position x:
function fgetLineContaining( $fh, $x ) {
if( $x 125145411) // 12514511 is the last pos in my file
return "";
// now go as much left as possible, until newline is found
// or beginning of the file
while( $x > 0 && $c != "\n" && $c != "\r") {
fseek($fh, $x);
$x--; // go left in the file
$c = fgetc( $fh );
}
$x+=2; // skip newline char
fseek( $fh, $x );
return fgets( $fh, 1024 ); // return the line from the beginning until \n
}
While this is working as expected, I have to sad that my csv file has ~1.5Mil lines, and these left-seeks are slowing thins down pretty much.
Is there a better way to seek a line containing position x inside a file?
Also, it would be much better if object of a class could be saved to a file without serializing it, thus enabling reading of a file object-by-object. Does php support that?
Thanks
I think you really should consider using SQLite or MySQL again (like others have suggested in the comments). Most of the suggestions about pre-calculating indexes are already implemented "properly" in these SQL engines.
You said the speed wasn't good enough in SQL. Did you have the fields indexed properly? How were you querying the data? Where you using bulk queries, where you using prepared statements? Did the SQL process have enough ram to store it's indexes in RAM?
One thing you can possibly try to speed under the current algorithm is to load the (~100MB ?) file onto a RAM disc. No matter what you chose to do, either CVS or SQLite, this WILL help speed things up, especially if the hard drive seek time is your bottleneck.
You could possibly even read the whole file into PHP array's (assuming your computer has enough RAM for that). That would allow you to do your search via index ($big_array[$offset]) lookups.
Also one thing to keep in mind, PHP isn't exactly super fast at doing low level things fast. You might want to consider moving away from PHP in favor of C or C++.

Categories