Parse many XML files in batches with PHP/Laravel

Parse many XML files in batches with PHP/Laravel - php

I have my code already working, it will parse the files and insert the records, the issue that is stumping me as I have never had to do this is, how can I tell my code to parse 1-300 files then wait then parse the next "batch" 301-500 and so on until it's finished parsing all the files. I'm needed to parse over 50 thousand files, so obviously I'm reaching php's memory limit and execution time which has already been increased but I don't think I could set it extremely high to process 50 thousand.
I need help with how do I tell my code to run 1-x then rerun and run x-y?
My code is (Note, I am gathering more information that what's in my snip below)
$xml_files = glob(storage_path('path/to/*.xml'));
foreach ($xml_files as $file) {
$data = simplexml_load_file($file);
... Parse XML and get certain nodes ...
$name = $data->record->memberRole->member->name;
... SQL to insert record into DB ...
Members::firstOrCreate(
['name' => $name]
);
}

Simplest, if inelegant solution, is calling the script multiple times with an offset and using a for loop instead of forach.
$xml_files = glob(storage_path('path/to/*.xml'));
$offset = $_GET['offset'];
// Or if calling the script via command line:
// $offset = $argv[1];
$limit = $offset + 300;
for ($i = $offset; $i < $limit; $i++) {
$data = simplexml_load_file($xml_files[$i]);
// process and whatever
}
If you're calling the script as a web page, just add a query param like my-xml-parser.php?offset=300 and get offset like this: $offset = $_GET['offset'].
If you're calling this as a command line script, call it like this: php my-xml-parser.php 300, and get the offset from argv: $offset = $argv[1]
EDIT
If it's a web script, you can try and add a curl call that would call itself with the next offset without waiting for an answer.

Related

php - for loop repeating itself / going out of sequence

I'm very new to PHP, making errors and learning as I go. Please be gentle! :)
I want to access some data from Blizzard.com's API. For this particular data set, it's not a block of data in JSON, rather each object has it's own URL to access. I estimate that there are approx 150000 objects, however I don't know the start or end points of the number range. So I'm having to assume 1 and work past the highest number I know (269065)
To get the data, I need to access each object's data via a JSON file, which I read, get the contents of & drop in to a text file (this could be written as an insert in to a SQL db too, as I'm able to do this if it's the text file that's the issue). But to be honest, I would love to get to the bottom of why this is happening as much as anything!
I wasn't going to try and run ~250000 iterations in a for loop, I thought I'd try something I considered small, 2000.
The for loop starts with $a as 1, uses $a as part of the URL, loads & decodes the JSON, checks to see if the first field (ID) in the object is set, if it is, it writes a few fields to data.txt & if the first field (ID) isn't set it just writes $a to data.txt (so I know it's a null for other purposes not outlined here).
Simple! Or so I thought, after approx after 183 iterations, the data written to the text file goes awry as seen by the quote below. It is out of sequence and starts at 1 again, then back to 184 ad nauseam. The loop then seems to be locked in some kind of infinite loop of running, outputting in a random order until I close the page 10-20 minutes later.
I have obviously made a big mistake! But I have no idea what I have done wrong to have caused this. During my attempts I have rewritten the code with new variable names, so a new text does not conflict with code that could be running in memory.
I've tried resetting variables to blank at the end of the loop in case it something was being reused that was causing a problem.
If anyone could point out any errors in my code, or suggest something for me to look in to, to handle bigger loops that would be brilliant. I am assuming my issue may be a time out or memory problem. But I don't know where to start & was hoping I'd find some suggestions here.
If it's relevant, I am using 000webhostapp.com as my host provider for now, until I get some paid for hosting.
1 ... 182 183 1 184 2 3 185 4 186 5 187 6 188 7 189 190 8 191
for ($a = 1; $a <= 2000; $a++) {
$json = "https://eu.api.battle.net/wow/recipe/".$a."?locale=en_GB&<MYPRIVATEAPIKEY>";
$contents = file_get_contents($json);
$data = json_decode($contents,true);
if (isset($data['id'])) {
$file = fopen("data.txt","a");
fwrite($file,$data['id'].",'".$data['name']."'\n");
fclose($file);
} else {
$file = fopen("data.txt","a");
fwrite($file,$a."\n");
fclose($file);
}
}
The content of the file I'm trying to access is
{"id":33994,"name":"Precise Strikes","profession":"Enchanting","icon":"spell_holy_greaterheal"}
I scrapped the original plan and wrote this instead. Thank you again who took the time out of their day to help and offer suggestions!
$b = $mysqli->query("SELECT id FROM `static_recipes` order by id desc LIMIT 1;")->fetch_object()->id;
if (empty($b)) {$b=1;};
$count = $b+101;
$write = [];
for ($a = $b+1; $a < $count; $a++) {
$json = "https://eu.api.battle.net/wow/recipe/".$a."?locale=en_GB&apikey=";
$contents = #file_get_contents($json);
$data = json_decode($contents,true);
if (isset($data['id'])) {
$write [] = "(".$data['id'].",'".addslashes($data['name'])."','".addslashes($data['profession'])."','".addslashes($data['icon'])."')";
} else {
$write [] = "(".$a.",'a','a','a'".")";
}
}
$SQL = ('INSERT INTO `static_recipes` (id, name, profession, icon) VALUES '.implode(',', $write));
$mysqli->query($SQL);
$mysqli->close();

$write = [];
for ($a = 1; $a <= 2000; $a++) {
$json = "https://eu.api.battle.net/wow/".$a."?locale=en_GB&<MYPRIVATEAPIKEY>";
$contents = file_get_contents($json);
$data = json_decode($contents,true);
if (isset($data['id'])) {
$write [] = $data['id'].",'".$data['name']."'\n";
} else {
$write [] = $a."\n";
}
}
$file = fopen("data.txt","a");
fwrite($file, implode('', $write));
fclose($file);
Also, why you are think what some IDS isn't duplicated at several "https://eu.api.battle.net/wow/[N]" urls data?
Also if you are I wasn't going to try and run ~250000 think about curl_multi_init(): http://php.net/manual/en/function.curl-multi-init.php

I can't really see anything obviously wrong with your code, can't run it though as I don't have the JSON
It could be possible that there is some kind of race condition since you're opening and closing the same file hundreds of times very quickly.
File operations might seem atomic but not necessarily so - here's an interesting SO thread:
Does PHP wait for filesystem operations (like file_put_contents) to complete before moving on?
Like some others' suggested - maybe just open the file before you enter the loop then close the file when the loop breaks.
I'd try it first and see if it helps.

There's nothing in your original code that would cause that sort of behaviour. PHP will not arbitrarily change the value of a variable. You are opening this file in append mode, are you certain that you're not looking at old data? Maybe output some debug messages as you process the data. It's likely you'd run up against some rate limiting on the API server, so putting a pause in there somewhere may improve reliability.
The only substantive change I'd suggest to your code is opening the file once and closing it when you're done.
$file = fopen("data_1_2000.txt", "w");
for ($a = 1; $a <= 2000; $a++) {
$json = "https://eu.api.battle.net/wow/recipe/$a?locale=en_GB&<MYPRIVATEAPIKEY>";
$contents = file_get_contents($json);
$data = json_decode($contents, true);
if (!empty($data['id'])) {
$data["name"] = str_replace("'", "\\'", $data["name"]);
$record = "$data[id],'$data[name]'";
} else {
$record = $a;
}
fwrite($file, "$record\n");
sleep(1);
echo "$a "; if ($a % 50 === 0) echo "\n";
}
fclose($file);

Xampp ERR_CONNECTION_RESET for 1 file

Currently have a file that is set to read a CSV file. The CSV file contains 1600 api queries. Then each api query then returns more queries that need to be run. I am using Xampp v3.2.2 on windows 10 and running php v5.6.15. When running the file through my browser it ran fine for the first 800+ records in the CSV before timing out. When I rerun the file now I get an error "Site can't be reach ERR_CONNECTION_RESET". Not sure what could be causing this. abbreviated version of the code is included below
<?php
error_reporting(E_ERROR | E_PARSE);
set_time_limit (28800);
$csv = array_map('str_getcsv', file('file.csv'));
for($i = 0; $i < count($csv); $i++){
if($csv[$i][2] == 1){ //this item in csv is flag to check if this item has been run yet
if($csv[$i][3] != 'NULL' && trim($csv[$i][3]) != ''){ // check to make sure there is a URL
$return = file_get_contents(trim($csv[$i][3])); //Get the contents that links to new api calls
if($return){
$isCall = array(); // array to store all new calls
$data = array(); // array to store all data to put in csv
$doc = new DOMDocument('1.0'); // create new DOM object
$doc->loadHTML($return); // load page string into DOM object
$links = $doc->getElementsByTagName('a'); // get all <a> tags on the page
if($links->length > 0){ // if there is at least one <a> tag on page
for($j = 0; $j < $links->length; $j++){ // loop through <a> tags
$isCall[]= $links->item($j)->getAttribute('href'); // get href attribute from <a> tag and push into array
}
for($x = 0; $x < count($isCall); $x++){ // loop through all the calls and search for data
$string = file_get_contents($isCall[$x]);
if($string) {
$thispage = new DOMDocument('1.0');
$thispage->loadHTML($string);
$pagedata = $thispage->getElementsByTagName('div');
if ($pagedata->length > 0) {
for($j = 0; $j < $pagedata->length; $j++) {
$data[] = $pagedata->item($j)->C14N();
}
}
}
if(count($data) >= 5) break; // limiting to 5 data points to be added to csv
}
}
if(!empty($data)) $csv[$i] = array_merge($csv[$i], $data); // if we have data points lets add them to the row in the csv
}
}
$csv[$i][2] = 2; // set the flag to 2
$fp = fopen('file.csv', 'w'); // write the contents to the csv each time through loops so if it fails we start at last completed record
foreach ($csv as $f) {
fputcsv($fp, $f);
}
fclose($fp);
}
}
?>

Usually ERR_CONNECTION_RESET is an error that occurs when the site you are trying to connect to is unable to establish that connection. This usually happens due to reasons like firewall blocking on issues with ISP cache etc.
However in your case, I feel that the site you are connecting to is voluntarily closing connection attempts because what you are trying to do is loop over and hit that API site 1600 times continuously.
The API site is allowing the first 800-odd attempts but after that it gets worried that you are perhaps a malicious script trying to harm it. Like a classical example of DOS (Denial Of Service) attempt.
You should check if there is any restriction to the number of attempts a client can make to the API site with a fixed time (like say 500 hits every 24 hours) or you should try to sleep N seconds after each hit to the site or after each X number of hits to the site.

Handle Large File with PHP

I have a file with the size of around 10 GB or more. The file contains only numbers ranging from 1 to 10 on each line and nothing else. Now the task is to read the data[numbers] from the file and then sort the numbers in ascending or descending order and create a new file with the sorted numbers.
Can anyone of you please help me with the answer?

I'm assuming this is somekind of homework and goal for this is to sort more data than you can hold in your RAM?
Since you only have numbers 1-10, this is not that complicated task. Just open your input file and count how many occourances of every specific number you have. After that you can construct simple loop and write values into another file. Following example is pretty self explainatory.
$inFile = '/path/to/input/file';
$outFile = '/path/to/output/file';
$input = fopen($inFile, 'r');
if ($input === false) {
throw new Exception('Unable to open: ' . $inFile);
}
//$map will be array with size of 10, filled with 0-s
$map = array_fill(1, 10, 0);
//Read file line by line and count how many of each specific number you have
while (!feof($input)) {
$int = (int) fgets($input);
$map[$int]++;
}
fclose($input);
$output = fopen($outFile, 'w');
if ($output === false) {
throw new Exception('Unable to open: ' . $outFile);
}
/*
* Reverse array if you need to change direction between
* ascending and descending order
*/
//$map = array_reverse($map);
//Write values into your output file
foreach ($map AS $number => $count) {
$string = ((string) $number) . PHP_EOL;
for ($i = 0; $i < $count; $i++) {
fwrite($output, $string);
}
}
fclose($output);
Taking into account the fact, that you are dealing with huge files, you should also check script execution time limit for your PHP environment, following example will take VERY long for 10GB+ sized files, but since I didn't see any limitations concerning execution time and performance in your question, I'm assuming it is OK.

I had a similar issue before. Trying to manipulate such a large file ended up being huge drain on resources and it couldn't cope. The easiest solution I ended up with was to try and import it into a MySQL database using a fast data dump function called LOAD DATA INFILE
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Once it's in you should be able to manipulate the data.
Alternatively, you could just read the file line by line while outputting the result into another file line by line with the sorted numbers. Not too sure how well this would work though.
Have you had any previous attempts at it or are you just after a possible method of doing it?

If that's all you don't need PHP (if you have a Linux maschine at hand):
sort -n file > file_sorted-asc
sort -nr file > file_sorted-desc
Edit: OK, here's your solution in PHP (if you have a Linux maschine at hand):
<?php
// Sort ascending
`sort -n file > file_sorted-asc`;
// Sort descending
`sort -nr file > file_sorted-desc`;
?>
:)

Looping through files, one file processes, then that's it, no error reporting

So I have a problem with a script I'm working on. I have a folder full of JSON files called roster0.json, roster1, etc. etc.
$dir = "responses/";
$files = glob($dir . "roster*");
$failed = array();
$failcnt = 0;
if (isset($files)) {
$data = null;
for ($i = 0; $i < count($files); $i++) {
$data = json_decode(utf8_decode(file_get_contents($files[$i])));
if(isset($data)){
// Process stuff
When I var_dump($files) I get an array with over 100 paths "responses/roster0.json".
When I test $data I get a proper array of data.
However, once the loop goes to the next file, it never loads it, and never processes it.
Here's the crazy part. If I change the start of the for loop, e.g. $i = 20. It will load the 21st file in the directory and parse it and insert it into the db properly!
Ignoring the failcnt stuff at the bottom, here's the current version of the script in it's entirety. http://pastebin.com/yqyKi5Ag
PS - I have full WARNING/ERROR reporting on in PHP and not getting any error messages...HELP! Thanks!

When I was writing the insert string the ID was being duplicated and thus was invalid. Switched to auto-inc and tada. It works. Thanks for the assistance. –

Crunch lots of files to generate stats file

I have a bunch of files I need to crunch and I'm worrying about scalability and speed.
The filename and filedata(only the first line) is stored into an array in RAM to create some statical files later in the script.
The files must remain files and can't be put into a databases.
The filename are formatted in the following fashion :
Y-M-D-title.ext (where Y is Year, M for Month and D for Day)
I'm actually using glob to list all the files and create my array :
Here is a sample of the code creating the array "for year" or "month" (It's used in a function with only one parameter -> $period)
[...]
function create_data_info($period=NULL){
$data = array();
$files = glob(ROOT_DIR.'/'.'*.ext');
$size = sizeOf($files);
$existing_title = array(); //Used so we can handle having the same titles two times at different date.
if (isSet($period)){
if ( "year" === $period ){
for ($i = 0; $i < $size; $i++) {
$info = extract_info($files[$i], $existing_file);
//Create the data array with all the data ordered by year/month/day
$data[(int)$info[5]][] = $info;
unset($info);
}
}elseif ( "month" === $period ){
for ($i = 0; $i < $size; $i++) {
$info = extract_info($files[$i], $existing_file);
$key = $info[5].$info[6];
//Create the data array with all the data ordered by year/month/day
$data[(int)$key][] = $info;
unset($info);
}
}
}
[...]
}
function extract_info($file, &$existing){
$full_path_file = $file;
$file = basename($file);
$info_file = explode("-", $file, 4);
$filetitle = explode(".", $info_file[3]);
$info[0] = $filetitle[0];
if (!isSet($existing[$info[0]]))
$existing[$info[0]] = -1;
$existing[$info[0]] += 1;
if ($existing[$info[0]] > 0)
//We have already found a post with this title
//the creation of the cache is based on info[4] data for the filename
//so we need to tune it
$info[0] = $info[0]."-".$existing[$info[0]];
$info[1] = $info_file[3];
$info[2] = $full_path_file;
$post_content = file(ROOT_DIR.'/'.$file, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
$info[3] = $post_content[0]; //first line of the files
unset($post_content);
$info[4] = filemtime(ROOT_DIR.'/'.$file);
$info[5] = $info_file[0]; //year
$info[6] = $info_file[1]; //month
$info[7] = $info_file[2]; //day
return $info;
}
So in my script I only call create_data_info(PERIOD) (PERIOD being "year", "month", etc..)
It returns an array filled with the info I need, and then I can loop throught it to create my statistics files.
This process is done everytime the PHP script is launched.
My question is : is this code optimal (certainly not) and what can I do to squeeze some juice from my code ?
I don't know how I can cache this (even if it's possible), as there is a lot of I/O involved.
I can change the tree structure if it could change things compared to a flat structure, but from what I found out with my tests it seems flat is the best.
I already thought about creating a little "booster" in C doing only the crunching, but I since it's I/O bound, I don't think it would make a huge difference and the application would be a lot less compatible for shared hosting users.
Thank you very much for your input, I hope I was clear enough here. Let me know if you need clarification (and forget my english mistakes).

To begin with you should use DirectoryIterator instead of glob function. When it comes to scandir vs opendir vs glob, glob is as slow as it gets.
Also, when you are dealing with a large amount of files you should try to do all your processing inside one loop, php function calls are rather slow.
I see you are using unset($info); yet in every loop you make, $info gets new value. Php does its own garbage collection, if thats your concern. Unset is a language construct not a function and should be pretty fast, but when using not needed, it still makes whole thing a bit slower.
You are passing $existing as a reference. Is there practical outcome for this? In my experience references make things slower.
And at last your script seems to deal with a lot of string processing. You might want to consider somekind of "serialize data and base64 encode/decode" solution, but you should benchmark that specifically, might be faster, might be slower depenging on your whole code. (My idea is that, serialize/unserialize MIGHT run faster as these are native php functions and custom functions with string processing are slower).
My answer was not very I/O related but I hope it was helpful.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parse many XML files in batches with PHP/Laravel - php

Related

php - for loop repeating itself / going out of sequence

Xampp ERR_CONNECTION_RESET for 1 file

Handle Large File with PHP

Looping through files, one file processes, then that's it, no error reporting

Crunch lots of files to generate stats file

Categories

Resources