Find all occurrences of a string in a file - php

Please keep in mind the file I am opening can be 10mb to 125mb. I have researched various ways to open a file and am still not sure as to the best approach if any one is best. Please advise!
I am opening a large file and trying to extract the text between two strings each time the first occurs. I can find the first string and extract the text to the second string, however, my loop gives me that result 12 times (number of times string occurs in this file. I can see what I am doing wrong in the loop, basically finding the first occurrence and repeating its output 12 times. How can I loop through the file and get the text between the 2-12th occurrences?
Also, any tips for proper opening of large files and handling memory limits would be great.
If this is put in an array, do I lose the whitespace? I am using PRE to display it correctly as it is. Ultimately, I want to parse each string found into smaller elements either in an array or a db. I don't want to get ahead of myself, so ignore the array comments if necessary.
<?php
ini_set('memory_limit', '-1');
/*
Functions
*/
function get_string_between($string, $start, $end){
$string = " ".$string;
$ini = strpos($string,$start);
if ($ini == 0) return "";
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
return substr($string,$ini,$len);
}
/*
Pre Loop
*/
$string1 = "String 1";
$string2 = "String 2";
$report = file_get_contents('report.rpt','r');
$cbcount = substr_count($report,$string1);
echo $cbcount;
/*
Loop
*/
for ($i=0; $i<$cbcount; $i++){
$output = get_string_between($report, $string1, $string2);
echo "<pre>".$output."</pre>";
}
?>

You're never actually advancing any pointer of any kind, so it has no way of knowing that it already found the first match.
Now, depending on your input, you may be able to just use a regex:
preg_match_all("(".preg_quote($string1).".*?".preg_quote($string2).")s",$report,$matches);
(Replace the entire loop with this)
Then you can var_dump($matches[0]) to see your output.

$startfrom = 0;
while (($start = strpos($string1, $report, $startfrom)) !== false) {
$end = strpos($string2, $report, $start);
echo "<pre>".substr($report, $start, $end-$start)."</pre>";
$startfrom = $end + 1;
}
Regarding dealing with large files, instead of reading the entire thing into memory, you can use fopen() and fgets() to read it line by line. When you find a line containing the $string1 you start accumulating lines in in a variable, until you find the line containing $string2. This only works simply if the match strings cannot contain newlines.

Related

very large php string magically turns into array

I am getting an "Array to string conversion error on PHP";
I am using the "variable" (that should be a string) as the third parameter to str_replace. So in summary (very simplified version of whats going on):
$str = "very long string";
str_replace("tag", $some_other_array, $str);
$str is throwing the error, and I have been trying to fix it all day, the thing I have tried is:
if(is_array($str)) die("its somehow an array");
serialize($str); //inserted this before str_replace call.
I have spent all day on it, and no its not something stupid like variables around the wrong way - it is something bizarre. I have even dumped it to a file and its a string.
My hypothesis:
The string is too long and php can't deal with it, turns into an array.
The $str value in this case is nested and called recursively, the general flow could be explained like this:
--code
//pass by reference
function the_function ($something, &$OFFENDING_VAR, $something_else) {
while(preg_match($something, $OFFENDING_VAR)) {
$OFFENDING_VAR = str_replace($x, y, $OFFENDING_VAR); // this is the error
}
}
So it may be something strange due to str_replace, but that would mean that at some point str_replace would have to return an array.
Please help me work this out, its very confusing and I have wasted a day on it.
---- ORIGINAL FUNCTION CODE -----
//This function gets called with multiple different "Target Variables" Target is the subject
//line, from and body of the email filled with << tags >> so the str_replace function knows
//where to replace them
function perform_replacements($replacements, &$target, $clean = TRUE,
$start_tag = '<<', $end_tag = '>>', $max_substitutions = 5) {
# Construct separate tag and replacement value arrays for use in the substitution loop.
$tags = array();
$replacement_values = array();
foreach ($replacements as $tag_text => $replacement_value) {
$tags[] = $start_tag . $tag_text . $end_tag;
$replacement_values[] = $replacement_value;
}
# TODO: this badly needs refactoring
# TODO: auto upgrade <<foo>> to <<foo_html>> if foo_html exists and acting on html template
# Construct a regular expression for use in scanning for tags.
$tag_match = '/' . preg_quote($start_tag) . '\w+' . preg_quote($end_tag) . '/';
# Perform the substitution until all valid tags are replaced, or the maximum substitutions
# limit is reached.
$substitution_count = 0;
while (preg_match ($tag_match, $target) && ($substitution_count++ < $max_substitutions)) {
$target = serialize($target);
$temp = str_replace($tags,
$replacement_values,
$target); //This is the line that is failing.
unset($target);
$target = $temp;
}
if ($clean) {
# Clean up any unused search values.
$target = preg_replace($tag_match, '', $target);
}
}
How do you know $str is the problem and not $some_other_array?
From the manual:
If search and replace are arrays, then str_replace() takes a value
from each array and uses them to search and replace on subject. If
replace has fewer values than search, then an empty string is used for
the rest of replacement values. If search is an array and replace is a
string, then this replacement string is used for every value of
search. The converse would not make sense, though.
The second parameter can only be an array if the first one is as well.

Manually move the fgetc file pointer to the next line

Question 1: How can I manually move the fgetc file pointer from its current location to the next line?
I'm reading in data character by character until a specified number of delimiters are counted. Once the delimiter count reaches a certain number, it needs to copy the remainder of the line until a new line (the record delimiter). Then I need to start copying character by character again starting at the next record.
Question 2: Is manually moving the file pointer to the next line the right idea? I would just explode(at "\n") but I have to count the pipe delimiters first because "\n" isn't always the record delimiter.
Here's my code (it puts all the data into the correct record until it reaches the last delimiter '|' in the record. It then puts the rest of the line into the next record because I haven't figured out how to make it correctly look for the '\n' after specified # of | are counted):
$file=fopen("source_data.txt","r") or exit ("File Open Error");
$record_incrementor = 0;
$pipe_counter = 0;
while (!feof($file))
{
$char_buffer = fgetc($file);
$str_buffer[] = $char_buffer;
if($char_buffer == '|')
{
$pipe_counter++;
}
if($pipe_counter == 46) //Maybe Change to 46
{
$database[$record_incrementor] = $str_buffer;
$record_incrementor++;
$str_buffer = NULL;
$pipe_counter = 0;
}
}
Sample Data:
1378|2009-12-13 11:51:45.783000000|"Pro" |"B13F28"||""|1||""|""|""|||False|||""|""|""|""||""||||||2010-12-15 11:51:51.330000000|108||||||""||||||False|""|""|False|""|||False
1379|2009-12-13 12:23:23.327000000|"TLUG"|"TUG"||""|1||""|""|""|||False|||""|""|""|""||""||||||1943-04-19 00:00:00|||||||""||||||False|""|""|False|""|||False
I'd say that doing this via file handling functions is a bit clumsy, when it could be done via regular expression quite easily. Just read the entire file into a string using file_get_contents() and doing a regular expression like /^(([^|]*\|){47}([^\r\n]*))/m with preg_match_all() could find you all the rows (which you can then explode() using | as the delimiter and setting 48 as the limit for number of fields.
Here is a working example function. The function takes the file name, field delimiter and the number of fields per row as the arguments. The function returns 2 dimensional array where first index is the data row number and the second is the field number.
function loadPipeData ($file, $delim = '|', $fieldCount = 48)
{
$contents = file_get_contents($file);
$d = preg_quote($delim, '/');
preg_match_all("/^(([^$d]*$d){" . ($fieldCount - 1) . '}([^\r\n]*))/m', $contents, $match);
$return = array();
foreach ($match[0] as $line)
{
$return[] = explode($delim, $line, $fieldCount);
}
return $return;
}
var_dump(loadPipeData('source_data.txt'));
(Note: this is a solution to the original problem)
You can read to the end of the line like this:
while (!feof($file) && fgetc($file) !== '\n');
As for whether or not fgetc is the right way to do this... your format makes it difficult to use anything else. You can't split on \n, because there may be newlines within a field, and you can't split on |, because the end of the record doesn't have a pipe.
The only other option I can think is to use preg_match_all:
$buffer = file_get_contents('test.txt');
preg_match_all('/((?:[^|]*\|){45}[^\n]*\n)/', $buffer, $matches);
foreach ($matches[0] as $row) {
$fields = explode('|', $row);
}
Answer to the modified question:
To read from the file pointer to the end of the line, you can simply use the file reading function fgets(). It returns everything from the current file pointer position until it reaches the end of the line (and also returns the end of the line character(s)). After the function call, the file reading pointer has been moved to the beginning of the next line.

How to use str_replace() to remove text a certain number of times only in PHP?

I am trying to remove the word "John" a certain number of times from a string. I read on the php manual that str_replace excepts a 4th parameter called "count". So I figured that can be used to specify how many instances of the search should be removed. But that doesn't seem to be the case since the following:
$string = 'Hello John, how are you John. John are you happy with your life John?';
$numberOfInstances = 2;
echo str_replace('John', 'dude', $string, $numberOfInstances);
replaces all instances of the word "John" with "dude" instead of doing it just twice and leaving the other two Johns alone.
For my purposes it doesn't matter which order the replacement happens in, for example the first 2 instances can be replaced, or the last two or a combination, the order of the replacement doesn't matter.
So is there a way to use str_replace() in this way or is there another built in (non-regex) function that can achieve what I'm looking for?
As Artelius explains, the last parameter to str_replace() is set by the function. There's no parameter that allows you to limit the number of replacements.
Only preg_replace() features such a parameter:
echo preg_replace('/John/', 'dude', $string, $numberOfInstances);
That is as simple as it gets, and I suggest using it because its performance hit is way too tiny compared to the tedium of the following non-regex solution:
$len = strlen('John');
while ($numberOfInstances-- > 0 && ($pos = strpos($string, 'John')) !== false)
$string = substr_replace($string, 'dude', $pos, $len);
echo $string;
You can choose either solution though, both work as you intend.
You've misunderstood the wording of the manual.
If passed, this will be set to the number of replacements performed.
The parameter is passed by reference and its value is changed by the function to indicate how many times the string was found and replaced. Its initial value is discarded.
There are a few things you could do to achieve this, but I can't think of one specific php function that will easily let you do this.
One option is to create your own replace function and utilize strripos and substr to do the replaces.
Another thing you can do is use preg_replace_callback and count the number of replacements you have done in the callback.
There's probably more ways but that's all I can think of on the fly. If performance is an issue I suggest you give both a try and do some simple benchmarks.
The cleanest, most-direct, single function call is to use preg_replace(). Its replacement limiting parameter makes the task intuitive and readable.
$string = preg_replace('/John/', 'dude', $string, $numberOfInstances);
The function is also attractive because making the search case-insensitive is as simple as adding the i pattern modifier to the end of the pattern. I won't delve into the usefulness of word boundaries (\b).
If a search string might contain characters with special meaning to the regex engine, then preg_quote() will be necessary -- this diminishes the beauty of the technique but not prohibitively so.
$search = '$5.99';
$pattern = '/' . preg_quote($search, '/') . '/';
$string = preg_replace($pattern, 'free', $string, $numberOfInstances);
For anyone who has an unnatural bias against regex functions, this can be done without regex and without looping -- it will be case-sensitive though.
Limited Explode & Implode: (Demo)
$numberOfInstances = 2;
$string = 'Hello John, how are you John. John are you happy with your life John?';
// explode here -^^^^ and ---------^^^^ only to create the following array:
// 0 => 'Hello ',
// 1 => ', how are you ',
// 2 => '. John are you happy with your life John?'
echo implode('dude', explode('John', $string, $numberOfInstances + 1));
Output:
Hello dude, how are you dude. John are you happy with your life John?
Notice the explode's limiting parameter dictates how many elements are generated, not how many explosions are executed on the string.
function str_replace_occurrences($find, $replace, $string, $count = -1) {
// current occrurence
$current = 0;
// while any occurrence
while (($pos = strpos($string, $find)) != false) {
// update length of str (size of string is changing)
$len = strlen($find);
// found next one
$current++;
// check if we've reached our target
// -1 is used to replace all occurrence
if($current <= $count || $count == -1) {
// do replacement
$string = substr_replace($string, $replace, $pos, $len);
} else {
// we've reached our
break;
}
}
return $string;
}
Artelius has already described how the function works, ill just show you how to do this via the manual methods:
function str_replace_occurrences($find,$replace,$string,$count = 0)
{
if($count == 0)
{
return str_replace($find,$replace,$string);
}
$pos = 0;
$len = strlen($find);
while($pos < $count && false !== ($pos = strpos($string,$find,$pos)))
{
$string = substr_replace($string,$replace,$pos,$len);
}
return $string;
}
This is untested but should work.

PHP Split a string with start and stop value

I have fooled around with regex but can't seem to get it to work. I have a file called includes/header.php I am converting the file into one big string so that I can pull out a certain portion of the code to paste in the html of my document.
$str = file_get_contents('includes/header.php');
From here I am trying to get return only the string that starts with <ul class="home"> and ends with </ul>
try as I may to figure out an expression I am still confused.
Once I trim down the string I can just print that on my page but I can't figure out the trimming part
If you need something really hardcore, http://www.php.net/manual/en/book.xmlreader.php.
If you just want to rip out the text that fits that pattern try something like this.
$string = "stuff<ul class=\"home\">alsdkjflaskdvlsakmdf<another></another></ul>stuff";
if( preg_match( '/<ul class="home">(.*)<\/ul>/', $string, $match ) ) {
//do stuff with $match[0]
}
I'm assuming that the difficulty you're having has to do with escaping the regex special characters in the string(s) you're using as a delimiter. If so, try using the preg_quote() function:
$start = preg_quote('<ul class="home">');
$end = preg_quote('</ul>', '/');
preg_match("/" . $start. '.*' . $end . "/", $str, $matching_html_snippets);
The html you want should be in $matching_html_snippets[0]
You probably want an XML parser such as the built in one. Here is an example you might want to take a look at.
http://www.php.net/manual/en/function.xml-parse.php#90733
If you want to use regex then something along the lines of
$str = file_get_contents('includes/header.php');
$matchedstr = preg_match("<place your pattern here>", $str, $matches);
You probably want the pattern
'/<ul class="home">.*?<\/ul>/s'
Where $matches will contain an array of the matches it found so you can grab whatever element you want from the array with
$matchedstr[0];
which will return the first element. And then output that.
But I'd be a bit wary, regular expressions do tend to match to surprising edge cases and you need to feed them actual data to get reliable results as to when they are failing. However if you are just passing templates it should be ok, just do some tests and see if it all works. If not I'd still recommend using the PHP XML Parser.
Hope that helps.
If you feel like not using regexes you could use string finding, which I think the PHP manual implies is quicker:
function substrstr($orig, $startText, $endText) {
//get first occurrence of the start string
$start = strpos($orig, $startText);
//get last occurrence of the end string
$end = strrpos($orig, $endText);
if($start === FALSE || $end === FALSE)
return $orig;
$start++;
$length = $end - $start;
return substr($orig, $start, $length);
}
$substr = substrstr($string, '<ul class="home">', '</ul>');
You'll need to make some adjustments if you want to include the terminating strings in the output, but that should get you started!
Here's a novel way to do it; I make no guarantees about this technique's robustness or performance, other than it does work for the example given:
$prefix = '<ul class="home">';
$suffix = '</ul>';
$result = $prefix . array_shift(explode($suffix, array_pop(explode($prefix, $str)))) . $suffix;

php - why does this regex truncate my string to zero length?

Yesterday I tracked down a strange bug which caused a website display only a white page - no content on it, no error message visible.
I found that a regular expression used in preg_replace was the problem.
I used the regex in order to replace the title html tag in the accumulated content just before echo´ing the html. The html got rather large on the page where the bug occured (60 kb - not too large) and it seemed like preg_replace / the regex used can only handle a string of certain length - or my regex is really messed up (also possible).
Look at this sample program which reproduces the problem (tested on PHP 5.2.9):
function replaceTitleTagInHtmlSource($content, $replaceWith) {
return preg_replace('#(<title>)([\s\S]+)(<\/title>)#i', '$1'.$replaceWith.'$3', $content);
}
$dummyStr = str_repeat('A', 6000);
$totalStr = '<title>foo</title>';
for($i = 0; $i < 10; $i++) {
$totalStr .= $dummyStr;
}
print 'orignal: ' . strlen($totalStr);
print '<hr />';
$replaced = replaceTitleTagInHtmlSource($totalStr, 'bar');
print 'replaced: ' . strlen($replaced);
print '<hr />';
Output:
orignal: 60018
replaced: 0
So - the function gets a string of length 60000 and returns a string with 0 length. Not what I wanted to do with my regex.
Changing
for($i = 0; $i < 10; $i++) {
to
for($i = 0; $i < 1; $i++) {
in order to decrease the total string length, the output is:
orignal: 6018
replaced: 6018
When I removed the replacing, the content of the page was displayed without any problems.
It seems like you're running into the backtracking limit.
This is confirmed if you print preg_last_error(): it returns PREG_BACKTRACK_LIMIT_ERROR.
You can either increase the limit in your ini file or using ini_set() or change your regular expression from ([\s\S]+) to .*?, which will stop it from backtracking so much.
It thas been said many times before on SO, eg Regex to match the first ending HTMl tag (and probably will be mentioned again) that regexes are not appropriate for HTML because tags are too irregular.
Use DOM functions where they're available.
Backtracking: [\s\S]+ will match ALL available characters, then go backwards through the string looking for the </title>. [^<]+ matches all characters that aren't < and therefore grabs </title> faster.
function replaceTitleTagInHtmlSource($content, $replaceWith) {
return preg_replace('#(<title>)([^<]+)(</title>)#i', '$1'.$replaceWith.'$3', $content);
}
Your regex seems to be a little funny.
([\s\S]+) matches all space and non-space. you should try (.*?) instead.
changing your function works for me:
function replaceTitleTagInHtmlSource($content, $replaceWith) {
return preg_replace('`\<title\>(.*?)\<\/title\>`i', '<title>'.$replaceWith.'</title>', $content);
}
and the problem seems to be you trying to use $1 and $3 to match and

Categories