php - why does this regex truncate my string to zero length?

php - why does this regex truncate my string to zero length? - php

Yesterday I tracked down a strange bug which caused a website display only a white page - no content on it, no error message visible.
I found that a regular expression used in preg_replace was the problem.
I used the regex in order to replace the title html tag in the accumulated content just before echo´ing the html. The html got rather large on the page where the bug occured (60 kb - not too large) and it seemed like preg_replace / the regex used can only handle a string of certain length - or my regex is really messed up (also possible).
Look at this sample program which reproduces the problem (tested on PHP 5.2.9):
function replaceTitleTagInHtmlSource($content, $replaceWith) {
return preg_replace('#(<title>)([\s\S]+)(<\/title>)#i', '$1'.$replaceWith.'$3', $content);
}
$dummyStr = str_repeat('A', 6000);
$totalStr = '<title>foo</title>';
for($i = 0; $i < 10; $i++) {
$totalStr .= $dummyStr;
}
print 'orignal: ' . strlen($totalStr);
print '<hr />';
$replaced = replaceTitleTagInHtmlSource($totalStr, 'bar');
print 'replaced: ' . strlen($replaced);
print '<hr />';
Output:
orignal: 60018
replaced: 0
So - the function gets a string of length 60000 and returns a string with 0 length. Not what I wanted to do with my regex.
Changing
for($i = 0; $i < 10; $i++) {
to
for($i = 0; $i < 1; $i++) {
in order to decrease the total string length, the output is:
orignal: 6018
replaced: 6018
When I removed the replacing, the content of the page was displayed without any problems.

It seems like you're running into the backtracking limit.
This is confirmed if you print preg_last_error(): it returns PREG_BACKTRACK_LIMIT_ERROR.
You can either increase the limit in your ini file or using ini_set() or change your regular expression from ([\s\S]+) to .*?, which will stop it from backtracking so much.

It thas been said many times before on SO, eg Regex to match the first ending HTMl tag (and probably will be mentioned again) that regexes are not appropriate for HTML because tags are too irregular.
Use DOM functions where they're available.

Backtracking: [\s\S]+ will match ALL available characters, then go backwards through the string looking for the </title>. [^<]+ matches all characters that aren't < and therefore grabs </title> faster.
function replaceTitleTagInHtmlSource($content, $replaceWith) {
return preg_replace('#(<title>)([^<]+)(</title>)#i', '$1'.$replaceWith.'$3', $content);
}

Your regex seems to be a little funny.
([\s\S]+) matches all space and non-space. you should try (.*?) instead.
changing your function works for me:
function replaceTitleTagInHtmlSource($content, $replaceWith) {
return preg_replace('`\<title\>(.*?)\<\/title\>`i', '<title>'.$replaceWith.'</title>', $content);
}
and the problem seems to be you trying to use $1 and $3 to match and

Related

Find all occurrences of a string in a file

Please keep in mind the file I am opening can be 10mb to 125mb. I have researched various ways to open a file and am still not sure as to the best approach if any one is best. Please advise!
I am opening a large file and trying to extract the text between two strings each time the first occurs. I can find the first string and extract the text to the second string, however, my loop gives me that result 12 times (number of times string occurs in this file. I can see what I am doing wrong in the loop, basically finding the first occurrence and repeating its output 12 times. How can I loop through the file and get the text between the 2-12th occurrences?
Also, any tips for proper opening of large files and handling memory limits would be great.
If this is put in an array, do I lose the whitespace? I am using PRE to display it correctly as it is. Ultimately, I want to parse each string found into smaller elements either in an array or a db. I don't want to get ahead of myself, so ignore the array comments if necessary.
<?php
ini_set('memory_limit', '-1');
/*
Functions
*/
function get_string_between($string, $start, $end){
$string = " ".$string;
$ini = strpos($string,$start);
if ($ini == 0) return "";
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
return substr($string,$ini,$len);
}
/*
Pre Loop
*/
$string1 = "String 1";
$string2 = "String 2";
$report = file_get_contents('report.rpt','r');
$cbcount = substr_count($report,$string1);
echo $cbcount;
/*
Loop
*/
for ($i=0; $i<$cbcount; $i++){
$output = get_string_between($report, $string1, $string2);
echo "<pre>".$output."</pre>";
}
?>

You're never actually advancing any pointer of any kind, so it has no way of knowing that it already found the first match.
Now, depending on your input, you may be able to just use a regex:
preg_match_all("(".preg_quote($string1).".*?".preg_quote($string2).")s",$report,$matches);
(Replace the entire loop with this)
Then you can var_dump($matches[0]) to see your output.

$startfrom = 0;
while (($start = strpos($string1, $report, $startfrom)) !== false) {
$end = strpos($string2, $report, $start);
echo "<pre>".substr($report, $start, $end-$start)."</pre>";
$startfrom = $end + 1;
}
Regarding dealing with large files, instead of reading the entire thing into memory, you can use fopen() and fgets() to read it line by line. When you find a line containing the $string1 you start accumulating lines in in a variable, until you find the line containing $string2. This only works simply if the match strings cannot contain newlines.

Random String Generator (PHP)

I am trying write a PHP function that returns a random string of a given length. I wrote this:
<?
function generate_string($lenght) {
$ret = "";
for ($i = 0; $i < $lenght; $i++) {
$ret .= chr(mt_rand(32,126));
}
return $ret;
}
echo generate_string(150);
?>
The above function generates a random string, but the length of the string is not constant, ie: one time it is 30 characters, the other is 60 (obviously I call it with the same length as input every time). I've searched other examples of random string generators, but they all use a base string to pick letters. I am wondering why this method is not working properly.
Thanks!

Educated guess: you attempt to display your plain text string as HTML. The browser, after being told it's HTML, handles it as such. As soon as a < character is generated, the following characters are rendered as an (unknown) HTML tag and are not displayed as HTML standards mandate.
Fix:
echo htmlspecialchars(generate_string(150));

This is the conclusion i reached after testing it a while : Your functions works correctly. It depends on what you do with the randomly generated string. If you are simply echo-ing it, then it might generate somthing like <ck1ask which will be treated like a tag. Try eliminating certain characters from being concatenated to the string.

This function will work to generate a random string in PHP
function getRandomString($maxlength=12, $isSpecialChar=false)
{
$randomString=null;
//initalise the string include lower case, upper case and numbers
$charSet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
//if required special character to include, please set $isSpecialchar= 1 or true
if ($isSpecialChar) $charSet .= "~##$%^*()_±={}|][";
//loop for get specify length character with random characters
for ($i=0; $i<$maxlength; $i++) $randomString .= $charSet[(mt_rand(0, (strlen($charSet)-1)))];
//return the random string
return $randomString;
}
//call the function set value you required to string length default:12
$random8char=getRandomString(8);
echo $random8char;
Source: Generate random string in php

PHP method for stripping duplicate chars from a multibyte string?

Arrrgh. Does anyone know how to create a function that's the multibyte character equivalent of the PHP count_chars($string, 3) command?
Such that it will return a list of ONLY ONE INSTANCE of each unique character. If that was English and we had
"aaabggxxyxzxxgggghq xcccxxxzxxyx"
It would return "abgh qxyz" (Note the space IS counted).
(The order isn't important in this case, can be anything).
If Japanese kanji (not sure browsers will all support this):
漢漢漢字漢字私私字私字漢字私漢字漢字私
And it will return just the 3 kanji used:
漢字私
It needs to work on any UTF-8 encoded string.

Hey Dave, you're never going to see this one coming.
php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私';
php > $not_kanji = 'aaabcccbbc';
php > $pattern = '/(.)\1+/u';
php > echo preg_replace($pattern, '$1', $kanji);
漢字漢字私字私字漢字私漢字漢字私
php > echo preg_replace($pattern, '$1', $not_kanji);
abcbc
What, you thought I was going to use mb_substr again?
In regex-speak, it's looking for any one character, then one or more instances of that same character. The matched region is then replaced with the one character that matched.
The u modifier turns on UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is UTF-8 already and PCRE was compiled with Unicode support, this should work fine for you.
Hey, guess what!
$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff';
$l = mb_strlen($not_kanji);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($not_kanji, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
echo join('', array_keys($unique));
This uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. We then use that character as a key in an array. We're taking advantage of PHP's positional arrays: keys are sorted in the order that they are defined. Once we've gone through the string and identified all of the characters, we grab the keys and join'em back together in the same order that they appeared in the string. You also get a per-character character count from this technique.
This would have been much easier if there was such a thing as mb_str_split to go along with str_split.
(No Kanji example here, I'm experiencing a copy/paste bug.)
Here, try this on for size:
function mb_count_chars_kinda($input) {
$l = mb_strlen($input);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($input, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
return $unique;
}
function mb_string_chars_diff($one, $two) {
$left = array_keys(mb_count_chars_kinda($one));
$right = array_keys(mb_count_chars_kinda($two));
return array_diff($left, $right);
}
print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
/* =>
Array
(
[5] => f
[6] => g
)
*/
You'll want to call this twice, the second time with the left string on the right, and the right string on the left. The output will be different -- array_diff just gives you the stuff in the left side that's missing from the right, so you have to do it twice to get the whole story.

Please try to check the iconv_strlen PHP standard library function. Can't say about orient encodings, but it works fine for european and east europe languages. In any case it gives some freedom!

$name = "My string";
$name_array = str_split($name);
$name_array_uniqued = array_unique($name_array);
print_r($name_array_uniqued);
Much easier. User str_split to turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it that way.

Using preg_match to get filename

We use custom bbcode in our news posts
[newsImage]imageName.jpg[/newsImage]
And i'd like to use preg_match to get the imageName.jpg from between those tags. The whole post is stored in a variable called $newsPost.
I'm new to regex and I just can't figure out the right expression to use in preg_match to get what I want.
Any help is appreciated. Also, do any of you know a good resource for learning what each of the characters in regex do?

preg_match_all('/\[newsImage\]([^\[]+)\[\/newsImage\]/i', $newsPost, $images);
The variable $images should then contain your list of matches.
http://www.php.net/manual/en/regexp.introduction.php

To answer your second question: A very good regex tutorial is regular-expressions.info.
Among other things, it also contains a regular expression syntax reference.
Since different regex flavors use a different syntax, you'll also want to look at the regex flavor comparison page.

As Rob said but escaping last ]
preg_match('/\[newsImage\]([^\[]+)\[newsImage\]/i', $newsPost, $images);
$images[1] will contain the name of image file.

This is not exactly what you asked for, but you can replace your [newsImage] tags with tags using the following code, its not perfect as it will fall down if you have an empty tag e.g. [newsImage][/newsImage]
function process_image_code($text) {
//regex looks for [newsImage]sometext[/newsImage]
$urlReg ="/((?:\[newsImage]{1}){1}.{1,}?(?:\[\/newsImage]){1})/i";
$pregResults = preg_split ($urlReg , $text, -1, PREG_SPLIT_DELIM_CAPTURE);
$output = "";
//loop array to build the output string
for($i = 0; $i < sizeof($pregResults); $i++) {
//if the array item has a regex match process the result
if(preg_match($urlReg, $pregResults[$i]) ) {
$pregResults[$i] = preg_replace ("/(?:\[\/newsImage]){1}/i","\" alt=\"Image\" border=\"0\" />",$pregResults[$i] ,1);
// find if it has a http:// at the start of the image url
if(preg_match("/(?:\[newsImage]http:\/\/?){1}/i",$pregResults[$i])) {
$pregResults[$i] = preg_replace ("/(?:\[newsImage]?){1}/i","<img src=\"",$pregResults[$i] ,1);
}else {
$pregResults[$i] = preg_replace ("/(?:\[newsImage]?){1}/i","<img src=\"http://",$pregResults[$i] ,1);
}
$output .= $pregResults[$i];
}else {
$output .= $pregResults[$i];
}
}
return $output;
}

keep HTMLformat after replace some text (using PHP and JS)

I would like modify HTML like
I am <b>Sadi, novice</b> programmer.
to
I am <b>Sadi, learner</b> programmer.
To do it I will search using a string "novice programmer". How can I do it please? Any idea?
It search using more than one word "novice programmer". It could be a whole sentence. The extra white space (e.g. new line, tab) should be ignored and any tag must be ignored during the search. But during the replacement tag must be preserved.
It is a sort of converter. It will be better if it is case insensitive.
Thank you
Sadi
More clarification:
I get some nice reply with possible solution. But please keep posting if you have any idea in mind.
I would like to more clarify the problem just in case anyone missed it. Main post shows the problem as an example scenario.
1) Now the problem is find and replace some string without considering the tags. The tags may shows up within a single word. String may contain multiple word. Tag only appear in the content string or the document. The search phrase never contain any tags.
We can easily remove all tags and do some text operation. But here the another problem shows up.
2) The tags must be preserve, even after replacing the text. That is what the example shows.
Thank you Again for helping

ok i think this is what you want. it takes your input search and replace, splits them into arrays of strings delimited by space, generates a regexp that finds the input sentence with any number of whitespace/html tags, and replaces it with the replacement sentence with the same tags replaced between the words.
if the wordcount of the search sentence is higher than that of the replacement, it just uses spaces between any extra words, and if the replacement wordcount is higher than the search, it will add all 'orphaned' tags on the end. it also handles regexp chars in the find and replace.
<?php
function htmlFriendlySearchAndReplace($find, $replace, $subject) {
$findWords = explode(" ", $find);
$replaceWords = explode(" ", $replace);
$findRegexp = "/";
for ($i = 0; $i < count($findWords); $i++) {
$findRegexp .= preg_replace("/([\\$\\^\\|\\.\\+\\*\\?\\(\\)\\[\\]\\{\\}\\\\\\-])/", "\\\\$1", $findWords[$i]);
if ($i < count($findWords) - 1) {
$findRegexp .= "(\s?(?:<[^>]*>)?\s(?:<[^>]*>)?)";
}
}
$findRegexp .= "/i";
$replaceRegexp = "";
for ($i = 0; $i < count($findWords) || $i < count($replaceWords); $i++) {
if ($i < count($replaceWords)) {
$replaceRegexp .= str_replace("$", "\\$", $replaceWords[$i]);
}
if ($i < count($findWords) - 1) {
$replaceRegexp .= "$" . ($i + 1);
} else {
if ($i < count($replaceWords) - 1) {
$replaceRegexp .= " ";
}
}
}
return preg_replace($findRegexp, $replaceRegexp, $subject);
}
?>
here are the results of a few tests :
Original : <b>Novice Programmer</b>
Search : Novice Programmer
Replace : Advanced Programmer
Result : <b>Advanced Programmer</b>
Original : Hi, <b>Novice Programmer</b>
Search : Novice Programmer
Replace : Advanced Programmer
Result : Hi, <b>Advanced Programmer</b>
Original : I am not a <b>Novice</b> Programmer
Search : Novice Programmer
Replace : Advanced Programmer
Result : I am not a <b>Advanced</b> Programmer
Original : Novice <b>Programmer</b> in the house
Search : Novice Programmer
Replace : Advanced Programmer
Result : Advanced <b>Programmer</b> in the house
Original : <i>I am not a <b>Novice</b> Programmer</i>
Search : Novice Programmer
Replace : Advanced Programmer
Result : <i>I am not a <b>Advanced</b> Programmer</i>
Original : I am not a <b><i>Novice</i> Programmer</b> any more
Search : Novice Programmer
Replace : Advanced Programmer
Result : I am not a <b><i>Advanced</i> Programmer</b> any more
Original : I am not a <b><i>Novice</i></b> Programmer any more
Search : Novice Programmer
Replace : Advanced Programmer
Result : I am not a <b><i>Advanced</i></b> Programmer any more
Original : I am not a Novice<b> <i> </i></b> Programmer any more
Search : Novice Programmer
Replace : Advanced Programmer
Result : I am not a Advanced<b> <i> </i></b> Programmer any more
Original : I am not a Novice <b><i> </i></b> Programmer any more
Search : Novice Programmer
Replace : Advanced Programmer
Result : I am not a Advanced <b><i> </i></b> Programmer any more
Original : <i>I am a <b>Novice</b> Programmer</i> too, now
Search : Novice Programmer too
Replace : Advanced Programmer
Result : <i>I am a <b>Advanced</b> Programmer</i> , now
Original : <i>I am a <b>Novice</b> Programmer</i>, now
Search : Novice Programmer
Replace : Advanced Programmer Too
Result : <i>I am a <b>Advanced</b> Programmer Too</i>, now
Original : <i>I make <b>No money</b>, now</i>
Search : No money
Replace : Mucho$1 Dollar$
Result : <i>I make <b>Mucho$1 Dollar$</b>, now</i>
Original : <i>I like regexp, you can do [A-Z]</i>
Search : [A-Z]
Replace : [Z-A]
Result : <i>I like regexp, you can do [Z-A]</i>

I would do this:
if (preg_match('/(.*)novice((?:<.*>)?\s(?:<.*>)?programmer.*)/',$inString,$attributes) {
$inString = $attributes[1].'learner'.$attributes[2];
}
It should match any of the following:
novice programmer
novice</b> programmer
novice </b>programmer
novice<span> programmer
A test version of what the regex states would be something like: Match any set of characters until you reach "novice" and put it into a capturing group, then maybe match something that starts with a '<' and has any number of characters after it and then ends with '>' (but don't capture it), but then there only match something with a white space and then maybe match again something that starts with a '<' and has any number of characters after it and then ends with '>' (but don't capture it) which must then be followed by programmer followed by any number of characters and put that into a capture group.
I would do some specific testing though, as I may have missed some stuff. Regex is a programmers best friend!

Well, there might be a better way, but off the top of my head (assuming that tags won't appear in the middle of words, HTML is well-formed, etc.)...
Essentially, you'll need three things (sorry if this sounds patronising, not intended that way):
1. A method of sub-string matching that ignores tags.
2. A way of making the replacement preserving the tags.
3. A way of putting it all together.
1 - This is probably the most difficult bit. One method would be to iterate through all of the characters in the source string (strings are basically arrays of characters so you can access the characters as if they are array elements), attempting to match as many characters as possible from the search string, stopping when you've either matched all of the characters or run out of characters to match. Any characters between and including '<' and '>' should be ignored. Some pseudo-code (check this over, it's late and there may be mistakes):
findMatch(startingPos : integer, subject : string, searchString : string){
//Variables for keeping track of characters matched, positions, etc.
inTag = false;
matchFound = false;
matchedCharacters = 0;
matchStart = 0;
matchEnd = 0;
for(i from startingPos to length(searchString)){
//Work out when entering or exiting tags, ignore tag contents
if(subject[i] == '<' || subject[i] == '>'){
inTag = !inTag;
}
else if(!inTag){
//Check if the character matches expected in search string
if(subject[i] == searchString[matchedCharacters]){
if(!matchFound){
matchFound = true;
matchStart = i;
}
matchedCharacters++;
//If all of the characters have been matched, return the start and end positions of the substring
if(matchedCharacters + 1 == length(searchString)){
matchEnd = i - matchStart;
return matchStart, matchEnd;
}
}
else{
//Reset counts if not found
matchFound = false;
matchCharacters = 0;
}
}
}
//If no full matches were found, return error
return -1;
}
2 - Split the HTML source code into three strings - the bit you want to work on (between the two positions returned by the matching function) and the part before and after. Split up the bit you want to modify using, for example:
$parts = preg_split("/(<[^>]*>)/",$string, -1, PREG_SPLIT_DELIM_CAPTURE);
Keep a record of where the tags are, concatenate the non-tag segments and perform substring replace on this as normal, then split the modified string up again and reassemble with the tags in place.
3 - This is the easy part, just concatenate the modified part and the other two bits back together.
I may have horribly over complicated this mind, if so just ignore me.

Unless cOm's already written it, the regex would be the best way to go:
$cleaned_string = preg_replace('/\<.\>/', $raw_text, "");
Or something like that. I would need to research/test the regex.
Then you can just use a simple $foobar = str_replace($find, $replace_with, $cleaned_string); to find the text you want to replace.
Didn't realize he wanted to put the HTML back in. It's all regex for that, and more than I know at the moment.
Knowing what I do know, technique-wise I would probably use an expression that didn't ignore whitespace between the words, but did between the < and > brackets, then use the variable-containing abilities of regex to output.

Interesting problem.
I would use the DOM and XPath to find the closest nodes containing that text and then use substring matching to find out which bit of the string is in what node. That will involve character-per-character matching and possible backtracking, though.
Here is the first part, finding the container nodes:
<?php
error_reporting(E_ALL);
header('Content-Type: text/plain; charset=UTF-8');
$doc = new DOMDocument();
$doc->loadHTML(<<<EOD
<p>
<span>
<i>
I am <b>Sadi, novice</b> programmer.
</i>
</span>
</p>
<ul>
<li>
<div>
I am <em>Cornholio, novice</em> programmer of television shows.
</div>
</li>
</ul>
EOD
);
$xpath = new DOMXPath($doc);
// First, get a list of all nodes containing the text anywhere in their tree.
$nodeList = $xpath->evaluate('//*[contains(string(.), "programmer")]');
$deepestNodes = array();
// Now only keep the deepest nodes, because the XPath query will also return HTML, BODY, ...
foreach ($nodeList as $node) {
$deepestNodes[] = $node;
$ancestor = $node;
while (($ancestor = $ancestor->parentNode) && ($ancestor instanceof DOMElement)) {
$deepestNodes = array_filter($deepestNodes, function ($existingNode) use ($ancestor) {
return ($ancestor !== $existingNode);
});
}
}
foreach ($deepestNodes as $node) {
var_dump($node->tagName);
}
I hope that helps you along.

Since you didn't give exact specifics on what you will use this for, I will use your example of "I am sadi, novice programmer".
$before = 'I am <b>sadi, novice</b> programmer';
$after = preg_replace ('/I am (<.*>)?(.*), novice(<.*>)? programmer/','/I am $1$2, learner$3 programmer/',$string);
Alternatively, for any text:
$string = '<b>Hello</b>, world!';
$orig = 'Hello';
$replace = 'Goodbye';
$pattern = "/(<.*>)?$orig(<.*>)?/";
$final = "/$1$replace$2/";
$result = preg_replace($pattern,$final,$string);
//$result should now be 'Goodbye, world!'
Hope that helped. :d
Edit: An example of your example, with the second piece of code:
$string = 'I am sadi, novice programmer.';
$orig = 'novice';
$replace = 'learner';
$pattern = "/(<.>)?$orig(<.>)?/";
$final = "$1$replace$2";
$result = htmlspecialchars(preg_replace($pattern,$final,$string));
echo $result;
The only problem is if you were searching for something that was more than a word long.
Edit 2: Finally came up with a way to do it across multiple words. Here's the code:
function htmlreplace($string,$orig,$replace)
{
$orig = explode(' ',$orig);
$replace = explode(' ',$replace);
$result = $string;
while (count($orig)>0)
{
$shift = array_shift($orig);
$rshift = array_shift($replace);
$pattern = "/$shift\s?(<.*>)?/";
$replacement = "$rshift$1";
$result = preg_replace($pattern,$replacement,$result);
}
$result .= implode(' ',$replace);
return $result;
}
Have fun! :d

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php - why does this regex truncate my string to zero length? - php

It thas been said many times before on SO, eg Regex to match the first ending HTMl tag (and probably will be mentioned again) that regexes are not appropriate for HTML because tags are too irregular. Use DOM functions where they're available.

Related

Find all occurrences of a string in a file

Random String Generator (PHP)

PHP method for stripping duplicate chars from a multibyte string?

Using preg_match to get filename

keep HTMLformat after replace some text (using PHP and JS)

Categories

Resources