Settings that could influence PHP str_replace behaviour

Settings that could influence PHP str_replace behaviour - php

I am currently working on a replacement tool that will dynamically replace certain strings (including html) in a website using a smarty outputfilter.
For the replacement to take place, I am using PHP's str_ireplace method, which reads the code that is supposed to be replaced and the replacement code from a database, and then pass the result to the smarty output (using an output filter), in a similar way as the below.
$tpl_source = str_ireplace($replacements['sourceHTML'], $replacements['replacementHTML'], $tpl_source);
The problem is, that although it works great on my dev server, once uploaded to the live server replacements occasionally fail. The same replacements work just fine on my dev version though. After some examinations and googling there was not much I could find out regarding this issue. So my question is, what could influence str_replace's behavour?
Thanks
Edit with replacement example:
$htmlsource = file_get_contents('somefile.html');
$newstr = str_replace('Some text', 'sometext', $htmlsource); // the text to be replaced does exist in the html source
fails to replace. After some checking, it looks like the combination of "> creates a problem. But just the combination of it. If I try to change only (") it works, if I try to change only (>) it works.

It might be that special chars like umlauts do not display on the live server correctly and so str_replace() would fail, if there are specialchars inside the string you want to replace.

Is the input string identical on both systems? Have you verified this? Are you sure?
Things to check:
Are the HTML attributes in the same order?
Are the attribute values using the same kind quote marks? (eg <a href='#'> vs <a href="#">)
Is there any other stray HTML code getting in there?
Is the entity encoding the same? (eg vs   - same character; different HTML)
Is the character-set the same? (eg utf-8 vs ISO 8859-1: Accented characters will be encoded differently)
Any of these things will affect the result and produce the failures you're describing.

This was a trikcy problem, and it ended up having nothing to do with the str_replace method itself;
We are using smarty as a tamplating system. The str_replace method was used by a smarty ouput filter in order to change the html in some ocassions, just before it was delivered to the user.
Here is the Smarty outputfilter Code:
function smarty_outputfilter_replace($tpl_source, &$smarty)
{
$replacements = Content::getReplacementsForPage();
if (is_array($replacements))
{
foreach ($replacements as $replacementData)
{
$tpl_source = str_replace($replacementData['sourcecode'], $replacementData['replacementcode'], $tpl_source);
}
}
return ($tpl_source);
}
So this code failed now and then for now apparent reason... until I realized that the HTML code in the smarty template was being manipulated by an Apache filter.
This resulted into the source code in the browser (which we were using as the code to be replaced by something else) not being identical to the template code (which smarty was trying to modify). Result? str_replace failed :)

Related

Replace // comments by /* comments */ Except in URLs [duplicate]

I need to remove the comment lines from my code.
preg_replace('!//(.*)!', '', $test);
It works fine. But it removes the website url also and left the url like http:
So to avoid this I put the same like preg_replace('![^:]//(.*)!', '', $test);
It's work fine. But the problem is if my code has the line like below
$code = 'something';// comment here
It will replace the comment line with the semicolon. that is after replace my above code would be
$code = 'something'
So it generates error.
I just need to delete the single line comments and the url should remain same.
Please help. Thanks in advance

try this
preg_replace('#(?<!http:)//.*#','',$test);
also read more about PCRE assertions http://cz.php.net/manual/en/regexp.reference.assertions.php

If you want to parse a PHP file, and manipulate the PHP code it contains, the best solution (even if a bit difficult) is to use the Tokenizer : it exists to allow manipulation of PHP code.
Working with regular expressions for such a thing is a bad idea...
For instance, you thought about http:// ; but what about strings that contain // ?
Like this one, for example :
$str = "this is // a test";

This can get complicated fast. There are more uses for // in strings. If you are parsing PHP code, I highly suggest you take a look at the PHP tokenizer. It's specifically designed to parse PHP code.
Question: Why are you trying to strip comments in the first place?
Edit: I see now you are trying to parse JavaScript, not PHP. So, why not use a javascript minifier instead? It will strip comments, whitespace and do a lot more to make your file as small as possible.

PHP Escape a string if it hasn't already been escaped with entities

I'm using a 3rd party API that seems to return its data with the entity codes already in there. Such as The Lion’s Pride.
If I print the string as-is from the API it renders just fine in the browser (in the example above it would put in an apostrophe). However, I can't trust that the API will always use the entities in the future so I want to use something like htmlentities or htmlspecialchars myself before I print it. The problem with this is that it will encode the ampersand in the entity code again and the end result will be The Lion&#8217;s Pride in the HTML source which doesn't render anything user friendly.
How can I use htmlentities or htmlspecialchars only if it hasn't already been used on the string? Is there a built-in way to detect if entities are already present in the string?

No one seems to be answering your actual question, so I will
How can I use htmlentities or htmlspecialchars only if it hasn't already been used on the string? Is there a built-in way to detect if entities are already present in the string?
It's impossible. What if I'm making an educational post about HTML entities and I want to actually print this on the screen:
The Lion&#8217;s Pride
... it would need to be encoded as...
The Lion&amp&semi;&num;8217&semi;s Pride
But what if that was the actual string we wanted to print on the string ? ... and so on.
Bottom line is, you have to know what you've been given and work from there – which is where the advice from the other answers comes in – which is still just a workaround.
What if they give you double-encoded strings? What if they start wrapping the html-encoded strings in XML? And then wrap that in JSON? ... And then the JSON is converted to binary strings? the possibilities are endless.
It's not impossible for the API you depend on to suddenly switch the output type, but it's also a pretty big violation of the original contract with your users. To some extent, you have to put some trust in the API to do what it says it's going to do. Unit/Integration tests make up the rest of the trust.
And because you could never write a program that works for any possible change they could make, it's senseless to try to anticipate any change at all.

Decode the string, then re-encode the entities. (Using html_entity_decode())
$string = htmlspecialchars(html_entity_decode($string));
https://eval.in/662095

There is NO WAY to do what you ask for!
You must know what kind of data is the service giving back.
Anything else would be guessing.
Example:
what if the service is giving back & but is not escaping ?
you would guess it IS escaping so you would wrongly interpret as & while the correct value is &

I think the best solution, is first to decode all html entities/special chars from the original string, and then html encode the string again.
That way you will end up with a correctly encoded string, no matter if the original string was encoded or not.

You also have the option of using htmlspecialchars_decode();
$string = htmlspecialchars_decode($string);

It's already in htmlentities:
php > echo htmlentities('Hi&mom', ENT_HTML5, ini_get('default_charset'), false);
Hi&mom
php > echo htmlentities('Hi&mom', ENT_HTML5, ini_get('default_charset'), true);
Hi&amp&semi;mom
Just use the [optional]4th argument to NOT double-encode.

Replace '&' with 'and' on the fly in PHP

Is there a way to replace the character & with and in a PHP web form as the user types it rather than after submitting the form?
When & is inserted into our database our search engine doesn't interpret the & correctly replacing it with & returning an incorrect search result (i.e. not the result that included &).
Here is the field we would like to run this on:
<input type="text" name="project_title" id="project_title" value="<?php echo $project_title; ?>" size="60" class="btn_input2"/>

Is there a way to replace the character & with and in a PHP web form as the user types it rather than after submitting the form?
PHP is on the server, it has no control over anything taking place under any circumstances what-so-ever on the client-side. It sends raw text from the web server, a 100megaton thermonuclear device explodes, and PHP never exists anymore after the content is sent. Just the document received on your client side remains. To work with effects on your client side, you need to work with JavaScript.
To do that, you would pick your favorite JavaScript library and add an event listener for "keyup" events. Replace ampersands with "and", and drop the replacement text back in the box. mugur has posted an answer that shows you how to do this.
This is a horrible solution in practice because your users will be screaming for bloody justice to deliver them from such an awful user experience. What you've ended up doing is replacing the input text with something they didn't want. Other search tools do this, why can't yours? You hit backspace, then what? When you hit in the text, you probably lose your cursor position.
Not only that, you're treating a symptom rather than the cause. Look at why you're doing this:
The reason is when & is inserted into our database our search engine flips out and replaces it with & which then returns an incorrect result (i.e. not the result that included &).
No, your database and search engine do no such thing as "flipping out". You're not aware of what's going on and try to treat symptoms rather than learn the cause and fix it. Your symptom cure will create MORE issues down the road. Don't do it.
& is an HTML Entity Code. Every "special" charecter has one. This means your database also encodes > as > as well as characters with accents in them (such as French, German, or Spanish texts). You get "Wrong" results for all of these.
You didn't show any code so you don't get any code. But here's what your problem is.
Your code is converting raw text into HTML Entity codes where appropriate, you're searching against a non-encoded string.
Option 1: Fix the cause
Encode your search text with HTML entities so that it matches for all these cases. Match accent charecters with their non-accented cousins so searching for "francais" might return "français".
Option 2: Fix one symptom
Do a string replace for ampersands either on the client or server side, your search breaks for all other encodings. Never find texts such as "Bob > Sally". Never find "français".

Before submitting the form you'd need to use JavaScript to change as the user types it in. Not ideal since JS can be turned off.
You'd be much better to "clean" the ampersands after submitting but before inserting into the database.
A simple str_replace should work:
str_replace(' & ',' and ', $_POST['value']);
But as others have pointed out, this isn't a good solution. The best solution would be to encode the ampersands as they go into the database (which seems to be happening just now), then modify your search script to allow for this.

You can do that as they complete the form with jquery like this:
$('#input').change(function() { // edited conforming Icognito suggestion
var some_val = $('#input').val().replace('&', 'and');
$('#input').val( some_val );
});
EDIT: working example (http://jsfiddle.net/4gXZW/13/)
JS:
$('.target').change(function() {
$('.target').val($('.target').val().replace('&', 'and'));
});
HTML:
<input class="target" type="text" value="Field 1" />
Otherwise you can do that in PHP before the insert sql.
$to_insert = str_replace("&", "and", $_POST['your_variable']);

XML parser error: entity not defined

I have searched stackoverflow on this problem and did find a few topics, but I feel like there isn't really a solid answer for me on this.
I have a form that users submit and the field's value is stored in a XML file. The XML is set to be encoded with UTF-8.
Every now and then a user will copy/paste text from somewhere and that's when I get the "entity not defined error".
I realize XML only supports a select few entities and anything beyond that is not recognized - hence the parser error.
From what I gather, there's a few options I've seen:
I can find and replace all and swap them out with or an actual space.
I can place the code in question within a CDATA section.
I can include these entities within the XML file.
What I'm doing with the XML file is that the user can enter content into a form, it gets stored in a XML file, and that content then gets displayed as XHTML on a Web page (parsed with SimpleXML).
Of the three options, or any other option(s) I'm not aware of, what's really the best way to deal with these entities?
Thanks,
Ryan
UPDATE
I want to thank everyone for the great feedback. I actually determined what caused my entity errors. All the suggestions made me look into it more deeply!
Some textboxes where plain old textboxes, but my textareas were enhanced with TinyMCE. It turns out, while taking a closer look, that the PHP warnings always referenced data from the TinyMCE enhanced textareas. Later I noticed on a PC that all the characters were taken out (because it couldn't read them), but on a MAC you could see little square boxes referencing the unicode number of that character. The reason it showed up in squares on a MAC in the first place, is because I used utf8_encode to encode data that wasn't in UTF to prevent other parsing errors (which is somehow also related to TinyMCE).
The solution to all this was quite simple:
I added this line entity_encoding : "utf-8" in my tinyMCE.init. Now, all the characters show up the way they are supposed to.
I guess the only thing I don't understand is why the characters still show up when placed in textboxes, because nothing converts them to UTF, but with TinyMCE it was a problem.

I agree that it is purely an encoding issue. In PHP, this is how I solved this problem:
Before passing the html-fragment to SimpleXMLElement constructor I decoded it by using html_entity_decode.
Then further encoded it using utf8_encode().
$headerDoc = '<temp>' . utf8_encode(html_entity_decode($headerFragment)) . '</temp>';
$xmlHeader = new SimpleXMLElement($headerDoc);
Now the above code does not throw any undefined entity errors.

You could HTML-parse the text and have it re-escaped with the respective numeric entities only (like: →  ). In any case — simply using un-sanitized user input is a bad idea.
All of the numeric entities are allowed in XML, only the named ones known from HTML do not work (with the exception of &, ", <, >, &apos;).
Most of the time though, you can just write the actual character (ö → ö) to the XML file so there is no need to use an entity reference at all. If you are using a DOM API to manipulate your XML (and you should!) this is your safest bet.
Finally (this is the lazy developer solution) you could build a broken XML file (i.e. not well-formed, with entity errors) and just pass it through tidy for the necessary fix-ups. This may work or may fail depending on just how broken the whole thing is. In my experience, tidy is pretty smart, though, and lets you get away with a lot.

1. I can find and replace all [ ?] and swap them out with [ ?] or an actual space.
This is a robust method, but it requires you to have a table of all the HTML entities (I assume the pasted input is coming from HTML) and to parse the pasted text for entity references.
2. I can place the code in question within a CDATA section.
In other words disable parsing for the whole section? Then you would have to parse it some other way. Could work.
3. I can include these entities within the XML file.
You mean include the entity definitions? I think this is an easy and robust way, if you don't mind making the XML file quite a bit bigger. You could have an "included" file (find one on the web) which is an external entity, which you reference from the top of your main XML file.
One downside is that the XML parser you use has to be one that processes external entities (which not all parsers are required to do). And it must correctly resolve the (possibly relative) URL of the external entity to something accessible. This is not too bad but it may increase constraints on your processing tools.
4. You could forbid non-XML in the pasted content. Among other things, this would disallow entity references that are not predefined in XML (the 5 that Tomalak mentioned) or defined in the content itself. However this may violate the requirements of the application, if users need to be able to paste HTML in there.
5. You could parse the pasted content as HTML into a DOM tree by setting someDiv.innerHTML = thePastedContent;
In other words, create a div somewhere (probably display=none, except for debugging). Say you then have a javascript variable myDiv that holds this div element, and another variable myField that holds the element that is your input text field. Then in javascript you do
myDiv.innerHTML = myField.value;
which takes the unparsed text from myField, parses it into an HTML DOM tree, and sticks it into myDiv as HTML content.
Then you would use some browser-based method for serializing (= "de-parsing") the DOM tree back into XML. See for example this question. Then you send the result to the server as XML.
Whether you want to do this fix in the browser or on the server (as #Hannes suggested) will depend on the size of the data, how quick the response has to be, how beefy your server is, and whether you care about hackers sending not-well-formed XML on purpose.

Use "htmlentities()" with flag "ENT_XML1": htmlentities($value, ENT_XML1);
If you use "SimpleXMLElement" class:
$SimpleXMLElement->addChild($name, htmlentities($value, ENT_XML1));

If you want to convert all characters, this may help you (I wrote it a while back) :
http://www.lautr.com/convert-all-applicable-characters-to-numeric-entities-for-use-in-xml
function _convertAlphaEntitysToNumericEntitys($entity) {
return '&#'.ord(html_entity_decode($entity[0])).';';
}
$content = preg_replace_callback(
'/&([\w\d]+);/i',
'_convertAlphaEntitysToNumericEntitys',
$content);
function _convertAsciOver127toNumericEntitys($entity) {
if(($asciCode = ord($entity[0])) > 127)
return '&#'.$asciCode.';';
else
return $entity[0];
}
$content = preg_replace_callback(
'/[^\w\d ]/i',
'_convertAsciOver127toNumericEntitys', $content);

This question is a general problem for any language that parses XML or JSON (so, basically, every language).
The above answers are for PHP, but a Perl solution would be as easy as...
my $excluderegex =
'^\n\x20-\x20' . # Don't Encode Spaces
'\x30-\x39' . # Don't Encode Numbers
'\x41-\x5a' . # Don't Encode Capitalized Letters
'\x61-\x7a' ; # Don't Encode Lowercase Letters
# in case anything is already encoded
$value = HTML::Entities::decode_entities($value);
# encode properly to numeric
$value = HTML::Entities::encode_numeric($value, $excluderegex);

Scrape a price off a website

I'm trying to scrape a price from a web page using PHP and Regexes. The price will be in the format £123.12 or $123.12 (i.e., pounds or dollars).
I'm loading up the contents using libcurl. The output of which is then going into preg_match_all. So it looks a bit like this:
$contents = curl_exec($curl);
preg_match_all('/(?:\$|£)[0-9]+(?:\.[0-9]{2})?/', $contents, $matches);
So far so simple. The problem is, PHP isn't matching anything at all - even when there are prices on the page. I've narrowed it down to there being a problem with the '£' character - PHP doesn't seem to like it.
I think this might be a charset issue. But whatever I do, I can't seem to get PHP to match it! Anyone have any ideas?
(Edit: I should note if I try using the Regex Test Tool using the same regex and page content, it works fine)

Have you try to use \ in front of £
preg_match_all('/(\$|\£)[0-9]+(\.[0-9]{2})/', $contents, $matches);
I have try this expression with .Net with \£ and it works. I just edited it and removed some ":".
(source: clip2net.com)
Read my comment about the possibility of Curl giving you bad encoding (comment of this post).

maybe pound has it's html entity replacement? i think you should try your regexp with some sort of couching program (i.e. match it against fixed text locally).
i'd change my regexp like this: '/(?:\$|£)\d+(?:\.\d{2})?/'

This should work for simple values.
'#(?:\$|\£|\€)(\d+(?:\.\d+)?)#'
This will not work with thousand separator like 234,343 and 34,454.45.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.