First, the string is being pulled from an XML file.
There's a special character that I am trying to replace: '£'
When I use str_replace like so:
$ability1 = str_replace("£", "", $ability);
This is what var_dump shows:
string(138) "Argothian Pixies can't be blocked by artifact creatures.�Prevent all damage that would be dealt to Argothian Pixies by artifact creatures."
Once $ability1 is passed and wordpress inserts it into the post. This is the result.
Argothian Pixies can’t be blocked by artifact creatures.
It deletes everything after the � character.
Why would £ be changed to � when its supposed to be "". I'm not quite sure what I'm missing
Make sure the string is using the correct encoding, try encoding or decoding to UTF8 and then apply the str_replace.
Maybe your string is in UTF-8? PHP. You would have to do something like this:
$ability1 = utf8_decode($ability);
$ability1 = preg_replace("/[£ ]/","", $ability1);
$ability1 = utf8_encode($ability1);
How is the XML file encoded? I suspect it may be UTF-8. In which case you'll need to see a function such as utf_decode() to handle it correctly in your code (assuming your code is in ANSI)
Related
so I have this string
"5. Before Dash—AfterDash"
inside a file
However, when opening the file using file_get_contents, the dash becomes converted to certain weird characters...Where's what it looks like if I echo the file_get_contents output
"5. Before Dash�AfterDash"
How do I got about converting that � character to a valid long dash again in PHP? And how can I prevent further � to appear in other characters as well?
this causes json_decode to fail when I try to json_decode the string?
To generalize Amal Murali's comment: Make sure the encoding of the text file, the encoding of the php file (both can be determined and changed by Notepad++ or other editors) and the output encoding (set by php's header() function as stated by Amal or by html meta tags) are the same. In case you work with any kind of database, make sure the connection uses the same encoding, too.
I am consuming an XML feed which contains a great deal of whitespace.
When I echo out the raw feed, it looks as though the columns of the tabled data are properly formatted with just the white space.
I have tried many regex patterns to remove it, to only allow visible characters, trim, chop, utf-8 encode/decode, nothing is touching it. It's like it is laughing in my face when I echo out a value and see this:
string(17) "72"
Opened the data in Notepad++ with show all characters on, and it simply shows it as spaces. I am at a loss of where to go with this.
I did recieve the following error:
simplexml_load_string(): Entity: line 265: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xB0 0x43 0x20 0x74
I just found this regex (untested)
$xml_data = preg_replace("/>\s+</", "><", $xml_data);
If you are using the xml parser, I think you can use the 'XML_OPTION_SKIP_WHITE' option referenced here:
http://php.net/manual/en/function.xml-parser-set-option.php
Try running the data through utf8_encode() - it might seem like a hack, but it seems like the originating data isn't properly setup.
My theory is that you're grabbing it with the wrong encoding, and the proper solution would be to load it differently.
Solution
My very hacky workaround that works:
$raw = file_get_contents('http://stupidwebservice.com/xmldata.asmx/Feed');
$raw = urlencode(utf8_encode($raw));
$raw = str_replace('++','',$raw);
$raw = urldecode($raw);
urlencoding after the utf-8 encoding turned the space into +'s. I simply removed all instances of double ++'s and took it back. Works great.
I'm getting troubles with data from a database containing german umlauts. Basically, whenever I receive a data containing umlauts, it is a black square with an interrogation mark. I solved this by putting
mysql_query ('SET NAMES utf8')
before the query.
The problem is, as soon as I use json_encode(...) on a result of a query, the value containing an umlaut gets null. I can see this by calling the php-file directly in the browser. Are there other solution than replacing this characters before encoding to JSON and decoding it in JS?
Check out this pretty elegant solution mentioned here:
json_encode( $json_full, JSON_UNESCAPED_UNICODE );
If the problem isn't anywhere else in your code this should fix it.
Edit: Umlaut problems can be caused by a variety of sources like the charset of your HTML document, the database format or some previous php functions your strings run through (You should definitely look into multibyte functions when having problems with umlauts).
These problems tend to be the pretty annoying because they are hard to track in most cases (altough this isn't as bad as it was a few years ago). The function above fixes – as asked – umlaut problems with json_encode … but there is a good chance that the problem is caused by a different part of your application and not this specific function.
I know this might be old but here a better solution:
Define the document type with utf-8 charset:
<?php header('Content-Type: application/json; charset=utf-8'); ?>
Make sure that all content is utf_encoded. JSON works only with utf-8!
function encode_items(&$item, $key)
{
$item = utf8_encode($item);
}
array_walk_recursive($rows, 'encode_items');
Hope this helps someone.
You probably just want to show the texts somehow in the browser, so one option would be to change the umlauts to HTML entities by using htmlentities().
The following test worked for me:
<?php
$test = array( 'bla' => 'äöü' );
$test['bla'] = htmlentities( $test['bla'] );
echo json_encode( $test );
?>
The only important point here is that json_encode() only supports UTF-8 encoding.
http://www.php.net/manual/en/function.json-encode.php
All string data must be UTF-8 encoded.
So when you have any special characters in a non utf-8 string json_encode will return a null Value.
So either you switch the whole project to utf-8 or you make sure you utf8_encode() any string before using json_encode().
make sure the translation file itself was explicitely stored as UTF-8
After that reload cache blocks and translations
i have a bunch of files which are supposed to be html documents for the most part, however sometimes the editor(s) copy&pasted text from other sources into it, so now i come across some weird chars every now and then - for example non-encoded copyright sign, or weird things that look like a dash or minus but are something else (ascii #146?), or a single char that looks like "...".
i had a look at get_html_translation_table(), however this will only replace the "usual" special chars like &, euro signs etc., but it seems like i need regex and specify only allowed chars and discard all the unknown chars. I tried this here, but this didnt work at all:
function fixNpChars($string)
{
//characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference.
$pattern = '/[\x{0000}-\x{0008}][\x{000B}-\x{000C}][\x{000E}-\x{001F}][\x{0080}-\x{009F}][x{007F}]/u';
$replacement = '';
return preg_replace($pattern, $replacement, $string);
}
Any idea whats wrong here?
EDIT:
The database where i store my imported files and the php side is all set to utf-8 (content type utf-8, db table charset utf8/utf8_general_ci, mysql_set_charset('utf8',$this->mHandle); executed after db connection is established. Most of the imported files are either utf8 or iso-8859-1.
Your regex syntax looks a little problematic. Maybe this?:
$pattern = '/[\x00-\x08][\x0B-\x0C][\x0E-\x1F][\x80-\x9F][x7F]/u';
Don't think of removing the invalid characters as the best option, this problem can be solved using htmlentities and html_entity_decode functions.
So I'm working on a project that is taking data from a file, in the file some lines require utf8 symbols but are encoded oddly, they are \xC6 for example rather than being \Æ
If I do as follows:
$name = "\xC6ther";
$name = preg_replace('/x([a-fA-F0-9]{2})/', '&#$1;', $name);
echo utf8_encode($name);
It works fine. I get this:
Æther
But if I pull the same data from MySQL, and do as follows:
$name = $row['OracleName'];
$name = preg_replace('/x([a-fA-F0-9]{2})/', '\&#$1;', $name);
$name = utf8_encode($name);
Then I receive this as output:
\&#C6;ther
Anyone know why this is?
As requested, vardump of $row['OracleName'];
string(15) "xC6ther Barrier"
on your second preg_replace why there is a \
preg_replace('/x([a-fA-F0-9]{2})/', '&#$1;', $name);
ok I think there is some confusion here. you regular expression is matching something like x66 and would replace that by 'B', which seems to be some html entities encoding to me but you are using utf8_encode which do that (from manual):
utf8_encode — Encodes an ISO-8859-1 string to UTF-8
so the things would never get converted ... (or to be more precise the 'B' would remains 'B' since they are all same characters in ISO-8859-1 and UTF-8)
also to be noted on your first snippet you use \xC6 but this would never get caught by the preg_replace since it's already encoded character. The \x means the next hex number (0x00 ~ 0xFF) would be drop in the string as is. it won't make a string xC6
So I am kind of confused of what you really wanna do. what the preg_replace is all about?
if you want to convert HTML entities to UTF-8 look into mb_convert_encoding (manual), if you want to do the reverse, code in HTML entities from some UTF-8 look into htmlentities (manual)
and if it has nothing to do with all of that and you want to simply change encoding mb_convert_encoding is still there.
Figured out the problem, on the SQL pull I missed an 'x' in the preg_replace
preg_replace('/x([a-fA-F0-9]{2})/', '&#x$1;', $name);
Once I added in the x, it worked like a charm.