Google can't read a sitemap with special characters in URLs - php

I got a big sitemap created dynamically with PHP, it has a sitemap index with some 230 separate sitemaps, and each individual sitemap has between 3.000 and 15.000 URLs.
In most of those 230 sitemaps, everything is ok, but in some of them some URLs contain special characters and Google returns an error, does not accept such sitemap. The example of a normal, accepted URL:
http://www.site.com/Gentofte-Greve/Denmark 1 Badmintonligaen/12-fe-juice_a-1091627-1-33-1-odds/
The example of an URL which corrupts the entire sitemap file for Google:
http://www.site.com/Team%20%C5rhus%20Elite-Solr%F8d%20Strand/Denmark 1 Badmintonligaen/12-fe-juice_a-1091631-1-33-1-odds/
Any special character, for example the Nordic ones, will wreck the sitemap. Here is an example of Nordic characters: http://www.borgos.nndata.no/alfabet.htm
My questions is - HOW do I code those special characters (and other similar ones) so sitemap still checks out fine. Which PHP coding function do I use if that's a solution? Is the only solution to use str_replace and replace those characters with normal ones? It wouldn't be an issue, the URL works no matter what you write in the first part of it as that part is for SEO only, but this would be time-consuming. I'd prefer to be able to write those special characters in a way which doesn't wreck the sitemap for Google.
Everything else regarding my sitemaps is fine, they're coded in UTF-8 or at least they should be with this line:
<?xml version='1.0' encoding='UTF-8'?>

Are the %C5 and %F8 sequences meant to represent the characters U+00C5 (Å) and U+00F8 (ø)? If so, you need to use their UTF-8 encodings, not their raw Unicode codepoint numbers. 'Å' should be %C3%85, and 'ø' should be %C3%B8.
For more information about URI encoding, see RFC 3986.
Doing this in PHP is complicated by the fact that PHP strings are really byte strings, not Unicode character strings. They can't store abstract Unicode characters; they can only store the encoded representation of those characters, in a particular encoding such as UTF-8 or UTF-16. You can use the mbstring extension to work with encoded Unicode strings, but doing this correctly will probably mean using the mbstring functions for all handling of Unicode text throughout your application.
You should be looking to fix this encoding problem at the source: how did your program get a string that contains the byte 0xC5 to represent the character U+00C5? Something, somewhere, must've assumed that Unicode codepoint numbers translate directly into bytes, which is wrong. Find and fix that, so that your data is read into the PHP string in UTF-8 form to begin with, and then use the mbstring functions for any manipulation of the string afterward.
Once you have a string that contains the UTF-8 representation of your URL, rawurlencode() should give you the correct percent-escaped result.

Related

How to convert a Chinese character to UTF-16 code units?

I'm using PHP for this web development project. Right now, I'm working on a user page, where the user can add words that he knows. Off course, I'm starting out crude, without adding any special features yet like Do you know this Character suggestion, etc.
I have tackled the challenges of adding UTF-16 collation and charset set to UTF-16 in my MySQL Database, in fact online at http://freemysqlhosting.net to support Chinese characters in my website. Now what I'm struggling with is to support automatic PinYin generation for my Chinese characters.
I have found this after searching all over SO: https://github.com/reorx/pinyindep/blob/master/Uni2Pinyin. Each line begins with a Chinese character, in UTF-16 Code Units.
Take for example, 爱. In UTF-16, it is 7231. I convert this at https://r12a.github.io/apps/conversion/. When I do a lookup in the file, I get the pinyin associated. :D This is the functionality I need, though looking it up in GitHub is in JS, rather than PHP.
In the manual lookup, ai4 is returned, which is the correct intonation. Now, what I'm looking for is either a PHP Built-in Library, or a code snippet to convert this string input, let's say “爱” into a UTF-16 Four Character Code Unit, such as here 7321.
So what's the question:
How should I convert a Chinese character, in form of a string, to UTF-16 code units? (Either through built-in library, or through a suggested PHP Code Snippet)
P.S. I don't really like third-party tools unless they are really popular worldwide, or there's no other option.
You need to use PHP's multibyte string module:
$c = "爱";
list(, $d) = unpack('N', mb_convert_encoding($c, 'UCS-4BE', 'UTF-8'));
echo dechex($d);
// => 7231
Change UTF-8 to UTF-16 if your string is coming from the database in that encoding.
mb_convert_encoding will change the string into four-byte-per-character encoding; then unpack converts the four bytes into an unsigned long; finally, converting to hexadecimal string using dechex.
If you are using PHP 7.2+ you can use mb_ord to simplify the conversion.
echo dechex(mb_ord("爱"));

Decode HTML-encoded characters with extended ASCII

I have an XML with special Characters encoded as &#xxx; in it. As long as I'd output these characters to a browser, that would work fine as they're HTML-Encodings (sort of).
But I need to read the XML-File with simplexml_load_string, which results in garbage for certain characters, because they're in the extended ASCII-table.
For example:
š translates to š - but when I try to use html_entity_decode, I get an empty character.
I tried almost everything from iconv to mb_decode_numericentity - nothing worked.
How do I convert those &#xxx; to the real characters???
[Edit]
I found this table http://www.ascii-code.com that claims the š is an extended ASCII Character using ISO-8859-1
I'm confused...
You're apparently dealing with two different characters that look almost identical when printing:
'LATIN SMALL LETTER S WITH CARON' (U+0161) actually encodes as š
š corresponds to 'SINGLE CHARACTER INTRODUCER' (U+009A)
I've found that none of my fonts or text editors handle the second one properly. So you most likely get a blank character for that precise reason.
The second one appears to be some kind of weird control character whose exact purpose escapes from my understanding:
To be followed by a single printable character (0x20 through 0x7E) or
format effector (0x08 through 0x0D). The intent was to provide a means
by which a control function or a graphic character that would be
available regardless of which graphic or control sets were in use
could be defined. Definitions of what the following byte would invoke
was never implemented in an international standard. Not part of the
first edition of ISO/IEC 6429
It's worth noting that character references in XML use numeric codes from a fixed encoding (some UCS variant). If the author of the XML file doesn't follow this convention you'll be faced with either invalid XML (something that effectively prevents it from being parsed with an XML library) or valid XML that contains corrupted data (something that, at most, will require tedious post-processing).

PHP Encoding Conversion to Windows-1252 whilst keeping UTF-8 Compatibility

I need to convert uploaded filenames with an unknown encoding to Windows-1252 whilst also keeping UTF-8 compatibility.
As I pass on those files to a controller (on which I don't have any influence), the files have to be Windows-1252 encoded. This controller then again generates a list of valid file(names) that are stored via MySQL into a database - therefore I need UTF-8 compatibility. Filenames passed to the controller and filenames written to the database MUST match. So far so good.
In some rare cases, when converting to "Windows-1252" (like with te character "ï"), the character is converted to something invalid in UTF-8. MySQL then drops those invalid characters - as a result filenames on disk and filenames stored to the database don't match anymore. This conversion, which failes sometimes, is achieved with simple recoding:
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//IGNORE", $sOriginalFilename);
To prevent invalid characters being generated by the conversion, I then again can remove all invalid UTF-8 characters from the recoded string:
ini_set('mbstring.substitute_character', "none");
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//TRANSLIT", $sOriginalFilename);
$sTargetFilename = mb_convert_encoding($sTargetFilename, 'UTF-8', 'Windows-1252');
But this will completely remove / recode any special characters left in the string. For example I lose all "äöüÄÖÜ" etc., which are quite regular in german language.
If you know a cleaner and simpler way of encoding to Windows-1252 (without losing valid special characters), please let me know.
Any help is very appreciated. Thank you in advance!
I think the main problem is that mb_detect_encoding() does not do exactly what you think it does. It attempts to detect the character encoding but it does it from a fairly limited list of predefined encodings. By default, those encodings are the ones returned by mb_detect_order(). In my computer they are:
ASCII
UTF-8
So this function is completely useless unless you take care of compiling a list of candidate encodings and feeding the function with it.
Additionally, there's basically no reliable way to guess the encoding of an arbitrary input string, even if you restrict yourself to a small subset of encodings. In your case, Windows-1252 is so close to ISO-8859-1 and ISO-8859-15 that you have no way to tell them apart other than visual inspection of key characters like ¤ or €.
You can't have a string be Windows-1252 and UTF-8 at the same time. The character sets are identical for the first 128 characters (they contain e.g. the basic latin alphabet), but when it goes beyond that (like for Umlauts), it's either one or the other. They have different code points in UTF-8 than they have in Windows-1252.
Keep to ASCII in the filesystem - if you need to sustain characters outside ASCII in a filename, there are
schemes you can use to represent unicode characters while keeping to ASCII.
For example, percent encoding:
äöüÄÖÜ.txt <-> %C3%A4%C3%B6%C3%BC%C3%84%C3%96%C3%9C.txt
Of course this will hit the file name limit pretty fast and is not very optimal.
How about punycode?
äöüÄÖÜ.txt <-> xn--4caa7cb2ac.txt

Strange behaviour when encoding cURL response as UTF-8

I'm making a cURL request to a third party website which returns a text file on which I need to do a few string replacements to replace certain characters by their html entity equivalents e.g I need to replace í by í.
Using string_replace/preg_replace_callback on the response directly didn't result in matches (whether searching for í directly or using its hex code \x00\xED), so I used utf8_encode() before carrying out the replacement. But utf8_encode replaces all the í characters by Ã.
Why is this happening, and what's the correct approach to carrying out UTF-8 replacements on an arbitrary piece of text using php?
*edit - some further research reveals
utf8_decode("í") == í;
utf8_encode("í") == í;
utf8_encode("\xc3\xad") == í;
utf8_encode is definitely not the way to go here (you're double-encoding if you do that).
Re. searching for the character directly or using its hex code, did you make sure to add the u modifier at the end of the regex? e.g. /\x00\xED/u?
You're probably specify the characters/strings you want replaced via string literals in the php source code? If you do, then the values of those string literals depends on the encoding you save your php file in. So while you see the character í, maybe the literal value is a latin encoded í, like maybe 8859-1 encoding, or maybe its windows cp1252 í, or maybe its utf8 í, or maybe even utf32 í...i dont know off hand how many of those are different, but i know at least some have different byte representations, and so wont match in a php string comparison.
my point is, you need to specify the correct character that will match whatever encoding your incoming text is in.
heres an example without using literals
$iso8859_1 = chr(236);
$utf8 = utf8_encode(chr(236));
be warned, text editors may or may not convert the existing characters when you change the encoding, if you decide to change the file encoding to utf8. I've seen editors do really bizarre things when changing the encoding. Start with a fresh file.
also-just because the other server claims its utf8, doesn't mean it really is.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medúlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt
To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.
I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a U​RI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

Categories