convert special characters to regular alphabet in php - php

I'm trying to build a search page for a bunch of menu items in my database which often contain special characters like é (as in sautéed), and so I want to convert both the search query and the database content to regular alphabets, and I'm having trouble. I'm using ISO-8859-1 so that special characters will display properly on the website, and I get the feeling this is hindering my attempts at conversion...
header('Content-Type: text/html; charset=ISO-8859-1');
The search query is sent to search.php using the GET method, so the query "sautéed" will appear like this in the address bar:
search.php?q=saut%E9ed
This is the function I'm trying to build, that's not working:
$q = $_GET['q'];
function clean_str($a) {
$fix = array('é' => 'e');
$str = str_replace(array_keys($fix), array_values($fix), $a);
return $str;
}
$fixed = clean_str($q); // currently has no effect
I'm tried using %29 as the array key, as well as the HTML character code (é). I've tried utf8_encode($q); to no avail. Other characters like ! and + work fine in the clean_str() function, but not special alphabets like é.

Though you might want to reconsider the way you're doing this, as has been suggested, I believe this will get you there.
function clean_str($a) {
$fix = array('é' => 'e');
$str = str_replace(array_keys($fix), array_values($fix), $a);
return $str;
}
$fixed = clean_str(utf8_encode($_GET['q'])); // return an encoded utf8 string.
echo $fixed;
For more on utf8_encode see here.

To wit, é is the regular alphabet in several languages =) While you're suggesting you would like to know how to covert the text to ASCII (which English speakers may consider 'regular') what you really should be doing is working with the modern web's most permissive encoding, which is UTF8.
That way, you will be able to accept input in any language, save it, process it, and serve it back up, without needing to normalise or ill-convert to another codepage.
Serve your pages with <meta charset="utf-8"> in the source code, and an http content header to indicate UTF8 encoding, and things should go a lot smoother. (note that for the now defunct HTML 4.01 or XHTML 1/1.1 you will need to use the older meta tag syntax. Using those flavours for new projects is, however, very much not recommended)

Related

converting special characters in HTML into the appropriate coding for PHP

I am making a website where one fills out a form and it creates a PDF. The user will be able to put in diacritic and special characters. The way I am sending the characters to the PHP, those characters will come into the PHP as HTML coded characters i.e. à. I need to change this to whatever it is PHP will read so when I put it through the PDF maker we have it has the diacritic character and not the HTML code for it.
I wrote a test to try this out but I haven't been able to figure it out. If I have to I will end up writing an array for every possible character they can use and translate the incoming string but I am trying to find an easier solution.
Here is the code of my test:
$title = "Test of Title for use With This Project and it should also wrap because it is sò long! Acutally it is even longer than previously expected!";
$ti = htmlspecialchars_decode($title);
I have been attempting to use the htmlspecialchars_decode() to convert it but it still comes out as &ograve and not ò. Is there an easy way to do this?
See the documentation which tells you it won't touch most of the characters you care about and to use html_entity_decode instead.
Use the html_entity_decode function instead of htmlspecialchars_decode (which only decodes entities such as &, ", < and > = special HTML chars, not all entities).

How should I deal with character encodings when storing crawled web content for a search engine into a MySQL database?

I have a crawler that downloads webpages, scrapes specific content and then stores that content into a MySQL database. Later that content is displayed on a webpage when it's searched for ( standard search engine type setup ).
The content is generally of two different encoding types... UTF-8 or ISO-8859-1 or it is not specified. My database tables use cp1252 west european ( latin1 ) encoding. Up until now, I've simply filtered all characters that are not alphanumeric, spaces or punctuation using a regular expression before storing the content to MySQL. For the most part, this has eliminated all character encoding problems, and content is displayed properly when recalled and outputted to HTML. Here is the code I use:
function clean_string( $string )
{
$string = trim( $string );
$string = preg_replace( '/[^a-zA-Z0-9\s\p{P}]/', '', $string );
$string = $mysqli->real_escape_string( $string );
return $string;
}
I now need to start capturing "special" characters like trademark, copyright, and registered symbols, and am having trouble. No matter what I try, I end up with weird characters when I redisplay the content in HTML.
From what I've read, it sounds like I should use UTF-8 for my database encoding. How do I ensure all my data is converted properly before storing it to the database? Remember that my original content comes from all over the web in various encoding formats. Are there other steps I'm overlooking that may be giving me problems?
You should convert your database encoding to UTF-8.
About the content: for every page you crawl, fetch the page's encoding (from HTTP header/
meta charset) and use that encoding to convert to utf-8 like this:
$string = iconv("UTF-8", "THIS STRING'S ENCODING", $string);
Where THIS STRING'S ENCODING is the one you just grabbed as described above.
PHP manual on iconv: http://be2.php.net/manual/en/function.iconv.php
UTF-8 encompasses just about everything. It would definitely be my choice.
As far as storing the data, just ensure the connection to your database is using the proper charset. See the manual.
To deal with the ISO encoding, simply use utf8_encode when you store it, and utf8_decode when you retrieve it.
Try doing the encoding/decoding even when it's supposedly UTF-8 and see if that works for you. I've often seen people say something is UTF-8 when it isn't.
You'll also need to change your database to UTF-8.
Below worked for me when I am scraping and presenting the data on html page.
While scraping the data from external website do an utf8_encode:utf8_encode(trim(str_replace(array("\t","\n\r","\n","\r"),"",trim($th->plaintext))));
Before writing to the HTML page set the charset to utf-8 : <meta charset="UTF-8">
While writing of echoing out on html do an utf8_decode.echo "Menu Item:". utf8_decode ($value['item'])
This helped me to solve problem with my html scraping issues. Hope someone else finds it useful.

Convert ASCII and UTF-8 to non-special characters with one function

So I'm building a website that is using a database feed that was already set up and has been used by the client for all their other websites for quite some time.
They fill this database through an external program, and I have no way to change the way I get my data.
Now I have the following problem, sometimes I get strings in UTF-8 and sometimes in ASCII (I hope I've got these terms right, they're still a bit vague to me sometimes).
So I could get either this: Scénic or Scénic.
Now the problem is, I have to convert this to non-special characters (so it would become Scenic) for urls.
I don't think there's a function for converting é to e (if there is do tell) so I'll probably need to create an array for that containing all the source and destinations, but the bigger problem is converting é to é without breaking é when it comes through that function.
Or should I just create an array containing everything (so for example: array('é'=>'e','é'=>'e'); etc.
I know how to get é to é, by doing utf8_encode(html_entity_decode('é')), however putting é through this same function will return é.
Maybe I'm approaching this the wrong way, but in that case I'd love to know how I should approach it.
Thanks to #XzKto and this comment on PHP.net I changed my slug function to the following:
static function slug($input){
$string = html_entity_decode($input,ENT_COMPAT,"UTF-8");
$oldLocale = setlocale(LC_CTYPE, '0');
setlocale(LC_CTYPE, 'en_US.UTF-8');
$string = iconv("UTF-8","ASCII//TRANSLIT",$string);
setlocale(LC_CTYPE, $oldLocale);
return strtolower(preg_replace('/[^a-zA-Z0-9]+/','-',$string));
}
I feel like the setlocale part is a bit dirty but this works perfectly for translating special characters to their 'normal' equivalents.
Input a áñö ïß éèé returns a-ano-iss-eee

How to replace garbled characters in a string?

I have this text...
“I’m not trying to be credible,†David admits with a smile broadening"
...and I would like to delete those funny characters, I've tried str_replace() but it does not work.
Any ideas?
You probably have handled text in a different encoding then its source encoding.
So if the text is UTF-8, you are not handling it currently as UTF-8. The easiest way is to send a header such as...
header('Content-Type: text/html; charset=UTF-8');
You could also add the meta element, but ensure it is the first child of your head element.
You need to fix that at the source instead of trying to patch it later (which will never work well).
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
...
</head>
Different sources often have different encodings, so you need to specify the encoding in which you are presenting the view. Utf-8 is the most popular, since it covers all of ASCII and many, many other languages.
php's utf8_(de)encode converts iso-8859-1 to utf-8 and the opposite and regular string manipulating functions are not multibyte-(which utf-8 can be) character aware. Either you use functions specific to mb_strings or enable encoding with certain parameters.
//comment if i'm mistaken
Well, you are using a different character encoding that you should probably use(you should be using utf-8 encoding), so I would change that instead of trying to just fix it on the spot with a quick-fix(you will run into less problems overall that way).
If you really want to fix it using PHP, you can use the ctype_alpha() function; you should be able to do something like this:
$theString = "your text here"; // your input string
$newString = ""; // your new string
$i = 0;
while($theString[$i]) // while there are still characters in the string
{
if(ctype_alpha($theString[$i]) // if it's a character in your current set
{
$newString .= $theString[$i]; // add it to the new string, increment pointer, and go to next loop iteration
$i++;
continue;
} // if the specific character at the $i index is an alphabetical character, add it to the new string
else
{
$i++;
} // if it's a bad character, just move the pointer up by one for the next iteration
}
Then use $newString however you want to. Really though, just change your character encoding instead of doing it this way. You want the encoding to be the same across your entire project.

Correct character encoding

I'm currently scraping a website for various pieces of textual data (with permission, of course). The issue I'm seeing is that certain characters aren't correctly encoded in the process. This is particularly prominent with apostrophes ('): leading to characters such as: .
Currently, I use the following code to convert various HTML entities from the scraped data:
htmlentities($content, ENT_COMPAT, 'UTF-8', FALSE)
Is there a better way to handle this sort of thing?
HTML entities have two goals:
Escape characters that have a special meaning in HTML, such as angle quotes, so they can be used as literals.
Display characters that are not supported by the character set you are using, such as the euro symbol in an ISO-8859-1 document.
They are not exactly an encoding tool.
If you want to convert from one charset into another one, I suggest you use iconv(). However, you must know both the source and the target charset. The source charset should be mentioned in the Content-Type response header and the target charset is something you decided when you started the site (although in your case it looks like UTF-8 is the most reasonable option).
You don't want to use htmlentities right away, I would use that on the data at the last point before you store it. One of the problems you'll run into is people don't always encode their entities properly anyway. Not everyone uses ™ they just copy the trademark in. If you put some logic in to try and grab whatever they put in and encode it properly you may be better off. For Example:
$patterns = array();
$patterns[0] = '/—/';
$patterns[1] = '/&nsbsp;/';
$patterns[2] = '/®/';
$replacements = array();
$replacements[2] = '&151;';
$replacements[1] = '&160;';
$replacements[0] = '&174;';
$ourhtml = preg_replace($patterns, $replacements, $html);
You could find all the "gotcha" characters like dashes and single quotes, apostrophes etc and encode them by hand, as well as use a set standard to the entities (text or numeric).
You could also use regular expressions to do the same thing, and would probably be a more elegant solution. But my suggestion would be to take some time filtering out what you don't want by hand, and then you know your data will be prepared exactly how you like.
It's a little bit difficult to suggest things based on the information provided. Can you provide an example snippet of text maybe?
Failing that, I'll employee the shotgun approach (e.g., suggesting a bunch of things and hoping one of them hits)
First of all, are you sure the page you're accessing is encoded in UTF-8? What does mb_detect_encoding say?
One option (may not work depending on your needs) would be to use iconv with the TRANSLIT option to convert the characters into something easier to handle using PHP. You could also look at using the mb_* functions for working with multibyte strings.
Are you sure htmlentities is the problem? If the content is UTF-8, and your site is set to serve ISO-8859-1, you're going to see odd characters. Check the encoding your browser is using to make sure it matches the encoding of the characters you're producing.
I don't see any issue with using htmlentities() as long as you pass false as the last parameter. This will ensure that you don't encode anything twice (such as turning & into &amp;).

Categories