Getting the title of a page in PHP - php

When I want to get the title of a remote webiste, I use this script:
function get_remotetitle($urlpage) {
$file = #fopen(($urlpage),"r");
$text = fread($file,16384);
if (preg_match('/<title>(.*?)<\/title>/is',$text,$found)) {
$title = $found[1];
} else {
$title = 'Title N/A';
}
return $title;
}
But when I parase a webiste title with accents, I get "�". But if I look in PHPMyAdmin, I see the accents correctly. What's happening?

This is most likely a character encoding issue. You are probably getting the character correctly but the page that displays it has the wrong character encoding so it doesn't display right.

check out PHP Simple HTML DOM Parser
use it something like:
$html = file_get_html('http://www.google.com/');
$ret = $html->find('title', 0);

The trouble is that the text has a different encoding from what you're using on the page you're displaying it on.
What you want to do is find out what encoding the data is (for instance by looking at what encoding the page you take the text from is using) and converting it to the encoding you're using yourself.
For doing the actual conversion, you can use iconv (for the general case), utf8_decode (UTF8 -> ISO-8859-1), utf8_encode (ISO-8859-1 -> UTF8) or mb_convert_encoding.
To help you find out what the encoding of the source page is, you could for instance put the website through the w3c Validator which automatically detects encoding.
If want an automatic way to determine encoding, you'll have to look at the HTML itself. The ways you can determine the selected charset can be fonud in the HTML 4 specification.
In addition, it's worth having a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for a bit more information on encoding.

I solved it. I added htmlentities($text) and now displays the accents and so.

Try this:
echo iconv('UTF-8', 'ASCII//TRANSLIT', $title);

Related

convert special characters to regular alphabet in php

I'm trying to build a search page for a bunch of menu items in my database which often contain special characters like é (as in sautéed), and so I want to convert both the search query and the database content to regular alphabets, and I'm having trouble. I'm using ISO-8859-1 so that special characters will display properly on the website, and I get the feeling this is hindering my attempts at conversion...
header('Content-Type: text/html; charset=ISO-8859-1');
The search query is sent to search.php using the GET method, so the query "sautéed" will appear like this in the address bar:
search.php?q=saut%E9ed
This is the function I'm trying to build, that's not working:
$q = $_GET['q'];
function clean_str($a) {
$fix = array('é' => 'e');
$str = str_replace(array_keys($fix), array_values($fix), $a);
return $str;
}
$fixed = clean_str($q); // currently has no effect
I'm tried using %29 as the array key, as well as the HTML character code (é). I've tried utf8_encode($q); to no avail. Other characters like ! and + work fine in the clean_str() function, but not special alphabets like é.
Though you might want to reconsider the way you're doing this, as has been suggested, I believe this will get you there.
function clean_str($a) {
$fix = array('é' => 'e');
$str = str_replace(array_keys($fix), array_values($fix), $a);
return $str;
}
$fixed = clean_str(utf8_encode($_GET['q'])); // return an encoded utf8 string.
echo $fixed;
For more on utf8_encode see here.
To wit, é is the regular alphabet in several languages =) While you're suggesting you would like to know how to covert the text to ASCII (which English speakers may consider 'regular') what you really should be doing is working with the modern web's most permissive encoding, which is UTF8.
That way, you will be able to accept input in any language, save it, process it, and serve it back up, without needing to normalise or ill-convert to another codepage.
Serve your pages with <meta charset="utf-8"> in the source code, and an http content header to indicate UTF8 encoding, and things should go a lot smoother. (note that for the now defunct HTML 4.01 or XHTML 1/1.1 you will need to use the older meta tag syntax. Using those flavours for new projects is, however, very much not recommended)

How should I deal with character encodings when storing crawled web content for a search engine into a MySQL database?

I have a crawler that downloads webpages, scrapes specific content and then stores that content into a MySQL database. Later that content is displayed on a webpage when it's searched for ( standard search engine type setup ).
The content is generally of two different encoding types... UTF-8 or ISO-8859-1 or it is not specified. My database tables use cp1252 west european ( latin1 ) encoding. Up until now, I've simply filtered all characters that are not alphanumeric, spaces or punctuation using a regular expression before storing the content to MySQL. For the most part, this has eliminated all character encoding problems, and content is displayed properly when recalled and outputted to HTML. Here is the code I use:
function clean_string( $string )
{
$string = trim( $string );
$string = preg_replace( '/[^a-zA-Z0-9\s\p{P}]/', '', $string );
$string = $mysqli->real_escape_string( $string );
return $string;
}
I now need to start capturing "special" characters like trademark, copyright, and registered symbols, and am having trouble. No matter what I try, I end up with weird characters when I redisplay the content in HTML.
From what I've read, it sounds like I should use UTF-8 for my database encoding. How do I ensure all my data is converted properly before storing it to the database? Remember that my original content comes from all over the web in various encoding formats. Are there other steps I'm overlooking that may be giving me problems?
You should convert your database encoding to UTF-8.
About the content: for every page you crawl, fetch the page's encoding (from HTTP header/
meta charset) and use that encoding to convert to utf-8 like this:
$string = iconv("UTF-8", "THIS STRING'S ENCODING", $string);
Where THIS STRING'S ENCODING is the one you just grabbed as described above.
PHP manual on iconv: http://be2.php.net/manual/en/function.iconv.php
UTF-8 encompasses just about everything. It would definitely be my choice.
As far as storing the data, just ensure the connection to your database is using the proper charset. See the manual.
To deal with the ISO encoding, simply use utf8_encode when you store it, and utf8_decode when you retrieve it.
Try doing the encoding/decoding even when it's supposedly UTF-8 and see if that works for you. I've often seen people say something is UTF-8 when it isn't.
You'll also need to change your database to UTF-8.
Below worked for me when I am scraping and presenting the data on html page.
While scraping the data from external website do an utf8_encode:utf8_encode(trim(str_replace(array("\t","\n\r","\n","\r"),"",trim($th->plaintext))));
Before writing to the HTML page set the charset to utf-8 : <meta charset="UTF-8">
While writing of echoing out on html do an utf8_decode.echo "Menu Item:". utf8_decode ($value['item'])
This helped me to solve problem with my html scraping issues. Hope someone else finds it useful.

XML charactor encoding issues with accents

I have had the problem a few times now while working on projects and I would like to know if there's an elegant solution.
Problem
I am pulling tweets via XML from twitter and uploading them to my DB however when I output them to screen I get these characters:
"moved to dusseldorf.�"
OR
también
and if I have Russian characters then I get lots of ugly boxes in place.
What I would like is the correct native accents to show under one encoding. I thought was possible with UTF-8.
What I am using
PHP, MYSQL
After reading in the XML file I am doing the following to cleanse the data:
$data = trim($data);
$data = htmlentities($data);
$data = mysql_real_escape_string($data);
My Database Collation is: utf8_general_ci
Web page character set is: charset=UTF-8
I think it could have something to do with HTML entities but I really appreciate a solution that works across the board on projects.
Thanks in advance.
Replace this line:
$data = htmlentities($data);
With this:
$data = htmlentities($data, null, "UTF-8");
That way, htmlentities() will leave valid UTF-8 characters alone. For more information see the documentation for htmlentities().
You need to change your connection's encoding to UTF-8 (it's usually iso-8859-1). See here: How can I store the '€' symbol in MySQL using PHP?
Calling htmlentities() is unnecessary when you get the encodings right. I would remove it completely. You'll just have to be careful to use htmlspecialchars() when outputting the data a in HTML context.
Make sure that you set your php internal encoding ot UTF8 using iconv_set_encoding, and that you call htmlentities with the encoding information as EdoDodo said. Also make sure that you're database stores with UTF8-encoding, though you say that's already the case.
You can't use htmlentities() in it's default state for XML data, because this function produces HTML entities, not XML entities.
The difference is that the HTML DTD defines a bunch of entity codes which web browsers are programmed to interpret. But most XML DTDs don't define them (if the XML even has a DTD).
The only entitity codes that are available by default to XML are >, < and &. All other entities need to be presented using their numeric entity.
PHP doesn't have an xmlentities() function, but if you read the manual page for htmlentities(), you'll see in the comments that that plenty of people have had this same issue and have posted their solutions. After a quick browse through it, I'd suggest looking at the one named philsXMLClean().
Hope that helps.

PHP problem character set

I have a problem where users upload zipped text files. After I extract text contents I import them in mysql database. But later when I display the text in browser some characters are garbled. I tried to encode them but I am unable to detect the encoding of the text files with PHP and convert to UTF-8 with iconv or mbstring.
Mysql database charset is UTF-8.
header('Content-type: text/html; charset=utf-8');
is added.
Tried with
iconv('UTF-8', 'UTF-8//IGNORE', $text_file_contents)
But it simply removes the garbled chars: � which should be either ' or " when I checked manually with Firefox browser. Firefox showed that is ISO-8859-1 but I can not check for every article they send (articles may be in different character set).
How to convert this characters to UTF-8 ?
EDIT:
This is a modified function I found on
http://php.net/manual/en/function.mb-detect-encoding.php
origanlly written by prgss at bk dot ru .
function myutf8_detect_encoding($string, $default = 'UTF-8', $encode = 0, $encode_to = 'UTF-8') {
static $list = array('UTF-8', 'ISO-8859-1', 'ASCII', 'windows-1250', 'windows-1251', 'latin1', 'windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'ISO-8859-2', 'ISO-8859-3', 'GBK', 'GB2312', 'GB18030', 'MACROMAN', 'ISO-8859-4', 'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 'ISO-8859-11', 'ISO-8859-12', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16');
foreach ($list as $item) {
$sample = iconv($item, $item, $string);
if (md5($sample) == md5($string)) {
if ($encode == 1)
return iconv($item, $encode_to, $string);
else
return $item;
}
}
if ($encode == 1)
return iconv($encode_to, $encode_to . '//IGNORE', $string);
else
return $default;
}
and in my code I use:
myutf8_detect_encoding(trim($description), 'UTF-8', 1)
but it still returns garbled characters of this text “old is gold’’ .
This is indeed tricky.
Detecting an arbitrary string's encoding using detect_encoding... is known to be not very reliable (although it should be able to distinguish between UTF-8 and ISO-8859-1 for example - make sure you give it a try first.)
If the auto-detection doesn't work out, there is the option of displaying the content to the user before it gets submitted, along with a drop-down menu to switch between the most used encodings. Then show a message like
Please check your submission. If you are seeing incorrect or garbled characters, please change the encoding in the drop-down menu until the content is correct.
Whenever the user changes the drop-down value, your script will pull the content again, use iconv() to convert it from the specified encoding to UTF-8, and output the result, until it looks good.
This needs some finesse in designing the User Interface to be understandable for the end user, but it would often be the best option. Especially if you are dealing with users from many different regions or continents with a lot of different encodings.
Having had the same problem of encoding detection, I made a php function that outputs different information about the string and should make it relatively easy to identify the encoding used.
http://php.net/manual/en/function.ord.php (function hex_chars by "manixrock(hat)gmail(doink)com").
It shows the values of the characters inside the string, as well as the values of each individual byte. You look at the output and see which of your suspected encodings matches the bytes. You should first familiarize yourself with the various popular encodings like UTF-8, UTF-16, ISO-8859-X (understand their byte storage). Also make sure you test the string as unaltered as possible (take care how the encoding might change between what PHP outputs and what the browser receives, how the browser displays, or if you get the string from another source like MySQL or a file how that may change the encoding).
This helped me detect that a text had undergone the conversions: (UTF-8 to byte[]) then (ISO-8859-1 to UTF-8). That function helped a lot. Hope it helps you.
Use mb_detect_encoding to find out what encoding is used, then iconv to convert.
Try to insert right after the mysql connection:
mysql_query("SET NAMES utf8");

PHP output showing little black diamonds with a question mark

I'm writing a php program that pulls from a database source. Some of the varchars have quotes that are displaying as black diamonds with a question mark in them (�, REPLACEMENT CHARACTER, I assume from Microsoft Word text).
How can I use php to strip these characters out?
If you see that character (� U+FFFD "REPLACEMENT CHARACTER") it usually means that the text itself is encoded in some form of single byte encoding but interpreted in one of the unicode encodings (UTF8 or UTF16).
If it were the other way around it would (usually) look something like this: ä.
Probably the original encoding is ISO-8859-1, also known as Latin-1. You can check this without having to change your script: Browsers give you the option to re-interpret a page in a different encoding -- in Firefox use "View" -> "Character Encoding".
To make the browser use the correct encoding, add an HTTP header like this:
header("Content-Type: text/html; charset=ISO-8859-1");
or put the encoding in a meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Alternatively you could try to read from the database in another encoding (UTF-8, preferably) or convert the text with iconv().
I also faced this � issue. Meanwhile I ran into three cases where it happened:
substr()
I was using substr() on a UTF8 string which cut UTF8 characters, thus the cut chars could not be displayed correctly. Use mb_substr($utfstring, 0, 10, 'utf-8'); instead. Credits
htmlspecialchars()
Another problem was using htmlspecialchars() on a UTF8 string. The fix is to use: htmlspecialchars($utfstring, ENT_QUOTES, 'UTF-8');
preg_replace()
Lastly I found out that preg_replace() can lead to problems with UTF. The code $string = preg_replace('/[^A-Za-z0-9ÄäÜüÖöß]/', ' ', $string); for example transformed the UTF string "F(×)=2×-3" into "F � 2� ". The fix is to use mb_ereg_replace() instead.
I hope this additional information will help to get rid of such problems.
This is a charset issue. As such, it can have gone wrong on many different levels, but most likely, the strings in your database are utf-8 encoded, and you are presenting them as iso-8859-1. Or the other way around.
The proper way to fix this problem, is to get your character-sets straight. The simplest strategy, since you're using PHP, is to use iso-8859-1 throughout your application. To do this, you must ensure that:
All PHP source-files are saved as iso-8859-1 (Not to be confused with cp-1252).
Your web-server is configured to serve files with charset=iso-8859-1
Alternatively, you can override the webservers settings from within the PHP-document, using header.
In addition, you may insert a meta-tag in you HTML, that specifies the same thing, but this isn't strictly needed.
You may also specify the accept-charset attribute on your <form> elements.
Database tables are defined with encoding as latin1
The database connection between PHP to and database is set to latin1
If you already have data in your database, you should be aware that they are probably messed up already. If you are not already in production phase, just wipe it all and start over. Otherwise you'll have to do some data cleanup.
A note on meta-tags, since everybody misunderstands what they are:
When a web-server serves a file (A HTML-document), it sends some information, that isn't presented directly in the browser. This is known as HTTP-headers. One such header, is the Content-Type header, which specifies the mimetype of the file (Eg. text/html) as well as the encoding (aka charset).
While most webservers will send a Content-Type header with charset info, it's optional. If it isn't present, the browser will instead interpret any meta-tags with http-equiv="Content-Type". It's important to realise that the meta-tag is only interpreted if the webserver doesn't send the header. In practice this means that it's only used if the page is saved to disk and then opened from there.
This page has a very good explanation of these things.
As mentioned in earlier answers, it is happening because your text has been written to the database in iso-8859-1 encoding, or any other format.
So you just need to convert the data to utf8 before outputting it.
$text = “string from database”;
$text = utf8_encode($text);
echo $text;
To make sure your MYSQL connection is set to UTF-8 (or latin1, depending on what you're using), you can do this to:
$con = mysql_connect("localhost","username","password");
mysql_set_charset('utf8',$con);
or use this to check what charset you are using:
$con = mysql_connect("localhost","username","password");
$charset = mysql_client_encoding($con);
echo "The current character set is: $charset\n";
More info here: http://php.net/manual/en/function.mysql-set-charset.php
I chose to strip these characters out of the string by doing this -
ini_set('mbstring.substitute_character', "none");
$text= mb_convert_encoding($text, 'UTF-8', 'UTF-8');
Just Paste This Code In Starting to The Top of Page.
<?php
header("Content-Type: text/html; charset=ISO-8859-1");
?>
Based on your description of the problem, the data in your database is almost certainly encoded as Windows-1252, and your page is almost certainly being served as ISO-8859-1. These two character sets are equivalent except that Windows-1252 has 16 extra characters which are not present in ISO-8859-1, including left and right curly quotes.
Assuming my analysis is correct, the simplest solution is to serve your page as Windows-1252. This will work because all characters that are in ISO-8859-1 are also in Windows-1252. In PHP you can change the encoding as follows:
header('Content-Type: text/html; charset=Windows-1252');
However, you really should check what character encoding you are using in your HTML files and the contents of your database, and take care to be consistent, or convert properly where this is not possible.
Add this function to your variables
utf8_encode($your variable);
Try This Please
mb_substr($description, 0, 490, "UTF-8");
This will help you. Put this inside <head> tag
<meta charset="iso-8859-1">
That can be caused by unicode or other charset mismatch. Try changing charset in your browser, in of the settings the text will look OK. Then it's question of how to convert your database contents to charset you use for displaying. (Which can actually be just adding utf-8 charset statement to your output.)
what I ended up doing in the end after I fixed my tables was to back it up and change back the settings to utf-8 then I altered my dump file so that DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci are my character set entries
now I don't have characterset issues anymore because the database and browser are utf8.
I figured out what caused it. It was the web page+browser effects on the DB. On the terminals that are linux (ubuntu+firefox) it was encoding the database in latin1 which is what the tabes are set. But on the windows 10+edge terminals, the entries were force coded into utf8. Also I noticed the windows 10 has issues staying with latin1 so I decided to bend with the wind and convert all to utf8.
I figured it was a windows 10 issue because we started using win 10 terminals.
so yet again microsoft bugs causes issues. I still don't know why the encoding changes on the forms because the browser in windows 10 shows the latin1 characterset but when it goes in its utf8 encoded and I get the data anomaly. but in linux+firefox it doesn't do that.
This happened to work in my case:
$text = utf8_decode($text)
I turns the black diamond character into a question mark so you can:
$text = str_replace('?', '', utf8_decode($text));
Just add these lines before headers.
Accurate format of .doc/docx files will be retrieved:
if(ini_get('zlib.output_compression'))
ini_set('zlib.output_compression', 'Off');
ob_clean();
When you extract data from anywhere you should use functions with the prefix md_FUNC_NAME.
Had the same problem it helped me out.
Or you can find the code of this symbol and use regexp to delete these symbols.
You can also change the caracter set in your browser. Just for debug reasons.
Using the same charset (as suggested here) in both the database and the HTML has not worked for me... So remembering that the code is generated as HTML, I chose to use the "(HTML code) or the " (ISO Latin-1 code) in my database text where quotes were used. This solved the problem while providing me a quotation mark. It is odd to note that prior to this solution, only some of the quotation marks and apostrophes did not display correctly while others did, however, the special code did work in all instances.
I ran the "detect encoding" code after my collation change in phpmyadmin and now it comes up as Latin_1.
but here is something I came across looking a different data anomaly in my application and how I fixed it:
I just imported a table that has mixed encoding (with diamond question marks in some lines, and all were in the same column.) so here is my fix code. I used utf8_decode process that takes the undefined placeholder and assigns a plain question mark in the place of the "diamond question mark " then I used str_replace to replace the question mark with a space between quotes.
here is the
[code]
include 'dbconnectfile.php';
//// the variable $db comes from my db connect file
/// inx is my auto increment column
/// broke_column is the column I need to fix
$qwy = "select inx,broke_column from Table ";
$res = $db->query($qwy);
while ($data = $res->fetch_row()) {
for ($m=0; $m<$res->field_count; $m++) {
if ($m==0){
$id=0;
$id=$data[$m];
echo $id;
}else if ($m==1){
$fix=0;
$fix=$data[$m];
$fix = utf8_decode($fix);
$fixx =str_replace("?"," ",$fix);
echo $fixx;
////I echoed the data to the screen because I like to see something as I execute it :)
}
}
$insert= "UPDATE Table SET broke_column='".$fixx."' where inx='".$id."'";
$insresult= $db->query($insert);
echo"<br>";
}
?>
For global purposes.
Instead of converting, codifying, decodifying each text I prefer to let them as they are and instead change the server php settings.
So,
Let the diamonds
From the browser, on the view menu select
"text encoding" and find the one which let's you see your text
correctly.
Edit your php.ini and add:
default_charset = "ISO-8859-1"
or instead of ISO-8859 the one which fits your text encoding.
Go to your phpmyadmin and select your database and just increase the length/value of that table's field to 500 or 1000 it will solve your problem.

Categories