I am building a search engine and webcrawler using PHP, and i would like to detect the language of a website, how would i detect the language of a page by:
Checking the URL for https://twitter.com/?lang=jap
if that is not set then i would like to:
Check the URL https://www.google.co.jp/
if i still can't find anything then i would to set default to English
the code i have so far for scraping pages is:
function crawl($url){
$html = file_get_html($url);
if($html && is_object($html) && isset($html->nodes)){
$weblinks[]=$url;
foreach($html->find('a') as $element) {
global $weblinks;
$link = $element->href;
$base_url = parse_url($url, PHP_URL_HOST);
if(substr($link,0,7)=="http://"){
$link = $link;
}else if(substr($link,0,8)=="https://"){
$link = $link;
}else if(substr($link,0,2)=="//"){
$link = substr($link, 2);
}else if(substr($link,0,1)=="#"){
$link = $html;
}else if(substr($link,0,7)=="mailto:"){
$link = "";
}else if(substr($link,0,11)=="javascript:"){
$link = "";
}else{
if(substr($link, 0, 1) != "/"){
$link = $base_url."/".$link;
}else{
$link = $base_url . $link;
}
}
if(substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && $link != ""){
if(substr($url, 0, 8) == "https://"){
$link = "https://".$link;
}else{
$link = "http://".$link;
}
}
if(!in_array($link, $weblinks)){
$weblinks[]=$link;
}
}
$html->clear();
}else{
}
}
function info($weblinks){
foreach($weblinks as $link) {
$linkhtml = file_get_html("$link");
if($linkhtml && is_object($linkhtml) && isset($linkhtml->nodes)){
$titleraw = $linkhtml->find('title',0);
$title = $titleraw->innertext;
$des = $linkhtml->find("meta[name='description']",0)->content;
//detect language here
echo "<tr><td>".$title."</td><td>".$link."</td><td>".$des."</td></tr>";
$sql = mysql_query("INSERT into web once");
$title = "";
$des = "";
$linkhtml->clear();
}
}
}
To get the language from ?lang=:
$url = 'www.domain.org?lang=IT';
$url_parts = parse_url($url);
$lang = parse_str($url_parts['lang']);
You should then validate this with a switch/case statement and a list of languages that you support, like this:
switch ($lang) {
case 'EN':
//language is English
break;
case 'IT':
//language is Italian
break;
case 'FR':
//language is French
break;
default:
//?lang query was empty, or contained an unsupported language
$lang = FALSE;
} //end switch
After that, you can use this logic to determine whether you need to check the URL for the language:
if ($lang == FALSE) {
//code to determine language from TLD
}
Hopefully this will help get you started, although this is a big can of worms you're opening up. There are other things you need to check in order to be certain of the language of a website in addition to what you've mentioned. One of them is the language meta tag, which is like this: <meta name="language" content="english"> and goes in the head of the webpage, though not all websites use it.
Some multilingual websites, like mine, use a subdomain like http://it.website.com or http://fr.website.com
Others use query strings that are different from ?lang=. So you'll need to do a significant amount of research to cover all your bases.
I use the following PHP script as index for my website.
This script should include a specific page depending on the browser's language (automatically detected).
This script does not work well with all browsers, so it always includes index_en.php for any detected language (the cause of the problem is most probably an issue with some Accept-Language header not being considered).
Could you please suggest me a more robust solution?
<?php
// Open session var
session_start();
// views: 1 = first visit; >1 = second visit
// Detect language from user agent browser
function lixlpixel_get_env_var($Var)
{
if(empty($GLOBALS[$Var]))
{
$GLOBALS[$Var]=(!empty($GLOBALS['_SERVER'][$Var]))?
$GLOBALS['_SERVER'][$Var] : (!empty($GLOBALS['HTTP_SERVER_VARS'][$Var])) ? $GLOBALS['HTTP_SERVER_VARS'][$Var]:'';
}
}
function lixlpixel_detect_lang()
{
// Detect HTTP_ACCEPT_LANGUAGE & HTTP_USER_AGENT.
lixlpixel_get_env_var('HTTP_ACCEPT_LANGUAGE');
lixlpixel_get_env_var('HTTP_USER_AGENT');
$_AL=strtolower($GLOBALS['HTTP_ACCEPT_LANGUAGE']);
$_UA=strtolower($GLOBALS['HTTP_USER_AGENT']);
// Try to detect Primary language if several languages are accepted.
foreach($GLOBALS['_LANG'] as $K)
{
if(strpos($_AL, $K)===0)
return $K;
}
// Try to detect any language if not yet detected.
foreach($GLOBALS['_LANG'] as $K)
{
if(strpos($_AL, $K)!==false)
return $K;
}
foreach($GLOBALS['_LANG'] as $K)
{
//if(preg_match("/[[( ]{$K}[;,_-)]/",$_UA)) // matching other letters (create an error for seo spyder)
return $K;
}
// Return default language if language is not yet detected.
return $GLOBALS['_DLANG'];
}
// Define default language.
$GLOBALS['_DLANG']='en';
// Define all available languages.
// WARNING: uncomment all available languages
$GLOBALS['_LANG'] = array(
'af', // afrikaans.
'ar', // arabic.
'bg', // bulgarian.
'ca', // catalan.
'cs', // czech.
'da', // danish.
'de', // german.
'el', // greek.
'en', // english.
'es', // spanish.
'et', // estonian.
'fi', // finnish.
'fr', // french.
'gl', // galician.
'he', // hebrew.
'hi', // hindi.
'hr', // croatian.
'hu', // hungarian.
'id', // indonesian.
'it', // italian.
'ja', // japanese.
'ko', // korean.
'ka', // georgian.
'lt', // lithuanian.
'lv', // latvian.
'ms', // malay.
'nl', // dutch.
'no', // norwegian.
'pl', // polish.
'pt', // portuguese.
'ro', // romanian.
'ru', // russian.
'sk', // slovak.
'sl', // slovenian.
'sq', // albanian.
'sr', // serbian.
'sv', // swedish.
'th', // thai.
'tr', // turkish.
'uk', // ukrainian.
'zh' // chinese.
);
// Redirect to the correct location.
// Example Implementation aff var lang to name file
/*
echo 'The Language detected is: '.lixlpixel_detect_lang(); // For Demonstration
echo "<br />";
*/
$lang_var = lixlpixel_detect_lang(); //insert lang var system in a new var for conditional statement
/*
echo "<br />";
echo $lang_var; // print var for trace
echo "<br />";
*/
// Insert the right page iacoording with the language in the browser
switch ($lang_var){
case "fr":
//echo "PAGE DE";
include("index_fr.php");//include check session DE
break;
case "it":
//echo "PAGE IT";
include("index_it.php");
break;
case "en":
//echo "PAGE EN";
include("index_en.php");
break;
default:
//echo "PAGE EN - Setting Default";
include("index_en.php");//include EN in all other cases of different lang detection
break;
}
?>
why dont you keep it simple and clean
<?php
$lang = substr($_SERVER['HTTP_ACCEPT_LANGUAGE'], 0, 2);
$acceptLang = ['fr', 'it', 'en'];
$lang = in_array($lang, $acceptLang) ? $lang : 'en';
require_once "index_{$lang}.php";
?>
Accept-Language is a list of weighted values (see q parameter). That means just looking at the first language does not mean it’s also the most preferred; in fact, a q value of 0 means not acceptable at all.
So instead of just looking at the first language, parse the list of accepted languages and available languages and find the best match:
// parse list of comma separated language tags and sort it by the quality value
function parseLanguageList($languageList) {
if (is_null($languageList)) {
if (!isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) {
return array();
}
$languageList = $_SERVER['HTTP_ACCEPT_LANGUAGE'];
}
$languages = array();
$languageRanges = explode(',', trim($languageList));
foreach ($languageRanges as $languageRange) {
if (preg_match('/(\*|[a-zA-Z0-9]{1,8}(?:-[a-zA-Z0-9]{1,8})*)(?:\s*;\s*q\s*=\s*(0(?:\.\d{0,3})|1(?:\.0{0,3})))?/', trim($languageRange), $match)) {
if (!isset($match[2])) {
$match[2] = '1.0';
} else {
$match[2] = (string) floatval($match[2]);
}
if (!isset($languages[$match[2]])) {
$languages[$match[2]] = array();
}
$languages[$match[2]][] = strtolower($match[1]);
}
}
krsort($languages);
return $languages;
}
// compare two parsed arrays of language tags and find the matches
function findMatches($accepted, $available) {
$matches = array();
$any = false;
foreach ($accepted as $acceptedQuality => $acceptedValues) {
$acceptedQuality = floatval($acceptedQuality);
if ($acceptedQuality === 0.0) continue;
foreach ($available as $availableQuality => $availableValues) {
$availableQuality = floatval($availableQuality);
if ($availableQuality === 0.0) continue;
foreach ($acceptedValues as $acceptedValue) {
if ($acceptedValue === '*') {
$any = true;
}
foreach ($availableValues as $availableValue) {
$matchingGrade = matchLanguage($acceptedValue, $availableValue);
if ($matchingGrade > 0) {
$q = (string) ($acceptedQuality * $availableQuality * $matchingGrade);
if (!isset($matches[$q])) {
$matches[$q] = array();
}
if (!in_array($availableValue, $matches[$q])) {
$matches[$q][] = $availableValue;
}
}
}
}
}
}
if (count($matches) === 0 && $any) {
$matches = $available;
}
krsort($matches);
return $matches;
}
// compare two language tags and distinguish the degree of matching
function matchLanguage($a, $b) {
$a = explode('-', $a);
$b = explode('-', $b);
for ($i=0, $n=min(count($a), count($b)); $i<$n; $i++) {
if ($a[$i] !== $b[$i]) break;
}
return $i === 0 ? 0 : (float) $i / count($a);
}
$accepted = parseLanguageList($_SERVER['HTTP_ACCEPT_LANGUAGE']);
var_dump($accepted);
$available = parseLanguageList('en, fr, it');
var_dump($available);
$matches = findMatches($accepted, $available);
var_dump($matches);
If findMatches returns an empty array, no match was found and you can fall back on the default language.
The existing answers are a little too verbose so I created this smaller, auto-matching version.
function prefered_language(array $available_languages, $http_accept_language) {
$available_languages = array_flip($available_languages);
$langs;
preg_match_all('~([\w-]+)(?:[^,\d]+([\d.]+))?~', strtolower($http_accept_language), $matches, PREG_SET_ORDER);
foreach($matches as $match) {
list($a, $b) = explode('-', $match[1]) + array('', '');
$value = isset($match[2]) ? (float) $match[2] : 1.0;
if(isset($available_languages[$match[1]])) {
$langs[$match[1]] = $value;
continue;
}
if(isset($available_languages[$a])) {
$langs[$a] = $value - 0.1;
}
}
arsort($langs);
return $langs;
}
And the sample usage:
//$_SERVER["HTTP_ACCEPT_LANGUAGE"] = 'en-us,en;q=0.8,es-cl;q=0.5,zh-cn;q=0.3';
// Languages we support
$available_languages = array("en", "zh-cn", "es");
$langs = prefered_language($available_languages, $_SERVER["HTTP_ACCEPT_LANGUAGE"]);
/* Result
Array
(
[en] => 0.8
[es] => 0.4
[zh-cn] => 0.3
)*/
Full gist source here
The official way to handle this is using the PECL HTTP library. Unlike some answers here, this correctly handles the language priorities (q-values), partial language matches and will return the closest match, or when there are no matches it falls back to the first language in your array.
PECL HTTP:
http://pecl.php.net/package/pecl_http
How to use:
http://php.net/manual/fa/function.http-negotiate-language.php
$supportedLanguages = [
'en-US', // first one is the default/fallback
'fr',
'fr-FR',
'de',
'de-DE',
'de-AT',
'de-CH',
];
// Returns the negotiated language
// or the default language (i.e. first array entry) if none match.
$language = http_negotiate_language($supportedLanguages, $result);
The problem with the selected answer above is that the user may have their first choice set as a language that's not in the case structure, but one of their other language choices are set. You should loop until you find a match.
This is a super simple solution that works better. Browsers return the languages in order of preference, so that simplifies the problem. While the language designator can be more than two characters (e.g. - "EN-US"), typically the first two are sufficient. In the following code example I'm looking for a match from a list of known languages my program is aware of.
$known_langs = array('en','fr','de','es');
$user_pref_langs = explode(',', $_SERVER['HTTP_ACCEPT_LANGUAGE']);
foreach($user_pref_langs as $idx => $lang) {
$lang = substr($lang, 0, 2);
if (in_array($lang, $known_langs)) {
echo "Preferred language is $lang";
break;
}
}
I hope you find this a quick and simple solution that you can easily use in your code. I've been using this in production for quite a while.
Try this one:
#########################################################
# Copyright © 2008 Darrin Yeager #
# https://www.dyeager.org/ #
# Licensed under BSD license. #
# https://www.dyeager.org/downloads/license-bsd.txt #
#########################################################
function getDefaultLanguage() {
if (isset($_SERVER["HTTP_ACCEPT_LANGUAGE"]))
return parseDefaultLanguage($_SERVER["HTTP_ACCEPT_LANGUAGE"]);
else
return parseDefaultLanguage(NULL);
}
function parseDefaultLanguage($http_accept, $deflang = "en") {
if(isset($http_accept) && strlen($http_accept) > 1) {
# Split possible languages into array
$x = explode(",",$http_accept);
foreach ($x as $val) {
#check for q-value and create associative array. No q-value means 1 by rule
if(preg_match("/(.*);q=([0-1]{0,1}.\d{0,4})/i",$val,$matches))
$lang[$matches[1]] = (float)$matches[2];
else
$lang[$val] = 1.0;
}
#return default language (highest q-value)
$qval = 0.0;
foreach ($lang as $key => $value) {
if ($value > $qval) {
$qval = (float)$value;
$deflang = $key;
}
}
}
return strtolower($deflang);
}
https://www.dyeager.org/blog/2008/10/getting-browser-default-language-php.html
Unfortunately, none of the answers to this question takes into account some valid HTTP_ACCEPT_LANGUAGE such as:
q=0.8,en-US;q=0.5,en;q=0.3: having the q priority value at first place.
ZH-CN: old browsers that capitalise (wrongly) the whole langcode.
*: that basically say "serve whatever language you have".
After a comprehensive test with thousands of different Accept-Languages that reached my server, this is my language detection method:
define('SUPPORTED_LANGUAGES', ['en', 'es']);
function detect_language($fallback='en') {
foreach (preg_split('/[;,]/', $_SERVER['HTTP_ACCEPT_LANGUAGE']) as $sub) {
if (substr($sub, 0, 2) == 'q=') continue;
if (strpos($sub, '-') !== false) $sub = explode('-', $sub)[0];
if (in_array(strtolower($sub), SUPPORTED_LANGUAGES)) return $sub;
}
return $fallback;
}
The following script is a modified version of Xeoncross's code (thank you for that Xeoncross) that falls-back to a default language setting if no languages match the supported ones, or if a match is found it replaces the default language setting with a new one according to the language priority.
In this scenario the user's browser is set in order of priority to Spanish, Dutch, US English and English and the application supports English and Dutch only with no regional variations and English is the default language. The order of the values in the "HTTP_ACCEPT_LANGUAGE" string is not important if for some reason the browser does not order the values correctly.
$supported_languages = array("en","nl");
$supported_languages = array_flip($supported_languages);
var_dump($supported_languages); // array(2) { ["en"]=> int(0) ["nl"]=> int(1) }
$http_accept_language = $_SERVER["HTTP_ACCEPT_LANGUAGE"]; // es,nl;q=0.8,en-us;q=0.5,en;q=0.3
preg_match_all('~([\w-]+)(?:[^,\d]+([\d.]+))?~', strtolower($http_accept_language), $matches, PREG_SET_ORDER);
$available_languages = array();
foreach ($matches as $match)
{
list($language_code,$language_region) = explode('-', $match[1]) + array('', '');
$priority = isset($match[2]) ? (float) $match[2] : 1.0;
$available_languages[][$language_code] = $priority;
}
var_dump($available_languages);
/*
array(4) {
[0]=>
array(1) {
["es"]=>
float(1)
}
[1]=>
array(1) {
["nl"]=>
float(0.8)
}
[2]=>
array(1) {
["en"]=>
float(0.5)
}
[3]=>
array(1) {
["en"]=>
float(0.3)
}
}
*/
$default_priority = (float) 0;
$default_language_code = 'en';
foreach ($available_languages as $key => $value)
{
$language_code = key($value);
$priority = $value[$language_code];
if ($priority > $default_priority && array_key_exists($language_code,$supported_languages))
{
$default_priority = $priority;
$default_language_code = $language_code;
var_dump($default_priority); // float(0.8)
var_dump($default_language_code); // string(2) "nl"
}
}
var_dump($default_language_code); // string(2) "nl"
Quick and simple:
$language = trim(substr( strtok(strtok($_SERVER['HTTP_ACCEPT_LANGUAGE'], ','), ';'), 0, 5));
NOTE:
The first language code is what is being used by the browser, the rest are other languages the user has setup in the browser.
Some languages have a region code, eg. en-GB, others just have the language code, eg. sk.
If you just want the language and not the region (eg. en, fr, es, etc.), you can use:
$language =substr($_SERVER['HTTP_ACCEPT_LANGUAGE'], 0, 2);
I think the cleanest way is this!
<?php
$lang = substr($_SERVER['HTTP_ACCEPT_LANGUAGE'], 0, 2);
$supportedLanguages=['en','fr','gr'];
if(!in_array($lang,$supportedLanguages)){
$lang='en';
}
require("index_".$lang.".php");
There is a method in php-intl extension:
locale_accept_from_http($_SERVER['HTTP_ACCEPT_LANGUAGE'])
FOR LARAVEL USERS, here's a single line of code that returns a very clean collection (or array) of preferred languages:
$langs = Str::of($_SERVER['HTTP_ACCEPT_LANGUAGE'])
->explode(',')
->transform(fn($lang) => Str::substr($lang, 0, 2))
->unique();
All of the above with fallback to 'en':
$lang = substr(explode(',',$_SERVER['HTTP_ACCEPT_LANGUAGE'])[0],0,2)?:'en';
...or with default language fallback and known language array:
function lang( $l = ['en'], $u ){
return $l[
array_keys(
$l,
substr(
explode(
',',
$u ?: $_SERVER['HTTP_ACCEPT_LANGUAGE']
)[0],
0,
2
)
)[0]
] ?: $l[0];
}
One Line:
function lang($l=['en'],$u){return $l[array_keys($l,substr(explode(',',$u?:$_SERVER['HTTP_ACCEPT_LANGUAGE'])[0],0,2))[0]]?:$l[0];}
Examples:
// first known lang is always default
$_SERVER['HTTP_ACCEPT_LANGUAGE'] = 'en-us';
lang(['de']); // 'de'
lang(['de','en']); // 'en'
// manual set accept-language
lang(['de'],'en-us'); // 'de'
lang(['de'],'de-de, en-us'); // 'de'
lang(['en','fr'],'de-de, en-us'); // 'en'
lang(['en','fr'],'fr-fr, en-us'); // 'fr'
lang(['de','en'],'fr-fr, en-us'); // 'de'
Try,
$lang = substr($_SERVER['HTTP_ACCEPT_LANGUAGE'], 0,2);
if ($lang == 'tr') {
include_once('include/language/tr.php');
}elseif ($lang == 'en') {
include_once('include/language/en.php');
}elseif ($lang == 'de') {
include_once('include/language/de.php');
}elseif ($lang == 'fr') {
include_once('include/language/fr.php');
}else{
include_once('include/language/tr.php');
}
Thanks to
Since PHP 5.3.0 there is a Locale class bundled with the php-intl extension which has a method for this:
echo Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']);
or procedural style:
locale_accept_from_http($_SERVER['HTTP_ACCEPT_LANGUAGE']);
https://www.php.net/manual/en/locale.acceptfromhttp.php
I've got this one, which sets a cookie. And as you can see, it first checks if the language is posted by the user. Because browser language not always tells about the user.
<?php
$lang = getenv("HTTP_ACCEPT_LANGUAGE");
$set_lang = explode(',', $lang);
if (isset($_POST['lang']))
{
$taal = $_POST['lang'];
setcookie("lang", $taal);
header('Location: /p/');
}
else
{
setcookie("lang", $set_lang[0]);
echo $set_lang[0];
echo '<br>';
echo $set_lang[1];
header('Location: /p/');
}
?>