I am building a search engine and webcrawler using PHP, and i would like to detect the language of a website, how would i detect the language of a page by:
Checking the URL for https://twitter.com/?lang=jap
if that is not set then i would like to:
Check the URL https://www.google.co.jp/
if i still can't find anything then i would to set default to English
the code i have so far for scraping pages is:
function crawl($url){
$html = file_get_html($url);
if($html && is_object($html) && isset($html->nodes)){
$weblinks[]=$url;
foreach($html->find('a') as $element) {
global $weblinks;
$link = $element->href;
$base_url = parse_url($url, PHP_URL_HOST);
if(substr($link,0,7)=="http://"){
$link = $link;
}else if(substr($link,0,8)=="https://"){
$link = $link;
}else if(substr($link,0,2)=="//"){
$link = substr($link, 2);
}else if(substr($link,0,1)=="#"){
$link = $html;
}else if(substr($link,0,7)=="mailto:"){
$link = "";
}else if(substr($link,0,11)=="javascript:"){
$link = "";
}else{
if(substr($link, 0, 1) != "/"){
$link = $base_url."/".$link;
}else{
$link = $base_url . $link;
}
}
if(substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && $link != ""){
if(substr($url, 0, 8) == "https://"){
$link = "https://".$link;
}else{
$link = "http://".$link;
}
}
if(!in_array($link, $weblinks)){
$weblinks[]=$link;
}
}
$html->clear();
}else{
}
}
function info($weblinks){
foreach($weblinks as $link) {
$linkhtml = file_get_html("$link");
if($linkhtml && is_object($linkhtml) && isset($linkhtml->nodes)){
$titleraw = $linkhtml->find('title',0);
$title = $titleraw->innertext;
$des = $linkhtml->find("meta[name='description']",0)->content;
//detect language here
echo "<tr><td>".$title."</td><td>".$link."</td><td>".$des."</td></tr>";
$sql = mysql_query("INSERT into web once");
$title = "";
$des = "";
$linkhtml->clear();
}
}
}
To get the language from ?lang=:
$url = 'www.domain.org?lang=IT';
$url_parts = parse_url($url);
$lang = parse_str($url_parts['lang']);
You should then validate this with a switch/case statement and a list of languages that you support, like this:
switch ($lang) {
case 'EN':
//language is English
break;
case 'IT':
//language is Italian
break;
case 'FR':
//language is French
break;
default:
//?lang query was empty, or contained an unsupported language
$lang = FALSE;
} //end switch
After that, you can use this logic to determine whether you need to check the URL for the language:
if ($lang == FALSE) {
//code to determine language from TLD
}
Hopefully this will help get you started, although this is a big can of worms you're opening up. There are other things you need to check in order to be certain of the language of a website in addition to what you've mentioned. One of them is the language meta tag, which is like this: <meta name="language" content="english"> and goes in the head of the webpage, though not all websites use it.
Some multilingual websites, like mine, use a subdomain like http://it.website.com or http://fr.website.com
Others use query strings that are different from ?lang=. So you'll need to do a significant amount of research to cover all your bases.
Related
I would like to scrape the google search result up to page 2 but i'm having trouble on the result of blank page of my website or timeout.
for($j=0; $j<$acount; $j++){
sleep(60);
for($sp = 0; $sp <= 10; $sp+=10){
$url = 'http://www.google.'.$lang.'/search?q='.$in.'&start='.$sp;
if($sp == 10){
$datenbank = "proxy_work.php";
$datei = fopen($datenbank,"a+");
fwrite($datei, $data);
fwrite ($datei,"\r\n");
fclose($datei);
} else {
$datenbank = "proxy_work.php";
$datei = fopen($datenbank,"w+");
fwrite($datei, $data);
fwrite ($datei,"\r\n");
fclose($datei);
}
}
$html = file_get_html("proxy_work.php");
foreach($html->find('a') as $e){
// $title = $h3->innertext;
$link = $e->href;
if(in_array($endomain, $approveurl)){
}
// if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
}
}
Google search result pages (SERP) are not like a common website with static html. Google preserves its data from web scraping. Consider its data as a business directory and see the following tips for business directory scrape:
IP-proxying.
Imitating human behaviour by using some browser automation tools (Selenium, iMacros and others).
Read more here.
I have a website with ability to choose language. And I wanted to make that when user enters first time to the website, php gets his system language and writes to cookie (So user by default every time when he enters time will have same language). But when user want to change website language, he will press a button with chosen language (For example Russian), then website language will be set for russian, and when he will enter website again, he will have russian language.
So far I have this code, but it's really confusing and it doesnt work properly.
HTML:
<a href="index.php?language=en">
<a href="index.php?language=ru">
PHP:
<?php
ini_set('display_errors',1);
error_reporting(E_ALL);
$language = substr($_SERVER['HTTP_ACCEPT_LANGUAGE'], 0, 2);
if (empty($_COOKIE['language'])){
setcookie('language', $language);
}
if ( !empty($_GET['language']) ) {
$_COOKIE['language'] = $_GET['language'] === 'en' ? 'en' : 'ru';
} else {
switch ($language){
case "ru":
$language = 'ru';
break;
case "en":
$language = 'en';
break;
default:
$language = 'en';
break;
}
}
if ( $_COOKIE['language'] == "en") {
$language = 'en';
} else {
$language = 'ru';
}
$xml = simplexml_load_file("language.xml") or die("Equestria forgot languages");
$s_nav_main = $xml->s_nav_main->$language;
$s_nav_more = $xml->s_nav_more->$language;
$s_nav_bot = $xml->s_nav_bot->$language;
$s_nav_partners = $xml->s_nav_partners->$language;
$s_nav_developer = $xml->s_nav_developer->$language;
$s_aboutus = $xml->s_aboutus->$language;
$s_title = $xml->s_title->$language;
$s_head_title = $xml->s_head_title->$language;
$s_head_info = $xml->s_head_info->$language;
$s_statistics_people = $xml->s_statistics_people->$language;
$s_statistics_online = $xml->s_statistics_online->$language;
$s_statistics_messages = $xml->s_statistics_messages->$language;
$s_why_we_best = $xml->s_why_we_best->$language;
$s_why_we_best_content_title = $xml->s_why_we_best_content_title->$language;
$s_why_we_best_content_info = $xml->s_why_we_best_content_info->$language;
$s_why_we_best_adm_title = $xml->s_why_we_best_adm_title->$language;
$s_why_we_best_adm_info = $xml->s_why_we_best_adm_info->$language;
$s_why_we_best_comfort_title = $xml->s_why_we_best_comfort_title->$language;
$s_why_we_best_comfort_info = $xml->s_why_we_best_comfort_info->$language;
$s_why_we_best_wtf_title = $xml->s_why_we_best_wtf_title->$language;
$s_why_we_best_wtf_info = $xml->s_why_we_best_wtf_info->$language;
$s_trusted_title = $xml->s_trusted_title->$language;
$s_trusted_info = $xml->s_trusted_info->$language;
$s_people_celestia = $xml->s_people_celestia->$language;
$s_people_celestia_comment = $xml->s_people_celestia_comment->$language;
$s_people_luna = $xml->s_people_luna->$language;
$s_people_luna_comment = $xml->s_people_luna_comment->$language;
$s_people_twilight = $xml->s_people_twilight->$language;
$s_people_twilight_comment = $xml->s_people_twilight_comment->$language;
$s_botinfo_info = $xml->s_botinfo_info->$language;
$s_botinfo_more = $xml->s_botinfo_more->$language;
?>
The first place you should look for the users preferred language is the Accept-Language header. Geo-IP lookups are a dangerous and expensive waste of time (at least for determining language). Beyond that, you can set a cookie to override the choices presented by the browser, but there are legal implications around this for websites in Europe.
$avail_lang=array(
'en'=>1,
'fr'=>1,
'de'=>1,
'ru'=>1
);
define("DEFAULT_LANG", 'en');
...
if ($_COOKIE['language'] && isset($avail_lang[$_COOKIE['language']]) {
$use_lang=$_COOKIE['language'];
}
// override with GET if provided
if ($_GET['language'] && isset($avail_lang[$_GET['language']]) {
$use_lang=$_GET['language'];
}
// no language? check browser
if (!$use_lang) {
$request_lang=explode(",", $_SERVER['HTTP_ACCEPT_LANGUAGE']);
foreach($request_lang as $i) {
list($lang, $pref)=explode("=", trim($i));
$pref=$pref ? 0.0+$pref : 1.0;
list($lang, $country)=explode("-", $lang);
$pref_lang[$lang]=$pref;
}
rsort($pref_lang);
$use_lang=array_shift(array_intersect_key($pref_lang, $avail_lang));
if (!$use_lang) $use_lang=DEFAULT_LANGUAGE;
}
if (user_accepts_cookies() && $use_lang!=$_COOKIE['language']) {
set_lang_cookie($use_lang);
}
a simple logic can be adopted here -
when a user lands at your website you should track his/her IP address, we can easily get their country using that IP. Then you can easily serve language to them.
Found the way how to do this:
$lang = substr($_SERVER['HTTP_ACCEPT_LANGUAGE'], 0, 2);
if ( !empty($_GET['language']) ) {
$_COOKIE['language'] = $_GET['language'] === 'en' ? 'en' : 'ru';
} elseif (empty($_COOKIE['language'])) {
$_COOKIE['language'] = $lang;
}
setcookie('language', $_COOKIE['language']);
if ( $_COOKIE['language'] == "en") {
$language = 'en';
} else {
$language = 'ru';
}
I'm trying to get the title of a website that is entered by the user.
Text input: website link, entered by user is sent to the server via AJAX.
The user can input anything: an actual existing link, or just single word, or something weird like 'po392#*#8'
Here is a part of my PHP script:
// Make sure the url is on another host
if(substr($url, 0, 7) !== "http://" AND substr($url, 0, 8) !== "https://") {
$url = "http://".$url;
}
// Extra confirmation for security
if (filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED)) {
$urlIsValid = "1";
} else {
$urlIsValid = "0";
}
// Make sure there is a dot in the url
if (strpos($url, '.') !== false) {
$urlIsValid = "1";
} else {
$urlIsValid = "0";
}
// Retrieve title if no title is entered
if($title == "" AND $urlIsValid == "1") {
function get_http_response_code($theURL) {
$headers = get_headers($theURL);
if($headers) {
return substr($headers[0], 9, 3);
} else {
return 'error';
}
}
if(get_http_response_code($url) != "200") {
$urlIsValid = "0";
} else {
$file = file_get_contents($url);
$res = preg_match("/<title>(.*)<\/title>/siU", $file, $title_matches);
if($res === 1) {
$title = preg_replace('/\s+/', ' ', $title_matches[1]);
$title = trim($title);
$title = addslashes($title);
}
// If title is still empty, make title the url
if($title == "") {
$title = $url;
}
}
}
However, there are still errors occuring in this script.
It works perfectly if an existing url as 'https://www.youtube.com/watch?v=eB1HfI-nIRg' is entered and when a non-existing page is entered as 'https://www.youtube.com/watch?v=NON-EXISTING', but it doesn't work when the users enters something like 'twitter.com' (without http) or something like 'yikes'.
I tried literally everthing: cUrl, DomDocument...
The problem is that when an invalid link is entered, the ajax call never completes (it keeps loading), while it should $urlIsValid = "0" whenever an error occurs.
I hope someone can help you - it's appreciated.
Nathan
You have a relatively simple problem but your solution is too complex and also buggy.
These are the problems that I've identified with your code:
// Make sure the url is on another host
if(substr($url, 0, 7) !== "http://" AND substr($url, 0, 8) !== "https://") {
$url = "http://".$url;
}
You won't make sure that that possible url is on another host that way (it could be localhost). You should remove this code.
// Make sure there is a dot in the url
if (strpos($url, '.') !== false) {
$urlIsValid = "1";
} else {
$urlIsValid = "0";
}
This code overwrites the code above it, where you validate that the string is indeed a valid URL, so remove it.
The definition of the additional function get_http_response_code is pointless. You could use only file_get_contents to get the HTML of the remote page and check it against false to detect the error.
Also, from your code I conclude that, if the (external to context) variable $title is empty then you won't execute any external fetch so why not check it first?
To sum it up, your code should look something like this:
if('' === $title && filter_var($url, FILTER_VALIDATE_URL))
{
//# means we suppress warnings as we won't need them
//this could be done with error_reporting(0) or similar side-effect method
$html = getContentsFromUrl($url);
if(false !== $html && preg_match("/<title>(.*)<\/title>/siU", $file, $title_matches))
{
$title = preg_replace('/\s+/', ' ', $title_matches[1]);
$title = trim($title);
$title = addslashes($title);
}
// If title is still empty, make title the url
if($title == "") {
$title = $url;
}
}
function getContentsFromUrl($url)
{
//if not full/complete url
if(!preg_match('#^https?://#ims', $url))
{
$completeUrl = 'http://' . $url;
$result = #file_get_contents($completeUrl);
if(false !== $result)
{
return $result;
}
//we try with https://
$url = 'https://' . $url;
}
return #file_get_contents($url);
}
i am currently developing a web crawler in PHP and it still is a simple one but what i want to know is how can i make my crawler to crawl pages in background and not use my bandwidth, do i have to use some cron jobs and i want it to automatically store the data in database.
Here what i have done:
<?php
$conn = mysqli_connect("localhost","root","","crawler") or die(mysqli_error());
ini_set('max_execution_time', 4000);
$to_crawl = "http://hootpile.com";
$c = array();
function get_links($url){
global $c;
$input = file_get_contents($url);
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $input, $matches);
$base_url = parse_url($url, PHP_URL_HOST);
$l = $matches[2];
foreach($l as $link) {
if(strpos($link, "#")) {
$link = substr($link,0, strpos($link, "#"));
}
if(substr($link,0,1) == ".") {
$link = substr($link, 1);
}
if(substr($link,0,7)=="http://") {
$link = $link;
}
else if(substr($link,0,8) =="https://") {
$link = $link;
}
else if(substr($link,0,2) =="//") {
$link = substr($link, 2);
}
else if(substr($link,0,2) =="#") {
$link = $url;
}
else if(substr($link,0,2) =="mailto:") {
$link = "[".$link."]";
}
else {
if(substr($link,0,1) != "/") {
$link = $base_url."/".$link;
}
else {
$link = $base_url.$link;
}
}
if(substr($link, 0, 7)=="http://" && substr($link, 0, 8)!="https://" && substr($link, 0, 1)=="[") {
if(substr($url, 0, 8) == "https://") {
$link = "https://".$link;
}
else {
$link = "http://".$link;
}
}
//echo $link."<br />";
if(!in_array($link,$c)) {
array_push($c,$link);
}
}
}
get_links($to_crawl);
foreach ($c as $page) {
get_links($page);
}
foreach ($c as $page) {
$query = mysqli_query($conn,"INSERT INTO LINKS VALUES('','$page')");
echo $page."<br />";
}
?>
You can use SimpleHTML Dom, But crawling/scraping depend on the web page structure. How many data you want to store, May be you can't found same data and structure on different websites. In case you should make some common program to fetch data from scraped data.
You can use ReactPHP, since it allows you to easily spawn a process that keeps running.
You also can write a hashbang at the very beggining of the file:
#/usr/bin/php
give execution permissions to the file:
chmod a+x your_script_path.php
and execute it with cron or with nohup. If you want to daemonize it, then there is a little more work.
I think you should not use PHP for crawler/scraper because it's not designed for long-running tasks. It will cause memory usage problem, use Python instead (I use Python + BeautifulSoup + urllib for scraper).
Besides you should use crontab and nohup to schedule background jobs.
I am unable to set session for $_SESSION['next'] under switch/case condition, while $_SESSION['user_id'] works perfectly before the condition. The script run into each condition of switch/case condition and redirect without setting $_SESSION['next']. Is there any specific reason why it fails to work? How to solve this?
require_once ('../src/facebook.php');
require_once ('../src/fbconfig.php');
//Facebook Authentication part
$user_id = $facebook->getUser();
if ($user_id <> '0' && $user_id <> '') {
session_start();
$_SESSION['user_id'] = $user_id;
switch((isset($_GET['page']) ? $_GET['page'] : '')){
case 'abc';{
$_SESSION['next'] = 'AAA';
echo "<script>top.location.href = 'https://www.example.com/xxx/'</script>";
exit;}
case 'def';{
$_SESSION['next'] = 'BBB';
echo "<script>top.location.href = 'https://www.example.com/xxx/'</script>";
exit;}
case 'ghi';{
$_SESSION['next'] = 'CCC';
echo "<script>top.location.href = 'https://www.example.com/xxx/'</script>";
exit;}
default;{
echo "<script>top.location.href = 'https://www.example.com/xxx/'</script>";
exit;}
}
} else {
echo "<script>top.location.href = 'https://www.example.com/xxx/'</script>";
exit;
}
Your switch is all wrong. Read the manual and try this:
<?php
switch ((isset($_GET['page']) ? $_GET['page'] : '')){
case 'abc':
$_SESSION['next'] = 'AAA';
echo "<script>top.location.href = 'https://www.example.com/xxx/'</script>";
break;
case 'def':
$_SESSION['next'] = 'BBB';
echo "<script>top.location.href = 'https://www.example.com/xxx/'</script>";
break;
case 'ghi':
$_SESSION['next'] = 'CCC';
echo "<script>top.location.href = 'https://www.example.com/xxx/'</script>";
break;
default:
echo "<script>top.location.href = 'https://www.example.com/xxx/'</script>";
break;
}
You're using exit in your switch, which (unless you want your script to end at the switch) is a no-no. Instead, you have to use the break keyword.
You also use semicolons and curly braces for each case.
case 'ghi';{ ... }
NO! Proper usage is
case 'ghi':
.
.
.
break;
Update: I just noticed you use this line:
if ($user_id <> '0' && $user_id <> '') { ... }
What is <> doing in PHP code? The "standard" operator for "not equals" is != in PHP. Use it correctly or no one will want to use your code.
Second update: You never set $_SESSION['next'] in your default case. It's very likely that your switch is always going to the default case. This would cause the behavior you're experiencing.
I suggest:
if (($user_id != '0') && ($user_id != ''))
(parentheses, and the != operator)
and also a DRYer switch:
$page = array_key_exists('page', $_GET) ? $_GET['page'] : '';
switch ($page) {
case 'abc':
$next = 'AAA';
$loc = 'https://www.example.com/xxx/';
break;
case 'def':
$next = 'BBB';
$loc = 'https://www.example.com/yyy/';
break;
... // and so on
}
if (isset($next)) {
$_SESSION['next'] = $next;
// If it does not work, you have problems with your session ID, maybe?
}
// I find this syntax easier
print <<<SCRIPT
<script type="text/javascript">
top.location.href = '$loc';
</script>
SCRIPT;
exit();