Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
So must of us have a lot of content on our sites in one language or another. Since we are web professionals we spent all that time we could have been learning human languages - instead learning computer languages. So we need someway to translate our content.
Google provides a translation service (among others) and so given their massive empire I am confident that they do (or shortly will) have the best translation service. With that in mind, what is the best way to use it? We could just be lazy and use the little widget that they provide - but we would lose all the content and SEO juice because google would rewrite the links to point to "translate.googleusercontent.com?translate=...".
So my question is - how we can use this service while retaining the translated content on our site?
One method would be to use the Google AJAX API to load the content inline when the wants it. But since it is powered by JS (like jQuery)- Search Engines won't benefit from this.
Another method would be to use a server side language (like PHP) to scrap the content from the google translate page. But I'm not sure this is 100% legal.
Finally, I was wondering about using mod_rewrite to fetch the page. But again, I don't think this would benefit our site.
RewriteRule ^(.*)-fr$ http://www.google.com/translate_c?hl=fr&sl=en&u=http://site.com/$1 [R,NC]
RewriteRule ^(.*)-de$ http://www.google.com/translate_c?hl=de&sl=en&u=http://site.com/$1 [R,NC]
RewriteRule ^(.*)-es$ http://www.google.com/translate_c?hl=es&sl=en&u=http://site.com/$1 [R,NC]
RewriteRule ^(.*)-it$ http://www.google.com/translate_c?hl=it&sl=en&u=http://site.com/$1 [R,NC]
All you would need to do is add a a couple links on your pages with the variables “-fr” appended to the end of what ever URL is in the link and your set.
//View file
View Page in German
Does anyone have any thoughts on this?
:EDIT:
After reading google's Terms of Service it seems that
You will not, and will not permit your
end users or other third parties to:
incorporate Google Results as the
primary content on your Property or
any page on your Property; submit any
request exceeding 5000 characters in
length;
Which sounds to me like you can't use the google translate URL to translate the main content - with PHP or AJAX - if that content is the main post of the page. Now how does this work? Why would you build a translation API and then not allow it to be used on the main page content?
Well, you should read the EULA, maybe google doesn't want you to use their service in that way.
Not to mention that Google Translate may be fine across indo-european languages, but right now, translations to other families of languages really suck, and generate comical, meaningless text (e.g. my own language, Hungarian, is a nightmare for Google). I don't think it'll advance to an at least usable level in the near future.
I think the most SEO friendly way to decide what language to display is to look at the Accept-Language request header, although language flag icons wouldn't be a bad idea either, in case someone using an en-us browser feels more comfortable reading French, for example.
It looks like there is an (unofficial) API for php to translate using Google translate. It appears to be unofficial, but it's hosted on Google code, so if it's something that Google didn't want, it would probably be gone by now.
You should make sure to cache the translated pages though.
http://code.google.com/p/gtranslate-api-php/
To have a real multilingual site, automated translations are not and will not be a good enough solution. On my site, I've added an interface allowing easy human translation and Google translate (as well as babelfish) is used for suggesting translations before a real human does the actual translations. Check the project at http://transposh.org/ is your site is on WordPress
The quality of SEO Translate is still questionable. Given that it is based on statistical translation, in the long run it will improve, but today it's outright dangerous. I would not use it for my site - as I pointed out in one of my last posts on my blog about the impact on the new Google algorithm on website translation, the latest Google Panda algorithm update penalizes spelling and grammar errors, so machine translation might ultimately penalize you.
After more research, apparently google does expose the JSON URL to make direct requests - so using a server side language does seem to be an option (as long as they are cached). However, once you get that content you still need to figure out how to allow users to access it in the flow of your current app. Perhaps something like the mod_rewrite method mentioned above?
You can translate text through the google language api's REST interface.
Here is a PHP library that does it:
http://code.google.com/p/php-language-api/
A simple example is on the project page.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I need to crawl a website and detect how many ads are on a page. I've crawled using PHPCrawl and saved the content to DB. How can I detect if a web page has ads above the fold?
Well simply put: You can't really. At least not in a simple way. There is many things to consider here and all of them are highly subjective to the web page you are crawling, the device used, etc. I'll try to explain some of the main issues you would need to work around.
Dynamic Content
The first big problem is, that you just have the HTML structure, which on itself gives no direct visual information. It would be, if we were in like 1990, but modern websites use CSS and JS to enhance their pages core structure. What you see in your browser is not just HTML rendered as is. It's subject to CSS styling and even JS induced code fragments, which can alter the page significantly. For example: Any page that has a so called AJAX loader, will output as a very simple HTML code block, that you would see in the crawler. But the real page is loaded AFTER this is rendered (from JS).
Viewport
What you described as "above the fold" is an arbitary term, that can't be defined globaly. A smartphone has a very different viewport than a desktop PC. Also most of modern websites use a very different structure for mobile, tablet and desktop devices. But let's say you just want to do it for desktop devices. You could define an average viewport over most used screen resolutions (which you may find on the internet). We will define it as 1366x786 for now (based on a quick google search). However you still only have a PHP script and an HTML string. Which brings us the next problem.
Rendering
What you see in your browser is actually the result of a complex system, that incooperates HTML and all of the linked ressouces to render a visual representation of the code you have crawled. Besides the core structure of the HTML string you got, any resource linked can (and will) chanfge how the content looks. They can even add more content based on a variety of conditions. So what you would need to get the actual visual information is a so called "headless browser". Only this can give you valid informations about what is actually visible inside the desired viewport. If you want to dig into that topic, a good starting point would be a distribution like "PhantomJS". However don't assume that this is an easy task. You still only have bits of data, no context whatsoever.
Context, or "What is an ad?"
Let's assume you have tackled all these problems and made a script that can actually interpret all the things you got from your crawler. You still need to know "What is an ad?". And thats a huge problem. Of course for you as a human it's easy to distinquish between what is part of the website and what is an ad. But translating that into code is more of an AI task than just a basic script. For example: The ads could (and are most of the time) loaded into a predfined container, after the actual page load. These in turn may only have a cryptic ID set that distinguishes them from the rest of the (actually valid) page content. If you are lucky, it has a class with a string like "advertisment", but you can't just define that as given. Ads are subject to all sorts of ad blockers, so they have a long history of trying to disquise themselves as good as possible. You will have a REALLY hard time figuring out what is an ad, and what is valid page content.
So, while I only tackled some of the problems you are going to run into, I want to say that it's not impossible. But you have to understand that you are at the most basic entry point and if you want to make a system that is actually working, you'll have to spend a LOT of time on finetuning and maybe even research on the AI field.
And to come back to your question: There is no simple answer for "How to detect if a page has ads". Because it is way more complex than you might think. Hope this helps.
They are some questions already about this on stackoverflow, but none is really clear about the 'best practice'.
For the content design, what are the options and what is the better option?
Some options I know are using folders: site.com/en/ and site.com/fr/ or redirects site.com/index.php?language=en
An even easier practice is using a new url: en.site.com and fr.site.com
But what if I want to keep site.com/index.php and nothing more ? What are my options for that?
For example, if you change the language on LinkedIn, there's nothing changing in the URL. How do they work there ?
update: in my case the website is a platform, using LAMP stack. Technical advice is also welcome (like how to store/link all the different language files)
You have some options, and each has its own advantages.
If you have a web app, which should not be indexed by search engines, then you are free to do whatever you want. You can keep a language setting in your Session and show strings in the chosen language. This will simplify URLs and management.
However, if you have a standard website which should go into Google, then your options are restricted. If you use the former approach, google will be confused and will index only one language, or, worse, make an ugly mix of the languages. Google does not keep sessions when indexing your page, so if you have two versions of the same page in two different languages, they need to have different URLs. And passing a language as a GET parameter each time is ugly, error prone, and not user friendly.
So you should either have languages as folders (eg. site.com/en/), which is the best options, or use subdomains. This can be a problem, however, because each subdomain is indexed as if it were a separate website, so things like pagerank and site reputation are split among the two.
Here are my recommendations based on multi-languages web sites experience
You can determine the language of the browser in Javascript (see this), which is a good start in having a language that likely fits the user demand
Use utf8 for encoding, everywhere (browser headers, programming, database). The only drawback I know of utf8 is that in some cases, the number of bytes it takes is bigger than another encoding matching more closely a given language (utf8 is opened to any). Big advantage, it covers any language, the ASCII part (bytes < 128) is the same as a western encoding (takes only one byte) ...
Store the user preferences (not in URL).
either in a cookie (+ : user does not have to be logged it on his computer to keep her prefs ; - : when user accesses the page from another computer, the cookie is not present and he/she has to reselect the pref).
or in a session (requires user login to deterime which user is currently using the site)
Regarding the site structure, if you don't store the user preferences, the common /us /fr /jp... solution has advantages: search engines robots will find all languages from root (the page display doesn't depend on user choice). Or you could load only a language dependent Javascript that displays the page immediately (after JS download) in the language of choice, without the need of a page reload / redirect.
You can tell Google what the parameters in the URL mean through Google's Webmaster Tools. So if you use a standard convention like index.php for the main page and a parameter like ?lang=fr for let's say French content in regard to SEO as long as it's a real translation. Likewise when Google.fr crawls the site you would present their users with the French version of the page based on their default setting in their browser. This will increase stickiness on the site and then increase the rank for the selected search terms in French. You can check their default language with this in PHP to make it less heavy on the user's end:
<?php
$lang = substr($_SERVER['HTTP_ACCEPT_LANGUAGE'], 0, 2);
echo $lang;
?>
Then you would simply concatenate the language selected (as an overlay) to the end of the query string (if it's not the default), bypassing the need for storing a cookie. This should also help from a user experience standpoint. It's not as "pretty" but if people aren't typing it in, it's much more user friendly in that:
People can switch languages by changing the variable
People can provide a link to a user in another country and they can be presented with the page in their native language intuitively when PHP checks their language. "Do you want to see this content in French?"
The native language could in theory be translated on the fly with something like the Google Translate Widget or Google Chrome's built-in functionality.
Content is only served on one URI per page meaning you would not have to rewrite several redirects between versions in the event of stale content. As you're not likely translating into both languages at the same time.
From Wikipedia on Chrome:
As of February 2013, according to StatCounter, Google Chrome has a 37%
worldwide usage share of web browsers making it the most widely used
web browser in the world.
As for storing the files, you can make separate tables in the database for each language, enter in the content in the native language, then duplicate that content to the other language's table. Then you could translate that content live on screen if so desired, or allow someone access to your CMS who could translate the content. Both records would have the same ID in the database so the page would be served appropriately when the language was looked up.
Large corporate sites do the handling of the URLs in a couple of ways, they would show french content at http://www.some_website.fr/ and English content at http://www.some_website.com/. Sites like wikipedia use a subdns so http://fr.wikipedia.org/ and http://en.wikipedia.org.
HTML5 supports the lang='fr' approach to declaring content on screen.
I don't know what's your situation but, imho, you shouldn't use new urls.. it would only be a useless waste of resources (all kinds of resources).. If your website's client-side is javascript-based, you could use libraries like i18next, which gives you great support in localizations.. That's a valid alternative if you agree leaving the localization at you application's client-side.
For server-side localization in php, i wouldn't be of any help..
Let's say I have a plain HTML website. More than 80% of my visitors are usually from search engines like Google, Yahoo, etc. What I want to do is to make my whole website in Flash.
However, search engines can't read information from Flash or JavaScript. That means my web page would lose more than half of the visitors.
So how do I show show HTML pages instead of Flash to the search engines?
Note: you could reach a specific page/category/etc in Flash by using PHP GET function, for example: you can surf trough all the web pages from the homepage and link to a specific web page by typing page?id=1234.
Short answer: don't make your whole site in Flash.
Longer answer: If you show humans one view and the googlebot another, you are potentially guilty of "cloaking". If the Google Gods find you guilty, you will be banned to the Supplemental Index, never to be heard from again.
Also, doing an entire site in Flash breaks the basic contract of the web, namely that you can link to specific content from other sites or in emails. If your site has just one URL and everything else is handled inside of Flash ... well, I don't know what you have, but it isn't a website anymore. Adobe may like you, but many people will not. Oh, and Flash is very unfriendly to people with handicaps.
I recommend using Flash where it is needed (videos, animations, etc.), but make it part of an honest-to-God website.
What I want to do is to make my whole
website in Flash
So how to accomplish this: show HTML
pages instead of Flash?
These two seem a bit contradictory.
Important is to understand the reasoning behind choosing Flash to build your entire website.
More than 80 percent of my visitors
are usually from search engines
You did some analysis but did you look at how many visitors access your website via a mobile device? Because apart from SEO, Flash won't serve on the majority of these devices.
Have you considered HTML5 as an alternative for anything you want to do with Flash?
Facebook requires you to build applications in Flash among others but html, why? I do not know, but that is their policy and there has got to be a reason.
I have been recently developing simple social applications in Flash (*.swf) and my latest app is a website in flash that will display in tab of my company webpage in Facebook; at the same time, I also want to use that website as a regular webpage on the internet for my company. So, the only way I could find out to display html text within a flash file is by changing the properties for the text wherever I can in CHARACTER to "Render text as HTML", look for the symbol "<>". I think that way the search engines will be able to read your content and process your website accordingly. Good luck.
As you say that you can reach the Flash page by get variable using page ID or any other variables. So its good. I hope you will add Flash in each HTML page. Beside this, you can add all other HTML contents in hidden format. So the crawlers could reach the content and your site will look-up in Flash. Isn't it?
Since no-one actually gave you an straight answer (probably because your question is absolute face-palm-esque), i'll try:
Consider using the web-development approach called progressive enhancement. Now, it's fair to say that it probably wasn't intended for Flashification of a website, but you can make use of it's principles.
Start with your standard HTML version of your website
Introduce swfobject to dynamically (important bit) swap out the HTML content for it's Flash equivalent
Introduce swfaddress to allow for deep linking into your Flash movies (pseudo-URLs)
Granted, steps 2 and 3 are a little more advanced that how i've described them and your site size/structure/design may not suit this approach, but at least it's an answer.
All that being said, I agree with the other answers/comments about the need for using Flash to display your entire site - there's very very very few reasons anyone would do that, and there's more reasons than already added as to why not to (iOS devices etc)...
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 years ago.
Improve this question
How can i make my database content search-able on SE,
so basically how to make a website more SEO friendly where the data is not static,
it will come from the database.
It doesn't matter whether its content is loaded from a database or a static file, as long as it's being loaded server-side (ie. by PHP) rather than client-side (ie. by JavaScript). Crawlers see no difference, and so the same guidelines apply.
FRKT is correct that the search engines don't know where content is coming from.
Meta tags, while still somewhat important, don't have the same effect they used to. Include them, but don't consider them the be-all, end-all of how to get higher in SEO.
Start by making sure that the page you generate is W3C compliant. Once it's working, put it into the w3c validator at http://validator.w3.org/ and make it 100% correct. A search engine can't see code if it's poorly structured.
Now, comes the tough part....the other stuff. Nobody REALLY knows everything that the Googles of the world look for, but we've all got pretty good ideas. For example, you'll be higher in search rankings if your domain has "aged" or been out on the web for a while....makes sense, you're not a fly by night operation if your URL has been in operation for months. Keep your content fresh, use proper markup (such as titles in h1 tags, content in p, and ensure that you're not "hiding" your content using images without Meta tags or burying important text in Flash.
Google and Bing provide "webmaster tools" that you can embed in your site and analyze the code to take some of the guesswork out of what the browser sees. See https://www.google.com/webmasters/tools/ and http://www.bing.com/webmaster Don't miss this free opportunity to make things better.
Good luck. Building a strong SEO site with a CMS is not difficult at all if you take your time and think through your actions.
You need to provide the correct meta tags on your web pages such as the Keywords tag in order for search engine crawlers to determine that the contents on your pages are relevant.
If your content is coming from the database and you cannot change it then perhaps you could write a web control to determine the most popular words in your content and then present these automatically within the keywords meta tag.
Provide links to it.
You can't create a form which has form control in which end-users specify what they want to retrieve: because a search engine won't fill in the form (and therefore won't retrieve the data).
Instead you need to serve a page which includes hyperlinks to the various data.
Most search engines provide a way of specifying sitemaps that essentially tell them how to access certain pages that can't be found through normal crawling. For example, pages accessed through javascript or form submissions that generate a URL (method=GET).
Search engines index pages, not databases. Your pages can be dynamic, crawlers come back often enough to update the indexed content and incorporate any new content. You don't have to provide a URL for all pages, just the first page in a series. The search engine will find and follow any pagination links, and index the subsequent pages.
In addition to the other comments, use search engine friendly URLs. This will require you to rewrite your URLs.
Some links:
http://www.seoconsultants.com/articles/1000/urls
http://articles.sitepoint.com/article/search-engine-friendly-urls
http://www.evolt.org/article/Search_Engine_Friendly_URLs_with_PHP_and_Apache/17/15049/index.html%22
The basic idea is that a search engine can do more with a URL in the format:
http://mysite.com/cars/toyota/tacoma
Than it can with a URL in the format:
http://mysite.com/item.php?mid=123&modid=456
I want to build a in-site search engine with php. Users must login to see the information. So I can't use the google or yahoo search engine code.
I want to make the engine searching for the text and pages, and not the tables in mysql database right now.
Has anyone ever done this? Could you give me some pointers to help me get started?
you'll need a spider that harvests pages from your site (in a cron job, for example), strips html and saves them in a database
You might want to have a look at Sphinx http://sphinxsearch.com/ it is a search engine that can easily be access from php scripts.
You can cheat a little bit the way the much-hated Experts-Exchange web site does. They are for-profit programmer's Q&A site much like StackOverflow. In order to see answers you have to pay, but sometimes the answers come up in Google search results. It is rather clear that E-E present different page for web crawlers and different for humans. You could use the same trick, then add Google Custom Search to your site. Users who are logged in would then see the results, otherwise they'd be bounced to login screen.
Do you have control over your server? Then i would recommend that you install Solr/Lucene for index and SolPHP for interacting with PHP. That way you can have facets and other nice full text search features.
I would not spider the actual pages, instead i would spider pages without navigation and other things that is not content related.
SOLR requiers Java on the server.
I have used sphider finally which is a free tool, and it works well with php.
Thanks all.
If the content and the titles of your pages are already managed by a database, you will just need to write your search engine in php. There are plenty of solutions to query your database, for example:
http://www.webreference.com/programming/php/search/
If the content is just contained in html files and not in the db, you might want to write a spider.
You may be interested in caching the results to improve the performances, too.
I would say that everything depends on the size and the complexity of your website/web application.