I have a page on my site that fetches and displays news items from the database of another (legacy) site on the same server. Some of the items contain relative links that should be fixed so that they direct to the external site instead of causing 404 errors on the main site.
I first considered using the <base> tag on the fetched news items, but this changes the base URL of the whole page, breaking the relative links in the main navigation - and it feels pretty hackish too.
I'm currently thinking of creating a regex to find the relative URLs (they all start with /index.php?) and prepending them with the desired base URL. Are there any more elegant solutions to this? The site is built on Symfony 2 and uses jQuery.
Here is how I would tackle the problem:
function prepend_url ($prefix, $path) {
// Prepend $prefix to $path if $path is not a full URL
$parts = parse_url($path);
return empty($parts['scheme']) ? rtrim($prefix, '/').'/'.ltrim($path, '/') : $path;
}
// The URL scheme and domain name of the other site
$otherDomain = 'http://othersite.tld';
// Create a DOM object
$dom = new DOMDocument('1.0');
$dom->loadHTML($inHtml); // $inHtml is an HTML string obtained from the database
// Create an XPath object
$xpath = new DOMXPath($dom);
// Find candidate nodes
$nodesToInspect = $xpath->query('//*[#src or #href]');
// Loop candidate nodes and update attributes
foreach ($nodesToInspect as $node) {
if ($node->hasAttribute('src')) {
$node->setAttribute('src', prepend_url($otherDomain, $node->getAttribute('src')));
}
if ($node->hasAttribute('href')) {
$node->setAttribute('href', prepend_url($otherDomain, $node->getAttribute('href')));
}
}
// Find all nodes to export
$nodesToExport = $xpath->query('/html/body/*');
// Iterate and stringify them
$outHtml = '';
foreach ($nodesToExport as $node) {
$outHtml .= $node->C14N();
}
// $outHtml now contains the "fixed" HTML as a string
See it working
You can override the base tag by putting http:\\ in front of the link. That is, give a full url, not a relative URL.
Well, not actually a solution, but mostly a tip...
You could start playing aroung with ExceptionController.
There, just for example, you could seek for 404 error and check query string appended to request:
$request = $this->container->get('request');
....
if (404 === $exception->getStatusCode()) {
$query = $request->server->get('QUERY_STRING');
//...handle your logic
}
The other solution would be to define special route with its controller for such purposes, which would catch requests to index.php and do redirects and so on. Just define index.php in requirements of route and move this route on the top of your routing.
Not a clearest answer ever, but at least I hope I gave you a direction...
Cheers ;)
Related
I am new to drupal coding and still fairly new to PHP. I have gotten myself to a certain point, and am now stuck! The documentation has helped me a lot up to this point, but I find myself struggling to make it over this hurdle.
My Code:
<?php
//Pulls the refering page url
$prev_page = $_SERVER['HTTP_REFERER'];
//breaks the referer into an associated array
$delimit = '/';
$splode = explode($delimit,$prev_page);
$chunked = array_slice($splode, 3, NULL);
//iterates through the array to output the address as a string
foreach($chunked as $k=>$v){
$path .= $v."/";
}
//find the node id from the alias
$node = drupal_get_normal_path($path);
echo $node;
?>
So I have gotten the refering page address to be just the extension (ie: about-us/tim rather than http://www.google.com/about-us/tim ) to pass into drupal_get_normal_path.
I have put the actual uri into the drupal_get_normal_path and received the node information that I had expected to get, but when I use the variable as shown in the code block above it returns the text that is stored in the variable instead of finding the node source.
Any help ya'll can give is greatly appreciated!
Think this is fairly similar to this question here.
What you're doing wrong is assuming that that function returns a node - it doesn't, just returns the internal path to that node. So you still have to get the object (the node) referenced by that URL and then you actually have the node id.
Basically, you can achieve it using (this code is slightly more efficient and compact than what you have - plus it works!):
$url = $_SERVER['HTTP_REFERER'];
$path = preg_replace('/\//','',parse_url($url,PHP_URL_PATH),1);
$org_path = drupal_lookup_path("source", $path);
$node = menu_get_object("node", 1, $org_path);
$nid=$node->nid;
print $nid;
If you actually want to load the node, you just do node_load($nid) after all this.
Hope this helps!
URL : http://www.sayuri.co.jp/used-cars
Example : http://www.sayuri.co.jp/used-cars/B37753-Toyota-Wish-japanese-used-cars
Hey guys , need some help with one of my personal projects , I've already wrote the code to fetch data from each single car url (example) and post on my site
Now i need to go through the main url : sayuri.co.jp/used-cars , and :
1) Make an array / list / nodes of all the urls for all the single cars in it , then run my internal code for each one to fetch data , then move on to the next one
I already have the code to save each url into a log file when completed (don't think it will be necessary if it goes link by link without starting from the top but will ensure no repetition.
2) When all links are done for the page , it should move to the next page and do the same thing until the end ( there are 5-6 pages max )
I've been stuck on this part since last night and would really appreciate any help . Thanks
My code to get data from the main url :
$content = file_get_contents('http://www.sayuri.co.jp/used-cars/');
// echo $content;
and
$dom = new DOMDocument;
$dom->loadHTML($content);
//echo $dom;
I'm guessing you already know this since you say you've gotten data from the car entries themselves, but a good point to start is by dissecting the page's DOM and seeing if there are any elements you can use to jump around quickly. Most browsers have page inspection tools to help with this.
In this case, <div id="content"> serves nicely. You'll note it contains a collection of tables with the required links and a <div> that contains the text telling us how many pages there are.
Disclaimer, but it's been years since I've done PHP and I have not tested this, so it is probably neither correct or optimal, but it should get you started. You'll need to tie the functions together (what's the fun in me doing it?) to achieve what you want, but these should grab the data required.
You'll be working with the DOM on each page, so a convenience to grab the DOMDocument:
function get_page_document($index) {
$content = file_get_contents("http://www.sayuri.co.jp/used-cars/page:{$index}");
$document = new DOMDocument;
$document->loadHTML($content);
return $document;
}
You need to know how many pages there are in total in order to iterate over them, so grab it:
function get_page_count($document) {
$content = $document->getElementById('content');
$count_div = $content->childNodes->item($content->childNodes->length - 4);
$count_text = $count_div->firstChild->textContent;
if (preg_match('/Page \d+ of (\d+)/', $count_text, $matches) === 1) {
return $matches[1];
}
return -1;
}
It's a bit ugly, but the links are available inside each <table> in the contents container. Rip 'em out and push them in an array. If you use the link itself as the key, there is no concern for duplicates as they'll just rewrite over the same key-value.
function get_page_links($document) {
$content = $document->getElementById('content');
$tables = $content->getElementsByTagName('table');
$links = array();
foreach ($tables as $table) {
if ($table->getAttribute('class') === 'itemlist-table') {
// table > tbody > tr > td > a
$link = $table->firstChild->firstChild->firstChild->firstChild->getAttribute('href');
// No duplicates because they just overwrite the same entry.
$links[$link] = "http://www.sayuri.co.jp{$link}";
}
}
return $links;
}
Perhaps also obvious, but these will break if this site changes their formatting. You'd be better off asking if they have a REST API or some such available for long term use, though I'm guessing you don't care as much if it's just a personal project for tinkering.
Hope it helps prod you in the right direction.
I'm struggling to make AJAX-based website SEO-friendly. As recommended in tutorials on the web, I've added "pretty" href attributes to links: контакт and, in a div where content is loaded with AJAX by default, a PHP script for crawlers:
$files = glob('./pages/*.php');
foreach ($files as &$file) {
$file = substr($file, 8, -4);
}
if (isset($_GET['site'])) {
if (in_array($_GET['site'], $files)) {
include ("./pages/".$_GET['site'].".php");
}
}
I have a feeling that at the beginning I need to additionaly cut the _escaped_fragment_= part from (...)/index.php?_escaped_fragment_=site=about because otherwise the script won't be able to GET the site value from URL , am I right?
but, anyway, how do I know that the crawler transforms pretty links (those with #!) to ugly links (containing ?_escaped_fragment_=)? I've been told that it happens automatically and I don't need to provide this mapping, but Fetch as Googlebot doesn't provide me with any information about what happens to URL.
Google bot will automatically query for ?_escaped_fragment_= urls.
So from www.example.com/index.php#!site=about
Google bot will query: www.example.com/index.php?_escaped_fragment_=site=about
On PHP site you will get it as $_GET['_escaped_fragment_'] = "site=about"
If you want to get the value of the "site" you need to do something like this:
if(isset($_GET['_escaped_fragment_'])){
$escaped = explode("=", $_GET['_escaped_fragment_']);
if(isset($escaped[1]) && in_array($escaped[1], $files)){
include ("./pages/".$escaped[1].".php");
}
}
Take a look at the documentation:
https://developers.google.com/webmasters/ajax-crawling/docs/specification
I am creating a website using the MVC structure. Below is a code I have used to use clean URLS and load the appropriate files. However it only works for the first level.
Say I wanted to visit mywebsite.com/admin it would work, however mywebsite.com/admin/dashboard would not. The problem is in the arrays, how could I get the array to load content after the 2nd level along with the second level.
Would it be best to create an array like this?
Array
- controller
- view
- dashboard
Any help here would be great. Also as a side question. What would be the best way to set up "custom" urls. So if I were to put in mywebsite.com/announcement it would check to see if its got controllers, failing that, check to see if it's got custom content (maybe a file of the same name in "customs" folder, and then if there's nothing execute the 404 page not found stuff) This isn't a priority question though, but loosely associated in how the code works so I thought it best to add.
function hook() {
$params = parse_params();
$url = $_SERVER['REQUEST_URI'];
$url = str_replace('?'.$_SERVER['QUERY_STRING'], '', $url);
$urlArray = array();
$urlArray = explode("/",$url);
var_dump($urlArray);
if (isset($urlArray[2]) & !empty($urlArray[2])) {
$route['controller'] = $urlArray[2];
} else {
$route['controller'] = 'front'; // Default Action
}
if (isset($urlArray[3]) & !empty($urlArray[3])) {
$route['view'] = $urlArray[3];
} else {
$route['view'] = 'index'; // Default Action
}
include(CONTROLLER_PATH.$route['controller'].'.php');
include(VIEW_PATH.$route['controller'].DS.$route['view'].'.php');
var_dump($route['controller']);
var_dump($route['view']);
var_dump($urlArray);
var_dump($params);
// reseting messages
$_SESSION['flash']['notice'] = '';
$_SESSION['flash']['warning'] = '';
}
// Return form array
function parse_params() {
$params = array();
if(!empty($_POST)) {
$params = array_merge($params, $_POST);
}
if(!empty($_GET)) {
$params = array_merge($params, $_GET);
}
return $params;
}
Can you clarify this: "The problem is in the arrays, how could I get the array to load content after the 2nd level along with the second level."
I don't understand how you want this thing to work. I checked your code and it works. Maybe you just need to put $urlArray[1] instead of $urlArray[2] and 2 instead of 3? First element in the array is at index 0.
Usually it's done like this:
Url format:
/controller/action/param1/param2/...
-controller- should be a class. That class has a method/function called -action-.
ex. /shoes/show/121/ --> this will load controller shoes
and execute the method/function show(121)
that will show the shoes that have the id 121 in the
database.
ex. /shoes/list/sport --> this will load controller shoes
and execute function list('sport') that will list all
shoes in the sport category.
As you can see, you only load one controller and from that controller you run only one function and that function will get the rest of the path and use it as parameters.
If you want to have multiple controllers for one URL, then the rest of the controllers will have to be loaded from the main controller. Most MVCs (like CodeIgniter) load only one controller per URL.
Second question:
Best way for pretty urls would be to save them in the db. This means you can have URLs like this:
/I-can-write-anything-here-No-need-to-add-ids-or-controller-names
Then you take this URL and search it in db and get the -controller- and -action- that you need for this URL.
But I have yet to see a popular MVC framework do this. I guess the reason is that the db will get a lot of queries for text matches and that will slow things down.
Popular MVC frameworks use:
/controller/action/param1/param2
This has the benefit that you can directly find the controller/action from the url.
The downside is that you will get urls like:
/shoes/list/sport
//when what you really want is
/shoes/sport
//or just
/sport //if the website only sells shoes
This can be fixed by redirecting /shoes/sport to /shoes/list/sport
If you make your own MVC then you should use OOP because if not, thing will get ugly quick: all actions/functions are in the same namespace.
Personally I would recommend that you use one of the many PHP frameworks that exist as that will take care of the routing for you and let you concentrate on writing your application. CakePHP is one that I've used for a while and it makes my life so much easier.
What I do:
I create a .htaccess file that redirects an url like www.example.com/url/path/or/something to www.example.com/index.php?url=url/path/or/something, so it will be pretty easy to do an explode on your $_GET['url']
Second, it's better because everything a user input, will be redirected to your index.php, so you have FULL control over EVERYTHING.
If you want I can PM you the url to my mvc (bitbucket) so you can have a look on how I do this ;)
(Sorry for the others, but I don't like to put url's to my site in public)
edit:
To be more precise to your particular question; It will solve your problem, because everything goes to index.php and you have full control over the requested url.
I am developing a new module and in my hook_menu_alter() I need to detect the node currently being viewed.
Instead of using arg(1) to fetch the the node id from the url, I discovered I can use
menu_get_object().
The following code works in my hook_init() but does not in hook_menu_alter():
$node = menu_get_object();
dpm($node);
Can anyone offer some insight into why that does not work and how to get the current node infomation in hook_menu_alter()?
Thanks.
The output from hook_menu, hook_menu_alter etc. is cached so those functions will only be called when the caches are cleared, not for every page load. If you think about, if the menus were rebuilt on every page load the performance of the site would suffer considerably.
As such, when hook_menu_alter is called (which won't be from a node page), there's no node for menu_get_object() to give you. The way to handle these things is in the page/access callback for the menu item:
function mymodule_menu_alter(&$items) {
$items['some/path']['page callback'] = 'mymodule_page_callback';
}
function mymodule_page_callback() {
// This is a live page so menu_get_object() is now available
$node = menu_get_object();
}
From your comment I think you're trying to deny access to particular nodes based on some criteria. For this you'll want to implement your own access callback for the already existing node/% menu path. Something like this:
function mymodule_menu_alter(&$items) {
$items['node/%node']['access callback'] = 'mymodule_access_callback';
}
function mymodule_access_callback($node) {
if ($node->type == 'group') {
if (some_function_that_determines_access($node)) {
return TRUE;
}
return FALSE;
}
return node_access('view', $node);
}