I need to display title from referring URL and here is the code I'm using to achieve that:
<?php
if (isset($_SERVER['HTTP_REFERER'])) {
$url_to_load = $_SERVER['HTTP_REFERER'];
$f = file_get_contents($url_to_load);
$p1 = strpos($f, "<title>");//position start
$qe = substr($f, $p1);//string from start position
$p2 = strpos($qe, "</title>");//position end
$query = substr($qe, 7, $p2-2);//cuts from start position +7 (<title>) untill end position -2...
echo $query;}
else{
$ref_url = 'No Reffering URL'; // show failure message
}//end else no referer set
echo "$ref_url";
?>
When i visit page with this code from URL that has the following code:
<title>Title Of Referrer</title>
Code works, but there is still the piece of the closing tag and when i check source code this is what i'll get:
Title Of Referrer</tit
What i need to change to remove the closing tag completely?
$query = substr($qe, 7, $p2-7);//cuts from start position +7 (<title>) untill end position -2...
You only subtract 2 at the end on end title but you add 7 on start title.
Try the code above and see if that works
EDIT:
Another solution is to do like this.
$query = strip_tags(substr($qe, 0, $p2));
This saves all of the title tags but then delete them with strip_tags()
EDIT2:
There are some other things in the code I would suggest.
$f = file_get_contents($url_to_load);
$query = strip_tags(substr($f, strpos($f, "<title>"), strpos($f, "</title>")));
This code brings it down to two lines of code and uses fewer variables. You can also get ridd of $f, but it may be useful to something else and it's only one variable.
Related
How can i call in php words from url?
Example 1
http://myurl.com/keyword-one
I want to display "Keyword One" as page title in php
Example 2
http://myurl.com/keyword/keyword-two
I want to display "Keyword Two" as page title in php
Thank you! :-)
the simplest solution in those cases would be using (untested code, but should work):
<?php
//get path
$urlPath = $_SERVER["REQUEST_URI"];
//get last element
$end = array_slice(explode('/', rtrim($urlPath, '/')), -1)[0];
//replace dashes with spaces and display
echo ucwords(str_replace('-',' ',$end));
I want to scrape few web pages. I am using php and simple html dom parser.
For instance trying to scrape this site: https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=5
I use this load the url.
$html = new simple_html_dom();
$html->load_file($url);
This loads the correct page. Then I find the next page link, here it will be:
https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=6
Just the page value is changed from 5 to 6. The code snippet to get the next link is:
function getNextLink($_htmlTemp)
{
//Getting the next page links
$aNext = $_htmlTemp->find('a.next', 0);
$nextLink = $aNext->href;
return $nextLink;
}
The above method returns the correct link with page value being 6.
Now when I try to load this next link, it fetches the first default page with page query absent from the url.
//After loop we will have details of all the listing in this page -- so get next page link
$nxtLink = getNextLink($originalHtml); //Returns string url
if(!empty($nxtLink))
{
//Yay, we have the next link -- load the next link
print 'Next Url: '.$nxtLink.'<br>'; //$nxtLink has correct value
$originalHtml->load_file($nxtLink); //This line fetches default page
}
The whole flow is something like this:
$html->load_file($url);
//Whole thing in a do-while loop
$originalHtml = $html;
$shouldLoop = true;
//Main Array
$value = array();
do{
$listings = $originalHtml->find('div.searchResult');
foreach($listings as $item)
{
//Some logic here
}
//After loop we will have details of all the listing in this page -- so get next page link
$nxtLink = getNextLink($originalHtml); //Returns string url
if(!empty($nxtLink))
{
//Yay, we have the next link -- load the next link
print 'Next Url: '.$nxtLink.'<br>';
$originalHtml->load_file($nxtLink);
}
else
{
//No next link -- stop the loop as we have covered all the pages
$shouldLoop = false;
}
} while($shouldLoop);
I have tried encoding the whole url, only the query parameters but the same result. I also tried creating new instances of simple_html_dom and then loading the file, no luck. Please help.
You need to html_entity_decode those links, I can see that they are getting mangled by simple-html-dom.
$url = 'https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes';
$html = str_get_html(file_get_contents($url));
while($a = $html->find('a.next', 0)){
$url = html_entity_decode($a->href);
echo $url . "\n";
$html = str_get_html(file_get_contents($url));
}
I am trying to make "manner friendly" website. We use different declination dependent on gender and other factors. For example:
You did = robili
It did = robilo
She did = robila
Linguisticaly this is very simplified (and unlucky) example! I would like to change html text in php file where appropriate. For example
<? php
something
?>
html text of the page and somewhere is the word "robil"
<div>we tried to robil^i|o|a^</div>
<? php something ?>
Now I would like to replace all occurences of different tokens ^characters|characters|characters^ and replace them by one of their internal values according to "gender".
It is easy in javascript on the client side, but you will see all this weird "tokenizing" before javascript replace it.
Here I do not know the elegant solution.
Or do you have better idea?
Thanks for advice.
You can add these scripts before and after the HTML:
<?php
// start output buffering
ob_start();
?>
<html>
<body>
html text of the page and somewhere is the word "robil"
<div>we tried to robil^i|o|a^, but also vital^si|sa|ste^, borko^mal|mala|malo^ </div>
</body>
</html>
<?php
$use = 1; // indicate which declination to use (0,1 or 2)
// get buffered html
$html = ob_get_contents();
ob_end_clean();
// match anything between '^' than's not a control chr or '^', min 5 and max 20 chrs.
if (preg_match_all('/\^[^[:cntrl:]\^]{3,20}\^/',$html,$matches))
{
// replace all
foreach (array_unique($matches[0]) as $match)
{
$choices = explode('|',trim($match,'^'));
$html = str_replace($match,$choices[$use],$html);
}
}
echo $html;
This returns:
html text of the page and somewhere is the word "robil" we tried to
robilo, but also vitalsa, borkomala
I am writing a simple php crawler that gets data from a website and inserts it into my database. I start with a predefined url. Then I go through the the contents of the page (from php's file_get_contents) and eventually use file_get_contents on links of that page. The url's I am getting from the links are fine when I echo them and then open them from my browser on their own. However, when I use file_get_contents and then echo the result, the page does not appear correctly because of errors related to dynamically created server-side data from the site. The echo'd page contents do not include the listed data from the server that I need, because it cannot find necessary resources for the site.
It appears relative paths in the echo'd webpage are not allowing the desired content to be generated.
Can anyone point me in the right direction here?
Any help is appreciated!
Here is some of my code so far:
function crawl_all($url)
{
$main_page = file_get_contents($url);
while(strpos($main_page, '"fl"') > 0)
{
$subj_start = strpos($main_page, '"fl"'); // get start of subject row
$main_page = substr($main_page, $subj_start); // cut off everything before subject row
$link_start = strpos($main_page, 'href') + 6; // get the start of the subject link
$main_page = substr($main_page, $link_start); // cut off everything before subject link
$link_end = strpos($main_page, '">') - 1; // get the end of the subject link
$link_length = $link_end + 1;
$link = substr($main_page, 0, $link_length); // get the subject link
crawl_courses('https://whatever.com' . $link);
}
}
/* Crawls all the courses for a subject. */
function crawl_courses($url)
{
$subj_page = file_get_contents($url);
echo $url; // website looks fine when in opened in browser
echo $subj_page; // when echo'd, the page does not contain most of the server-side generated data i need
while(strpos($subj_page, '<td><a href') > 0)
{
$course_start = strpos($subj_page, '<td><a href');
$subj_page = substr($subj_page, $course_start);
$link_start = strpos($subj_page, 'href') + 6;
$subj_page = substr($subj_page, $link_start);
$link_end = strpos($subj_page, '">') - 1;
$link_length = $link_end + 1;
$link = substr($subj_page, 0, $link_length);
//crawl_professors('https://whatever.com' . $link);
}
}
Try advance html dom parser. It is here....
http://sourceforge.net/projects/advancedhtmldom/
I have following script, which is not working. What to I do to add the link?
jno = "97856483";
dispTitle = "new book";
dispAuthor = "authorname";
document.getElementById('popups').innerHTML = '';
//Add link to add this book:
var url = encodeURIComponent(jno) + "&tt=" + encodeURIComponent(dispTitle) + "&at=" + encodeURIComponent(dispAuthor);
//document.writeln(url);
document.getElementById("addLink").innerHTML = "<a href='memaccountentry.php?isbn='+ url>Add book</a>" ; //This one just appends the word url.
//window.location.href = 'memaccountentry.php?isbn=' +jno +'&tt=' +dispTitle+'&at=' +dispAuthor; //I know this is working, but not a right way to do.
//I need to put a href link to go to the next page.
//ajax.open('GET', 'memaccountentry.php?isbn=' +jno +'&tt=' +dispTitle+'&at=' +dispAuthor', true);
You need to properly open and close your quotes.
Try that:
document.getElementById("addLink").innerHTML = "<a href='memaccountentry.php?isbn="+ url +"'>Add book</a>" ; //This one just appends the word url.
It looks like you didn't format your string correctly.
If this is not what you wanted, then you have me completely confused.
document.getElementById("addLink").innerHTML = "Add book";