How to crawl a site with server-generated content? - php

I am writing a simple php crawler that gets data from a website and inserts it into my database. I start with a predefined url. Then I go through the the contents of the page (from php's file_get_contents) and eventually use file_get_contents on links of that page. The url's I am getting from the links are fine when I echo them and then open them from my browser on their own. However, when I use file_get_contents and then echo the result, the page does not appear correctly because of errors related to dynamically created server-side data from the site. The echo'd page contents do not include the listed data from the server that I need, because it cannot find necessary resources for the site.
It appears relative paths in the echo'd webpage are not allowing the desired content to be generated.
Can anyone point me in the right direction here?
Any help is appreciated!
Here is some of my code so far:
function crawl_all($url)
{
$main_page = file_get_contents($url);
while(strpos($main_page, '"fl"') > 0)
{
$subj_start = strpos($main_page, '"fl"'); // get start of subject row
$main_page = substr($main_page, $subj_start); // cut off everything before subject row
$link_start = strpos($main_page, 'href') + 6; // get the start of the subject link
$main_page = substr($main_page, $link_start); // cut off everything before subject link
$link_end = strpos($main_page, '">') - 1; // get the end of the subject link
$link_length = $link_end + 1;
$link = substr($main_page, 0, $link_length); // get the subject link
crawl_courses('https://whatever.com' . $link);
}
}
/* Crawls all the courses for a subject. */
function crawl_courses($url)
{
$subj_page = file_get_contents($url);
echo $url; // website looks fine when in opened in browser
echo $subj_page; // when echo'd, the page does not contain most of the server-side generated data i need
while(strpos($subj_page, '<td><a href') > 0)
{
$course_start = strpos($subj_page, '<td><a href');
$subj_page = substr($subj_page, $course_start);
$link_start = strpos($subj_page, 'href') + 6;
$subj_page = substr($subj_page, $link_start);
$link_end = strpos($subj_page, '">') - 1;
$link_length = $link_end + 1;
$link = substr($subj_page, 0, $link_length);
//crawl_professors('https://whatever.com' . $link);
}
}

Try advance html dom parser. It is here....
http://sourceforge.net/projects/advancedhtmldom/

Related

Simple html dom always loading the default first page and not the specified url

I want to scrape few web pages. I am using php and simple html dom parser.
For instance trying to scrape this site: https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=5
I use this load the url.
$html = new simple_html_dom();
$html->load_file($url);
This loads the correct page. Then I find the next page link, here it will be:
https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=6
Just the page value is changed from 5 to 6. The code snippet to get the next link is:
function getNextLink($_htmlTemp)
{
//Getting the next page links
$aNext = $_htmlTemp->find('a.next', 0);
$nextLink = $aNext->href;
return $nextLink;
}
The above method returns the correct link with page value being 6.
Now when I try to load this next link, it fetches the first default page with page query absent from the url.
//After loop we will have details of all the listing in this page -- so get next page link
$nxtLink = getNextLink($originalHtml); //Returns string url
if(!empty($nxtLink))
{
//Yay, we have the next link -- load the next link
print 'Next Url: '.$nxtLink.'<br>'; //$nxtLink has correct value
$originalHtml->load_file($nxtLink); //This line fetches default page
}
The whole flow is something like this:
$html->load_file($url);
//Whole thing in a do-while loop
$originalHtml = $html;
$shouldLoop = true;
//Main Array
$value = array();
do{
$listings = $originalHtml->find('div.searchResult');
foreach($listings as $item)
{
//Some logic here
}
//After loop we will have details of all the listing in this page -- so get next page link
$nxtLink = getNextLink($originalHtml); //Returns string url
if(!empty($nxtLink))
{
//Yay, we have the next link -- load the next link
print 'Next Url: '.$nxtLink.'<br>';
$originalHtml->load_file($nxtLink);
}
else
{
//No next link -- stop the loop as we have covered all the pages
$shouldLoop = false;
}
} while($shouldLoop);
I have tried encoding the whole url, only the query parameters but the same result. I also tried creating new instances of simple_html_dom and then loading the file, no luck. Please help.
You need to html_entity_decode those links, I can see that they are getting mangled by simple-html-dom.
$url = 'https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes';
$html = str_get_html(file_get_contents($url));
while($a = $html->find('a.next', 0)){
$url = html_entity_decode($a->href);
echo $url . "\n";
$html = str_get_html(file_get_contents($url));
}

Get and return media url (m3u8) using PHP

I have a website that hosts videos from a client. On the website the files load externally via m3u8 link.
The client would now like to have those videos on a Roku channel.
If I simply use the m3u8 link from the site it gives an error because the url generated is sent with a cookie and so a client must click and the link to generate a new code for them.
I would like if possible (and I have not seen this here) is to scrape the html page and just return the link via PHP script on the website from the Roku.
I know how to get titles and such using pure php but am having problems returning the m3u8 link..
I do have code to show I am not looking for handouts and actually am trying.
This is what I have used for getting the title name for example.
Note: I would like to know if it is possible to have one php that autofills the html page per url so I do not have to use a different php for each video with the url pretyped in.
<?php
$html = file_get_contents('http://example.com'); //get the html returned from the following url
$movie_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$movie_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$movie_xpath = new DOMXPath($movie_doc);
//get all the titles
$movie_row = $movie_xpath->query('//title');
if($movie_row->length > 0){
foreach($movie_row as $row){
echo $row->nodeValue . "<br/>";
}
}
}
?>
There is a simple approach for this, which involves using regex.
In this example let's say the video M3u8 file is located at: http://example.com/theVideoPage
You would point the video URL Source in your XML to your PHP file.
http://thisPhpFileLocation.com
<?php
$html = file_get_contents("http://example.com/theVideoPage");
preg_match_all(
'/(http.*m3u8)/',
$html,
$posts, // will contain the article data
PREG_SET_ORDER // formats data into an array of posts
);
foreach ($posts as $post) {
$link = $post[0];
header("Location: $link");
}
?>
Now if you want to use a URL that you can append a URL link at the end it could look something like this and you would use an address as such for a Video Url located at
http://thisPhpFileLocation.com?id=theVideoPage
<?php
$id = $_GET['id'];
$html = file_get_contents("http://example.com".$id);
preg_match_all(
'/(http.*m3u8)/',
$html,
$things, // will contain the article data
PREG_SET_ORDER // formats data into an array of posts
);
foreach ($things as $thing) {
$link = $thing[1];
// clear out the output buffer
while (ob_get_status())
{
ob_end_clean();
}
// no redirect
header("Location: $link");
}
?>

PHP redirect but in parent frame

I have a small time url shortner at http://thetpg.tk using a simple php script and MySQL.
What it does is to get the id and matches it in the SQL Database and redirects it to the specified link found in the Database using header().
But if I have a frameset with source as something like http://thetpg.tk redirected link is loaded inside the frame instead of the parent window.
For e.g. look at the page source of
http://thetpgmusic.tk which has the frame source as
http://thetpg.tk/b which further redirects to
http://thepirategamer.tk/music.php .
I want (1) to load (3) as the parent, but just by making changes in the functions in (2) .
So is there a function like
header(Location:http://thepirategamer.tk/music.php, '_parent');
in php, or is there any other way to implement it?
NOTE: I can't change anything in (2).
Thanks in advance ! :)
There are tree solutions that can help you do this:
First solution:
This solution may involve php if you're using echo to generate your html code, when you need to output an a tag, you should make sure to add the atribute target='_parent'
<?php
echo ' Click here ';
?>
problem :
The problem with this solution, is that it doesn't work if you need to redirect in the parent window from a page that you don't own (inside the iframe). The second solution solves this problem
Second solution:
This second solution is totally client-side, wich means you need to use some javascript. you should define a javascript function that addes the target='_parent' in every a tag
function init ()
{
TagNames = document.getElementById('iframe').contentWindow.document.getElementsByTagName('a');
for( var x=0; x < TagNames.length; x++ )
TagNames[x].onclick = function()
{
this.setAttribute('target','_parent');
}
};
Now all you need to do is to call this function when the body is loaded like this
<body onload="init();"> ... </body>
problem:
The problem with this solution, is that if you have a link that contains an anchor like this href="#" it will change the parent window to the child window To solve this problem, you have to use the third solution
Third solution:
This solution is also client-side and you have to use javascript. It is like the second solution except that you have to test if the link is a url to an external page or to an anchor before you redirect. so you need to define a function that returns true if it's a link to an external page and false if it's a simple anchor, and then you'll have to use this function like this
function init ()
{
TagNames = document.getElementById('iframe').contentWindow.document.getElementsByTagName('a');
for( var x=0; x < TagNames.length; x++ )
TagNames[x].onclick = function()
{
if ( is_external_url( this.href ) )
document.location = this.href;
}
};
and you also need to call this function when the body is loaded
<body onload="init();"> ... </body>
don't forget to define is_external_url()
update :
Here is the solution to get the url of the last child, it's just a simple function that looks from frames and iframes inside the paages and get the urls
function get_last_url($url)
{
$code = file_get_contents($url);
$start = strpos($code, '<frameset');
$end = strpos($code, '</frameset>');
if($start===false||$end===false)
{
$start = strpos($code, '<iframe');
$end = strpos($code, '</iframe>');
if($start===false||$end===false)
return $url;
}
$sub = substr($code, $start,$end-$start);
$sub = substr($sub, strpos($sub,'src="')+5);
$url = explode('"', $sub)[0];
return get_last_child($url);
}
$url = get_last_url("http://thetpgmusic.tk/");
header('Location: ' . $url);
exit();

best way to store html in sql using php - display in htmlText in as3

I am having many issues getting html to display correctly in my using a dynamic textField in as3. I am having two main issues.
Either the <br> and <p> are not showing he leaves no line breaks or not the correct formatting with no spaces between some words etc...
there is an error loading because I have it stored in the database wrong, or I am sending the var to flash wrong so flash can't read it or display it correctly.
I understand there are a few levels of complexity to this that could be found in one or all of these 3 codes.
Storing it in the database and using best format.
Pulling it from the database and converting it to a string that can be sent to as3.
Getting the as3 to display the html correctly.
I am not worried about css styles or any other html. I just want it to display the format correctly with the right line breaks etc...
One important note. I am pulling the html from an external page so I don't have control over all the extra html code in this, such as styles etc....
In mySql I am storing the html in a TEXT type as opposed to varchar etc...
Here is the php code that is storing the html text.
<?
// I am only showing what I believe to be the relevant code.
//I am mainly concerned about the Content var.
$content = trim(strip_tags(addslashes($content)));
// Also tried the line below without the strip tags
$content = trim(addslashes($content));
mysqli_query($con,"INSERT INTO myDataBase (Content, Title) VALUES ('$content','$t'");
?>
Here the php code that is getting the content from the sql.
<?
while($row = mysqli_fetch_array($result))
{
$Content[] = stripslashes($row['Content']);
}
/// sending to as3
echo "&Content1=$Content[1]";
?>
Here is the as3 code that is displaying the html. (Not working well)
public function popeContent()
{
var formatT = new TextFormat( );
formatT.bold = false;
formatT.color = 0x454544;
formatT.font = "TradeGothic";
formatT.size = 40;
formatT.leading = 10;
formatT.font = "Arial";
theContent.multiline = true;
theContent.wordWrap = true;
theContent.htmlText = Content1;
theContent.width = 1075;
theContent.y = 0;
theContent.x = 0;
theContent.autoSize = TextFieldAutoSize.LEFT;
theContent.setTextFormat(formatT);
addChild(theContent);
}
Any help to send me in the right direction would be appreciated. Thanks so much.

Adding a link from javascript to php script

I have following script, which is not working. What to I do to add the link?
jno = "97856483";
dispTitle = "new book";
dispAuthor = "authorname";
document.getElementById('popups').innerHTML = '';
//Add link to add this book:
var url = encodeURIComponent(jno) + "&tt=" + encodeURIComponent(dispTitle) + "&at=" + encodeURIComponent(dispAuthor);
//document.writeln(url);
document.getElementById("addLink").innerHTML = "<a href='memaccountentry.php?isbn='+ url>Add book</a>" ; //This one just appends the word url.
//window.location.href = 'memaccountentry.php?isbn=' +jno +'&tt=' +dispTitle+'&at=' +dispAuthor; //I know this is working, but not a right way to do.
//I need to put a href link to go to the next page.
//ajax.open('GET', 'memaccountentry.php?isbn=' +jno +'&tt=' +dispTitle+'&at=' +dispAuthor', true);
You need to properly open and close your quotes.
Try that:
document.getElementById("addLink").innerHTML = "<a href='memaccountentry.php?isbn="+ url +"'>Add book</a>" ; //This one just appends the word url.
It looks like you didn't format your string correctly.
If this is not what you wanted, then you have me completely confused.
document.getElementById("addLink").innerHTML = "Add book";

Categories