I created a program in php using CURL, in which i can take data of any site and can display it in the browser. Another part of the program is that the data can be saved in the file using file handling and after saving this data, I can find all the http links within the body tag of the saved file. My code is showing all the sites in the browser which I took, but I can not find the http links and some unnecessary code is also occurring like this image, though I don't want it to come.
https://www.screencast.com/t/Nwaz93oU
PHP Code:
<!DOCTYPE html>
<html>
<?php
function get_all_links(){
$html = file_get_contents('http://www.ucertify.com');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'<br />';
}
}
function get_site_data($uc_url){
$get_uc = curl_init();
curl_setopt($get_uc,CURLOPT_URL,$uc_url);
curl_setopt($get_uc,CURLOPT_RETURNTRANSFER,true);
$output=curl_exec($get_uc);
curl_close($get_uc);
$fp=fopen("mohit.txt","w");
fputs($fp,$output);
return $output;
}
?>
<body>
<div>
<?php
$site_content = get_site_data("http://www.ucertify.com");
echo $site_content;
?>
</div>
<div >
<?php
echo get_all_links("http://www.ucertify.com");
?>
</div>
</body>
</html>
On get_all_links method validate if $url variable is a valid url in some pages may have onclick handler to javascript. In order to validate if a url you can use regex and php's preg_match. Also you can look on What is a good regular expression to match a URL? about the needed regex in order to validate a url.
Related
I have written the following code but it just returns empty data :
enter code here
$code="CS225";
$url="https://cs.illinois.edu/courses/profile/{$code}";
echo $url;
$html = file_get_contents($url);
$pokemon_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$pokemon_doc->loadHTML($html);
libxml_clear_errors();
$pokemon_xpath = new DOMXPath($pokemon_doc);
$pokemon_row = $pokemon_xpath->query("//div[#id='extCoursesDescription']");
if($pokemon_row->length > 0){
foreach($pokemon_row as $row){
echo $row->nodeValue . "<br/>";
}
}
}
the website that i am trying to scrape is : https://cs.illinois.edu/courses/profile/CS225
The course content seems to be loaded on the source by the page on loading. But if you go through the source that is loaded you get to ...
<script type='text/javascript' src='//ws.engr.illinois.edu/courses/item.asp?n=3&course=CS225'></script>
From this you can track through to the url http://ws.engr.illinois.edu/courses/item.asp?n=3&course=CS225 and this gives you the actual content your after. So rather than the original URL, use this new one and you should be able to extract the information from there.
Although this content is all wrapped in document.write()'s.
Update:
To remove the document() bits - a simple way is to just process the content...
$html = file_get_contents($url);
$html = str_replace(["document.write('","');"], "", $html);
$html = str_replace('\"', '"', $html);
I'm trying to create a proxy request to a different domain from my own and doing some changes on the code before outputting the HTML to be displayed. And all works well except that my CSS file doesn't seem to take effect.
<?php
if (isset($_GET['$url']))
{
$html = file_get_contents($_GET['url']);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$a = array();
foreach ($dom->getElementsByTagName('link') as $href)
{
$a[] = $href->getAttribute('href');
}
echo str_replace($a[0],$url."/".$a[0], $html);
}
?>
The result is an HTML document but without CSS styling. But if I check the source code in my browser it shows that the link to the CSS file is okay and clicking on it takes me to that CSS file, but its not taking effect in styling the output
An api returns me couple of html code (only part of the body, not full html) and i want to change all images src's with others.
I get and set attributes then if i echo it in foreach loop i see old and new value but when i try to save it with saveHTML then dump the full html block which is returned from api, i don't see replaced paths.
$page = json_decode($page);
$page = (array) $page->rows;
$page = ($page[0]->_->content);
$dom = new \DOMDocument();
$dom->loadHTML($page);
$tag = $dom->getElementsByTagName('img');
foreach($tag as $t)
{
echo $t->getAttribute('src').'<br'>; //showing old src
$t->setAttribute('src', 'bla');
echo $t->getAttribute('src').'<br'>; //showing new src
}
$dom->saveHTML();
var_dump($page); //nothing is changed
My_ friend this is not how it works.
You should have your edited HTML in the result of saveHTML() so:
$editedHtml = $dom->saveHTML()
var_dump($editedHtml);
Now you should see your changed HTML.
Explanation is that $page is completely different object that has nothing to do with $dom object.
Cheers!
I'm attempting to make a script that only echos the div that encolose the image on google.
$url = "http://www.google.com/";
$page = file($url);
foreach($page as $theArray) {
echo $theArray;
}
The problem is this echos the whole page.
I want to echo only the part between the <div id="lga"> and the next closest </div>
Note: I have tried using if's but it wasn't working so I deleted them
Thanks
Use the built-in DOM methods:
<?php
$page = file_get_contents("http://www.google.com");
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($page);
libxml_use_internal_errors(false);
$domx = new DOMXPath($domd);
$lga = $domx->query("//*[#id='lga']")->item(0);
$domd2 = new DOMDocument();
$domd2->appendChild($domd2->importNode($lga, true));
echo $domd2->saveHTML();
In order to do this you need to parse the DOM and then get the ID you are looking for. Check out a parsing library like this http://simplehtmldom.sourceforge.net/manual.htm
After feeding your html document into the parser you could call something like:
$html = str_get_html($page);
$element = $html->find('div[id=lga]');
echo $element->plaintext;
That, I think, would be your quickest and easiest solution.
I am "attempting" to scrape a web page that has the following structures within the page:
<p class="row">
<span>stuff here</span>
Descriptive Link Text
<div>Link Description Here</div>
</p>
I am scraping the webpage using curl:
<?php
$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, "http://www.host.tld/");
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
curl_close($handle);
?>
I have done some research and found that I should not use a RegEx to parse the HTML that is returned from the curl, and that I should use PHP DOM. This is how I have done this:
$newDom = new domDocument;
$newDom->loadHTML($html);
$newDom->preserveWhiteSpace = false;
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo $printString . "<br>";
}
Now I am not pretending that I completely understand this but I get the gist, and I do get the sections I am wanting. The only issue is that what I get is only the text of the HTML page, as if I had copied it out of my browser window. What I want is the actual HTML because I want to extract the links and use them too, like so:
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo "LINK " . $printString . "<br>";
}
As you can see, I cannot get the link because I am only getting the text of the webpage and not the source, like I want. I know the "curl_exec" is pulling the HTML because I have tried just that, so I believe that the DOM is somehow stripping the HTML that I want.
According to comments on the PHP manual on DOM, you should use the following inside your loop:
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($sections->item($i), true));
$innerHTML = trim($tmp_dom->saveHTML());
This will set $innerHTML to be the HTML content of the node.
But I think what you really want is to get the 'a' nodes under the 'p' node, so do this:
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++) {
$sec = $sections->item($i);
$links = $sec->getElementsByTagName('a');
$linkNo = $links->length;
for ($j=0; $j<$linkNo; $j++) {
$printString = $links->item($j)->nodeValue;
echo $printString . "<br>";
}
}
This will just print the body of each link.
You can pass a node to DOMDocument::saveXML(). Try this:
$printString = $newDom->saveXML($sections->item($i));
you might want to take a look at phpQuery for doing server-side HTML parsing things. basic example