I'm looking to turn
Some page to
Some page
using PHP. I'll have the HTML code of a random website so it's not as simple as using str_replace()
I've tried Replacing anchor href value with regex but that seems to just erase my entire page and I get a blank, white screen. Can anyone offer any help?
My code:
$html = file_get_contents(htmlentities($_GET['q'])); // Takes contents of website entered by user
$arr = array(); // Defines array
$html2 = ""; // Defines variable to write to later
$dom = new DOMDocument();
$dom->loadHTML($html); // Loads the HTML code displayed earlier
$domcss = $dom->getElementsByTagName('link');
foreach($domcss as $links) {
if( strtolower($links->getAttribute('rel')) == "stylesheet" ) {
$x = $links->getAttribute('href');
$html2 .= '<link rel="stylesheet" type="text/css" href="'.htmlentities($_GET['q']) . "/" . $x.'">';
}
} // This replaces all stylesheets from "./style.css", to "http://example.com/style.css"
echo $html2 . $html // Echos the entire webpage, with stylesheet links edited
To manipulate this with DOM, find the <a> tags and then if there is a href attribute, add the prefix in. The end of this code just echos out the resultant HTML...
$dom = new DOMDocument();
$dom->loadHTML($html); // Loads the HTML code displayed earlier
$aTags = $dom->getElementsByTagName('a');
$prefix = "http://example.com?q=";
foreach($aTags as $links) {
$href = $links->getAttribute('href');
if( !empty($href)) {
$links->setAttribute("href", $prefix.$href);
}
}
echo $dom->saveHTML();
$prefix contains the bit you want to add the the URL.
Related
I have written the following code but it just returns empty data :
enter code here
$code="CS225";
$url="https://cs.illinois.edu/courses/profile/{$code}";
echo $url;
$html = file_get_contents($url);
$pokemon_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$pokemon_doc->loadHTML($html);
libxml_clear_errors();
$pokemon_xpath = new DOMXPath($pokemon_doc);
$pokemon_row = $pokemon_xpath->query("//div[#id='extCoursesDescription']");
if($pokemon_row->length > 0){
foreach($pokemon_row as $row){
echo $row->nodeValue . "<br/>";
}
}
}
the website that i am trying to scrape is : https://cs.illinois.edu/courses/profile/CS225
The course content seems to be loaded on the source by the page on loading. But if you go through the source that is loaded you get to ...
<script type='text/javascript' src='//ws.engr.illinois.edu/courses/item.asp?n=3&course=CS225'></script>
From this you can track through to the url http://ws.engr.illinois.edu/courses/item.asp?n=3&course=CS225 and this gives you the actual content your after. So rather than the original URL, use this new one and you should be able to extract the information from there.
Although this content is all wrapped in document.write()'s.
Update:
To remove the document() bits - a simple way is to just process the content...
$html = file_get_contents($url);
$html = str_replace(["document.write('","');"], "", $html);
$html = str_replace('\"', '"', $html);
An api returns me couple of html code (only part of the body, not full html) and i want to change all images src's with others.
I get and set attributes then if i echo it in foreach loop i see old and new value but when i try to save it with saveHTML then dump the full html block which is returned from api, i don't see replaced paths.
$page = json_decode($page);
$page = (array) $page->rows;
$page = ($page[0]->_->content);
$dom = new \DOMDocument();
$dom->loadHTML($page);
$tag = $dom->getElementsByTagName('img');
foreach($tag as $t)
{
echo $t->getAttribute('src').'<br'>; //showing old src
$t->setAttribute('src', 'bla');
echo $t->getAttribute('src').'<br'>; //showing new src
}
$dom->saveHTML();
var_dump($page); //nothing is changed
My_ friend this is not how it works.
You should have your edited HTML in the result of saveHTML() so:
$editedHtml = $dom->saveHTML()
var_dump($editedHtml);
Now you should see your changed HTML.
Explanation is that $page is completely different object that has nothing to do with $dom object.
Cheers!
I know there are similar question, but, trying to study PHP I met this error and I want understand why this occurs.
<?php
$url = 'http://aice.anie.it/quotazione-lme-rame/';
echo "hello!\r\n";
$html = new DOMDocument();
#$html->loadHTML($url);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query(".//*[#id='table33']/tbody/tr[2]/td[3]/b");
foreach ($nodelist as $n) {
echo $n->nodeValue . "\n";
}
?>
this prints just "hello!". I want to print the value extracted with the xpath, but the last echo doesn't do anything.
You have some errors in your code :
You try to get the table from the url http://aice.anie.it/quotazione-lme-rame/, but it's actually in an iframe located at http://www.aiceweb.it/it/frame_rame.asp, so get the iframe url directly.
You use the function loadHTML(), which load an HTML string. What you need is the loadHTMLFile function, which takes the link of an HTML document as a parameter (See http://www.php.net/manual/fr/domdocument.loadhtmlfile.php)
You assume there is a tbody element on the page but there is no one. So remove that from your query filter.
Working code :
$url = 'http://www.aiceweb.it/it/frame_rame.asp';
echo "hello!\r\n";
$html = new DOMDocument();
#$html->loadHTMLFile($url);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query(".//*[#id='table33']/tr[2]/td[3]/b");
foreach ($nodelist as $n) {
echo $n->nodeValue . "\n";
}
’I have the following scenario and I'm already spending hours trying to handle it: I'm developing a Wordpress theme (hence PHP) and I want to check whether the content of a post (which is HTML) contains a tag with a certain id/class. If so, I want to extract it from the content and place it somewhere else.
Example: Let's say the text content of the Wordpress post is
<?php
/* $content actually comes from WP function get_the_content() */
$content = '<p>some text and so forth that I don\'t care about...</p> <div class="the-wanted-element"><p>I WANT THIS DIV!!!</p></div>';
?>
So how can I extract that div with the class (could also live with giving it an ID), output it (with tags and all that) in one place of the template, and output the rest (without the extracted tag, of course) in another place of the template?
I've already tried with the DOMDocument class, p.i.t.a. to me, maybe I'm too stupid.
Try:
$content = '<p>some text and so forth that I don\'t care about...</p> <div class="the-wanted-element"><p>I WANT THIS DIV!!!</p></div>';
$dom = new DomDocument;
$dom->loadHtml($content);
$xpath = new DomXpath($dom);
$contents = '';
foreach ($xpath->query('//div[#class="the-wanted-element"]') as $node) {
$contents = $dom->saveXml($node);
break;
}
echo $contents;
How to get the remaining xml/html:
$content = '<p>some text and so forth that I don\'t care about...</p> <div class="the-wanted-element"><p>I WANT THIS DIV!!!</p></div>';
$dom = new DomDocument;
$dom->loadHtml($content);
$xpath = new DomXpath($dom);
foreach ($xpath->query('//div[#class="the-wanted-element"]') as $node) {
$node->parentNode->removeChild($node);
break;
}
$contents = '';
foreach ($xpath->query('//body/*') as $node) {
$contents .= $dom->saveXml($node);
}
echo $contents;
I'm having trouble trying to write an if statement for the DOM that will check if $html is blank. However, whenever the HTML page does end up blank, it just removes everything that would be below DOM (including what I had to check if it was blank).
$html = file_get_contents("http://example.com/");
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementById('dividhere')->getElementsByTagName('img');
foreach ($links as $link)
{
echo $link->getAttribute('src');
}
All this does is grab an image URL in the specified div, which works perfectly until the page is a blank HTML page.
I've tried using SimpleHTMLDOM, which didn't work either (it didn't even fetch the image on working pages). Did I happen to miss something with this one or am I just missing something in both?
include_once('simple_html_dom.php')
$html = file_get_html("http://example.com/");
foreach($html->find('div[id="dividhere"]') as $div)
{
if(empty($div->src))
{
continue;
}
echo $div->src;
}
Get rid on the $html variable and just load the file into $dom by doing #$dom->loadHTMLFile("http://example.com/");, then have an if statement below that to check if $dom is empty.