I want to strip some html-body code from de full-html code.
I use the script below.
<?php
function getbody($filename) {
$file = file_get_contents($filename);
$bodystartpattern = ".*<body>";
$bodyendpattern = "</body>.*";
$noheader = eregi_replace($bodystartpattern, "", $file);
$noheader = eregi_replace($bodyendpattern, "", $noheader);
return $noheader;
}
$bodycontent = getbody($_GET['url']);
?>
But in some cases the tag <body> doesn't exist literally, but the tag could be <body style="margin:0;"> or something. Who can tell me what is the solution to find the body-tag in this case by using a regular expression in the $bodystartpattern which looks for the closing-">" of the opening-body-tag?
#1nflktd I have tried the code below.
<?php
header('Content-Type:text/html; charset=UTF-8');
function getbody($filename) {
$file = file_get_contents($filename);
$dom = new DOMDocument;
$dom->loadHTML($file);
$bodies = $dom->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
for ($i = 0; $i < $body->children->length; $i++) {
$body->remove($body->children->item($i));
}
$stringbody = $dom->saveHTML($body);
return $stringbody;
}
$url = "http://www.barcelona.com/";
$bodycontent = getbody($url);
?>
<html>
<head></head>
<body>
<?php
echo "BODY ripped from: ".$url."<br/>";
echo "<textarea rows='40' cols='200' >".$bodycontent."</textarea>";
?>
</body>
</html>
Why don't you use a html parser ?
function getbody($filename) {
$file = file_get_contents($filename);
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($file);
libxml_use_internal_errors(false);
$bodies = $dom->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
for ($i = 0; $i < $body->children->length; $i++) {
$body->remove($body->children->item($i));
}
$stringbody = $dom->saveHTML($body);
return $stringbody;
}
DOM loadHTML reference
Related
I am trying to grab URL, with DOMparser but stuck at getNamedItem
How to solve this problem? What I am missing here? I welcome for any idea!
$url = 'https://www.31sumai.com/search/area/kansai/result/?area=16,17,18';
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$mainlink = null;
$allPTags = $DOMParser->getElementsByTagName('p');
foreach ($allPTags as $ptag) {
$class = $ptag->attributes->getNamedItem("class");
if ($class && $class->nodeValue == 'c-name') {
$main = $ptag->attributes->getNamedItem("href");
if ($main) {
$mainlink = $main->nodeValue;
}
}
}
var_dump($mainlink);
It s returning null but already checked the website, there is a URL in that tag.
$url = 'https://lions-mansion.jp/area/kansai/';
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$mainlink = null;
$allPTags = $DOMParser->getElementsByTagName('p');
foreach ($allPTags as $ptag) {
$class = $ptag->attributes->getNamedItem("class");
if ($class && $class->nodeValue == 'areapageDetailList_item_btn_hp') {
$links = $ptag->getElementsByTagName('a');
foreach ($links as $link) {
$hrefAttr = $link->attributes->getNamedItem("href");
if ($hrefAttr) {
$mainlink = $hrefAttr->nodeValue;
}
}
}
}
echo $mainlink;
In var2, I want HTML content, as there are some br so they are not getting included and I want them to get included.
<?php
$url = "http://sms.hindijokes.co";
$html = file_get_contents($url);
$doc = new DOMDocument;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTML("<html><body>".$html."
</body> </html>");
$xpath = new DOMXPath($doc);
$query1 = "//h2[#class='entry-title']/a";
$query2 = "//div[#class='entry-content'][1]/p";
$entries1 = $xpath->query($query1);
$entries2 = $xpath->query($query2);
$var1 = $entries1->item(0)->textContent;
$var2 = $entries2->item(0)->textContent;
echo "$var1";
echo "<br>";
$f = $entries2->length;
for($i = 0; $i < $f; $i++){
echo $entries2->item($i)->textContent."\n";
}
?>
I trying to extract the news headlines and the link (href) of each headline using the code bellow, but the link extraction is not working. It's only getting the headline. Please help me find out what's wrong with the code.
Link to page from which I want to get the headline and link from:
http://web.tmxmoney.com/news.php?qm_symbol=BCM
<?php
$data = file_get_contents('http://web.tmxmoney.com/news.php?qm_symbol=BCM');
$dom = new domDocument;
#$dom->loadHTML($data);
$dom->preserveWhiteSpace = true;
$xpath = new DOMXPath($dom);
$rows = $xpath->query('//div');
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('span');
$newstitle = $cols->item(0)->nodeValue;
$link = $cols->item(0)->nodeType === HTML_ELEMENT_NODE ? $cols->item(0)->getElementsByTagName('a')->item(0)->getAttribute('href') : '';
echo $newstitle . '<br>';
echo $link . '<br><br>';
}
?>
Thanks in advance for your help!
Try to do this:
<?php
$data= file_get_contents('http://web.tmxmoney.com/news.php?qm_symbol=BCM');
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$hrefs= $xpath->query('/html/body//a');
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo ''.$url.'<br />';
}
}
?>
I have found the solution. Here it goes:
<?php
$data = file_get_contents('http://web.tmxmoney.com/news.php?qm_symbol=BCM');
$dom = new domDocument;
#$dom->loadHTML($data);
$dom->preserveWhiteSpace = true;
$xpath = new DOMXPath($dom);
$rows = $xpath->query('//div');
foreach ($rows as $row) {
$cols1 = $row->getElementsByTagName('a');
$link = $cols1->item(0)->nodeType === XML_ELEMENT_NODE ? $cols1->item(0)->getAttribute('href') : '';
$cols2 = $row->getElementsByTagName('span');
$title = $cols2->item(0)->nodeValue;
$source = $cols2->item(1)->nodeValue;
echo $title . '<br>';
echo $source . '<br>';
echo $link . '<br><br>';
}
?>
Is it possible to convert just a selection of a HTML with multiple tables to JSON ?
I have this Table:
<div class="mon_title">2.11.2015 Montag</div>
<table class="info" >
<tr class="info"><th class="info" align="center" colspan="2">Nachrichten zum Tag</th></tr>
<tr class='info'><td class='info' colspan="2"><b><u></u> </b>
...
</table>
<p>
<table class="mon_list" >
...
</table>
And this PHP code to covert it into JSON:
function save_table_to_json ( $in_file, $out_file ) {
$html = file_get_contents( $in_file );
file_put_contents( $out_file, convert_table_to_json( $html ) );
}
function convert_table_to_json ( $html ) {
$document = new DOMDocument();
$document->loadHTML( $html );
$obj = [];
$jsonObj = [];
$th = $document->getElementsByTagName('th');
$td = $document->getElementsByTagName('td');
$thNum = $th->length;
$arrLength = $td->length;
$rowIx = 0;
for ( $i = 0 ; $i < $arrLength ; $i++){
$head = $th->item( $i%$thNum )->textContent;
$content = $td->item( $i )->textContent;
$obj[ $head ] = $content;
if( ($i+1) % $thNum === 0){
$jsonObj[++$rowIx] = $obj;
$obj = [];
}
}
save_table_to_json( 'heute_S.htm', 'heute_S.json' );
What it does is takes the table class=info and the table class=mon_list and converts it to json.
Is there any way that it can just take the table class=mon_list?
You can use XPath to search for the class, and then create a new DOM document that only contains the results of the XPath query. This is untested, but should get you on the right track.
It's also worth mentioning that you can use foreach to iterate over the node list.
$document = new DOMDocument();
$document->loadHTML( $html );
$xpath = new DomXPath($document);
$tables = $xpath->query("//*[contains(#class, 'mon_list')]");
$tableDom = new DomDocument();
$tableDom->appendChild($tableDom->importNode($tables->item(0), true));
$obj = [];
$jsonObj = [];
$th = $tableDom->getElementsByTagName('th');
$td = $tableDom->getElementsByTagName('td');
$thNum = $th->length;
$arrLength = $td->length;
$rowIx = 0;
for ( $i = 0 ; $i < $arrLength ; $i++){
$head = $th->item( $i%$thNum )->textContent;
$content = $td->item( $i )->textContent;
$obj[ $head ] = $content;
if( ($i+1) % $thNum === 0){
$jsonObj[++$rowIx] = $obj;
$obj = [];
}
}
Another unrelated answer is to use getAttribute() to check the class name. Someone on a different answer has written a function for doing this:
function getElementsByClass(&$parentNode, $tagName, $className) {
$nodes=array();
$childNodeList = $parentNode->getElementsByTagName($tagName);
for ($i = 0; $i < $childNodeList->length; $i++) {
$temp = $childNodeList->item($i);
if (stripos($temp->getAttribute('class'), $className) !== false) {
$nodes[]=$temp;
}
}
return $nodes;
}
How can I take all the attribute of an element? Like on my example below I can only get one at a time, I want to pull out all of the anchor tag's attribute.
$dom = new DOMDocument();
#$dom->loadHTML(http://www.example.com);
$a = $dom->getElementsByTagName("a");
echo $a->getAttribute('href');
thanks!
$length = $a->attributes->length;
$attrs = array();
for ($i = 0; $i < $length; ++$i) {
$name = $a->attributes->item($i)->name;
$value = $a->getAttribute($name);
$attrs[$name] = $value;
}
print_r($attrs);
"Inspired" by Simon's answer. I think you can cut out the getAttribute call, so here's a solution without it:
$attrs = array();
for ($i = 0; $i < $a->attributes->length; ++$i) {
$node = $a->attributes->item($i);
$attrs[$node->nodeName] = $node->nodeValue;
}
var_dump($attrs);
$a = $dom->getElementsByTagName("a");
foreach($a as $element)
{
echo $element->getAttribute('href');
}
$html = $data['html'];
if(!empty($html)){
$doc = new DOMDocument();
$doc->loadHTML($html);
$doc->saveHTML();
$datadom = $doc->getElementsByTagName("input");
foreach($datadom as $element)
{
$class =$class." ".$element->getAttribute('class');
}
}