php to extract data from a website

php to extract data from a website - php

I want to get all <p> elements from 1st jokes so basically I made this script:
<?php
$url = "http://sms.hindijokes.co";
$html = file_get_contents($url);
$doc = new DOMDocument;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTML("<html><body>".$html."
</body> </html>");
$xpath = new DOMXPath($doc);
$query1 = "//h2[#class='entry-title']/a";
$query2 = "//div[#class='entry-content']/p";
$entries1 = $xpath->query($query1);
$entries2 = $xpath->query($query2);
$var1 = $entries1->item(0)->textContent;
$var2 = $entries2->item(0)->textContent;
echo "$var1";
echo "<br>";
$f = 5;
for($i = 0; $i < $f; $i++){
echo $entries2->item($i)->textContent."\n";
}
?>
This time I was knowing that there are five <p> elements in first joke but if I want it to be automate script, there would be sometimes more or less than five <p> elements so it would cause mess.

You need first div's p elements only, so your query would be:
$entries2 = $xpath->query('//(div[#class='entry-content'])[1]/p');
Now you can iterate all p elements with foreach() loop (extracting its html contents):
$innerHtml = '';
foreach ($entries2 as $entry) {
$children = $entry->childNodes;
foreach ($children as $child) {
$innerHtml .= $child->ownerDocument->saveXML($child);
}
}
$innerHtml = str_replace(["\r\n", "\r", "\n", "\t"], '', $innerHtml);

DOMXPath::query returns DOMNodeList object. Use DOMNodeList::length property.
$f = $entries2->length;

Try this way it is returning until null; but some joke has multiple p tags so its better for you to find it by your custom class/id
$i = 0;
while($entries2->item($i)->textContent!=NULL) {
echo "<br>";
echo $i." ".$entries2->item($i)->textContent;
$i++;
}

Related

Transpose first column of an html table in header

I’m trying to scrape a table on Borsa Italiana
I use this code
<?php
$url = "https://www.borsaitaliana.it/borsa/azioni/global-equity-market/dati-completi.html?isin=IT0001477402";
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$doc = new \DOMDocument();
if($doc->loadHTML($html))
{
$result = new \DOMDocument();
$result->formatOutput = true;
$table = $result->appendChild($result->createElement("table"));
$tbody = $table->appendChild($result->createElement("tbody"));
$xpath = new \DOMXPath($doc);
foreach($xpath->query("//table[#class=\"m-table -clear-m\"]/tbody/tr") as $row)
{
$newRow = $tbody->appendChild($result->createElement("tr"));
foreach($xpath->query("./td[position()>0 and position()<3]", $row) as $cell)
{
$newRow->appendChild($result->createElement("td", trim($cell->nodeValue)));
}
}
}
echo $result->saveHTML($result->documentElement);
?>
Result is a table with two columns and more rows. I would transpose first column in header, in order to save result in my database for my personal use.
Can anyone help me?
Thank you

Try it:
<?php
$url = "https://www.borsaitaliana.it/borsa/azioni/global-equity-market/dati-completi.html?isin=IT0001477402";
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$doc = new \DOMDocument();
if ($doc->loadHTML($html)) {
$result = new \DOMDocument();
$result->formatOutput = true;
$xpath = new \DOMXPath($doc);
// collects data in $arr -->
$arr = [];
foreach ($xpath->query("//table[#class=\"m-table -clear-m\"]/tbody/tr") as $row) {
$itm = [];
foreach ($xpath->query("./td[position()>0 and position()<3]", $row) as $cell) {
$itm[] = trim($cell->nodeValue);
}
$arr[] = $itm;
}
// <--
$table = $result->appendChild($result->createElement("table"));
// outputs head -->
$thead = $table->appendChild($result->createElement("thead"));
$newRow = $thead->appendChild($result->createElement("tr"));
foreach (array_column($arr, 0) as $th) {
$newRow->appendChild($result->createElement("th", $th));
}
// <--
// outputs data -->
$tbody = $table->appendChild($result->createElement("tbody"));
$newRow = $tbody->appendChild($result->createElement("tr"));
foreach ($arr as $row) {
$newRow->appendChild($result->createElement("td", isset($row[1])? $row[1]: ""));
}
// <--
}
echo $result->saveHTML($result->documentElement);
But I agree with #tim - you have to use API for that.

foreach ends before getting through all elements in SimpleXMLElement

I loop over a big xml document, and i need to remove some nodes. Unfortunately my foreach breaks after first removing. How is that?
$ids = [1, 2];
$data=<<<DNS_TXT
<feed xmlns:g="http://base.google.com/ns/1.0">
<entry><g:id>1</g:id><description>Desc 1</description></entry>
<entry><g:id>2</g:id><description>Desc 2</description></entry>
<entry><g:id>3</g:id><description>Desc 3</description></entry>
<entry><g:id>4</g:id><description>Desc 4</description></entry>
<entry><g:id>5</g:id><description>Desc 5</description></entry>
</feed>
DNS_TXT;
$doc = new SimpleXMLElement($data);
$i = 0;
foreach($doc->entry as $entry)
{
$i++;
$dom = $entry->children('http://base.google.com/ns/1.0');
if(!in_array($dom->id, $ids)) {
$dom = dom_import_simplexml($entry);
$dom->parentNode->removeChild($dom);
}
}
echo $i;
Result is 3 instead of 5...
Of course i can do that:
/.../
$toRemove = array();
foreach($doc->entry as $entry)
{
$dom = $entry->children('http://base.google.com/ns/1.0');
if(!in_array($dom->id, $ids)) {
$dom = dom_import_simplexml($entry);
$toRemove[] = $dom;
}
}
foreach ($toRemove as $dom) {
$dom->parentNode->removeChild($dom);
}
/.../
But why in first case foreach ends?

In such ways it's better to loop from maximum index to lowest. So it works:
$entries = $doc->entry;
for($i = count($entries)-1; $i >= 0; $i--)
{
$entry = $entries[$i];
$dom = $entry->children('http://base.google.com/ns/1.0');
if(!in_array($dom->id, $ids)) {
unset($doc->entry[$i]);
}

Convert a selected HTML Table to JSON

Is it possible to convert just a selection of a HTML with multiple tables to JSON ?
I have this Table:
<div class="mon_title">2.11.2015 Montag</div>
<table class="info" >
<tr class="info"><th class="info" align="center" colspan="2">Nachrichten zum Tag</th></tr>
<tr class='info'><td class='info' colspan="2"><b><u></u> </b>
...
</table>
<p>
<table class="mon_list" >
...
</table>
And this PHP code to covert it into JSON:
function save_table_to_json ( $in_file, $out_file ) {
$html = file_get_contents( $in_file );
file_put_contents( $out_file, convert_table_to_json( $html ) );
}
function convert_table_to_json ( $html ) {
$document = new DOMDocument();
$document->loadHTML( $html );
$obj = [];
$jsonObj = [];
$th = $document->getElementsByTagName('th');
$td = $document->getElementsByTagName('td');
$thNum = $th->length;
$arrLength = $td->length;
$rowIx = 0;
for ( $i = 0 ; $i < $arrLength ; $i++){
$head = $th->item( $i%$thNum )->textContent;
$content = $td->item( $i )->textContent;
$obj[ $head ] = $content;
if( ($i+1) % $thNum === 0){
$jsonObj[++$rowIx] = $obj;
$obj = [];
}
}
save_table_to_json( 'heute_S.htm', 'heute_S.json' );
What it does is takes the table class=info and the table class=mon_list and converts it to json.
Is there any way that it can just take the table class=mon_list?

You can use XPath to search for the class, and then create a new DOM document that only contains the results of the XPath query. This is untested, but should get you on the right track.
It's also worth mentioning that you can use foreach to iterate over the node list.
$document = new DOMDocument();
$document->loadHTML( $html );
$xpath = new DomXPath($document);
$tables = $xpath->query("//*[contains(#class, 'mon_list')]");
$tableDom = new DomDocument();
$tableDom->appendChild($tableDom->importNode($tables->item(0), true));
$obj = [];
$jsonObj = [];
$th = $tableDom->getElementsByTagName('th');
$td = $tableDom->getElementsByTagName('td');
$thNum = $th->length;
$arrLength = $td->length;
$rowIx = 0;
for ( $i = 0 ; $i < $arrLength ; $i++){
$head = $th->item( $i%$thNum )->textContent;
$content = $td->item( $i )->textContent;
$obj[ $head ] = $content;
if( ($i+1) % $thNum === 0){
$jsonObj[++$rowIx] = $obj;
$obj = [];
}
}

Another unrelated answer is to use getAttribute() to check the class name. Someone on a different answer has written a function for doing this:
function getElementsByClass(&$parentNode, $tagName, $className) {
$nodes=array();
$childNodeList = $parentNode->getElementsByTagName($tagName);
for ($i = 0; $i < $childNodeList->length; $i++) {
$temp = $childNodeList->item($i);
if (stripos($temp->getAttribute('class'), $className) !== false) {
$nodes[]=$temp;
}
}
return $nodes;
}

Getting 'DomNode' type object in php

I am using php's DomDocument class to load an HTML file and then empty its contents. The problem is when I do .removeChild() it gives me 'Not Found Error'. heres my code
$doc=new DOMDocument();
$doc->loadHTMLFile("a.html");
$body= $doc->getElementsByTagName('body')->item(0);
foreach($body->childNodes as $child)
{
$body->removeChild($child);
}
$child is of DOMText type....may be because removeChild expects DOMNode and not DOMText? if yes then how can i iterate over childNodes such that $child is of type DOMNode?

Use a for loop instead of a foreach loop.
$doc=new DOMDocument();
$doc->loadHTML("c.html");
$doc->preserveWhiteSpace = true;
$body = $doc->getElementsByTagName('body')->item(0);
$children = $body->childNodes;
$length = $children->length;
for($i = 0 ; $i < $length; $i++) {
$child = $children->item($i);
if ($child)
$body->removeChild($child);
}
$html = $doc->saveHTML();
echo $html;

PHP: DomElement->getAttribute

How can I take all the attribute of an element? Like on my example below I can only get one at a time, I want to pull out all of the anchor tag's attribute.
$dom = new DOMDocument();
#$dom->loadHTML(http://www.example.com);
$a = $dom->getElementsByTagName("a");
echo $a->getAttribute('href');
thanks!

$length = $a->attributes->length;
$attrs = array();
for ($i = 0; $i < $length; ++$i) {
$name = $a->attributes->item($i)->name;
$value = $a->getAttribute($name);
$attrs[$name] = $value;
}
print_r($attrs);

"Inspired" by Simon's answer. I think you can cut out the getAttribute call, so here's a solution without it:
$attrs = array();
for ($i = 0; $i < $a->attributes->length; ++$i) {
$node = $a->attributes->item($i);
$attrs[$node->nodeName] = $node->nodeValue;
}
var_dump($attrs);

$a = $dom->getElementsByTagName("a");
foreach($a as $element)
{
echo $element->getAttribute('href');
}

$html = $data['html'];
if(!empty($html)){
$doc = new DOMDocument();
$doc->loadHTML($html);
$doc->saveHTML();
$datadom = $doc->getElementsByTagName("input");
foreach($datadom as $element)
{
$class =$class." ".$element->getAttribute('class');
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php to extract data from a website - php

DOMXPath::query returns DOMNodeList object. Use DOMNodeList::length property. $f = $entries2->length;

Try this way it is returning until null; but some joke has multiple p tags so its better for you to find it by your custom class/id $i = 0; while($entries2->item($i)->textContent!=NULL) { echo "<br>"; echo $i." ".$entries2->item($i)->textContent; $i++; }

Related

Transpose first column of an html table in header

foreach ends before getting through all elements in SimpleXMLElement

Convert a selected HTML Table to JSON

Getting 'DomNode' type object in php

PHP: DomElement->getAttribute

Categories

Resources