I’m trying to scrape a table on Borsa Italiana
I use this code
<?php
$url = "https://www.borsaitaliana.it/borsa/azioni/global-equity-market/dati-completi.html?isin=IT0001477402";
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$doc = new \DOMDocument();
if($doc->loadHTML($html))
{
$result = new \DOMDocument();
$result->formatOutput = true;
$table = $result->appendChild($result->createElement("table"));
$tbody = $table->appendChild($result->createElement("tbody"));
$xpath = new \DOMXPath($doc);
foreach($xpath->query("//table[#class=\"m-table -clear-m\"]/tbody/tr") as $row)
{
$newRow = $tbody->appendChild($result->createElement("tr"));
foreach($xpath->query("./td[position()>0 and position()<3]", $row) as $cell)
{
$newRow->appendChild($result->createElement("td", trim($cell->nodeValue)));
}
}
}
echo $result->saveHTML($result->documentElement);
?>
Result is a table with two columns and more rows. I would transpose first column in header, in order to save result in my database for my personal use.
Can anyone help me?
Thank you
Try it:
<?php
$url = "https://www.borsaitaliana.it/borsa/azioni/global-equity-market/dati-completi.html?isin=IT0001477402";
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$doc = new \DOMDocument();
if ($doc->loadHTML($html)) {
$result = new \DOMDocument();
$result->formatOutput = true;
$xpath = new \DOMXPath($doc);
// collects data in $arr -->
$arr = [];
foreach ($xpath->query("//table[#class=\"m-table -clear-m\"]/tbody/tr") as $row) {
$itm = [];
foreach ($xpath->query("./td[position()>0 and position()<3]", $row) as $cell) {
$itm[] = trim($cell->nodeValue);
}
$arr[] = $itm;
}
// <--
$table = $result->appendChild($result->createElement("table"));
// outputs head -->
$thead = $table->appendChild($result->createElement("thead"));
$newRow = $thead->appendChild($result->createElement("tr"));
foreach (array_column($arr, 0) as $th) {
$newRow->appendChild($result->createElement("th", $th));
}
// <--
// outputs data -->
$tbody = $table->appendChild($result->createElement("tbody"));
$newRow = $tbody->appendChild($result->createElement("tr"));
foreach ($arr as $row) {
$newRow->appendChild($result->createElement("td", isset($row[1])? $row[1]: ""));
}
// <--
}
echo $result->saveHTML($result->documentElement);
But I agree with #tim - you have to use API for that.
Related
I am trying to grab URL, with DOMparser but stuck at getNamedItem
How to solve this problem? What I am missing here? I welcome for any idea!
$url = 'https://www.31sumai.com/search/area/kansai/result/?area=16,17,18';
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$mainlink = null;
$allPTags = $DOMParser->getElementsByTagName('p');
foreach ($allPTags as $ptag) {
$class = $ptag->attributes->getNamedItem("class");
if ($class && $class->nodeValue == 'c-name') {
$main = $ptag->attributes->getNamedItem("href");
if ($main) {
$mainlink = $main->nodeValue;
}
}
}
var_dump($mainlink);
It s returning null but already checked the website, there is a URL in that tag.
$url = 'https://lions-mansion.jp/area/kansai/';
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$mainlink = null;
$allPTags = $DOMParser->getElementsByTagName('p');
foreach ($allPTags as $ptag) {
$class = $ptag->attributes->getNamedItem("class");
if ($class && $class->nodeValue == 'areapageDetailList_item_btn_hp') {
$links = $ptag->getElementsByTagName('a');
foreach ($links as $link) {
$hrefAttr = $link->attributes->getNamedItem("href");
if ($hrefAttr) {
$mainlink = $hrefAttr->nodeValue;
}
}
}
}
echo $mainlink;
I want to get all <p> elements from 1st jokes so basically I made this script:
<?php
$url = "http://sms.hindijokes.co";
$html = file_get_contents($url);
$doc = new DOMDocument;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTML("<html><body>".$html."
</body> </html>");
$xpath = new DOMXPath($doc);
$query1 = "//h2[#class='entry-title']/a";
$query2 = "//div[#class='entry-content']/p";
$entries1 = $xpath->query($query1);
$entries2 = $xpath->query($query2);
$var1 = $entries1->item(0)->textContent;
$var2 = $entries2->item(0)->textContent;
echo "$var1";
echo "<br>";
$f = 5;
for($i = 0; $i < $f; $i++){
echo $entries2->item($i)->textContent."\n";
}
?>
This time I was knowing that there are five <p> elements in first joke but if I want it to be automate script, there would be sometimes more or less than five <p> elements so it would cause mess.
You need first div's p elements only, so your query would be:
$entries2 = $xpath->query('//(div[#class='entry-content'])[1]/p');
Now you can iterate all p elements with foreach() loop (extracting its html contents):
$innerHtml = '';
foreach ($entries2 as $entry) {
$children = $entry->childNodes;
foreach ($children as $child) {
$innerHtml .= $child->ownerDocument->saveXML($child);
}
}
$innerHtml = str_replace(["\r\n", "\r", "\n", "\t"], '', $innerHtml);
DOMXPath::query returns DOMNodeList object. Use DOMNodeList::length property.
$f = $entries2->length;
Try this way it is returning until null; but some joke has multiple p tags so its better for you to find it by your custom class/id
$i = 0;
while($entries2->item($i)->textContent!=NULL) {
echo "<br>";
echo $i." ".$entries2->item($i)->textContent;
$i++;
}
I trying to extract the news headlines and the link (href) of each headline using the code bellow, but the link extraction is not working. It's only getting the headline. Please help me find out what's wrong with the code.
Link to page from which I want to get the headline and link from:
http://web.tmxmoney.com/news.php?qm_symbol=BCM
<?php
$data = file_get_contents('http://web.tmxmoney.com/news.php?qm_symbol=BCM');
$dom = new domDocument;
#$dom->loadHTML($data);
$dom->preserveWhiteSpace = true;
$xpath = new DOMXPath($dom);
$rows = $xpath->query('//div');
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('span');
$newstitle = $cols->item(0)->nodeValue;
$link = $cols->item(0)->nodeType === HTML_ELEMENT_NODE ? $cols->item(0)->getElementsByTagName('a')->item(0)->getAttribute('href') : '';
echo $newstitle . '<br>';
echo $link . '<br><br>';
}
?>
Thanks in advance for your help!
Try to do this:
<?php
$data= file_get_contents('http://web.tmxmoney.com/news.php?qm_symbol=BCM');
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$hrefs= $xpath->query('/html/body//a');
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo ''.$url.'<br />';
}
}
?>
I have found the solution. Here it goes:
<?php
$data = file_get_contents('http://web.tmxmoney.com/news.php?qm_symbol=BCM');
$dom = new domDocument;
#$dom->loadHTML($data);
$dom->preserveWhiteSpace = true;
$xpath = new DOMXPath($dom);
$rows = $xpath->query('//div');
foreach ($rows as $row) {
$cols1 = $row->getElementsByTagName('a');
$link = $cols1->item(0)->nodeType === XML_ELEMENT_NODE ? $cols1->item(0)->getAttribute('href') : '';
$cols2 = $row->getElementsByTagName('span');
$title = $cols2->item(0)->nodeValue;
$source = $cols2->item(1)->nodeValue;
echo $title . '<br>';
echo $source . '<br>';
echo $link . '<br><br>';
}
?>
I am using plesk 11.0.9 version.I set a cron for updating the database.The cron file will run once's in a day.Each time I want a huge amount of data(around 20,000)updated to the database.But the cron file will run only for 5 min and time out occur.Due to this reason the database is not updated correctly.I used the following code inside the cron file.
<?php
$query=mysql_query("SELECT id,detail_url,region,region_id FROM c_url_details_crone");
while($ress=mysql_fetch_array($query))
{
echo '---'. $city=$ress['region']; echo "<br>";
$url=$ress['detail_url'];
$did=$ress['id'];
require_once 'simplehtmldom_1_51/simple_html_dom.php';
$html = file_get_html($url);
foreach($html->find('h3[class=h3 z-address]') as $link)
$data['address']= $link;
foreach($html->find('h2[class=z-price brown]') as $pr)
$data['priceH']= $pr;
foreach($html->find('div[class=z-feature]') as $rooms)
$data['room']=$rooms;
foreach($html->find('div[class=z-description]') as $description)
$data['description']=$description;
$property1='';
foreach($html->find('table[class=table-style]') as $property)
{
$property1=$property1. ','.$property;
}
$data['property']=$property1 ;
foreach($html->find('table[class=table-style z-rooms]') as $extras)
{
$data['extras']=$extras;
}
foreach($html->find('li[class=z-mls]') as $listcode)
{
$data['listcode']=$listcode;
}
foreach($html->find('div[class=z-listing-by]') as $propertybrkr)
{
$data['propertybrkr']=$propertybrkr;
}
foreach($html->find('div[class=small-12 columns flat-columns z-block]') as $propertyimages)
$data['propertyimages']=$propertyimages;
if(isset($propertyimages)){ $file_contents= $propertyimages;
$img1='';
foreach(#$file_contents->find('img') as $element)
$img1=$img1. ','.$element->src;
$imgeslidr=explode(',',$img1);
// print_r($imgeslidr);
#$st0=implode(",",$imgeslidr);
}
$detls= explode(',',$property1);
#$dom = new DOMDocument;
#$dom->loadHTML(#$detls[1]);
#$cells = $dom->getElementsByTagName('td');
#$contents = array();
#$aa="";
foreach($cells as $cell)
{
#$contents[] = $cell->nodeValue;
$aa.=$cell->nodeValue.",";
}
#$dom = new DOMDocument;
#$dom->loadHTML(#$detls[1]);
#$cells = $dom->getElementsByTagName('th');
#$contents = array();
$bb='';
foreach($cells as $cell)
{
#$contents1[] = $cell->nodeValue;
$bb.=$cell->nodeValue.",";
}
#$dom = new DOMDocument;
#$dom->loadHTML(#$detls[3]);
#$cells = $dom->getElementsByTagName('th');
#$contents = array();
#$cc="";
foreach($cells as $cell)
{
#$contents2[] = $cell->nodeValue;
$cc.=$cell->nodeValue.",";
}
#$dom = new DOMDocument;
#$dom->loadHTML(#$detls[3]);
#$cells = $dom->getElementsByTagName('td');
#$contents = array();
#$dd="";
foreach($cells as $cell)
{
#$contents3[] = $cell->nodeValue;
$dd.=$cell->nodeValue.",";
}
#$dom = new DOMDocument;
#$dom->loadHTML(#$detls[4]);
#$cells = $dom->getElementsByTagName('th');
#$contents = array();
#$ff="";
foreach($cells as $cell)
{
#$contents5[] = $cell->nodeValue;
$ff.=$cell->nodeValue.",";
}
#$dom = new DOMDocument;
#$dom->loadHTML(#$detls[3]);
#$cells = $dom->getElementsByTagName('thead');
#$contents = array();
#$gg="";
foreach($cells as $cell)
{
#$contents6[] = $cell->nodeValue;
$gg.=$cell->nodeValue.",";
}
#$dom = new DOMDocument;
#$dom->loadHTML(#$detls[4]);
#$cells = $dom->getElementsByTagName('thead');
#$contents = array();
#$hh="";
foreach($cells as $cell)
{
#$contents7[] = $cell->nodeValue;
$hh.=$cell->nodeValue.",";
}
#$dom = new DOMDocument;
#$dom->loadHTML(#$detls[2]);
#$cells = $dom->getElementsByTagName('thead');
#$contents = array();
#$uu="";
foreach($cells as $cell)
{
#$contents8[] = $cell->nodeValue;
$uu.=$cell->nodeValue.",";
}
#$dom = new DOMDocument;
#$dom->loadHTML(#$detls[2]);
#$cells = $dom->getElementsByTagName('th');
#$contents = array();
#$jj="";
foreach($cells as $cell)
{
#$contents9[] = $cell->nodeValue;
$jj.=$cell->nodeValue.",";
}
#$dom = new DOMDocument;
#$dom->loadHTML(#$detls[2]);
#$cells = $dom->getElementsByTagName('td');
#$contents = array();
#$kk="";
foreach($cells as $cell)
{
#$contents10[] = $cell->nodeValue;
$kk.=$cell->nodeValue.",";
}
#$dom = new DOMDocument;
#$dom->loadHTML(#$extras);
#$cells = $dom->getElementsByTagName('td');
#$contents = array();
#$contss="";
foreach($cells as $cell)
{
#$contentsss[] = $cell->nodeValue;
$contss.=$cell->nodeValue."**#**";
}
#$images1=$st0;
#$a1=$aa;
#$b1=$bb;
#$c1=mysql_real_escape_string($cc);
#$d1=mysql_real_escape_string($dd);
#$f1=mysql_real_escape_string($ff);
#$g1=mysql_real_escape_string($gg);
#$h1=mysql_real_escape_string($hh);
#$i1=mysql_real_escape_string($uu);
#$j1=mysql_real_escape_string($jj);
#$k1=mysql_real_escape_string($kk);
#$l1=mysql_real_escape_string($contss);
#$decs=$description;
$c++;
mysql_query("INSERT INTO house_moredetails_crone(city,durl,a,b,c,d,f,g,h,i,j,k,l,description,images)
VALUES('".$city."','".$url."','".$a1."','".$b1."','".$c1."','".$d1."','".$f1."','".$g1."','".$h1."','".$i1."','".$j1."','".$k1."','".$l1."','".$decs."','".$images1."')") ;
mysql_query("DELETE FROM c_url_details_crone WHERE id='$did'");
}
?>
Is there any other better option other than the cron for updating database.Or by using cron is there any solution to update database without timeout.Waiting for your reply!!
i tried to concatenate innerhtml of div into string variable:
games variable:
$games = '';
DOMinnerHTML function:
function DOMinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
ExtractFromType function:
function ExtractFromType($type)
{
$html = file_get_contents('www.site.com/' .$type);
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
if (strpos($div->getAttribute('style'),'MyString') !== false) {
//////
$games = $games.DOMinnerHTML($div);
//////
}
}
}
code:
ExtractFromType('MyType');
echo $games; // = Nothing.
this code return nothing.
$games is defined in the global scope, and it's not available inside ExctractFromType. Define it inside the function, then return the value:
function ExtractFromType($type) {
$html = file_get_contents('www.site.com/' .$type);
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$divs = $dom->getElementsByTagName('div');
$games = '';
foreach ($divs as $div) {
if (strpos($div->getAttribute('style'),'MyString') !== false) {
$games = $games.DOMinnerHTML($div);
}
}
}
echo ExtractFromType('MyType');