HTML Table parsing using DOM - php

I Have a HTML Table
My Parsing Code is
$src = new DOMDocument('1.0', 'utf-8');
$src->formatOutput = true;
$src->preserveWhiteSpace = false;
#$src->loadHTML($result);
$xpath = new DOMXPath($src);
$data=$xpath->query('//td[ contains (#class, "bodytext1") ]');
foreach($data as $datas)
{
echo $datas->nodeValue."<br />";
}
$values=$xpath->query('//tr[ contains (#bgcolor, "f3fafe") ]');
foreach($values as $value)
{
echo $value->nodeValue."<br />";
}
$values1=$xpath->query('//tr[ contains (#bgcolor, "def0fa") ]');
foreach($values1 as $value1)
{
echo $value1->nodeValue."<br />";
}
to be printed, and I want them to be repeated along with other lines as shown above in output i need.
and I want this whole thing in a array so that i can insert it in the database
Can anyone please guide me or give me any hint so that I can do this

This should get you started.
$src = new DOMDocument('1.0', 'utf-8');
$src->formatOutput = true;
$src->preserveWhiteSpace = false;
$src->loadHTML($result);
$xpath = new DOMXPath($src);
// get header data
$data=$xpath->query('//table[1]//td');
$htno = trim(explode(":",$data->item(0)->nodeValue)[1]);
$name = trim(explode(":",$data->item(1)->nodeValue)[1]);
$fatherName=trim(explode(":",$data->item(2)->nodeValue)[1]);
// rows from 2nd table
$values1=$xpath->query('//table[2]//tr');
$header = true; // flag to track whether we've read the header row.
foreach($values1 as $value1)
{
if (!$header) {
$rowdata = str_replace("\r\n"," ",$value1->nodeValue);
echo $htno," ",$name," ",$fatherName," ",$rowdata,"\n";
}
$header = false;
}
Note:
The $header flag is a quick fix. A better Xpath query might eliminate the need for it.
the str_replace near the bottom is ugly but expedient. You might want to play with the xpath query to see if you can improve it.
Output is not formatted for HTML - lines are delimited by \n
I got a warning on one line where it contained &, so I changed it to AND. You might have to preprocess your tables to eliminate those somehow.

you could use third party's dll,such as "Html Agility Pack". a tool which is professional to convert html into xml.

Related

PHP scrape data from website

I am new to programming. So I choose to build a webpage by using Wordpress. But I am trying to gather weather data from other sites, I could not find a fitting plugin for scraping the data, and decided to give it a try and put something together myself.
But with my limited understanding of programming is giving me issues. With a little inspirations from the web I have put this together:
$html = file_get_contents('http://www.frederikshavnhavn.dk/scripts/weatherwindow.php?langid=2'); //get the html returned from the following url
$poke_doc = new DOMDocument();
libxml_use_internal_errors(false); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$poke_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$poke_xpath = new DOMXPath($poke_doc);
//get all the spans's with an id
$poke_type = $poke_xpath->query("//span[#class='weathstattype']");
$poke_text = $poke_xpath->query("//span[#class='weathstattext']");
foreach($poke_text as $text){
foreach($poke_type as $type){
echo $type->nodeValue;
echo $text->nodeValue . "</br>";
continue 2;
}
break;
}
}
Being that this is all new to me, and I am really trying to get this to work for me, hoping for a better understanding of the code behind the works.
What I am trying to achieve is a formatted list with the data.
1. value $type $text
2. value $type $text
Right now it is giving me a lot of trouble.
when I use the continue 2 it does not change the value $type, but when I just use continue statement it changes $type but not $text. How can I make it change both values each time?
Thanks for your help.
try adding this method:
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
then replace the foreach with this:
foreach($poke_text as $text){
//echo $type ->nodeValue . "</n>";
echo get_inner_html($text ).'<br>';
}
foreach($poke_type as $type){
//echo $text ->nodeValue;
echo get_inner_html($type ).'<br>';
}
produces this:
197° (Syd)
5.7 °C Stigende
4.8 m/s Stigende
5.4 m/s Stigende
-6 cm Faldende 1004 hPa Vindretning Lufttemperatur Middel vindhastighed Max vindhastighed Vandstand Lufttryk
Buddy in your code your foreach loops (in last) you use $type as $text and $text as $type.. I run the code and just change the variables as they should be its working fine..
$html = file_get_contents('http://www.frederikshavnhavn.dk/scripts/weatherwindow.php?langid=2'); //get the html returned from the following url
$poke_doc = new DOMDocument();
libxml_use_internal_errors(false); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$poke_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$poke_xpath = new DOMXPath($poke_doc);
//get all the spans's with an id
$poke_type = $poke_xpath->query("//span[#class='weathstattype']");
$poke_text = $poke_xpath->query("//span[#class='weathstattext']");
foreach($poke_text as $text){
echo $text->nodeValue;
}
foreach($poke_type as $type){
echo $type->nodeValue;
}
}
And this the out that I got from your code (by changing the variables in loop)
196° (Syd) 5.6 °C 4.1 m/s 5 m/s -6 cm 1004 hPa Vindretning Lufttemperatur Middel vindhastighed Max vindhastighed Vandstand Lufttryk
Now You have your data I think you can manage how to sort them out...

Get text from script output

everyone, I've been using this code for quite a long time
<?php
$url = 'http://www.smn.gov.ar/mensajes/index.php?observacion=metar&operacion=consultar&87582=on&87641=on&87750=on&87765=on&87222=on&87761=on&87860=on&87395=on&87344=on&87166=on&87904=on&87571=on&87347=on&87803=on&87576=on&87162=on&87532=on&87497=on&87097=on&87046=on&87548=on&87217=on&87506=on&87692=on&87418=on&87574=on&87715=on&87374=on&87289=on&87852=on&87178=on&87896=on&87823=on&87270=on&87155=on&87453=on&87925=on&87934=on&87480=on&87047=on&87553=on&87311=on&87909=on&87436=on&87509=on&87912=on&87623=on&87444=on&87129=on&87371=on&87645=on&87022=on&87127=on&87828=on&87121=on&87938=on&87791=on&87448=on';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
// search for td's containing METAR
$metars = $xpath->query('//td[contains(text(), "METAR SA")]');
if($metars->length <= 0) {
echo 'no metars found';
exit;
}
$data = array();
foreach($metars as $metar) {
$data[] = $metar->nodeValue;
}
echo '<pre>';
print_r($data);
Well, this was working fine, until the program in charge to read the output was updated and now it needs a clear output.
At the momment I'm getting this:
http://ar.ivao.aero/weather/metar.php
But the program needs it like this:
SABE 161600Z 02006KT 9999 FEW030 24/18 Q1009 =
SAZA 161600Z 18011KT CAVOK 24/08 Q1010 =
SAZB 161700Z 27012KT CAVOK 21/09 Q1011 =
I don't thought maybe using another script like a file_get_content() could be useful but again its going to show the infromation I don't want to.
I also tried replacing print_r() by var_dump() but its the same
Any ideas?
There is anyway to get this informatin in a simple txt file?
Regards,
You need to filter out some data. Try to find out what's common in the info you need to output. For instance, all the required info from your raw print_r data seems to beging with METAR. So
echo '<pre>';
foreach($metars as $metar) {
if(substr($metar->nodeValue, 0, 5) === "METAR") {
echo str_replace("METAR ", "", $metar->nodeValue) . PHP_EOL;
}
}
That removes any lines like Aeropuerto FORMOSA from the output.

How to Enhance This? Get a Part of a Web Page in Another Domain

I have made this:
<html>
<head>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script>
$(document).ready(
function()
{
$("body").html($("#HomePageTabs_cont_3").html());
}
);
</script>
</head>
<body>
<?php
echo file_get_contents("http://www.bankasya.com.tr/index.jsp");
?>
</body>
</html>
When I check my page with Firebug, It gives countless "missing files" (images, css files, js files, etc.) errors. I want to have just a part of the page not of all. This code does what I want. But I am wondering if there is a better way.
EDIT:
The page does what I need. I do not need all the contents. So iframe is useless to me. I just want the raw data of the div #HomePageTabs_cont_3.
Your best bet is PHP server-side parsing. I have written a small snippet to show you how to do this using DOMDocument (and possibly tidyif your server has it, to barf out all the mal-formed XHTML foos).
Caveat: outputs UTF-8. You can change this in the constructor of DOMDocument
Caveat 2: WILL barf out if its input is neither utf-8 not iso-8859-9. The current page's charset is iso-8859-9 and I see no reason why they would change this.
header("content-type: text/html; charset=utf-8");
$data = file_get_contents("http://www.bankasya.com.tr/index.jsp");
// Clean it up
if (class_exists("tidy")) {
$dataTidy = new tidy();
$dataTidy->parseString($data,
array(
"input-encoding" => "iso-8859-9",
"output-encoding" => "iso-8859-9",
"clean" => 1,
"input-xml" => true,
"output-xml" => true,
"wrap" => 0,
"anchor-as-name" => false
)
);
$dataTidy->cleanRepair();
$data = (string)$dataTidy;
}
else {
$do = true;
while ($do) {
$start = stripos($data,'<script');
$stop = stripos($data,'</script>');
if ((is_numeric($start))&&(is_numeric($stop))) {
$s = substr($data,$start,$stop-$start);
$data = substr($data,0,$start).substr($data,($stop+strlen('</script>')));
} else {
$do = false;
}
}
// nbsp breaks it?
$data = str_replace(" "," ",$data);
// Fixes for any element that requires a self-closing tag
if (preg_match_all("/<(link|img)([^>]+)>/is",$data,$mt,PREG_SET_ORDER)) {
foreach ($mt as $v) {
if (substr($v[2],-1) != "/") {
$data = str_replace($v[0],"<".$v[1].$v[2]."/>",$data);
}
}
}
// Barf out the inline JS
$data = preg_replace("/javascript:[^;]+/is","#",$data);
// Barf out the noscripts
$data = preg_replace("#<noscript>(.+?)</noscript>#is","",$data);
// Muppets. Malformed comment = one more regexp when they could just learn to write proper HTML...
$data = preg_replace("#<!--(.*?)--!?>#is","",$data);
}
$DOM = new \DOMDocument("1.0","utf-8");
$DOM->recover = true;
function error_callback_xmlfunction($errno, $errstr) { throw new Exception($errstr); }
$old = set_error_handler("error_callback_xmlfunction");
// Throw out all the XML namespaces (if any)
$data = preg_replace("#xmlns=[\"\']?([^\"\']+)[\"\']?#is","",(string)$data);
try {
$DOM->loadXML(((substr($data, 0, 5) !== "<?xml") ? '<?xml version="1.0" encoding="utf-8"?>' : "").$data);
} catch (Exception $e) {
$DOM->loadXML(((substr($data, 0, 5) !== "<?xml") ? '<?xml version="1.0" encoding="iso-8859-9"?>' : "").$data);
}
restore_error_handler();
error_reporting(E_ALL);
$DOM->substituteEntities = true;
$xpath = new \DOMXPath($DOM);
echo $DOM->saveXML($xpath->query("//div[#id=\"HomePageTabs_cont_3\"]")->item(0));
In order of appearance:
Fetch the data
If we have tidy, sanitize HTML with it
Create a new DOMDocument and load our document ((string)$dataTidy is a short-hand tidy getter)
Create an XPath request path
Use XPath to request all divs with id set as what we want, get the first item of the collection (->item(0), which will be a DOMElement) and request for the DOM to output its XML content (including the tag itself)
Hope it is what you're looking for... Though you might want to wrap it in a function.
Edit
Forgot to mention: http://rescrape.it/rs.php for the actual script output!
Edit 2
Correction, that site is not W3C-valid, and therefore, you'll either need to tidy it up or apply a set of regular expressions to the input before processing. I'm going to see if I can formulate a set to barf out the inconsistencies.
Edit 3
Added a fix for all those of us who do not have tidy.
Edit 4
Couldn't resist. If you'd actually like the values rather than the table, use this instead of the echo:
$d = new stdClass();
$rows = $xpath->query("//div[#id=\"HomePageTabs_cont_3\"]//tr");
$rc = $rows->length;
for ($i = 1; $i < $rc-1; $i++) {
$cols = $xpath->query($rows->item($i)->getNodePath()."/td");
$d->{$cols->item(0)->textContent} = array(
((float)$cols->item(1)->textContent),
((float)$cols->item(2)->textContent)
);
}
I don't know about you, but for me, data works better than malformed tables.
(Welp, that one took a while to write)
I'd get in touch with the remote site's owner and ask if there was a data feed I could use that would just return the content I wanted.
Sébastien answer is the best solution, but if you want to use jquery you can add Base tag in head section of your site to avoid not found errors on images.
<base href="http://www.bankasya.com.tr/">
Also you will need to change your sources to absolute path.
But use DOMDocument

Dom Document - extract a document id & save

I am trying to extract a specific clump of HTML using dom document.
My code is as follows:
$domd = new DOMDocument('1.0', 'utf-8');
$domd->loadHTML($string);
$this->hook = 'content';
if($this->hook !== '') {
$main = $domd->getElementById($this->hook);
$newstr = "";
foreach($main->childNodes as $node) {
$newstr .= $domd->saveXML($node, LIBXML_NOEMPTYTAG);
}
$domd->loadHTML($newstr);
}
//MORE PARSING USING THE DOMD OBJECT
It works great BUT the foreach is quite slow, and I was wondering if there's a more intelligent way of doing this. I am re-loading the HTML into the $domd so I can keep editing. In the back of my mind I feel I should be saving a fragment, not re-loading the saved $newstr into the object.
Can this be made more elegant or faster?
Thanks!
I'm assuming you want to mutate your existing $domd document, replacing it completely with those child nodes you're grabbing from that content node:
UPDATE: Just realized that since you were reloading using loadHTML, you probably wanted to preserve the html/body nodes that it creates. Code below has been adjusted to empty body and append the fragment there:
$domd = new DOMDocument('1.0', 'utf-8');
$domd->loadHTML($string);
$this->hook = 'content';
if($this->hook !== '') {
$main = $domd->getElementById($this->hook);
$fragment = $domd->createDocumentFragment();
while($main->hasChildNodes()) {
$fragment->appendChild($main->firstChild);
}
$body = $domd->getElementsByTagName("body")->item(0);
while($body->hasChildNodes()) {
$body->removeChild($body->firstChild);
}
$body->appendChild($fragment);
}
//MORE PARSING USING THE DOMD OBJECT

How to validate plain text (link text) in a hyperlink using php?

I am using simple html dom to fetch datas from other websites. while fetching data it fetches both hyperlinks with plain text and without plain text. I want to remove hyperlinks without plain text(link text) while fetching the data ..
i have tried below codes
if($title==""){ echo "No text";}
and
if(ctype_space($title)) { echo "No text";}
where $title is the plaintext fetched from the website
but both method didnt worked..can any one help
Advance thanks for your help
Until you give us more information on what value is my best guess would be to try something like this
if(empty($title))
{
echo "No Text";
}
Does it really need to be "plain text validation"?
Reading your question it seems you just want to remove links with empty values.
If the latter is true, you can do something like this:
$html = <<<EOL
Text
More Text
EOL;
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
if (strlen(trim($link->nodeValue)) == 0) {
$link->parentNode->removeChild($link);
}
}
var_dump($dom->saveHTML());
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($html);
$links_array = $xPath->query("//a"); // select all a tags
$totalLinks = $links_array->length; // how many links there are.
for($i = 0; $i < $totalLinks; $i++) // process each link one by one
{
$title = $links_array->item($i)->nodeValue; // get LInkText
if($title == '') // if no link text
{
$url = $links_array->item($i)->getAttribute('href');
// do here what you want
}
}
You need to use preg_match, with a regular expression, to extract the link text. For example
if (preg_match("/<a.*?>(.*?)</",$title,$matches))
{
echo $matches[1];
}

Categories