DOM object - special characters will not shown correctly - php

I have the following php code:
$html = '<table>
<tr>
<td data-label="Date">übermittelt</td>
<td data-label="Location">xxx</td>
</tr>
<tr>
<td data-label="Date">xD2</td>
<td data-label="Location">xxx</td>
</tr>
</table>';
$dom = new DOMDocument();
$dom->loadHTML($html);
echo $html; // NO PROBLEM WITH SPECIAL CHARACTERS
$nodes = $dom->getElementsByTagName('td');
echo $nodes->item(0)->nodeValue; // PROBLEM WITH SPECIAL CHARACTERS
My Problem is, that my last echo shows the result like this:
übermittelt
The echo $html shows the result correctly like this:
übermittelt
What can I do to solve this issue?

thanks for your support:
solution was to defined this line correctly like this:
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $html);

Related

PHP: Simple HTML DOM - parsing problem with missing </TR> tag

I am parsing a local file that is extracted from a system as a .htm file, I am therefore using Simple HTML DOM.
The file has only a single table and I basically want to capture each row in the table and save it as regular .csv file.
It would all work wonderfully except for the fact that the html file has a missing </TR> tag at the end of the first row (in every case). This means that my code captures the first $tr as the whole table instead of just the col name headers.
There are some pre-requisites to fixing this:-
The extracted .htm file cannot be manually edited in any way.
The first row cannot be counted in any way as columns may change (in order and number).
The first cell of the second row will be a 0 a lot of the time, but not always.
Here is the html (as a subset; original extract is 30,000+ rows)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><META content="IE=5.0000" http-equiv="X-UA-Compatible">
<META http-equiv="Content-Type" content="text/html; charset=windows-1252">
<META name="GENERATOR" content="MSHTML 11.00.10570.1001"></HEAD>
<BODY>
<H1>Monthly Report</H1><BR><BR><BR>
<P> Reporting Level : Ledger<BR> Reporting Context :
2466<BR> Company Name : topcage<BR> Set of Books Currency :
2466<BR> Register Type : All<BR> Summary Level :
Transaction Distribution Level<BR> Product : All<BR>
<P>
<TABLE border="1">
<TBODY>
<TR>
<TD><B>Tax Amt</B></TD>
<TD><B>Tax Amt Funcl Curr</B></TD>
<TD><B>Taxable Amt</B></TD>
<TD><B>Taxable Amt Funcl Curr</B></TD>
<TD><B>Total Entered Amount</B></TD>
<TD><B>Trx Line Class</B></TD>
<TR>
<TD>0</TD>
<TD>0</TD>
<TD>179</TD>
<TD>179</TD>
<TD>179</TD>
<TD>INVOICE</TD></TR>
<TR>
<TD>0</TD>
<TD>0</TD>
<TD>177</TD>
<TD>177</TD>
<TD>177</TD>
<TD>INVOICE</TD></TR>
<TR>
<TD>0</TD>
<TD>0</TD>
<TD>262.5</TD>
<TD>262.5</TD>
<TD>262.5</TD>
<TD>INVOICE</TD></TR>
<TR>
<TD align="LEFT" colspan="6"><B>Report Count</B></TD></TR>
<TR>
<TD></TD>
<TD></TD>
<TD></TD>
<TD></TD>
<TD></TD>
<TD>3</TD></TR></TBODY></TABLE><BR>*** End of Report *** </P></BODY></HTML>
Here is my code:
$html = file_get_html('file.htm');
$myfile = fopen("newfile.txt", "w");
foreach($html->find('tr') as $tr)
{
$row = array();
foreach($tr->find('td') as $td)
{
$row[] = $td->innertext;
}
fwrite($myfile, implode($row, ",") . "\n");
}
fclose($myfile);
Here is the content of the file that is generated:-
<b>Tax Amt</b>,<b>Tax Amt Funcl Curr</b>,<b>Taxable Amt</b>,<b>Taxable Amt Funcl Curr</b>,<b>Total Entered Amount</b>,<b>Trx Line Class</b>,0,0,179,179,179,INVOICE,0,0,177,177,177,INVOICE,0,0,262.5,262.5,262.5,INVOICE,<b>Report Count</b>,,,,,,3
0,0,179,179,179,INVOICE
0,0,177,177,177,INVOICE
0,0,262.5,262.5,262.5,INVOICE
<b>Report Count</b>
,,,,,3
Use this code:
$html = file_get_contents('file.htm');
$pattern = '/<\/TD>(\s*)<TR>/i';
$replacement = '</TD></TR><TR>';
$html = preg_replace($pattern, $replacement, $html);
$html = str_get_html($html);
instead of:
$html = file_get_html('file.htm');
this way you get the file contents and replace what you want before processing it.

DomDocument php extract info and images

Hello I am having a problem with DomDocument. I need to do an script which extracts all the information from the tables with certain id.
So I did:
$link = "WEBSITE URL";
$html = file_get_contents($link);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$context_nodes = $xpath->query('//table[#id="news"]/tr[position()>0]/td');
So I get all the <td>s and information, but the problem is that the <img> tags haven't been extracted by the script. How can I extract all the information of the tables either text or image html tags?
The html code from which I want to extract the info is:
<table id="news" width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="539" height="35"><span><strong>Info to Extract</strong></span></td>
</tr>
<tr>
<td height="35" class="texto10">Martes, 02 de Octubre de 2012 | Autor: Trovert" rel="author"></a></td>
</tr>
<tr>
<td height="35" class="texto12Gris"><p><strong>Info To extract</strong></p>
<p><strong> </strong></p>
<p><strong>Casa de Gobierno: (a 9 cuadras del hostel)</strong></p>
<img title="title" src="../images/theimage.jpg" width="400" height="266" />
</td>
</tr>
</table>
This is how I am iterating the extracted elements:
foreach ($context_nodes as $node) {
echo $node->nodeValue . '<br/>';
}
Thanks
If you need more than text, you'll have to try harder, not just nodeValue/textContent, but walk through the target nodes DOM branch:
function walkNode($node)
{
$str="";
if($node->nodeType==XML_TEXT_NODE)
{
$str.=$node->nodeValue;
}
elseif(strtolower($node->nodeName)=="img")
{
/* This is just a demonstration;
* You'll have to extract the info in the way you want
* */
$str.='<img src="'.$node->attributes->getNamedItem("src")->nodeValue.'" />';
}
if($node->firstChild) $str.=walkNode($node->firstChild);
if($node->nextSibling) $str.=walkNode($node->nextSibling);
return $str;
}
This is a simple, straightforward recursive function. So now you can do this:
$dom=new DOMDocument();
$dom->loadHTML($html);
$xpath=new DOMXPath($dom);
$tds=$xpath->query('//table[#id="news"]//tr[position()>0]/td');
foreach($tds as $td)
{
echo walkNode($td->firstChild);
echo "\n";
}
Online demo
(Please be noted that I "fixed" a little bit of your HTML as it doesn't seem valid; also pretty-indented a little bit)
This outputs something like this:
Info to Extract
Martes, 02 de Octubre de 2012 | Autor: Trovert
Info To extract
Casa de Gobierno: (a 9 cuadras del hostel)
<img src="../images/theimage.jpg" />
Try this....
foreach ($context_nodes as $node) {
echo $doc->saveHTML($node) . '<br/>';
}

How to extract hyperlink using php

I have searched online and thought this would work but it doesn't for some reason. I'm trying to extract a hyperlink that only displays it's URL from a HTML. I'm only trying to extract the URL within the td align="center". Here is a sample of the HTML doc I'm trying to extract:
<td>
Aug 17
</td>
<td>
FT
</td>
<td align="right">
Arsenal ruby
</td>
**<td align="center">**
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
And here is my PHP code to extract it from the td align="center":
<?php
//$searchURL = "site";
include 'simple_html_dom.php';
$site = 'website';
$html = file_get_html($site);
$tabledata = array();
// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
echo $e->href . '<br>';
?>
I know the code works because the code can extract everything if it is just the td within the barracks.
So you have identified the <td> elements themselves, but you did not go down to the next nesting level to grab the href from the <a> elements. You might do that like this:
foreach($html->find('td[align=center]') as $e)
echo $e->children(0)->href . '<br>';
Use the DOM and Xpath:
Select all td elements in the document
//td
Only if the align attribute equals "center"
//td[#align="center"]
Get the a sub elements
//td[#align="center"]//a
Get the href attribute nodes of that a elements
//td[#align="center"]//a/#href
Source example:
$html = <<<'HTML'
<td>
FT
</td>
<td align="right">
Arsenal ruby
</td>
**<td align="center">**
1-3
</td>
<td>Aston Villa</td>
<td style="text-align:right;">60,003</td>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$nodes = $xpath->evaluate('//td[#align="center"]//a/#href');
foreach ($nodes as $node) {
var_dump($node->value);
}
You selected the td element. The anchor element is the child of the td element.
// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
echo $e->firstChild()->getAttribute('href') . '<br>';

How can I parse a website to get the links out of a table?

I am trying to figure out how to parse a website to get the links out of a table. In my particular case there are two tables, but I only want the links from the second table (Link5 & Link6). Here is the HTML I am trying to parse.
<html>
<head>
</head>
<body>
Link1<br>
<br>
<table>
<tbody>
<tr>
<td>Link2</td>
<td>dog</td>
<td>fish</td>
</tr>
<tr>
<td>Link3</td>
<td>cat</td>
<td>bird</td>
</tr>
</tbody>
</table>
<br>
Link4<br>
<br>
<table>
<tbody>
<tr>
<td>Link5</td>
<td>cow</td>
</tr>
<tr>
<td>Link6</td>
<td>horse</td>
</tr>
</tbody>
</table>
<br>
Link7<br>
</body>
</html>
I have read that DOM is a good way to parse data from the web, so here is the code I have been working on.
<?php
$link = array();
//new dom object
$dom = new DOMDocument();
//load the html
$html = $dom->loadHTMLFile('http://www.example.com');
//discard white space
$dom->preserveWhiteSpace = false;
//get the table by its tag name
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(1)->getElementsByTagName('tr');
$i = 0;
//loop over the table rows
foreach ($rows as $row)
{
$links = $row->getElementsByTagName('a');
//put node value into an array
$link[] = $links->item(0)->nodeValue;
// echo the values
echo $link[$i] . '<br />';
$i++;
}
?>
This code gives the following output:
Link5
Link6
But what I would like to achieve is:
http://www.example.com/link5.html
http://www.example.com/link6.html
Any help would be greatly appreciated.
I guess the problem is you want to get the href not the node's value. So you should use getAttribute
$link[] = $links->item(0)->getAttribute("href");

How to get data between <td> elements with Regex and Php

How can I get the "85 mph" from this html code with PHP + Regex ?
I couldn't come up with right regex
This is the code
http://pastebin.com/ffRH9K9Q
<td align="left">Los Angeles</td>
</tr>
<tr>
<td align="left">Wind Speed:</td>
<td align="left">85 mph</td>
</tr>
<tr>
<td align="left">Snow Load:</td>
<td align="left">0 psf</td>
(simplified example)
You've heard already about not using regex for the job, so I won't talk about that.
Let's try something here. Perhaps not the ideal solution, but could work for you.
<?php
$data = 'your table';
preg_match ('|<td align="left">(.*)mph</td>|Usi', $data, $result);
print_r($result); // Your result shoud be in here
You could need some trimming or taking whitespaces into account in the regex.
The first comment that links to the post about NOT PARSING HTML WITH REGEX is important. That said, try something like DOMDocument::loadHTML instead. That should get you started traversing the DOM with PHP.
To expand on DorkRawk's suggestion (in the hope of providing a relatively succinct answer that isn't overwhelming for a beginner), try this:
<?php
$yourhtml = '<td align="left">Los Angeles</td>
</tr>
<tr>
<td align="left">Wind Speed:</td>
<td align="left">85 mph</td>
</tr>
<tr>
<td align="left">Snow Load:</td>
<td align="left">0 psf</td>';
$dom = new DOMDocument();
$dom->loadHTML($yourhtml);
$xpath = new DOMXPath($dom);
$matches = $xpath->query('//td[.="Wind Speed:"]/following-sibling::td');
foreach($matches as $match) {
echo $match->nodeValue."\n\n";
}

Categories