How to parse XML/HTML server's reponse?

How to parse XML/HTML server's reponse? - php

my first time here.
I got these lines as a response from the server and saved them in a file. They look like XML, right? My task is to read the content of those td tags and put them into other structured file(Excel). The problem is I dont know how to do that.
At the moment, I think I will strip the first and last line of the file then parse them into XML. But do you know other ways ? Thanks.
<CallbackContent><![CDATA[
<table cellspacing="0" border="0" cellpadding="0" width="100%">
<tr class="rowcolor2">
<td align="left" style="padding:5px;">22/02/2010</td>
<td align="right" style="padding:5px;">510,02</td>
</tr>
</table>
]]></CallbackContent>
Btw, I'm using PHP.

Use an XML parser such as SimpleXML. It will allow you to extract the CDATA safely.
Then if the HTML is XML-compliant (in other words, it's XHTML) you can use SimpleXML to extract data from it. For example:
$xml='<CallbackContent><![CDATA[
<table cellspacing="0" border="0" cellpadding="0" width="100%">
<tr class="rowcolor2">
<td align="left" style="padding:5px;">22/02/2010</td>
<td align="right" style="padding:5px;">510,02</td>
</tr>
</table>
]]></CallbackContent>';
$CallbackContent = simplexml_load_string($xml);
$html = (string) $CallbackContent;
// if XHTML
$table = simplexml_load_string($html);
// otherwise, use
$dom = new DOMDocument;
$dom->loadHTML($html);
$table = simplexml_import_dom($dom)->body->table;
foreach ($table->tr as $tr)
{
echo 'tr class=', $tr['class'], "\n";
foreach ($tr->td as $td)
{
echo 'td align=', $td['align'], ' - value: ', (string) $td, "\n";
}
}

You cannot read the table with an XML parser, because it is pushed out as a CDATA block, which equivocates to a string literal.

First, read the whole thing using a XML parser so that you can pull out the contents of the CDATA section. Then take that and stuff it through an HTML parser.

Related

PHP Dom Document - Using Glob and get and specific element and class in every file of a directory

I'm using Glob function in order to get every .htm file and then I trying to get text from a specific table where class = 'DataGrid_Item', by using PHP's DOM element with following HTML (same structure) and following code:
1. HTML
<div>
<table rules="all" id="GridViewAfiliacion" style="border-collapse:collapse;" border="1" cellspacing="0">
<tbody>
<tr class="DataGrid_Header" style="background-color:#98B676;">
<th scope="col">ESTADO</th>
<th scope="col">ENTIDAD</th>
<th scope="col">REGIMEN</th>
<th scope="col">FECHA DE AFILIACION ENTIDAD</th>
<th scope="col">TIPO DE AFILIADO</th>
</tr>
<tr class="DataGrid_Item" align="center">
<td>ACTIVO</td>
<td>NUEVA EPS S.A.</td>
<td>CONTRIBUTIVO</td>
<td>01/06/2016</td>
<td>COTIZANTE</td>
</tr>
</tbody>
</table>
2. PHP
// Directory of Files
$directory = "../fosyga/archivoshtml/";
$array_filename = glob($directory . "*.htm");
foreach($array_filename as $filename)
{
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($filename);
$content_node = $dom->getElementById("GridViewAfiliacion");
// Get the HTML as a string
$string = $content_node > C14N();
}
It's possyble to extract class="DataGrid_Item" info into a string?
Pd: I think the glob function does not work properly in this case, I'm not using that in a correct way.

Getting PHP str_replace to work with Joomla

As you may know, Joomla components enable you to override their output by copying their template files into your site template. Joomla components generally use helper files which cannot be overridden.
I have a helper.php file that includes the string:
$specific_fields_text = '<tr><td class="key">'.$specific_field_title.': </td><td class="kr_sidecol_subaddress">'.$specific_fields[$i]->text.' '.$specific_fields[$i]->description.'</td></tr>';
In my template override is the code:
<table border="0" cellpadding="2" cellspacing="0">
<?php echo koparentHTML::getHTMLSpecificFields($this->specific_fields); ?>
</table>
The output is as follows:
<table border="0" cellpadding="2" cellspacing="0">
<tr>
<td class="key">title</td>
<td class="kr_sidecol_subaddress">value</td>
</tr>
<tr>
<td class="key">title</td>
<td class="kr_sidecol_subaddress">value</td>
</tr>
//.....etc......//
</table>
Basically I want to get rid of the table and turn it into a definition list but I cannot modify the helper.php file. I am thinking that the answer is to do with str_replace
I have tried using:
<dl>
<?php
$spec_fields = koparentHTML::getHTMLSpecificFields($this->specific_fields);
$spec_fields_dl = str_replace("<tr><td class='key'>'.$specific_field_title.': </td><td class='kr_sidecol_subaddress'>'.$specific_fields[$i]->text.' '.$specific_fields[$i]->description.'</td></tr>'", "<dt class='key'>'.$specific_field_title.': </dt><dd class='kr_sidecol_subaddress'>'.$specific_fields[$i]->text.' '.$specific_fields[$i]->description.'</dd>'", $spec_fields);
echo $spec_fields_dl;
?>
</dl>
This returns all of the text but with no html tags (no tr, td, dt, etc).

You can easily parse table data with PHP, like in this example:
$doc = new DOMDocument();
$doc->loadHTML(koparentHTML::getHTMLSpecificFields($this->specific_fields));
$rows = $doc->getElementsByTagName('tr');
$data = array();
for ($i = 0; $i < $rows->length; $i++) {
$cols = $rows->item($i)->getElementsbyTagName("td");
$data[$cols->item(0)->nodeValue] = $data[$cols->item(1)->nodeValue];
}
var_dump $data;
This should convert your table into assoc array ('title' => 'value').
I hope it helps.

I have figured this out. For some reason the PHP bits such as '.$specific_field_title.' where stopping the str_replace from working. To get around this I just searched for the HTML elements and put them in an array like so:
echo str_replace(array('<tr><td class="key">', '</td><td class="kr_sidecol_subaddress">', '</td></tr>'),
array('<dt class="key">', '</dt><dd class="kr_sidecol_subaddress">', '</dd>'),
koparentHTML::getHTMLSpecificFields($this->specific_fields));
And now this works perfectly. Thank you to everyone who contributed.

get values from an external page?

this code is in an external url: www.example.com.
</head><body><div id="cotizaciones"><h1>Cotizaciones</h1><table cellpadding="3" cellspacing="0" class="tablamonedas">
<tr style="height:19px"><td class="1"><img src="../mvd/usa.png" width="24" height="24" /></td>
<td class="2">19.50</td>
<td class="3">20.20</td>
<td class="4"><img src="../mvd/Bra.png" width="24" height="24" /></td>
<td class="5">9.00</td>
<td class="6">10.50</td>
</tr><tr style="height:16px" valign="bottom"><td class="15"><img src="../mvd/Arg.png" width="24" height="24" /></td>
<td class="2">2.70</td>
<td class="3">3.70</td>
<td class="4"><img src="../mvd/Eur.png" width="24" height="24" /></td>
<td class="5">24.40</td>
<td class="6">26.10</td>
</tr></table>
i want to get the values of the td, any suggestions? php,jquery etc.

You won't be able to do this with javascript, due to security restrictions that only allow you to load data from your own site.
You will have to pull the content with php (using something as simple as file_get_contents) and then parse it.
For the parsing, take a read through this comprehensive post:
How do you parse and process HTML/XML in PHP?
DOM is likely going to be your best bet.
Try playing around with this:
$html = file_get_contents('/path/to/remote/page/');
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('td') as $node) {
echo "Full TD html: " . $dom->saveHtml($node) . "\n";
echo "TD contents: " . $node->nodeValue . "\n\n";
}

Its not possible to do with jquery, however you can easily do it with PHP.
Use file_get_contents to read entire source code of the page into a string.
Parse, tokenise the string that contains the entire page source in order to grab all the td value.
<?php
$srccode = file_get_contents('http://www.example.com/');
/*$src is a string that contains source code of web-page http://www.example.com/
/*Now only thing you have to do is write a function say "parser" that tokenise or parse the string in order to grab all the td value*/
$output=parser($srccode);
echo $output;
?>
You have to be very careful while parsing the string to get desired output.For parsing you can either use regular expression or create your own look up table.You can use a HTML DOM parser written in PHP5 that let you manipulate HTML in a very easy way.A lot of such free parsers are available.

How to remove table, tr, td tag in html with php

I have a html code:
<table id="table1" border="0" cellspacing="0" cellpadding="3" width="1" align="center">
<tr>
<td>
<img src="http://vnexpress.net/Files/Subject/3b/bd/ac/f9/cuongbibat.jpg" width="330" height="441" border="1" alt="Cường">
</td>
</tr>
<tr>
<td class="Image">Everything
</td>
</tr>
</table>
<table id="table2" border="0" cellspacing="0" cellpadding="3" width="1" align="center">
<tr>
<td>
Someone
</td>
</tr>
<tr>
<td class="Image">Everything
</td>
</tr>
</table>
I have 2 table, i want to remove all tag: table, tr, td if table have img tag(table 1).
I need to get result like :
<img src="http://vnexpress.net/Files/Subject/3b/bd/ac/f9/cuongbibat.jpg" width="330" height="441" border="1" alt="Cường">
Everything
<table id="table2" border="0" cellspacing="0" cellpadding="3" width="1" align="center">
<tr>
<td>
Someone
</td>
</tr>
<tr>
<td class="text">Everything
</td>
</tr>
</table>
Please help me. Thank you.

HTML Purifier can be used to strip either all tags or a certain set of tags from a document. It's the go-to solution for basically any HTML tag stripping in PHP - don't ever use regexes for this or the sun will burn out and we will all freeze to death in the suffocating darkness.
Try something like:
$config->set('HTML.Allowed', 'img');
$purifier = new HTMLPurifier($config);
$output = $filter->purify($YOUR_HTML);
You'll need to add a $config->set('HTML.Allowed', 'TAGNAME'); line for every tag you don't want to get scrubbed away, but it's a price worth paying for the continued lifegiving warmth of the day-star. And also not leaving your site open to XSS attacks and content-eating glitches, I guess.

Check out:
http://simplehtmldom.sourceforge.net/
Let's you find tags on an HTML page with selectors just like jQuery and extract contents from HTML in a single line.

In theory, it's possible to do this with a single highly complex regexp. It's always easier to do the search-and-replace on separate steps: search for the outer container first, then work on what it contains.
<?php
header("Content-type: text/plain");
$html = '<table id="table1" border="0" cellspacing="0" cellpadding="3" width="1" align="center">
<tr>
<td>
<img src="http://vnexpress.net/Files/Subject/3b/bd/ac/f9/cuongbibat.jpg" width="330" height="441" border="1" alt="Cường">
</td>
</tr>
<tr>
<td class="Image">Everything
</td>
</tr>
</table>
<table id="table2" border="0" cellspacing="0" cellpadding="3" width="1" align="center">
<tr>
<td>
Someone
</td>
</tr>
<tr>
<td class="Image">Everything
</td>
</tr>
</table> ';
$html = preg_replace_callback('/<table\b[^>]*>.*?<\/table>/si', 'removeTableIfImg', $html);
function removeTableIfImg($matches) {
$table = $matches[0];
return preg_match('/<img\b[^>]*>/i', $table, $img)
? preg_replace('/<\/?(?:table|td|tr)\b[^>]*>\s*/i', '', $table)
: $table;
}
echo $html;
?>
The first pattern finds the tables. The second pattern (in the callback) checks if there's an image tag. The third removes the table, td, and tr tags.

i needed something like this.
here is my solution:
(<\/?tr.*?>)|(<\/?td.*?>)|(<\/?table.*?>)
this regex will select all tr td and table tags not greedy.
you can see it in action here:
http://regexr.com/3fslh

As sudowned said do not use regex for this, it will drive you crazy. Usually searching for libs consumes the same amount of time than writing your own small parser for this. I did this several times in different languages. You learn a lot and you often can reuse the code :-)
since you are not interested in attributes, this should be quite easy. loop the entry site char by char. Check out this java code, its one of my earlier, smaller approach to sanitize html:
public static String sanatize(String body, String[] whiteList, String tagSeperator, String seperate) {
StringBuilder out = new StringBuilder();
StringBuilder tag = new StringBuilder();
boolean quoteOpen = false;
boolean tagOpen = false;
for(int i=0;i<body.length();i++) {
char c = body.charAt(i);
if(i<body.length()-1 && c == '<' && !quoteOpen && body.charAt(i+1) != '!') {
tagOpen = true;
tag.append(c);
} else if(c == '>' && !quoteOpen && tagOpen) {
tag.append(c);
for (String tagName : whiteList) {
String stag = tag.toString().toLowerCase();
if (stag.startsWith("</"+tagName+" ") || stag.startsWith("</"+tagName+">") || stag.startsWith("<"+tagName+" ") || stag.startsWith("<"+tagName+">")) {
out.append(tag);
} else if (stag.startsWith("</") && tagSeperator != null) {
if (seperate.length()>2) {
if (seperate.contains("," + stag.replaceAll("[</]+(\\w+)[\\s>].*", "$1") + ",")) {
out.append(tagSeperator);
}
} else {
if (!out.toString().endsWith(tagSeperator)) {
out.append(tagSeperator);
}
}
}
}
tag = new StringBuilder();
tagOpen = false;
} else if (c == '"' && !quoteOpen) {
quoteOpen = true;
if (tagOpen)
tag.append(c);
else
out.append(c);
} else if (i>1 && c == '"' && quoteOpen && body.charAt(i-1) != '\\' ) {
quoteOpen = false;
if (tagOpen)
tag.append(c);
else
out.append(c);
} else {
if (tagOpen)
tag.append(c);
else
out.append(c);
}
}
return out.toString();
}
You can ignore separator and separate, I used this to sanitise tags and convert to csv

How to get data between <td> elements with Regex and Php

How can I get the "85 mph" from this html code with PHP + Regex ?
I couldn't come up with right regex
This is the code
http://pastebin.com/ffRH9K9Q
<td align="left">Los Angeles</td>
</tr>
<tr>
<td align="left">Wind Speed:</td>
<td align="left">85 mph</td>
</tr>
<tr>
<td align="left">Snow Load:</td>
<td align="left">0 psf</td>
(simplified example)

You've heard already about not using regex for the job, so I won't talk about that.
Let's try something here. Perhaps not the ideal solution, but could work for you.
<?php
$data = 'your table';
preg_match ('|<td align="left">(.*)mph</td>|Usi', $data, $result);
print_r($result); // Your result shoud be in here
You could need some trimming or taking whitespaces into account in the regex.

The first comment that links to the post about NOT PARSING HTML WITH REGEX is important. That said, try something like DOMDocument::loadHTML instead. That should get you started traversing the DOM with PHP.

To expand on DorkRawk's suggestion (in the hope of providing a relatively succinct answer that isn't overwhelming for a beginner), try this:
<?php
$yourhtml = '<td align="left">Los Angeles</td>
</tr>
<tr>
<td align="left">Wind Speed:</td>
<td align="left">85 mph</td>
</tr>
<tr>
<td align="left">Snow Load:</td>
<td align="left">0 psf</td>';
$dom = new DOMDocument();
$dom->loadHTML($yourhtml);
$xpath = new DOMXPath($dom);
$matches = $xpath->query('//td[.="Wind Speed:"]/following-sibling::td');
foreach($matches as $match) {
echo $match->nodeValue."\n\n";
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to parse XML/HTML server's reponse? - php

You cannot read the table with an XML parser, because it is pushed out as a CDATA block, which equivocates to a string literal.

First, read the whole thing using a XML parser so that you can pull out the contents of the CDATA section. Then take that and stuff it through an HTML parser.

Related

PHP Dom Document - Using Glob and get and specific element and class in every file of a directory

Getting PHP str_replace to work with Joomla

get values from an external page?

How to remove table, tr, td tag in html with php

How to get data between <td> elements with Regex and Php

Categories

Resources