Inconsistent elements causing problems using simple html dom parser

Inconsistent elements causing problems using simple html dom parser - php

I am scraping the following source using simple_html_dom.php:
http://www.forexfactory.com/calendar.php
I am scraping the table elements td.event and td.actual.
Problem is, if you view the source, you can see that the td.event all have span elements, which I am stripping out like such:
$events = array();
foreach ($html->find('td.event') as $event) {
foreach($event->find('span') as $e) {
$events[] = $e->innertext;
}
}
So
<td class="event"><span>Spanish Unemployment Change</span></td>
nicely gives me
Spanish Unemployment Change
However, the td.actual element is inconsistent, some contain span elements, some do not.
So the question is, due to this inconsistency, how do I retrieve the text within the span of some, and not in others ?
Eg
<td class="actual">46.9</td>
vs
<td class="actual"> <span class="better">54.0</span> </td>
<td class="actual"> <span class="worse">-64.4K</span> </td>

You can just use the plaintext method as follows:
$actuals = array();
foreach ($html->find('td.actual') as $actual) {
$actuals[] = $actual->plaintext;
}

Related

Getting DOM elements of html from file_get_contents [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 6 years ago.
I am fetching html from a website with file_get_contents. I have a table (with a class name) inside html, and I want to get the data inside html tags.
This is how I fetch the html data from url:
$url = 'http://example.com';
$content = file_get_contents($url);
The html looks like:
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
</body>
</table>
Is there a way to searh DOM elements in php like we do in jQuery? So that I can access the values 1, 2 (first td) and div's value inside second td.
Something like
a) search the html for table with class name space
b) inside that table, inside tbody, return each tr's 'first td's value' and 'div's value inside second td'
So I get; 1 and Mars, 2 and Earth.

Use the DOM extension, for example. Its DOMXPath class is particularly useful for such kind of tasks.
You can easily set the listed conditions with an XPath expression like this:
//table[#class="space"]//tr[count(td) = 2]/td
where
- //table[#class="space"] selects all table elements from the document having class attribute value equal to "space" string;
- //tr[count(td) = 2] selects all tr elements having exactly two td child elements;
- /td represents the td elements.
Sample implementation:
$html = <<<'HTML'
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
<tr>
<td class="earthia">3</td>
</tr>
</tbody>
</table>
HTML;
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$cells = $xpath->query('//table[#class="space"]//tr[count(td) = 2]/td');
$i = 0;
foreach ($cells as $td) {
if (++$i % 2) {
$number = $td->nodeValue;
} else {
$planet = trim($td->textContent);
printf("%d: %s\n", $number, $planet);
}
}
Output
1: Mars
2: Earth
The code above is supposed to be considered as a sample rather than an instruction for practical use, as it is not very scalable. The logic is bound to the fact that the XPath expression selects exactly two cells for each row. In practice, you may want to select the rows, iterate them, and put the extra conditions into the loop, e.g.:
$rows = $xpath->query('//table[#class="space"]//tr');
foreach ($rows as $tr) {
$cells = $xpath->query('.//td', $tr);
if ($cells->length < 2) {
continue;
}
$number = $cells[0]->nodeValue;
$planet = trim($cells[1]->textContent);
printf("%d: %s\n", $number, $planet);
}
DOMXPath::query() is called with an XPath expression relative to the current row ($tr), then checks if the returned DOMNodeList contains at least two cells. The rest of the code is trivial.
You can also use SimpleXML extension, which also supports XPath. But the extension is much less flexible as compared to the DOM extension.
For huge documents, use extensions based on SAX-based parsers such as XMLReader.

Extracting data from HTML using Simple HTML DOM Parser

For a college project, I am creating a website with some back end algorithms and to test these in a demo environment I require a lot of fake data. To get this data I intend to scrape some sites. One of these sites is freelance.com.To extract the data I am using the Simple HTML DOM Parser but so far I have been unsuccessful in my efforts to actually get the data I need.
Here is an example of the HTML layout of the page I intend to scrape. The red boxes mark the required data.
Here is the code I have written so far after following some tutorials.
<?php
include "simple_html_dom.php";
// Create DOM from URL
$html = file_get_html('http://www.freelancer.com/jobs/Website-Design/1/');
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table[id=project_table] tr') as $tr) {
foreach($tr->find('td[class=title-col]') as $t) {
//get the inner HTML
$data = $t->outertext;
echo $data;
}
}
?>
Hopefully someone can point me in the right direction as to how I can get this working.
Thanks.

The raw source code is different, that's why you're not getting the expected results...
You can check the raw source code using ctrl+u, the data are in table[id=project_table_static], and the cells td have no attributes, so, here's a working code to get all the URLs from the table:
$url = 'http://www.freelancer.com/jobs/Website-Design/1/';
// Create DOM from URL
$html = file_get_html($url);
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table#project_table_static tbody tr') as $i=>$tr) {
// Skip the first empty element
if ($i==0) {
continue;
}
echo "<br/>\$i=".$i;
// get the first anchor
$anchor = $tr->find('a', 0);
echo " => ".$anchor->href;
}
// Clear dom object
$html->clear();
unset($html);
Demo

strip tags placing a delimiter or store to an array using PHP

I've stripped the tag data from an url like
$url='http://abcd.com';
$d=stripslashes(file_get_contents($url));
echo strip_tags($d);
but unfortunately all the tag values are clubbed together like user14036100 9.00user23034003 11.33user32028000 14.00 where in the user1, user2, user3 attributes are stored, It is hard to analyse the attribute values as all are joined together by strip_tags().
so friends can someone help me to strip each tag and store in an array or by placing a delimiter at the end of each stripped tag data.
Thanks in advance :)

You cannot achieve this with strip_tags(), since it justs removes the tags. You wan't to replace them with e.g. a whitespace character (new line, space, ..).
You should probably do this with a regex call, which just replaces all tags.
A better way would be to parse the fetched page with DOMDocument, so that you can derive the structure directly from the HTML structure.
Example of usage of DOMDocument
You have the following example html page:
<!DOCTYPE html>
<html>
<head>
<title>This is my title</title>
</head>
<body>
<table id="someDataHere">
<tr>
<th>Country</th>
<th>Population</th>
</tr>
<tr>
<td>Germany</td>
<td>81,779,600</td>
</tr>
<tr>
<td>Belgium</td>
<td>11,007,020</td>
</tr>
<tr>
<td>Netherlands</td>
<td>16,847,007</td>
</tr>
</table>
</body>
</html>
You can use DOMDocument to fetch the entries in the table:
$url = "...";
$dom = new DOMDocument("1.0", "UTF-8");
$dom->loadHTML(file_get_contents($url));
$preparedData = array();
$table = $dom->getElementById("someDataHere");
$tableRows = $table->getElementsByTagName('tr');
foreach ($tableRows as $tableRow)
{
$columns = $tableRow->getElementsByTagName('td');
// skip the header row of the table - it has no <td>, just <th>
if (0 == $columns->length)
{
continue;
}
$preparedData[ $columns->item(0)->nodeValue ] = $columns->item(1)->nodeValue;
}
$preparedData will now hold the following data:
Array
(
[Germany] => 81,779,600
[Belgium] => 11,007,020
[Netherlands] => 16,847,007
)
Some notes
Since you are developing a crawler (spider), you are highly dependent on the HTML structure of the target webpage. You may have to adjust your crawler every time they change something in their templates.
This is just a simple example, but it should make clear, how you can now use it, to produce more advanced results.
Since DOMDocument implements the DOM methods, you have to work your way through the HTML structure with the possibilities they provide.
For very huge HTML pages DOMDocument can become quite expensive in terms of memory.

I want php code to find href title and some other infos from html table

I create this code until now:
<?php
$url=" SOME HTML URL ";
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('a');
foreach ($tags as $tag) {
echo $tag->getAttribute('href');
}
?>
I have html pages with tables so i want the link the title and the date. Example of html code:
<TR>
<TD align="center" vAlign="top" bgColor="#ffffff" class="smalltext">3</TD>
<TD class="plaintext" >THIS IS THE TITLE </TD>
<TD align="center" class="plaintext" >THIS IS DATE</TD>
</TR>
It works fine for me for the link, but i don't know how to take the others.
Tnx.

Where you are doing this:
$tags = $doc->getElementsByTagName('a');
You are getting back all the A tags. There only happens to be one.
If you want to get the text "THIS IS DATE", you're aren't going to get it by looking in A tags because the text is not inside an A tag - it is in a TD tag.
$tds = $doc->getElementsByTagName('td');
... would work to get all the TD elements, or you could assign an ID to the element you want to target and use getElementById instead.
Basically, though, this information is all in the documentation, which you absolutely should read before asking questions. Happy reading!
Once again, that's: http://php.net/manual/en/class.domdocument.php

php: parsing table structure with SimpleXML

I'm trying to read in an xml file that for some reason has been modeled in a table structure like so:
<tr id="1">
<td name="Date">10/01/2009</td>
<td name="PromoName">Sample Promo Name</td>
<td name="PromoCode">Sample Promo Code</td>
<td name="PromoLevel" />
</tr>
This is just one sample row, the file has multiple <tr> blocks and it's all surrounded by <table>.
How can I read in the values, with all of the lines being named <td> name?

You could use simpleXML with an XPath expression.
$xml = simplexml_load_file('myFile.xml');
$values = $xml->xpath('//td[#name]');
foreach($values as $v) {
echo "Found $v<br />";
}
This would give you all the TD node values that have a name attribute, e.g.
Found 10/01/2009
Found Sample Promo Name
Found Sample Promo Code
Found <nothing cuz PromoLevel is empty>
Edit To get through all the Table Rows, you could do something like this:
$rows = $xml->xpath('//tr');
foreach($rows as $row) {
echo $row['id'];
foreach($row->td as $td) {
if($td['name']) {
echo $td['name'],':',$td,'<br/>',PHP_EOL;
}
}
}
You might also want to have a look at this article.
Edit Fixed the XPath expression, as Josh suggested.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Inconsistent elements causing problems using simple html dom parser - php

You can just use the plaintext method as follows: $actuals = array(); foreach ($html->find('td.actual') as $actual) { $actuals[] = $actual->plaintext; }

Related

Getting DOM elements of html from file_get_contents [duplicate]

Extracting data from HTML using Simple HTML DOM Parser

strip tags placing a delimiter or store to an array using PHP

I want php code to find href title and some other infos from html table

php: parsing table structure with SimpleXML

Categories

Resources