Scraping using php - preg_match_all

Scraping using php - preg_match_all - php

Trying to get the value of Internet Data Volume Balance - the script should echo 146.30mb
New to all these, having a look at all the tutorials.
How can this be done?
<tr >
<td bgcolor="#F8F8F8"><div align="left"><B><FONT class="tplus_text">Account Status</FONT></B></div></td>
<td bgcolor="#FFFFFF"><div align="left"><FONT class="tplus_text">You exceeded your allowed credit.</FONT></div></td>
</tr>
<tr >
<td bgcolor="#F8F8F8"><div align="left"><B><FONT class="tplus_text">Period Free Time Remaining</FONT></B></div></td>
<td bgcolor="#FFFFFF"><div align="left"><FONT class="tplus_text">0:00:00 hours</FONT></div></td>
</tr>
<tr >
<td bgcolor="#F8F8F8"><div align="left"><B><FONT class="tplus_text">Internet Data Volume Balance</FONT></B></div></td>
<td bgcolor="#FFFFFF"><div align="left"><FONT class="tplus_text" style="text-transform:none;">146.30 MB</FONT></div></td>
</tr>

If you were willing to or have already installed phpQuery, you can use that.
phpQuery::newDocumentFileHTML('htmlpage.html');
echo pq('td:eq(6)')->text();

PHP can interact with the DOM just like JavaScript can. This is vastly superior to parsing the markup, as most people will tell you is the wrong approach anyway:
Loading from an HTML File
// Start by creating a new document
$doc = new DOMDocument();
// I've loaded the table into an external file, and am loading it into the $doc
$doc->loadHTMLFile( 'htmlpage.html' );
// Since you have six table cells, I'm calling up all of them
$cells = $doc->getElementsByTagName("td");
// I'm grabbing the sixth cell's textContent property
echo $cells->item(5)->textContent;
This code will output "146.30 MB" to the screen.
Loading from a String
If you have the HTML stored within a string, you can load that into your document as well. We'll change the method used to load the file, into the method used to load from a string:
$str = "<table><tr><td>Foo</td></tr>...</table>";
$doc->loadHTML( $str );
We would then proceed with the same code as above to select the cells, and show their textContent in the output.
Check out the DOMDocument Class.

Related

Getting DOM elements of html from file_get_contents [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 6 years ago.
I am fetching html from a website with file_get_contents. I have a table (with a class name) inside html, and I want to get the data inside html tags.
This is how I fetch the html data from url:
$url = 'http://example.com';
$content = file_get_contents($url);
The html looks like:
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
</body>
</table>
Is there a way to searh DOM elements in php like we do in jQuery? So that I can access the values 1, 2 (first td) and div's value inside second td.
Something like
a) search the html for table with class name space
b) inside that table, inside tbody, return each tr's 'first td's value' and 'div's value inside second td'
So I get; 1 and Mars, 2 and Earth.

Use the DOM extension, for example. Its DOMXPath class is particularly useful for such kind of tasks.
You can easily set the listed conditions with an XPath expression like this:
//table[#class="space"]//tr[count(td) = 2]/td
where
- //table[#class="space"] selects all table elements from the document having class attribute value equal to "space" string;
- //tr[count(td) = 2] selects all tr elements having exactly two td child elements;
- /td represents the td elements.
Sample implementation:
$html = <<<'HTML'
<table class="space">
<thead></thead>
<tbody>
<tr>
<td class="marsia">1</td>
<td class="mars">
<div>Mars</div>
</td>
</tr>
<tr>
<td class="earthia">2</td>
<td class="earth">
<div>Earth</div>
</td>
</tr>
<tr>
<td class="earthia">3</td>
</tr>
</tbody>
</table>
HTML;
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$cells = $xpath->query('//table[#class="space"]//tr[count(td) = 2]/td');
$i = 0;
foreach ($cells as $td) {
if (++$i % 2) {
$number = $td->nodeValue;
} else {
$planet = trim($td->textContent);
printf("%d: %s\n", $number, $planet);
}
}
Output
1: Mars
2: Earth
The code above is supposed to be considered as a sample rather than an instruction for practical use, as it is not very scalable. The logic is bound to the fact that the XPath expression selects exactly two cells for each row. In practice, you may want to select the rows, iterate them, and put the extra conditions into the loop, e.g.:
$rows = $xpath->query('//table[#class="space"]//tr');
foreach ($rows as $tr) {
$cells = $xpath->query('.//td', $tr);
if ($cells->length < 2) {
continue;
}
$number = $cells[0]->nodeValue;
$planet = trim($cells[1]->textContent);
printf("%d: %s\n", $number, $planet);
}
DOMXPath::query() is called with an XPath expression relative to the current row ($tr), then checks if the returned DOMNodeList contains at least two cells. The rest of the code is trivial.
You can also use SimpleXML extension, which also supports XPath. But the extension is much less flexible as compared to the DOM extension.
For huge documents, use extensions based on SAX-based parsers such as XMLReader.

How to extracting Data from HTML table using php

I keep trying different methods of extracting the data from the HTML table such as using xpath. The table(s) do not contain any classes so I am not sure how to use xpath without classes or Id. This data is being retrieved from an rss xml file. I am currently using DOM. After I extract the data, I will try to sort, the tables by Job Title
Here is my php code
$html='';
$xml= simplexml_load_file($url) or die("ERROR: Cannot connect to url\n check if report still exist in the Gradleaders system");
/*What we do here in this loop is retrieve all content inside the encoded content,
*which includes the CDATA information. This is where the HTML and styling is included.
*/
foreach($xml->channel->item as $cont){
$html=''.$cont->children('content',true)->encoded.'<br>'; //actual tag name is encoded
}
$htmlParser= new DOMDocument(); //to parse html using DOMDocument
libxml_use_internal_errors(true); // your HTML gives parser warnings, keep them internal
$htmlParser->loadHTML($html); //Loaded the html string we took from simple xml
$htmlParser->preserveWhiteSpace = false;
$tables= $htmlParser->getElementsByTagName('table');
$rows= $tables->item(0)->getElementsByTagName('tr');
foreach($rows as $row){
$cols = $row->getElementsByTagName('td');
echo $cols;
}
This is the HTML I am extracting info from
<table cellpadding='1' cellspacing='2'>
<tr>
<td><b>Job Title:</b></td>
<td>Job Example </td>
</tr>
<tr>
<td><b>Job ID:</b></td>
<td>23992</td>
</tr>
<tr>
<td><b>Job Description:</b></td>
<td>Just a job example </td>
</tr>
<tr>
<td><b>Job Category:</b></td>
<td>Work-study Position</td>
</tr>
<tr>
<td><b>Position Type:</b></td>
<td>Work-study</td>
</tr>
<tr>
<td><b>Applicant Type:</b></td>
<td>Work-study</td>
</tr>
<tr>
<td><b>Status:</b></td>
<td>Active</td>
</tr>
<tr>
<td colspan='2'><b><a href='https://www.myjobs.com/tuemp/job_view.aspx?token=I1iBwstbTs2pau+SjrYfWA%3d%3d'>Click to View More</a></b></td>
</tr>
</table>

You can use xpath to query('//td') and retrieve the td html using C14N(), something like:
$dom = new DOMDocument();
$dom->loadHtml($html);
$x = new DOMXpath($dom);
foreach($x->query('//td') as $td){
echo $td->C14N();
//if just need the text use:
//echo $td->textContent;
}
Output:
<td><b>Job Title:</b></td>
<td>Job Example </td>
<td><b>Job ID:</b></td>
...
C14N();
Returns canonicalized nodes as a string or FALSE on failure
Update:
Another question, how can I grab individual Table Data? For example,
just grab, Job ID
Use XPath contains, i.e.:
foreach($x->query('//td[contains(., "Job ID:")]') as $td){
echo $td->textContent;
}
Update V2:
How can I get the next Table Data after that (to actually get the Job
Id) ?
Use following-sibling::*[1], i.e:
echo $x->query('//td[contains(*, "Job ID:")]/following-sibling::*[1]')->item(0)->textContent;
//23992

$xpathParser = new DOMXPath($htmlParser);
$tableDataNodes = $xpathParser->evaluate("//table/tr/td")
for ($x=0;$x<$tableDataNodes.length;$x++) {
echo $tableDataNodes[$x];
}

Digging deeper into DOMElement

I've used Zend_Dom_Query to extract some <tr> elements and I want to now loop through them and do some more. Each <tr> looks like this, so how can I print the title Title 1 and the id of the second td id=categ-113?
<tr class="sometr">
<th><a class="title">Title1</a></th>
<td class="category" id="categ-113"></td>
<td class="somename">Title 1 name</td>
</tr>

You should just play around with the results. I've never worked with it, but this is how far i got (and im kinda new to Zend myself):
$dom = new ZEnd_Dom_Query($html);
$res = $dom->query('.sometr');
foreach($res as $dom) {
$a = $obj->getElementsByTagName('a');
echo $a->item(0)->textContent; // the title
}
And with this i think you're set to go. For further information and functions to be used of the result look up DOMElement ( http://php.net/manual/de/class.domelement.php ). With this information you should be able to grab all that. But my question is:
Why doing this so complicated, i don't really see a use-case for doing this. As the title and everything else should be something coming from the database? And if it's an XML there's better solutions than relying on Dom_Query.
Anyways, if this was helpful to you please accept and/or vote the answer.

How to pull data from xml and break it in pages (pagination)

Hey everyone, I am using simplexml to pull data from an external xml source. I have got values even for limiting the number of results to display. I thought I could paginate with a simple query within the URL, something like "&page=2" but it is not possible as far as documentation shows.
I downloaded a pagination class intended to use within a MYSQL query an tried to used the vars output from the xml. But the output is loading the whole results of the xml and not the specified within the URL vars.
I think what I might do is to count the results first and then paginate, which is what I am trying to do. Do you see anything in this code that can be improved? Sorry If it isn´t clear, but maybe discussing with some coders fellas I can see a bit of light at the end of the tunnel and exaplin a bit better.
So here is the code:
<?
$url ="http://www.somedomain.com/cgi/xml/engine/get_data.php?ref=$ref&checkin=$checkin&checkout=$checkout&rval=$rval&pval=$pval&country=$country&city=$city&lg=$lg&orderby=$orderby&ordertype=$ordertype&maxrows=$maxrows";
// see I am already defining the max num of rows within the url. Which means that the proper way to sort this out is to start counting from the # aheads?
$all = new SimpleXMLElement($url, null, true);
$all->items_total = $hotels->id;
//
require_once 'paginator.class.php';
//calling the paginator class
foreach($all as $hotel) // loop through our hotels
{
$pages = new Paginator;
//creating a new paginator
$pages->mid_range = 7;
$pages->items_total = $hotel->id;
//extracting the var out from the XML
$rest = substr($hotel->description, 0, -150); // returns "abcde"
//echo <<<EOF
<table width="100%" border=0>
<tr>
<td colspan="2">{$hotel->name}<span class="stars" widht="{$hotel->rating}">{$hotel->rating}</span></h2></a><p><b>Direccion:</b> <i>{$hotel->address}</i> - {$hotel->province}</p>
<td colspan="2"><div align="center">PRECIO: {$hotel->currencyCode} {$hotel->minCostOfStay</a>
</div></a></a>
</td>
</tr>
<tr>
<td colspan="2"> $rest...<strong>ampliar información</strong></td>
<td valign="middle"><div align="center"><a href="{$hotel->rooms->room->bookUrl}"><img src="{$hotel->photoUrl}"></div></td>
</tr>
<tr>
<td colspan="2"><div align="center"><strong>VER TODO SOBRE ESTE </strong></div></td>
<td colspan="2"><div align="center">$text</a></div></td>
</a></div></td>
</tr>
//EOF;
echo '</table>';
$pages->paginate();
}
echo $pages->display_pages();
?>

You're clobbering your $all variable:
$all = new SimpleXMLElement($url, null, true); // used by the loop
$all = new Paginator; // reset within the loop

PHP Using domdocument to extract data from html

I have a table with the following structure. I cannot seem to get the data I want.
<table class="gsborder" cellspacing="0" cellpadding="2" rules="cols" border="1" id="d00">
<tr class="gridItem">
<td>Code</td><td>0adf</td>
</tr><tr class="AltItem">
<td>CompanyName</td><td>Some Company</td>
</tr><tr class="Item">
<td>Owner</td><td>Jim Jim</td>
</tr><tr class="AltItem">
<td>DivisionName</td><td> </td>
</tr><tr class="Item">
<td>AddressLine1</td><td>9314 W. SPRING ST.</td>
</tr>
</table>
This table is of course nested within another table within the page. How can I use DomDocument for example to refer to "Code" and "0adf" as a key value pair? They actually don't need to be in a key value pair but I should be able to call them each separately.
EDIT:
Using PHP Simple HTML, I was able to extract the data I needed using this:
$foo = $html->getElementById("d00")->childNodes(1)->childNodes(1);
The problem with this though is that I am getting the two <td></td> tags with my data. Is there a way to only grab the raw data without the tags?
Also, is this the right way to get my data out of this table?

If you're not dead set on using DOMDocument, try using the PHP Simple HTML DOM Parser. This has the benefit of allowing you to parse HTML which is not valid XML as well as providing a nicer interface to the parsed document.
You could write something like:
$html = str_get_html(...);
foreach($html->find('tr') as $tr)
{
print 'First td: ' . $tr->find('td', 0)->plaintext;
print 'Second td: ' . $tr->find('td', 1)->plaintext;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scraping using php - preg_match_all - php

If you were willing to or have already installed phpQuery, you can use that. phpQuery::newDocumentFileHTML('htmlpage.html'); echo pq('td:eq(6)')->text();

Related

Getting DOM elements of html from file_get_contents [duplicate]

How to extracting Data from HTML table using php

Digging deeper into DOMElement

How to pull data from xml and break it in pages (pagination)

PHP Using domdocument to extract data from html

Categories

Resources