Digging deeper into DOMElement - php

I've used Zend_Dom_Query to extract some <tr> elements and I want to now loop through them and do some more. Each <tr> looks like this, so how can I print the title Title 1 and the id of the second td id=categ-113?
<tr class="sometr">
<th><a class="title">Title1</a></th>
<td class="category" id="categ-113"></td>
<td class="somename">Title 1 name</td>
</tr>

You should just play around with the results. I've never worked with it, but this is how far i got (and im kinda new to Zend myself):
$dom = new ZEnd_Dom_Query($html);
$res = $dom->query('.sometr');
foreach($res as $dom) {
$a = $obj->getElementsByTagName('a');
echo $a->item(0)->textContent; // the title
}
And with this i think you're set to go. For further information and functions to be used of the result look up DOMElement ( http://php.net/manual/de/class.domelement.php ). With this information you should be able to grab all that. But my question is:
Why doing this so complicated, i don't really see a use-case for doing this. As the title and everything else should be something coming from the database? And if it's an XML there's better solutions than relying on Dom_Query.
Anyways, if this was helpful to you please accept and/or vote the answer.

Related

Show certain part of a different webpage into mine

I want to be able to show the top 10 players on my server from gametracker.com into my webpage.
Now I looked up the source code of the gametracker.com page which is showing the top 10 players and the part looks like this
<div class="blocknew blocknew666">
<div class="blocknewhdr">
TOP 10 PLAYERS <span class="item_text_12">(Online & Offline)</span>
</div>
<table class="table_lst table_lst_stp">
<tr>
<td class="col_h c01">
Rank
</td>
<td class="col_h c02">
Name
</td>
<td class="col_h c03">
Score
</td>
<td class="col_h c04">
Time Played
</td>
</tr>
.
.
.
.
</table>
<div class="item_h10">
</div>
<a class="fbutton" href="/server_info/*.*.*.*:27015/top_players/">
View All Players & Stats
</a>
</div>
As you can see the content I want is within the class="blocknew blocknew666" I could have easily pulled it out if it was within an id but I don't know how to handle it when the content is within a class. I looked up on the internet a bit and came across this
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Is it possible to use this code to do what I want? If yes please write the line of code i would need to use, or give me some suggestion on how to tackle this issue.
I'm only going to post a partial answer because I believe that doing this might be a violation of the terms of use for the GameTracker service, what you are asking for is basically a method to steal proprietary content from another website. You SHOULD most definitely be GETTING PERMISSION from GameTracker before you do this.
To do this I would use strstr. http://php.net/manual/en/function.strstr.php
$html = file_get_html('http://www.gametracker.com/server_info/someip/');
$topten = strstr($html, 'TOP 10 PLAYERS');
echo $topten; //this will print everthing after the content you looked for.
Now I will leave it up to you to figure out how to chop off the un-needed content that comes after the top ten is done AND to get permission from GameTracker to use this.
Based on tremor's suggestion this is the working code for the above problem
<?php
function rstrstr($haystack,$needle)
{
return substr($haystack, 0,strpos($haystack, $needle));
}
$html = file_get_contents('http://www.gametracker.com/server_info/*.*.*.*:27015/');
$topten = strstr($html, 'TOP 10 PLAYERS');//this will print everthing after the content you looked for.
$topten = strstr($topten, '<table class="table_lst table_lst_stp">');
$topten = rstrstr($topten,'<div class="item_h10">'); //this will trim stuff that is not needed
echo $topten;
?>
The code you provided is part of the simpledom library
http://simplehtmldom.sourceforge.net/
You need to download and include the library for the code to work.

Scraping using php - preg_match_all

Trying to get the value of Internet Data Volume Balance - the script should echo 146.30mb
New to all these, having a look at all the tutorials.
How can this be done?
<tr >
<td bgcolor="#F8F8F8"><div align="left"><B><FONT class="tplus_text">Account Status</FONT></B></div></td>
<td bgcolor="#FFFFFF"><div align="left"><FONT class="tplus_text">You exceeded your allowed credit.</FONT></div></td>
</tr>
<tr >
<td bgcolor="#F8F8F8"><div align="left"><B><FONT class="tplus_text">Period Free Time Remaining</FONT></B></div></td>
<td bgcolor="#FFFFFF"><div align="left"><FONT class="tplus_text">0:00:00 hours</FONT></div></td>
</tr>
<tr >
<td bgcolor="#F8F8F8"><div align="left"><B><FONT class="tplus_text">Internet Data Volume Balance</FONT></B></div></td>
<td bgcolor="#FFFFFF"><div align="left"><FONT class="tplus_text" style="text-transform:none;">146.30 MB</FONT></div></td>
</tr>
If you were willing to or have already installed phpQuery, you can use that.
phpQuery::newDocumentFileHTML('htmlpage.html');
echo pq('td:eq(6)')->text();
PHP can interact with the DOM just like JavaScript can. This is vastly superior to parsing the markup, as most people will tell you is the wrong approach anyway:
Loading from an HTML File
// Start by creating a new document
$doc = new DOMDocument();
// I've loaded the table into an external file, and am loading it into the $doc
$doc->loadHTMLFile( 'htmlpage.html' );
// Since you have six table cells, I'm calling up all of them
$cells = $doc->getElementsByTagName("td");
// I'm grabbing the sixth cell's textContent property
echo $cells->item(5)->textContent;
This code will output "146.30 MB" to the screen.
Loading from a String
If you have the HTML stored within a string, you can load that into your document as well. We'll change the method used to load the file, into the method used to load from a string:
$str = "<table><tr><td>Foo</td></tr>...</table>";
$doc->loadHTML( $str );
We would then proceed with the same code as above to select the cells, and show their textContent in the output.
Check out the DOMDocument Class.

strip tags placing a delimiter or store to an array using PHP

I've stripped the tag data from an url like
$url='http://abcd.com';
$d=stripslashes(file_get_contents($url));
echo strip_tags($d);
but unfortunately all the tag values are clubbed together like user14036100 9.00user23034003 11.33user32028000 14.00 where in the user1, user2, user3 attributes are stored, It is hard to analyse the attribute values as all are joined together by strip_tags().
so friends can someone help me to strip each tag and store in an array or by placing a delimiter at the end of each stripped tag data.
Thanks in advance :)
You cannot achieve this with strip_tags(), since it justs removes the tags. You wan't to replace them with e.g. a whitespace character (new line, space, ..).
You should probably do this with a regex call, which just replaces all tags.
A better way would be to parse the fetched page with DOMDocument, so that you can derive the structure directly from the HTML structure.
Example of usage of DOMDocument
You have the following example html page:
<!DOCTYPE html>
<html>
<head>
<title>This is my title</title>
</head>
<body>
<table id="someDataHere">
<tr>
<th>Country</th>
<th>Population</th>
</tr>
<tr>
<td>Germany</td>
<td>81,779,600</td>
</tr>
<tr>
<td>Belgium</td>
<td>11,007,020</td>
</tr>
<tr>
<td>Netherlands</td>
<td>16,847,007</td>
</tr>
</table>
</body>
</html>
You can use DOMDocument to fetch the entries in the table:
$url = "...";
$dom = new DOMDocument("1.0", "UTF-8");
$dom->loadHTML(file_get_contents($url));
$preparedData = array();
$table = $dom->getElementById("someDataHere");
$tableRows = $table->getElementsByTagName('tr');
foreach ($tableRows as $tableRow)
{
$columns = $tableRow->getElementsByTagName('td');
// skip the header row of the table - it has no <td>, just <th>
if (0 == $columns->length)
{
continue;
}
$preparedData[ $columns->item(0)->nodeValue ] = $columns->item(1)->nodeValue;
}
$preparedData will now hold the following data:
Array
(
[Germany] => 81,779,600
[Belgium] => 11,007,020
[Netherlands] => 16,847,007
)
Some notes
Since you are developing a crawler (spider), you are highly dependent on the HTML structure of the target webpage. You may have to adjust your crawler every time they change something in their templates.
This is just a simple example, but it should make clear, how you can now use it, to produce more advanced results.
Since DOMDocument implements the DOM methods, you have to work your way through the HTML structure with the possibilities they provide.
For very huge HTML pages DOMDocument can become quite expensive in terms of memory.

How to pull data from xml and break it in pages (pagination)

Hey everyone, I am using simplexml to pull data from an external xml source. I have got values even for limiting the number of results to display. I thought I could paginate with a simple query within the URL, something like "&page=2" but it is not possible as far as documentation shows.
I downloaded a pagination class intended to use within a MYSQL query an tried to used the vars output from the xml. But the output is loading the whole results of the xml and not the specified within the URL vars.
I think what I might do is to count the results first and then paginate, which is what I am trying to do. Do you see anything in this code that can be improved? Sorry If it isn´t clear, but maybe discussing with some coders fellas I can see a bit of light at the end of the tunnel and exaplin a bit better.
So here is the code:
<?
$url ="http://www.somedomain.com/cgi/xml/engine/get_data.php?ref=$ref&checkin=$checkin&checkout=$checkout&rval=$rval&pval=$pval&country=$country&city=$city&lg=$lg&orderby=$orderby&ordertype=$ordertype&maxrows=$maxrows";
// see I am already defining the max num of rows within the url. Which means that the proper way to sort this out is to start counting from the # aheads?
$all = new SimpleXMLElement($url, null, true);
$all->items_total = $hotels->id;
//
require_once 'paginator.class.php';
//calling the paginator class
foreach($all as $hotel) // loop through our hotels
{
$pages = new Paginator;
//creating a new paginator
$pages->mid_range = 7;
$pages->items_total = $hotel->id;
//extracting the var out from the XML
$rest = substr($hotel->description, 0, -150); // returns "abcde"
//echo <<<EOF
<table width="100%" border=0>
<tr>
<td colspan="2">{$hotel->name}<span class="stars" widht="{$hotel->rating}">{$hotel->rating}</span></h2></a><p><b>Direccion:</b> <i>{$hotel->address}</i> - {$hotel->province}</p>
<td colspan="2"><div align="center">PRECIO: {$hotel->currencyCode} {$hotel->minCostOfStay</a>
</div></a></a>
</td>
</tr>
<tr>
<td colspan="2"> $rest...<strong>ampliar información</strong></td>
<td valign="middle"><div align="center"><a href="{$hotel->rooms->room->bookUrl}"><img src="{$hotel->photoUrl}"></div></td>
</tr>
<tr>
<td colspan="2"><div align="center"><strong>VER TODO SOBRE ESTE </strong></div></td>
<td colspan="2"><div align="center">$text</a></div></td>
</a></div></td>
</tr>
//EOF;
echo '</table>';
$pages->paginate();
}
echo $pages->display_pages();
?>
You're clobbering your $all variable:
$all = new SimpleXMLElement($url, null, true); // used by the loop
$all = new Paginator; // reset within the loop

PHP Using domdocument to extract data from html

I have a table with the following structure. I cannot seem to get the data I want.
<table class="gsborder" cellspacing="0" cellpadding="2" rules="cols" border="1" id="d00">
<tr class="gridItem">
<td>Code</td><td>0adf</td>
</tr><tr class="AltItem">
<td>CompanyName</td><td>Some Company</td>
</tr><tr class="Item">
<td>Owner</td><td>Jim Jim</td>
</tr><tr class="AltItem">
<td>DivisionName</td><td> </td>
</tr><tr class="Item">
<td>AddressLine1</td><td>9314 W. SPRING ST.</td>
</tr>
</table>
This table is of course nested within another table within the page. How can I use DomDocument for example to refer to "Code" and "0adf" as a key value pair? They actually don't need to be in a key value pair but I should be able to call them each separately.
EDIT:
Using PHP Simple HTML, I was able to extract the data I needed using this:
$foo = $html->getElementById("d00")->childNodes(1)->childNodes(1);
The problem with this though is that I am getting the two <td></td> tags with my data. Is there a way to only grab the raw data without the tags?
Also, is this the right way to get my data out of this table?
If you're not dead set on using DOMDocument, try using the PHP Simple HTML DOM Parser. This has the benefit of allowing you to parse HTML which is not valid XML as well as providing a nicer interface to the parsed document.
You could write something like:
$html = str_get_html(...);
foreach($html->find('tr') as $tr)
{
print 'First td: ' . $tr->find('td', 0)->plaintext;
print 'Second td: ' . $tr->find('td', 1)->plaintext;
}

Categories