How can I extract data from an HTML table in PHP? [duplicate] - php

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
Let's say I want to extract a certain number/text from a table from here: http://www.fifa.com/associations/association=chn/ranking/gender=m/index.html
I want to get the first number on the right table td under FIFA Ranking position. That would be 88 right now. Upon inspection, it is <td class="c">88</td>.
How would I use PHP to extract the info from said webpage?
edit: I am told JQuery/JavaScript it is for this... better suited

This could probably be prettier, but it'd go something like:
<?php
$page = file_get_contents("http://www.fifa.com/associations/association=chn/ranking/gender=m/index.html");
preg_match('/<td class="c">[0-9]*</td>/',$page,$matches);
foreach($matches as $match){
echo str_replace(array( "/<td class=\"c\">", "</td>"), "", $match);
}
?>
I've never done anything like this before with PHP, so it may not work.
If you can work your magic after page load, you can use JavaScript/JQuery
<script type='text/javascript'>
var arr = [];
jQuery('table td.c').each(
arr[] = jQuery(this).html();
);
return arr;
</script>
Also, sorry for deleting my comment. You weren't specific as to what needed to be done, so I initially though jQuery would better fit your needs, but then I thought "Maybe you want to get the page content before an HTML page is loaded".

Try http://simplehtmldom.sourceforge.net/,
$html = file_get_html('http://www.google.com/');
echo $html->find('div.rankings', 0)->find('table', 0)->find('tr',0)->find('td.c',0)->plaintext;
This is untested, just looking at the source. I'm sure you could target it faster.
In fact,
echo $html->find('div.rankings', 0)->find('td.c',0)->plaintext;
should work.

Using DOMDocument, which should be pre-loaded with your PHP installation:
$dom = new DOMDocument();
$dom->loadHTML(file_get_contents("http://www.example.com/file.html"));
$xpath = new DOMXPath($dom);
$cell = $xpath->query("//td[#class='c']")->item(0);
if( $cell) {
$number = intval(trim($cell->textContent));
// do stuff
}

Related

How do I cut this string in PHP?

I'm currently working on a script to archive an imageboard.
I'm kinda stuck on making links reference correctly, so I could use some help.
I receive this string:
>>10028949<br><br>who that guy???
In said string, I need to alter this part:
<a href="10028949#p10028949"
to become this:
<a href="#p10028949"
using PHP.
This part may appear more than once in the string, or might not appear at all.
I'd really appreciate it if you had a code snippet I could use for this purpose.
Thanks in advance!
Kenny
Disclaimer: as it'll be said in the comments, using a DOM parser is better to parse HTML.
That being said:
"/(<a[^>]*?href=")\d+(#[^"]+")/"
replaced by $1$2
So...
$myString = preg_replace("/(<a[^>]*?href=\")\d+(#[^\"]+\")/", "$1$2", $myString);
try this
>>10028949<br><br>who that guy???
Although you have the question already answered I invite you to see what would (approximately xD) be the correct approach, parsing it with DOM:
$string = '>>10028949<br><br>who that guy???';
$dom = new DOMDocument();
$dom->loadHTML($string);
$links = $dom->getElementsByTagName('a'); // This stores all the links in an array (actually a nodeList Object)
foreach($links as $link){
$href = $link->getAttribute('href'); //getting the href
$cut = strpos($href, '#');
$new_href = substr($href, $cut); //cutting the string by the #
$link->setAttribute('href', $new_href); //setting the good href
}
$body = $dom->getElementsByTagName('body')->item(0); //selecting everything
$output = $dom->saveHTML($body); //passing it into a string
echo $output;
The advantages of doing it this way is:
More organized / Cleaner
Easier to read by others
You could for example, have mixed links, and you only want to modify some of them. Using Dom you can actually select certain classes only
You can change other attributes as well, or the selected tag's siblings, parents, children, etc...
Of course you could achieve the last 2 points with regex as well but it would be a complete mess...

PHP XPath query returns nothing

I've been recently playing with DOMXpath in PHP and had success with it, trying to get more experience with it I've been playing grabbing certain elements of different sites. I am having trouble getting the weather marker off of http://www.theweathernetwork.com/weather/cape0005 this website.
Specifically I want
//*[#id='theTemperature']
Here is what I have
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $tag){
echo $tag->nodeValue;
}
Is there something I am doing wrong here? I am able to produce actual results on other tags on the page but specifically not this one.
Thanks in advance.
You might want to improve your DOMDocument debugging skills, here some hints (Demo):
<?php
header('Content-Type: text/plain;');
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $i => $tag){
echo $i, ': ', var_dump($tag->nodeValue), ' HTML: ', $dom->saveHTML($tag), "\n";
}
Output the number of the found node, I do it here with $i in the foreach.
var_dump the ->nodeValue, it helps to show what exactly it is.
Output the HTML by making use of the saveHTML function which shows a better picture.
The actual output:
0: string(0) ""
HTML: <p id="theTemperature"></p>
You can easily spot that the element is empty, so the temperature must go in from somewhere else, e.g. via javascript. Check the Network tools of your browser.
what happens is straightforward, the page contains an empty id="theTemperature" element which is a placeholder to be populated with javascript. file_get_contents() will just download the page, not executing javascript, so the element remains empty. Try to load the page in the browser with javascript disabled to see it yourself
The element you're trying to select is indeed empty. The page loads the temperature into that id through ajax. Specifically this script:
http://www.theweathernetwork.com/common/js/master/citypage_ajax.js?cb=201301231338
but when you do a file_get_contents those scripts obviously don't get resolved. I'd go with guido's solution of using the RSS

Retrieve first 10 comments from .html file

I was goog for hours and just cannot find an answer. Please suggest:
Having a .html file that contains only user comments in paragraphs like:
<p>12/02/2012 4:32pm Mark</p>
<p>Hi! it's a nice demo! Really thankful</p>
<hr>
<p>11/02/2012 11:03am Miron</p>
<p>How to change the font size from CFD again?</p>
<hr>
<!-- AND LOADS OF OTHER <P><P> COMMENTS DELIMITED BY <HR> ... -->
There's 1000's of comments structured like this,
I'd like to grab somehow the newest 10 (not by date, just the first 'ten' comments). And I don't know how.
I know I can use jQuery's .load('comments.html') and than remove all the elements but the first 10 comments, or even include the whole file with PHP and than do the .hide() with jQuery... but it's a good idea to load the whole file for just 10 comments?
How to split that file and get inside an <div id="latest_10_comments"></div> the first 10 comments from the comments.html file?
I know you wanted a JavaScript solution but you could do this in PHP by using the explode function.
Something like this:
$comments = explode("<hr>", file_get_contents("/comments.html"));
for($i = 0; $i < 10; $i++) {
print($comments[$i]);
}
This creates an array called $comments which is each comment in comments.html separated by a
<hr>
tag.
First, I'd suggest reconsidering your approach to this problem entirely. Why are you storing everything an in HTML file this way? You should either store it as an XML file or store it in your database if you want to dynamically load certain comments on demand.
However, to answer your question you're going to need to use an X/HTML parser like PHP's DomDocument if you want to do this in PHP. Here's a working example...
EDIT (changed to reflect the OP's desired behavior):
$dom = new DomDocument;
$dom->loadHTMLFile("comments.html");
// Get all the P tag elements in the DOM
$comments = $dom->getElementsByTagName('p');
// Get only the first 10
$amount = 10; // number of comments you want
foreach ($comments as $num => $comment_nodes) {
if ($num + 1 > $amount)
break;
echo $comment_nodes->nodeValue, PHP_EOL;
}
Solution 1. You can use a RegEx pattern to match 2 p tags followed by hr and repeat the pattern for 10 times.
Solution 2.
Idea from other answer(CHRIS), but as that has error in PHP, I am suggesting this.
$comments = explode("<hr>", file_get_contents("/comments.html"));
for($i = 0; $i < 10; $i++) {
print($comments[$i]);
}
$("<p>").each(function(index, value)
{
//Do what you want here
}
This will cycle through all your <p>. If you know the order of the elements then you can do what you want with them based on index.

I want to load specific div form other website in php

I have a problem to load specific div element and show on my page using PHP. My code right now is as follows:
<?php
$page = file_get_contents("http://www.bbc.co.uk/sport/football/results");
preg_match('/<div id="results-data" class="fixtures-table full-table-medium">(.*)<\/div>/is', $page, $matches);
var_dump($matches);
?>
I want it to load id="results-data" and show it on my page.
You won't be able to manipulate the URL to get only a portion of the page. So what you'll want to do is grab the page contents via the server-side language of your choice and then parse the HTML. From there you can grab the specific DIV you are looking for and then print that out to your screen. You could also use to remove unwanted content.
With PHP you could use file_get_contents() to read the file you want to parse and then use DOMDocument to parse it and grab the DIV you want.
Here's the basic idea. This is untested but should point you in the right direction:
$page = file_get_contents('http://www.bbc.co.uk/sport/football/results');
$doc = new DOMDocument();
$doc->loadHTML($page);
$divs = $doc->getElementsByTagName('div');
foreach($divs as $div) {
// Loop through the DIVs looking for one withan id of "content"
// Then echo out its contents (pardon the pun)
if ($div->getAttribute('id') === 'content') {
echo $div->nodeValue;
}
}
You should use some html parser. Take a look at PHPQuery, here is how you can do it:
require_once('phpQuery/phpQuery.php');
$html = file_get_contents('http://www.bbc.co.uk/sport/football/results');
phpQuery::newDocumentHTML($html);
$resultData = pq('div#results-data');
echo $resultData;
Check it out here:
http://code.google.com/p/phpquery
Also see their selectors' documentation.

How can I parse a very simple Table using PHP

Good day dear community!
I need to build a function which parses the content of a very simple Table
(with some labels and values) see the url below. I have used various ways to parse html sources. But this one is is a bit tricky! See the target i want to parse - it has some invaild markup:
The target: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=644.0013008534253&SchulAdresseMapDO=194190
Well i tried it with this one
<?php
require_once('config.php'); // call config.php for db connection
$filename = "url.txt"; // Include the txt file which have urls
$each_line = file($filename);
foreach($each_line as $line_num => $line)
{
$line = trim($line);
$content = file_get_contents($line);
//echo ($content)."<br>";
$pattern = '/<td>(.*?)<\/td>/si';
preg_match_all($pattern,$content,$matches);
foreach ($matches[1] as $match) {
$match = strip_tags($match);
$match = trim($match);
//var_dump($match);
$sql = mysqli_query("insert into tablename(contents) values ('$match')");
//echo $match;
}
}
?>
Well - see the regex in line 7-11: it does not match!
Conclusio: i have to rework the parser-part of this script. I need to parse someway different - since the parsercode does not match exactly what is aimed. It is aimed to get back the results of the table.
Can anybody help me here to get a better regex - or a better way to parse this site ...
Any and all help will be greatly apprecaited.
regards
zero
You could use tear the table apart using
preg_split('/<td width="73%"> /', $str, -1); (note; i did not bother escaping characters)
You'll want to drop the first entry. Now you can use stripos and substr to cut away everything after the .
This is a basic setup! You will have to fine-tune it quite a bit, but I hope this gives you an idea of what would be my approach.
Regex does not always provide perfect result. Using any HTML parser is a good idea. There are many HTML parsers as described in Gordon's Answer.
I have used Simple HTML DOM Parser in past and it worked for me.
For Example:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// Find all <td> in <table> which class=hello
$es = $html->find('table.hello td');
// Find all td tags with attribite align=center in table tags
$es = $html->find('table td[align=center]');

Categories