Extracting data from a website table

Extracting data from a website table - php

There is a table on the website Goal.com that I have attached to this question. I want to know how to store the strings in the column Player Name into a variable or database somehow.
The reason for this is because I have a variable in my code called $player. This variable stores a different string every 24 hours and is printed onto my site. This is done by using a custom made function.
I want to code that if '$player' is equal to any string in the column 'Player Name' from goal.com, to re-run the function so a different string is stored in variable and printed on my website.
TABLE : http://www.goal.com/en/scores/transfer-zone?ICID=TZ_DD1_VA

PHP Simple HTML DOM Parser can do the job for you. http://simplehtmldom.sourceforge.net/
Download simple_html_dom.php here; http://sourceforge.net/projects/simplehtmldom/files/simple_html_dom.php/download
Here is a full example.
<?php
include("simple_html_dom.php");
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTMLFile("http://www.goal.com/en/scores/transfer-zone?ICID=TZ_DD1_VA");
$xpath = new DOMXPath($doc);
$player_names = $xpath->query("//td[#class='player_name_col']");
foreach ($player_names as $player_name) {
echo $player_name->nodeValue . "<br />";
}
?>

Related

Simple html dom parser get tr from table

I am trying to scrap http://spys.one/free-proxy-list/but here i just want get Proxy by ip:port column only
i checked the website there was 3 table
Anyone can help me out?
<?php
require "scrapper/simple_html_dom.php";
$html=file_get_html("http://spys.one/free-proxy-list/");
$html=new simple_html_dom($html);
$rows = array();
$table = $html->find('table',3);
var_dump($table);

Try the below script. It should fetch you only the required items and nothing else:
<?php
include 'simple_html_dom.php';
$url = "http://spys.one/free-proxy-list/";
$html = file_get_html($url);
foreach($html->find("table[width='65%'] tr[onmouseover]") as $file) {
$data = $file->find('td', 0)->plaintext;
echo $data . "<br/>";
}
?>
Output it produces like:
176.94.2.84
178.150.141.93
124.16.84.208
196.53.99.7
31.146.161.238

I really don 't know, what your simple html dom library does. Anyway. Nowadays PHP has all aboard what you need for parsing specific dom elements. Just use PHPs own DOMXPath class for querying dom elements.
Here 's a short example for getting the first column of a table.
$dom = new \DOMDocument();
$dom->loadHTML('https://your.url.goes.here');
$xpath = new \DomXPath($dom);
// query the first column with class "value" of the table with class "attributes"
$elements = $xpath->query('(/table[#class="attributes"]//td[#class="value"])[1]');
// iterate through all found td elements
foreach ($elements as $element) {
echo $element->nodeValue;
}
This is a possible example. It does not solve exactly your issue with http://spys.one/free-proxy-list/. But it shows you how you could easily get the first column of a specific table. The only thing you have to do now is finding the right query in the dom of the given site for the table you want to query. Because the dom of the given site is a pretty complex table layout from ages ago and the table you want to parse does not have a unique id or something else, you have to find out.

How i can Parse Mediawiki Sommaire and found the HTML code with PHP?

Exemple with a mediawiki link : https://www.visionduweb.eu/wiki/index.php?title=Utiliser_PHP
Show the source code and identify the sommaire from this Mediawiki page.
I search how i can parse the source code and found the HTML code for this sommaire.
#
I tried with $domExemple = $xpath->query(« //ul/li »); but I have too many answers and poorly formatted.
I tried with $domExemple = $xpath->query(« //ul/li[#class=’toclevel-1 tocsection-1′] »); which gives me the result, but, how to get all toclevel and tocsection, without having to specify the number 1, or 2, or 3, ... toclevel or tocsection.
In this example, I do not get the HTML content, only the text content.
I would have preferred to retrieve the HTML content.

I believe you can simplify your xpath expression using the syntax defined here:
How can I match on an attribute that contains a certain string?
Try something like this:
$results = $xpath->query('//ul/li[contains(#class, "toclevel-") and contains(#class, "tocsection-"]');
foreach ($results as $li) {
// to get html of $li, import it into a fresh DOMDocument and run saveHTML
$newdoc = new DOMDocument();
$cloned = $li->cloneNode(true);
$newdoc->appendChild($newdoc->importNode($cloned, true));
echo $newdoc->saveHTML();
}

Getting table cell TD value using XPath and DOM in PHP

I need to access table cell values via DOM / PHP. The web page is loaded into $myHTML. I have identified the XPath as :
//*[#id="main-content-inner"]/div[2]/div[1]/div/div/table/tbody/tr/td[1]
I want to get the text of the value in the cell as follows:
$dom = new DOMDocument();
$dom->loadHTML($myHTML);
$xpath = new DOMXPath($dom);
$myValue = $xpath->query('//*[#id="main-content-inner"]/div[2]/div[1]/div/div/table/tbody/tr/td[1]');
echo $myValue->nodeValue;
But I am getting "Undefined Property: DOMNodeList::$nodeValue error. How do I retrieve the value of this table cell? I have tried various techniques from stackoverflow with no luck.

DOMXPath::query() returns a DOMNodeList, even if there's only one match.
If you know for sure you have a match there, you can use
echo $myValue->item(0)->nodeValue;
But if you want to be bullet proof, you better check the length in advance, e.g.
if ($myValue->length > 0) {
echo $myValue->item(0)->nodeValue;
} else {
//No such cell. What now?
}

PHP XPath query returns nothing

I've been recently playing with DOMXpath in PHP and had success with it, trying to get more experience with it I've been playing grabbing certain elements of different sites. I am having trouble getting the weather marker off of http://www.theweathernetwork.com/weather/cape0005 this website.
Specifically I want
//*[#id='theTemperature']
Here is what I have
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $tag){
echo $tag->nodeValue;
}
Is there something I am doing wrong here? I am able to produce actual results on other tags on the page but specifically not this one.
Thanks in advance.

You might want to improve your DOMDocument debugging skills, here some hints (Demo):
<?php
header('Content-Type: text/plain;');
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $i => $tag){
echo $i, ': ', var_dump($tag->nodeValue), ' HTML: ', $dom->saveHTML($tag), "\n";
}
Output the number of the found node, I do it here with $i in the foreach.
var_dump the ->nodeValue, it helps to show what exactly it is.
Output the HTML by making use of the saveHTML function which shows a better picture.
The actual output:
0: string(0) ""
HTML: <p id="theTemperature"></p>
You can easily spot that the element is empty, so the temperature must go in from somewhere else, e.g. via javascript. Check the Network tools of your browser.

what happens is straightforward, the page contains an empty id="theTemperature" element which is a placeholder to be populated with javascript. file_get_contents() will just download the page, not executing javascript, so the element remains empty. Try to load the page in the browser with javascript disabled to see it yourself

The element you're trying to select is indeed empty. The page loads the temperature into that id through ajax. Specifically this script:
http://www.theweathernetwork.com/common/js/master/citypage_ajax.js?cb=201301231338
but when you do a file_get_contents those scripts obviously don't get resolved. I'd go with guido's solution of using the RSS

PHP/HTML - Multiple page screen scrape, export to .txt with commas between dates and values

I am attempting to scrape the web page (see code) - as well as those pages going back in time (you can see the date '20110509' in the page itself) - for simple numerical strings. I can't seem to figure out through much trial and error (I'm new to programming) how to parse the specific data in the table that I want. I have been trying to use simple PHP/HTML without curl or other such things. Is this possible? I think my main issue is
using the delimiters that are necessary to get the data from the source code.
What I'd like is for the program to start at the very first page it can, say for example '20050101', and scan through each page till the current date, grabbing the specific data for example, the "latest close" (column), "closing arm" (row), and have that value for the corresponding date exported to a single .txt file, with the date being separated from the value with a comma. Each time the program is run, the date/value should be appended to the existing text file.
I am aware many lines of the code below are junk, it's part of my learning process.
<html>
<title>HTML with PHP</title>
<body>
<?php
$rawdata = file_get_contents('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2-20110509.html?mod=mdc_pastcalendar');
//$data = substr(' ', $data);
//$begindate = '20050101';
//$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
//if (preg_match(' <td class="text"> ' , $data , $content)) {
//$content = str_replace($newlines
echo $rawdata;
///file_put_contents( 'NYSETRIN.html' , $content , FILE_APPEND);
?>
<b>some more html</b>
<?php
?>
</body>
</html>

All right so let's do this. We're going to first load the data into an HTML parser, then create an XPath parser out of it. XPath will help us navigate around the HTML easily. So:
$date = "20110509";
$data = file_get_contents("http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2-{$date}.html?mod=mdc_pastcalendar");
$doc = new DOMDocument();
#$doc->loadHTML($data);
$xpath = new DOMXpath($doc);
Now then we need to grab some data. First off let's get all the data tables. Looking at the source, these tables are indicated by a class of mdcTable:
$result = $xpath->query("//table[#class='mdcTable']");
echo "Tables found: {$result->length}\n";
So far:
$ php test.php
Tables found: 5
Okay so we have the tables. Now we need to get specific column. So let's use the latest close column you mentioned:
$result = $xpath->query("//table[#class='mdcTable']/*/td[contains(.,'Latest close')]");
foreach($result as $td) {
echo "Column contains: {$td->nodeValue}\n";
}
The result so far:
$ php test.php
Column contains: Latest close
Column contains: Latest close
Column contains: Latest close
... etc ...
Now we need the column index for getting the specific column for the specific row. We do this by counting all of the previous sibling elements, then adding one. This is because element index selectors are 1 indexed, not 0 indexed:
$result = $xpath->query("//table[#class='mdcTable']/*/td[contains(.,'Latest close')]");
$column_position = count($xpath->query('preceding::*', $result->item(0))) + 1;
echo "Position is: $column_position\n";
Result is:
$ php test.php
Position is: 2
Now we need to get our specific row:
$data_row = $xpath->query("//table[#class='mdcTable']/*/td[starts-with(.,'Closing Arms')]");
echo "Returned {$data_row->length} row(s)\n";
Here we use starts-with, since the row label has a utf-8 symbol in it. This makes it easier. Result so far:
$ php test.php
Returned 4 row(s)
Now we need to use the column index to get the data we want:
$data_row = $xpath->query("//table[#class='mdcTable']/*/td[starts-with(.,'Closing Arms')]/../*[$column_position]");
foreach($data_row as $row) {
echo "{$date},{$row->nodeValue}\n";
}
Result is:
$ php test.php
20110509,1.26
20110509,1.40
20110509,0.32
20110509,1.01
Which can now be written to a file. Now, we don't have the markets these apply to, so let's go ahead and grab those:
$headings = array();
$market_headings = $xpath->query("//table[#class='mdcTable']/*/td[#class='colhead'][1]");
foreach($market_headings as $market_heading) {
$headings[] = $market_heading->nodeValue;
}
Now we can use a counter to reference which market we're on:
$data_row = $xpath->query("//table[#class='mdcTable']/*/td[starts-with(.,'Closing Arms')]/../*[$column_position]");
$i = 0;
foreach($data_row as $row) {
echo "{$date},{$headings[$i]},{$row->nodeValue}\n";
$i++;
}
The output being:
$ php test.php
20110509,NYSE,1.26
20110509,Nasdaq,1.40
20110509,NYSE Amex,0.32
20110509,NYSE Arca,1.01
Now for your part:
This can be made into a function that takes a date
You'll need code to write out the file. Check out the filesystem functions for hints
This can be made extendible to use different columns and different rows

I'd recommend using the HTML Agility Pack, its a HTML parser which is very handy for finding particular content within a HTML document.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extracting data from a website table - php

Related

Simple html dom parser get tr from table

How i can Parse Mediawiki Sommaire and found the HTML code with PHP?

Getting table cell TD value using XPath and DOM in PHP

PHP XPath query returns nothing

PHP/HTML - Multiple page screen scrape, export to .txt with commas between dates and values

Categories

Resources