simplehtmldom parsing script to break the data coming in plain text

simplehtmldom parsing script to break the data coming in plain text - php

Here is my script in which I am fetching three items Medicine Name, Generic Name, Class Name. My problem here is that I am successful in fetching the Medicine name separately but the Generic Name and Class Name is coming as string. If you will run the script you will get better idea what I am actually trying to say, I want to store Generic Name and Class Name is separate columns in table.
Script
<?php
error_reporting(0);
//simple html dom file
require('simple_html_dom.php');
//target url
$html = file_get_html('http://www.drugs.com/condition/atrial-flutter.html?rest=1');
//crawl td columns
foreach($html->find('td') as $element)
{
//get drug name
$drug_name = $element->find('b');
foreach($drug_name as $drug_name)
{
echo "Drug Name:-".$drug_name;
foreach($element->find('span[class=small] a',2) as $t)
{
//get the inner HTML
$data = $t->plaintext;
echo $data;
}
echo "<br/>";
}
}
?>
Thanks in advance

Your current code is a little bit far from what you need to do but you could utilize css selectors to get those elements easier.
Example:
$data = array();
$html = file_get_html('http://www.drugs.com/condition/atrial-flutter.html?rest=1');
foreach($html->find('tr td[1]') as $td) { // you do not need to loop each td!
// target the first td of the row
$drug_name = $td->find('a b', 0)->innertext; // get the drug name bold tag inside anchor
$other_info = $td->find('span.small[2]', 0); // get the other info
$generic_name = $other_info->find('a[1]', 0)->innertext; // get the first anchor, generic name
$children_count = count($other_info->children()); // count all of the children
$classes = array();
for($i = 1; $i < $children_count; $i++) { // since you already got the first, (in position zero) iterate all children starting from 1
$classes[] = $other_info->find('a', $i)->innertext; // push it inside another container
}
$data[] = array(
'drug_name' => $drug_name,
'generic_name' => $generic_name,
'classes' => $classes,
);
}
echo '<pre>';
print_r($data);

Related

Simple html dom parser table to array (extended)

There is this website
http://www.oxybet.com/france-vs-iceland/e/5209778/
What I want is to scrape not the full table but PARTS of this table.
For example to only display rows that include sportingbet stoiximan and mybet and I don't need all columns only 1 x 2 columns, also the numbers that are with red must be scraped as is with the red box or just display an asterisk next to them in the scrape can this be done or do I need to scrape the whole table on a database first then query the database?
What I got now is this code I borrowed from another similar question on this forum which is:
<?php
require('simple_html_dom.php');
$html = file_get_html('http://www.oxybet.com/france-vs-iceland/e/5209778/');
$table = $html->find('table', 0);
$rowData = array();
foreach($table->find('tr') as $row) {
// initialize array to store the cell data from each row
$flight = array();
foreach($row->find('td') as $cell) {
// push the cell's text to the array
$flight[] = $cell->plaintext;
}
$rowData[] = $flight;
}
echo '<table>';
foreach ($rowData as $row => $tr) {
echo '<tr>';
foreach ($tr as $td)
echo '<td>' . $td .'</td>';
echo '</tr>';
}
echo '</table>';
?>
which returns the full table. What I want mainly is somehow to detect the numbers selected in the red box (in 1 x 2 areas) and display an asterisk next to them in my scrape, secondly I want to know if its possible to scrape specific columns and rows and not everything do i need to use xpath?
I beg for someone to point me in the right direction I spent hours on this, the manual doesn't explain much http://simplehtmldom.sourceforge.net/manual.htm

Link is dead. However, you can do this with xPath and reference the cells that you want by their colour and order, and many more ways too.
This snippet will give you the general gist; taken from a project I'm working on atm:
function __construct($URL)
{
// make new DOM for nodes
$this->dom = new DOMDocument();
// set error level
libxml_use_internal_errors(true);
// Grab and set HTML Source
$this->HTMLSource = file_get_contents($URL);
// Load HTML into the dom
$this->dom->loadHTML($this->HTMLSource);
// Make xPath queryable
$this->xpath = new DOMXPath($this->dom);
}
function xPathQuery($query){
return $this->xpath->query($query);
}
Then simply pass a query to your DOMXPath, like //tr[1]

Display first 4 columns of external table

I am using Windows software to organize a tourpool. This program creates (among other things) HTML pages with rankings of participants. But these HTML pages are quite hideous, so I am building a site around it.
To show the top 10 ranking I need to select the first 10 out of about 1000 participants of the generated HTML file and put it on my own site.
To do this, I used:
// get top 10 ranks of p_rank.html
$file_contents = file_get_contents('p_rnk.htm');
$start = strpos($file_contents, '<tr class="header">');
// get end
$i = 11;
while (strpos($file_contents, '<tr><td class="position">'. $i .'</td>', $start) === false){
$i++;
}
$end = strpos($file_contents, '<td class="position">'. $i .'</td>', $start);
$code = substr($file_contents, $start, $end);
echo $code;
This way I get it to work, only the last 3 columns (previous position, up or down and details) are useless information. So I want these columns deleted or find a way to only select and display the first 4.
How do i manage this?
EDIT
I adjusted my code and at the end I only echo the adjusted table.
<?php
$DOM = new DOMDocument;
$DOM->loadHTMLFile("p_rnk.htm");
$table = $DOM->getElementsByTagName('table')->item(0);
$rows = $table->getElementsByTagName('tr');
$cut_rows_after = 10;
$cut_colomns_after = 3;
$row_index = $rows->length-1;
while($row = $rows->item($row_index)) {
if($row_index+1 > $cut_rows_after)
$table->removeChild($row);
else {
$tds = $row->getElementsByTagName('td');
$colomn_index = $tds->length-1;
while($td = $tds->item($colomn_index)) {
if($colomn_index+1 > $cut_colomns_after)
$row->removeChild($td);
$colomn_index--;
}
}
$row_index--;
}
echo $DOM->saveHTML($table);
?>

I'd say that the best way to deal with such stuff is to parse the html document (see, for instance, the first anwser here) and then manipulate the object that describes DOM. This way, you can easily extract the table itself using various selectors, get your 10 first records in a simpler manner and also will be able to remove unnecessary child (td) nodes from each line (using removeChild). When you're done with modifying, dump the resulting HTML using saveHTML.
Update:
ok, here's a tested code. I removed the necessity to hardcode the numbers of colomns and rows and separated the desired numbers of colomns and rows into a couple of variables (so that you can adjust them if neede). Give the code a closer look: you'll notice some details which were missing in you code (index is 0..999, not 1..1000, that's why all those -1s and +1s appear; it's better to decrease the index instead of increasing because in this case you don't have to case about numeration shifts on removing; I've also used while instead of for not to care about cases of $rows->item($row_index) == null separately):
<?php
$DOM = new DOMDocument;
$DOM->loadHTMLFile("./table.html");
$table = $DOM->getElementsByTagName('tbody')->item(0);
$rows = $table->getElementsByTagName('tr');
$cut_rows_after = 10;
$cut_colomns_after = 4;
$row_index = $rows->length-1;
while($row = $rows->item($row_index)) {
if($row_index+1 > $cut_rows_after)
$table->removeChild($row);
else {
$tds = $row->getElementsByTagName('td');
$colomn_index = $tds->length-1;
while($td = $tds->item($colomn_index)) {
if($colomn_index+1 > $cut_colomns_after)
$row->removeChild($td);
$colomn_index--;
}
}
$row_index--;
}
echo $DOM->saveHTML();
?>
Update 2:
If the page doesn't contain tbody, use the container which is present. For instance, if tr elements are inside a table element, use $DOM->getElementsByTagName('table') instead of $DOM->getElementsByTagName('tbody').

Parsing HTML Table based on nearby header tag using DOMDocument and DOMXPath

I have a simple PHP app that parse html content and extract data from td that matches certain query.
HTML Code:
<html>
<h3>HELLO WORLD</h3>
<table>
<tr><td>A</td><td>A2</td></tr>
<tr><td>B</td><td>B2</td></tr>
...
...
</table>
<h3>HELLO AMERICA</h3>
<table>
<tr><td>A</td><td>A3</td></tr>
<tr><td>C</td><td>C2</td></tr>
...
...
</table>
<h3>HELLO TEXAS</h3>
<table>
<tr><td>D</td><td>D2</td></tr>
<tr><td>E</td><td>E2</td></tr>
...
...
</table>
<html>
PHP Code to parse the table
$content = file_get_contents($html_string);
$dom = new DOMDocument();
#$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$query = "//tr/td[position()=1 and normalize-space(text()) = '".$q."']";
$entries = $xpath->query($query);
$entryCount = $entries->length;
if ($entryCount==1){
$entry = $entries->item(0);
$tr = $entry->parentNode;
foreach ($tr->getElementsByTagName("td") as $td) {
$fieldnames[] = $td->textContent;
}
//Return data set
$data[] = $fieldnames;
return $data;
}
else {
$data = array();
for ($i=0;$i<$entryCount;$i++){
$fieldnames = [];
$entry = $entries->item($i);
$tr = $entry->parentNode;
foreach ($tr->getElementsByTagName("td") as $td) {
$fieldnames[] = $td->textContent;
}
$data[] = $fieldnames;
}
return $data;
}
Basically this will go through all 3 tables. Let say, I send a query ($q = A), it will return:
$data[0][0] => A, $data[0][1] => A2
$data[1][0] => A, $data[1][1] => A3
However, I only want the data from the first table (A and A2). The table is 'naked'. No ID, no class or any identification. The only thing that identifies them is the h3 tag. Let say, I provide a query that specifies the h3 ($q2 = HELLO WORLD), is it possible to extract data from only the first table?

You want to use the preceding-sibling axis and the [1] positional predicate (or whatever it’s formally called), and look at the text content of the h3 elements to find whichever h3 element is the one right before the table you want; so, I think, this:
//table[preceding-sibling::h3[1][. = "HELLO WORLD"]]
Or, to get the specific stuff within that which the code in your example is looking for,
//table[preceding-sibling::h3[1][. = "HELLO WORLD"]]/tr/td[position()=1 and normalize-space(text()) = '".$q."']
And if you did later happen to want to get any of the other tables, just swap out the text in that expression; for example, the following will get just the last one in your example.
//table[preceding-sibling::h3[1][. = "HELLO TEXAS"]]

Inserting numerical ID's in paragraphs (PHP/MySQL DB Query)

I have a pretty ordinary query that displays articles stored in a database table (field = 'Article')...
while ($row = $stm->fetch())
{
$Content = $row['Article'];
}
echo $Content;
I'd like to know how I can modify the display so that every paragraph has a numerical ID. For example, the first paragraph would be [p id="1"], the second one [p id="2"] and so on. However, it would be even better if the last paragraph displayed as [p id="Last"].
(Sorry, I forgot how to post inline code, so I replaced the tags (e.g. <) with brackets.)
My goal is to simply get more control over my content. For example, there are certain items that I want to include after the first paragraph on some pages, and I might want to include a certain feature before paragraph#4 on one special page.
ON EDIT... Neither of the methods suggested below worked for me, but it' probably because I simply didn't implement them correctly; the code in both examples isn't familiar to me. At any rate, I'm bookmarking this page so I can learn more about those scripts.
In the meantime, I finally found a regex solution. (I think preg_replace is another word for regex, right?)
This inserts a numerical ID in each paragraph tag:
$c = 1;
$r = preg_replace('/(<p( [^>]+)?>)/ie', '"<p\2 id=\"" . $c++ . "\">"', $Article);
$Article = $r;
This changes the ID in the last paragraph tag to "Last"...
$c = 1;
$r = preg_replace('/(<p( [^>]+)?>)/ie', '"<p\2 id=\"" . $c++ . "\">"', $Article);
$r = preg_replace('/(<p.*?)id="'.($c-1).'"(>)/i', '\1id="Last"\2', $r);
$Article = $r;

Assuming your HTML is well-formed, you could use the SimpleXMLElement class to do so:
$sxe = new SimpleXMLElement($row['Article']);
$i = 0;
foreach ($sxe->children() as $p) {
$p->addAttribute('id', $i);
}
$p->id = 'Last'; // to set the ID of the last paragraph
echo $sxe->__toString();
If it isn't well-formed, you could use the DOMDocument class instead:
$dom = new DOMDocument;
$dom->loadHTML($row['Article']);
$i;
foreach ($dom->getElementsByTagName('p') as $p) {
$p->id = $id;
}
$p->id = 'Last';
echo $dom->saveHTML();

Order an XML scoreboard and return it with PHP

I'm creating a game where when a timer ends the user can enter their name. I pass the name and score to a PHP file with AJAX. That PHP file adds it to an XML file. I loop through it to create a table holding the scores and it's then returned to AJAX and I then output it on the screen with jQuery. I have all of this working fine right now.
What I want to accomplish is this:
1. After the score is added to the XML file I want to order the nodes according to score, in descending order
2. I then want to populate the table with the values in order. I'd also like to limit it to only the top 10 scores.
Basically where I'm running into problems coming up with a solution is the ordering. Once the XML is ordered populating the table and limiting it to 10 should be pretty straight forward. Any suggestions on how I should do this?
XML : http://people.rit.edu/lxl1500/Prog4/Project%202/scoreboard.xml
jQuery Ajax call:
function addScore(score,name){
var url = 'scoreboard.php';
$.post(url,{Score:score, Name:name},function(data){
$('#score').html(data).show();
});
}
scoreboard.php:
<?php
$score = $_POST['Score'];
$name = $_POST['Name'];
if($name == ''){
$name = 'Player'.rand(0,5000);
}
$scoreboard = new domDocument;
$scoreboard->load('scoreboard.xml');
$root=$scoreboard->documentElement;
$entry = $scoreboard->createElement('entry');
$userScore = $scoreboard->createElement('score',$score);
$userName = $scoreboard->createElement('name',$name);
$entry->appendChild($userName);
$entry->appendChild($userScore);
$root->appendChild($entry);
$scoreboard->save('scoreboard.xml');
$scores = $scoreboard->getElementsByTagName('entry');
$string = '<table id="score-table" cellspacing="10"><tbody><tr><th align="left">User</th><th align="left">Score</th></tr>';
foreach($scores as $score){
$getScore = $score->getElementsByTagName('score')->item(0)->nodeValue;
$getPlayer = $score->getElementsByTagName('name')->item(0)->nodeValue;
$string.="<tr><td>$getPlayer</td><td>$getScore</td></tr>";
}
$string.='</tbody></table>';
echo $string;
?>
Any help would be greatly appreciated! Thanks.

You can build a sorted XML file, that is add the nodes to the xml file in sorted order, something like
$entries = $root->getElementsByTagName('entry');
$added = false;
foreach ($entries as $item) {
if ($score <= $item->getElementsByTagName('score')->item(0)->nodeValue) continue;
$root->insertBefore($entry, $item);
$added = true;
break;
}
// if not yet added, add it
if (!$added) {
$root->appendChild($entry);
}
For this to work the file has to be sorted (or empty).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

simplehtmldom parsing script to break the data coming in plain text - php

Related

Simple html dom parser table to array (extended)

Display first 4 columns of external table

Parsing HTML Table based on nearby header tag using DOMDocument and DOMXPath

Inserting numerical ID's in paragraphs (PHP/MySQL DB Query)

Order an XML scoreboard and return it with PHP

Categories

Resources