Beautiful Soup [Python] and the extracting of text in a table - php

i am new to Python and to Beatiful Soup also! I heard about BS. It is told to be a great tool to parse and extract content. So here i am...:
I want to take the content of the first td of a table in a html
document. For example, i have this table
<table class="bp_ergebnis_tab_info">
<tr>
<td>
This is a sample text
</td>
<td>
This is the second sample text
</td>
</tr>
</table>
How can i use beautifulsoup to take the text "This is a sample text"?
I use soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'}) to get
the whole table.
Thanks... or should i try to get the whole stuff with Perl ... which i am not so familiar with. Another soltion would be a regex in PHP.
See the target [1]: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=799.601437941842&SchulAdresseMapDO=142323
Note; since the html is a bit invalid - i think that we have to do some cleaning. That can cause a lot of PHP code - since we want to solve the job in PHP. Perl would be a good solution too.
Many thanks for some hints and ideas for a starting point
zero

First find the table (as you are doing). Using find rather than findall returns the first item in the list (rather than returning a list of all finds - in which case we'd have to add an extra [0] to take the first element of the list):
table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})
Then use find again to find the first td:
first_td = table.find('td')
Then use renderContents() to extract the textual contents:
text = first_td.renderContents()
... and the job is done (though you may also want to use strip() to remove leading and trailing spaces:
trimmed_text = text.strip()
This should give:
>>> print trimmed_text
This is a sample text
>>>
as desired.

Use "text" to get text between "td"
1) First read table DOM using tag or ID
soup = BeautifulSoup(self.driver.page_source, "html.parser")
htnm_migration_table = soup.find("table", {'id':'htnm_migration_table'})
2) Read tbody
tbody = htnm_migration_table.find('tbody')
3) Read all tr from tbody tag
trs = tbody.find_all('tr')
4) get all tds using tr
for tr in trs:
tds = tr.find_all('td')
for td in tds:
print(td.text)

I find Beautiful Soup very efficient tool so keep learning it :-) It is able to parse a page with invalid markup so it should be able to handle the page you refer. You may want to use command BeautifulSoup(html).prettify() command if you want to get a valid reformatted page source with valid markup.
As for your question, the result of your first soup.findAll(...) command is also a Beautiful Soup object and you can make a second search in it, like this:
table_soup = soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'})
your_sample_text = table_soup.find("td").renderContents().strip()
print your_sample_text

Related

Simple html dom parser table

Im using Simple HTML Dom to parse the data into my own php script, I need to get the text inside the td, only one td of more in the table. Website from where I try to parse the table->td. Specifically, I need the first USD td.
The result must be
$ 0.0137
Source php:
<?php
include('../simple_html_dom.php');
$html = file_get_html('https://rub.currencyrate.today/');
foreach($html->find('table') as $e){
foreach($e->find('td',0) as $f){
echo strip_tags($f->innertext) . '<br>';
}
}
?>
This code displays result
₽ 1 $ 0.0137 € 0.0115 £ 0.00988 ¥ 0.0884 Ƀ 0.00000040
I've tried several ways to get that but i've fail in each and everyone of them. Can someone give me a hand?
You're looking for the second <td> in the first <table>.
Therefore there is no need to iterate (foreach) over all tables, and iterating over the first <td> is even wrong (if you check the error log, it will show you that already).
Lets do first table, second table-data, the numbers in find() are zero-based:
$dollar = $html->find('table', 0)->find('td', 1)->innertext();
For your output take care to properly encode as HTML, strip_tags is not of much use there, you want just the HTML characters properly encoded with htmlspecialchars (something strip_tags is not even capable of):
echo htmlspecialchars($dollar, ENT_QUOTES | ENT_HTML5), '<br>';
$ 0.0137
A few further notes:
run with simplehtmldom 2.0-RC2: the version you use might have bugs. I could not fully reproduce your output with that version (but the traversal was wrong anyway)
you should allow yourself the "luxury" to be able to see errors more prominently on your development box.
take care encoding HTML output properly.
the closing ?> php tag is not necessary at the end of file, leave it out before it causes problems.
last but not least if you allow me the remark: simplehtmldom is really old. you may consider at some time to make use of the DOMDocument class which is from the dom PHP extension and use it together with the other xml PHP extensions (simplexml, xmlreader etc.).
Example in full:
<?php declare(strict_types=1);
include __DIR__ . '/../simple_html_dom.php';
$html = file_get_html('https://rub.currencyrate.today/');
$dollar = $html->find('table', 0)->find('td', 1)->innertext();
echo htmlspecialchars($dollar, ENT_QUOTES | ENT_HTML5), '<br>';

Getting Specific Data in PHP

What is the way to get specific data using PHP. In this case i want to get some text which is wrapped by <span class="s"> to the first <b> HTML tag.Assuming a HTML source code is:
Once there was a king <span class="s"> May 3 2009 <b> ABC Some Text </b> Some photo or video</span> but they have...
So, here i want to get those filtered data in a variable like: $fdata = "May 3 2009";Because, May 3 2009 is wrapped by <span class="s"> to the first <b> HTML tag.
I will use it in SIMPLE PHP HTML DOM PARSING. So, any idea or example to filter those text and get it in a variable? Any idea will be a great help. *If you found a duplicate question here, its not that its more specified.
Use Simple HTML DOM
http://simplehtmldom.sourceforge.net/
Or http://php.net/manual/en/domdocument.loadhtml.php
Or you can use any other library also.
If you're using simple html dom parser you'd grab the elements you're targeting like this:
$ret = $html->find('span class="s"');
This is just a basic sample, but it should get you going in the right direction.
if you need to find a very specific instance, you can use something such as:
$ret = $html->find("#div1", 0)->children(1)->children(1)->children(2)->id;

Access child on a table using Xpath

I am trying to access a specific element of the Dom using XPath
Here is an example
<table>
<tbody>
<tr>
<td>
<b>1</b> data<br>
<b>2</b> data<br>
<b>3</b> data<br>
</td>
</tr>
</tbody>
</table>
I want to target "table td" so my query in Xpath is something like
$finder->query('//table/td');
only this doesn't return the td as its a sub child and direct access would be done using
$finder->query('//tr/td');
Is there a better way to write the query which would allow me to use something like the first example ignoring the elements in-between and return the TD?
Is there a better way to write the query which would allow me to use
something like the first example ignoring the elements in-between and
return the TD?
You can write:
//table//td
However, is this really "better"?
In many cases the evaluation of the XPath pseudo-operator // can result in significant inefficiency as it causes the whole subtree rooted in the context-node to be traversed.
Whenever the path to the wanted nodes is statically known, it may be more efficient to replace any // with the specific, known path, thus avoiding the complete subtree traversal.
For the provided XML document, such expression is:
/*/*/tr/td
If there is more than one table element, each a child of the top element and we want to select only the tds of the forst table, a good, specific expression is:
/*/table[1]/*/tr/td
If we want to select only the first td of the first table in the same document, a good way to do this would be:
(/*/table[1]/*/tr//td)[1]
Or if we want to select the first td in the XML document (not knowing its structure in advance), then we could specify this:
(//td)[1]
What you are looking for is:
$finder->query('//table//td');
Oh boy oh boy, there's something not seen often.
As for your first xpath query, you can just return what you want but use double // on before tagnames
But, I don't see why you don't just want to get the td's by tagname...
You can write this way too:-
$finder->query('//td');

Selenium, xpath to match a table row containing multiple elements

I'm trying to find a Selenium/PHP XPath for matching a table row that contains multiple elements (text, and form elements).
Example:
<table class="foo">
<tr>
<td>Car</td><td>123</td><td><input type="submit" name="s1" value="go"></td>
</tr>
</table>
This works for a single text element:
$this->isElementPresent( "//table/tbody/tr/td[contains(text(), 'Car')]" );
while this does NOT (omitting the /td locator):
$this->isElementPresent( "//table/tbody/tr[contains(text(), 'Car')]" );
and thus, this obviously won't work either for multiple elements:
$this->isElementPresent( "//table/tbody/tr[contains(text(), 'Car')][contains(text(), '123')]" );
Another way to do this would be with getTable( "xpath=//table[#class='foo'].x.y") for each and every row x, column y. Cumbersome, but it worked... mostly. It does NOT return the <input> tag! It will return an empty string for that cell :(
Any ideas?
This XPath expression:
/html/body/table[descendant::td[contains(.,'Car')]]
Note: If you know your schema, don't use a starting // operator. Use string value instead of text node (this way you get the concatenation of all descendant text nodes).
Several paths can be combined with | separator.
Tweak this:
//tr/td[contains(text(), 'Car')]/text() | //tr/td/input[#value="s1"]/#name
you might want to use
//td[contains,'Car'] and td[contains,'123']/ancestor::tr
that will select the tr that contains td which matches the two contains arguments
Try to use View Xpath Plugin in firefox, very useful plugin.
Learn more about Axes in Xpath: http://www.w3schools.com/xpath/xpath_axes.asp
Thanks to knb for some syntax hints.
This is slightly off-topic, but relevant to the search that led me here...
I had a table with [ name | value ] cells. I needed to get value from the row with 'name' preceding it.
(fake example, but every link I was looking for had the same text and no IDs - the point is that the context information was in a neighboring cell)
<table id="options"><tbody>
<tr>
<td>other</td>
<td>edit</td>
</tr>
<tr>
<td>this label</td>
<td>edit</td> <!-- I want this button -->
</tr>
<tr>
<td>other</td>
<td>edit</td>
</tr>
</tbody></table>
I could retrieve the button I wanted like this, using nested [[]] conditions:
//table[#id='options']/tbody/tr[td[contains(text(), 'this label')]]/td[2]/a
"get the "a" that is in a row that contains another cell with the text I'm looking for"
I think this sort of task might be a common case, so I'm posting it here FYI
In my problem, I had a list of products where it was identified by a unique SKU/catalog combination. If I wanted to add that product to a cart, I chose it by SKU and catalog.
Using foob.ar's example:
//table[#class='foo']/tr[td[contains(text(), 'Car')] and td[contains(., '123')]]
You can combine it with dman's solution for choosing a specific element/column within that row
//table[#class='foo']/tr[td[contains(text(), 'Car')] and td[contains(., '123')]]//input[#name='s1']
Edit:
The solution above works if I was only looking for those two values in any of the columns. If you want to find a value relative to a specific column, I had to modify it a bit
//table[#class='foo']/tr[td[position()=1 and contains(text(), 'Car')] and td[position()=2 and contains(text(), '123')]]//input[#name='s1']

Using PHP PCRE to fetch div content

I'm trying to fetch data from a div (based on his id), using PHP's PCRE. The goal is to fetch div's contents based on his id, and using recursivity / depth to get everything inside it. The main problem here is to get other divs inside the "main div", because regex would stop once it gets the next </div> it finds after the initial <div id="test">.
I've tryed so many different approaches to the subject, and none of it worked. The best solution, in my oppinion, is to use the R parameter (Recursion), but never got it to work properly.
Any Ideais?
Thanks in advance :D
You'd be much better off using some form of DOM parser - regex really isn't suited to this problem. If all you want is basic HTML dom parsing, something like simplehtmldom would be right up your alley. It's trivial to install (just include a single PHP file) and trivial to use (2-3 lines will do what you need).
include('simple-html-dom.php');
$dom = str_get_html($bunchofhtmlcode);
$testdiv = $dom->find('div#test',0); // 0 for the first occurrence
$testdiv_contents = $testdiv->innertext;

Categories