What is the way to get specific data using PHP. In this case i want to get some text which is wrapped by <span class="s"> to the first <b> HTML tag.Assuming a HTML source code is:
Once there was a king <span class="s"> May 3 2009 <b> ABC Some Text </b> Some photo or video</span> but they have...
So, here i want to get those filtered data in a variable like: $fdata = "May 3 2009";Because, May 3 2009 is wrapped by <span class="s"> to the first <b> HTML tag.
I will use it in SIMPLE PHP HTML DOM PARSING. So, any idea or example to filter those text and get it in a variable? Any idea will be a great help. *If you found a duplicate question here, its not that its more specified.
Use Simple HTML DOM
http://simplehtmldom.sourceforge.net/
Or http://php.net/manual/en/domdocument.loadhtml.php
Or you can use any other library also.
If you're using simple html dom parser you'd grab the elements you're targeting like this:
$ret = $html->find('span class="s"');
This is just a basic sample, but it should get you going in the right direction.
if you need to find a very specific instance, you can use something such as:
$ret = $html->find("#div1", 0)->children(1)->children(1)->children(2)->id;
Related
I want to extract only text from a php string.
This php string contains html code like tags or etc.
So I only need a simple text from this string.
This is the actual string:
<div class="devblog-index-content battlelog-wordpress">
<p><strong>The celebration of the Recon class in our second </strong>BF4 Class Week<strong> continues with a sneaky stroll down memory lane. Learn more about how the Recon has changed in appearance, name and weaponry over the years…</strong></p>
<p> </p>
<p style="text-align:center"><img alt="bf4-history-of-recon-1" class="aligncenter" src="http://eaassets-a.akamaihd.net/battlelog/prod/954660ddbe53df808c23a0ba948e7971/en_US/blog/wp-content/uploads/2014/10/bf4-history-of-recon-1.jpg?v=1412871863.37" style="width:619px" /></p>
I want to show this from the string:
The celebration of the Recon class in our second BF4 Class Week continues with a sneaky stroll down memory lane. Learn more about how the Recon has changed in appearance, name and weaponry over the years…
Actually this text will be placed in meta description tag so I don't need any HTML in meta tag.
How can I perform this? Any ideas and thoughts about this technique ?
You may try:
echo(strip_tags($your_string));
More info here: http://php.net/manual/en/function.strip-tags.php
Another option is to use Html2Text. It will do a much better job than strip_tags, especially if you want to parse complicated HTML code.
Extracting text from HTML is tricky, so your best bet is to use a library built for this purpose.
https://github.com/mtibben/html2text
Install using composer:
composer require html2text/html2text
Basic usage:
$html = new \Html2Text\Html2Text('Hello, "<b>world</b>"');
echo $html->getText(); // Hello, "WORLD"
Adding another option for someone else who may need this, the Stringizer library might be an option, see Strip Tags.
Full disclosure I'm the owner of the project.
i want to get all the text after the <span class="general2"> including the <h2> tags
i have the Html content as following
<span class="general2" itemprop="articleBody"> I WANT THIS TEXT I WANT THIS TEXTI WANT THIS TEXT<br />
<h2>I WANT THIS TEXT AND ALSO PRESERVE THE TAG</h2><br />
I WANT THIS TEXT</span>
i tried the query
//span[contains(#class,'general2')]
but it gives me all the text as plain text. want something like
//span[contains(#class,'general2')]/*[text() or local-name()='h3']
As you want quite distinct elements it is probably best to use the union operator | to join different elements together. You can first get all the text elements which are children of <span/, then also the text element of <a/> and last but not least the <h2/> element. This should work:
//span[contains(#class,'general2')]/text() | //span[contains(#class,'general2')]/h2 | //span[contains(#class,'general2')]/a/text()
Using XPath 3.0 this can be written more elegant as it allows functions as steps:
//span[contains(#class,'general2')]/(text() | h2 | a/text())
That is the task of your host programming language. The XPath's job is only to select relevant element, then you need to find a way using PHP to get inner HTML markup of the selected element. Maybe something like this (I'm not PHP guy in any way) :
$span = $xpath->query('//span[contains(#class,'general2')]');
echo $dom->saveXML($span->item(0));
PHP References to get above snippet : Get inner HTML of parent element with php and xpath, How to get innerHTML of DOMNode?
Is the current PHPExcel can now format HTML tags inside excel cell?
This is like question here
I have a table from database that has field that contains string with html tags
i.e. < b > hello < / b >
and I want it to output in excel not as a plain text but something like this hello
Is there any php to excel library that can do this? any idea? thanks in advance
No! PHPExcel doesn't have any built-in logic to do this; and nor does any other library that I'm aware of.
You'd need to write some code yourself to handle the conversion from HTML to a Rich Text Run.... there's some logic inside the HTML Reader that you might be able to use as the basis for this.
EDIT
Since this answer was written, a helper class has been added to the PHPExcel library that will take a basic block of simple html markup and convert it to a rich text object that can be set as a cell value. This is the PHPExcel_Helper_HTML class, with it's toRichTextObject() method, that takes an argument of a block of html and returns a Rich Text Object. There are examples demonstrating its use in Examples/42richText.php
$html = '<font color="#0000ff">
<h1 align="center">My very first example of rich text<br />generated from html markup</h1>
<p>
<font size="14" COLOR="rgb(0,255,128)">
<b>This block</b> contains an <i>italicized</i> word;
while this block uses an <u>underline</u>.
</font>
</p>
<p align="right"><font size="9" color="red">
I want to eat <ins><del>healthy food</del> <strong>pizza</strong></ins>.
</font>
';
$wizard = new PHPExcel_Helper_HTML;
$richText = $wizard->toRichTextObject($html);
While not all markup is supported, and it doesn't use stylesheets, and only a limited set of inline style elements, it works well enough with basic markup elements.
I have spent a lot of time on this thing. But i am not getting goal. Thier is a no way write like this and get data as our wish. I apperitiate with Mark answer, Try with your own script.
Let's say i have this block of code,
<div id="id1">
This is some text
<div class="class1"><p>lala</p> Some markup</div>
</div>
What I would want is only the text "This is some text" without the child element's .class1 contents. I can do it in jquery using $('#id1').contents().eq(0).text(), how can i do this in phpQuery?
Thanks.
my bad, i was doing
pq('#id1.contents().eq(0).text()')
instead of
pq('#id1')->contents()->eq(0)->text()
If compatibility is what you are after, and you want to traverse/manipulate elements as DOM objects, then perhaps the PHP DOM XML library is what you are after: http://www.php.net/manual/en/book.domxml.php
Your code would look something like this:
$xml = xmldoc('<div id="id1">This is some text<div class="class1"><p>lala</p> Some markup</div></div>');
$node = $xml->get_element_by_id("id1");
$content = $node->get_content();
I'm sorry, I don't have time to run a test of this right now, but hopefully it sets you in the right direction, and forms the basis for a decent revision... There is a good list of DOM traversal functions in the PHP documentation though :)
References: http://www.php.net/manual/en/book.domxml.php, http://www.php.net/manual/en/function.domdocument-get-element-by-id.php, http://www.php.net/manual/en/function.domnode-get-content.php
i am new to Python and to Beatiful Soup also! I heard about BS. It is told to be a great tool to parse and extract content. So here i am...:
I want to take the content of the first td of a table in a html
document. For example, i have this table
<table class="bp_ergebnis_tab_info">
<tr>
<td>
This is a sample text
</td>
<td>
This is the second sample text
</td>
</tr>
</table>
How can i use beautifulsoup to take the text "This is a sample text"?
I use soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'}) to get
the whole table.
Thanks... or should i try to get the whole stuff with Perl ... which i am not so familiar with. Another soltion would be a regex in PHP.
See the target [1]: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=799.601437941842&SchulAdresseMapDO=142323
Note; since the html is a bit invalid - i think that we have to do some cleaning. That can cause a lot of PHP code - since we want to solve the job in PHP. Perl would be a good solution too.
Many thanks for some hints and ideas for a starting point
zero
First find the table (as you are doing). Using find rather than findall returns the first item in the list (rather than returning a list of all finds - in which case we'd have to add an extra [0] to take the first element of the list):
table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})
Then use find again to find the first td:
first_td = table.find('td')
Then use renderContents() to extract the textual contents:
text = first_td.renderContents()
... and the job is done (though you may also want to use strip() to remove leading and trailing spaces:
trimmed_text = text.strip()
This should give:
>>> print trimmed_text
This is a sample text
>>>
as desired.
Use "text" to get text between "td"
1) First read table DOM using tag or ID
soup = BeautifulSoup(self.driver.page_source, "html.parser")
htnm_migration_table = soup.find("table", {'id':'htnm_migration_table'})
2) Read tbody
tbody = htnm_migration_table.find('tbody')
3) Read all tr from tbody tag
trs = tbody.find_all('tr')
4) get all tds using tr
for tr in trs:
tds = tr.find_all('td')
for td in tds:
print(td.text)
I find Beautiful Soup very efficient tool so keep learning it :-) It is able to parse a page with invalid markup so it should be able to handle the page you refer. You may want to use command BeautifulSoup(html).prettify() command if you want to get a valid reformatted page source with valid markup.
As for your question, the result of your first soup.findAll(...) command is also a Beautiful Soup object and you can make a second search in it, like this:
table_soup = soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'})
your_sample_text = table_soup.find("td").renderContents().strip()
print your_sample_text