PHP simplehtmldom read only viewable text - php

I have the following html format
<p>This is viewable <span style="display:none">This is not viewable</span></p>
I want to use php simplehtmldom to extract only the "This is viewable" part.
Is there anyway to do it directly?

Sure you can, just remove that text:
$str = '<p>This is viewable <span style="display:none">This is not viewable</span></p>';
$html = str_get_html($str);
foreach($html->find('[style*=display:none]') as $el){
$el->innertext = '';
}
echo $html->find('p', 0)->text();
// This is viewable

No, SimpleHTMLDOM is merely a DOM parser, it does not process the attributes in any meaningful way, let alone process inline styles. To properly do what you intend to achieve, it would also need to be able to process extended inline styles, like style="anyother:'attribute';display:none" and alternative ways of hiding content, like visibility:hidden and opacity:0, or brilliant stuff like -webkit-transform:rotateY(90deg).
In a nutshell, there is no remotely easy way to achieve the intended result.

Related

Adding a class to all English text in HTML?

The requirement is to add an englishText class around all english words on a page. The problem is similar to this, but the Javascript solutions wont work for me. I require a PHP example to solve this problem. For example, if you have this:
<p>Hello, 你好</p>
<div>It is me, 你好</div>
<strong>你好, how are you</strong>
Afterwards I need to end with:
<p><span class="englishText">Hello</span>, 你好</p>
<div><span class="englishText">It is me</span>, 你好</div>
<strong>你好, <span class="englishText">how are you</span></strong>
There are more complicated cases, such as:
<strong>你好, TEXT?</strong>
<div>It is me, 你好</div>
This should become:
<strong>你好, <span class="englishText">TEXT?</span></strong>
<div><span class="englishText">It is me</span>, 你好</div>
But I think I can sort out these edge cases once I know how actually iterate over the document correctly.
I can't use javascript to solve this because:
This needs to work on browsers that don't support javascript
I would prefer to have the classes in place on page load so there isn't any delay in rendering the text in the correct font.
I figured the best way to iterate over the document would be using PHP Simple HTML DOM Parser.
But the problem is that if I try this:
foreach ($html->find('div') as $element)
{
// make changes here
}
My concern is that the following case will cause chaos:
<div>
Hello , 你好
<div>Hello, 你好</div>
</div>
As you can see, it's going to go into the first div and then if I process that node, I will be processing the node within that too.
Any ideas how to get around this and only select the nodes for processing once?
UPDATE
I realise now that what I effectively need is a recursive way to iterate over HTML elements with the ability to change them as I iterate over them.
You should travel through siblings that way you won't get in trouble with such a cases...
Something like that:
<?php
foreach ($html->find('div') as $element)
{
foreach($element->next_sibling() as $sibling){
echo $sibling->plaintext()."\n";
}
}
?>
Or much easier way imo:
Just...
Change every <*> to "\n"."<*>" with preg_replace();
Make an array of lines like $lines = explode("\n",$html_string);
3.
foreach($lines as $line){
$text = strip_tags($line);
echo $text;
}

Use PHP to echo whats inside div tags

I dont know what to research or where to start here.
What im trying to do is use PHP to read an HTML Page and pull out the raw text contained inside a div
the div is this
<div class="thingy">
test
</div>
When the php is executed, I want it to echo
Test
Is there an easy snippet for this, or can someone post a small script?
Edit: the html page with the Div is on another webpage.
What you're looking to do is parse HTML. Use the DOM module that comes with PHP to do this: http://php.net/manual/en/book.dom.php
You do NOT want to try to do this with regular expressions.
If you want to remove ALL the HTML tags from a document, use the PHP strip_tags() function: http://us3.php.net/strip_tags
While this could possibly be done using regex, I would recommend using a DOM parser. My reccommendation goes to SimpleHTML Dom Parser. Using it, here's how you would do what you want
$string = "<div class=\"thingy\">test</div>";
$html = str_get_html($string); // create the DOM object
$div = $html->find('div[class=thingy]', 0); // find the first div with a class of 'thingy'
echo $div->plaintext(); // echo the text contents
If you want to parse your html you can use it like
<?php
$str = '<div class="thingy">test</div>';
echo strip_tags($str);//OUTPUT : test
?>
As your html is on other webpage, start output buffering include that file in your main php script, do all manipulation on it to get the content.

Select first DOM Element of type text using phpQuery

Let's say i have this block of code,
<div id="id1">
This is some text
<div class="class1"><p>lala</p> Some markup</div>
</div>
What I would want is only the text "This is some text" without the child element's .class1 contents. I can do it in jquery using $('#id1').contents().eq(0).text(), how can i do this in phpQuery?
Thanks.
my bad, i was doing
pq('#id1.contents().eq(0).text()')
instead of
pq('#id1')->contents()->eq(0)->text()
If compatibility is what you are after, and you want to traverse/manipulate elements as DOM objects, then perhaps the PHP DOM XML library is what you are after: http://www.php.net/manual/en/book.domxml.php
Your code would look something like this:
$xml = xmldoc('<div id="id1">This is some text<div class="class1"><p>lala</p> Some markup</div></div>');
$node = $xml->get_element_by_id("id1");
$content = $node->get_content();
I'm sorry, I don't have time to run a test of this right now, but hopefully it sets you in the right direction, and forms the basis for a decent revision... There is a good list of DOM traversal functions in the PHP documentation though :)
References: http://www.php.net/manual/en/book.domxml.php, http://www.php.net/manual/en/function.domdocument-get-element-by-id.php, http://www.php.net/manual/en/function.domnode-get-content.php

PHP get external page content

i get the html from another site with file_get_contens, my question is how can i get a specific tag value?
let's say i have:
<div id="global"><p class="paragraph">1800</p></div>
how can i get paragraph's value? thanks
If the example is really that trivial you could just use a regular expression. For generic HTML parsing though, PHP has DOM support:
$dom = new domDocument();
$dom->loadHTML("<div id=\"global\"><p class=\"paragraph\">1800</p></div>");
echo $dom->getElementsByTagName('p')->item(0)->nodeValue;
You need to parse the HTML. There are several ways to do this, including using PHP's XML parsing functions.
However, if it is just a simple value (as you asked above) I would use the following simple code:
// your content
$contents='<div id="global"><p class="paragraph">1800</p></div>';
// define start and end position
$start='<div id="global"><p class="paragraph">';
$end='</p></div>';
// find the stuff
$contents=substr($contents,strpos($contents,$start)+strlen($start));
$contents=substr($contents,0,strpos($contents,$end));
// write output
echo $contents;
Best of luck!
Christian Sciberras
(tested and works)
$input = '<div id="global"><p class="paragraph">1800</p></div>';
$output = strip_tags($input);
preg_match_all('#paragraph">(.*?)<#is', $input, $output);
print_r($output);
Untested.

Extract data from website via PHP

I am trying to create a simple alert app for some friends.
Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two:
http://www.sparkfun.com/commerce/product_info.php?products_id=5
http://www.sparkfun.com/commerce/product_info.php?products_id=9279
I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and quantity available and alert us to make an order if a product is between some thresholds.
I have tried some regex (found on some tutorials, but i an way too n00b for this) but haven't managed to get this working, any good tips or examples?
$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');
preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];
preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];
echo "Price: $price - Availability: $in_stock\n";
It's called screen scraping, in case you need to google for it.
I would suggest that you use a dom parser and xpath expressions instead. Feed the HTML through HtmlTidy first, to ensure that it's valid markup.
For example:
$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[#class="pricing"]/th') as $node) {
echo $node, "\n";
}
What ever you do: Don't use regular expressions to parse HTML or bad things will happen. Use a parser instead.
1st, asking this question goes too into details. 2nd, extracting data from a website might not be legitimate. However, I have hints:
Use Firebug or Chrome/Safari Inspector to explore the HTML content and pattern of interesting information
Test your RegEx to see if the match. You may need do it many times (multi-pass parsing/extraction)
Write a client via cURL or even much simpler, use file_get_contents (NOTE that some hosting disable loading URLs with file_get_contents)
For me, I'd better use Tidy to convert to valid XHTML and then use XPath to extract data, instead of RegEx. Why? Because XHTML is not regular and XPath is very flexible. You can learn XSLT to transform.
Good luck!
You are probably best off loading the HTML code into a DOM parser like this one and searching for the "pricing" table. However, any kind of scraping you do can break whenever they change their page layout, and is probably illegal without their consent.
The best way, though, would be to talk to the people who run the site, and see whether they have alternative, more reliable forms of data delivery (Web services, RSS, or database exports come to mind).
The simplest method to extract data from Website. I've analysed that my all data is covered within <h3> tag only, so I've prepared this one.
<?php
include(‘simple_html_dom.php’);
// Create DOM from URL, paste your destined web url in $page
$page = ‘http://facebook4free.com/category/facebookstatus/amazing-facebook-status/’;
$html = new simple_html_dom();
//Within $html your webpage will be loaded for further operation
$html->load_file($page);
// Find all links
$links = array();
//Within find() function, I have written h3 so it will simply fetch the content from <h3> tag only. Change as per your requirement.
foreach($html->find(‘h3′) as $element)
{
$links[] = $element;
}
reset($links);
//$out will be having each of HTML element content you searching for, within that web page
foreach ($links as $out)
{
echo $out;
}
?>

Categories