simple_html_dom find all elements that ONLY contain certain text - php

I have:
<span>something or other</span>
<b>blarg</b>
<b>blarg and stuff</b>
<span>blarg</span>
<em>wakka wakka</em>
<em>wakka blarg</em>
<em>blarg</em>
and I just want to get the elements that ONLY contain "blarg" and no other text, so:
<b>blarg</b>
<span>blarg</span>
<em>blarg</em>
The important issue here is that I'm trying to check if blarg exists within one element alone on the page or not. I've had some general luck with regex but I'd rather do it with simple_html_dom so that I can look at child and sibling elements as well.
Does anyone know what is the simplest way to do this with simple_html_dom?

A way to do it, is to parse every tag, and test if it contains 'blarg'...
Here's a working example:
$text = '<span>something or other</span>
<b>blarg</b>
<b>blarg and stuff</b>
<span>blarg</span>
<em>wakka wakka</em>
<em>wakka blarg</em>
<em>blarg</em>';
echo "<div>Original Text: <xmp>$text</xmp></div>";
$html = str_get_html($text);
// Find all elements
$tags = $html->find('*');
foreach ($tags as $key => $tag) {
// If text in tag contains 'blarg'
if (strcmp(trim($tag->plaintext),'blarg') == 0) {
echo "<div> 'blarg' found in \$tags[$key]: <xmp>".$tag->outertext."</xmp></div>";
}
}
I don't know what you want to do with, but this may be a start :)

Related

Remove all except inside tag

How to remove all from page except text inside <p> tag?
Page:
This is text.
<div class="text">This is text in 'div' tag</div>
<p>This is text in 'p' tag</p>
Expected result:
This is text in 'p' tag
Greetings.
Basically, you'll have to parse the markup. PHP comes with a good parser in the form of the DOMDocument class, so that's really quite easy:
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
Next, get all p tags:
$paragraphs = $dom->getElementsByTagName('p');
This method returns a DOMNodeList object, which implements the Traversable interface, so you can use it as an array of DOMNode instances (DOMElement in this case):
$first = $paragraphs->item(0);//or $paragraphs[0] even
foreach ($paragraphs as $p) {
echo $p->textContent;//echo the inner text
}
If you only want the paragraph elements that do not contain child elements, then you can easily check that:
foreach ($paragraphs as $p) {
if (!$p->hasChildNodes()) {
echo $p->textContent; // or $p->nodeValue
}
}
A closely related answer with some more links/info: How to split an HTML string into chunks in PHP?
You can easily do this with the native php strip_tags function like so:
strip_tags("<p>This is text in 'p' tag</p>");
Which will return as you expected, "This is text in 'p' tag". NOTE: this is only useful when you have an outer-container div, and you use a little bit of dirty RegExp in order to strip not only the P, but the whole tags the user expected (ex. the div tag). This function has one argument, and a second optional argument. The first one is the string that you are stripping the tags from, and the second one specifies allowable tags that won't be stripped as a string. These tags will not be removed in the process. For more information on the strip_tags function click here. I hope you got the idea :)

PHP Simple Html Dom get the plain text of div,but avoiding all other tags

I use PHP Simple Html Dom to get some html,now i have a html dom like follow code,i need fetch the plain text inner div,but avoiding the p tags and their content(only return 111111), who can help me?Thanks in advance!
<div>
<p>00000000</p>
111111
<p>22222222</p>
</div>
It depends on what you mean by "avoiding the p tags".
If you just want to remove the tags, then just running strip_tags() on it should work for what you want.
If you actually want to just return "11111" (ie. strip the tags and their contents) then this isn't a viable solution. For that, something like this may work:
$myDiv = $html->find('div'); // wherever your the div you're ending up with is
$children = $myDiv->children; // get an array of children
foreach ($children AS $child) {
$child->outertext = ''; // This removes the element, but MAY NOT remove it from the original $myDiv
}
echo $myDiv->innertext;
If you text is always at the same position , try this:
$html->find('text', 2)->plaintext; // should return 111111
Here is my solution
I want to get the Primary Text part only.
$title_obj = $article->find(".ofr-descptxt",0); //Store the Original Tree ie) h3 tag
$title_obj->children(0)->outertext = ""; //Unset <br/>
$title_obj->children(1)->outertext = ""; //Unset the last Span
echo $title_obj; //It has only first element
Edited:
If you have PHP errors
Try to enclose with If else or try my lazy code
($title_obj->children(0))?$title_obj->children(0)->outertext="":"";
($title_obj->children(1))?$title_obj->children(1)->outertext = "":"";
Official Documentation
$wordlist = array("<p>", "</p>")
foreach($wordlist as $word)
$string = str_replace($word, "", $string);

Remove tags with Simple HTML DOM parser [duplicate]

I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?
There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';
when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.
I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;
I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.
The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}
This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}
Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.
Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing
Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}
This works now:
$element->remove();
You can see the documentation for the method here.
Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}

PHP - Extracting two values from a line

I'm a beginner with regular expressions and am working on a server where I cannot instal anything (does using DOM methods require the instal of anything?).
I have a problem that I cannot solve with my current knowledge.
I would like to extract from the line below the album id and image url.
There are more lines and other url elements in the string (file), but the album ids and image urls I need are all in strings similar to the one below:
<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">
So in this case I would like to get '774' and 'http://img255.imageshack.us/img00/000/000001.png'
I've seen multiple examples of extracting just the url or one other element from a string, but I really need to keep these both together and store these in one record of the database.
Any help is really appreciated!
Since you are new to this, I'll explain that you can use PHP's HTML parser known as DOMDocument to extract what you need. You should not use a regular expression as they are inherently error prone when it comes to parsing HTML, and can easily result in many false positives.
To start, lets say you have your HTML:
$html = '<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">';
And now, we load that into DOMDocument:
$doc = new DOMDocument;
$doc->loadHTML( $html);
Now, we have that HTML loaded, it's time to find the elements that we need. Let's assume that you can encounter other <a> tags within your document, so we want to find those <a> tags that have a direct <img> tag as a child. Then, check to make sure we have the correct nodes, we need to make sure we extract the correct information. So, let's have at it:
$results = array();
// Loop over all of the <a> tags in the document
foreach( $doc->getElementsByTagName( 'a') as $a) {
// If there are no children, continue on
if( !$a->hasChildNodes()) continue;
// Find the child <img> tag, if it exists
foreach( $a->childNodes as $child) {
if( $child->nodeType == XML_ELEMENT_NODE && $child->tagName == 'img') {
// Now we have the <a> tag in $a and the <img> tag in $child
// Get the information we need:
parse_str( parse_url( $a->getAttribute('href'), PHP_URL_QUERY), $a_params);
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
}
}
A print_r( $results); now leaves us with:
Array
(
[0] => Array
(
[0] => 774
[1] => http://img255.imageshack.us/img00/000/000001.png
)
)
Note that this omits basic error checking. One thing you can add is in the inner foreach loop, you can check to make sure you successfully parsed an album parameter from the <a>'s href attribute, like so:
if( isset( $a_params['album'])) {
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
Every function I've used in this can be found in the PHP documentation.
If you've already narrowed it down to this line, then you can use a regex like the following:
$matches = array();
preg_match('#.+album=(\d+).+src="([^"]+)#', $yourHtmlLineHere, $matches);
Now if you
echo $matches[1];
echo " ";
echo $matches[2];
You'll get the following:
774 http://img255.imageshack.us/img00/000/000001.png

Simple HTML Dom: How to remove elements?

I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?
There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';
when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.
I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;
I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.
The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}
This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}
Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.
Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing
Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}
This works now:
$element->remove();
You can see the documentation for the method here.
Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}

Categories