PHP Simple HTML Dom - tag attrib - php

i'm trying try grab the plain text from within <font size="3" color="blue"> ... its not picking up the font tag, although it does work if I do "font", 3 but there are a lot of font tags in the site and i'd like to make the search a bit more specific. is it possible to have multiple attribs on a tag?
<?php
include('simple_html_dom.php');
$html = new simple_html_dom();
$html = file_get_html('http://cwheel.domain.com/');
##### <font size="3" color="blue">Certified Genuine</font>
$element = $html->find("font[size=3][color=blue]", 0);
echo $element-> plaintext . '<br>';
$html->clear();
?>

I dont know Simple_html_dom. But it seems the query you are trying to pass is an xpath query. In that case you need to use prefix attributes with #. Also you need to prefix the whole query with // to make sure it searches for any font tag that is in any level deep. Final query should look something like this.
//font[#size=3][#color=blue]
Using DOMDocument and DOMXPath it works pretty good.
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$fonts = $xpath->query('font[#size="3" ][ #color="blue"]');
foreach($fonts as $font){
echo $font->textContent. "\n";
}

Related

Remove HTML Tag using DOMDocument

I'd like to remove <font> tags from my html and am trying to use replaceChild to do so, but it doesn't seem to work properly. Can anyone catch what might be wrong?
$html = '<html><body><br><font class="heading2">Limited Size and Resources</font><p><br><strong>Q: When can a member use the limited size and resources exception?</strong></p></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$font_tags = $dom->GetElementsByTagName('font');
foreach($font_tags as $font_tag) {
foreach($font_tag as $child) {
$child->replaceChild($child->nodeValue, $font_tag);
}
}
echo $dom->saveHTML();
From what I understand, $font_tags is a DOMNodeList, so I need to iterate through it twice in order to use the DOMNode::replaceChild function. I then want to replace the current value with just the content inside of the tags. However, when I output the $html nothing changes. Any ideas what could be wrong?
Here is a PHP Sandbox to test the code.
I'll put my remarks inline
$html = '<html><body><br><font class="heading2">Limited Size and Resources</font><p><br><strong>Q: When can a member use the limited size and resources exception?</strong></p></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$font_tags = $dom->GetElementsByTagName('font');
/* You only need one loop, as it is iterating your collection
You would only need a second loop if each font tag had children of their own
*/
foreach($font_tags as $font_tag) {
/* replaceChild replaces children of the node being called
So, to replace the font tag, call the function on its parent
$prent will be that reference
*/
$prent = $font_tag->parentNode;
/* You can't insert arbitrary text, you have to create a textNode
That textNode must also be a member of your document
*/
$prent->replaceChild($dom->createTextNode($font_tag->nodeValue), $font_tag);
}
echo $dom->saveHTML();
Updated Sandbox: Hopefully I understood your requirements correctly
First, let us find out what wasn't working in your code.
foreach($font_tag as $child) wasn't even iterating once as $font_tag is a single 'font' tag element from font_tags array, and not an array itself.
$child->replaceChild($child->nodeValue, $font_tag); - A child node can't replace its parent ($font_tag), but the reverse is possible.
As replaceChild is a method of the parent node to replace its child.
For more details check the PHP: DOMNode::replaceChild documentation, or the point 2 below my code.
echo $html will output the $html string, but not the updated $dom object that we are modifying.
This would work -
$html = '<html><body><br><font class="heading2">Limited Size and Resources</font><p><br><strong>Q: When can a member use the limited size and resources exception?</strong></p></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$font_tags = $dom->GetElementsByTagName('font');
foreach($font_tags as $font_tag)
{
$new_node = $dom->createTextNode($font_tag->nodeValue);
$font_tag->parentNode->replaceChild($new_node, $font_tag);
}
echo $dom->saveHTML();
I am creating a $new_node directly in the $dom, so the node is live in the DOMDocument and not any local variable.
To replace the child object $font_tag, we have to first traverse to the parent node using the parentNode method.
Finally, we are printing out the modified $dom using saveHTML method, which will convert the DOMDocument into a HTML String.
Remove a specific span tag from HTML while preserving/keeping the inside content using PHP and DOMDocument
<?php
$content = '<span style="font-family: helvetica; font-size: 12pt;"><div>asdf</div><span>TWO</span>Business owners are fearful of leading. They would rather follow the leader than embrace a bold move that challenges their confidence. </span>';
$dom = new DOMDocument();
// Use LIBXML for preventing output of doctype, <html>, and <body> tags
$dom->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//span[#style="font-family: helvetica; font-size: 12pt;"]') as $span) {
// Move all span tag content to its parent node just before it.
while ($span->hasChildNodes()) {
$child = $span->removeChild($span->firstChild);
$span->parentNode->insertBefore($child, $span);
}
// Remove the span tag.
$span->parentNode->removeChild($span);
}
// Get the final HTML with span tags stripped
$output = $dom->saveHTML();
print_r($output);

PHP - SimpleHTMLDom - How to access table elements?

I'm trying to grab the artists for every album release based on metacritic using simplehtmldom - http://www.metacritic.com/browse/albums/release-date/coming-soon/date?view=detailed
The artist names are contained within seperate td elements which have the class name of artistName
What I've managed to figure out so far is
$html = file_get_html('http://www.metacritic.com/browse/albums/release-date/coming-soon/date?view=detailed');
$es = $html->find('table.musicTable td');
Where do I go from here? I'm finding examples and the documentation a bit confusing. Any help will be really appreciated.
I suggest to use the PHP:DOM extension
DOM manual here
which is a very powerful tools for parsing and manipulating XML or HTML documents
for your case you can do like this
<?php
$html = file_get_contents('http://www.metacritic.com/browse/albums/release-date/coming-soon/date?view=detailed');
$doc = new DOMDocument();
$doc->loadHTML($html);
$searchNode = $doc->getElementsByTagName("table");
foreach( $searchNode as $searchNode )
{
//do your things here
}
?>
or even can use xpath to query the document node
Xpath usage
Every name is contained into an anchor inside a <td class="artistName">, that's all what needed in this case to create the following code:
$url = "http://www.metacritic.com/browse/albums/release-date/coming-soon/date?view=detailed";
//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load_file($url);
// Find the anchor containing the name inside all "td.artistName" elements
$anchors = $html->find('td.artistName a');
// loop through all found anchors and print the content
foreach($anchors as $anchor) {
$name = $anchor->plaintext;
echo $name . "<br>";
}
// Clear DOM object
$html->clear();
unset($html);
OUTPUT
Peter Gabriel
Stephen Malkmus & The Jicks
TOY
Black Knights
Broken Bells
Bruce Springsteen
David Broza
Eskimo Callboy
...
Working DEMO
Please read the MANUAL for more examples and details

Read page source using PHP with primes "

I am trying to read the source code of a page. I just want to read some text that is within a certain division element with the id "wrapper_left".
My problem is that if a prime " is used in the first argument of the explode function, it does not work. I tried escaping the string, although I figured this wouldn't do anything.
$source_code = htmlspecialchars(file_get_contents('http://mydomain.com'));
$source_code = explode('<div id="wrapper_left">', $source_code);
echo $source_code[1];
Thanks tons in advance.
Don't bother trying to get this done with explode(), string manipulation, or a regular expression, you need an HTML parser, like DOMDocument:
$doc = new DOMDocument;
$doc->loadHTMLFile( 'http://mydomain.com');
$xpath = new DOMXPath( $doc);
$div = $xpath->query( '//div[#id="wrapper_left"]')->item(0);
echo $div->textContent;
You can see it working in this demo, which, when fed this HTML:
<div id="wrapper_left">Some text</div>
It produces:
Some text

Keep new line, when the HTML is on 1 line and new line layout is done with <div>

I need to get content from a site
I need to get
/html/body/div/div[2]/table/tbody/tr/td/div/div[2]/form/fieldset[2]/table[2]
or
<table class='properties'>
For which the code is visible here: http://paste.pocoo.org/show/347881/
contents with all the content formatted just on new lines.
I don't care about paddings, and other formatting, I just want to keep the new lines.
For example a proper output would be
tájékoztató
az eljárás eredményéről
A Közbeszerzések Tanácsa (Szerkesztőbizottsága) tölti ki
A hirdetmény kézhezvételének dátuma____________________
KÉ nyilvántartási szám_________________________________
I. SZAKASZ: AJÁNLATKÉRŐ
I.1) Név, cím és kapcsolattartási pont(ok)
The problem I face that the new lines are introduced with the div's and cannot get it.
Update
This be executed by a PHP cron, so there is no access to JS.
There is a library called phpQuery: http://code.google.com/p/phpquery/
You can walk through DOM object like with jQuery:
phpQuery::newDocument($htmlCode)->find('table.properties');
On a mached element's content fire strip_tags and you will get pure content of that table.
The trick is to fetch the inner divs in an xpath expression, then use their textContent property:
<?php
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML(file_get_contents("..."));
libxml_use_internal_errors(false);
$domx = new DOMXPath($domd);
$items = $domx->query("/html/body/div/div[2]/table/tr/td/div/div[2]/form/fieldset[2]/table[2]/tr/td/div//div/div[#style='padding-left: 0px;']");
$output = "";
foreach ($items as $item) {
$output .= $item->textContent . "\n";
}
echo $output;

How do I assemble pieces of HTML into a DOMDocument?

It appears that loadHTML and loadHTMLFile for a files representing sections of an HTML document seem to fill in html and body tags for each section, as revealed when I output with the following:
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$elements = $doc->getElementsByTagName('*');
if( !is_null($elements) ) {
foreach( $elements as $element ) {
echo "<br/>". $element->nodeName. ": ";
$nodes = $element->childNodes;
foreach( $nodes as $node ) {
echo $node->nodeValue. "\n";
}
}
}
Since I plan to assemble these parts into the larger document within my own code, and I've been instructed to use DOMDocument to do it, what can I do to prevent this behavior?
This is part of several modifications the HTML parser module of libxml makes to the document in order to work with broken HTML. It only occurs when using loadHTML and loadHTMLFile on partial markup. If you know the partial is valid X(HT)ML, use load and loadXML instead.
You could use
$doc->saveXml($doc->getElementsByTagName('body')->item(0));
to dump the outerHTML of the body element, e.g. <body>anything else</body> and strip the body element with str_replace or extract the inner html with substr.
$html = '<p>I am a fragment</p>';
$dom = new DOMDocument;
$dom->loadHTML($html); // added html and body tags
echo substr(
$dom->saveXml(
$dom->getElementsByTagName('body')->item(0)
),
6, -7
);
// <p>I am a fragment</p>
Note that this will use XHTML compliant markup, so <br> would become <br/>. As of PHP 5.3.5, there is no way to pass a node to saveHTML(). A bug request has been filed.
The closest you can get is to use the DOMDocumentFragment.
Then you can do:
$doc = new DOMDocument();
...
$f = $doc->createDocumentFragment();
$f->appendXML("<foo>text</foo><bar>text2</bar>");
$someElement->appendChild($f);
However, this expects XML, not HTML.
In any case, I think you're creating an artificial problem. Since you know the behavior is to create the html and body tags you can just extract the elements in the file from within the body tag and then import the, to the DOMDocument where you're assembling the final file. See DOMDocument::importNode.

Categories