How do I use the DOM parser to extract the content of a html element in a variable.
More exactly:
I have a form where user inputs html in a text area. I want to extract the content of the first paragraph.
I know there are many tutorials on this, but could not find any on extracting from variable and not a file(page)
Thanks
If you're taking HTML as user input, I recommend using simplehtmldom. It has a loose parser with tolerance for buggy html and lets you use CSS selectors to pull element and their content out of the DOM.
I didn't test this, but it should work:
$html = str_get_html($_POST['input']);
print $html->find('p:first')->plaintext; // first paragraph
Related
I am trying to extract the content of a <div> nested inside a <code> tag with PHP Simple HTML DOM Parser but I am always getting the error Trying to get property of non-object in... as if the parser was finding nothing inside my <div>
The code I'm using is
include_once('simplehtmldom_1_5/simple_html_dom.php');
// Create a DOM object
$html = new simple_html_dom();
// Load HTML
$html->load('<code><div>hello</div></code>');
// Extract div content
echo $html->find('div',0)->innertext;
But if instead of using <code><div>hello</div></code> as my sample code i use <span><div>hello</div></span> it works... it seems like I'm having problems only looking inside the code tag.
What's wrong with what i'm doing?
Hope you guys can point me in the right direction, thank you very much for your support!
simplehtmldom among others strips out pre formatted tags.
If you want code tag to be recognized delete or comment out line 1076 in *simple_html_dom.php*
According to the source code for Simple HTML DOM it automagically removes code tags when it loads the HTML into the parser.
If you need the functionality you'll need to remove the reference to remove_noise() in the load() function within simplehtmldom.php.
This should produce the results you expect, but obviously may well introduce other issues, depending on the authors reasoning for removing the tags in the first place.
I'm very new at regular expressions: I want to preg_match all elements in a html dom, that has a data-editable attribute. All other attributes of those elements should be matched also, so i can reuse them later:
<div class="teaser" id="teaser" data-editable><p>Content</p></div>
After matching i want those elements with data-editable attribute to have specific css classes and add another element inside. So only block-level parents should be matched.
<div class="teaser editable" id="teaser"><button>edit</button><p>Content</p></div>
Here's what i've got:
<(div|p).*(data-editable).[^>]+>(.*?)<\/\1>
I know, i'm totally wrong with that - this one matches also elements that does not have that data-editable attribute set because of that .+ inside. But how to match the different attributes without losing their values?
You shouldn't go through HTML with regex (as shown here). What you should do would be to use an HTML parsing framework, such as the PHP Simple DOM Parser to process your HTML pages.
According to their documentation, you can do what you want through this: $html->find("div[data-editable]", 0)->outertext
Since HTML isn't a regular language, you're better of using a DOM parser. Much easier, too
Preface: I cannot rename the source tags or edit their IDs. Any changes to the tags must happen after they have been fetched.
What I'm doing: using file_get_contents in PHP, I am requesting data from a remote site. This data is just two <p> tags. I need to hide or rename the second of the two <p> tags.
Is this possible with PHP or jQuery?
What I'm working with:
<p>Hello my name is test</p><p>I like studying geology.</p>
If you need to hide second text, you can do this with Jquery:
$('p:eq(1)').hide();
Jsfiddle
You could try a php string replace
$new_string = str_replace('</p><p>','',file_get_contents('somecontent'));
If you need to do it before render HTML, you need to parse contents and remove/replace second p tag and create a new content.
Here is a DOM parser Simple HTML DOM Parser
Find similar questions below
How to match second <a> tag in this string
How to add attribute to first P tag using PHP regular expression?
Or you can do it after rendering HTML as rNix suggested.
I am working on URL Get content.
If i want to fetch ONLY the text conent from this site(Only text)
http://en.wikipedia.org/wiki/Asia
How is it possible. I can fetch the URL title and URL using PHP.
I got the url title using the below code:
$url = getenv('HTTP_REFERER');
$file = file($url);
$file = implode("",$file);
//$get_description = file_get_contents($url);
if(preg_match("/<title>(.+)<\/title>/i",$file,$m))
$get_title = $m[1];
echo $get_title;
Could you pl help me to get the content.
Using file_get_content i could get the HTML code alone. Any other possibilities?
Thanks -
Haan
If you just want to get a textual version of a HTML page, then you will have to process it yourself. Fetch the HTML (as you seem to already know how to do) and then process it into plain text with PHP.
There are several approaches to doing this. The first is htmlspecialchars() which will escape all the HTML special characters. I don't imagine this is what you actually want but I thought I'd mention it for completeness.
The second approach is strip_tags(). This will remove all HTML completely from a HTML document. However, it doesn't validate the input its working with, it just does a fairly simple text replace. This means you will end up with stuff that you might not want in the textual representation being included (such as the contents of the head section, or the innards of embedded javascript and stylesheets)
The other approach is to parse the downloaded HTML with DOMDocument. I've not written code for you (don't have time), but the general procedure would be similar to as follows:
Load the HTML into a DOMDocument object
Get the document's body element and iterate over its children.
For each child, if the child in question is a text node, append it to an output string. If it isn't a text node, then iterate over its children as well to check if any of its children are text nodes (and if not then iterate over those child elements as well and so on). You might also want to check the type of the node further. For example, if you don't want javascript or css embedded in the output then you can check that the tag type is not STYLE or SCRIPT and just ignore it if it is.
The above description is most easily implemented as a recursive function (one that calls itself).
The end result should be a string that contains only the textual content of the downloaded page, with no markup.
EDIT: Forgot about strip_tags! I updated my answer to mention that as well. I left my DOMDocument approach included in my answer though, because as the documentation for strip_tags states, it does no validation of the markup its processing, whereas DOMDocument attempts to parse it (and can potentially be more robust if a DOMDocument based text extraction is implemented well).
Use file_get_contents to get the HTML content and then strip_tags to remove the HTML tags, thus leaving only the text.
Let's say the HTML contains 15 table tags, before each table there is a div tag with some text inside. I need to get the text from the div tag that is directly before the 10th table tag in the HTML markup. How would I do that?
The only way I can think of is to use explode('<table', $html) to split the HTML into parts and then get the last div tag from the 9th value of the exploded array with regular expression. Is there a better way?
I'm reading through the PHP DOM documentation but I cannot see any method that would help me with this task there.
You load your HTML into a DOMDocument and query it with this XPath expression:
//table[10]/preceding-sibling::div[1]
This would work for the following layout:
<div>Some text.</div>
<table><!-- #1 --></table>
<!-- ...nine more... -->
<div>Some other text.</div> <!-- this would be selected -->
<table><!-- #10 --></table>
<!-- ...four more... -->
XPath is capable of doing really complex node lookups with ease. If the above expression does not yet work for you, probably very little is required to make it do what you want.
HTML is structured data represented as a string, this is something substantially different from being a string. Don't give in to the temptation of doing stuff like this with string handling functions like explode(), or even regex.
If you don't feel like learning xpath, you can use the same old-school DOM walking techniques you would use with JavaScript in the browser.
document.getElementsByTagName('table')[9]
then crawl your way up the .previousSibling values until you find one that isn't a TextNode and is a div
I've found that PHP's DOMDocument works pretty well with non-perfect HTML and then once you have the DOM I think you can even pass that into a SimpleXML object and work with it XML-style even though the original HTML/XHTML structure wasn't perfect.