Get text with PHP Simple HTML DOM Parser - php

i'm using PHP Simple HTML DOM Parser to get text from a webpage.
The page i need to manipulate is something like:
<html>
<head>
<title>title</title>
<body>
<div id="content">
<h1>HELLO</h1>
Hello, world!
</div>
</body>
</html>
I need to get the h1 element and the text that has no tags.
to get the h1 i use this code:
$html = file_get_html("remote_page.html");
foreach($html->find('#content') as $text){
echo "H1: ".$text->find('h1', 0)->plaintext;
}
But the other text?
I also tried this into the foreach but i get the full text:
$text->plaintext;
but it returned also the H1 tag...

It looks like $text->find('text',2); gets what you're looking for, however I'm not sure how well that will work when the amount of text nodes is unknown. I'll keep looking.

You can simply strip html tags using strip_tags
<?php
strip_tags($input, '<br>');
?>

Use strip tags, as #Peachy pointed out. However, passing it a second argument <br> means string will ignore <br> tags, which is unnecessary. In your case,
<?php
strip_tags($text);
?>
would work as you'd like, given that you are only selecting content in the content id.

Try it
echo "H1: ".$text->find('h1', 0)->innertext;

Related

Get div by id from string using PHP

I have a while dom document string inside variable, I want to grab only content of a div using id, I can't use PHP dom parser, is it possible using Regex?
<html>
....
<body>
....
<div id="something">
// content
</div>
<div id="other_divs">
// content
</div>
</body>
</html>
preg_match('%<div[^>]+id="something"[^>]*>(.*?)</div>%si', $string)
If there is no additional div in the content itself.
you can try to use this regexp, hope it will help you.
preg_match_all('/<div\s*id="[^>]+">(.*?)<\/div>/s', $html, $match);
print_r($match[0]);

How to retrieve code from database with php ?

I want to store HTML code in mysql database. If I want to store the following code into database
<html>
<body>
<p> this is a paragraph </p
</body>
</html>
they store as they are. But when I retrieve them and echo with php the tag get vanished. But I want to echo them as they are above. I also want to store and show not only HTML but other code (c,java,php) also. Anyone have any idea?
you can use htmlentities () php function to echo html codes
$str = "
<html>
<body>
<p> this is a paragraph </p
</body>
</html>
";
echo htmlentities($str);
You can also use htmlspecialchars();
You can use htmlentities($str) for that, another nice thing to use is <pre></pre>
Putting those tags around the code will preserve newlines, tabs and spaces. In case you want to showcase it.

Replace <pre> tags with <code>

What i need to do is to replace all pre tags with code tags.
Example
<pre lang="php">
echo "test";
</pre>
Becomes
<code>
echo "test";
</code>
<pre lang="html4strict">
<div id="test">Hello</div>
</pre>
Becomes
<code>
<div id="test">Hello</div>
</code>
And so on..
Default DOM functions of php have a lot of problems because of the greek text inside.
I think Simple HTML DOM Parser is what i need but i cant figure out how to do what i want. Any ideas?
UPDATE
Im moving to a new CMS thats why im writing a script to format all posts to the correct format before inserting into DB. I cant use pre tags in the new CMS.
Why not KISS (Keep It Simple, Stupid):
echo str_replace(
array('<pre>', '</pre>'),
array('<code>', '</code>'),
$your_html_with_pre_tags
);
Look at the manual. Changing <pre> tags to <code> should be as simple as:
$str = '<pre lang="php">
echo "test";
</pre>
<pre lang="html4strict">
<div id="test">Hello</div>
</pre>';
require_once("simplehtmldom/simple_html_dom.php");
$html = str_get_html($str);
foreach($html->find("pre") as $pre) {
$pre->tag = "code";
$pre->lang = null; // remove lang attribute?
}
echo $html->outertext;
// <code>
//     echo "test";
// </code>
// <code>
//     <div id="test">Hello</div>
// </code>
PS: you should encode the ", < and > characters in your input.
Just replacing pre tags with code tags changes the meaning and the rendering essentially and makes the markup invalid, if there are any block-level elements like div inside the element. So you need to revise your goal. Check out whether you can actually keep using pre. If not, use <div class=pre> instead, together with a style sheet that makes it behave like pre in rendering. When you just replace pre tags with div tags, you won’t create syntax errors (the content model of div allows anything that pre allows, and more).
Regarding the lang attribute, lang="php" is incorrect (by HTML specs, lang attribute specifies the human language of the content, using standard language codes), but the idea of coding information about computer language is good. It may help styling and scripting later. HTML5 drafts mention that such information can be coded using a class name that starts with language-, e.g. class="language-php"' (or, when combined with another class name,class="language-php pre"'.

PHP Simple HTML DOM Parser: How to remove <font> tags from script output?

I'm using PHP Simple HTML DOM Parser to extract a list of URLs from a page as follows:
<?php
include('simple_html_dom.php');
$url = 'http://www.domain.com/';
$html = file_get_html($url);
foreach($html->find('table[width=370]') as $table)
{
foreach($table->find('a') as $item)
echo $item->outertext . '<br><hr>';
}
$html->clear();
?>
It works just fine insofar as it extracts the required information, however, some of the a tags (on domain.com) are formatted like this:
<font size="2">Anchor text</font>
Whereas, in others, the font size is defined in the p tag that contains each a tag, meaning the a tag is displayed as:
Anchor text
Is there any way to strip out the font tag from those a tags that have it? It's probably very simple, but I've been 'running around in rings' for ages trying to do it :(
Thanks for any ideas or suggestions you might have.
Tom.
strip_tags() maybe?
If you only want to allow the a tag, just use:
echo strip_tags($item->outertext, 'a');

PHP-code blocks in templates loaded in DOMDocument

I need to parse HTML-template with DOMDocument. But HTML code may contain PHP-code blocks, for example:
<div id="test" data="<?php echo $somevar?>"> </div>
When I load this HTML I get error "Unescaped '<' not allowed in attributes values...". Parser thinks that attribute "data" has no closing quote and <php is new tag. How can I specify to ignore <php tag or something like that?
Your HTML code:
<div id="test" data="<?php echo $somevar?>"> </div>
Is not XML code. For XML it's invalid, HTML is okay. To load HTML code with DOMDocument, you can use the DOMDocument::loadHTML­Docs function.
It will load your template without any error.
Example / Demo:
$html = '<div id="test" data="<?php echo $somevar?>"> </div>';
$doc = new DOMDocument();
$doc->loadHTML($html);
Related: Can PHP include work for only a specified portion of a file?
If you try to parse a document with PHP tags in it, you should remove those, or capture the output of the file first, and then parse it.
You can capture the output of the file with ob_start() and ob_get_clean();.
You can remove the PHP tags with regex:
$cleaned = preg_replace("/<\?php.*?\?>/i","",$input);
This feels hacky, but...
$doc->loadHtml(str_replace('<?php', '<?php', file_get_contents($file)));
Try:
<div id="test" data="<?= htmlentities($somevar) ?>"> </div>
You can also try htmlspecialchars(), which is a "lighter" version of htmlentities().

Categories