This question already has answers here:
How to insert HTML to PHP DOMNode?
(5 answers)
Closed 7 years ago.
I am using PHP's DOM object to create HTML pages for my website. This works great for my head, however since I will be entering a lot of HTML into the body (not via DOM), I would think I would need to use DOM->createElement($bodyHTML) to add my HTML from my site to the DOM object.
However DOM->createElement seems to parse all HTML entities so my end result ended up displaying the HTML on the page and not the actual renders HTML.
I am currently using a hack to get this to work,
$body = $this->DOM
->createComment('DOM Glitch--><body>'.$bodyHTML."</body><!--Woot");
Which puts all my site code in a comment, which I bypass athe comment and manually add the <body> tags.
Currently this method works, but I believe there should be a more proper way of doing this. Ideally something like DOM->createElement() that will not parse any of the string.
I also tried using DOM->createDocumentFragment() However it does not like some of the string so it would error and not work (Along with take up extra CPU power to re-parse the body's HTML).
So, my question is, is there a better way of doing this other than using DOM->createComment()?
You use the DOMDocumentFragment objec to insert arbitrary HTML chunks into another document.
$dom = new DOMDocument();
#$dom->loadHTML($some_html_document); // # to suppress a bajillion parse errors
$frag = $dom->createDocumentFragment(); // create fragment
$frag->appendXML($some_other_html_snippet); // insert arbitary html into the fragment
$node = // some operations to find whatever node you want to insert the fragment into
$node->appendChild($frag); // stuff the fragment into the original tree
I FOUND THE SOLUTION but it's not a pure php solution, but works very well. A little hack for everybody who lost countless hours, like me, to fix this
$dom = new DomDocument;
// main object
$object = $dom->createElement('div');
// html attribute
$attr = $dom->createAttribute('value');
// ugly html string
$attr->value = "<div> this is a really html string ©</div><i></i> with all the © that XML hates!";
$object->appendChild($attr);
// jquery fix (or javascript as well)
$('div').html($(this).attr('value')); // and it works!
$('div').removeAttr('value'); // to clean-up
loadHTML works just fine.
<?php
$dom = new DOMDocument();
$dom->loadHTML("<font color='red'>Hey there mrlanrat!</font>");
echo $dom->saveHTML();
?>
which outputs Hey there mrlanrat! in red.
or
<?php
$dom = new DOMDocument();
$bodyHTML = "here is the body, a nice body I might add";
$dom->loadHTML("<body> " . $bodyHTML . " </body>");
// this would even work as well.
// $bodyHTML = "<body>here is the body, a nice body I might add</body>";
// $dom->loadHTML($bodyHTML);
echo $dom->saveHTML();
?>
Which outputs:
here is the body, a nice body I might add and inside of your HTML source code, its wrapped inside body tags.
I spent a lot of time working on Anthony Forloney's answer, But I cannot seem to get the html to append to the body without it erroring.
#Mark B: I have tried doing that, but as I said in the comments, it errored on my html.
I forgot to add the below, my solution:
I decided to make my html object much simpler and to allow me to do this by not using DOM and just use strings.
Related
I am trying to get the plaintext from the given html. But, it is not possible for me.
for this, what I had done is
My html is in $content variable
Now, I am passing $content variable to php DomDocuemnt
$d = new DOMDocument();
#$d->loadHTML($content)
Whats my next step to get the plaintext from the obtained html.
Please help me in this. Thanks in advance!
I can't understand your question but if you want the HTML code as string then
Try this...
$d = new DOMDocument();
$d->loadHTML($content);
$plainText = $d->textContent;
echo $plainText;
The DOM itself does not have such functionality. You may use the strip_tags() function though. Like this:
$d = new DOMDocument();
$d->loadHTML($content);
$plainText = strip_tags($d->textContent);
echo $plainText;
// which is probably equivalent to:
$plainText = strip_tags($content);
Note: using the DOMDocument() is useful to test that $content is correct or if you want to get a specific tag ($main = $d->getElementByName('<main>'); $plainText = strip_tags($main[0]->textContent)) otherwise directly using strip_tags() is enough.
There are some problems as the strip_tags() function has no clue about the type of tag being removed. This means a sequence such as:
... word</p><p>more ...
will concatenate those two words:
... wordmore ...
This is a difficult problem since some tags are expected to be removed that way and others not. For example, if the user had some form of emphasis, no spaces is the right way of removing the tag:
che<u>val<u> -> cheval
che<u>veaux<u> -> cheveaux
(Singular and plural of "horse" in French)
A browser has no clue either, the CSS is what tells whether a tag is a block (<div>) or inline (<u>).
I am trying to split something in PHP, and I can't get it to work.. Been trying for a while now, so thought I would ask here.
So lets say that I have multiple <script> ... </script> in my source code, then what can I do to split these into a string. I'm trying with explode, but not working out as planned.
This is what I've tried so far:
$script = explode('<script>',$data,1);
echo htmlspecialchars($script[1]);
Tried that but it doesn't get any specific <script>.
Example script:
<script>
script here...
</script>
<script>
second script here...
</script>
So how will I go about getting the second script?
Sorry, I'm not the best at regex or parsing in PHP yet, and merry christmas to all of you! :)
Do not parse HTML with string functions. Or regex, for that matter. the <center> cannot hold regexes and HTML. But that's a different story. Instead, use an html parser, like Simple HTML DOM(Which, for some reason, is blocked by my high school's stupid firewall). Please correct me if I'm wrong, since I can't access the docs for it.
include("simple_html_dom.php");
$html=str_get_html($text);
$scripts=$html->find("script");
foreach($scripts as $script){
echo(htmlspecialchars($script));
}
Use loadHTML():
$doc = new DOMDocument();
// load the HTML string we want to strip
$doc->loadHTML($html);
// get all the script tags
$script_tags = $doc->getElementsByTagName('script');
Instead of string functions, I'd use a DOM Parser such as PHP's DOMDocument to extract the required data. Here's how you can do it:
$text = <<<TEXT
<script>
script here...
</script>
<script>
second script here...
</script>
TEXT;
$dom = new DOMDocument;
$dom->loadHTML($text);
echo $dom->getElementsByTagName('script')->item(1)->nodeValue;
Some explanation:
The text is loaded using loadHTML() method and then you use getElementsByTagName() method to get all the script tags. Now we use item(1) to specifically target the second <script> tag and then echo the nodeValue of that node.
Output:
second script here...
I am Expanding/Condensing a Blog post using substrings, where the second substring is within a div tag that activates when a button is pressed (hence concatenating both substrings)
The code looks like as below:
<?php echo substr($f2, 0, 50);?>
<div id="<?php echo $f4; ?>" class = "hidden">
<?php echo substr($f2, 0, 5000);?></div>
My problem however is if the blog post contains html tags (e.g. <\li>, <\p>) and the initial substring ends before the termination of that set of tags, then obviously it causes major formatting problems.
Is there a way around this using my current method, or am I going to need to use something like an XML stylesheet (in which case please guide me through it)
EDIT:
I have semi-completed my request using DOMDocument.
$second = substr($f2, 50, 5000);
$dom= new DOMDocument();
$dom->loadHTML($second);
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
$secondoutput = ($dom->saveXml($body->item(0)));
$first = substr($f2, 0, 50);
$dom= new DOMDocument();
$dom->loadHTML($first);
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
$firstoutput = ($dom->saveXml($body->item(0)));
This works except, when the second subtring is called it no longer has the previous formatting as it has been purified.
Is there any way to reattch the previous HTML tag when the second substring is called?
There are diffrent solutions to this problem, but substr is not particularly suitable (as you mentioned).
You could use Regular Expressions, or a HTML-Parser.
Go ahead and copy solutions from this question.
You may want to use Tidy to fix the truncated HTML.
You may want to parse whole HTML code with DOMDocument or SimpleHTMLDOM and then remove last elements until the post is short enough.
What I've been trying to do recently is to extract listing information from a given html file,
For example, I have an html page that has a list of many companys, with their phone number, address, etc'
Each company is in it's own table, every table started like that: <table border="0">
I tried to use PHP to get all of the information, and use it later, like put it in a txt file, or just import into a database.
I assume that the way to achieve my goal is by using regex, which is one of the things that I really have problems with in php,
I would appreciate if you guys could help me here.
(I only need to know what to look for, or atleast something that could help me a little, not a complete code or anything like that)
Thanks in advance!!
I recommend taking a look at the PHP DOMDocument and parsing the file using an actual HTML parser, not regex.
There are some very straight-forward ways of getting tables, such as the GetElementsByTagName method.
<?php
$htmlCode = /* html code here */
// create a new HTML parser
// http://php.net/manual/en/class.domdocument.php
$dom = new DOMDocument();
// Load the HTML in to the parser
// http://www.php.net/manual/en/domdocument.loadhtml.php
$dom->LoadHTML($htmlCode);
// Locate all the tables within the document
// http://www.php.net/manual/en/domdocument.getelementsbytagname.php
$tables = $dom->GetElementsByTagName('table');
// iterate over all the tables
$t = 0;
while ($table = $tables->item($t++))
{
// you can now work with $table and find children within, check for
// specific classes applied--look for anything that would flag this
// as the type of table you'd like to parse and work with--then begin
// grabbing information from within it and treating it as a DOMElement
// http://www.php.net/manual/en/class.domelement.php
}
If You're familiar with jQuery (and even if You're not as it's command are simple enough) I recommend this PHP counterpart: http://code.google.com/p/phpquery/
If your HTML is valid XML, as in XHTML, then you could parse it using SimpleXML
I am trying to create a simple alert app for some friends.
Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two:
http://www.sparkfun.com/commerce/product_info.php?products_id=5
http://www.sparkfun.com/commerce/product_info.php?products_id=9279
I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and quantity available and alert us to make an order if a product is between some thresholds.
I have tried some regex (found on some tutorials, but i an way too n00b for this) but haven't managed to get this working, any good tips or examples?
$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');
preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];
preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];
echo "Price: $price - Availability: $in_stock\n";
It's called screen scraping, in case you need to google for it.
I would suggest that you use a dom parser and xpath expressions instead. Feed the HTML through HtmlTidy first, to ensure that it's valid markup.
For example:
$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[#class="pricing"]/th') as $node) {
echo $node, "\n";
}
What ever you do: Don't use regular expressions to parse HTML or bad things will happen. Use a parser instead.
1st, asking this question goes too into details. 2nd, extracting data from a website might not be legitimate. However, I have hints:
Use Firebug or Chrome/Safari Inspector to explore the HTML content and pattern of interesting information
Test your RegEx to see if the match. You may need do it many times (multi-pass parsing/extraction)
Write a client via cURL or even much simpler, use file_get_contents (NOTE that some hosting disable loading URLs with file_get_contents)
For me, I'd better use Tidy to convert to valid XHTML and then use XPath to extract data, instead of RegEx. Why? Because XHTML is not regular and XPath is very flexible. You can learn XSLT to transform.
Good luck!
You are probably best off loading the HTML code into a DOM parser like this one and searching for the "pricing" table. However, any kind of scraping you do can break whenever they change their page layout, and is probably illegal without their consent.
The best way, though, would be to talk to the people who run the site, and see whether they have alternative, more reliable forms of data delivery (Web services, RSS, or database exports come to mind).
The simplest method to extract data from Website. I've analysed that my all data is covered within <h3> tag only, so I've prepared this one.
<?php
include(‘simple_html_dom.php’);
// Create DOM from URL, paste your destined web url in $page
$page = ‘http://facebook4free.com/category/facebookstatus/amazing-facebook-status/’;
$html = new simple_html_dom();
//Within $html your webpage will be loaded for further operation
$html->load_file($page);
// Find all links
$links = array();
//Within find() function, I have written h3 so it will simply fetch the content from <h3> tag only. Change as per your requirement.
foreach($html->find(‘h3′) as $element)
{
$links[] = $element;
}
reset($links);
//$out will be having each of HTML element content you searching for, within that web page
foreach ($links as $out)
{
echo $out;
}
?>