trim/delete Everything after DIV with ID - php

In the testing environment $html is 20 to 30 lines or more of HTML is created by a CURL (scrape) query to another page/site, but for simplicity in the question i reduced it to this simple example:
I need to echo the DIV with ID "keepthis" and all its content with HTML structure intact, but delete everything before it and after it. The DIV with ID "deletethis" will always have that ID. I have looked at multiple posts involving substr / explode / trim but i cannot find or get to work a method that deletes everything TO THE RIGHT in $html starting from position 0 of
that div(deletethis) is not located at a fixed # of characters into the code, I am able to get the delete all before DIV(keepthis) to work, just not the other side. Any help would be appreciated.
$html = '<h1>hello world</h1><div id="keepthis"> Sample content</div><div id="deletethis">a bunch of other dynamic html here</div>';
$x = substr($html, strpos($html, '<div id="keepthis">')); //cleans up the BEFORE code
echo $x;

So based on the link try this :
$html = '<h1>hello world</h1><div id="keepthis"> Sample content</div><div id="deletethis">a bunch of other dynamic html here</div>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$result = $xpath->query('//div[#id="keepthis"]');
if ($result->length > 0) {
var_dump($result->item(0)->nodeValue);
}
Warning : The node value will not output tags but you can iterate through childs of $result->item(0) to get them

string rtrim ( string $str [, string $character_mask ] )
This function returns a string with whitespace stripped from the end of str.
Without the second parameter, rtrim() will strip these characters:

Related

PHP Simple Html Dom get the plain text of div,but avoiding all other tags

I use PHP Simple Html Dom to get some html,now i have a html dom like follow code,i need fetch the plain text inner div,but avoiding the p tags and their content(only return 111111), who can help me?Thanks in advance!
<div>
<p>00000000</p>
111111
<p>22222222</p>
</div>
It depends on what you mean by "avoiding the p tags".
If you just want to remove the tags, then just running strip_tags() on it should work for what you want.
If you actually want to just return "11111" (ie. strip the tags and their contents) then this isn't a viable solution. For that, something like this may work:
$myDiv = $html->find('div'); // wherever your the div you're ending up with is
$children = $myDiv->children; // get an array of children
foreach ($children AS $child) {
$child->outertext = ''; // This removes the element, but MAY NOT remove it from the original $myDiv
}
echo $myDiv->innertext;
If you text is always at the same position , try this:
$html->find('text', 2)->plaintext; // should return 111111
Here is my solution
I want to get the Primary Text part only.
$title_obj = $article->find(".ofr-descptxt",0); //Store the Original Tree ie) h3 tag
$title_obj->children(0)->outertext = ""; //Unset <br/>
$title_obj->children(1)->outertext = ""; //Unset the last Span
echo $title_obj; //It has only first element
Edited:
If you have PHP errors
Try to enclose with If else or try my lazy code
($title_obj->children(0))?$title_obj->children(0)->outertext="":"";
($title_obj->children(1))?$title_obj->children(1)->outertext = "":"";
Official Documentation
$wordlist = array("<p>", "</p>")
foreach($wordlist as $word)
$string = str_replace($word, "", $string);

Remove HTML Entity if Incomplete

I have an issue where I have displayed up to 400 characters of a string that is pulled from the database, however, this string is required to contain HTML Entities.
By chance, the client has created the string to have the 400th character to sit right in the middle of a closing P tag, thus killing the tag, resulting in other errors for code after it.
I would prefer this closing P tag to be removed entirely as I have a "...read more" link attached to the end which would look cleaner if attached to the existing paragraph.
What would be the best approach for this to cover all HTML Entity issues? Is there a PHP function that will automatically close off/remove any erroneous HTML tags? I don't need a coded answer, just a direction will help greatly.
Thanks.
Here's a simple way you can do it with DOMDocument, its not perfect but it may be of interest:
<?php
function html_tidy($src){
libxml_use_internal_errors(true);
$x = new DOMDocument;
$x->loadHTML('<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />'.$src);
$x->formatOutput = true;
$ret = preg_replace('~<(?:!DOCTYPE|/?(?:html|body|head))[^>]*>\s*~i', '', $x->saveHTML());
return trim(str_replace('<meta http-equiv="Content-Type" content="text/html;charset=utf-8">','',$ret));
}
$brokenHTML[] = "<p><span>This is some broken html</spa";
$brokenHTML[] = "<poken html</spa";
$brokenHTML[] = "<p><span>This is some broken html</spa</p>";
/*
<p><span>This is some broken html</span></p>
<poken html></poken>
<p><span>This is some broken html</span></p>
*/
foreach($brokenHTML as $test){
echo html_tidy($test);
}
?>
Though take note of Mike 'Pomax' Kamermans's comment.
why you don't take the last word in the paragraph or content and remove it, if the word is complete you remove it , if is not complete you also remove it, and you are sure that the content still clean, i show you an example for what code will be look like :
while($row = $req->fetch(PDO::FETCH_OBJ){
//extract 400 first characters from the content you need to show
$extraction = substr($row->text, 0, 400);
// find the last space in this extraction
$last_space = strrpos($extraction, ' ');
//take content from the first character to the last space and add (...)
echo substr($extraction, 0, $last_space) . ' ...';
}
just remove last broken tag and then strip_tags
$str = "<p>this is how we do</p";
$str = substr($str, 0, strrpos($str, "<"));
$str = strip_tags($str);

PHP text to array and with key

I know RegExp not well, I did not succeeded to split string to array.
I have string like:
<h5>some text in header</h5>
some other content, that belongs to header <p> or <a> or <img> inside.. not important...
<h5>Second text header</h5>
So What I am trying to do is to split text string into array where KEY would be text from header and CONTENT would be all the rest content till the next header like:
array("some text in header" => "some other content, that belongs to header...", ...)
I would suggest looking at the PHP DOM http://php.net/manual/en/book.dom.php. You can read / create DOM from a document.
i've used this one and enjoyed it.
http://simplehtmldom.sourceforge.net/
you could do it with a regex as well.
something like this.
/<h5>(.*)<\/h5>(.*)<h5>/s
but this just finds the first situation. you'll have to cut hte string to get the next one.
any way you cut it, i don't see a one liner for you. sorry.
here's a crummy broken 4 liner.
$chunks = explode("<h5>", $html);
foreach($chunks as $chunk){
list($key, $val) = explode("</h5>", $chunk);
$res[$key] = $val;
}
dont parse HTML via preg_match
instead use php Class
The DOMDocument class
example:
<?php
$html= "<h5>some text in header</h5>
some other content, that belongs to header <p> or <a> or <img> inside.. not important...
<h5>Second text header</h5>";
// a new dom object
$dom = new domDocument('1.0', 'utf-8');
// load the html into the object ***/
$dom->loadHTML($html);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
$hFive= $dom->getElementsByTagName('h5');
echo $hFive->item(0)->nodeValue; // u can get all h5 data by changing the index
?>
Reference

Extract description in site with no meta tag description?

I need of a function in php that extract a description of a site url that don't have meta tag description any idea?
i have tried this function but don't work :
$content = file_get_contents($url);
function getExcerpt($content) {
$text = html_entity_decode($content);
$excerpt = array();
//match all tags
preg_match_all("|<[^>]+>(.*)]+>|", $text, $p, PREG_PATTERN_ORDER);
for ($x = 0; $x < sizeof($p[0]); $x++) {
if (preg_match('< p >i', $p[0][$x])) {
$strip = strip_tags($p[0][$x]);
if (preg_match("/\./", $strip))
$excerpt[] = $strip;
}
if (isset($excerpt[0])){
preg_match("/([^.]+.)/", $strip,$matches);
return $matches[1];
}
}
return false;
}
$excerpt = getExcerpt($content);
Parsing HTML with RegEx is almost always a bad idea. Thankfully PHP has libraries that can do the work for you. The following code uses DOMDocument to extract either the meta description or if one does not exist, the first 1000 characters in the page.
<?php
function getExcerpt($html) {
$dom = new DOMDocument();
// Parse the inputted HTML into a DOM
$dom->loadHTML($html);
$metaTags = $dom->getElementsByTagName('meta');
// Check for a meta description and return it if it exists
foreach ($metaTags as $metaTag) {
if ($metaTag->getAttribute('name') === "description") {
return $metaTag->getAttribute('content');
}
}
// No meta description, extract an excerpt from the body
// Get the body node
$body = $dom->getElementsByTagName('body');
$body = $body->item(0);
// extract the contents
$bodyText = $body->textContent;
// collapse any line breaks
$bodyText = preg_replace('/\s*\n\s*/', "\n", $bodyText);
// collapse any more leftover spaces or tabs to single spaces
$bodyText = preg_replace('/[ ]+/', ' ', $bodyText);
// return the first 1000 chars
return trim(substr($bodyText, 0, 1000));
}
$html = file_get_contents('test.html');
echo nl2br(getExcerpt($html));
You'll probably want to add a little more logic to it, some DOM traversal to try to find the content, or just some snippet near the middle of the text. As it is, this code will probably grab a bunch of unwanted stuff like the top of the page navigation etc.
You should first check if there is meta description available, if yes then display that else search for the <p> tags and display that data as description (you might want to put a limit on length of a paragraph, e.g. if length is less than 30, search for next paragraph). If there is no <p> tag then simply display the title as description (that's how facebook and Digg works)

Regular Expressions, avoiding HTML tags in PHP

I have actually seen this question quite a bit here, but none of them are exactly what I want... Lets say I have the following phrase:
Line 1 - This is a TEST phrase.
Line 2 - This is a <img src="TEST" /> image.
Line 3 - This is a TEST link.
Okay, simple right? I am trying the following code:
$linkPin = '#(\b)TEST(\b)(?![^<]*>)#i';
$linkRpl = '$1TEST$2';
$html = preg_replace($linkPin, $linkRpl, $html);
As you can see, it takes the word TEST, and replaces it with a link to test. The regular expression I am using right now works good to avoid replacing the TEST in line 2, it also avoids replacing the TEST in the href of line 3. However, it still replaces the text encapsulated within the tag on line 3 and I end up with:
Line 1 - This is a TEST phrase.
Line 2 - This is a <img src="TEST" /> image.
Line 3 - This is a <a href="newurl">TEST</a> link.
This I do not want as it creates bad code in line 3. I want to not only ignore matches inside of a tag, but also encapsulated by them. (remember to keep note of the /> in line 2)
Honestly, I'd do this with DomDocument and Xpath:
//First, create a simple html string around the text.
$html = '<html><body><div id="#content">'.$text.'</div></body></html>';
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$query = '//*[not(name() = "a") and contains(., "TEST")]';
$nodes = $xpath->query($query);
//Force it to an array to break the reference so iterating works properly
$nodes = iterator_to_array($nodes);
$replaceNode = function ($node) {
$text = $node->wholeText;
$text = str_replace('TEST', 'TEST', '');
$fragment = $node->ownerDocument->createDocumentFragment();
$fragment->appendXML($text);
$node->parentNode->replaceChild($fragment, $node);
}
foreach ($nodes as $node) {
if ($node instanceof DomText) {
$replaceNode($node, 'TEST');
} else {
foreach ($node->childNodes as $child) {
if ($child instanceof DomText) {
$replaceNode($node, 'TEST');
}
}
}
}
This should work for you, since it ignores all text inside of a elements, and only replaces the text directly inside of the matching tags.
Okay... I think I came up with a better solution...
$noMatch = '(</a>|</h\d+>)';
$linkUrl = 'http://www.test.com/test/'.$link['page_slug'];
$linkPin = '#(?!(?:[^<]+>|[^>]+'.$noMatch.'))\b'.preg_quote($link['page_name']).'\b#i';
$linkRpl = ''.$link['page_name'].'';
$page['HTML'] = preg_replace($linkPin, $linkRpl, $page['HTML']);
With this code, it won't process any text within <a> tags and <h#> tags. I figure, any new exclusions I want to add, simply need to be added to $noMatch.
Am I wrong in this method?

Categories