Parsing keywords having space in-between the words

Parsing keywords having space in-between the words - php

I'm fetching data from a website and the below mentioned script works fine when i parse single words like "math,chemistry,science" etc. However, if i try to parse a keyword which contains space in-between like "business math" etc the browser just loads forever it doesn't seem to work. Please guide me..
<?php
include("simple_html_dom.php");
$keywords = "business math,chemistry,science";
$keywords = explode(',', $keywords);
foreach($keywords as $keyword) {
echo '<br><b><font color="red">Keyword: </font><font color="blue">'.$keyword.'</font></b><br>';
$html = file_get_html('http://www.tutorvista.com/search/'.$keyword);
$i = 1;
foreach($html->find('div[style=padding:20px; border-top:thin solid #DDDDDD; border-bottom:none;]') as $element) {
foreach($element->find('div[class=entry-abstract]') as $div) {
$title[$i] = $div->plaintext.'<br><br>';
}
$i++;
}
print_r($title);
}
?>

The problem is in the line:
$html = file_get_html('http://www.tutorvista.com/search/'.$keyword);
That function internally uses file_get_contents(), which doesn't accept spaces and need the URI to be encoded with urlencode().
Try this out:
$html = file_get_html( urlencode('http://www.tutorvista.com/search/'.$keyword) );
Ref:
http://sourceforge.net/p/simplehtmldom/code/208/tree/trunk/simple_html_dom.php#l76
http://php.net/manual/en/function.file-get-contents.php

Related

Regular Expression not matching content in PHP

I am trying to scrape an ebay page such as this one: http://www.ebay.co.uk/sch/Cars-/9801/i.html?_nkw=vw+golf
Everything works great except one of my regular expressions just isn't matching the content and therefore the matches aren't being pushed to $linksArray I have outputted the contents to make sure what I am trying to match is infact there - and it is. I then go print_r($linksArray) where all the matches should be. but it's not. It is an empty multi dimensional array. You can see my live example here: http://www.mycommunity.co.za/marcksack/index.php
Here is my PHP code:
<?php
echo '<form method="POST">
<input type="text" id="url" name="url" size="120" value="' . (isset($_REQUEST["url"]) && !empty($_REQUEST["url"]) ? $_REQUEST["url"] : "") . '"/>
<input type="submit" value="Submit" />
</form>';
flush();
if (isset($_REQUEST["url"]) && !empty($_REQUEST["url"])) {
$url = $_REQUEST["url"];
$phones = array();
for ($page = 1; $page <= 1; $page++) {
// get page contents
$contents = file_get_contents($url . "&_pgn=" . $page);
echo(htmlentities($contents));
// find all links patterns
// HERE IS THE PROBLEM
$pattern = '/class="lvtitle"><a href="(.*)" class="vip"/';
$linksArray = array();
preg_match_all($pattern, $contents, $linksArray);
print_r($linksArray);
$links = $linksArray[0];
foreach($links as $link) {
$pureLink = str_replace("class=\"lvtitle\"><a href=\"", "", $link);
$pureLink = str_replace("\" class=\"vip\"", "", $pureLink);
// getting sub page contents
$subContents = file_get_contents($pureLink);
// find all links patterns
$subContents = str_replace(" ", "", $subContents);
$phonePattern = '/07[0-9]{9}/';
$phonesArray = array();
preg_match_all($phonePattern, $subContents, $phonesArray);
foreach($phonesArray[0] as $element) {
// check if phone not added previousely to the phones array
if (!in_array($element, $phones)) {
// add it to the phones array
array_push($phones, $element);
echo $element . "<br />";
flush();
}
}
}
}
// print results
foreach($phones as $phone){
echo $phone."<br/>";
}
}
?>
So obviously my question is what am I doing wrong? Why are the matches not being pushed to my $linksArray variable. I really appreciate your help!

This regex works:
"/ class=\"lvtitle\"><a href=\"([^\"]*)\" class=\"vip\"/"
A few issues with your's:
You were trying to capture the URL using (.*), which will match the entire line.
It was not matching the entire line because ebay has two spaces in between the class and href attributes.
Also, as has already been mentioned, you should use the API or DOMDocument for this. But in case you are curious, this is why it wasn't working. I hope that helps!

Remove all attributes from PHP string but keep basic markdown tags [duplicate]

How can I use php to strip all/any attributes from a tag, say a paragraph tag?
<p class="one" otherrandomattribute="two"> to <p>

Although there are better ways, you could actually strip arguments from html tags with a regular expression:
<?php
function stripArgumentFromTags( $htmlString ) {
$regEx = '/([^<]*<\s*[a-z](?:[0-9]|[a-z]{0,9}))(?:(?:\s*[a-z\-]{2,14}\s*=\s*(?:"[^"]*"|\'[^\']*\'))*)(\s*\/?>[^<]*)/i'; // match any start tag
$chunks = preg_split($regEx, $htmlString, -1, PREG_SPLIT_DELIM_CAPTURE);
$chunkCount = count($chunks);
$strippedString = '';
for ($n = 1; $n < $chunkCount; $n++) {
$strippedString .= $chunks[$n];
}
return $strippedString;
}
?>
The above could probably be written in less characters, but it does the job (quick and dirty).

Strip attributes using SimpleXML (Standard in PHP5)
<?php
// define allowable tags
$allowable_tags = '<p><a><img><ul><ol><li><table><thead><tbody><tr><th><td>';
// define allowable attributes
$allowable_atts = array('href','src','alt');
// strip collector
$strip_arr = array();
// load XHTML with SimpleXML
$data_sxml = simplexml_load_string('<root>'. $data_str .'</root>', 'SimpleXMLElement', LIBXML_NOERROR | LIBXML_NOXMLDECL);
if ($data_sxml ) {
// loop all elements with an attribute
foreach ($data_sxml->xpath('descendant::*[#*]') as $tag) {
// loop attributes
foreach ($tag->attributes() as $name=>$value) {
// check for allowable attributes
if (!in_array($name, $allowable_atts)) {
// set attribute value to empty string
$tag->attributes()->$name = '';
// collect attribute patterns to be stripped
$strip_arr[$name] = '/ '. $name .'=""/';
}
}
}
}
// strip unallowed attributes and root tag
$data_str = strip_tags(preg_replace($strip_arr,array(''),$data_sxml->asXML()), $allowable_tags);
?>

Here is one function that will let you strip all attributes except ones you want:
function stripAttributes($s, $allowedattr = array()) {
if (preg_match_all("/<[^>]*\\s([^>]*)\\/*>/msiU", $s, $res, PREG_SET_ORDER)) {
foreach ($res as $r) {
$tag = $r[0];
$attrs = array();
preg_match_all("/\\s.*=(['\"]).*\\1/msiU", " " . $r[1], $split, PREG_SET_ORDER);
foreach ($split as $spl) {
$attrs[] = $spl[0];
}
$newattrs = array();
foreach ($attrs as $a) {
$tmp = explode("=", $a);
if (trim($a) != "" && (!isset($tmp[1]) || (trim($tmp[0]) != "" && !in_array(strtolower(trim($tmp[0])), $allowedattr)))) {
} else {
$newattrs[] = $a;
}
}
$attrs = implode(" ", $newattrs);
$rpl = str_replace($r[1], $attrs, $tag);
$s = str_replace($tag, $rpl, $s);
}
}
return $s;
}
In example it would be:
echo stripAttributes('<p class="one" otherrandomattribute="two">');
or if you eg. want to keep "class" attribute:
echo stripAttributes('<p class="one" otherrandomattribute="two">', array('class'));
Or
Assuming you are to send a message to an inbox and you composed your message with CKEDITOR, you can assign the function as follows and echo it to the $message variable before sending. Note the function with the name stripAttributes() will strip off all html tags that are unnecessary. I tried it and it work fine. i only saw the formatting i added like bold e.t.c.
$message = stripAttributes($_POST['message']);
or
you can echo $message; for preview.

I honestly think that the only sane way to do this is to use a tag and attribute whitelist with the HTML Purifier library. Example script here:
<html><body>
<?php
require_once '../includes/htmlpurifier-4.5.0-lite/library/HTMLPurifier/Bootstrap.php';
spl_autoload_register(array('HTMLPurifier_Bootstrap', 'autoload'));
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.Allowed', 'p,b,a[href],i,br,img[src]');
$config->set('URI.Base', 'http://www.example.com');
$config->set('URI.MakeAbsolute', true);
$purifier = new HTMLPurifier($config);
$dirty_html = "
<a href=\"http://www.google.de\">broken a href link</a
fnord
<x>y</z>
<b>c</p>
<script>alert(\"foo!\");</script>
Anzahl besuchter Seiten
<img src=\"www.example.com/bla.gif\" />
<a href=\"http://www.google.de\">missing end tag
ende
";
$clean_html = $purifier->purify($dirty_html);
print "<h1>dirty</h1>";
print "<pre>" . htmlentities($dirty_html) . "</pre>";
print "<h1>clean</h1>";
print "<pre>" . htmlentities($clean_html) . "</pre>";
?>
</body></html>
This yields the following clean, standards-conforming HTML fragment:
broken a href linkfnord
y
<b>c
<a>Anzahl besuchter Seiten</a>
<img src="http://www.example.com/www.example.com/bla.gif" alt="bla.gif" /><a href="http://www.google.de">missing end tag
ende
</a></b>
In your case the whitelist would be:
$config->set('HTML.Allowed', 'p');

HTML Purifier is one of the better tools for sanitizing HTML with PHP.

You might also look into html purifier. True, it's quite bloated, and might not fit your needs if it only conceirns this specific example, but it offers more or less 'bulletproof' purification of possible hostile html. Also you can choose to allow or disallow certain attributes (it's highly configurable).
http://htmlpurifier.org/

PHP: How to insert a string into matched regex pattern (adding rel="no-follow" to anchor links)

I am writing a commenting system for my website, using PHP.
I want to do the following:
Detect all external links (i.e. anchor tags with source NOT containing the string mywebsite.com) in a comment
Add the string 'rel="no-follow"' to anchor tags identified in step 1 above.
I have an idea for such a function, but I will need some help from more experienced PHP developers so that I'm sure I'm doing things the right way. This is what my first attempt looks like
<?php
function process_comment($comment)
{
$external_url_pattern = "href=[^mywebsite.com]"; //this regex is probably wrong (Help!)
//are there any matches
$matches = array();
preg_match_all($external_url_pattern, $comment, $matches);
foreach($matches as $match)
{
// how do we insert the 'rel="no-follow" string ?
}
}
?>
Would appreciate any comments, pointers and tips in helping me complete this function. Thanks.

Dont know if this will be appropriate, but instead of regex you could do with DOMDocument as well:
$dom = new DOMDocument();
$dom->loadHTML($html);
//Evaluate Anchor tag in HTML
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
if($url == "mywebsite.com") {
$href->setAttribute("rel", "no-follow");
}
}
// save html
$html=$dom->saveHTML();
echo $html;
Hope it helps

This is a bit tricky but will do the job.
function process_comment($str)
{
//parses href attribute values into $match
if(preg_match_all('/href\=\"(.*)\"/',$str,$match))
{
foreach($match[1] as $v)
{
//check matched value contains your site as host name
//if not
//adds rel="no-follow" and replaces the link with the attribute
if(!preg_match('#^(?:http://)?(w+\.)?'.$mysite.'(.*)?#i',$v, $m))
{
$rel = $v.'" rel="no-follow';
$str = str_replace($v,$rel,$str);
}
}
}
return $str;
}
process_comment($comment);
You can simply use strstr instead of second preg_match. I used it because I think some urls may contain something like this "http://www.external.com/url.php?v=www.mysite.com"

remove script tag from HTML content

I am using HTML Purifier (http://htmlpurifier.org/)
I just want to remove <script> tags only.
I don't want to remove inline formatting or any other things.
How can I achieve this?
One more thing, it there any other way to remove script tags from HTML

Because this question is tagged with regex I'm going to answer with poor man's solution in this situation:
$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html);
However, regular expressions are not for parsing HTML/XML, even if you write the perfect expression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.
Remember, anything that user inputs should be considered not safe.
Better solution here would be to use DOMDocument which is designed for this.
Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:
<?php
$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$script = $dom->getElementsByTagName('script');
$remove = [];
foreach($script as $item)
{
$remove[] = $item;
}
foreach ($remove as $item)
{
$item->parentNode->removeChild($item);
}
$html = $dom->saveHTML();
I have removed the HTML intentionally because even this can bork.

Use the PHP DOMDocument parser.
$doc = new DOMDocument();
// load the HTML string we want to strip
$doc->loadHTML($html);
// get all the script tags
$script_tags = $doc->getElementsByTagName('script');
$length = $script_tags->length;
// for each tag, remove it from the DOM
for ($i = 0; $i < $length; $i++) {
$script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}
// get the HTML string back
$no_script_html_string = $doc->saveHTML();
This worked me me using the following HTML document:
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>
hey
</title>
<script>
alert("hello");
</script>
</head>
<body>
hey
</body>
</html>
Just bear in mind that the DOMDocument parser requires PHP 5 or greater.

$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags_to_remove = array('script','style','iframe','link');
foreach($tags_to_remove as $tag){
$element = $dom->getElementsByTagName($tag);
foreach($element as $item){
$item->parentNode->removeChild($item);
}
}
$html = $dom->saveHTML();

A simple way by manipulating string.
function stripStr($str, $ini, $fin)
{
while (($pos = mb_stripos($str, $ini)) !== false) {
$aux = mb_substr($str, $pos + mb_strlen($ini));
$str = mb_substr($str, 0, $pos);
if (($pos2 = mb_stripos($aux, $fin)) !== false) {
$str .= mb_substr($aux, $pos2 + mb_strlen($fin));
}
}
return $str;
}

Shorter:
$html = preg_replace("/<script.*?\/script>/s", "", $html);
When doing regex things might go wrong, so it's safer to do like this:
$html = preg_replace("/<script.*?\/script>/s", "", $html) ? : $html;
So that when the "accident" happen, we get the original $html instead of empty string.

this is a merge of both ClandestineCoder & Binh WPO.
the problem with the script tag arrows is that they can have more than one variant
ex. (< = < = &lt;) & ( > = > = &gt;)
so instead of creating a pattern array with like a bazillion variant,
imho a better solution would be
return preg_replace('/script.*?\/script/ius', '', $text)
? preg_replace('/script.*?\/script/ius', '', $text)
: $text;
this will remove anything that look like script.../script regardless of the arrow code/variant and u can test it in here https://regex101.com/r/lK6vS8/1

Try this complete and flexible solution. It works perfectly, and is based in-part by some previous answers, but contains additional validation checks, and gets rid of additional implied HTML from the loadHTML(...) function. It is divided into two separate functions (one with a previous dependency so don't re-order/rearrange) so you can use it with multiple HTML tags that you would like to remove simultaneously (i.e. not just 'script' tags). For example removeAllInstancesOfTag(...) function accepts an array of tag names, or optionally just one as a string. So, without further ado here is the code:
/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [BEGIN] */
/* Usage Example: $scriptless_html = removeAllInstancesOfTag($html, 'script'); */
if (!function_exists('removeAllInstancesOfTag'))
{
function removeAllInstancesOfTag($html, $tag_nm)
{
if (!empty($html))
{
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'); /* For UTF-8 Compatibility. */
$doc = new DOMDocument();
$doc->loadHTML($html,LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD|LIBXML_NOWARNING);
if (!empty($tag_nm))
{
if (is_array($tag_nm))
{
$tag_nms = $tag_nm;
unset($tag_nm);
foreach ($tag_nms as $tag_nm)
{
$rmvbl_itms = $doc->getElementsByTagName(strval($tag_nm));
$rmvbl_itms_arr = [];
foreach ($rmvbl_itms as $itm)
{
$rmvbl_itms_arr[] = $itm;
}
foreach ($rmvbl_itms_arr as $itm)
{
$itm->parentNode->removeChild($itm);
}
}
}
else if (is_string($tag_nm))
{
$rmvbl_itms = $doc->getElementsByTagName($tag_nm);
$rmvbl_itms_arr = [];
foreach ($rmvbl_itms as $itm)
{
$rmvbl_itms_arr[] = $itm;
}
foreach ($rmvbl_itms_arr as $itm)
{
$itm->parentNode->removeChild($itm);
}
}
}
return $doc->saveHTML();
}
else
{
return '';
}
}
}
/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [END] */
/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [BEGIN] */
/* Prerequisites: 'removeAllInstancesOfTag(...)' */
if (!function_exists('removeAllScriptTags'))
{
function removeAllScriptTags($html)
{
return removeAllInstancesOfTag($html, 'script');
}
}
/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [END] */
And here is a test usage example:
$html = 'This is a JavaScript retention test.<br><br><span id="chk_frst_scrpt">Congratulations! The first \'script\' tag was successfully removed!</span><br><br><span id="chk_secd_scrpt">Congratulations! The second \'script\' tag was successfully removed!</span><script>document.getElementById("chk_frst_scrpt").innerHTML = "Oops! The first \'script\' tag was NOT removed!";</script><script>document.getElementById("chk_secd_scrpt").innerHTML = "Oops! The second \'script\' tag was NOT removed!";</script>';
echo removeAllScriptTags($html);
I hope my answer really helps someone. Enjoy!

An example modifing ctf0's answer. This should only do the preg_replace once but also check for errors and block char code for forward slash.
$str = '<script> var a - 1; </script>';
$pattern = '/(script.*?(?:\/|/|/)script)/ius';
$replace = preg_replace($pattern, '', $str);
return ($replace !== null)? $replace : $str;
If you are using php 7 you can use the null coalesce operator to simplify it even more.
$pattern = '/(script.*?(?:\/|/|/)script)/ius';
return (preg_replace($pattern, '', $str) ?? $str);

function remove_script_tags($html){
$dom = new DOMDocument();
$dom->loadHTML($html);
$script = $dom->getElementsByTagName('script');
$remove = [];
foreach($script as $item){
$remove[] = $item;
}
foreach ($remove as $item){
$item->parentNode->removeChild($item);
}
$html = $dom->saveHTML();
$html = preg_replace('/<!DOCTYPE.*?<html>.*?<body><p>/ims', '', $html);
$html = str_replace('</p></body></html>', '', $html);
return $html;
}
Dejan's answer was good, but saveHTML() adds unnecessary doctype and body tags, this should get rid of it. See https://3v4l.org/82FNP

I would use BeautifulSoup if it's available. Makes this sort of thing very easy.
Don't try to do it with regexps. That way lies madness.

I had been struggling with this question. I discovered you only really need one function. explode('>', $html); The single common denominator to any tag is < and >. Then after that it's usually quotation marks ( " ). You can extract information so easily once you find the common denominator. This is what I came up with:
$html = file_get_contents('http://some_page.html');
$h = explode('>', $html);
foreach($h as $k => $v){
$v = trim($v);//clean it up a bit
if(preg_match('/^(<script[.*]*)/ius', $v)){//my regex here might be questionable
$counter = $k;//match opening tag and start counter for backtrace
}elseif(preg_match('/([.*]*<\/script$)/ius', $v)){//but it gets the job done
$script_length = $k - $counter;
$counter = 0;
for($i = $script_length; $i >= 0; $i--){
$h[$k-$i] = '';//backtrace and clear everything in between
}
}
}
for($i = 0; $i <= count($h); $i++){
if($h[$i] != ''){
$ht[$i] = $h[$i];//clean out the blanks so when we implode it works right.
}
}
$html = implode('>', $ht);//all scripts stripped.
echo $html;
I see this really only working for script tags because you will never have nested script tags. Of course, you can easily add more code that does the same check and gather nested tags.
I call it accordion coding. implode();explode(); are the easiest ways to get your logic flowing if you have a common denominator.

This is a simplified variant of Dejan Marjanovic's answer:
function removeTags($html, $tag) {
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach (iterator_to_array($dom->getElementsByTagName($tag)) as $item) {
$item->parentNode->removeChild($item);
}
return $dom->saveHTML();
}
Can be used to remove any kind of tag, including <script>:
$scriptlessHtml = removeTags($html, 'script');

use the str_replace function to replace them with empty space or something
$query = '<script>console.log("I should be banned")</script>';
$badChar = array('<script>','</script>');
$query = str_replace($badChar, '', $query);
echo $query;
//this echoes console.log("I should be banned")
?>

PHP: Display the first 500 characters of HTML

I have a huge HTML code in a PHP variable like :
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
I want to display only first 500 characters of this code. This character count must consider the text in HTML tags and should exclude HTMl tags and attributes while measuring the length.
but while triming the code, it should not affect DOM structure of HTML code.
Is there any tuorial or working examples available?

If its the text you want, you can do this with the following too
substr(strip_tags($html_code),0,500);

Ooohh... I know this I can't get it exactly off the top of my head but you want to load the text you've got as a DOMDOCUMENT
http://www.php.net/manual/en/class.domdocument.php
then grab the text from the entire document node (as a DOMnode http://www.php.net/manual/en/class.domnode.php)
This won't be exactly right, but hopefully this will steer you onto the right track.
Try something like:
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
$dom = new DOMDocument();
$dom->loadHTML($html_code);
$text_to_strip = $dom->textContent;
$stripped = mb_substr($text_to_strip,0,500);
echo "$stripped"; // The Sameple text.Another sample text.....
edit ok... that should work. just tested locally
edit2
Now that I understand you want to keep the tags, but limit the text, lets see. You're going to want to loop the content until you get to 500 characters. This is probably going to take a few edits and passes for me to get right, but hopefully I can help. (sorry I can't give undivided attention)
First case is when the text is less than 500 characters. Nothing to worry about. Starting with the above code we can do the following.
if (strlen($stripped) > 500) {
// this is where we do our work.
$characters_so_far = 0;
foreach ($dom->child_nodes as $ChildNode) {
// should check if $ChildNode->hasChildNodes();
// probably put some of this stuff into a function
$characters_in_next_node += str_len($ChildNode->textcontent);
if ($characters_so_far+$characters_in_next_node > 500) {
// remove the node
// try using
// $ChildNode->parentNode->removeChild($ChildNode);
}
$characters_so_far += $characters_in_next_node
}
//
$final_out = $dom->saveHTML();
} else {
$final_out = $html_code;
}

i'm pasting below a php class i wrote a long time ago, but i know it works. its not exactly what you're after, as it deals with words instead of a character count, but i figure its pretty close and someone might find it useful.
class HtmlWordManipulator
{
var $stack = array();
function truncate($text, $num=50)
{
if (preg_match_all('/\s+/', $text, $junk) <= $num) return $text;
$text = preg_replace_callback('/(<\/?[^>]+\s+[^>]*>)/','_truncateProtect', $text);
$words = 0;
$out = array();
$text = str_replace('<',' <',str_replace('>','> ',$text));
$toks = preg_split('/\s+/', $text);
foreach ($toks as $tok)
{
if (preg_match_all('/<(\/?[^\x01>]+)([^>]*)>/',$tok,$matches,PREG_SET_ORDER))
foreach ($matches as $tag) $this->_recordTag($tag[1], $tag[2]);
$out[] = trim($tok);
if (! preg_match('/^(<[^>]+>)+$/', $tok))
{
if (!strpos($tok,'=') && !strpos($tok,'<') && strlen(trim(strip_tags($tok))) > 0)
{
++$words;
}
else
{
/*
echo '<hr />';
echo htmlentities('failed: '.$tok).'<br /)>';
echo htmlentities('has equals: '.strpos($tok,'=')).'<br />';
echo htmlentities('has greater than: '.strpos($tok,'<')).'<br />';
echo htmlentities('strip tags: '.strip_tags($tok)).'<br />';
echo str_word_count($text);
*/
}
}
if ($words > $num) break;
}
$truncate = $this->_truncateRestore(implode(' ', $out));
return $truncate;
}
function restoreTags($text)
{
foreach ($this->stack as $tag) $text .= "</$tag>";
return $text;
}
private function _truncateProtect($match)
{
return preg_replace('/\s/', "\x01", $match[0]);
}
private function _truncateRestore($strings)
{
return preg_replace('/\x01/', ' ', $strings);
}
private function _recordTag($tag, $args)
{
// XHTML
if (strlen($args) and $args[strlen($args) - 1] == '/') return;
else if ($tag[0] == '/')
{
$tag = substr($tag, 1);
for ($i=count($this->stack) -1; $i >= 0; $i--) {
if ($this->stack[$i] == $tag) {
array_splice($this->stack, $i, 1);
return;
}
}
return;
}
else if (in_array($tag, array('p', 'li', 'ul', 'ol', 'div', 'span', 'a')))
$this->stack[] = $tag;
else return;
}
}
truncate is what you want, and you pass it the html and the number of words you want it trimmed down to. it ignores html while counting words, but then rewraps everything in html, even closing trailing tags due to the truncation.
please don't judge me on the complete lack of oop principles. i was young and stupid.
edit:
so it turns out the usage is more like this:
$content = $manipulator->restoreTags($manipulator->truncate($myHtml,$numOfWords));
stupid design decision. allowed me to inject html inside the unclosed tags though.

I'm not up to coding a real solution, but if someone wants to, here's what I'd do (in pseudo-PHP):
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
$aggregate = '';
$document = XMLParser($html_code);
foreach ($document->getElementsByTagName('*') as $element) {
$aggregate .= $element->text(); // This is the text, not HTML. It doesn't
// include the children, only the text
// directly in the tag.
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing keywords having space in-between the words - php

Related

Regular Expression not matching content in PHP

Remove all attributes from PHP string but keep basic markdown tags [duplicate]

PHP: How to insert a string into matched regex pattern (adding rel="no-follow" to anchor links)

remove script tag from HTML content

PHP: Display the first 500 characters of HTML

Categories

Resources