Best way to automatically remove comments from PHP code - php

What’s the best way to remove comments from a PHP file?
I want to do something similar to strip-whitespace() - but it shouldn't remove the line breaks as well.
For example,
I want this:
<?PHP
// something
if ($whatsit) {
do_something(); # we do something here
echo '<html>Some embedded HTML</html>';
}
/* another long
comment
*/
some_more_code();
?>
to become:
<?PHP
if ($whatsit) {
do_something();
echo '<html>Some embedded HTML</html>';
}
some_more_code();
?>
(Although if the empty lines remain where comments are removed, that wouldn't be OK.)
It may not be possible, because of the requirement to preserve embedded HTML - that’s what’s tripped up the things that have come up on Google.

I'd use tokenizer. Here's my solution. It should work on both PHP 4 and 5:
$fileStr = file_get_contents('path/to/file');
$newStr = '';
$commentTokens = array(T_COMMENT);
if (defined('T_DOC_COMMENT')) {
$commentTokens[] = T_DOC_COMMENT; // PHP 5
}
if (defined('T_ML_COMMENT')) {
$commentTokens[] = T_ML_COMMENT; // PHP 4
}
$tokens = token_get_all($fileStr);
foreach ($tokens as $token) {
if (is_array($token)) {
if (in_array($token[0], $commentTokens)) {
continue;
}
$token = $token[1];
}
$newStr .= $token;
}
echo $newStr;

Use php -w <sourcefile> to generate a file stripped of comments and whitespace, and then use a beautifier like PHP_Beautifier to reformat for readability.

$fileStr = file_get_contents('file.php');
foreach (token_get_all($fileStr) as $token ) {
if ($token[0] != T_COMMENT) {
continue;
}
$fileStr = str_replace($token[1], '', $fileStr);
}
echo $fileStr;

Here's the function posted above, modified to recursively remove all comments from all PHP files within a directory and all its subdirectories:
function rmcomments($id) {
if (file_exists($id)) {
if (is_dir($id)) {
$handle = opendir($id);
while($file = readdir($handle)) {
if (($file != ".") && ($file != "..")) {
rmcomments($id . "/" . $file); }}
closedir($handle); }
else if ((is_file($id)) && (end(explode('.', $id)) == "php")) {
if (!is_writable($id)) { chmod($id, 0777); }
if (is_writable($id)) {
$fileStr = file_get_contents($id);
$newStr = '';
$commentTokens = array(T_COMMENT);
if (defined('T_DOC_COMMENT')) { $commentTokens[] = T_DOC_COMMENT; }
if (defined('T_ML_COMMENT')) { $commentTokens[] = T_ML_COMMENT; }
$tokens = token_get_all($fileStr);
foreach ($tokens as $token) {
if (is_array($token)) {
if (in_array($token[0], $commentTokens)) { continue; }
$token = $token[1]; }
$newStr .= $token; }
if (!file_put_contents($id, $newStr)) {
$open = fopen($id, "w");
fwrite($open, $newStr);
fclose($open);
}
}
}
}
}
rmcomments("path/to/directory");

A more powerful version: remove all comments in the folder
<?php
$di = new RecursiveDirectoryIterator(__DIR__, RecursiveDirectoryIterator::SKIP_DOTS);
$it = new RecursiveIteratorIterator($di);
$fileArr = [];
foreach($it as $file) {
if(pathinfo($file, PATHINFO_EXTENSION) == "php") {
ob_start();
echo $file;
$file = ob_get_clean();
$fileArr[] = $file;
}
}
$arr = [T_COMMENT, T_DOC_COMMENT];
$count = count($fileArr);
for($i=1; $i < $count; $i++) {
$fileStr = file_get_contents($fileArr[$i]);
foreach(token_get_all($fileStr) as $token) {
if(in_array($token[0], $arr)) {
$fileStr = str_replace($token[1], '', $fileStr);
}
}
file_put_contents($fileArr[$i], $fileStr);
}

/*
* T_ML_COMMENT does not exist in PHP 5.
* The following three lines define it in order to
* preserve backwards compatibility.
*
* The next two lines define the PHP 5 only T_DOC_COMMENT,
* which we will mask as T_ML_COMMENT for PHP 4.
*/
if (! defined('T_ML_COMMENT')) {
define('T_ML_COMMENT', T_COMMENT);
} else {
define('T_DOC_COMMENT', T_ML_COMMENT);
}
/*
* Remove all comment in $file
*/
function remove_comment($file) {
$comment_token = array(T_COMMENT, T_ML_COMMENT, T_DOC_COMMENT);
$input = file_get_contents($file);
$tokens = token_get_all($input);
$output = '';
foreach ($tokens as $token) {
if (is_string($token)) {
$output .= $token;
} else {
list($id, $text) = $token;
if (in_array($id, $comment_token)) {
$output .= $text;
}
}
}
file_put_contents($file, $output);
}
/*
* Glob recursive
* #return ['dir/filename', ...]
*/
function glob_recursive($pattern, $flags = 0) {
$file_list = glob($pattern, $flags);
$sub_dir = glob(dirname($pattern) . '/*', GLOB_ONLYDIR);
// If sub directory exist
if (count($sub_dir) > 0) {
$file_list = array_merge(
glob_recursive(dirname($pattern) . '/*/' . basename($pattern), $flags),
$file_list
);
}
return $file_list;
}
// Remove all comment of '*.php', include sub directory
foreach (glob_recursive('*.php') as $file) {
remove_comment($file);
}

If you already use an editor like UltraEdit, you can open one or multiple PHP file(s) and then use a simple Find&Replace (Ctrl + R) with the following Perl regular expression:
(?s)/\*.*\*/
Beware the above regular expression also removes comments inside a string, i.e., in echo "hello/*babe*/"; the /*babe*/ would be removed too. Hence, it could be a solution if you have few files to remove comments from. In order to be absolutely sure it does not wrongly replace something that is not a comment, you would have to run the Find&Replace command and approve each time what is getting replaced.

Bash solution: If you want to remove recursively comments from all PHP files starting from the current directory, you can write this one-liner in the terminal. (It uses temp1 file to store PHP content for processing.)
Note that this will strip all white spaces with comments.
find . -type f -name '*.php' | while read VAR; do php -wq $VAR > temp1 ; cat temp1 > $VAR; done
Then you should remove temp1 file after.
If PHP_BEAUTIFER is installed then you can get nicely formatted code without comments with
find . -type f -name '*.php' | while read VAR; do php -wq $VAR > temp1; php_beautifier temp1 > temp2; cat temp2 > $VAR; done;
Then remove two files (temp1 and temp2).

Following upon the accepted answer, I needed to preserve the line numbers of the file too, so here is a variation of the accepted answer:
/**
* Removes the php comments from the given valid php string, and returns the result.
*
* Note: a valid php string must start with <?php.
*
* If the preserveWhiteSpace option is true, it will replace the comments with some whitespaces, so that
* the line numbers are preserved.
*
*
* #param string $str
* #param bool $preserveWhiteSpace
* #return string
*/
function removePhpComments(string $str, bool $preserveWhiteSpace = true): string
{
$commentTokens = [
\T_COMMENT,
\T_DOC_COMMENT,
];
$tokens = token_get_all($str);
if (true === $preserveWhiteSpace) {
$lines = explode(PHP_EOL, $str);
}
$s = '';
foreach ($tokens as $token) {
if (is_array($token)) {
if (in_array($token[0], $commentTokens)) {
if (true === $preserveWhiteSpace) {
$comment = $token[1];
$lineNb = $token[2];
$firstLine = $lines[$lineNb - 1];
$p = explode(PHP_EOL, $comment);
$nbLineComments = count($p);
if ($nbLineComments < 1) {
$nbLineComments = 1;
}
$firstCommentLine = array_shift($p);
$isStandAlone = (trim($firstLine) === trim($firstCommentLine));
if (false === $isStandAlone) {
if (2 === $nbLineComments) {
$s .= PHP_EOL;
}
continue; // Just remove inline comments
}
// Stand-alone case
$s .= str_repeat(PHP_EOL, $nbLineComments - 1);
}
continue;
}
$token = $token[1];
}
$s .= $token;
}
return $s;
}
Note: this is for PHP 7+ (I didn't care about backward compatibility with older PHP versions).

For Ajax and JSON responses, I use the following PHP code, to remove comments from HTML/JavaScript code, so it would be smaller (about 15% gain for my code).
// Replace doubled spaces with single ones (ignored in HTML any way)
$html = preg_replace('#(\s){2,}#', '\1', $html);
// Remove single and multiline comments, tabs and newline chars
$html = preg_replace(
'#(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|((?<!:)//.*)|[\t\r\n]#i',
'',
$html
);
It is short and effective, but it can produce unexpected results, if your code has bad syntax.

Run the command php --strip file.php in a command prompt (for example., cmd.exe), and then browse to WriteCodeOnline.
Here, file.php is your own file.

In 2019 it could work like this:
<?php
/* hi there !!!
here are the comments */
//another try
echo removecomments('index.php');
/* hi there !!!
here are the comments */
//another try
function removecomments($f){
$w=Array(';','{','}');
$ts = token_get_all(php_strip_whitespace($f));
$s='';
foreach($ts as $t){
if(is_array($t)){
$s .=$t[1];
}else{
$s .=$t;
if( in_array($t,$w) ) $s.=chr(13).chr(10);
}
}
return $s;
}
?>
If you want to see the results, just let's run it first in XAMPP, and then you get a blank page, but if you right click and click on view source, you get your PHP script ... it's loading itself and it's removing all comments and also tabs.
I prefer this solution too, because I use it to speed up my framework one file engine "m.php" and after php_strip_whitespace, all source without this script I observe is slowest: I did 10 benchmarks, and then I calculate the math average (I think PHP 7 is restoring back the missing cr_lf's when it is parsing or it is taking a while when these are missing).

php -w or php_strip_whitespace($filename);
documentation

The catch is that a less robust matching algorithm (simple regex, for instance) will start stripping here when it clearly shouldn't:
if (preg_match('#^/*' . $this->index . '#', $this->permalink_structure)) {
It might not affect your code, but eventually someone will get bit by your script. So you will have to use a utility that understands more of the language than you might otherwise expect.

Related

How to search/filter node content inside of xml file with SimpleXMLElement - php

I need to filter/search all links (png,jpg,mp3) from an XXML file but I got stuck there. I did for example to get all mp3 but I did it knowing that was there, but for example if I put other file where the path is different, then it won't detect it.
foreach($xml->BODY->GENERAL->SOUNDS->SOUND as $a){
echo ''.$a->PATH.'<br>';
}
Example XML
You could get the extension of each file and compare it to an array of "accepted extensions". Then use, continue to skip to write link:
$accepted_exts = ['png','jpg','mp3'];
foreach($xml->BODY->GENERAL->SOUNDS->SOUND as $a) {
$path = $a->PATH;
$ext = strtolower(substr($path, strrpos($path, '.') + 1));
if (!in_array($ext, $accepted_exts)) continue ; // continue to next iteration
echo ''.$path.'<br>'; // write the link
}
To get other links:
$accepted_exts = ['png','jpg','mp3'];
$links = [] ;
foreach($xml->HEAD as $items) {
foreach ($items as $item) {
$path = (string)$item;
if (!in_array(get_ext($path), $accepted_exts)) continue ; // continue to next iteration
$links[] = $path ;
}
}
foreach($xml->BODY->GENERAL->SOUNDS->SOUND as $a) {
$path = $a->PATH;
if (!in_array(get_ext($path), $accepted_exts)) continue ; // continue to next iteration
$links[] = $path ;
}
foreach ($links as $path) {
echo ''.$path.'<br>'; // write the link
}
function get_ext($path) {
return strtolower(substr($path, strrpos($path, '.') + 1));
}
Will outputs:
http://player.glifing.com/img/Player/blue.png<br>
http://player.glifing.com/img/Player/blue_intro.png<br>
http://player.glifing.com/upload/fondoinstrucciones2.jpg<br>
http://player.glifing.com/upload/stopbet2.png<br>
http://player.glifing.com/upload/goglif2.png<br>
http://player.glifing.com/img/Player/Glif 3 OK.png<br>
http://player.glifing.com/img/Player/BetPensant.png<br>
http://player.glifing.com/audio/Player/si.mp3<br>
http://player.glifing.com/audio/Player/no.mp3<br>
To save having to know which individual tags may contain a URL, you can use XPath to search for any text content that starts with "http://" or "https://". Then process each part to check the extension.
$xml = simplexml_load_file("data.xml");
$extensions = ['png', 'jpg', 'mp3'];
$links = $xml->xpath('//text()[starts-with(normalize-space(), "http://")
or starts-with(normalize-space(), "https://")]');
foreach ( $links as $link ) {
$link = trim(trim($link),"_");
$path = parse_url($link, PHP_URL_PATH);
$extension = strtolower(pathinfo($path, PATHINFO_EXTENSION));
if ( in_array($extension, $extensions)) {
// Do something
echo $link.PHP_EOL;
}
else {
echo "Rejected:".$link.PHP_EOL;
}
}
I found that using trim() helped clean up URL's which had blank lines after them (or at least some extra whitespace). And convert them all to lower to make checking easier.
You may not need the rejected bit, but I put it in to test my code.
You would have to repeat the above

PHP code to create a negative word dictionary and search if a post has negative words

I'm trying to develop a PHP application where it takes comments from users and then match the string to check if the comment is positive or negative. I have list of negative words in negative.txt file. If a word is matched from the word list, then I want a simple integer counter to increment by 1. I tried the some links and created the a code to check if the comment has is negative or positive but it is only matching the last word of the file.Here's the code what i have done.
<?php
function teststringforbadwords($comment)
{
$file="BadWords.txt";
$fopen = fopen($file, "r");
$fread = fread($fopen,filesize("$file"));
fclose($fopen);
$newline_ele = "\n";
$data_split = explode($newline_ele, $fread);
$new_tab = "\t";
$outoutArr = array();
//process uploaded file data and push in output array
foreach ($data_split as $string)
{
$row = explode($new_tab, $string);
if(isset($row['0']) && $row['0'] != ""){
$outoutArr[] = trim($row['0']," ");
}
}
//---------------------------------------------------------------
foreach($outoutArr as $word) {
if(stristr($comment,$word)){
return false;
}
}
return true;
}
if(isset($_REQUEST["submit"]))
{
$comments = $_REQUEST["comments"];
if (teststringforbadwords($comments))
{
echo 'string is clean';
}
else
{
echo 'string contains banned words';
}
}
?>
Link Tried : Check a string for bad words?
I added the strtolower function around both your $comments and your input from the file. That way if someone spells STUPID, instead of stupid, the code will still detect the bad word.
I also added trim to remove unnecessary and disruptive whitespace (like newline).
Finally, I changed the way how you check the words. I used a preg_match to split about all whitespace so we are checking only full words and don't accidentally ban incorrect strings.
<?php
function teststringforbadwords($comment)
{
$comment = strtolower($comment);
$file="BadWords.txt";
$fopen = fopen($file, "r");
$fread = strtolower(fread($fopen,filesize("$file")));
fclose($fopen);
$newline_ele = "\n";
$data_split = explode($newline_ele, $fread);
$new_tab = "\t";
$outoutArr = array();
//process uploaded file data and push in output array
foreach ($data_split as $bannedWord)
{
foreach (preg_split('/\s+/',$comment) as $commentWord) {
if (trim($bannedWord) === trim($commentWord)) {
return false;
}
}
}
return true;
}
1) Your storing $row['0'] only why not others index words. So problem is your ignoring some of word in text file.
Some suggestion
1) Insert the text in text file one by one i.e new line like this so you can access easily explode by newline to avoiding multiple explode and loop.
Example: sss.txt
...
bad
stupid
...
...
2) Apply trim and lowercase function to both comment and bad string.
Hope it will work as expected
function teststringforbadwords($comment)
{
$file="sss.txt";
$fopen = fopen($file, "r");
$fread = fread($fopen,filesize("$file"));
fclose($fopen);
foreach(explode("\n",$fread) as $word)
{
if(stristr(strtolower(trim($comment)),strtolower(trim($word))))
{
return false;
}
}
return true;
}

Php simple algorithm for autoloader

Here's my "simple" algorithm:
if the class is named like 'AaaBbbCccDddEeeFff' loop like this:
include/aaa/bbb/ccc/ddd/eee/fff.php
include/aaa/bbb/ccc/ddd/eee_fff.php
include/aaa/bbb/ccc/ddd_eee_fff.php
include/aaa/bbb/ccc_ddd_eee_fff.php
include/aaa/bbb_ccc_ddd_eee_fff.php
include/aaa_bbb_ccc_ddd_eee_fff.php
if still nothing found, try to look if those files exist:
include/aaa/bbb/ccc/ddd/eee/fff/base.php
include/aaa/bbb/ccc/ddd/eee/base.php
include/aaa/bbb/ccc/ddd/base.php
include/aaa/bbb/ccc/base.php
include/aaa/bbb/base.php
include/aaa/base.php
include/base.php
If still not found then error.
I'm looking for a fast and easy way to convert this:
'AaaBbbCccDddEeeFff'
to this:
include/aaa/bbb/ccc/ddd/eee/fff.php
and then and easy way to remove latest folder (I guess I should look for explode()).
Any idea how to do this? (I'm not asking for the whole code, I'm not lazy).
Since you specifically asked not to have the whole code, here is some code to get you started. This takes the input and divides it into chunks delineated by changes in case. The rest you can work out as an exercise.
<?php
$input = "AaaBbbCccDddEeeFff";
$str_so_far = "";
$last_was_upper = 0;
$chunks = array();
while($next_letter = substr($input,0,1)) {
$is_upper = (strtoupper($next_letter)==$next_letter);
if($str_so_far && $is_upper && !$last_was_upper) {
$chunks[] = $str_so_far;
$str_so_far = "";
}
if($str_so_far && !$is_upper && $last_was_upper) {
$chunks[] = $str_so_far;
$str_so_far = "";
}
$str_so_far .= $next_letter;
$input = substr($input,1);
$last_was_upper = $is_upper;
}
var_dump($chunks);
?>
I think a regular expression would work. Something like preg_match_all('[A-Z][a-z][a-z]'
, $string); might work - that would match a capital letter, followed by a lowercase letter, and another lowercase letter.
As the other answers are regex, here's a non-regex way for completeness:
function transform($str){
$arr = array();
$part = '';
for($i=0; $i<strlen($str); $i++){
$char = substr($str, $i, 1);
if(ctype_upper($char) && $i > 0){
$arr[] = $part;
$part = '';
}
$part .= $char;
}
$arr[] = $part;
return 'include/' . strtolower(implode('/', $arr)) . '.php';
}
echo transform('AaaBbbCccDddEeeFff');
// include/aaa/bbb/ccc/ddd/eee/fff.php
This builds an array of the folders, so you can manipulate it as needed, for example remove a folder by unsetting the desired index, before it gets imploded.
Here is the first part of your algorithm:
AaaBbbCccDddEeeFff -> include/aaa/bbb/ccc/ddd/eee/fff.php
include/aaa/bbb/ccc/ddd/eee_fff.php
include/aaa/bbb/ccc/ddd_eee_fff.php
include/aaa/bbb/ccc_ddd_eee_fff.php
include/aaa/bbb_ccc_ddd_eee_fff.php
include/aaa_bbb_ccc_ddd_eee_fff.php
I think you can do last part independently based on my answer.
<?php
function convertClassToPath($class) {
return strtolower(preg_replace('/([a-z])([A-Z])/', '$1' . DIRECTORY_SEPARATOR . '$2', $class)) . '.php';
}
function autoload($path) {
$base_dir = 'include' . DIRECTORY_SEPARATOR;
$real_path = $base_dir . $path;
var_dump('Checking: ' . $real_path);
if (file_exists($real_path) === true) {
var_dump('Status: Success');
include $real_path;
} else {
var_dump('Status: Fail');
$last_separator_pos = strrpos($path, DIRECTORY_SEPARATOR);
if ($last_separator_pos === false) {
return;
} else {
$path = substr_replace($path, '_', $last_separator_pos, 1);
autoload($path);
}
}
}
$class = 'AaaBbbCccDddEeeFff';
var_dump(autoload(convertClassToPath($class)));

PHP: Display the first 500 characters of HTML

I have a huge HTML code in a PHP variable like :
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
I want to display only first 500 characters of this code. This character count must consider the text in HTML tags and should exclude HTMl tags and attributes while measuring the length.
but while triming the code, it should not affect DOM structure of HTML code.
Is there any tuorial or working examples available?
If its the text you want, you can do this with the following too
substr(strip_tags($html_code),0,500);
Ooohh... I know this I can't get it exactly off the top of my head but you want to load the text you've got as a DOMDOCUMENT
http://www.php.net/manual/en/class.domdocument.php
then grab the text from the entire document node (as a DOMnode http://www.php.net/manual/en/class.domnode.php)
This won't be exactly right, but hopefully this will steer you onto the right track.
Try something like:
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
$dom = new DOMDocument();
$dom->loadHTML($html_code);
$text_to_strip = $dom->textContent;
$stripped = mb_substr($text_to_strip,0,500);
echo "$stripped"; // The Sameple text.Another sample text.....
edit ok... that should work. just tested locally
edit2
Now that I understand you want to keep the tags, but limit the text, lets see. You're going to want to loop the content until you get to 500 characters. This is probably going to take a few edits and passes for me to get right, but hopefully I can help. (sorry I can't give undivided attention)
First case is when the text is less than 500 characters. Nothing to worry about. Starting with the above code we can do the following.
if (strlen($stripped) > 500) {
// this is where we do our work.
$characters_so_far = 0;
foreach ($dom->child_nodes as $ChildNode) {
// should check if $ChildNode->hasChildNodes();
// probably put some of this stuff into a function
$characters_in_next_node += str_len($ChildNode->textcontent);
if ($characters_so_far+$characters_in_next_node > 500) {
// remove the node
// try using
// $ChildNode->parentNode->removeChild($ChildNode);
}
$characters_so_far += $characters_in_next_node
}
//
$final_out = $dom->saveHTML();
} else {
$final_out = $html_code;
}
i'm pasting below a php class i wrote a long time ago, but i know it works. its not exactly what you're after, as it deals with words instead of a character count, but i figure its pretty close and someone might find it useful.
class HtmlWordManipulator
{
var $stack = array();
function truncate($text, $num=50)
{
if (preg_match_all('/\s+/', $text, $junk) <= $num) return $text;
$text = preg_replace_callback('/(<\/?[^>]+\s+[^>]*>)/','_truncateProtect', $text);
$words = 0;
$out = array();
$text = str_replace('<',' <',str_replace('>','> ',$text));
$toks = preg_split('/\s+/', $text);
foreach ($toks as $tok)
{
if (preg_match_all('/<(\/?[^\x01>]+)([^>]*)>/',$tok,$matches,PREG_SET_ORDER))
foreach ($matches as $tag) $this->_recordTag($tag[1], $tag[2]);
$out[] = trim($tok);
if (! preg_match('/^(<[^>]+>)+$/', $tok))
{
if (!strpos($tok,'=') && !strpos($tok,'<') && strlen(trim(strip_tags($tok))) > 0)
{
++$words;
}
else
{
/*
echo '<hr />';
echo htmlentities('failed: '.$tok).'<br /)>';
echo htmlentities('has equals: '.strpos($tok,'=')).'<br />';
echo htmlentities('has greater than: '.strpos($tok,'<')).'<br />';
echo htmlentities('strip tags: '.strip_tags($tok)).'<br />';
echo str_word_count($text);
*/
}
}
if ($words > $num) break;
}
$truncate = $this->_truncateRestore(implode(' ', $out));
return $truncate;
}
function restoreTags($text)
{
foreach ($this->stack as $tag) $text .= "</$tag>";
return $text;
}
private function _truncateProtect($match)
{
return preg_replace('/\s/', "\x01", $match[0]);
}
private function _truncateRestore($strings)
{
return preg_replace('/\x01/', ' ', $strings);
}
private function _recordTag($tag, $args)
{
// XHTML
if (strlen($args) and $args[strlen($args) - 1] == '/') return;
else if ($tag[0] == '/')
{
$tag = substr($tag, 1);
for ($i=count($this->stack) -1; $i >= 0; $i--) {
if ($this->stack[$i] == $tag) {
array_splice($this->stack, $i, 1);
return;
}
}
return;
}
else if (in_array($tag, array('p', 'li', 'ul', 'ol', 'div', 'span', 'a')))
$this->stack[] = $tag;
else return;
}
}
truncate is what you want, and you pass it the html and the number of words you want it trimmed down to. it ignores html while counting words, but then rewraps everything in html, even closing trailing tags due to the truncation.
please don't judge me on the complete lack of oop principles. i was young and stupid.
edit:
so it turns out the usage is more like this:
$content = $manipulator->restoreTags($manipulator->truncate($myHtml,$numOfWords));
stupid design decision. allowed me to inject html inside the unclosed tags though.
I'm not up to coding a real solution, but if someone wants to, here's what I'd do (in pseudo-PHP):
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
$aggregate = '';
$document = XMLParser($html_code);
foreach ($document->getElementsByTagName('*') as $element) {
$aggregate .= $element->text(); // This is the text, not HTML. It doesn't
// include the children, only the text
// directly in the tag.
}

PHP Remove URL from string

If I have a string that contains a url (for examples sake, we'll call it $url) such as;
$url = "Here is a funny site http://www.tunyurl.com/34934";
How do i remove the URL from the string?
Difficulty is, urls might also show up without the http://, such as ;
$url = "Here is another funny site www.tinyurl.com/55555";
There is no HTML present. How would i start a search if http or www exists, then remove the text/numbers/symbols until the first space?
I re-read the question, here is a function that would work as intended:
function cleaner($url) {
$U = explode(' ',$url);
$W =array();
foreach ($U as $k => $u) {
if (stristr($u,'http') || (count(explode('.',$u)) > 1)) {
unset($U[$k]);
return cleaner( implode(' ',$U));
}
}
return implode(' ',$U);
}
$url = "Here is another funny site www.tinyurl.com/55555 and http://www.tinyurl.com/55555 and img.hostingsite.com/badpic.jpg";
echo "Cleaned: " . cleaner($url);
Edit #2/#3 (I must be bored). Here is a version that verifies there is a TLD within the URL:
function containsTLD($string) {
preg_match(
"/(AC($|\/)|\.AD($|\/)|\.AE($|\/)|\.AERO($|\/)|\.AF($|\/)|\.AG($|\/)|\.AI($|\/)|\.AL($|\/)|\.AM($|\/)|\.AN($|\/)|\.AO($|\/)|\.AQ($|\/)|\.AR($|\/)|\.ARPA($|\/)|\.AS($|\/)|\.ASIA($|\/)|\.AT($|\/)|\.AU($|\/)|\.AW($|\/)|\.AX($|\/)|\.AZ($|\/)|\.BA($|\/)|\.BB($|\/)|\.BD($|\/)|\.BE($|\/)|\.BF($|\/)|\.BG($|\/)|\.BH($|\/)|\.BI($|\/)|\.BIZ($|\/)|\.BJ($|\/)|\.BM($|\/)|\.BN($|\/)|\.BO($|\/)|\.BR($|\/)|\.BS($|\/)|\.BT($|\/)|\.BV($|\/)|\.BW($|\/)|\.BY($|\/)|\.BZ($|\/)|\.CA($|\/)|\.CAT($|\/)|\.CC($|\/)|\.CD($|\/)|\.CF($|\/)|\.CG($|\/)|\.CH($|\/)|\.CI($|\/)|\.CK($|\/)|\.CL($|\/)|\.CM($|\/)|\.CN($|\/)|\.CO($|\/)|\.COM($|\/)|\.COOP($|\/)|\.CR($|\/)|\.CU($|\/)|\.CV($|\/)|\.CX($|\/)|\.CY($|\/)|\.CZ($|\/)|\.DE($|\/)|\.DJ($|\/)|\.DK($|\/)|\.DM($|\/)|\.DO($|\/)|\.DZ($|\/)|\.EC($|\/)|\.EDU($|\/)|\.EE($|\/)|\.EG($|\/)|\.ER($|\/)|\.ES($|\/)|\.ET($|\/)|\.EU($|\/)|\.FI($|\/)|\.FJ($|\/)|\.FK($|\/)|\.FM($|\/)|\.FO($|\/)|\.FR($|\/)|\.GA($|\/)|\.GB($|\/)|\.GD($|\/)|\.GE($|\/)|\.GF($|\/)|\.GG($|\/)|\.GH($|\/)|\.GI($|\/)|\.GL($|\/)|\.GM($|\/)|\.GN($|\/)|\.GOV($|\/)|\.GP($|\/)|\.GQ($|\/)|\.GR($|\/)|\.GS($|\/)|\.GT($|\/)|\.GU($|\/)|\.GW($|\/)|\.GY($|\/)|\.HK($|\/)|\.HM($|\/)|\.HN($|\/)|\.HR($|\/)|\.HT($|\/)|\.HU($|\/)|\.ID($|\/)|\.IE($|\/)|\.IL($|\/)|\.IM($|\/)|\.IN($|\/)|\.INFO($|\/)|\.INT($|\/)|\.IO($|\/)|\.IQ($|\/)|\.IR($|\/)|\.IS($|\/)|\.IT($|\/)|\.JE($|\/)|\.JM($|\/)|\.JO($|\/)|\.JOBS($|\/)|\.JP($|\/)|\.KE($|\/)|\.KG($|\/)|\.KH($|\/)|\.KI($|\/)|\.KM($|\/)|\.KN($|\/)|\.KP($|\/)|\.KR($|\/)|\.KW($|\/)|\.KY($|\/)|\.KZ($|\/)|\.LA($|\/)|\.LB($|\/)|\.LC($|\/)|\.LI($|\/)|\.LK($|\/)|\.LR($|\/)|\.LS($|\/)|\.LT($|\/)|\.LU($|\/)|\.LV($|\/)|\.LY($|\/)|\.MA($|\/)|\.MC($|\/)|\.MD($|\/)|\.ME($|\/)|\.MG($|\/)|\.MH($|\/)|\.MIL($|\/)|\.MK($|\/)|\.ML($|\/)|\.MM($|\/)|\.MN($|\/)|\.MO($|\/)|\.MOBI($|\/)|\.MP($|\/)|\.MQ($|\/)|\.MR($|\/)|\.MS($|\/)|\.MT($|\/)|\.MU($|\/)|\.MUSEUM($|\/)|\.MV($|\/)|\.MW($|\/)|\.MX($|\/)|\.MY($|\/)|\.MZ($|\/)|\.NA($|\/)|\.NAME($|\/)|\.NC($|\/)|\.NE($|\/)|\.NET($|\/)|\.NF($|\/)|\.NG($|\/)|\.NI($|\/)|\.NL($|\/)|\.NO($|\/)|\.NP($|\/)|\.NR($|\/)|\.NU($|\/)|\.NZ($|\/)|\.OM($|\/)|\.ORG($|\/)|\.PA($|\/)|\.PE($|\/)|\.PF($|\/)|\.PG($|\/)|\.PH($|\/)|\.PK($|\/)|\.PL($|\/)|\.PM($|\/)|\.PN($|\/)|\.PR($|\/)|\.PRO($|\/)|\.PS($|\/)|\.PT($|\/)|\.PW($|\/)|\.PY($|\/)|\.QA($|\/)|\.RE($|\/)|\.RO($|\/)|\.RS($|\/)|\.RU($|\/)|\.RW($|\/)|\.SA($|\/)|\.SB($|\/)|\.SC($|\/)|\.SD($|\/)|\.SE($|\/)|\.SG($|\/)|\.SH($|\/)|\.SI($|\/)|\.SJ($|\/)|\.SK($|\/)|\.SL($|\/)|\.SM($|\/)|\.SN($|\/)|\.SO($|\/)|\.SR($|\/)|\.ST($|\/)|\.SU($|\/)|\.SV($|\/)|\.SY($|\/)|\.SZ($|\/)|\.TC($|\/)|\.TD($|\/)|\.TEL($|\/)|\.TF($|\/)|\.TG($|\/)|\.TH($|\/)|\.TJ($|\/)|\.TK($|\/)|\.TL($|\/)|\.TM($|\/)|\.TN($|\/)|\.TO($|\/)|\.TP($|\/)|\.TR($|\/)|\.TRAVEL($|\/)|\.TT($|\/)|\.TV($|\/)|\.TW($|\/)|\.TZ($|\/)|\.UA($|\/)|\.UG($|\/)|\.UK($|\/)|\.US($|\/)|\.UY($|\/)|\.UZ($|\/)|\.VA($|\/)|\.VC($|\/)|\.VE($|\/)|\.VG($|\/)|\.VI($|\/)|\.VN($|\/)|\.VU($|\/)|\.WF($|\/)|\.WS($|\/)|\.XN--0ZWM56D($|\/)|\.XN--11B5BS3A9AJ6G($|\/)|\.XN--80AKHBYKNJ4F($|\/)|\.XN--9T4B11YI5A($|\/)|\.XN--DEBA0AD($|\/)|\.XN--G6W251D($|\/)|\.XN--HGBK6AJ7F53BBA($|\/)|\.XN--HLCJ6AYA9ESC7A($|\/)|\.XN--JXALPDLP($|\/)|\.XN--KGBECHTV($|\/)|\.XN--ZCKZAH($|\/)|\.YE($|\/)|\.YT($|\/)|\.YU($|\/)|\.ZA($|\/)|\.ZM($|\/)|\.ZW)/i",
$string,
$M);
$has_tld = (count($M) > 0) ? true : false;
return $has_tld;
}
function cleaner($url) {
$U = explode(' ',$url);
$W =array();
foreach ($U as $k => $u) {
if (stristr($u,".")) { //only preg_match if there is a dot
if (containsTLD($u) === true) {
unset($U[$k]);
return cleaner( implode(' ',$U));
}
}
}
return implode(' ',$U);
}
$url = "Here is another funny site badurl.badone somesite.ca/worse.jpg but this badsite.com www.tinyurl.com/55555 and http://www.tinyurl.com/55555 and img.hostingsite.com/badpic.jpg";
echo "Cleaned: " . cleaner($url);
returns:
Cleaned: Here is another funny site badurl.badone but this and and
$string = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i', '', $string);
Parsing text for URLs is hard and looking for pre-existing, heavily tested code that already does this for you would be better than writing your own code and missing edge cases. For example, I would take a look at the process in Django's urlize, which wraps URLs in anchors. You could port it over to PHP, and--instead of wrapping URLs in an anchor--just delete them from the text.
thanks mike,
update a bit, it return notice error,
'/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i'
$string = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i', '', $string);
$url = "Here is a funny site http://www.tunyurl.com/34934";
$replace = 'http www .com .org .net';
$with = '';
$clean_url = clean($url,$replace,$with);
echo $clean_url;
function clean($url,$replace,$with) {
$replace = explode(" ",$replace);
$new_string = '';
$check = explode(" ",$url);
foreach($check AS $key => $value) {
foreach($replace AS $key2 => $value2 ) {
if (-1 < strpos( strtolower($value), strtolower($value2) ) ) {
$value = $with;
break;
}
}
$new_string .= " ".$value;
}
return $new_string;
}
You would need to write a regular expression to extract out the urls.

Categories