Windows-1251 file inside UTF-8 site? - php

Hello everyone Masters Of Web Delevopment :)
I have a piece of PHP script that fetches last 10 played songs from my winamp. This script is inside file (lets call it "lastplayed.php") which is included in my site with php include function inside a "div".
My site is on UTF-8 encoding. The problem is that some songs titles are in Windows-1251 encoding. And in my site they displays like "������"...
Is there any known way to tell to this div with included "lastplayed.php" in it, to be with windows-1251 encoding?
Or any other suggestions?
P.S: The file with fetching script a.k.a. "lastplayed.php", is converted to UTF-8. But if it is ANCII it's the same result. I try to put and meta tag with windows-1251 between head tag but nothing happens again.
P.P.S: Script that fetches the Winamp's data (lastplayed.php):
<?php
/******
* You may use and/or modify this script as long as you:
* 1. Keep my name & webpage mentioned
* 2. Don't use it for commercial purposes
*
* If you want to use this script without complying to the rules above, please contact me first at: marty#excudo.net
*
* Author: Martijn Korse
* Website: http://devshed.excudo.net
*
* Date: 08-05-2006
***/
/**
* version 2.0
*/
class Radio
{
var $fields = array();
var $fieldsDefaults = array("Server Status", "Stream Status", "Listener Peak", "Average Listen Time", "Stream Title", "Content Type", "Stream Genre", "Stream URL", "Current Song");
var $very_first_str;
var $domain, $port, $path;
var $errno, $errstr;
var $trackLists = array();
var $isShoutcast;
var $nonShoutcastData = array(
"Server Status" => "n/a",
"Stream Status" => "n/a",
"Listener Peak" => "n/a",
"Average Listen Time" => "n/a",
"Stream Title" => "n/a",
"Content Type" => "n/a",
"Stream Genre" => "n/a",
"Stream URL" => "n/a",
"Stream AIM" => "n/a",
"Stream IRC" => "n/a",
"Current Song" => "n/a"
);
var $altServer = False;
function Radio($url)
{
$parsed_url = parse_url($url);
$this->domain = isset($parsed_url['host']) ? $parsed_url['host'] : "";
$this->port = !isset($parsed_url['port']) || empty($parsed_url['port']) ? "80" : $parsed_url['port'];
$this->path = empty($parsed_url['path']) ? "/" : $parsed_url['path'];
if (empty($this->domain))
{
$this->domain = $this->path;
$this->path = "";
}
$this->setOffset("Current Stream Information");
$this->setFields(); // setting default fields
$this->setTableStart("<table border=0 cellpadding=2 cellspacing=2>");
$this->setTableEnd("</table>");
}
function setFields($array=False)
{
if (!$array)
$this->fields = $this->fieldsDefaults;
else
$this->fields = $array;
}
function setOffset($string)
{
$this->very_first_str = $string;
}
function setTableStart($string)
{
$this->tableStart = $string;
}
function setTableEnd($string)
{
$this->tableEnd = $string;
}
function getHTML($page=False)
{
if (!$page)
$page = $this->path;
$contents = "";
$domain = (substr($this->domain, 0, 7) == "http://") ? substr($this->domain, 7) : $this->domain;
if (#$fp = fsockopen($domain, $this->port, $this->errno, $this->errstr, 2))
{
fputs($fp, "GET ".$page." HTTP/1.1\r\n".
"User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)\r\n".
"Accept: */*\r\n".
"Host: ".$domain."\r\n\r\n");
$c = 0;
while (!feof($fp) && $c <= 20)
{
$contents .= fgets($fp, 4096);
$c++;
}
fclose ($fp);
preg_match("/(Content-Type:)(.*)/i", $contents, $matches);
if (count($matches) > 0)
{
$contentType = trim($matches[2]);
if ($contentType == "text/html")
{
$this->isShoutcast = True;
return $contents;
}
else
{
$this->isShoutcast = False;
$htmlContent = substr($contents, 0, strpos($contents, "\r\n\r\n"));
$dataStr = str_replace("\r", "\n", str_replace("\r\n", "\n", $contents));
$lines = explode("\n", $dataStr);
foreach ($lines AS $line)
{
if ($dp = strpos($line, ":"))
{
$key = substr($line, 0, $dp);
$value = trim(substr($line, ($dp+1)));
if (preg_match("/genre/i", $key))
$this->nonShoutcastData['Stream Genre'] = $value;
if (preg_match("/name/i", $key))
$this->nonShoutcastData['Stream Title'] = $value;
if (preg_match("/url/i", $key))
$this->nonShoutcastData['Stream URL'] = $value;
if (preg_match("/content-type/i", $key))
$this->nonShoutcastData['Content Type'] = $value;
if (preg_match("/icy-br/i", $key))
$this->nonShoutcastData['Stream Status'] = "Stream is up at ".$value."kbps";
if (preg_match("/icy-notice2/i", $key))
{
$this->nonShoutcastData['Server Status'] = "This is <span style=\"color: red;\">not</span> a Shoutcast server!";
if (preg_match("/ultravox/i", $value))
$this->nonShoutcastData['Server Status'] .= " But an Ultravox Server";
$this->altServer = $value;
}
}
}
return nl2br($htmlContent);
}
}
else
return $contents;
}
else
{
return False;
}
}
function getServerInfo($display_array=null, $very_first_str=null)
{
if (!isset($display_array))
$display_array = $this->fields;
if (!isset($very_first_str))
$very_first_str = $this->very_first_str;
if ($html = $this->getHTML())
{
// parsing the contents
$data = array();
foreach ($display_array AS $key => $item)
{
if ($this->isShoutcast)
{
$very_first_pos = stripos($html, $very_first_str);
$first_pos = stripos($html, $item, $very_first_pos);
$line_start = strpos($html, "<td>", $first_pos);
$line_end = strpos($html, "</td>", $line_start) + 4;
$difference = $line_end - $line_start;
$line = substr($html, $line_start, $difference);
$data[$key] = strip_tags($line);
}
else
{
$data[$key] = $this->nonShoutcastData[$item];
}
}
return $data;
}
else
{
return $this->errstr." (".$this->errno.")";
}
}
function createHistoryArray($page)
{
if (!in_array($page, $this->trackLists))
{
$this->trackLists[] = $page;
if ($html = $this->getHTML($page))
{
$fromPos = stripos($html, $this->tableStart);
$toPos = stripos($html, $this->tableEnd, $fromPos);
$tableData = substr($html, $fromPos, ($toPos-$fromPos));
$lines = explode("</tr><tr>", $tableData);
$tracks = array();
$c = 0;
foreach ($lines AS $line)
{
$info = explode ("</td><td>", $line);
$time = trim(strip_tags($info[0]));
if (substr($time, 0, 9) != "Copyright" && !preg_match("/Tag Loomis, Tom Pepper and Justin Frankel/i", $info[1]))
{
$this->tracks[$c]['time'] = $time;
$this->tracks[$c++]['track'] = trim(strip_tags($info[1]));
}
}
if (count($this->tracks) > 0)
{
unset($this->tracks[0]);
if (isset($this->tracks[1]))
$this->tracks[1]['track'] = str_replace("Current Song", "", $this->tracks[1]['track']);
}
}
else
{
$this->tracks[0] = array("time"=>$this->errno, "track"=>$this->errstr);
}
}
}
function getHistoryArray($page="/played.html")
{
if (!in_array($page, $this->trackLists))
$this->createHistoryArray($page);
return $this->tracks;
}
function getHistoryTable($page="/played.html", $trackColText=False, $class=False)
{
$title_utf8 = mb_convert_encoding($trackArr ,"utf-8" ,"auto");
if (!in_array($page, $this->trackLists))
$this->createHistoryArray($page);
if ($trackColText)
$output .= "
<div class='lastplayed_top'></div>
<div".($class ? " class=\"".$class."\"" : "").">";
foreach ($this->tracks AS $title_utf8)
$output .= "<div style='padding:2px 0;'>".$title_utf8['track']."</div>";
$output .= "</div><div class='lastplayed_bottom'></div>
<div class='lastplayed_title'>".$trackColText."</div>
\n";
return $output;
}
}
// this is needed for those with a php version < 5
// the function is copied from the user comments # php.net (http://nl3.php.net/stripos)
if (!function_exists("stripos"))
{
function stripos($haystack, $needle, $offset=0)
{
return strpos(strtoupper($haystack), strtoupper($needle), $offset);
}
}
?>
And the calling script outside the lastplayed.php:
include "lastplayed.php";
$radio = new Radio($ip.":".$port);
echo $radio->getHistoryTable("/played.html", "<b>Last played:</b>", "lastplayed_content");

If all of your source data is in windows-1251, you can use something like:
$title_utf8=mb_convert_encoding($title,"utf-8","Windows-1251")
and put that converted data in your HTML stream.
Since I'm only looking at docs, I'm not 100% sure that the source encoding alias is correct; you may want to try CP1251 if Windows-1251 doesn't work.
If your source data isn't reliably in 1251, you'll have to come up with a heuristic to guess, and use the same conversion method. mb_detect_encoding may help you.
You cannot change the encoding of just part of an HTML document, but you can certainly convert everything to UTF-8 easily enough.
The newer ID3 implementations have an encoding marker in their text frames:
$00 ISO-8859-1 (ASCII)
$01 – UCS-2 in ID3v2.2 and ID3v2.3, UTF-16 encoded Unicode with BOM.
$02 – UTF-16BE encoded Unicode without BOM in ID3v2.4 only.
$03 – UTF-8 encoded Unicode in ID3v2.4 only.
Is it possible that your content is in UTF16?
Based on the code you've posted, it's not clear how $trackArr is defined, as it's not referenced elsewhere. It looks like you have several problems.
$title_utf8 = mb_convert_encoding($trackArr ,"utf-8" ,"auto")
"auto" expands to a list of encodings that do not include Windows-1251, so I'm not sure why you've used it. You really should use "Windows-1251". I have tried using "Windows-1251,utf-16" on a mac with PHP installed, but autodetect fails to find a suitable encoding against a relatively short string, so it looks like you're going to have to be the one to guess.
But that code doesn't look like it has any reason to exist anyway, as you overwrite the values with your iteration:
foreach ($this->tracks AS $title_utf8)
$output .= "<div style='padding:2px 0;'>".$title_utf8['track'].\"</div>";
In each iteration, the variable $title_utf8 is assigned to the current track. What you probably want is something more like:
foreach ($this->tracks AS $current_track)
$output .= "<div style='padding:2px 0;'>". mb_convert_encoding($current_track ,"utf-8" ,"Windows-1251");
mb_convert_encoding takes a string as the first argument, not an array or object, so you need to apply this encoding on each string that is not utf-8.

Just to let you know that the latest version supports character encoding/decoding :-)

Related

How to fix url contains arabic characters

I want to get the url of each file in certain directory
i tried string concatenation (like: domain.folder1.folder2.file.mp3) but some folders and files is with arabic characters that make error when using the url.
example:
this is my code output:
String A :
https://linkimage2url.com/apps/quran full/محمد صديق المنشاوي/تسجيلات الإذاعة المصرية/009 - At-Taubah (The Repentance) سورة التوبة.mp3
this code is not working in some android devices
but the next code works with all devices
String B:
https://linkimage2url.com/apps/quran%20full/%D9%85%D8%AD%D9%85%D8%AF%20%D8%B5%D8%AF%D9%8A%D9%82%20%D8%A7%D9%84%D9%85%D9%86%D8%B4%D8%A7%D9%88%D9%8A/%D8%AA%D8%B3%D8%AC%D9%8A%D9%84%D8%A7%D8%AA%20%D8%A7%D9%84%D8%A5%D8%B0%D8%A7%D8%B9%D8%A9%20%D8%A7%D9%84%D9%85%D8%B5%D8%B1%D9%8A%D8%A9/009%20-%20At-Taubah%20(The%20Repentance)%20%D8%B3%D9%88%D8%B1%D8%A9%20%D8%A7%D9%84%D8%AA%D9%88%D8%A8%D8%A9.mp3
Note: i got String B from internet download manager that converts it automatically when i tried to use string A
my question is:
how to convert String A to String B by php
and is there better way to readdir and get the url of each file
My code is:
if(is_dir($parent)){
if($dh = opendir($parent)){
while(($file = readdir($dh)) != false){
if($file == "." or $file == ".."){
//...
} else { //create object with two fields
sort($file);
$fileName = pathinfo($file)['filename'];
if(is_dir($parent."/".$file)){
$data[] = array('name'=> $fileName, 'subname'=> basename($path), 'url'=> $path."/".$file, "directory"=> true);
} else {
$res = "https://linkimage2url.com".$path."/".$file;
$data[] = array('name'=> $fileName, 'subname'=> basename($path), 'url'=> $res , "directory"=> false);
}
Try this one, I've used once for a legacy project:
function encode_fullurl($url) {
$output = '';
$valid = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~!*\'();:#&=+$,/?#[]%';
$length = strlen($url);
for ($i = 0; $i < $length; $i++) {
$character = $url[$i];
$output .= (strpos($valid, $character) === false ? rawurlencode($character) : $character);
}
return $output;
}
$url ="https://linkimage2url.com/apps/quran full/محمد صديق المنشاوي/تسجيلات الإذاعة المصرية/009 - At-Taubah (The Repentance) سورة التوبة.mp3";
echo encode_fullurl($url);
output:
https://linkimage2url.com/apps/quran%20full/%D9%85%D8%AD%D9%85%D8%AF%20%D8%B5%D8%AF%D9%8A%D9%82%20%D8%A7%D9%84%D9%85%D9%86%D8%B4%D8%A7%D9%88%D9%8A/%D8%AA%D8%B3%D8%AC%D9%8A%D9%84%D8%A7%D8%AA%20%D8%A7%D9%84%D8%A5%D8%B0%D8%A7%D8%B9%D8%A9%20%D8%A7%D9%84%D9%85%D8%B5%D8%B1%D9%8A%D8%A9/009%20-%20At-Taubah%20(The%20Repentance)%20%D8%B3%D9%88%D8%B1%D8%A9%20%D8%A7%D9%84%D8%AA%D9%88%D8%A8%D8%A9.mp3
it is not very performing, but it should do what you need
this code worked for me
thanks every one
$res1 = "https://linkimage2url.com" .$path."/".$file;
$query = flash_encode ($res1);
$url = htmlentities($query);
function flash_encode($string)
{
$string = rawurlencode($string);
$string = str_replace("%2F", "/", $string);
$string = str_replace("%3A", ":", $string);
return $string;
}

Someone know this "language"?

I need to parse this language in PHP, but I don't know what language it is and how to parse it.
Does someone know what language it is?
And if it's not a language, can someone explain me how to parse it?
Thank you very much
include "folder/file1.conf"
include "folder/file2.conf"
auth-mocked {
welcome = "Welcome"
login = "Login to continue:"
placeholder = "login"
button = "Login"
error = "Error:"
}
auth {
sso {
validation {
expected-uuid = "You need an UID"
}
session-not-found = "session was not found"
}
}
header {
company-name = "Company name"
help-popup {
title = "Need help?"
paragraph = "If you have any issue, you can contact your dedicated interlocutor:"
}
language-popup {
title = "Change language"
}
language = "Change language"
profile = "My profile"
terms-of-use = "Terms of use"
ao-documents = "Documents"
logout = "Logout"
user = "User"
}
black-panel {
common {
form = "You are currently filling the form:"
btn-i-understand = "Ok, thanks"
btn-link-view = "View"
}
}
I have finaly created my own parser to get the label for each keys.
function parseFile($file){
$title = "";
$key = "";
$value = "";
$str = "";
$array = array();
$results = array();
$lines = file('./generated_json/'.$file);
foreach($lines as $line){
if(strpos($line, " {\n")){
$title = str_replace(" {", "", $line);
array_push($array, $title);
$str = implode(".", $array);
}
if(strpos($line, "=")){
$keyEx = explode("=", $line);
$key = $keyEx[0];
$value = $keyEx[1];
$parsed = $str.".".$key;
$parsed = preg_replace('/\s+/', '', $parsed);
$parsed = str_replace("=", "", $parsed);
array_push($results, $parsed." = ".$value);
}
if(strpos($line, "}\n")){
array_pop($array);
$str = implode(".", $array);
}
}
return $results;
}
It may be an homemade file format, but here is a list of common file formats used for translation :
http://docs.translatehouse.org/projects/translate-toolkit/en/latest/formats/
If you don't find your file format in there, you could probably write a parser for it.

Check if the STRING contains terms (word / words) from a txt file - PHP

I would like to know if a STRING contains a specific word / words from a text list (TXT FILE).
The function will take a look line by line in a TXT file and see if whole the line appears in STRING himself or not.
<?php
function hebstrrev($string, $revInt = false, $encoding = 'UTF-8'){
$mb_strrev = function($str) use ($encoding){return mb_convert_encoding(strrev(mb_convert_encoding($str, 'UTF-16BE', $encoding)), $encoding, 'UTF-16LE');};
if(!$revInt){
$s = '';
foreach(array_reverse(preg_split('/(?<=\D)(?=\d)|\d+\K/', $string)) as $val){
$s .= ctype_digit($val) ? $val : $mb_strrev($val);
}
return $s;
} else {
return $mb_strrev($string);
}
}
function is_rtl( $string ) {
$rtl_chars_pattern = '/[\x{0590}-\x{05ff}\x{0600}-\x{06ff}]/u';
return preg_match($rtl_chars_pattern, $string);
}
header('Content-Type: text/html; charset=UTF-8');
$productFile = file_get_contents('vlist.txt');
mb_convert_encoding($productFile, 'UTF-16LE', 'UTF-8');
$products = str_word_count($productFile, 1);
$text = 'השבוע נתניה - מגזין';
$text = iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);
$found = false;
foreach ($products as $product)
{
if(is_rtl($product)) $product = hebstrrev($product, true);
$product = iconv(mb_detect_encoding($product, mb_detect_order(), true), "UTF-8", $product);
if (mb_strpos($text,$product) !== false) {
$found = true;
break;
}
}
if ($found) {
echo 'the status contains a product';
}
else {
echo 'The status doesnt contain a product';
}
The Problem is that the function is checking word by word into the STRING. without check if each line appears in the STRING partially.

How to add rel="nofollow" to links with preg_replace()

The function below is designed to apply rel="nofollow" attributes to all external links and no internal links unless the path matches a predefined root URL defined as $my_folder below.
So given the variables...
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
And the content...
internal
internal cloaked link
external
The end result, after replacement should be...
internal
internal cloaked link
external
Notice that the first link is not altered, since its an internal link.
The link on the second line is also an internal link, but since it matches our $my_folder string, it gets the nofollow too.
The third link is the easiest, since it does not match the blog_url, its obviously an external link.
However, in the script below, ALL of my links are getting nofollow. How can I fix the script to do what I want?
function save_rseo_nofollow($content) {
$my_folder = $rseo['nofollow_folder'];
$blog_url = get_bloginfo('url');
preg_match_all('~<a.*>~isU',$content["post_content"],$matches);
for ( $i = 0; $i <= sizeof($matches[0]); $i++){
if ( !preg_match( '~nofollow~is',$matches[0][$i])
&& (preg_match('~' . $my_folder . '~', $matches[0][$i])
|| !preg_match( '~'.$blog_url.'~',$matches[0][$i]))){
$result = trim($matches[0][$i],">");
$result .= ' rel="nofollow">';
$content["post_content"] = str_replace($matches[0][$i], $result, $content["post_content"]);
}
}
return $content;
}
Here is the DOMDocument solution...
$str = 'internal
internal cloaked link
external
external
external
external
';
$dom = new DOMDocument();
$dom->preserveWhitespace = FALSE;
$dom->loadHTML($str);
$a = $dom->getElementsByTagName('a');
$host = strtok($_SERVER['HTTP_HOST'], ':');
foreach($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
if (preg_match('/^https?:\/\/' . preg_quote($host, '/') . '/', $href)) {
continue;
}
$noFollowRel = 'nofollow';
$oldRelAtt = $anchor->attributes->getNamedItem('rel');
if ($oldRelAtt == NULL) {
$newRel = $noFollowRel;
} else {
$oldRel = $oldRelAtt->nodeValue;
$oldRel = explode(' ', $oldRel);
if (in_array($noFollowRel, $oldRel)) {
continue;
}
$oldRel[] = $noFollowRel;
$newRel = implode($oldRel, ' ');
}
$newRelAtt = $dom->createAttribute('rel');
$noFollowNode = $dom->createTextNode($newRel);
$newRelAtt->appendChild($noFollowNode);
$anchor->appendChild($newRelAtt);
}
var_dump($dom->saveHTML());
Output
string(509) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
internal
internal cloaked link
external
external
external
external
</body></html>
"
Try to make it more readable first, and only afterwards make your if rules more complex:
function save_rseo_nofollow($content) {
$content["post_content"] =
preg_replace_callback('~<(a\s[^>]+)>~isU', "cb2", $content["post_content"]);
return $content;
}
function cb2($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/hostgator"; // re-add quirky config here
$blog_url = "http://localhost/";
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
Gives following output:
[post_content] =>
internal
<a href="http://localhost/mytest/go/hostgator" rel=nofollow>internal cloaked link</a>
<a href="http://cnn.com" rel=nofollow>external</a>
The problem in your original code might have been $rseo which wasn't declared anywhere.
Try this one (PHP 5.3+):
skip selected address
allow manually set rel parameter
and code:
function nofollow($html, $skip = null) {
return preg_replace_callback(
"#(<a[^>]+?)>#is", function ($mach) use ($skip) {
return (
!($skip && strpos($mach[1], $skip) !== false) &&
strpos($mach[1], 'rel=') === false
) ? $mach[1] . ' rel="nofollow">' : $mach[0];
},
$html
);
}
Examples:
echo nofollow('something');
// will be same because it's already contains rel parameter
echo nofollow('something'); // ad
// add rel="nofollow" parameter to anchor
echo nofollow('something', 'localhost');
// skip this link as internall link
Using regular expressions to do this job properly would be quite complicated. It would be easier to use an actual parser, such as the one from the DOM extension. DOM isn't very beginner-friendly, so what you can do is load the HTML with DOM then run the modifications with SimpleXML. They're backed by the same library, so it's easy to use one with the other.
Here's how it can look like:
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
$html = '<html><body>
internal
internal cloaked link
external
</body></html>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$sxe = simplexml_import_dom($dom);
// grab all <a> nodes with an href attribute
foreach ($sxe->xpath('//a[#href]') as $a)
{
if (substr($a['href'], 0, strlen($blog_url)) === $blog_url
&& substr($a['href'], 0, strlen($my_folder)) !== $my_folder)
{
// skip all links that start with the URL in $blog_url, as long as they
// don't start with the URL from $my_folder;
continue;
}
if (empty($a['rel']))
{
$a['rel'] = 'nofollow';
}
else
{
$a['rel'] .= ' nofollow';
}
}
$new_html = $dom->saveHTML();
echo $new_html;
As you can see, it's really short and simple. Depending on your needs, you may want to use preg_match() in place of the strpos() stuff, for example:
// change the regexp to your own rules, here we match everything under
// "http://localhost/mytest/" as long as it's not followed by "go"
if (preg_match('#^http://localhost/mytest/(?!go)#', $a['href']))
{
continue;
}
Note
I missed the last code block in the OP when I first read the question. The code I posted (and basically any solution based on DOM) is better suited at processing a whole page rather than a HTML block. Otherwise, DOM will attempt to "fix" your HTML and may add a <body> tag, a DOCTYPE, etc...
Thanks #alex for your nice solution. But, I was having a problem with Japanese text. I have fixed it as following way. Also, this code can skip multiple domains with the $whiteList array.
public function addRelNoFollow($html, $whiteList = [])
{
$dom = new \DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$a = $dom->getElementsByTagName('a');
/** #var \DOMElement $anchor */
foreach ($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
$domain = parse_url($href, PHP_URL_HOST);
// Skip whiteList domains
if (in_array($domain, $whiteList, true)) {
continue;
}
// Check & get existing rel attribute values
$noFollow = 'nofollow';
$rel = $anchor->attributes->getNamedItem('rel');
if ($rel) {
$values = explode(' ', $rel->nodeValue);
if (in_array($noFollow, $values, true)) {
continue;
}
$values[] = $noFollow;
$newValue = implode($values, ' ');
} else {
$newValue = $noFollow;
}
// Create new rel attribute
$rel = $dom->createAttribute('rel');
$node = $dom->createTextNode($newValue);
$rel->appendChild($node);
$anchor->appendChild($rel);
}
// There is a problem with saveHTML() and saveXML(), both of them do not work correctly in Unix.
// They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
// So we need to do as follows. #see https://stackoverflow.com/a/20675396/1710782
return $dom->saveHTML($dom->documentElement);
}
<?
$str='internal
internal cloaked link
external';
function test($x){
if (preg_match('#localhost/mytest/(?!go/)#i',$x[0])>0) return $x[0];
return 'rel="nofollow" '.$x[0];
}
echo preg_replace_callback('/href=[\'"][^\'"]+/i', 'test', $str);
?>
Here is the another solution which has whitelist option and add tagret Blank attribute.
And also it check if there already a rel attribute before add a new one.
function Add_Nofollow_Attr($Content, $Whitelist = [], $Add_Target_Blank = true)
{
$Whitelist[] = $_SERVER['HTTP_HOST'];
foreach ($Whitelist as $Key => $Link)
{
$Host = preg_replace('#^https?://#', '', $Link);
$Host = "https?://". preg_quote($Host, '/');
$Whitelist[$Key] = $Host;
}
if(preg_match_all("/<a .*?>/", $Content, $matches, PREG_SET_ORDER))
{
foreach ($matches as $Anchor_Tag)
{
$IS_Rel_Exist = $IS_Follow_Exist = $IS_Target_Blank_Exist = $Is_Valid_Tag = false;
if(preg_match_all("/(\w+)\s*=\s*['|\"](.*?)['|\"]/",$Anchor_Tag[0],$All_matches2))
{
foreach ($All_matches2[1] as $Key => $Attr_Name)
{
if($Attr_Name == 'href')
{
$Is_Valid_Tag = true;
$Url = $All_matches2[2][$Key];
// bypass #.. or internal links like "/"
if(preg_match('/^\s*[#|\/].*/', $Url))
{
continue 2;
}
foreach ($Whitelist as $Link)
{
if (preg_match("#$Link#", $Url)) {
continue 3;
}
}
}
else if($Attr_Name == 'rel')
{
$IS_Rel_Exist = true;
$Rel = $All_matches2[2][$Key];
preg_match("/[n|d]ofollow/", $Rel, $match, PREG_OFFSET_CAPTURE);
if( count($match) > 0 )
{
$IS_Follow_Exist = true;
}
else
{
$New_Rel = 'rel="'. $Rel . ' nofollow"';
}
}
else if($Attr_Name == 'target')
{
$IS_Target_Blank_Exist = true;
}
}
}
$New_Anchor_Tag = $Anchor_Tag;
if(!$IS_Rel_Exist)
{
$New_Anchor_Tag = str_replace(">",' rel="nofollow">',$Anchor_Tag);
}
else if(!$IS_Follow_Exist)
{
$New_Anchor_Tag = preg_replace("/rel=[\"|'].*?[\"|']/",$New_Rel,$Anchor_Tag);
}
if($Add_Target_Blank && !$IS_Target_Blank_Exist)
{
$New_Anchor_Tag = str_replace(">",' target="_blank">',$New_Anchor_Tag);
}
$Content = str_replace($Anchor_Tag,$New_Anchor_Tag,$Content);
}
}
return $Content;
}
To use it:
$Page_Content = 'internal
internal
google
example
stackoverflow';
$Whitelist = ["http://yoursite.com","http://localhost"];
echo Add_Nofollow_Attr($Page_Content,$Whitelist,true);
WordPress decision:
function replace__method($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/articles"; // re-add quirky config here
$blog_url = 'https://'.$_SERVER['SERVER_NAME'];
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
add_filter( 'the_content', 'add_nofollow_to_external_links', 1 );
function add_nofollow_to_external_links( $content ) {
$content = preg_replace_callback('~<(a\s[^>]+)>~isU', "replace__method", $content);
return $content;
}
a good script which allows to add nofollow automatically and to keep the other attributes
function nofollow(string $html, string $baseUrl = null) {
return preg_replace_callback(
'#<a([^>]*)>(.+)</a>#isU', function ($mach) use ($baseUrl) {
list ($a, $attr, $text) = $mach;
if (preg_match('#href=["\']([^"\']*)["\']#', $attr, $url)) {
$url = $url[1];
if (is_null($baseUrl) || !str_starts_with($url, $baseUrl)) {
if (preg_match('#rel=["\']([^"\']*)["\']#', $attr, $rel)) {
$relAttr = $rel[0];
$rel = $rel[1];
}
$rel = 'rel="' . ($rel ? (strpos($rel, 'nofollow') ? $rel : $rel . ' nofollow') : 'nofollow') . '"';
$attr = isset($relAttr) ? str_replace($relAttr, $rel, $attr) : $attr . ' ' . $rel;
$a = '<a ' . $attr . '>' . $text . '</a>';
}
}
return $a;
},
$html
);
}

Best way to automatically remove comments from PHP code

What’s the best way to remove comments from a PHP file?
I want to do something similar to strip-whitespace() - but it shouldn't remove the line breaks as well.
For example,
I want this:
<?PHP
// something
if ($whatsit) {
do_something(); # we do something here
echo '<html>Some embedded HTML</html>';
}
/* another long
comment
*/
some_more_code();
?>
to become:
<?PHP
if ($whatsit) {
do_something();
echo '<html>Some embedded HTML</html>';
}
some_more_code();
?>
(Although if the empty lines remain where comments are removed, that wouldn't be OK.)
It may not be possible, because of the requirement to preserve embedded HTML - that’s what’s tripped up the things that have come up on Google.
I'd use tokenizer. Here's my solution. It should work on both PHP 4 and 5:
$fileStr = file_get_contents('path/to/file');
$newStr = '';
$commentTokens = array(T_COMMENT);
if (defined('T_DOC_COMMENT')) {
$commentTokens[] = T_DOC_COMMENT; // PHP 5
}
if (defined('T_ML_COMMENT')) {
$commentTokens[] = T_ML_COMMENT; // PHP 4
}
$tokens = token_get_all($fileStr);
foreach ($tokens as $token) {
if (is_array($token)) {
if (in_array($token[0], $commentTokens)) {
continue;
}
$token = $token[1];
}
$newStr .= $token;
}
echo $newStr;
Use php -w <sourcefile> to generate a file stripped of comments and whitespace, and then use a beautifier like PHP_Beautifier to reformat for readability.
$fileStr = file_get_contents('file.php');
foreach (token_get_all($fileStr) as $token ) {
if ($token[0] != T_COMMENT) {
continue;
}
$fileStr = str_replace($token[1], '', $fileStr);
}
echo $fileStr;
Here's the function posted above, modified to recursively remove all comments from all PHP files within a directory and all its subdirectories:
function rmcomments($id) {
if (file_exists($id)) {
if (is_dir($id)) {
$handle = opendir($id);
while($file = readdir($handle)) {
if (($file != ".") && ($file != "..")) {
rmcomments($id . "/" . $file); }}
closedir($handle); }
else if ((is_file($id)) && (end(explode('.', $id)) == "php")) {
if (!is_writable($id)) { chmod($id, 0777); }
if (is_writable($id)) {
$fileStr = file_get_contents($id);
$newStr = '';
$commentTokens = array(T_COMMENT);
if (defined('T_DOC_COMMENT')) { $commentTokens[] = T_DOC_COMMENT; }
if (defined('T_ML_COMMENT')) { $commentTokens[] = T_ML_COMMENT; }
$tokens = token_get_all($fileStr);
foreach ($tokens as $token) {
if (is_array($token)) {
if (in_array($token[0], $commentTokens)) { continue; }
$token = $token[1]; }
$newStr .= $token; }
if (!file_put_contents($id, $newStr)) {
$open = fopen($id, "w");
fwrite($open, $newStr);
fclose($open);
}
}
}
}
}
rmcomments("path/to/directory");
A more powerful version: remove all comments in the folder
<?php
$di = new RecursiveDirectoryIterator(__DIR__, RecursiveDirectoryIterator::SKIP_DOTS);
$it = new RecursiveIteratorIterator($di);
$fileArr = [];
foreach($it as $file) {
if(pathinfo($file, PATHINFO_EXTENSION) == "php") {
ob_start();
echo $file;
$file = ob_get_clean();
$fileArr[] = $file;
}
}
$arr = [T_COMMENT, T_DOC_COMMENT];
$count = count($fileArr);
for($i=1; $i < $count; $i++) {
$fileStr = file_get_contents($fileArr[$i]);
foreach(token_get_all($fileStr) as $token) {
if(in_array($token[0], $arr)) {
$fileStr = str_replace($token[1], '', $fileStr);
}
}
file_put_contents($fileArr[$i], $fileStr);
}
/*
* T_ML_COMMENT does not exist in PHP 5.
* The following three lines define it in order to
* preserve backwards compatibility.
*
* The next two lines define the PHP 5 only T_DOC_COMMENT,
* which we will mask as T_ML_COMMENT for PHP 4.
*/
if (! defined('T_ML_COMMENT')) {
define('T_ML_COMMENT', T_COMMENT);
} else {
define('T_DOC_COMMENT', T_ML_COMMENT);
}
/*
* Remove all comment in $file
*/
function remove_comment($file) {
$comment_token = array(T_COMMENT, T_ML_COMMENT, T_DOC_COMMENT);
$input = file_get_contents($file);
$tokens = token_get_all($input);
$output = '';
foreach ($tokens as $token) {
if (is_string($token)) {
$output .= $token;
} else {
list($id, $text) = $token;
if (in_array($id, $comment_token)) {
$output .= $text;
}
}
}
file_put_contents($file, $output);
}
/*
* Glob recursive
* #return ['dir/filename', ...]
*/
function glob_recursive($pattern, $flags = 0) {
$file_list = glob($pattern, $flags);
$sub_dir = glob(dirname($pattern) . '/*', GLOB_ONLYDIR);
// If sub directory exist
if (count($sub_dir) > 0) {
$file_list = array_merge(
glob_recursive(dirname($pattern) . '/*/' . basename($pattern), $flags),
$file_list
);
}
return $file_list;
}
// Remove all comment of '*.php', include sub directory
foreach (glob_recursive('*.php') as $file) {
remove_comment($file);
}
If you already use an editor like UltraEdit, you can open one or multiple PHP file(s) and then use a simple Find&Replace (Ctrl + R) with the following Perl regular expression:
(?s)/\*.*\*/
Beware the above regular expression also removes comments inside a string, i.e., in echo "hello/*babe*/"; the /*babe*/ would be removed too. Hence, it could be a solution if you have few files to remove comments from. In order to be absolutely sure it does not wrongly replace something that is not a comment, you would have to run the Find&Replace command and approve each time what is getting replaced.
Bash solution: If you want to remove recursively comments from all PHP files starting from the current directory, you can write this one-liner in the terminal. (It uses temp1 file to store PHP content for processing.)
Note that this will strip all white spaces with comments.
find . -type f -name '*.php' | while read VAR; do php -wq $VAR > temp1 ; cat temp1 > $VAR; done
Then you should remove temp1 file after.
If PHP_BEAUTIFER is installed then you can get nicely formatted code without comments with
find . -type f -name '*.php' | while read VAR; do php -wq $VAR > temp1; php_beautifier temp1 > temp2; cat temp2 > $VAR; done;
Then remove two files (temp1 and temp2).
Following upon the accepted answer, I needed to preserve the line numbers of the file too, so here is a variation of the accepted answer:
/**
* Removes the php comments from the given valid php string, and returns the result.
*
* Note: a valid php string must start with <?php.
*
* If the preserveWhiteSpace option is true, it will replace the comments with some whitespaces, so that
* the line numbers are preserved.
*
*
* #param string $str
* #param bool $preserveWhiteSpace
* #return string
*/
function removePhpComments(string $str, bool $preserveWhiteSpace = true): string
{
$commentTokens = [
\T_COMMENT,
\T_DOC_COMMENT,
];
$tokens = token_get_all($str);
if (true === $preserveWhiteSpace) {
$lines = explode(PHP_EOL, $str);
}
$s = '';
foreach ($tokens as $token) {
if (is_array($token)) {
if (in_array($token[0], $commentTokens)) {
if (true === $preserveWhiteSpace) {
$comment = $token[1];
$lineNb = $token[2];
$firstLine = $lines[$lineNb - 1];
$p = explode(PHP_EOL, $comment);
$nbLineComments = count($p);
if ($nbLineComments < 1) {
$nbLineComments = 1;
}
$firstCommentLine = array_shift($p);
$isStandAlone = (trim($firstLine) === trim($firstCommentLine));
if (false === $isStandAlone) {
if (2 === $nbLineComments) {
$s .= PHP_EOL;
}
continue; // Just remove inline comments
}
// Stand-alone case
$s .= str_repeat(PHP_EOL, $nbLineComments - 1);
}
continue;
}
$token = $token[1];
}
$s .= $token;
}
return $s;
}
Note: this is for PHP 7+ (I didn't care about backward compatibility with older PHP versions).
For Ajax and JSON responses, I use the following PHP code, to remove comments from HTML/JavaScript code, so it would be smaller (about 15% gain for my code).
// Replace doubled spaces with single ones (ignored in HTML any way)
$html = preg_replace('#(\s){2,}#', '\1', $html);
// Remove single and multiline comments, tabs and newline chars
$html = preg_replace(
'#(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|((?<!:)//.*)|[\t\r\n]#i',
'',
$html
);
It is short and effective, but it can produce unexpected results, if your code has bad syntax.
Run the command php --strip file.php in a command prompt (for example., cmd.exe), and then browse to WriteCodeOnline.
Here, file.php is your own file.
In 2019 it could work like this:
<?php
/* hi there !!!
here are the comments */
//another try
echo removecomments('index.php');
/* hi there !!!
here are the comments */
//another try
function removecomments($f){
$w=Array(';','{','}');
$ts = token_get_all(php_strip_whitespace($f));
$s='';
foreach($ts as $t){
if(is_array($t)){
$s .=$t[1];
}else{
$s .=$t;
if( in_array($t,$w) ) $s.=chr(13).chr(10);
}
}
return $s;
}
?>
If you want to see the results, just let's run it first in XAMPP, and then you get a blank page, but if you right click and click on view source, you get your PHP script ... it's loading itself and it's removing all comments and also tabs.
I prefer this solution too, because I use it to speed up my framework one file engine "m.php" and after php_strip_whitespace, all source without this script I observe is slowest: I did 10 benchmarks, and then I calculate the math average (I think PHP 7 is restoring back the missing cr_lf's when it is parsing or it is taking a while when these are missing).
php -w or php_strip_whitespace($filename);
documentation
The catch is that a less robust matching algorithm (simple regex, for instance) will start stripping here when it clearly shouldn't:
if (preg_match('#^/*' . $this->index . '#', $this->permalink_structure)) {
It might not affect your code, but eventually someone will get bit by your script. So you will have to use a utility that understands more of the language than you might otherwise expect.

Categories