Detect remote charset in php - php

I would like to determine a remote page's encoding through detection of the Content-Type header tag
<meta http-equiv="Content-Type" content="text/html; charset=XXXXX" />
if present.
I retrieve the remote page and try to do a regex to find the required setting if present.
I am still learning hence the problem below...
Here is what I have:
$EncStart = 'charset=';
$EncEnd = '" \/\>';
preg_match( "/$EncStart(.*)$EncEnd/s", $RemoteContent, $RemoteEncoding );
echo = $RemoteEncoding[ 1 ];
The above does indeed echo the name of the encoding but it does not know where to stop so it prints out the rest of the line then most of the rest of the remote page in my test.
Example: When testing a remote russian page it printed:
windows-1251" />
rest of page ....
Which means that $EncStart was okay, but the $EncEnd part of the regex failed to stop the matching. This meta header usually ends in 3 different possibility after the name of the encoding.
"> | "/> | " />
I do not know weather this is usable to satisfy the end of the maching and if yes how to escape it. I played with different ways of doing it but none worked.
Thank you in advance for lending a hand.

add a question mark to your pattern to make it non-greedy (and there's also no need of 's')
preg_match( "/charset=\"(.+?)\"/", $RemoteContent, $RemoteEncoding );
echo $RemoteEncoding[ 1 ];
note that this won't handle charset = "..." or charset='...' and many other combinations.

Take a look at Simple HTML Dom Parser. With it, you can easily find the charset from the head without resorting to cumbersome regexes. But as David already commented, you should also examine the headers for the same information and prioritize it if found.
Tested example:
require_once 'simple_html_dom.php';
$source = file_get_contents('http://www.google.com');
$dom = str_get_html($source);
$meta = $dom->find('meta[http-equiv=content-type]', 0);
$src_charset = substr($meta ->content, stripos($meta ->content, 'charset=') + 8);
foreach ($http_response_header as $header) {
#list($name, $value) = explode(':', $header, 2);
if (strtolower($name) == 'content-type') {
$hdr_charset = substr($value, stripos($value, 'charset=') + 8);
break;
}
}
var_dump(
$hdr_charset,
$src_charset
);

Related

file_get_contents( - Fix relative urls

I am trying to display a website to a user, having downloaded it using php.
This is the script I am using:
<?php
$url = 'http://stackoverflow.com/pagecalledjohn.php';
//Download page
$site = file_get_contents($url);
//Fix relative URLs
$site = str_replace('src="','src="' . $url,$site);
$site = str_replace('url(','url(' . $url,$site);
//Display to user
echo $site;
?>
So far this script works a treat except for a few major problems with the str_replace function. The problem comes with relative urls. If we use an image on our made up pagecalledjohn.php of a cat (Something like this: ). It is a png and as I see it it can be placed on the page using 6 different urls:
1. src="//www.stackoverflow.com/cat.png"
2. src="http://www.stackoverflow.com/cat.png"
3. src="https://www.stackoverflow.com/cat.png"
4. src="somedirectory/cat.png"
4 is not applicable in this case but added anyway!
5. src="/cat.png"
6. src="cat.png"
Is there a way, using php, I can search for src=" and replace it with the url (filename removed) of the page being downloaded, but without sticking url in there if it is options 1,2 or 3 and change procedure slightly for 4,5 and 6?
Rather than trying to change every path reference in the source code, why don't you simply inject a <base> tag in your header to specifically indicate the base URL upon which all relative URL's should be calculated?
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base
This can be achieved using your DOM manipulation tool of choice. The example below would show how to do this using DOMDocument and related classes.
$target_domain = 'http://stackoverflow.com/';
$url = $target_domain . 'pagecalledjohn.php';
//Download page
$site = file_get_contents($url);
$dom = DOMDocument::loadHTML($site);
if($dom instanceof DOMDocument === false) {
// something went wrong in loading HTML to DOM Document
// provide error messaging and exit
}
// find <head> tag
$head_tag_list = $dom->getElementsByTagName('head');
// there should only be one <head> tag
if($head_tag_list->length !== 1) {
throw new Exception('Wow! The HTML is malformed without single head tag.');
}
$head_tag = $head_tag_list->item(0);
// find first child of head tag to later use in insertion
$head_has_children = $head_tag->hasChildNodes();
if($head_has_children) {
$head_tag_first_child = $head_tag->firstChild;
}
// create new <base> tag
$base_element = $dom->createElement('base');
$base_element->setAttribute('href', $target_domain);
// insert new base tag as first child to head tag
if($head_has_children) {
$base_node = $head_tag->insertBefore($base_element, $head_tag_first_child);
} else {
$base_node = $head_tag->appendChild($base_element);
}
echo $dom->saveHTML();
At the very minimum, it you truly want to modify all path references in the source code, I would HIGHLY recommend doing so with DOM manipulation tools (DOMDOcument, DOMXPath, etc.) rather than regex. I think you will find it a much more stable solution.
I don't know if I get your question completely right, if you want to deal with all text-sequences enclosed in src=" and ", the following pattern could make it:
~(\ssrc=")([^"]+)(")~
It has three capturing groups of which the second one contains the data you're interested in. The first and last are useful to change the whole match.
Now you can replace all instances with a callback function that is changing the places. I've created a simple string with all the 6 cases you've got:
$site = <<<BUFFER
1. src="//www.stackoverflow.com/cat.png"
2. src="http://www.stackoverflow.com/cat.png"
3. src="https://www.stackoverflow.com/cat.png"
4. src="somedirectory/cat.png"
5. src="/cat.png"
6. src="cat.png"
BUFFER;
Let's ignore for a moment that there are no surrounding HTML tags, you're not parsing HTML anyway I'm sure as you haven't asked for a HTML parser but for a regular expression. In the following example, the match in the middle (the URL) will be enclosed so that it's clear it matched:
So now to replace each of the links let's start lightly by just highlighting them in the string.
$pattern = '~(\ssrc=")([^"]+)(")~';
echo preg_replace_callback($pattern, function ($matches) {
return $matches[1] . ">>>" . $matches[2] . "<<<" . $matches[3];
}, $site);
The output for the example given then is:
1. src=">>>//www.stackoverflow.com/cat.png<<<"
2. src=">>>http://www.stackoverflow.com/cat.png<<<"
3. src=">>>https://www.stackoverflow.com/cat.png<<<"
4. src=">>>somedirectory/cat.png<<<"
5. src=">>>/cat.png<<<"
6. src=">>>cat.png<<<"
As the way of replacing the string is to be changed, it can be extracted, so it is easier to change:
$callback = function($method) {
return function ($matches) use ($method) {
return $matches[1] . $method($matches[2]) . $matches[3];
};
};
This function creates the replace callback based on a method of replacing you pass as parameter.
Such a replacement function could be:
$highlight = function($string) {
return ">>>$string<<<";
};
And it's called like the following:
$pattern = '~(\ssrc=")([^"]+)(")~';
echo preg_replace_callback($pattern, $callback($highlight), $site);
The output remains the same, this was just to illustrate how the extraction worked:
1. src=">>>//www.stackoverflow.com/cat.png<<<"
2. src=">>>http://www.stackoverflow.com/cat.png<<<"
3. src=">>>https://www.stackoverflow.com/cat.png<<<"
4. src=">>>somedirectory/cat.png<<<"
5. src=">>>/cat.png<<<"
6. src=">>>cat.png<<<"
The benefit of this is that for the replacement function, you only need to deal with the URL match as single string, not with regular expression matches array for the different groups.
Now to your second half of your question: How to replace this with the specific URL handling like removing the filename. This can be done by parsing the URL itself and remove the filename (basename) from the path component. Thanks to the extraction, you can put this into a simple function:
$removeFilename = function ($url) {
$url = new Net_URL2($url);
$base = basename($path = $url->getPath());
$url->setPath(substr($path, 0, -strlen($base)));
return $url;
};
This code makes use of Pear's Net_URL2 URL component (also available via Packagist and Github, your OS packages might have it, too). It can parse and modify URLs easily, so is nice to have for the job.
So now the replacement done with the new URL filename replacement function:
$pattern = '~(\ssrc=")([^"]+)(")~';
echo preg_replace_callback($pattern, $callback($removeFilename), $site);
And the result then is:
1. src="//www.stackoverflow.com/"
2. src="http://www.stackoverflow.com/"
3. src="https://www.stackoverflow.com/"
4. src="somedirectory/"
5. src="/"
6. src=""
Please note that this is exemplary. It shows how you can to it with regular expressions. You can however to it as well with a HTML parser. Let's make this an actual HTML fragment:
1. <img src="//www.stackoverflow.com/cat.png"/>
2. <img src="http://www.stackoverflow.com/cat.png"/>
3. <img src="https://www.stackoverflow.com/cat.png"/>
4. <img src="somedirectory/cat.png"/>
5. <img src="/cat.png"/>
6. <img src="cat.png"/>
And then process all <img> "src" attributes with the created replacement filter function:
$doc = new DOMDocument();
$saved = libxml_use_internal_errors(true);
$doc->loadHTML($site, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
libxml_use_internal_errors($saved);
$srcs = (new DOMXPath($doc))->query('//img/#hsrc') ?: [];
foreach ($srcs as $src) {
$src->nodeValue = $removeFilename($src->nodeValue);
}
echo $doc->saveHTML();
The result then again is:
1. <img src="//www.stackoverflow.com/cat.png">
2. <img src="http://www.stackoverflow.com/cat.png">
3. <img src="https://www.stackoverflow.com/cat.png">
4. <img src="somedirectory/cat.png">
5. <img src="/cat.png">
6. <img src="cat.png">
Just a different way of parsing has been used - the replacement still is the same. Just to offer two different ways that are also the same in part.
I suggest doing it in more steps.
In order to not complicate the solution, let's assume that any src value is always an image (it could as well be something else, e.g. a script).
Also, let's assume that there are no spaces, between equals sign and quotes (this can be fixed easily if there are). Finally, let's assume that the file name does not contain any escaped quotes (if it did, regexp would be more complicated).
So you'd use the following regexp to find all image references:
src="([^"]*)". (Also, this does not cover the case, where src is enclosed into single quotes. But it is easy to create a similar regexp for that.)
However, the processing logic could be done with preg_replace_callback function, instead of str_replace. You can provide a callback to this function, where each url can be processed, based on its contents.
So you could do something like this (not tested!):
$site = preg_replace_callback(
'src="([^"]*)"',
function ($src) {
$url = $src[1];
$ret = "";
if (preg_match("^//", $url)) {
// case 1.
$ret = "src='" . $url . '"';
}
else if (preg_match("^https?://", $url)) {
// case 2. and 3.
$ret = "src='" . $url . '"';
}
else {
// case 4., 5., 6.
$ret = "src='http://your.site.com.com/" . $url . '"';
}
return $ret;
},
$site
);

Remove HTML Entity if Incomplete

I have an issue where I have displayed up to 400 characters of a string that is pulled from the database, however, this string is required to contain HTML Entities.
By chance, the client has created the string to have the 400th character to sit right in the middle of a closing P tag, thus killing the tag, resulting in other errors for code after it.
I would prefer this closing P tag to be removed entirely as I have a "...read more" link attached to the end which would look cleaner if attached to the existing paragraph.
What would be the best approach for this to cover all HTML Entity issues? Is there a PHP function that will automatically close off/remove any erroneous HTML tags? I don't need a coded answer, just a direction will help greatly.
Thanks.
Here's a simple way you can do it with DOMDocument, its not perfect but it may be of interest:
<?php
function html_tidy($src){
libxml_use_internal_errors(true);
$x = new DOMDocument;
$x->loadHTML('<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />'.$src);
$x->formatOutput = true;
$ret = preg_replace('~<(?:!DOCTYPE|/?(?:html|body|head))[^>]*>\s*~i', '', $x->saveHTML());
return trim(str_replace('<meta http-equiv="Content-Type" content="text/html;charset=utf-8">','',$ret));
}
$brokenHTML[] = "<p><span>This is some broken html</spa";
$brokenHTML[] = "<poken html</spa";
$brokenHTML[] = "<p><span>This is some broken html</spa</p>";
/*
<p><span>This is some broken html</span></p>
<poken html></poken>
<p><span>This is some broken html</span></p>
*/
foreach($brokenHTML as $test){
echo html_tidy($test);
}
?>
Though take note of Mike 'Pomax' Kamermans's comment.
why you don't take the last word in the paragraph or content and remove it, if the word is complete you remove it , if is not complete you also remove it, and you are sure that the content still clean, i show you an example for what code will be look like :
while($row = $req->fetch(PDO::FETCH_OBJ){
//extract 400 first characters from the content you need to show
$extraction = substr($row->text, 0, 400);
// find the last space in this extraction
$last_space = strrpos($extraction, ' ');
//take content from the first character to the last space and add (...)
echo substr($extraction, 0, $last_space) . ' ...';
}
just remove last broken tag and then strip_tags
$str = "<p>this is how we do</p";
$str = substr($str, 0, strrpos($str, "<"));
$str = strip_tags($str);

Detect Encoding and Convert Everything to UTF-8 with PHP

I want to extract various data from URLs that will be converted to UTF-8 no matter what the encoding methods is used in original page (or at least it will work on most of the source encodings).
So, after looking and searching many discussions and answers, I finally came with the following code, with which I am parsing HTML data twice (once for detecting encoding and a second time for getting the actual data). This is working at least on all the checked URLs. But I think that the code is poorly written.
Can anyone let me know if there are any better alternatives to do the same or if I need any improvements on the code?
<?php
header('Content-Type: text/html; charset=utf-8');
require_once 'curl.php';
require_once 'curl_response.php';
$curl = new Curl;
$url = "http://" . $_GET['domain'];
$curl_response = $curl->get($url);
$header_content_type = $curl_response->headers['Content-Type'];
$dom_doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom_doc->loadHTML('<?xml encoding="utf-8" ?>' . $curl_response);
libxml_use_internal_errors(FALSE);
$metas = $dom_doc->getElementsByTagName('meta');
foreach ($metas as $meta) {
if (strtolower($meta->getAttribute('http-equiv')) == 'content-type') {
$meta_content_type = $meta->getAttribute('content');
}
if ($meta->getAttribute('charset') != '') {
$html5_charset = $meta->getAttribute('charset');
}
}
if (preg_match('/charset=(.+)/', $header_content_type, $m)) {
$charset = $m[1];
} elseif (preg_match('/charset=(.+)/', $meta_content_type, $m)) {
$charset = $m[1];
} elseif (!empty($html5_charset)) {
$charset = $html5_charset;
} elseif (preg_match('/encoding=(.+)/', $curl_response, $m)) {
$charset = $m[1];
} else {
// browser default charset
// $charset = 'ISO-8859-1';
}
if (!empty($charset) && $charset != "utf-8") {
$tmp = iconv($charset,'utf-8', $curl_response);
libxml_use_internal_errors(TRUE);
$dom_doc->loadHTML('<?xml encoding="utf-8" ?>' . $tmp);
libxml_use_internal_errors(FALSE);
}
$page_title = $dom_doc->getElementsByTagName('title')->item(0)->nodeValue;
$metas = $dom_doc->getElementsByTagName('meta');
foreach ($metas as $meta) {
if (strtolower($meta->getAttribute('name')) == 'description') {
$meta_description = $meta->getAttribute('content');
}
if (strtolower($meta->getAttribute('name')) == 'keywords') {
$meta_tags = $meta->getAttribute('content');
}
}
print $charset;
print "<hr>";
print $page_title;
print "<hr>";
print $meta_description;
print "<hr>";
print $meta_tags;
print "<hr>";
print "Memory Peak Usages: " . memory_get_peak_usage()/1024/1024 . " MB";
?>
Your question is too open-ended, and I've voted to close it. However, I will still provide a stub of an answer that will, hopefully, point you in the right direction.
At the moment, you are checking user-defined input for the charset. This is a very, very, very bad move, for various reasons:
Most webmasters on small site will just header("Content-type: text/html; charset=utf-8") because they've heard it is good practice, without actually encoding. Not taking this into account will lead to mangled UTF-8 outputs
Some webmasters do the opposite: they do not set a header, and their webserver outputs ISO-8859-1 headers despite an UTF-8 encoding. Visibly on a page, this does not matter - it matters for DOMDocument (I've had this issue recently)
iconv double utf-8 encoding is never fun.
I'd strongly advise using a utility to decode UTF-8 until there are no more entities within the UTF-8 extended range of characters and then encoding once rather than relying on iconv or multibyte encoding. The reason is simple: these can get it wrong. You can also set an error handler to parse DOMDocument errors in order to catch and redirect the loadXML "failed due to malformed XML" errors, which will not be related to your character encoding at all. Basically, the key to you problem is to not blindly do stuff.
If you'd like good targets where you need to worry about UTF-8, parse the home page of Google Play. They send out malformed replies (which is what initially forced me to go through the UTF-8-decode-until-nothing-is-in-the-range approach). It will also show you that DOMDocument can fail due to a wide variety of reasons - not just charset - and that you need to follow the errors to deal with them.
Other performance pointers outside of that big encoding snafu include:
Fragmenting your code into resultant functions. You've got a lot of repetition in there - learn to use functions to stop having to explicitely write the same core functions multiple times.
This:
if (preg_match('/charset=(.+)/', $header_content_type, $m)) {
$charset = $m[1];
} elseif (preg_match('/charset=(.+)/', $meta_content_type, $m)) {
is horrible. You can easily replace it with a strpos call, which will speed this particular set of ifs by about 5-10x.
* $metas = $dom_doc->getElementsByTagName('meta'); - you're aware that DOMDocument will go through your entire DOM when you use this method, right? Consider restricting the XPath query to just the head tag (which is always the first child of html, which is the document. XPath: /html/head[0])
In regard to performance you should be using unset(); when you're done with variables or values even if you're going to reset their values, but not if you need the value further down your script. PHP cannot reclaim memory and will reuse the preallocated memory released from the unset command for future use.
Another thing you could do is take huge chunks of that code and split it into functions that return resultant values. Remember that function variables and memory are automatically released after execution unless you're working with global variables.
Those will help performance and memory utilization.

strip_tags disallow some tags

Based on the strip_tags documentation, the second parameter takes the allowable tags. However in my case, I want to do the reverse. Say I'll accept the tags the script_tags normally (default) accept, but strip only the <script> tag. Any possible way for this?
I don't mean somebody to code it for me, but rather an input of possible ways on how to achieve this (if possible) is greatly appreciated.
EDIT
To use the HTML Purifier HTML.ForbiddenElements config directive, it seems you would do something like:
require_once '/path/to/HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.ForbiddenElements', array('script','style','applet'));
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);
http://htmlpurifier.org/docs
HTML.ForbiddenElements should be set to an array. What I don't know is what form the array members should take:
array('script','style','applet')
Or:
array('<script>','<style>','<applet>')
Or... Something else?
I think it's the first form, without delimiters; HTML.AllowedElements uses a form of configuration string somewhat common to TinyMCE's valid elements syntax:
tinyMCE.init({
...
valid_elements : "a[href|target=_blank],strong/b,div[align],br",
...
});
So my guess is it's just the term, and no attributes should be provided (since you're banning the element... although there is a HTML.ForbiddenAttributes, too). But that's a guess.
I'll add this note from the HTML.ForbiddenAttributes docs, as well:
Warning: This directive complements %HTML.ForbiddenElements,
accordingly, check out that directive for a discussion of why you
should think twice before using this directive.
Blacklisting is just not as "robust" as whitelisting, but you may have your reasons. Just beware and be careful.
Without testing, I'm not sure what to tell you. I'll keep looking for an answer, but I will likely go to bed first. It is very late. :)
Although I think you really should use HTML Purifier and utilize it's HTML.ForbiddenElements configuration directive, I think a reasonable alternative if you really, really want to use strip_tags() is to derive a whitelist from the blacklist. In other words, remove what you don't want and then use what's left.
For instance:
function blacklistElements($blacklisted = '', &$errors = array()) {
if ((string)$blacklisted == '') {
$errors[] = 'Empty string.';
return array();
}
$html5 = array(
"<menu>","<command>","<summary>","<details>","<meter>","<progress>",
"<output>","<keygen>","<textarea>","<option>","<optgroup>","<datalist>",
"<select>","<button>","<input>","<label>","<legend>","<fieldset>","<form>",
"<th>","<td>","<tr>","<tfoot>","<thead>","<tbody>","<col>","<colgroup>",
"<caption>","<table>","<math>","<svg>","<area>","<map>","<canvas>","<track>",
"<source>","<audio>","<video>","<param>","<object>","<embed>","<iframe>",
"<img>","<del>","<ins>","<wbr>","<br>","<span>","<bdo>","<bdi>","<rp>","<rt>",
"<ruby>","<mark>","<u>","<b>","<i>","<sup>","<sub>","<kbd>","<samp>","<var>",
"<code>","<time>","<data>","<abbr>","<dfn>","<q>","<cite>","<s>","<small>",
"<strong>","<em>","<a>","<div>","<figcaption>","<figure>","<dd>","<dt>",
"<dl>","<li>","<ul>","<ol>","<blockquote>","<pre>","<hr>","<p>","<address>",
"<footer>","<header>","<hgroup>","<aside>","<article>","<nav>","<section>",
"<body>","<noscript>","<script>","<style>","<meta>","<link>","<base>",
"<title>","<head>","<html>"
);
$list = trim(strtolower($blacklisted));
$list = preg_replace('/[^a-z ]/i', '', $list);
$list = '<' . str_replace(' ', '> <', $list) . '>';
$list = array_map('trim', explode(' ', $list));
return array_diff($html5, $list);
}
Then run it:
$blacklisted = '<html> <bogus> <EM> em li ol';
$whitelist = blacklistElements($blacklisted);
if (count($errors)) {
echo "There were errors.\n";
print_r($errors);
echo "\n";
} else {
// Do strip_tags() ...
}
http://codepad.org/LV8ckRjd
So if you pass in what you don't want to allow, it will give you back the HTML5 element list in an array form that you can then feed into strip_tags() after joining it into a string:
$stripped = strip_tags($html, implode('', $whitelist)));
Caveat Emptor
Now, I've kind've hacked this together and I know there are some issues I haven't thought out yet. For instance, from the strip_tags() man page for the $allowable_tags argument:
Note:
This parameter should not contain whitespace. strip_tags() sees a tag
as a case-insensitive string between < and the first whitespace or >.
It means that strip_tags("<br/>", "<br>") returns an empty string.
It's late and for some reason I can't quite figure out what this means for this approach. So I'll have to think about that tomorrow. I also compiled the HTML element list in the function's $html5 element from this MDN documentation page. Sharp-eyed reader's might notice all of the tags are in this form:
<tagName>
I'm not sure how this will effect the outcome, whether I need to take into account variations in the use of a shorttag <tagName/> and some of the, ahem, odder variations. And, of course, there are more tags out there.
So it's probably not production ready. But you get the idea.
First, see what others have said on this topic:
Strip <script> tags and everything in between with PHP?
and
remove script tag from HTML content
It seems you have 2 choices, one is a Regex solution, both the links above give them. The second is to use HTML Purifier.
If you are stripping the script tag for some other reason than sanitation of user content, the Regex could be a good solution. However, as everyone has warned, it is a good idea to use HTML Purifier if you are sanitizing input.
PHP(5 or greater) solution:
If you want to remove <script> tags (or any other), and also you want to remove the content inside tags, you should use:
OPTION 1 (simplest):
preg_replace('#<script(.*?)>(.*?)</script>#is', '', $text);
OPTION 2 (more versatile):
<?php
$html = "<p>Your HTML code</p><script>With malicious code</script>"
$dom = new DOMDocument();
$dom->loadHTML($html);
$script = $dom->getElementsByTagName('script');
$remove = [];
foreach($script as $item)
{
$item->parentNode->removeChild($item);
}
$html = $dom->saveHTML();
Then $html will be:
"<p>Your HTML code</p>"
This is what I use to strip out a list of forbidden tags, can do both removing of tags wrapping content and tags including content, Plus trim off leftover white space.
$description = trim(preg_replace([
# Strip tags around content
'/\<(.*)doctype(.*)\>/i',
'/\<(.*)html(.*)\>/i',
'/\<(.*)head(.*)\>/i',
'/\<(.*)body(.*)\>/i',
# Strip tags and content inside
'/\<(.*)script(.*)\>(.*)<\/script>/i',
], '', $description));
Input example:
$description = '<html>
<head>
</head>
<body>
<p>This distinctive Mini Chopper with Desire styling has a powerful wattage and high capacity which makes it a very versatile kitchen accessory. It also comes equipped with a durable glass bowl and lid for easy storage.</p>
<script type="application/javascript">alert('Hello world');</script>
</body>
</html>';
Output result:
<p>This distinctive Mini Chopper with Desire styling has a powerful wattage and high capacity which makes it a very versatile kitchen accessory. It also comes equipped with a durable glass bowl and lid for easy storage.</p>
I use the following:
function strip_tags_with_forbidden_tags($input, $forbidden_tags)
{
foreach (explode(',', $forbidden_tags) as $tag) {
$tag = preg_replace(array('/^</', '/>$/'), array('', ''), $tag);
$input = preg_replace(sprintf('/<%s[^>]*>([^<]+)<\/%s>/', $tag, $tag), '$1', $input);
}
return $input;
}
Then you can do:
echo strip_tags_with_forbidden_tags('<cancel>abc</cancel>xpto<p>def></p><g>xyz</g><t>xpto</t>', 'cancel,g');
Output: 'abcxpto<p>def></p>xyz<t>xpto</t>'
echo strip_tags_with_forbidden_tags('<cancel>abc</cancel> xpto <p>def></p> <g>xyz</g> <t>xpto</t>', 'cancel,g');
Outputs: 'abc xpto <p>def></p> xyz <t>xpto</t>'

Json to xml with greek characters

I am using curl to get a json file which can be located here: (It's way too long to copy paste it): http://www.opap.gr/web/services/rs/betting/availableBetGames/sport/program/4100/0/sport-1.json?localeId=el_GR
After that i use json_decode to get the assosiative array.Till here everything seems ok.When i am using var_dump the characters inside the array are in Greek.After that i am using the following code:
$JsonClass = new ArrayToXML();
$mydata=$JsonClass->toXml($json);
class ArrayToXML
{
public static function toXML( $data, $rootNodeName = 'ResultSet', &$xml=null ) {
// turn off compatibility mode as simple xml throws a wobbly if you don't.
// if ( ini_get('zend.ze1_compatibility_mode') == 1 ) ini_set ( 'zend.ze1_compatibility_mode', 0 );
if ( is_null( $xml ) ) //$xml = simplexml_load_string( "" );
$xml = simplexml_load_string("<?xml version='1.0' encoding='UTF-8'?><$rootNodeName />");
// loop through the data passed in.
foreach( $data as $key => $value ) {
$numeric = false;
// no numeric keys in our xml please!
if ( is_numeric( $key ) ) {
$numeric = 1;
$key = $rootNodeName;
}
// delete any char not allowed in XML element names
`enter code here`$key = preg_replace('/[^a-z0-9\-\_\.\:]/i', '', $key);
// if there is another array found recrusively call this function
if ( is_array( $value ) ) {
$node = ArrayToXML::isAssoc( $value ) || $numeric ? $xml->addChild( $key ) : $xml;
// recrusive call.
if ( $numeric ) $key = 'anon';
ArrayToXML::toXml( $value, $key, $node );
} else {
// add single node.
$value = htmlentities( $value );
$xml->addChild( $key, $value );
}
}
// pass back as XML
return $xml->asXML();
}
public static function isAssoc( $array ) {
return (is_array($array) && 0 !== count(array_diff_key($array, array_keys(array_keys($array)))));
}
}
And here comes the problem .All the greek characters inside the result are in some strange characters Î?Î?Î¥Î?Î?ΡΩΣÎ?Î? for example.I really don't know what am i doing wrong.I am really bad with encoding /decoding things :(.
And to make this a bit more clear:
Here is how the assosiative array (on of the parts that i have the problem with) looks like:
{ ["resources"]=> array(4) { ["team-4833"]=> string(24) "ΛΕΥΚΟΡΩΣΙΑ U21" ["t-429"]=> string(72) "ΠΡΟΚΡΙΜΑΤΙΚΑ ΕΥΡΩΠΑΪΚΟΥ ΠΡΩΤΑΘΛΗΜΑΤΟΣ" ["t-429-short"]=> string(6) "ΠΕΠ" ["team-15387"]=> string(16) "ΕΛΛΑΔΑ U21" } ["locale"]=> string(5) "el_GR" } ["relatedNum"]=> NULL }
And here is what i get after the use of simplexml
<resources><team-4833>Î?Î?Î¥Î?Î?ΡΩΣÎ?Î? U21</team-4833><t-429>ΠΡÎ?Î?ΡÎ?Î?Î?ΤÎ?Î?Î? Î?ΥΡΩΠÎ?ΪÎ?Î?Î¥ ΠΡΩΤÎ?Î?Î?Î?Î?Î?ΤÎ?Σ</t-429><t-429-short>Î Î?Î </t-429-short><team-15387>Î?Î?Î?Î?Î?Î? U21</team-15387></resources><locale>el_GR</locale></lexicon><relatedNum></relatedNum></betGames>
Thanks in advance for your replies.
PS:I have also <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> in the page i display the result but it doesnt help.
I still didn't find a solution with that so i used a different approach something like Yannis suggested.I saved the XML in a file using the class i found here http://www.phpclasses.org/package/1826-PHP-Store-associative-array-data-on-file-in-XML.html .
After that i load the xml with simplexml_load_file and i used xslt to access the data in all nodes and store it in my database.It worked fine that way .If anyone still wants to try and explain me why it doesn't work with the way i tried to do it at the start feel free (Just for the learning purpose :p)Thanks for your replies :).
There is no need - The current json is given in an xml format as well here apparently:
http://www.opap.gr/web/services/rs/betting/availableBetGames/sport/program/4100/0/sport-1.xml?localeId=el_GR
Just had to play with the url parameters a bit :)
This worked for me on chrome using php version 5.3.6:
$json = file_get_contents('http://www.opap.gr/web/services/rs/betting/availableBetGames/sport/program/4100/0/sport-1.json?localeId=el_GR');
$json = json_decode($json, true);
$xml = new SimpleXMLElement('<ResultSet/>');
array_walk_recursive($json, array ($xml, 'addChild'));
print $xml->asXML();
exit();
Clearly your bug is that you are manipulating UTF‑8–encoded Unicode as though those bytes were ISO‐8859‑1.
I cannot see where this is happening; probably in your call to htmlentities, whatever that is.
It may need to use some sort of “multibyte” hack, perhaps including such things as this sort of pattern:
/([^\x00-\x7F])/u
wiht an explicit /u so it works on logical code points instead of 8‑bit code units (read: bytes). It might do this to grab one non-ASCII code point so it can replace it with a numeric entity. Without the easily forgotten /u, it would work on bytes not code points, which matches what your description shows happening.
It could be this sort of thing, or it might be that you have to swap over to some of the mb_*() functions instead of normal ones. This is to work around the fundamental underlying PHP bug that there it no real Unicode support in the language, just a few band-aides here and there that seem to like to fall off from time to time for no good reason.
If you could use a clean language with not just proper Unicode support but also a clear separation between physical bytes and abstract characters, this sort of thing would not be happening. But I bet it’s a common problem that others must be having too, so I would be really surprised if it were a library bug instead of a (perfectly understandable!) oversight somewhere in your code.
answer in your question from GREECE---------
word "? [ΛΕΥΚΟ]"? it has ASC (his code character) 203-197-213-202-207 ()----------
when however you read him [prostithete] the 206 and are doubled the letters----------
but also change code as following 206-(203-48=155)-206-(197-48=149)-206-(213-48=165)-
-206-(213-48=165)-206-(202-48=154)-206-(207-48=159)-------------
consequently the solution they is checking to a character if you find the 206 to >ignore---------
him and in the ASC of next character to add number 48 and to find the new character. >------------
Because I deal also i with the [ΑΠΟΚΟΔΙΚΟΠΟΙΗΣΗ] of [ΟΠΑΠ] every new knowledge they is >[ΕΥΠΡΟΣΔΕΚΤΟ]------
in mail -->? bluegt03#in.gr

Categories