unicode chars with wikipedia search in PHP

unicode chars with wikipedia search in PHP - php

I pass a PHP string to wikipedia search page in order to retrieve part of the definition.
Everythin works fine, except unicode chars which appear in the \u... form. Here is an example to explain myself better. As you can see, the phonetic transcription of the name is not readable:
Henrik Ibsen, Henrik Ibsen \u02c8h\u025bn\u027eik \u02c8ips\u0259n
(Skien, 20 marzo 1828 - Oslo, 23 maggio 1906) è stato uno scrittore,
drammaturgo, poeta e regista teatrale norvegese.
The code I use to get the snippet from Wikipedia is this:
$word = $_GET["word"];
$html = file_get_contents('https://it.wikipedia.org/w/api.php?action=opensearch&search='.$word);
$utf8html = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $html), ENT_NOQUOTES, 'UTF-8');
The last line of my code does not solve the problem.
Do you know how to get a clean text which is entirely readable?

The output of the Wikipedia search API is JSON. Don't try to scrape bits out of it and parse string literal escapes yourself, that way madness lies. Just use a readily available JSON parser.
Also, you need to URL-escape the word when you add it into a query string, otherwise any searches for words with URL-special characters in will fail.
In summary:
$word = $_GET['word'];
$url = 'https://it.wikipedia.org/w/api.php?action=opensearch&search='.urlencode($word);
$response = json_decode(file_get_contents($url));
$matching_titles_array = $response[1];
$matching_summaries_array = $response[2];
$matching_urls = $response[3];
...etc...

You got some errors in your regex string, try using:
<?php
$str = "Henrik Ibsen, Henrik Ibsen \u02c8h\u025bn\u027eik \u02c8ips\u0259n(Skien, 20 marzo 1828 - Oslo, 23 maggio 1906) è stato uno scrittore, drammaturgo, poeta e regista teatrale norvegese.";
$utf8html = preg_replace('#\\\U([0-9A-F]{4})#i', "&#x\\1", $str);
echo $utf8html;

Well, the answer posted by bobince is certainly more effective than my previous procedure, which aimed at scraping and pruning bit by bit what I needed. Just to show you how I was doing it, here is my previous code:
$html = file_get_contents('https://it.wikipedia.org/w/api.php?action=opensearch&search='.$s);
$decoded = preg_replace('#\\\U([0-9A-F]{4})#i', "&#x\\1", $html);
$par = array("[", "]");
$def_no_par = str_replace($par, "", $decoded);
$def_no_vir = str_replace("\"\",", "", $def_no_par);
$def_cap = str_replace("\",", "\",<br>", $def_no_vir);
$def_pulita = str_replace("\"", "", $def_cap);
$def_clean = str_replace(".,", ".", $def_pulita);
$definizione = str_replace("$s,", "", $def_clean);
$out = str_replace("\\", "\"", $definizione);
As you can see, removing parts of the output to make it more readable was quite tiresome (and not completely successful).
Using the JSON approach makes everything more linear. Here is my new workaround:
$search = 'https://it.wikipedia.org/w/api.php?action=opensearch&search='.urlencode($s);
$response = json_decode(file_get_contents($search));
$matching_titles_array = $response[1];
$matching_summaries_array = $response[2];
$matching_urls = $response[3];
echo '<h3><div align="center"><font color=" #A3A375">'.$titolo.'</font></div></h3><br><br>';
foreach($response[1] as $t) {
echo '<font color="#5C85D6"><b>'.$t.'</b></font><br><br>';
}
foreach($response[2] as $s) {
echo $s.'<br><br>';
}
foreach($response[3] as $l) {
$link = preg_replace('!(((f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $l);
echo $link.'<br><br>';
}
The advantage is that now I can manipulate the arrays as I wish.
You can see it in action here:

Related

PHP and Simple DOM HTML Parser - Replace identical text string

I'm using the Simple DOM html parser php script in what seems to be a simple way, here's my code:
include('simple_html_dom.php');
$html = file_get_html($_SERVER['DOCUMENT_ROOT']."/wp-content/themes/genesis-sample-develop/cache-reports/atudem.html");
$snow_depth_min = $html->find('td', 115);
$snow_depth_max = $html->find('td', 116);
$snow_type = $html->find('td', 117);
The problem is with $snow_type. Sometimes the parsed text string is 'polvo' and sometimes it is 'polvo-dura'. I'm trying to replace 'polvo' with 'powder', and 'polvo-dura' with 'powder/packed'. If I do something like
if ($snow_type->innertext=='polvo-dura') {
$snow_type->innertext('powder');
}
or
$snow_type = str_replace("polvo", "powder", $snow_type);
$snow_type = str_replace("polvo-dura", "powder/packed", $snow_type);
it ends up with results like 'powder-dura' and weird things like that.
Obviously I'm new to php, so have some pattience with me ;) I would also like to understand why this happens and why a possible solution would work.
Thanks in advance

if ($snow_type->innertext=='polvo-dura') {
$innertext = 'powder/packed';
} else if ($snow_type->innertext=='polvo') {
$innertext = 'powder';
}

Provisional solution, using indexed arrays with preg_replace() :
$patterns = array();
$patterns[0] = '/-/';
$patterns[1] = '/polvo/';
$patterns[2] = '/dura/';
$replacements = array();
$replacements[0] = '/';
$replacements[1] = 'powder';
$replacements[2] = 'packed';
$snow_type_spanish_english = preg_replace($patterns, $replacements, $snow_type);
I have serious concerns about how it would work in real-world long complex texts, but for short-type data such as 'snow type' with values like 'a', 'b', 'a/b' or 'b/a', this can be just fine.
It would be great if someone comes with a better solution. I've been searching all over Internet for days and haven't found any specific solutions for text-values with the same words at the beginning, like 'powder' and 'powder-packed' for example.

remove HTML from displaying in PHP

I have this text : http://pastebin.com/2Zgbs7hi
And i want to be able to remove the HTML code from it and just display the plain text but i want to keep at least one line break where there are currently a few line breaks
i have tried:
$ticket["summary"] = 'pastebin example';
$TicketSummaryDisplay = nl2br($ticket["summary"]);
$TicketSummaryDisplay = stripslashes($TicketSummaryDisplay);
$TicketSummaryDisplay = trim(strip_tags($TicketSummaryDisplay));
$TicketSummaryDisplay = preg_replace('/\n\s+$/m', '', $TicketSummaryDisplay);
echo $TicketSummaryDisplay;
that is displaying as plain text, but it shows it all as one big block of text with no line breaks at all

Maybe this will earn you some time.
<?php
libxml_use_internal_errors(true); //crazy o tags
$html = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$dom = new DOMDocument;
$dom->loadHTML($html);
$result='';
foreach ($dom->getElementsByTagName('p') as $node) {
if (strstr($node->nodeValue, 'Legal Disclaimer:')){
break;
}
$result .= $node->nodeValue;
}
echo $result;

This example should successfully store text from html into an array of strings.
After stripping all the tags, you can use preg_split with \R special character ( matches any newline sequence ) to convert string into array. That array will now have several blank values, and there will be also some amount of html non-breaking space entities, so we will check the array for empty values with array_filter() function ( it will remove all items that do not satisfy the filter conditions, in our case, an empty value ). Here are a problem with entity, because and space characters are not the same, they have different ASCII code, so trim() function will not remove spaces. Here are two possible solutions, the first uncommented part will only replace &nbsp and check for white space characters, while the second commented one will decode all html entities and also check for spaces.
PHP:
$text = file_get_contents( 'http://pastebin.com/raw.php?i=2Zgbs7hi' );
$text = strip_tags( $text );
$array = array_filter(
preg_split( '/\R/', $text ),
function( &$item ) {
$item = str_replace( ' ', ' ', $item );
return trim( $item );
// $item = html_entity_decode( $item );
// return trim( str_replace( "\xC2\xA0", ' ', $item ) );
}
);
foreach( $array as $value ) {
echo $value . '<br />';
}
Array output:
Array
(
[8] => Hi,
[11] => Ashley has explained that I need to ask for another line and broadband for the wifi to work, please can you arrange this.
[13] => Regards
[23] => Legal Disclaimer:
[24] => This email and its attachments are confidential. If you received it by mistake, please don’t share it. Let us know and then delete it. Its content does not necessarily represent the views of The Dragon Enterprise
[25] => Centre and we cannot guarantee the information it contains is complete. All emails are monitored and may be seen by another member of The Dragon Enterprise Centre's staff for internal use
)
Now you should have clear array with only items with value in it. By the way, newlines in HTML are expressed through <br />, not through \n, your example as response in a web browser still has them, but they are only visible in page source code. I hope I did not missed the point of the question.

try this get text output with line brakes
<?php
$ticket["summary"] = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$TicketSummaryDisplay = nl2br($ticket["summary"]);
echo strip_tags($TicketSummaryDisplay,'<br>');
?>

You are asking on how to add line-breaks to your "one big block of text with no line breaks at all".
Short answer
After you stripped the HTML tags, apply wordwrap with a desired text-block length
$text = wordwrap($text, 90, "<br />\n");
I really wonder, why nobody suggested that function before.
there is also chunk_split around, which doesn't take words into account and just splits after a certain number of chars. breaking words - but that's not what you want, i guess.
PHP
<?php
$text = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
/**
* Returns string without html tags, also
* removes takes control chars, spaces and " " into account.
*/
function dropHtmlTags($string) {
// remove html tags
//$string = preg_replace ('/<[^>]*>/', ' ', $string);
$string = strip_tags($string);
// control characters and "&nbsp"
$string = str_replace("\r", '', $string); // remove
$string = str_replace("\n", ' ', $string); // replace with space
$string = str_replace("\t", ' ', $string); // replace with space
$string = str_replace(" ", ' ', $string);
// remove multiple spaces
$string = preg_replace('/ {2,}/', ' ', $string);
$string = trim($string);
return $string;
}
$text = dropHtmlTags($text);
// The Answer: insert line breaks after 95 chars,
// to get rid of the "one big block of text with no line breaks at all"
$text = wordwrap($text, 95, "<br />\n");
// if you want to insert line-breaks before the legal disclaimer,
// uncomment the next line
//$text = str_replace("Regards Legal Disclaimer", "<br /><br />Regards Legal Disclaimer", $text);
echo $text;
?>
Result
first section shows your text block
second section shows the text with wordwrap applied (code from above)

Hello it can be done as follows:
$abc= file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$abc = strip_tags("\n", $abc);
echo $abc;
Please, let me know whether it works

you may use
<?php
$a= file_get_contents('a.txt');
echo nl2br(htmlspecialchars($a));
?>

<?php
$handle = #fopen("pastebin.html", "r");
if ($handle) {
while (!feof($handle)) {
$buffer = fgetss($handle, 4096);
echo $buffer;
}
fclose($handle);
}
?>
output is
Hi,
Ashley has explained that I need to ask for another line and broadband for the wifi to work, please can you arrange this.
Regards
Legal Disclaimer:
This email and its attachments are confidential. If you received it by mistake, please don’t share it. Let us know and then delete it. Its content does not necessarily represent the views of The Dragon Enterprise
Centre and we cannot guarantee the information it contains is complete. All emails are monitored and may be seen by another member of The Dragon Enterprise Centre's staff for internal use
You can probably write additional code to convert to spaces etc.

I'm not sure I did understand everything correctly but this seems to be your expected result:
$txt = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
var_dump(preg_replace("/(\&nbsp\;(\s{1,})?)+/", "\n", trim(strip_tags(preg_replace("/(\s){1,}/", " ", $txt)))));
//more readable
$txt = preg_replace("/(\s){1,}/", " ", $txt);
$txt = trim(strip_tags($txt));
$txt = preg_replace("/(\&nbsp\;(\s{1,})?)+/", "\n", $txt);

The strip_tags() function strips HTML and PHP tags from a string, if that is what you are trying to accomplish.
Examples from the docs:
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
The above example will output:
Test paragraph. Other text
<p>Test paragraph.</p> Other text

Acents become interrogation marks in php when parsing html

i'm getting a PT-BR text automatically from downloading a html page and the acentution becomes interrogation marks when I use uft8_decode, this is my function:
function pegaMsg($string)
{
$bot_url = "http://website.com";
//&rnd=&msg="
$rand_msg = rand(0,100);
$url = $bot_url . $rand_msg . "&msg=" . $string;
$url = str_replace(" ", "%20", $url);
//echo "\n" . $url;
$download = http_get($url, $referer="");
$download['FILE'] = utf8_decode($download['FILE']);
$download['FILE'] = str_replace("var resp = ", "", $download['FILE']);
$download['FILE'] = str_replace("\\r\\n", "", $download['FILE']);
$download['FILE'] = str_replace(";", "", $download['FILE']);
$download['FILE'] = str_replace("\'", "", $download['FILE']);
$download['FILE'] = trim($download['FILE']);
return $download['FILE'];
}
this is the output expected:
VOCÊ TINHA DUAS ESCOLHAS:
and this is what I get:
'VOC? TINHA DUAS ESCOLHAS:
what can I do ? I want the ^ displayed ! thanks and sorry for the bad english

utf8_decode replaces invalid code unit sequences ?. The reason you're getting a ? is likely because the text you're passing to utf8_decode was not in UTF-8 to begin with.
In fact, it's possible it was already in ISO-8859-1, which is the encoding of the string returned by utf8_decode. In that case, your solution would be to just omit the call to utf8_decode.
If the original text was neither in UTF-8 nor in ISO-8859-1 (which is what I'm assuming you want, since you're calling utf8_decode), you have to use iconv or mb_convert_encoding.
A final possibility is that whatever is interpreting the script output is assuming the encoding of the script output is different from what it actually and it also converts invalid code unit sequences to ?.

Try to use encode
$download['FILE'] = utf8_encode($download['FILE']);

Replacing \r\n (newline characters) after running json_encode

So when I run json_encode, it grabs the \r\n from MySQL aswell. I have tried rewriting strings in the database to no avail. I have tried changing the encoding in MySQL from the default latin1_swedish_ci to ascii_bin and utf8_bin. I have done tons of str_replace and chr(10), chr(13) stuff. I don't know what else to say or do so I'm gonna just leave this here....
$json = json_encode($new);
if(isset($_GET['pretty'])) {
echo str_replace("\/", "/", jsonReadable(parse($json)));
} else {
$json = str_replace("\/", "/", $json);
echo parse($json);
}
The jsonReadable function is from here and the parse function is from here. The str_replaces that are already in there are because I am getting weird formatted html tags like </h1>. Finally, $new is an array which is crafted above. Full code upon request.
Help me StackOverflow. You're my only hope

Does the string contain "\r\n" (as in 0x0D 0x0A) or the literal string '\r\n'? If it's the former, this should remove any newlines.
$json = preg_replace("!\r?\n!", "", $json);
Optionally, replace the second parameter "" with "<br />" if you'd like to replace the newlines with a br tag. For the latter case, try the following:
$json = preg_replace('!\\r?\\n!', "", $json);

Don't replace it in the JSON, replace it in the source before you encode it.

I had a similar issue, i used:
$p_num = trim($this->recp);
$p_num = str_replace("\n", "", $p_num);
$p_num = str_replace("\r", ",", $p_num);
$p_num = str_replace("\n",',', $p_num);
$p_num = rtrim($p_num, "\x00..\x1F");
Not sure if this will help with your requirements.

removing strange characters from php string

this is what i have right now
Drawing an RSS feed into the php, the raw xml from the rss feed reads:
Paul’s Confidence
The php that i have so far is this.
$newtitle = $item->title;
$newtitle = utf8_decode($newtitle);
The above returns;
Paul?s Confidence
If i remove the utf_decode, i get this
Paulâ€™s Confidence
When i try a str_replace;
$newtitle = str_replace("”", "", $newtitle);
It doesnt work, i get;
Paulâ€™s Confidence
Any thoughts?

This is my function that always works, regardless of encoding:
function RemoveBS($Str) {
$StrArr = str_split($Str); $NewStr = '';
foreach ($StrArr as $Char) {
$CharNo = ord($Char);
if ($CharNo == 163) { $NewStr .= $Char; continue; } // keep £
if ($CharNo > 31 && $CharNo < 127) {
$NewStr .= $Char;
}
}
return $NewStr;
}
How it works:
echo RemoveBS('Hello õhowå åare youÆ?'); // Hello how are you?

Try this:
$newtitle = html_entity_decode($newtitle, ENT_QUOTES, "UTF-8")
If this is not the solution browse this page http://us2.php.net/manual/en/function.html-entity-decode.php

This will remove all non-ascii characters / special characters from a string.
//Remove from a single line string
$output = "Likening â€˜not-criticalâ€™ with";
$output = preg_replace('/[^(\x20-\x7F)]*/','', $output);
echo $output;
//Remove from a multi-line string
$output = "Likening â€˜not-criticalâ€™ with \n Likening â€˜not-criticalâ€™ with \r Likening â€˜not-criticalâ€™ with. ' ! -.";
$output = preg_replace('/[^(\x20-\x7F)\x0A\x0D]*/','', $output);
echo $output;

I solved the problem. Seems to be a short fix rather than the larger issue, but it works.
$newtitle = str_replace('â€™', "'", $newtitle);
I also found this useful snippit that may help others with same problem;
<?
$find[] = 'â€œ'; // left side double smart quote
$find[] = 'â€'; // right side double smart quote
$find[] = 'â€˜'; // left side single smart quote
$find[] = 'â€™'; // right side single smart quote
$find[] = 'â€¦'; // elipsis
$find[] = 'â€”'; // em dash
$find[] = 'â€“'; // en dash
$replace[] = '"';
$replace[] = '"';
$replace[] = "'";
$replace[] = "'";
$replace[] = "...";
$replace[] = "-";
$replace[] = "-";
$text = str_replace($find, $replace, $text);
?>
Thanks everyone for your time and consideration.

Yeah this is not working for me. What is the workaround for this? – vaichidrewar Mar 12 at 22:29
Add this to the HTML head (or modify if already there):
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
This will encode the funny chars like "â€œ" into UTF-8 so that the str_replace() function will interpret them properly.
Or you can do this:
ini_set('default_charset', 'utf-8');

Is the character encoding setting for your PHP server something other than UTF-8? If so, is there a reason or could it be changed to UTF-8? Though we don't store data in UTF-8 in our database, I've found that setting the webserver's character set to UTF-8 seems to help resolve character set issues.
I'd be interested in hearing others' opinions about this... whether I'm setting myself up for problems by setting webserver to UTF-8 while storing submitted data in Latin1 in our mysql database. I know there was a reason I chose Latin1 for the database but can't recall what it was. Interestingly, our current setup seems to allow for non-UTF-8 character entry and subsequent rendering... it seems that storing in Latin1 doesn't prevent subsequent decoding and display of all UTF-8 characters?

Use the below PHP code to remove
html_entity_decode(mb_convert_encoding(stripslashes($name), "HTML-ENTITIES", 'UTF-8'))

Read up on http://us.php.net/manual/en/function.html-entity-decode.php
That & symbol is a html code so you can easily decode it.

Super simple solution is to have the characters decoded when the page is loaded
Simply copy/paste the following at the beginning of the script
header('Content-Type: text/html; charset=UTF-8');
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_regex_encoding('UTF-8');
Reference: http://php.net/manual/en/function.mb-internal-encoding.php
comment left by webfav at web dot de

Many Strange Character be removed by applying
mysqli_set_charset($con,"utf8");
below the mysql connection code.
but in some circumstances of removing this type strange character like â€
we need to use: $title = 'â€…Stefen Suraj'; $newtitle = preg_replace('/[^(\x20-\x7F)]*/','', $title); echo $newtitle;
Output will be: Stefen Suraj

It does not work
You need to use
$arr1 = str_split($str)
then foreach and
echo($arr1[$k])
This will show you exactly which characters are written into the string.

Please Try this.
$find[] = '/â/' //'â€œ'; // left side double smart quote
$find[] = '/â/' //'â€'; // right side double smart quote
$find[] = '/â/' //'â€˜'; // left side single smart quote
$find[] = '/â/' //'â€™'; // right side single smart quote
$find[] = '/â&#133/' //'â€¦'; // elipsis
$find[] = '/â/' //'â€”'; // em dash
$find[] = '/â/' //'â€“'; // en dash
$replace[] = '“' // '"';
$replace[] = '”' // '"';
$replace[] = '‘' // "'";
$replace[] = '’' // "'";
$replace[] = '⋯' // "...";
$replace[] = '—' // "-";
$replace[] = '–' // "-";
$text = str_replace($find, $replace, $text);

1.The order of the strings in the $find array is significant.
2.This string "â€˜" should contain a tilde and look like three characters. If I save the .php file with my Genie editor it gits changed to just two characters "â€".
3.This is a useful reference https://www.i18nqa.com/debug/utf8-debug.html
<?php
$text = "â€˜â€™â€œâ€1â€˜ 2â€™ 3â€â€œâ€™â€˜ 4â€™ 5 6 7â€™ â€˜, â€™, â€œ, â€â€˜";
echo($text . "<br>");
$find = array("â€˜", "â€™", "â€œ", "â€");
$replace = array("‘", "’", "“", "”");
$text = str_replace($find, $replace, $text);
echo($text);
?>

Just one simple solution.
if your string contains these type of strange chars
suppose $text contains some of these then just do as shown bellow:
$mytext=mb_convert_encoding($text, "HTML-ENTITIES", 'UTF-8')
and it will work..

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

unicode chars with wikipedia search in PHP - php

Related

PHP and Simple DOM HTML Parser - Replace identical text string

remove HTML from displaying in PHP

Acents become interrogation marks in php when parsing html

Replacing \r\n (newline characters) after running json_encode

removing strange characters from php string

Categories

Resources