How to get x amount of characters from text file using php? - php

I'm trying to get about 200 letters/chars (including spaces) from a external text file. I've got the code to display the text i'll include that but to get certain letters i've got no idea. Once again i'm not talking about line's i really mean letters.
<?php
$file = "Nieuws/NieuwsTest.txt";
echo file_get_contents($file) . '<br /><br />';
?>

Use the fifth parameter of file_get_contents:
$s = file_get_contents('file', false, null, 0, 200);
This will work only with 256-character set, and will not work correctly with multi-byte characters, since PHP does not offer native Unicode support, unfortunately.
Unicode
In order to read specific number of Unicode characters, you will need to implement your own function using PHP extensions such as intl and mbstring. For example, a version of fread accepting the maximum number of UTF-8 characters can be implemented as follows:
function utf8_fread($handle, $length = null) {
if ($length > 0) {
$string = fread($handle, $length * 4);
return $string ? mb_substr($string, 0, $length) : false;
}
return fread($handle);
}
If $length is positive, the function reads the maximum number of bytes that a UTF-8 string of that number of characters can take (a UTF-8 character is represented as 1 to 4 8-bit bytes), and extracts the first $length multi-byte characters using mb_substr. Otherwise, the function reads the entire file.
A UTF-8 version of file_get_contents can be implemented in similar manner:
function utf8_file_get_contents(...$args) {
if (!empty($args[4])) {
$maxlen = $args[4];
$args[4] *= 4;
$string = call_user_func_array('file_get_contents', $args);
return $string ? mb_substr($string, 0, $maxlen) : false;
}
return call_user_func_array('file_get_contents', $args);
}

You should use substr() functions.
But i recommend you to use the multy byte safe mb_substr().
$text = mb_substr( file_get_contents($file), 200 ) . '<br /><br />';
With substr you will get trouble if there is some accents etc. Thoses problems will not happen with mb_substr()

use this:
<?php
$file = "Nieuws/NieuwsTest.txt";
echo substr( file_get_contents($file), 0, 200 ) . '<br /><br />';
?>

Related

Cut an arabic string

I have a string in the arabic language like:
على احمد يوسف
Now I need to cut this string and output it like:
...على احمد يو
I tried this function:
function short_name($str, $limit) {
if ($limit < 3) {
$limit = 3;
}
if (strlen($str) > $limit) {
if (preg_match('/\p{Arabic}/u', $str)) {
return substr($str, 0, $limit - 3) . '...';
}
else {
return '...'.substr($str, 0, $limit - 3);
}
}
else {
return $str;
}
}
The problem is that sometimes it displays a symbol like this at the end of the string:
...�على احمد يو
Why does this happen?
The symbol displayed after the cut is the result of substr() cutting in the middle of a character, resulting in an invalid character.
You need to use Multibyte String Functions to handle arabic strings, such as mb_strlen() and mb_substr().
You also need to make sure the internal encoding for those functions is set to UTF-8. You can set this globally at the top of your script:
mb_internal_encoding('UTF-8');
Which leads to this:
strlen('على احمد يوسف') returns 24, the size in octets
mb_strlen('على احمد يوسف') returns 13, the size in characters
Note that mb_strlen('على احمد يوسف') would also return 24 if the internal encoding was still set to the default ISO-8859-1.
Answer:
return '...'.mb_substr($str, 0, $limit - 3, "UTF-8"); // UTF-8 is optional
Background:
In ISO 8859-1 Arabic is not a 8-bit character set. The substr() calls the internal libc functions which work on sets of 8-bit chars. To display characters higher then 255 (Arabic, Cyclic, Korean, etc..) there are more bits needed to display that character, for example 16 or sometimes even 32-bits. You subtract 3*8-bits which will result in some undisplayable character in UTF-8. Especially if you're going to use a lot of multibyte strings, make sure you use the correct string functions such as mb_strlen()
Try this function;
public static function shorten_arabic_text($text, $lenght)
{
mb_internal_encoding('UTF-8');
$out = mb_strlen($text) > $lenght ? mb_substr($text, 0, $lenght) . " ..." : $text;
return $out;
}

PHP: UTF8_decode needed with filter for ASCII values 126-160; proposed solution

I previously began exploring this problem here. Here is the true problem, and a proposed solution:
Filenames with ASCII characters values between 32 and 255 pose a problem for utf8_encode(). Specifically, it doesn't handle the character values inclusively between 126 and 160 correctly. While filenames with those character names may be written to a database, passing those filenames to a function in PHP code will produce error messages stating the file cannot be found, etc.
I discovered this when trying to pass a filename with the offending characters to getimagesize().
What is needed for utf8_encode is a filter to EXCLUDE the conversion of the inclusive values between 126 and 160, while INCLUDING the conversion of all other characters (or any character, characters, or character ranges of the user's dersire; mine is for the ranges stated, for the reason provided).
The solution I devised requires two functions, listed below, and their application that follows:
// With thanks to Mark Baker for this function, posted elsewhere on StackOverflow
function _unichr($o) {
if (function_exists('mb_convert_encoding')) {
return mb_convert_encoding('&#'.intval($o).';', 'UTF-8', 'HTML-ENTITIES');
} else {
return chr(intval($o));
}
}
// For each character where value is inclusively between 126 and 160,
// write out the _unichr of the character, else write out the UTF8_encode of the character
function smart_utf8_encode($source) {
$text_array = str_split($source, 1);
$new_string = '';
foreach ($text_array as $character) {
$value = ord($character);
if ((126 <= $value) && ($value <= 160)) {
$new_string .= _unichr($character);
} else {
$new_string .= utf8_encode($character);
}
}
return $new_string;
}
$file_name = "abcdefghijklmnopqrstuvxyz~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ.jpg";
// This MUST be done first:
$file_name = iconv('UTF-8', 'WINDOWS-1252', $file_name);
// Next, smart_utf8_encode the variable (from encoding.inc.php):
$file_name = smart_utf8_encode($file_name);
// Now the file name may be passed to getimagesize(), etc.
$getimagesize = getimagesize($file_name);
If only PHP7 (6 is being skipped in the numbering, yes?) would include a filter on utf8_encode() to exclude certain character values, none of this would be necessary.

Character replacement encoding php

I have a string that I want to replace all 'a' characters to the greek 'α' character. I don't want to convert the html elements inside the string ie text.
The function:
function grstrletter($string){
$skip = false;
$str_length = strlen($string);
for ($i=0; $i < $str_length; $i++){
if($string[$i] == '<'){
$skip = true;
}
if($string[$i] == '>'){
$skip = false;
}
if ($string[$i]=='a' && !$skip){
$string[$i] = 'α';
}
}
return $string;
}
Another function I have made works perfectly but it doesn't take in account the hmtl elements.
function grstrletter_no_html($string){
return strtr($string, array('a' => 'α'));
}
I also tried a lot of encoding functions that php offers with no luck.
When I echo the greek letter the browser output it without a problem. When I return the string the browser outputs the classic strange question mark inside a triangle whenever the replace was occured.
My header has <meta http-equiv="content-type" content="text/html; charset=UTF-8"> and I also tried it with php header('Content-Type: text/html; charset=utf-8'); but again with no luck.
The string comes from a database in UTF-8 and the site is in wordpress so I just use the wordpress functions to get the content I want. I don't think is a db problem because when I use my function grstrletter_no_html() everything works fine.
The problem seems to happen when I iterate the string character by character.
The file is saved as UTF-8 without BOM (notepad++). I tried also to change the encoding of the file with no luck again.
I also tried to replace the greek letter with the corresponding html entity α and α but again same results.
I haven't tried yet any regex.
I would appreciate any help and thanks in advance.
Tried: Greek characters encoding works in HTML but not in PHP
EDIT
The solution based on deceze brilliant answer:
function grstrletter($string){
$skip = false;
$str_length = strlen($string);
for ($i=0; $i < $str_length; $i++){
if($string[$i] == '<'){
$skip = true;
}
if($string[$i] == '>'){
$skip = false;
}
if ($string[$i]=='a' && !$skip){
$part1 = substr($string, 0, $i);
$part1 = $part1 . 'α';
$string = $part1 . substr($string, $i+1);
}
}
return $string;
}
The problem is that you're setting only a single byte of your string. Example:
$str = "\x00\x00\x00";
var_dump(bin2hex($str));
$str[1] = "\xff\xff";
var_dump(bin2hex($str));
Output:
string(6) "000000"
string(6) "00ff00"
You're setting a two-byte character, but only one byte of it is actually pushed into the string. The second result here would have to be 00ffff for your code to work.
What you need is to cut the string from 0 to $i - 1, concatenate the 'α' into it, then concatenate the rest of the string $i + 1 to end onto it if you want to insert a multibyte character. That, or work with characters instead of bytes using the mbstring functions.
For more background information, see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

php multibyte string acessing via key [$i]

there is a string $string = "öşğüçı"; pay attention to the last one which is not i
when I want to print first char by echo $string[0] it prints nothing.. I know they are multibyte ones.. though printing first character can be accomplished by
echo $string[0].$string[1] but that is not what I want.. the question is
how can I make the obove mentioned issue just to program in a way below
for($i = 0; $i < sizeof($string); $i++)
echo $string[$i] . " ";
and it will print the following
ö ş ğ ü ç ı
masters of php please help...
to split a string into characters
$string = "öşğüçı";
preg_match_all('/./u', $string, $m);
$chars = $m[0];
note the "u" flag in the regular expression
<?php
// inform the browser you are sending text encoded with utf-8
header("Content-type: text/plain; charset=utf-8");
// if you're using a literal string make sure the file
// is saved using utf-8 as encoding
// or if you're getting it from another source make sure
// you get it in utf-8
$string = "öşğüçı";
// if you do not have your string in utf-8
// you need to find out the actual encoding
// and use "iconv" to convert it to utf-8
// process the string using the mb_* functions
// knowing that it is encoded in utf-8 at this point
$encoding = "UTF-8";
for($i = 0; $i < mb_strlen($string, $encoding); $i++) {
echo mb_substr($string, $i, 1, $encoding);
}
Of course if you prefer another encoding (but I wouldn't see why; maybe just utf-16) you can substitute each instance of "utf-8" from above with your desired encoding and read and use accordingly.
Example for UTF-16 output (file/input is encoded in UTF-8)
<?php
header("Content-type: text/plain; charset=utf-16");
$string = "öşğüçı";
$string = iconv("UTF-8", "UTF-16", $string);
$encoding = "UTF-16";
for($i = 0; $i < mb_strlen($string, $encoding); $i++) {
echo mb_substr($string, $i, 1, $encoding);
}
You cannot handle multi-byte strings in this way in PHP. If it's a fixed-length encoding, where every character takes up, say, two bytes, you can simply take two bytes at a time. If it's a variable-length encoding like UTF-8 though, you will need to use mb_substr and mb_strlen.
May I recommend What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text, which explains this in more detail.
Use iconv_substr or mb_substr to get character and iconv_strlen or mb_strlen to get size of string.

php true multi-byte string shuffle function?

I have a unique problem with multibyte character strings and need to be able to shuffle, with some fair degree of randomness, a long UTF-8 encoded multibyte string in PHP without dropping or losing or repeating any of the characters.
In the PHP manual under str_shuffle there is a multi-byte function (the first user submitted one) that doesn't work: If I use a string with for example all the Japanese hiragana and katakana of string length (ex) 120 chars, I am returned a string that's 119 chars or 118 chars. Sometimes I've seen duplicate chars even though the original string doesn't have them. So that's not functional.
To make this more complex, I also need to include if possible Japanese UTF-8 newlines and line feeds and punctuation.
Can anyone with experience dealing in multiple languages with UTF-8 mb strings help? Does PHP have any built in functions to do this? str_shuffle is EXACTLY what I want. I just need it to also work on multibyte chars.
Thanks very much!
Try splitting the string using mb_strlen and mb_substr to create an array, then using shuffle before joining it back together again. (Edit: As also demonstrated in #Frosty Z's answer.)
An example from the PHP interactive prompt:
php > $string = "Pretend I'm multibyte!";
php > $len = mb_strlen($string);
php > $sploded = array();
php > while($len-- > 0) { $sploded[] = mb_substr($string, $len, 1); }
php > shuffle($sploded);
php > echo join('', $sploded);
rmedt tmu nIb'lyi!eteP
You'll want to be sure to specify the encoding, where appropriate.
This should do the trick, too. I hope.
class String
{
public function mbStrShuffle($string)
{
$chars = $this->mbGetChars($string);
shuffle($chars);
return implode('', $chars);
}
public function mbGetChars($string)
{
$chars = [];
for($i = 0, $length = mb_strlen($string); $i < $length; ++$i)
{
$chars[] = mb_substr($string, $i, 1, 'UTF-8');
}
return $chars;
}
}
I like to use this function:
function mb_str_shuffle($multibyte_string = "abcčćdđefghijklmnopqrsštuvwxyzžß,.-+'*?=)(/&%$#!~ˇ^˘°˛`˙´˝") {
$characters_array = mb_str_split($multibyte_string);
shuffle($characters_array);
return implode('', $characters_array); // or join('', $characters_array); if you have a death wish (JK)
}
Split string into an array of multibyte characters
Shuffle the good guy array who doesn't care about his residents being multibyte
Join the shuffled array together into a string
Of course I normally wouldn't have a default value for function's parameter.

Categories