How to get text from word file using php accurately? - php

I am using doc2txt.class.php class to get the txt from word file using php and I am using the below code
require("doc2txt.class.php");
$docObj = new Doc2Txt("test.docx");
$txt = $docObj->convertToText();
My word file contains the below text
MWONGOZO WA MAOMBI MAALUMU (MAOMBI YA HATARI).
Huu ni Mfano Tu, Jinsi Ya Kuomba Na Maeneo Ya Kuombea! Unatakiwa pamoja na KUWA NA BIDII, KUMTEGEMEA SANA ROHO MTAKATIFU NI MUHIMU SANA!
MAOMBI MAALUMU YA JINSI YA KUPAMBANA KATIKA VITA VYA KIROHO
Jinsi Ya Kuomba Maombi Haya
But output I get is little different my output is
MWONGOZO WA MAOMBI MAALUMU (MAOMBI YA HATARI).Huu ni Mfano Tu, Jinsi Ya Kuomba Na Maeneo Ya Kuombea! Unatakiwa pamoja na KUWA NA BIDII, KUMTEGEMEA SANA ROHO MTAKATIFU NI MUHIMU SANA! MAOMBI MAALUMU YA JINSI YA KUPAMBANA KATIKA VITA VYA KIROHOJinsi Ya Kuomba Maombi Haya
as you can see output contains this word KIROHO Jinsi as one word KIROHOJinsi
so when I count the number of words it gives 45 words but actually there
are 46 words.
Is there any way to resolve this issue?

I have checked this code for txt file and it is working fine. I think this might help you. Thanks
$myfile = file_get_contents("test.txt");
$array = explode("\n", $myfile);
$count = null;
if (!empty($array))
{
$i = 0;
foreach ($array as $rowarray)
{
$a1 = array_filter(explode(" ", trim($rowarray)));
$count = $count + count($a1);
}
echo $count;
}

Related

Script to replace comma "," by "->" as multiple value separator on category field of csv (only on this filed of csv)

I need to to replace comma "," by "->" as multiple value separator on category field of csv, on a php script.
In the attached example csv piece, the field value on first row is
;ALIMENTACIÓN,GRANEL,Cereales legumbres y frutos secos,Desayuno y entre horas,Varios;
I neet to be replaced to:
;ALIMENTACIÓN->GRANEL->Cereales legumbres y frutos secos->Desayuno y entre horas->Varios;
I tried this code on my php script:
file_put_contents("result.csv",str_replace(",","->",file_get_contents("origin.csv")));
And it works, but it replace comma on all fields. but i need to change only on this Catefory field. It is, i need do no replace commas on description field, or other fields.
Thank you, in advance
Piece of my csv file as example (header and 3 rows -i truncated description field-):
id;SKU;DEFINICION;AMPLIACION;DISPONIBLE;IVA;REC_EQ;PVD;PVD_IVA;PVD_IVA_REC;PVP;PESO;EAN;HAY_FOTO;IMAGEN;FECHA_IMAGEN;CAT;MARCA;FRIO;CONGELADO;BIO;APTO_DIABETICO;GLUTEN;HUEVO;LACTOSA;APTO_VEGANO;UNIDAD_MEDIDA;CANTIDAD_MEDIDA;
1003;"01003";"COPOS DE AVENA 1000GR";"Los copos son granos de cereales que han sido aplastados para facilitar su digestion, manteniendo integras las propiedades del grano.<br>
La avena contiene proteínas en abundancia, así como hidratos de carbono, grasas saludables...";59;2;1.40;2.20;2.42;2.45;3.14;1;"8423266500305";1;"https://distribudiet.net/webstore/images/01003.jpg";"04/03/2020 0:00:00";ALIMENTACIÓN,GRANEL,Cereales legumbres y frutos secos,Desayuno y entre horas,Varios;GRANOVITA;0;0;0;0;1;0;0;1;kilo;1
1018;"01018";"MUESLI 10 FRUTAS 1000GR";"Receta de muesli de cereales, diez tipos diferentes de deliciosas frutas desecadas, frutos secos, semillas de girasol, lino y sesamo.<br>
A finales del ...";63;2;1.40;4.66;5.13;5.19;6.65;1;"8423266500060";1;"https://distribudiet.net/webstore/images/01018.jpg";"04/03/2020 0:00:00";ALIMENTACIÓN,GRANEL,Desayuno y entre horas;GRANOVITA;0;0;0;0;;0;0;1;kilo;1
1037;"01037";"AZUCAR CAÑA INTEGRAL 1000GR";"Azúcar moreno de caña integral sin gluten para endulzar todo tipo de postres, batidos o tus recetas favoritas de repostería. 100% natural, obtenido sin procesamiento quimico por ...";17;2;1.40;3.43;3.77;3.82;4.90;1;"8423266500121";1;"https://distribudiet.net/webstore/images/01037.jpg";"04/03/2020 0:00:00";ALIMENTACIÓN,GRANEL,Endulzantes;GRANOVITA;0;0;0;0;0;0;0;1;kilo;1
<?php
$input = 'PRESTA.csv';
$output = 'OUTPUT.csv';
$file = str_replace("<br>\n", "<br>", file_get_contents($input)); // Remove newlines in description
$lines = explode("\r\n", $file); // Split the file into lines
$fp = fopen($output, 'w'); // Open output file for writing
for ($i = 0; $i < count($lines); ++$i) {
$extract = str_getcsv($lines[$i], ';'); // Split using ; delimeter
if ($i > 0 && isset($extract[16])) // Only replace on the 16th field "CAT"
$extract[16] = str_replace(',', '->', $extract[16]);
else
var_dump($extract); // There are some lines that dont have a CAT field
fputcsv($fp, $extract, ';'); // Write line to file using ; delimeter
}
fclose($fp);

Php split echo by counting characters echoed, add variable content and then keep echoing

In need my php code to count characters from a text being echoed. When this count gets to 64, i need it to echo "$something" and the keep echoing from where it stoped.
Also, best case scenario this code shouldn't crop complete words.
For example
-- This:
echo 'This is a huge string that i mean to crop acording to it\'s character\'s count. For every 64 characters including spaces i need it to echo some other thing in the middle';
-- Would end up like this:
echo 'This is a huge string that i mean to crop acording to it\'s ' . $something . 'character\'s count. For every 64 characters including spaces ' . $something . 'i need it to echo some other thing in the middle';
For better understanding... I need this code to solve the fact that SVG text can't be wrapped and justified.
Would you use mb_strimwidth ? how?
Thanks in advance!
--- UPDATE 1 - I've unsuccessfully tried
echo mb_strimwidth($row['resumen'], 0, 84, "$something");
echo mb_strimwidth($row['resumen'], 64, 64, "$something");
echo mb_strimwidth($row['resumen'], 128, 64, "$something");
--- UPDATE 2 - PARTIAL SUCCESS!
$uno = substr($row['resumen'], 0, 64);
$dos = substr($row['resumen'], 64, 64);
$tres = substr($row['resumen'], 128, 64);
$suma = $uno . "</text><text>" . $dos . "</text><text>" . $tres;
echo "$suma";
BUT THIS JUST echoes the first line of my text.
Finally got to this solution:
$n=0;
$var="Texto largo mayor a 64 caracteres que complica mi utilizacion de una infografia en svg, ya que este lenguaje no acepta wrappers para el texto. Era muy lindo para ser verdad.";
$ts= mb_strwidth($var);
//Ahora defino una variable que cuenta el texto que queda por imprimir.
$aimprimir=mb_strwidth($var);
if ($ts>64){
while ($aimprimir>64):
//mientras reste por imprimir un texto de largo mayor que 64....
echo mb_strimwidth($var,$n,70,"<br/>");
$aimprimir=$aimprimir-64;
$n=$n+65;
endwhile;
//si lo que resta por imprimir es menor o igual a 64 entonces imprimalo...
echo mb_strimwidth($var,$n,$aimprimir,"<br/>");
}
else {
echo "$var";
}

Regex does not work in large string with html content [PHP]

I am trying to get values such R $ XX, XX [X is an example] using regular expression but I can not.
Below is my code:
$str = 'Indicada para 21 velocidades, corente indexadaCAPACETE MTB MANTUA MUSIC R$140,00PEDIVELA SHIMANO DEORE R$380,00PEDIVELA SHIMANO TX-71 R$99,00CORRENTE SHIMANO HG 40 R$55,00ROLO PARA TREINAMENTO TRANZ-X R$545,00CAPACETE MTB HIGH ONE (PROMOÇÃO) R$85,00BOMBA DE PÉ HIGH ONE COM MANÔMETRO (NYLON) R$89,90CAPA SELIM GEL (PRÓ-SPIN) R$45,00SUPORTE DE PAREDE VERTICAL R$20,00SUPORTE DE PAREDE HORIZONTAL R$35,00SUPORTE DE PAREDE VERTICAL PRETO R$28,00ESPUMA PARA GUIDÃO R$11,00BOMBA DE PÉ BETO NYLON R$55,00
Bomba pé nylon, acompanha adaptadores: valvula,bola e infláveisALAVANCA SHIMANO XT DUAL CONTROL EFM 761 R$500,00
Alavanca (par) 27 velocidades com manetes para freios mecânicos, com tecnologia "Dual Control" que chega muito próximo do sistema "STI" das bikes de corrida.
SAPATILHA SHIMANO MTB M 064 R$285,00
Pele sintética e malha flexível, resistentes ao esticar.
Entressola de poliamida reforçada com fibra de de vidro.
Pamilha estruturalmente flexível de acordo com uma ampla variedade de formatos de pé.
Volume + forma para melhor acomodação dos dedos dos pés.
Proteção em borracha oferece excelante tração e conforto para o caminhar.
Indicada para o pedal PD-M530, PD-M520.
Acompanha a base interna da sapatilha.ALAVANCA SHIMANO EF 51 R$130,00
Alavanca shimano 21 vel, ez-fire c/ maneteCAMPAINHA "I LOVE MY BIKE" R$14,00
Em alumínio, nas cores: polido, preto, azul e vermelho.
Fácil fixação no guidão.CAPACETE INFANTIL R$57,00CESTA ALUMÃNIO E NYLON
';
$regex = "/R\$[0-9]{1,},[0-9]{1,}/";
$result = preg_match_all($regex, $content, $rs);
var_dump($rs);
What's going on?
Try this code:
$content = "R$13,57 more text R$123,456";
$regex = "/$.*(R\$[0-9]{1,},[0-9]{1,}).*^/";
$result = preg_match_all($regex, $content, $rs);
var_dump($rs);
You need to place the group you are trying to match inside parentheses.

Can't do anything with text file contents (file_get_contents)

FIXED!
File encoding is UTF-16LE, changed to UTF-8 in PhpStorm and it behaves.
===========================================================
I'm reading a text file in PHP and want to read and manipulate the contents, but as soon as I touch the read contents of the file in anyway it 'breaks'.
If I read the file then echo it the text is displayed but any other operation with not work.
$contents = file_get_contents($file);
echo $contents; // works
$contents .= 'a longer test' . $contents;
echo $contents;
My ultimate goal is to run some regex’s on the contents before dumping it into a database but I need to be able to work with it first.
If it makes any difference I am using Laravel. I tried File::get($file) but have the same outcome.
EDIT to show output - Unicode issue?
//// first echo
POUR L ’É T U DE DE L ’H IST O IR E ET DE LA LANGUE DU PAYS, LA CONSERVATION DES A N TIQ U ITÉS DE L ’IL E , ET LA PUBLICATION DE DOCUMENTS HISTORIQUES, ETC., ETC. FONDÉE LE 28 JANVIER, 1873. DIXIÈME BULLETIN ANNUEL. : C. LE F E U VRE, IM PR IM E U R -É D IT EU R D E LA SOCIÉTÉ, BERESFORD LIBRARY , ST. -H ÉLIE R . 1885. = Page 1 =
// Second echo
POUR L ’É T U DE DE L ’H IST O IR E ET DE LA LANGUE DU PAYS, LA CONSERVATION DES A N TIQ U ITÉS DE L ’IL E , ET LA PUBLICATION DE DOCUMENTS HISTORIQUES, ETC., ETC. FONDÉE LE 28 JANVIER, 1873. DIXIÈME BULLETIN ANNUEL. : C. LE F E U VRE, IM PR IM E U R -É D IT EU R D E LA SOCIÉTÉ, BERESFORD LIBRARY , ST. -H ÉLIE R . 1885. = Page 1 =⁡潬杮牥琠獥エ෾਀ഀ਀匀伀䌀䤀䔀吀䔀  䨀䔀刀匀䤀䄀䤀匀䔀ഀ਀倀伀唀刀  䰀 ᤀ줠 吀 唀 䐀䔀  䐀䔀  䰀 ᤀ䠠 䤀匀吀 伀 䤀刀 䔀  䔀吀  䐀䔀  䰀䄀  䰀䄀一䜀唀䔀  䐀唀  倀䄀夀匀Ⰰഀ਀䰀䄀  䌀伀一匀䔀刀嘀䄀吀䤀伀一  䐀䔀匀  䄀 一 吀䤀儀 唀 䤀吀준匀  䐀䔀  䰀 ᤀ䤠䰀 䔀 Ⰰ  䔀吀  䰀䄀  倀唀䈀䰀䤀䌀䄀吀䤀伀一 ഀ਀䐀䔀  䐀伀䌀唀䴀䔀一吀匀  䠀䤀匀吀伀刀䤀儀唀䔀匀Ⰰ  䔀吀䌀⸀Ⰰ  䔀吀䌀⸀ഀ਀䘀伀一䐀준䔀  䰀䔀  ㈀㠀  䨀䄀一嘀䤀䔀刀Ⰰ  ㄀㠀㜀㌀⸀ഀ਀䐀䤀堀䤀저䴀䔀  䈀唀䰀䰀䔀吀䤀一  䄀一一唀䔀䰀⸀ഀ਀㨀ഀ਀䌀⸀  䰀䔀   䘀 䔀 唀 嘀刀䔀Ⰰ  䤀䴀 倀刀 䤀䴀 䔀 唀 刀 ⴀ준 䐀 䤀吀 䔀唀 刀   䐀 䔀  䰀䄀  匀伀䌀䤀준吀준Ⰰഀ਀䈀䔀刀䔀匀䘀伀刀䐀  䰀䤀䈀刀䄀刀夀 Ⰰ  匀吀⸀ ⴀ䠀 준䰀䤀䔀 刀 ⸀ഀ਀㄀㠀㠀㔀⸀਀ഀ 㴀 倀愀最攀 ㄀ 㴀
If I put the first string into a HEREDOC all works fine, so might be something with the txt file? It's extracted text from an OCRd from am old PDF.
Full code
public function import()
{
// get all the files
$files = File::files('../import');
foreach ($files as $file) {
// load text file contents
$contents = file_get_contents($file);
echo $contents; // as expected
$contents .= 'a longer test' . $contents;
echo $contents; // weird stuff
// test txt file contents inline
$contents2 = <<<EOD
SOCIETE JERSIAISE
POUR L ’É T U DE DE L ’H IST O IR E ET DE LA LANGUE DU PAYS,
LA CONSERVATION DES A N TIQ U ITÉS DE L ’IL E , ET LA PUBLICATION
DE DOCUMENTS HISTORIQUES, ETC., ETC.
FONDÉE LE 28 JANVIER, 1873.
DIXIÈME BULLETIN ANNUEL.
:
C. LE F E U VRE, IM PR IM E U R -É D IT EU R D E LA SOCIÉTÉ,
BERESFORD LIBRARY , ST. -H ÉLIE R .
1885.
= Page 1 =
EOD;
echo $contents2; // works
$contents2 .= 'a longer test' . $contents2;
echo $contents2; // prints as expected
}
FIXED!
File encoding is UTF-16LE, changed to UTF-8 in PhpStorm and it behaves.
Or in code:
foreach ($files as $file) {
// load text file contents
$contents = file_get_contents($file);
// fix encoding
$contents = mb_convert_encoding($contents, 'UTF-8', 'UTF-16');
echo $contents;
.....
$data_to_write = 'test';
$file_handle = fopen($file, 'a');
fwrite($file_handle, $data_to_write);
fclose($file_handle);

read file in descending php [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
hi im trying to read a file in descending order.
i want to echo last 10 words from the file
expected result:
brian tracy, brian tracy, der reiche
sack, der reiche sack, der reiche
sack, electrical machines by charles s
siskind second e, test de politica
fiscal, gigantomastia,gigantomastia,,
a,
file i want to read :
find a doctor, Find a Doctor,technique with fingers of right hand over left ven, la empresa adaptable, la empresa adaptable en la era de la informaci n, la pobre mia, probabilidad estadistica, crack beam, dwarf rabbit, probabilidad estadistica, kamsutra bangla, power of the dog, power of the dog, prinsip kerja uji ninhidrin, letramania 3, gre, gre, prinsip kerja uji ninhidrin, prinsip kerja uji ninhidrin, artificial intelligence a modern approach, configuring sap erp financials and controlling, gas spring, imperio carolingio, blue collar man, caligrafia, wonderlic, women and weight loss tamasha, women and the weight loss tamasha, vivir amar y aprender leo buscaglia, vivir amar y aprender leo buscaglia, wonderlic, plan de manejo ambiental, calibra o de manometros, curso de carpinteria, secreto industrial, secreto industrial, deneme, elementos secundarios de un triangulo, imperio carolingio, caligrafia, construir en lo construido, plan de manejo ambiental, lisboa, lisboa secreta, modelo de contrato secreto industrial, el conde de montecristo, metode titrasi formol, metode titrasi formol, probabilidad estadistica, probabilidad estadistica, history of islam akbar shah najeebabadi, caligrafia, caligrafia, conversacion en la catedral, brian tracy, brian tracy, der reiche sack, der reiche sack, der reiche sack, electrical machines by charles s siskind second e, test de politica fiscal, gigantomastia,gigantomastia, Find a Doctor, Find a Doctor,technique with fingers of right hand over left ven, la empresa adaptable, la empresa adaptable en la era de la informaci n, la pobre mia, probabilidad estadistica, crack beam, dwarf rabbit, probabilidad estadistica, kamsutra bangla, power of the dog, power of the dog, prinsip kerja uji ninhidrin, letramania 3, gre, gre, prinsip kerja uji ninhidrin, prinsip kerja uji ninhidrin, artificial intelligence a modern approach, configuring sap erp financials and controlling, gas spring, imperio carolingio, blue collar man, caligrafia, wonderlic, women and weight loss tamasha, women and the weight loss tamasha, vivir amar y aprender leo buscaglia, vivir amar y aprender leo buscaglia, wonderlic, plan de manejo ambiental, calibra o de manometros, curso de carpinteria, secreto industrial, secreto industrial, deneme, elementos secundarios de un triangulo, imperio carolingio, caligrafia, construir en lo construido, plan de manejo ambiental, lisboa, lisboa secreta, modelo de contrato secreto industrial, el conde de montecristo, metode titrasi formol, metode titrasi formol, probabilidad estadistica, probabilidad estadistica, history of islam akbar shah najeebabadi, caligrafia, caligrafia, conversacion en la catedral, brian tracy, brian tracy, der reiche sack, der reiche sack, der reiche sack, electrical machines by charles s siskind second e, test de politica fiscal, gigantomastia,gigantomastia,, a,
If the file will not be too big, you can simply read it all and then remove the data you don't need :
$content = file_get_contents($filename); // $filename is the file to read
$chunks = explode($delimiter, $content); // $delimiter is your word separator
$chunks = array_slice($chunks, -$n); // $n is the number of words to keep from the end of the file
// NOTE : -$n !
If the file will grow beyond reasonable size to be loaded into memory, you may read it in chunks. Something like (untested) :
function getLastTokens($filename, $n, $delimiter) {
$offset = filesize($filename);
$chunksize = 4096; // 4K chunk
if ($offset <= $chunksize * 2) {
// our one liner here because the file is samll enough
$tokens = explode($delimiter, file_get_contents($filename));
} else {
$tokens = array();
$fp = fopen($filename, 'r');
$chunkLength = 0;
while (count($tokens) < $n && $offset > 0) {
$lastOffset = $offset;
$offset -= $chunksize;
if ($offset < 0) $offset = 0; // can't seek before first byte
$chunkLength += ($lastOffset - $offset); // how much to read
fseek($fp, $offset);
$data = fread($fp, $chunkLength); // read the next (previous) chunk
if (($pos = strpos($data, $delimiter)) !== false) {
$chunkLength = 0; // reset chunk size to read next time
$offset += $pos;
$data = explode($delimiter, substr($data, $pos + 1));
array_unshift($data, & $tokens); // make $tokens the $data array's first element
// with the last line, this is equivalent to
// array_push($tokens, $data[1], $data[2], $data[3], ....)
call_user_func_array('array_push', $data);
}
}
fclose($fp);
}
fclose($fp);
return array_slice($tokens, -$n);
}
$file = "File contents"; //File get contents or anything else here.
$array = explode(",", $file);
$array = array_slice($array, -10, 10); //Starting from Last 10th element, get Ten elements.
$string = implode(", ", $array);
echo $array;
Edit:
Changed the implementation to remove the loop and the count etc.
$text = file_get_contents($file); //get contents of file
$words = explode(',', $text); //split into array
if (($length = count($words) < 10) {
$lastWords = $words; //shorter than 10 so return all
} else {
for ($i = $length-11, $i < $length; $i++ { //loop through last 10 words
$lastWords[] = $words[$i]; //add to array
}
}
$str = implode(',', $lastWords); //change array back into a string
echo $str;

Categories