Opening an encoded file with PHP

Opening an encoded file with PHP - php

I am opening a file on the server with PHP. The file seems ordinary. It opens in Notepad and Textedit on a PC. Even PHP can display it without any issue in a web browser when we echo out.
But when I try searching it with strpos() it can’t find anything except single characters. if i search for a string with 2 or more characters, it doesn’t find anything.
I have tried encoding it to UTF-8, and it detects it as ASCII. so everything seems right there.
I have also isolated the part of the file that I am trying to read down to only 250 characters. They all look fine on the screen.
But strpos can’t find it. I’ve run tests on every part of my code and I believe everything is fine with my code. The problem I believe derives from that the characters I see on the screen are not exactly matching what those characters really are.
My last resort is to write a function which converts each character into an integer array (if that’s even possible), and then convert all that back to a string. This way, we’ll know 100% that the characters we see are real.
Hoping that somebody has a better approach or perhaps an idea for something I missed?
I'll post the code below:
$content = file_get_contents($file->getPathname()); // get the file contents
$content = substr($content, 30, 300); // reduce the large file to just the first few lines
$content = htmlspecialchars($content); // try to remove any special characters from the file
$content = iconv('ASCII', 'UTF-8//IGNORE', $content); // encode to a friendly format
$string = "JobName"; // this is the string i'm searching for
if (strpos($content, $string) !== false) {
echo "bingo";
}
else {
echo " not found ";
}
Just to be clear, the file I'm opening is generated from a PC program that stores its data in .DAT format. Like I said, I can see and read the content very easily using any program, including PHP. but when I try to search, its as if it doesn't recognize the content at all.
I am not aware of how to upload a file on StackOverflow, but if someone can tell me how to do it then I will gladly post the file itself.

Thank you very much for your help ARKASCHA. I was able to find an online HexEditor and when I saw the characters, it seems there is a NUL character between every single character in this file. that's probably why I couldn't see it with a regular view. I just had to run an additional function to remove NUL characters from the file, and then it works as its supposed. Thanks again.

Related

Argument error when running mkvextract with PHP

No matter how hard I try, mkvextract doesn't work properly. I'm aware that there is a problem with the file path, but I tried hundreds of times, but I still could not succeed. How can I run this correctly?
shell_exec("mkvextract tracks /home/movies/R-12/X-1 ÇĞŞZ.mkv");
or
$filename = "/home/movies/R-12/X-1 ÇĞŞZ.mkv"
echo shell_exec("mkvextract tracks \"$filename\"");
I am aware that you cannot access the file path due to special characters

There may be several issues:
A file read permision issue: the file exists, but PHP (and the mkvextract it runs) don't have the permission to open it. In the rest of my answer I assume this is not happening, because you haven't added any error message containg the word permission or access to your question.
A shell argument escaping issue: correcly passing a command argument containing whitespace and/or shell metacharacters (e.g. ", \, $). I address this with escapeshellarg below.
A filename encoding issue: correctly specifying non-ASCII characters in filenames. I address this with mb_convert_encoding below.
For testing purposes, make a copy of the input file to /home/movies/t.mkv, and then try echo shell_exec("mkvextract tracks /home/movies/t.mkv").
If that works, then rename the copy to /home/movies/t t.mkv, and then try echo shell_exec("mkvextract tracks " . escapeshellarg("/home/movies/t t.mkv")). Without the escapeshellarg call, it wouldn't work, because the filename contains a space.
If that works, then the problem is with non-ASCII characters in the filename. To investigate it further, examine the output of var_dump(scandir("/home/movies/R-12")), and see how the letters with accents appear there. Pass it the same way to shell_exec. Don't forget about escapeshellarg.
If that works, use encoding conversion (with mb_convert_encoding) for the remaining filenames. You may want to ask a separate question about that, specifying the output of var_dump(scandir("/home/movies/R-12")) and var_dump("X-1 ÇĞŞZ.mkv") in your question.

$filename = "/home/movies/R-12/X-1 ÇĞŞZ.mkv"
echo shell_exec("sudo mkvextract tracks \"$filename\"");
I guess the whole problem was not adding sudo per :)

PHP preg_match on own computer doesn't work

I have this code:
$success = preg_match('/(.+(駅前)?駅) (\(([^線]+線)\) )?((([^線 ]+) )?(\d+[分時])?)/u', $m, $matches);
Example input text is
大正駅 (JR大阪環状線) ﾊﾞｽ 20分
This regex works on https://regex101.com/ and the code works on http://sandbox.onlinephpfunctions.com/. However, when I run the PHP code on my own computer, it never gives me a match. $matches is an empty array, and $success is 0. Yes, the exact same code. I have verified that the regex is correct (using first link) and that the code itself works (using second link). However, it still refuses to work on my own PC.
OS is Arch Linux, running PHP 7.3.11, system locale is ja_JP.UTF-8 (which I don't think matters, but just in case)
Does anyone see anything wrong with the code?

So I was able to find the problem.
First, I tried just the one-liner commented by Nick (3v4l.org/o4ADM) on my PC, and it works. (Of course it should. PHP can't be broken.)
So I figured out that it's the data I'm feeding preg_match that should be broken.
Normal prints and echos were in vain--$m is always how it should be. Then I considered AD7six's comment,
Check that the bytes for 駅 etc. are actually the same
so I looked carefully to check that the characters are all Japanese and no Chinese variants are there. And it's all Japanese, it's fine.
So what could it be?
I tried using PHP's file_put_contents to dump the variable to a file, and then typing the same text with my Japanese keyboard manually and saving them to another file. I opened Meld (a diff tool) and compared the two text and voila--the spaces on the text use a different codepoint than the usual half-width space (0x20). It uses 0xA0 instead, which is a "no-break space", apparently. What the heck.
Fortunately, a simple $m = str_replace("\u{00A0}", " ", $m) did the trick.
Thanks to everyone for leading me to the right answer!

Google translate API - too many characters being sent - how to debug?

I am using this script (https://github.com/viniciusgava/google-translate-php-client) to pass translation of text to Google. The text is generated through a script I coded myself.
The problem is that, Google is saying I'm passing 1-3k more characters than I actually am. I have gone through and made sure the loops are tight, commented out any sections that could possibly be leaking to test, but no matter what, it's passing too many characters.
Also, I tested doing this:
$text = (strlen($string) > 2000) ? substr($string, 0, 2000) . '...' : $string;
Even so, Google is saying I sent 3,300 characters.
My question is, how can I debug what's actually being sent to Google so I can identify the "leak"? How can I retrieve a list of everything that has been sent?
Note: The script I use for character counting (http://www.javascriptkit.com/script/script2/charcount.shtml) already includes whitespaces, html and punctuation when counting the amount of characters.
Thank you for the help.
Update:
I ran a test completely separate from my script, I input 550 words in raw $string = "words"; format... ran the translation, and Google is reporting 1000! So the same thing is happening... I'm wondering if it's something to do with the script itself.
Update 2:
I ran a test using the official GTranslate API script, with 60 characters, Google recorded it as 100 characters. So something is happening that's doubling my characters, even on the official script.

Settings that could influence PHP str_replace behaviour

I am currently working on a replacement tool that will dynamically replace certain strings (including html) in a website using a smarty outputfilter.
For the replacement to take place, I am using PHP's str_ireplace method, which reads the code that is supposed to be replaced and the replacement code from a database, and then pass the result to the smarty output (using an output filter), in a similar way as the below.
$tpl_source = str_ireplace($replacements['sourceHTML'], $replacements['replacementHTML'], $tpl_source);
The problem is, that although it works great on my dev server, once uploaded to the live server replacements occasionally fail. The same replacements work just fine on my dev version though. After some examinations and googling there was not much I could find out regarding this issue. So my question is, what could influence str_replace's behavour?
Thanks
Edit with replacement example:
$htmlsource = file_get_contents('somefile.html');
$newstr = str_replace('Some text', 'sometext', $htmlsource); // the text to be replaced does exist in the html source
fails to replace. After some checking, it looks like the combination of "> creates a problem. But just the combination of it. If I try to change only (") it works, if I try to change only (>) it works.

It might be that special chars like umlauts do not display on the live server correctly and so str_replace() would fail, if there are specialchars inside the string you want to replace.

Is the input string identical on both systems? Have you verified this? Are you sure?
Things to check:
Are the HTML attributes in the same order?
Are the attribute values using the same kind quote marks? (eg <a href='#'> vs <a href="#">)
Is there any other stray HTML code getting in there?
Is the entity encoding the same? (eg vs   - same character; different HTML)
Is the character-set the same? (eg utf-8 vs ISO 8859-1: Accented characters will be encoded differently)
Any of these things will affect the result and produce the failures you're describing.

This was a trikcy problem, and it ended up having nothing to do with the str_replace method itself;
We are using smarty as a tamplating system. The str_replace method was used by a smarty ouput filter in order to change the html in some ocassions, just before it was delivered to the user.
Here is the Smarty outputfilter Code:
function smarty_outputfilter_replace($tpl_source, &$smarty)
{
$replacements = Content::getReplacementsForPage();
if (is_array($replacements))
{
foreach ($replacements as $replacementData)
{
$tpl_source = str_replace($replacementData['sourcecode'], $replacementData['replacementcode'], $tpl_source);
}
}
return ($tpl_source);
}
So this code failed now and then for now apparent reason... until I realized that the HTML code in the smarty template was being manipulated by an Apache filter.
This resulted into the source code in the browser (which we were using as the code to be replaced by something else) not being identical to the template code (which smarty was trying to modify). Result? str_replace failed :)

why is php trim is not really remove all whitespace and line breaks?

I am grabbing input from a file with the following code
$jap= str_replace("\n","",addslashes(strtolower(trim(fgets($fh), " \t\n\r"))));
i had also previously tried these while troubleshooting
$jap= str_replace("\n","",addslashes(strtolower(trim(fgets($fh)))));
$jap= addslashes(strtolower(trim(fgets($fh), " \t\n\r")));
and if I echo $jap it looks fine, so later in the code, without any other alterations to $jap it is inserted into the DB, however i noticed a comparison test that checks if this jap is already in the DB returned false when i can plainly see that a seemingly exact same entry of jap is in the DB. So I copy the jap entry that was inserted right from phpmyadmin or from my site where the jap is displayed and paste into a notepad i notice that it paste like this... (this is an exact paste into the below quotes)
"
バスにのって、うみへ行きました"
and obviously i need, it without that white space and breaks or whatever it is.
so as far as I can tell the trim is not doing what it says it will do. or im missing something here. if so what is it?
UPDATE:
with regards to Jacks answer
the preg_replace did not help but here is what i did, i used the
bin2hex() to determine that the part that "is not the part i want" is
efbbbf
i did this by taking $jap into str replace and removing the japanese i am expecting to find, and what is left goes into the bin2hex. and the result was the above "efbbbf"
echo bin2hex(str_replace("どちらがあなたの本ですか","",$jap));
output of the above was efbbbf
but what is it? can i make a str_replace to remove this somehow?

The trim function doesn't know about Unicode white spaces. You could try this:
preg_replace('/^\p{Z}+|\p{Z}+$/u', '', $str);
As taken from: Trim unicode whitespace in PHP 5.2
Otherwise, you can do a bin2hex() to find out what characters are being added at the front.
Update
Your file contains a UTF8 BOM; to remove it:
$f = fopen("file.txt", "r");
$s = fread($f, 3);
if ($s !== "\xef\xbb\xbf") {
// bom not found, rewind file
fseek($f, 0, SEEK_SET);
}
// continue reading here

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.