Google translate API - too many characters being sent - how to debug? - php

I am using this script (https://github.com/viniciusgava/google-translate-php-client) to pass translation of text to Google. The text is generated through a script I coded myself.
The problem is that, Google is saying I'm passing 1-3k more characters than I actually am. I have gone through and made sure the loops are tight, commented out any sections that could possibly be leaking to test, but no matter what, it's passing too many characters.
Also, I tested doing this:
$text = (strlen($string) > 2000) ? substr($string, 0, 2000) . '...' : $string;
Even so, Google is saying I sent 3,300 characters.
My question is, how can I debug what's actually being sent to Google so I can identify the "leak"? How can I retrieve a list of everything that has been sent?
Note: The script I use for character counting (http://www.javascriptkit.com/script/script2/charcount.shtml) already includes whitespaces, html and punctuation when counting the amount of characters.
Thank you for the help.
Update:
I ran a test completely separate from my script, I input 550 words in raw $string = "words"; format... ran the translation, and Google is reporting 1000! So the same thing is happening... I'm wondering if it's something to do with the script itself.
Update 2:
I ran a test using the official GTranslate API script, with 60 characters, Google recorded it as 100 characters. So something is happening that's doubling my characters, even on the official script.

Related

PHP preg_match on own computer doesn't work

I have this code:
$success = preg_match('/(.+(駅前)?駅) (\(([^線]+線)\) )?((([^線 ]+) )?(\d+[分時])?)/u', $m, $matches);
Example input text is
大正駅 (JR大阪環状線) バス 20分
This regex works on https://regex101.com/ and the code works on http://sandbox.onlinephpfunctions.com/. However, when I run the PHP code on my own computer, it never gives me a match. $matches is an empty array, and $success is 0. Yes, the exact same code. I have verified that the regex is correct (using first link) and that the code itself works (using second link). However, it still refuses to work on my own PC.
OS is Arch Linux, running PHP 7.3.11, system locale is ja_JP.UTF-8 (which I don't think matters, but just in case)
Does anyone see anything wrong with the code?
So I was able to find the problem.
First, I tried just the one-liner commented by Nick (3v4l.org/o4ADM) on my PC, and it works. (Of course it should. PHP can't be broken.)
So I figured out that it's the data I'm feeding preg_match that should be broken.
Normal prints and echos were in vain--$m is always how it should be. Then I considered AD7six's comment,
Check that the bytes for 駅 etc. are actually the same
so I looked carefully to check that the characters are all Japanese and no Chinese variants are there. And it's all Japanese, it's fine.
So what could it be?
I tried using PHP's file_put_contents to dump the variable to a file, and then typing the same text with my Japanese keyboard manually and saving them to another file. I opened Meld (a diff tool) and compared the two text and voila--the spaces on the text use a different codepoint than the usual half-width space (0x20). It uses 0xA0 instead, which is a "no-break space", apparently. What the heck.
Fortunately, a simple $m = str_replace("\u{00A0}", " ", $m) did the trick.
Thanks to everyone for leading me to the right answer!

Opening an encoded file with PHP

I am opening a file on the server with PHP. The file seems ordinary. It opens in Notepad and Textedit on a PC. Even PHP can display it without any issue in a web browser when we echo out.
But when I try searching it with strpos() it can’t find anything except single characters. if i search for a string with 2 or more characters, it doesn’t find anything.
I have tried encoding it to UTF-8, and it detects it as ASCII. so everything seems right there.
I have also isolated the part of the file that I am trying to read down to only 250 characters. They all look fine on the screen.
But strpos can’t find it. I’ve run tests on every part of my code and I believe everything is fine with my code. The problem I believe derives from that the characters I see on the screen are not exactly matching what those characters really are.
My last resort is to write a function which converts each character into an integer array (if that’s even possible), and then convert all that back to a string. This way, we’ll know 100% that the characters we see are real.
Hoping that somebody has a better approach or perhaps an idea for something I missed?
I'll post the code below:
$content = file_get_contents($file->getPathname()); // get the file contents
$content = substr($content, 30, 300); // reduce the large file to just the first few lines
$content = htmlspecialchars($content); // try to remove any special characters from the file
$content = iconv('ASCII', 'UTF-8//IGNORE', $content); // encode to a friendly format
$string = "JobName"; // this is the string i'm searching for
if (strpos($content, $string) !== false) {
echo "bingo";
}
else {
echo " not found ";
}
Just to be clear, the file I'm opening is generated from a PC program that stores its data in .DAT format. Like I said, I can see and read the content very easily using any program, including PHP. but when I try to search, its as if it doesn't recognize the content at all.
I am not aware of how to upload a file on StackOverflow, but if someone can tell me how to do it then I will gladly post the file itself.
Thank you very much for your help ARKASCHA. I was able to find an online HexEditor and when I saw the characters, it seems there is a NUL character between every single character in this file. that's probably why I couldn't see it with a regular view. I just had to run an additional function to remove NUL characters from the file, and then it works as its supposed. Thanks again.

PHP preg_match match consecutive newline chars

I am trying to prevent certain kinds of posts on my site, which are mostly meant to make it look like they contain some content but are just spam. Specifically, the posts are a few random words, some newline characters, and a random character.
So, I know some legit users might have use for using two newline chars (to create a blank line between paragraphs), but I figure 3+ can be marked as spam.
I tested this regex on regex101 and it works fine, but is never triggered when I test on my site, any ideas as to why? When I uncomment the echo line, it will show me the number 4 for my test data, so I know it sees the newlines.. is my regex formed incorrectly?!
Test data:
This is a potential
spam post
Code:
//echo substr_count($lowercaseBody, "\n");
if (preg_match('/\n{3,}./', $lowercaseBody)){
error("Stop Spamming my chan you .");
}
The data likely contains CRLF's, not just LF's.
The substr_count test does not care about the interleaving CR's, but your regex patterns does.
Use (\r?\n) instead of the \n to allow both CRLF's and LF's (different browsers/OS's, may use different new-lines):
if (preg_match('/(\r?\n){3,}./', $lowercaseBody)){
error("Stop Spamming my chan you .");
}

Can't understand why Zend_Mail::addHeader() strips newlines

(Since this is my first SO question, let me just say I hope it's not too Zend-specific. As far as I can tell this shouldn't be a problem. Although I could have posted it in a Zend-specific forum, I feel like I'm at least as likely to get a good answer here, especially since the answer might involve MIME-related issues that transcend Zend Framework. I'm basically trying to understand whether the issue I'm facing should be considered a ZF bug, or if I'm misunderstanding something or misusing it.)
I've been using Zend_Mail to build up a MIME message that gets sent through SendGrid, an email distribution service. Their platform allows you to send emails through their SMTP server, but gives added features when you use a special header (X-SMTPAPI) whose value is a JSON-encoded string of proprietary parameters, which can get quite long.
Eventually, the header I was passing got too long (I think >1000 chars), and I got errors. I was confused because I knew that it was getting passed through PHP's native wordwrap() function before I passed the value to Zend_Mail::addHeader(), so I thought line length should never be a problem.
It turns out that addHeader() strips newlines very deliberately, and with no particular explanation by way of comments.
// In Zend_Mail::addHeader()
$value = $this->_filterOther($value);
// In Zend_Mail::_filterOther()
$rule = array("\r" => '',
"\n" => '',
"\t" => '',
);
return strtr($data, $rule);
Ok, this seemed reasonable at first -- maybe ZF wants full control of formatting and line-wrapping. The next method called in Zend_Mail::addHeader() is
$value = $this->_encodeHeader($value);
This method encodes the value (either quoted-printable or base64 as appropriate) and chunks it into lines of appropriate length, but only if it contains "non-printable characters", as determined by Zend_Mime::isPrintable($value).
Looking into that method, newlines (\n) are indeed considered non-printable characters! So if only they hadn't been stripped out of the string in the previous method call, the long header would get encoded as QP and chunked into 72-char lines, and everything would work fine. In fact, I did a test where I commented out the call to _filterOther(), and the long header gets encoded and goes through with no problem. But now I've just made a careless hack to ZF without really understanding the purpose behind the line I removed, so this can't be a long-term solution.
My medium-term solution has been to extend Zend_Mail and create a new method, addHeaderForceEncode(), which will always encode the value of the header, and thus always chunk it into short lines. But I'm still not satisfied because I don't understand why that _filterOther() call was necessary in the first place -- maybe I shouldn't be working around it at all.
Can anyone explain to me why this behaviour exists of stripping newlines? It seems to inevitably lead to situations where a header can get too long if it doesn't contain any "non-printable characters" other than newlines.
I've done a bunch of different searches on this subject and looked through some ZF bug reports, but haven't seen anyone talking about this. Surprisingly it seems to be a really obscure issue. FYI I'm working with ZF 1.11.11.
Update: In case anyone wants to follow the ZF issue I opened about this, here it is: Zend_Mail::addHeader() UNfolds long headers, then throws exception
You're probably running into a few things. Per RFC 2821, text lines in SMTP can't exceed 1000 characters:
text line
The maximum total length of a text line including the is
1000 characters (not counting the leading dot duplicated for
transparency). This number may be increased by the use of SMTP
Service Extensions.
A header can't contain newlines, so that's probably why Zend is stripping them. For long headers, it's common to insert a line break (CRLF in SMTP) and a tab to "wrap" them.
Excerpt from RFC 822:
Each header field can be viewed as a single, logical line of
ASCII characters, comprising a field-name and a field-body.
For convenience, the field-body portion of this conceptual
entity can be split into a multiple-line representation; this
is called "folding". The general rule is that wherever there
may be linear-white-space (NOT simply LWSP-chars), a CRLF
immediately followed by AT LEAST one LWSP-char may instead be
inserted.
I would say that the _encodeHeader() function should possibly look at line length, and if the header is longer than some magic value, do the "wrap and tab" to have it span multiple lines.

My jQuery and PHP give different results on the same thing?

Annoying brain numbing problem.
I have two functions to check the length of a string (primarily, the js one truncates as well) heres the one in Javascript:
$('textarea#itemdescription').keyup(function() {
var charLength = $(this).val().length;
// Displays count
$('span#charCount').css({'color':'#666'});
$('span#charCount').html(255 - charLength);
if($(this).val().length >= 240){
$('span#charCount').css({'color':'#FF0000'});
}
// Alerts when 250 characters is reached
if($(this).val().length >= 255){
$('span#charCount').css({'color':'#FF0000'});
$('span#charCount').html('<strong>0</strong>');
var text = $('textarea#itemdescription').val().substring(0,255)
$('textarea#itemdescription').val(text);
}
});
And here is my PHP to double check:
if(strlen($_POST["description"])>255){
echo "Description must be less than ".strlen($_POST["description"])." characters";
exit();
}
I'm using jQuery Ajax to post the values from the textarea. However my php validation says the strlen() is longer than my js is essentially saying. So for example if i type a solid string and it says 0 or 3 chars left till 255. I then click save and the php gives me the length as being 261.
Any ideas?
Is it to do with special characters, bit sizes that js reads differently or misses out? Or is it to do with something else? Maybe its ill today!... :P
Update:
I added var_dump($_POST['description'])
to see what was passed and it was returning escape slashes e.g. what\'s going on? I have tried adding stripslashes(); to no avail... where are they coming from?
UPDATE 2 - PROBLEM SOLVED:
Basically I think I just realised my server has magic quotes turned on... grr
So I have stripped slashes before processing now. Bit annoying but it will have to do!!
Thanks for your help!
Thanks,
Stefan
The easiest way to debug this is simply from your PHP script, by using:
var_dump($_POST['description']
I suggest you also use view source in your browser to see any escape code, special char codes, etc...
It would help if you posted more of your front-end code, especially where you are doing the actual POST. That said, are you sure that keyup is called every time? If the user just pastes text into the box have you verified it is still called?
Also keep in mind that JavaScript is not good enough to guarantee that a string will be less than a given length. A user could disable JavaScript, and a savvy "user" can send their own POST request with more than 255 chars.
I suspect that few characters are line breaks (you say you use textarea) that are ignored while you validate using javascript.
I see 2 things that might be causing your problem.
firstly substring(0,255) returns 256 characters
secondly magic_quotes might be turned on in php.ini, PHP tries to give you escaped strings but doesn't do it right all the time
edit
doh didnt re-read the substring definition, ignore the first one but magic_quotes might be on check that one
If you use UTF-8 encoding, PHP strlen() is counting the bytes, not the characters. If you have anything non-ASCII, this will happen. Use mb_strlen(). Magic quotes can add a few characters also.

Categories