PHP preg_match on own computer doesn't work

PHP preg_match on own computer doesn't work - php

I have this code:
$success = preg_match('/(.+(駅前)?駅) (\(([^線]+線)\) )?((([^線 ]+) )?(\d+[分時])?)/u', $m, $matches);
Example input text is
大正駅 (JR大阪環状線) ﾊﾞｽ 20分
This regex works on https://regex101.com/ and the code works on http://sandbox.onlinephpfunctions.com/. However, when I run the PHP code on my own computer, it never gives me a match. $matches is an empty array, and $success is 0. Yes, the exact same code. I have verified that the regex is correct (using first link) and that the code itself works (using second link). However, it still refuses to work on my own PC.
OS is Arch Linux, running PHP 7.3.11, system locale is ja_JP.UTF-8 (which I don't think matters, but just in case)
Does anyone see anything wrong with the code?

So I was able to find the problem.
First, I tried just the one-liner commented by Nick (3v4l.org/o4ADM) on my PC, and it works. (Of course it should. PHP can't be broken.)
So I figured out that it's the data I'm feeding preg_match that should be broken.
Normal prints and echos were in vain--$m is always how it should be. Then I considered AD7six's comment,
Check that the bytes for 駅 etc. are actually the same
so I looked carefully to check that the characters are all Japanese and no Chinese variants are there. And it's all Japanese, it's fine.
So what could it be?
I tried using PHP's file_put_contents to dump the variable to a file, and then typing the same text with my Japanese keyboard manually and saving them to another file. I opened Meld (a diff tool) and compared the two text and voila--the spaces on the text use a different codepoint than the usual half-width space (0x20). It uses 0xA0 instead, which is a "no-break space", apparently. What the heck.
Fortunately, a simple $m = str_replace("\u{00A0}", " ", $m) did the trick.
Thanks to everyone for leading me to the right answer!

Related

Google translate API - too many characters being sent - how to debug?

I am using this script (https://github.com/viniciusgava/google-translate-php-client) to pass translation of text to Google. The text is generated through a script I coded myself.
The problem is that, Google is saying I'm passing 1-3k more characters than I actually am. I have gone through and made sure the loops are tight, commented out any sections that could possibly be leaking to test, but no matter what, it's passing too many characters.
Also, I tested doing this:
$text = (strlen($string) > 2000) ? substr($string, 0, 2000) . '...' : $string;
Even so, Google is saying I sent 3,300 characters.
My question is, how can I debug what's actually being sent to Google so I can identify the "leak"? How can I retrieve a list of everything that has been sent?
Note: The script I use for character counting (http://www.javascriptkit.com/script/script2/charcount.shtml) already includes whitespaces, html and punctuation when counting the amount of characters.
Thank you for the help.
Update:
I ran a test completely separate from my script, I input 550 words in raw $string = "words"; format... ran the translation, and Google is reporting 1000! So the same thing is happening... I'm wondering if it's something to do with the script itself.
Update 2:
I ran a test using the official GTranslate API script, with 60 characters, Google recorded it as 100 characters. So something is happening that's doubling my characters, even on the official script.

Opening an encoded file with PHP

I am opening a file on the server with PHP. The file seems ordinary. It opens in Notepad and Textedit on a PC. Even PHP can display it without any issue in a web browser when we echo out.
But when I try searching it with strpos() it can’t find anything except single characters. if i search for a string with 2 or more characters, it doesn’t find anything.
I have tried encoding it to UTF-8, and it detects it as ASCII. so everything seems right there.
I have also isolated the part of the file that I am trying to read down to only 250 characters. They all look fine on the screen.
But strpos can’t find it. I’ve run tests on every part of my code and I believe everything is fine with my code. The problem I believe derives from that the characters I see on the screen are not exactly matching what those characters really are.
My last resort is to write a function which converts each character into an integer array (if that’s even possible), and then convert all that back to a string. This way, we’ll know 100% that the characters we see are real.
Hoping that somebody has a better approach or perhaps an idea for something I missed?
I'll post the code below:
$content = file_get_contents($file->getPathname()); // get the file contents
$content = substr($content, 30, 300); // reduce the large file to just the first few lines
$content = htmlspecialchars($content); // try to remove any special characters from the file
$content = iconv('ASCII', 'UTF-8//IGNORE', $content); // encode to a friendly format
$string = "JobName"; // this is the string i'm searching for
if (strpos($content, $string) !== false) {
echo "bingo";
}
else {
echo " not found ";
}
Just to be clear, the file I'm opening is generated from a PC program that stores its data in .DAT format. Like I said, I can see and read the content very easily using any program, including PHP. but when I try to search, its as if it doesn't recognize the content at all.
I am not aware of how to upload a file on StackOverflow, but if someone can tell me how to do it then I will gladly post the file itself.

Thank you very much for your help ARKASCHA. I was able to find an online HexEditor and when I saw the characters, it seems there is a NUL character between every single character in this file. that's probably why I couldn't see it with a regular view. I just had to run an additional function to remove NUL characters from the file, and then it works as its supposed. Thanks again.

preg_replace doesn't work on local machine, works everywhere else

I have the following code
$cities = preg_replace(
'/^\\d*\\n(.*)\\n([^\\d].*|)/m',
'\\item \\textbf{$1} -- $2',
$_POST['cities']
);
$_POST['cities'] has this value.
$cities is identical to $_POST['cities'] on my local machine and has had no replacements done.
I'm running PHP 5.5.9 through Xampp.
I've tested the code and regex through the following services, all telling me it should work:
PHP Live Regex
Regex101
Functions online (no direct link)
$count is 0, so clearly it doesn't match, however above sources should be enough proof that it should.
EDIT: The code doesn't work on a much, much smaller string either (consisting of two matches).

It seems the issue was that my test server was running Windows and I was used to Regex on Linux. Windows matches newlines by \r\n instead of just \n, so I changed all \n to \r?\n for portability, which solved it.
EDIT: I am now using \R like suggested by #je-suis-charlie.
I will accept this answer in two days.

Settings that could influence PHP str_replace behaviour

I am currently working on a replacement tool that will dynamically replace certain strings (including html) in a website using a smarty outputfilter.
For the replacement to take place, I am using PHP's str_ireplace method, which reads the code that is supposed to be replaced and the replacement code from a database, and then pass the result to the smarty output (using an output filter), in a similar way as the below.
$tpl_source = str_ireplace($replacements['sourceHTML'], $replacements['replacementHTML'], $tpl_source);
The problem is, that although it works great on my dev server, once uploaded to the live server replacements occasionally fail. The same replacements work just fine on my dev version though. After some examinations and googling there was not much I could find out regarding this issue. So my question is, what could influence str_replace's behavour?
Thanks
Edit with replacement example:
$htmlsource = file_get_contents('somefile.html');
$newstr = str_replace('Some text', 'sometext', $htmlsource); // the text to be replaced does exist in the html source
fails to replace. After some checking, it looks like the combination of "> creates a problem. But just the combination of it. If I try to change only (") it works, if I try to change only (>) it works.

It might be that special chars like umlauts do not display on the live server correctly and so str_replace() would fail, if there are specialchars inside the string you want to replace.

Is the input string identical on both systems? Have you verified this? Are you sure?
Things to check:
Are the HTML attributes in the same order?
Are the attribute values using the same kind quote marks? (eg <a href='#'> vs <a href="#">)
Is there any other stray HTML code getting in there?
Is the entity encoding the same? (eg vs   - same character; different HTML)
Is the character-set the same? (eg utf-8 vs ISO 8859-1: Accented characters will be encoded differently)
Any of these things will affect the result and produce the failures you're describing.

This was a trikcy problem, and it ended up having nothing to do with the str_replace method itself;
We are using smarty as a tamplating system. The str_replace method was used by a smarty ouput filter in order to change the html in some ocassions, just before it was delivered to the user.
Here is the Smarty outputfilter Code:
function smarty_outputfilter_replace($tpl_source, &$smarty)
{
$replacements = Content::getReplacementsForPage();
if (is_array($replacements))
{
foreach ($replacements as $replacementData)
{
$tpl_source = str_replace($replacementData['sourcecode'], $replacementData['replacementcode'], $tpl_source);
}
}
return ($tpl_source);
}
So this code failed now and then for now apparent reason... until I realized that the HTML code in the smarty template was being manipulated by an Apache filter.
This resulted into the source code in the browser (which we were using as the code to be replaced by something else) not being identical to the template code (which smarty was trying to modify). Result? str_replace failed :)

Scrape a price off a website

I'm trying to scrape a price from a web page using PHP and Regexes. The price will be in the format £123.12 or $123.12 (i.e., pounds or dollars).
I'm loading up the contents using libcurl. The output of which is then going into preg_match_all. So it looks a bit like this:
$contents = curl_exec($curl);
preg_match_all('/(?:\$|£)[0-9]+(?:\.[0-9]{2})?/', $contents, $matches);
So far so simple. The problem is, PHP isn't matching anything at all - even when there are prices on the page. I've narrowed it down to there being a problem with the '£' character - PHP doesn't seem to like it.
I think this might be a charset issue. But whatever I do, I can't seem to get PHP to match it! Anyone have any ideas?
(Edit: I should note if I try using the Regex Test Tool using the same regex and page content, it works fine)

Have you try to use \ in front of £
preg_match_all('/(\$|\£)[0-9]+(\.[0-9]{2})/', $contents, $matches);
I have try this expression with .Net with \£ and it works. I just edited it and removed some ":".
(source: clip2net.com)
Read my comment about the possibility of Curl giving you bad encoding (comment of this post).

maybe pound has it's html entity replacement? i think you should try your regexp with some sort of couching program (i.e. match it against fixed text locally).
i'd change my regexp like this: '/(?:\$|£)\d+(?:\.\d{2})?/'

This should work for simple values.
'#(?:\$|\£|\€)(\d+(?:\.\d+)?)#'
This will not work with thousand separator like 234,343 and 34,454.45.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP preg_match on own computer doesn't work - php

Related

Google translate API - too many characters being sent - how to debug?

Opening an encoded file with PHP

preg_replace doesn't work on local machine, works everywhere else

Settings that could influence PHP str_replace behaviour

Scrape a price off a website

Categories

Resources