Removing non-utf characters using php5 - php

I am trying to display data on a webpage that contains non-utf8 characters. A user uploads a tab separated file from a FileMaker databse into our Oracle database via a web form. I then display the data through another user webpage.
One record has a character I've not seen before. It is the letters, 'VT' in a square. Sort of like [VT]. Pasting here, but it probably wont show in this setting. attaching image. If it does not show, the line of text with the character in it looks like this:
'Blue is a color[VT]Blue web page to find more information.[VT]Blue is a nice color'
![vt character](http://www.johndcowan.com/graphics/vt-special-chars.png)
I've tried these PHP functions with no luck:
$answer = iconv("utf-8", "utf-8//ignore", $answer);
$answer = htmlspecialchars($answer);
$answer = strip_tags($answer);
Anyone have an idea how to 86 this character?

Related

Squares apperaing in text, want to search & replace

I have a site where a client has inserted text. Probably by copying it from a PDF or Word document.
There is a strange character that appears at random places in the inserted text.
The site is full of them so i would like to get rid of them programmatically, either by a search and replace in the database or by filtering via PHP when the value is printed out via str_replace();
The character looks like this, and is not recognized when i search for it in the database, and not even when you search for it with the built in browser search function.
It looks the same in the database as it does when rendered. In some cases it´s like an empty square, in some cases with a questionmark inside it.
I have tried this without luck:
$value = str_replace( '', '', $value );
(The square in first argument seems to be removed when i save)

Display the emoji codes in a PHP variable

I have text in a php variable $sentence. I want to display the actual emoji codes found in $sentence on my web page. Not the emoji graphical icons but the actual emoji codes themselves which are embedded in the sentence.
I have been trying all day to do this by reverse engineering code that is usually used to "remove emojis". No success.
Looking to display the actual codes to emojis embedded in text. Example ...
[twitter.com/ProfitTradeRoom/status/880439582063566848][1]
As you can see the fire emoji is in this tweet. The code for fire is 1F525. I want to see 1F525 on my screen when i echo out the php variable. Basically "1F525 HOT 1F525 stocks today" on my screen instead of the graphical emojis.
String taken straight from my MySQL database (copied and pasted) ...
🔥HOT🔥stocks today
I found the answer! The issue was my mysql connection. I was pulling from mysql into php. None of my regular expressions were matching any emoji unicodes until I did this ...
mysql_set_charset('utf8mb4');
I put mysql_set_charset immediately after I defined the connection string. Now when I pull text with emojis it will match a preg_match / preg_replace. For example detecting the fire emoji in a sentence ...
$theFireEmojiCode = '/[\x{1F525}]/u';
preg_match($theFireEmojiCode, $sentence, $matches);
Or I can do a preg_replace like this ...
$theFireEmojiCode = '/[\x{1F525}]/u';
preg_replace($theFireEmojiCode, "1F525", $sentence);
Hope this helps.

Opening an encoded file with PHP

I am opening a file on the server with PHP. The file seems ordinary. It opens in Notepad and Textedit on a PC. Even PHP can display it without any issue in a web browser when we echo out.
But when I try searching it with strpos() it can’t find anything except single characters. if i search for a string with 2 or more characters, it doesn’t find anything.
I have tried encoding it to UTF-8, and it detects it as ASCII. so everything seems right there.
I have also isolated the part of the file that I am trying to read down to only 250 characters. They all look fine on the screen.
But strpos can’t find it. I’ve run tests on every part of my code and I believe everything is fine with my code. The problem I believe derives from that the characters I see on the screen are not exactly matching what those characters really are.
My last resort is to write a function which converts each character into an integer array (if that’s even possible), and then convert all that back to a string. This way, we’ll know 100% that the characters we see are real.
Hoping that somebody has a better approach or perhaps an idea for something I missed?
I'll post the code below:
$content = file_get_contents($file->getPathname()); // get the file contents
$content = substr($content, 30, 300); // reduce the large file to just the first few lines
$content = htmlspecialchars($content); // try to remove any special characters from the file
$content = iconv('ASCII', 'UTF-8//IGNORE', $content); // encode to a friendly format
$string = "JobName"; // this is the string i'm searching for
if (strpos($content, $string) !== false) {
echo "bingo";
}
else {
echo " not found ";
}
Just to be clear, the file I'm opening is generated from a PC program that stores its data in .DAT format. Like I said, I can see and read the content very easily using any program, including PHP. but when I try to search, its as if it doesn't recognize the content at all.
I am not aware of how to upload a file on StackOverflow, but if someone can tell me how to do it then I will gladly post the file itself.
Thank you very much for your help ARKASCHA. I was able to find an online HexEditor and when I saw the characters, it seems there is a NUL character between every single character in this file. that's probably why I couldn't see it with a regular view. I just had to run an additional function to remove NUL characters from the file, and then it works as its supposed. Thanks again.

My php function does what I want but ouput converts characters to ascii

I am using the following function (tried both in my Wordpress Child Theme function file as well as a plugin - works in both cases) to remove hashtags from post titles. The function does what I want but then all titles (post body content is ok just titles) are now showing HTML numbers for characters (i.e., ' = &8217; and & = &038; and - = &8211;).
So this title
That's #testing the & and apostrophe #tagstitle #cats #cat #instagramcats
becomes
That&8217;s testing the &038; and apostrophe
Which removes hashtags as desired but creates the character issue.
function remove_hashtags($string){
return preg_replace('/#(?=[\w-]+)/', '',
preg_replace('/(?:#[\w-]+\s*)+$/', '', $string));
}
add_filter('the_title', 'remove_hashtags');
I've tried adding additional code:
html_entity_decode('the_title', ENT_QUOTES | ENT_XML1, 'UTF-8');
to the function after reading up on PHP (I'm just learning) but it doesn't seem to work and I'm not sure how to use
html_entities($string)
(question update adding more information)
I basically took the code from here - that had exactly what I needed. I just added the last lines for filtering the WordPress Post Titles.
I don't want to make it too complicated a question but ideally I would like to remove the hashtags from the text and actually create post tags from them. I have found several answers for each part I just don't know how to put it all together. Forget that though...I really just want to find out why all of the sudden the ascii numbers are replacing the original punctuation.
If you only want to remove hashtags you could use str_replace("#","",$string);

Actual input contents are not preseving on most of the browsers [FF,MSIE7/8 and etc]

I'm working on one application ( using PHP, javascript ). Below is the short description about my problem statement
There are two forms avaliable on my application, i.e. SourceFrm and targetFrm.
I am taking input on first form i.e. SourceFrm and doing processing on targetFrm.
Below is the input which I am taking from SourceFrm :
1) Enter your data (Identification of this input box id is 'inputdata' ):
2) Enter id ( Identification input box id is id ):
As per above input feed by user I am posting this data to targetFrm for further processiong.
On TargetFrm :
I am simply assigning inputdata value to php varible.
The spaces which are in between of words are getting lost ( more than one spaces converting to one space).
e.g.
User has added below data on input box and submitted
inputdata:
This is my test.
Here observed that user has added 5 spaces in between 'my' and 'test' word.
After assigning this input data to php variable. After that I printed this value
Below content I am getting
Output:
This is my test.
More than one spaces is converting to one space. This behaviour I checked on all browsers like FF,MSIE7/8 opera, safari, chrome.
If have used '<pre>' before printing php variable i.e.:
print "<pre>";
print $inputdata;
At time spaces are not getting lost (I am getting exact content).
Here my conflict is how do I presrve exact contents without using '<pre>'.
I have used encoding/decoding (htmlentitiesencode() and decode () )functionality, in my further data processing, so it may create some conflict if i replace spaces with . ( May conflict ll occur if i use instead space ).
Is anyone has any ideas, suggestions please suggest.
-Thanks
When you output your variables to HTML, they are parsed as HTML. Any additional white space is brought down to one space.
A simple fix would to replace all spaces with the html entitity to force browsers to display each space.
I wouldn't store the string with all the &nbps; in the database, but when you show it the would ensure that each space is seen.
EDIT
I mean only replace spaces on render...like:
print str_replace(' ', ' ', $inputdata);
HTML is capable of showing only one space. I'm not really sure why, but if you check your source code of rendered webpage containing your string, you'll see that it contains all the space, the browser just doesn't show it.
The same is for other space characters, as tabs.
The way to deal with it depends on type of your content. You can either replace spaces with or leave it as it is or do something completely different, i.e. strip more than one space down to one space.
It really depends on naturel of your data–the only real situation, when you would need more spaces than one, that comes to my mind is if you're trying to indent things with spaces, what actually isn't that great idea.
Edit: older resource:
http://www.sightspecific.com/~mosh/WWW_FAQ/nbsp.html

Categories