How to remove Junk characters coming in gmail attachments in php? - php

I have marked the junk characters in the image and I want the code to remove it and start reading the data after it.

That ugly looking text is not junk but something that makes a *.doc file a DOC file that it is (i.e. formatting). You can't really just echo that file using PHP.
You can display it using a some PHP doc viewer library though or if you can find some API online to convert DOC to TXT.
You can also make the user download it. Use file_put_content() to store that attachment into a doc file like below :
if(file_put_content("attachment.doc", $email['attachment'])){
header("Location: attachment.doc");
}

The binary data represents a *.doc file. If you really want to extract plain text from it, you could do some fuzzy logic, and extract the lines that do not contain any characters with low ASCII codes (except for CR and LF).
Assuming your data structure is in $data, you could do this:
foreach($data as $element) {
$element["attachment"] = preg_replace(
"/^.*?[\x01-\x09,\x0B,\x0C,\x0E-\x1F].*?$\R?/m",
"", $element["attachment"]);
}
Again, this is just "fuzzy" logic, so you still might get some meaningless text that is not removed.

Related

Opening an encoded file with PHP

I am opening a file on the server with PHP. The file seems ordinary. It opens in Notepad and Textedit on a PC. Even PHP can display it without any issue in a web browser when we echo out.
But when I try searching it with strpos() it can’t find anything except single characters. if i search for a string with 2 or more characters, it doesn’t find anything.
I have tried encoding it to UTF-8, and it detects it as ASCII. so everything seems right there.
I have also isolated the part of the file that I am trying to read down to only 250 characters. They all look fine on the screen.
But strpos can’t find it. I’ve run tests on every part of my code and I believe everything is fine with my code. The problem I believe derives from that the characters I see on the screen are not exactly matching what those characters really are.
My last resort is to write a function which converts each character into an integer array (if that’s even possible), and then convert all that back to a string. This way, we’ll know 100% that the characters we see are real.
Hoping that somebody has a better approach or perhaps an idea for something I missed?
I'll post the code below:
$content = file_get_contents($file->getPathname()); // get the file contents
$content = substr($content, 30, 300); // reduce the large file to just the first few lines
$content = htmlspecialchars($content); // try to remove any special characters from the file
$content = iconv('ASCII', 'UTF-8//IGNORE', $content); // encode to a friendly format
$string = "JobName"; // this is the string i'm searching for
if (strpos($content, $string) !== false) {
echo "bingo";
}
else {
echo " not found ";
}
Just to be clear, the file I'm opening is generated from a PC program that stores its data in .DAT format. Like I said, I can see and read the content very easily using any program, including PHP. but when I try to search, its as if it doesn't recognize the content at all.
I am not aware of how to upload a file on StackOverflow, but if someone can tell me how to do it then I will gladly post the file itself.
Thank you very much for your help ARKASCHA. I was able to find an online HexEditor and when I saw the characters, it seems there is a NUL character between every single character in this file. that's probably why I couldn't see it with a regular view. I just had to run an additional function to remove NUL characters from the file, and then it works as its supposed. Thanks again.

Special characters in file name - download vs delete

I am trying to create a file management system on my website. Problem is with downloading files that contain special characters (other work correctly).
If I use file_exists($mypath) the result is true therefore file exists.
When deleting this file with unlink($mypath) it also works fine.
Only thing that doesn´t work is downloading the file.
The download is done via href link where I echo the path but it somehow converts the characters so the link doesn´t work. The solution is in some conversion but I had no success yet.
I suspect that php is converting the special characters into html entities.
You should use the 'rawurlencode' php method to keep the special characters.
The following link talks about you issue (special chars appearing in file name and wanting to create a link):
http://www.dxsdata.com/2015/03/html-php-mailto-content-link-with-special-characters/
Their solution shows the use of rawurlencode, the following was copied from above link just incase the link goes dead:
Snip start...
Scenario
On your website, you want to offer a link which opens a mail client like Outlook with mail and content suggestion. The content contains a link to a file with special characters in its name, which causes Outlook to break the link, e.g. if it contains spaces or german Umlaute.
Solution
Using PHP, write
<?php
$fullPath = $yourAbsoluteHttpPath . "/" . rawurlencode(rawurlencode($filename));
$mailto = "mailto:a#b.com?subject=File for you&body=Your file: ".$fullPath;
?>
Generate Mail
Note the double usage of “rawurlencode” function. The first one is needed for HTML output, the second one will be needed for the part when Outlook takes the link code into its mail window.
Snip end ;-)

How to change encoding from plain text to Unicode so that I can read special characters from a HTML?

Below is my code :
<?php
// example of how to use basic selector to retrieve HTML contents
include('/Library/WebServer/Documents/simple_html_dom.php'); //this is the api for the simplehtmldom
// get DOM from URL or file
$html = file_get_html('http:/www.google.hk');
// extract text from table
echo $html->find('td[align="top"]', 1)->innertext.'<br><hr>';
// extract text from HTML
echo $html->innertext;
?>
I am using the simplephphtmldon API. When I execute my php program in my local server instead I get so many unrecognized characters due to the fact that the plain text can't really encode them to show up like they supposed to. Can Someone tell me what i need to change to inner text in order to get all the characters to show up? PS i also did try plaintext without any luck. textContent seems broken to me. Perhaps i need to try a different element first (?). Thanks
echo utf8_encode($html->innertext);
Or
echo utf8_decode($html->innertext);
It depends on the original encoding, so you may want to try both.
Note:
If you're seeing the output on a browser, make sure you set Unicode as text encoding or use this the following code at the top of you script.
header('Content-Type: text/html; charset=utf-8');

Using PHP to search a text file

I'm trying to create a code to download mp3 files embedded in a page. It starts out as a submit form. You input the URL and submit it, and it writes the HTML source of that page to a text file. I also set the script to search the source to see if there is an audio file embedded. I suppose I should include that it's not in the format of filename.mp3. The format is:
embed type="application/x-shockwave-flash" src="http://diaryofthedead.tumblr.com/swf/audio_player_black.swf?audio_file=http://www.tumblr.com/audio_file/1435664895/tumblr_lb2ybulZkt1qb5hrc&color=FFFFFF" height="27" width="207" quality="best"
So here's the thing, there's just a certain string you have to add to the end of the file, for it to redirect to the mp3 file. I know the string. What I want to do is extract, for example "http://www.tumblr.com/audio_file/1435664895/tumblr_lb3ybulZkt1q5hrc" from the middle of this. I know how to read from files but I have no idea how to extract certain parts from it without knowing the exact filename already. So is there any way I can have it search the source for "audio_file" and if it finds the string, extract the audio file?
If your program is just a parser for extracting MP3 files embedded in a webpage you don't even need to save the contents of the webpage onto a file, you can work with the page source inside just your server's memory.
If you want to detect paths to MP3 inside flashes, provided you know how does it match a regular expression, you are done.
If you don't know much about rgular expressions, you should look at them.
If you don't want as much power as a regular expression can give to you, you can always find strings by position, like:
$pos = strpos($haystack, $needle);
Beware: strpos() will find the first (strrpos will find the last) occurrence of a string. So you need to make it as explicit as you can, or you might end up capturing something unwanted.
Take a look at http://www.regular-expressions.info/quickstart.html or something similar.
I can't post more links because I don't have enough reputation yet
You can try using preg_match (http://php.net/manual/en/function.preg-match.php) to get the contents between "audio_file=" and "&".
Or you can also use a string between function to get the contents between those two strings:
http://www.php.net/manual/en/function.substr.php#89493

XML parser error: entity not defined

I have searched stackoverflow on this problem and did find a few topics, but I feel like there isn't really a solid answer for me on this.
I have a form that users submit and the field's value is stored in a XML file. The XML is set to be encoded with UTF-8.
Every now and then a user will copy/paste text from somewhere and that's when I get the "entity not defined error".
I realize XML only supports a select few entities and anything beyond that is not recognized - hence the parser error.
From what I gather, there's a few options I've seen:
I can find and replace all and swap them out with or an actual space.
I can place the code in question within a CDATA section.
I can include these entities within the XML file.
What I'm doing with the XML file is that the user can enter content into a form, it gets stored in a XML file, and that content then gets displayed as XHTML on a Web page (parsed with SimpleXML).
Of the three options, or any other option(s) I'm not aware of, what's really the best way to deal with these entities?
Thanks,
Ryan
UPDATE
I want to thank everyone for the great feedback. I actually determined what caused my entity errors. All the suggestions made me look into it more deeply!
Some textboxes where plain old textboxes, but my textareas were enhanced with TinyMCE. It turns out, while taking a closer look, that the PHP warnings always referenced data from the TinyMCE enhanced textareas. Later I noticed on a PC that all the characters were taken out (because it couldn't read them), but on a MAC you could see little square boxes referencing the unicode number of that character. The reason it showed up in squares on a MAC in the first place, is because I used utf8_encode to encode data that wasn't in UTF to prevent other parsing errors (which is somehow also related to TinyMCE).
The solution to all this was quite simple:
I added this line entity_encoding : "utf-8" in my tinyMCE.init. Now, all the characters show up the way they are supposed to.
I guess the only thing I don't understand is why the characters still show up when placed in textboxes, because nothing converts them to UTF, but with TinyMCE it was a problem.
I agree that it is purely an encoding issue. In PHP, this is how I solved this problem:
Before passing the html-fragment to SimpleXMLElement constructor I decoded it by using html_entity_decode.
Then further encoded it using utf8_encode().
$headerDoc = '<temp>' . utf8_encode(html_entity_decode($headerFragment)) . '</temp>';
$xmlHeader = new SimpleXMLElement($headerDoc);
Now the above code does not throw any undefined entity errors.
You could HTML-parse the text and have it re-escaped with the respective numeric entities only (like: →  ). In any case — simply using un-sanitized user input is a bad idea.
All of the numeric entities are allowed in XML, only the named ones known from HTML do not work (with the exception of &, ", <, >, &apos;).
Most of the time though, you can just write the actual character (ö → ö) to the XML file so there is no need to use an entity reference at all. If you are using a DOM API to manipulate your XML (and you should!) this is your safest bet.
Finally (this is the lazy developer solution) you could build a broken XML file (i.e. not well-formed, with entity errors) and just pass it through tidy for the necessary fix-ups. This may work or may fail depending on just how broken the whole thing is. In my experience, tidy is pretty smart, though, and lets you get away with a lot.
1. I can find and replace all [ ?] and swap them out with [ ?] or an actual space.
This is a robust method, but it requires you to have a table of all the HTML entities (I assume the pasted input is coming from HTML) and to parse the pasted text for entity references.
2. I can place the code in question within a CDATA section.
In other words disable parsing for the whole section? Then you would have to parse it some other way. Could work.
3. I can include these entities within the XML file.
You mean include the entity definitions? I think this is an easy and robust way, if you don't mind making the XML file quite a bit bigger. You could have an "included" file (find one on the web) which is an external entity, which you reference from the top of your main XML file.
One downside is that the XML parser you use has to be one that processes external entities (which not all parsers are required to do). And it must correctly resolve the (possibly relative) URL of the external entity to something accessible. This is not too bad but it may increase constraints on your processing tools.
4. You could forbid non-XML in the pasted content. Among other things, this would disallow entity references that are not predefined in XML (the 5 that Tomalak mentioned) or defined in the content itself. However this may violate the requirements of the application, if users need to be able to paste HTML in there.
5. You could parse the pasted content as HTML into a DOM tree by setting someDiv.innerHTML = thePastedContent;
In other words, create a div somewhere (probably display=none, except for debugging). Say you then have a javascript variable myDiv that holds this div element, and another variable myField that holds the element that is your input text field. Then in javascript you do
myDiv.innerHTML = myField.value;
which takes the unparsed text from myField, parses it into an HTML DOM tree, and sticks it into myDiv as HTML content.
Then you would use some browser-based method for serializing (= "de-parsing") the DOM tree back into XML. See for example this question. Then you send the result to the server as XML.
Whether you want to do this fix in the browser or on the server (as #Hannes suggested) will depend on the size of the data, how quick the response has to be, how beefy your server is, and whether you care about hackers sending not-well-formed XML on purpose.
Use "htmlentities()" with flag "ENT_XML1": htmlentities($value, ENT_XML1);
If you use "SimpleXMLElement" class:
$SimpleXMLElement->addChild($name, htmlentities($value, ENT_XML1));
If you want to convert all characters, this may help you (I wrote it a while back) :
http://www.lautr.com/convert-all-applicable-characters-to-numeric-entities-for-use-in-xml
function _convertAlphaEntitysToNumericEntitys($entity) {
return '&#'.ord(html_entity_decode($entity[0])).';';
}
$content = preg_replace_callback(
'/&([\w\d]+);/i',
'_convertAlphaEntitysToNumericEntitys',
$content);
function _convertAsciOver127toNumericEntitys($entity) {
if(($asciCode = ord($entity[0])) > 127)
return '&#'.$asciCode.';';
else
return $entity[0];
}
$content = preg_replace_callback(
'/[^\w\d ]/i',
'_convertAsciOver127toNumericEntitys', $content);
This question is a general problem for any language that parses XML or JSON (so, basically, every language).
The above answers are for PHP, but a Perl solution would be as easy as...
my $excluderegex =
'^\n\x20-\x20' . # Don't Encode Spaces
'\x30-\x39' . # Don't Encode Numbers
'\x41-\x5a' . # Don't Encode Capitalized Letters
'\x61-\x7a' ; # Don't Encode Lowercase Letters
# in case anything is already encoded
$value = HTML::Entities::decode_entities($value);
# encode properly to numeric
$value = HTML::Entities::encode_numeric($value, $excluderegex);

Categories