Removing special keyboard characters/shapes with regex or? - php

I am using YQL to scrape some data, and then parsing it into Amazon's simpledb. I am getting some errors when attempting to insert certain titles into the DB, because some titles from the xml file that I am parsing contain characters like the one's below.
◆ ▒ ♠ ✖ ¸ . ´ ¨
I am sure that's not all the possible special characters. It's just the one's I've noticed so far that are causing the errors.
These are not standard keyboard characters. Is there a simple way to remove/disallow these types of characters (regex, etc..) without finding every one of them and including them in a regex?
Thanks

$text = preg_replace('/[^a-zA-Z0-9_ -]/s', '', $text);
This will trim your text so it only contains letters or numbers, spaces and underlines/dashes.
Reference http://www.phpfreaks.com/forums/index.php?topic=223131.0

Related

Remove certain special HTML characters from string in PHP

I am scraping information from a website and I was wondering how could I ignore or replace some special HTML characters such as "á", "á", "’" and "&amp". These characters cannot be scraped into a database. I have already replaced " " using this:
$nbsp = utf8_decode('á');
$mystring = str_replace($nbsp, '', $mystring);
But I cannot seem to do the same with these other characters. I am scraping from the website using XPath. This returns the exact content that I am looking for but keeps the HTML characters that I do not want as they don't seem to be allowed into a database.
Thanks for any help with this.
It sounds like you've got a collation issue. I suggest ensuring that your database collation is set to utf8_ci, and that your web page's content encoding is also set to UTF-8. This may well solve your problem.
The best way to strip all special characters is to run the string through htmlspecialchars(), then do a case-insensitive regex find and replace using the following pattern:
&([a-z]{2,8}+|#[0-9]{2,5}|#x[0-9a-f]{2,4});
This should match named HTML entities (e.g. Ω or ) as well as decimal (e.g. &#01234) and hex-based (e.g. &x0BEE;) entities. The regex will strip them out completely.
Alternatively, just use the output of htmlspecialchars() to store it with the weird characters intact. Not ideal, but it works.

PHP regex not matching utf-8 decoded string

I am having trouble with some a regex statement. I'm not sure why it is doing this, however I think it may have something to do with character encoding.
So I am using curl to receive the page content from a website. Then I am using domXPath query to get a certain element, then from that element I get its content, then from that content I perform a regex statement. However the regex statement is not working and I don't know why.
This is what I receive from the element:
X: asdasdfgdgdrrY: dfgdfgfgZ: ukuykyukjghj
a B 7dd.
Now when I try to match it with this code:
/X: (?P<x>.*)Y: (?P<y>.*)Z: (?P<z>.*)\s*(?P<a>[a-zA-Z]+) (?P<b>[a-zA-Z]+) (?P<c>[0-9]+)dd/
I have tested this in Dreamweaver and it matches so I have no idea what it wouldn't online
Also the page I am receiving has a content of utf-8,
I attempt to convert the content to remove the utf-8 characters by using
iconv('utf-8', 'ISO-8859-1//IGNORE', $td->item(0)->nodeValue);
if I don't remove the utf-8 characters there are weird Á symbols after the 'a', 'b' and 'c' variable values.
Ok I figured it out,
all i had to do to get rid of these invisible invalid characters was:
$value = preg_replace("/[^a-zA-Z0-9 %():\$.\/-]/",' ',$value);
pre much just replace any character that wasnt valid, with a space, or blank. In my case I used space because it appeared some spaces were invalid.

Unknown character � after importing excel to MySQL, how to avoid it? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Problem in utf-8 encoding PHP + MySQL
I've imported about 1000 records into MySQL from an excel file. But now I'm seeing � between some texts. It seems they were double quotes.
How can I avoid this while importing data?
Can I use str_replace() function to handle this issue while printing data in web page?
Use preg_replace to do a regex replacement of all unrecognized characters.
Example:
$data = preg_replace("/[^a-zA-Z0-9]/", "", $data);
This example will replace all non alpha-numeric characters (anything that is not a-z, A-Z, 0-9).
http://php.net/manual/en/function.preg-replace.php
If your database is simple enough (no serialised values and no gigabytes in size), you could export it entirely (e.g. using PhpMyAdmin), open in a text editor, do search-replace and import it back.
str_replace('“', '"', $original_string);
there's a few characters word does this with, so you will want to probably also do:
str_replace("‘", "'", $original_string);
if you see other characters causing the same issue, you can open up the doc in word, and copy/paste the offending character into your editor and do a similar replacement.
Since you are most likely looking to replace the character with an equivalent version, you probably do not want to do a regex like suggested in another answer. str_replace is faster than preg_replace for type of use.

XML Non Breaking White Space

I think the cause of my woes at present is the non-breaking white space.
It appears some nasty characters have found their way into our MySQL database from our back office systems. So as I'm trying to run an XML output using PHP's XMLWriter, but there's loads of these silly characters getting into the field.
They're displayed in nano as ^K, in gedit as a weird square box, and when you delete them manually in MySQL they don't take up a phsyical space, despite that you know you've deleted something.
Please help me get rid of them!
Here is the line that is the nightmare at present (i've skipped out the rest of the XMLWriter buildup).
$writer->writeElement("description",$myitem->description);
After you have identified which character specifically you want to remove (and it's binary sequence), you can just remove it. For example with str_replace:
$binSequence = "..."; // the binary representation of the character in question
$descriptionFiltered = str_replace($binSequence, '', $myitem->description);
$writer->writeElement("description", $descriptionFiltered);
You have not specified yet about which concrete character you're talking, so I can't yet specify the binary sequence. Also if you're talking about a group of characters, the filtering might vary a bit.
Seems that they are vertical tabs, ASCII x0B. You should be able to REPLACE them in MySQL:
SELECT REPLACE('\v', '', `value`) WHERE key = 'foo';
However, the official reference doesn't mention \v specifically. If it doesn't work, you can remove it afterwards in PHP with a simple str_replace (since PHP 5.2.5):
str_replace("\v", '', $result);

remove invalid chars from html document

i have a bunch of files which are supposed to be html documents for the most part, however sometimes the editor(s) copy&pasted text from other sources into it, so now i come across some weird chars every now and then - for example non-encoded copyright sign, or weird things that look like a dash or minus but are something else (ascii #146?), or a single char that looks like "...".
i had a look at get_html_translation_table(), however this will only replace the "usual" special chars like &, euro signs etc., but it seems like i need regex and specify only allowed chars and discard all the unknown chars. I tried this here, but this didnt work at all:
function fixNpChars($string)
{
//characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference.
$pattern = '/[\x{0000}-\x{0008}][\x{000B}-\x{000C}][\x{000E}-\x{001F}][\x{0080}-\x{009F}][x{007F}]/u';
$replacement = '';
return preg_replace($pattern, $replacement, $string);
}
Any idea whats wrong here?
EDIT:
The database where i store my imported files and the php side is all set to utf-8 (content type utf-8, db table charset utf8/utf8_general_ci, mysql_set_charset('utf8',$this->mHandle); executed after db connection is established. Most of the imported files are either utf8 or iso-8859-1.
Your regex syntax looks a little problematic. Maybe this?:
$pattern = '/[\x00-\x08][\x0B-\x0C][\x0E-\x1F][\x80-\x9F][x7F]/u';
Don't think of removing the invalid characters as the best option, this problem can be solved using htmlentities and html_entity_decode functions.

Categories