regex to remove complete HTML entity

regex to remove complete HTML entity - php

We have a requirement to remove special characters from text strings. For example, we may get a string that looks like this; the ® is the registered trademark symbol:
PEPSI® Bottle 20 oz<br><br>
I'm not great with regex, and can't figure out how to edit the existing code to produce that.
Here's what we currently have:
$ui = "PEPSI Bottle 20 oz<br><br>";
$ui = preg_replace('/[^A-Za-z0-9\.\' -]/', '', $ui);
This results in PEPSI174 Bottle 20 ozbrbr.
Our desired result is PEPSI Bottle 20 oz<br><br>.
How can I edit the regex to make sure that
It doesn't remove valid HTML tags like <br>, and
If it does find a special character entity, it removes not only the special characters (the & and #), but also the numbers and semicolon?
We don't want to have it remove all the numbers, as obviously the string can contain numbers; it's only numbers that are part of the entity code that we need to remove.

You could use this but now I can't guaranty it covers all the possible HTML entities:
$res = preg_replace('/&[A-Za-z0-9#]+;/', '', $ui);
That says replace any substring that:
- starts with &
- followed by any number of alphanumeric characters or # in random order
- followed by ;.

Related

PHP - preg_match_all - a little advenced

I need to find specific part of text in string.
That text need to have:
12 characters (letters and numbers only)
whole string must contains at least 3 digits
3*4 characters with spaces (ex. K9X6 6GM6 LM11)
every block from example above must contains at least 1 number
words like this, line, spod shouldn't be recognized
So I ended with this code:
preg_match_all("/(?<!\S)(?i:[a-z\d]{4}|[a-z\d]{12})(?!\S)/", $input_lines, $output_array);
But it won't works for all of requirements. Of course I can use preg_repace or str_replace and remove all (!,?,#) and in a loop count numbers if there are 4 or more but I wonder if it is possible to do with preg_match_all...
Here is a string to search in:
?K9X6 6GM6 LM11 // not recognized - but it should be
!K9X6 6GM6 LM11 // not recognized - but it should be
K0X6 0GM7 LM12! // not recognized - but it should be
K1X6 1GM8 LM13# // not recognized - but it should be
K2X6 2GM9 LM14? // not recognized - but it should be
K3X6 3GM0 LM15# // not recognized - but it should be
K4X6 4GM1 LM16* // not recognized - but it should be
K5X65GM2LM17
bla bla bla
this shouldn't be visible
spod also shouldn't be visible
but line below should be!!
K9X66GM6LM11! (see that "!" at the end? Help me with this)
Correct preg_match_all should returns this:
K9X6
6GM6
LM11
K9X6
6GM6
LM11
K0X6
0GM7
LM12
K1X6
1GM8
LM13
K2X6
2GM9
LM14
K3X6
3GM0
LM15
K4X6
4GM1
LM16
K5X65GM2LM17
K9X66GM6LM11
working example: http://www.phpliveregex.com/p/bHX

The following should do the trick:
\b(?:(?=.{0,3}?\d)[A-Za-z\d]{4}\s??){3}\b
Demo
[A-Za-z\d]{4} matches 4 letters/digits
(?=.{0,3}?\d) checks there's a digit in these 4 characters
\s?? matches a whitespace character, but tries not to match it if possible
\b makes sure everything isn't contained in a larger word
Note that this will allow strings like K2X6 2GM9LM14, I'm not sure whether you want these to match or not.

Regex starts with x or x prefixed or suffixed

I'm trying to get pattern match for string like the following to convert every line into a list item <li>:
-Give on result
&Second new text
-The third text
Another paragraph without list.
-New list here
In natural language: Match every string that starts with - and ended with the new line sign \n
I tried the following pattern that works fine:
/^([-|-]\w+\s*.*)?\n*$/gum
Of course we can write it simply without the square brackets ^(-\w+\s*.*)?\n*$ but for debugging I used it as described.
In the example above, when I replaces the second - with & to be ^([-|&]\w+\s*.*)?\n*$ It works fine too and it mtaches the the second line of the smaple string. However, I could not able to make it matches - prefixed with white space or suffixed with white space.
I changed the sample string to:
- Give on result
&Second new text
-The third text
Another paragraph without list.
-New list here
and I tried the following pattern:
/^([-|\- |&| -]\w+\s*.*)?\n*$/gum
However, it failed to match any suffixed or prefixed - with white space.
Here are a live demo for the original working pattern:

To my understanding, what you want is having a line that starts with an element e (e being & or -), with element being either prefixed/suffixed by space(s).
^\s*[&-]\s*(.*)$
If you do not want multilines, simply do not use the m modifier.

^(\h*(?:-|&)\h*\w+\s*.*)\n*$
You can try this.| inside [] has no special meaning.See demo.
https://regex101.com/r/nS2lT4/3
A string may start with whitespace, then it should have either - or & which may have spaces ahead. Then it should have at least one alphanumeric characters which may have space ahead. Then it can have anything or nothing. In the end, it will eat up all the newlines it consume or none if it can't.

Stripping down Phonenumber (mobile)

Is there a function or a easy way to strip down phone numbers to a specific format?
Input can be a number (mobile, different country codes)
maybe
+4917112345678
+49171/12345678
0049171 12345678
or maybe from another country
004312345678
+44...
Im doing a
$mobile_new = preg_replace("/[^0-9]/","",$mobile);
to kill everything else than a number, because i need it in the format 49171 (without + or 00 at the beginning), but i need to handle if a 00 is inserted first or maybe someone uses +49(0)171 or or inputs a 0171 (needs to be 49171.
so the first numbers ALWAYS need to be countryside without +/00 and without any (0) between.
can someone give me an advice on how to solve this?

You can use
(?:^(?:00|\+|\+\d{2}))|\/|\s|\(\d\)
to match most of your cases and simply replace them with nothing. For example:
$mobile = "+4917112345678";
$mobile_new = preg_replace("/(?:^(?:00|\+|\+\d{2}))|\/|\s|\(\d\)/","",$mobile);
echo $mobile_new;
//output: 4917112345678
regex101 Demo
Explanation:
I'm making use of OR here, matching each of your cases one by one:
(?:^(?:00|\+|\+\d{2})) matches 00, + or + followed by two numbers at the beginning of your string
\/ matches a / anywhere in the string
\s matches a whitspace anywhere in the string (it matches the newline in the regex101 demo, but I suppose you match each number on its own)
\(\d\) matches a number enclosed in brackets anywhere in the string
The only case not covered by this regex is the input format 01712345678, as you can only take a guess what the country specific prefix can be. If you want it to be 49 by default, then simply replace each input starting with a single 0 with the 49:
$mobile = "01712345678";
$mobile_new = preg_replace("/^0/","49",$mobile);
echo $mobile_new;
//output: 491712345678

This pattern (49)\(?([0-9]{3})[\)\s\/]?([0-9]{8}) will split number in three groups:
49 - country code
3 digits - area code
8 digits - number
After match you can construct clean number just concatnating them by \1\2\3.
Demo: https://regex101.com/r/tE5iY3/1
If this not suits you then please explain more precisely what you want with test input and expected output.

I recommend taking a look at LibPhoneNumber by Google and its port for PHP.
It has support for many formats and countries and is well-maintained. Better not to figure this out yourself.
https://github.com/giggsey/libphonenumber-for-php
$phoneUtil = \libphonenumber\PhoneNumberUtil::getInstance();
$usNumberProto = $phoneUtil->parse("+1 650 253 0000", "US");

PHP Regex for full names in a specific format

I'm trying to make a function to verify names on PHP using Regex, I want the names to be able to carry infinite amount of spaces and ' and -, and to allow only capital characters after spaces but to allow capital and none capitals after - and '.. Also the total length should be of 50 characters and the name should end with a lowercase, note that the uppercases are A to Z plus those characters :
ÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ
and the lower cases are a to z plus those characters :
éçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß
each word (between a space , ' or - and another) should count at least 2 characters the name should also start with an uppercase and finish with a lower case and in words (between a space , ' or - and another) no uppercases but that of the beginning is allowed
Examples of acceptable names are :
Adam Klsld
Adam'odskdl
Adam'Ddlsl
Ùdam-ddkkdk
Addssd-Ddsdsd
I've been trying a lot but here's my last try that I still keep in my php file, the others I've deleted in the chaos of non-successful attempts (using mb_ereg function to match, so this is a posix-ere):
([A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ][a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+){1}((^[\'\-\s])[A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ][a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+)*
(this does not necessarily mean it's the best attempt but I though it may help and give an idea on how much of a dork am I)

I wouldn't exactly suggest you use this... but I think this does what you want?
^([A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ][a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+){1}((([\s])[A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ][a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+)|((['\-])([A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ]|[a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß])[a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+))*$
Here it is in a non-code block so you can see how insane it is... think it strips some characters here though:
^([A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ][a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+){1}((([\s])[A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ][a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+)|((['-])([A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ]|[a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß])[a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]+))*$

Is this Regex answering what you need to check ?
(You'll have to add the weird characters inside each brackets of course).

You can use this to avoid accented characters issue:
$pattern = "~^[\p{Lu}ß]\p{Ll}*+(?>(?> [\p{Lu}ß]|['-]\p{L})\p{Ll}*+)*$~u";
if(preg_match($pattern, $name)) { ...
Or for a more specific set of characters:
$pattern = "~(?(DEFINE)(?<Up>[A-ZÙÒÌÈÀÁÉÍßÓÚÝÂÊÎÔÛÃÑÕÄÅÆŒÇÐØËÏÖÜŸ]))
(?(DEFINE)(?<Lo>[a-zéçàèàèìòùáéíóúýâêîôûãñõäëïöüÿåæœçðøß]))
^\g<Up>\g<Lo>*+(?>(?>\h\g<Up>|['-]\g<Up>?+\g<Lo>)\g<Lo>*+)*+$~ux";
if (preg_match($pattern, $name, $matches)) { ...
or the same in a shorter way:
$pattern = "~(?(DEFINE)(?<Up>[A-ZÀ-ÖØ-ÝŸßŒ]))
(?(DEFINE)(?<Lo>[a-zà-öø-ýÿßœ]))
^\g<Up>\g<Lo>*+(?>(?>\h\g<Up>|['-]\g<Up>?+\g<Lo>)\g<Lo>*+)*+$~ux";

Splitting string containing letters and numbers not separated by any particular delimiter in PHP

Currently I am developing a web application to fetch Twitter stream and trying to create a natural language processing by my own.
Since my data is from Twitter (limited by 140 characters) there are many words shortened, or on this case, omitted space.
For example:
"Hi, my name is Bob. I m 19yo and 170cm tall"
Should be tokenized to:
- hi
- my
- name
- bob
- i
- 19
- yo
- 170
- cm
- tall
Notice that 19 and yo in 19yo have no space between them. I use it mostly for extracting numbers with their units.
Simply, what I need is a way to 'explode' each tokens that has number in it by chunk of numbers or letters without delimiter.
'123abc' will be ['123', 'abc']
'abc123' will be ['abc', '123']
'abc123xyz' will be ['abc', '123', 'xyz']
and so on.
What is the best way to achieve it in PHP?
I found something close to it, but it's C# and spesifically for day/month splitting. How do I split a string in C# based on letters and numbers

You can use preg_split
$string = "Hi, my name is Bob. I m 19yo and 170cm tall";
$parts = preg_split("/(,?\s+)|((?<=[a-z])(?=\d))|((?<=\d)(?=[a-z]))/i", $string);
var_dump ($parts);
When matching against the digit-letter boundary, the regular expression match must be zero-width. The characters themselves must not be included in the match. For this the zero-width lookarounds are useful.
http://codepad.org/i4Y6r6VS

how about this:
you extract numbers from string by using regexps, store them in an array, replace numbers in string with some kind of special character, which will 'hold' their position. and after parsing the string created only by your special chars and normal chars, you will feed your numbers from array to theirs reserved places.
just an idea, but imho might work for you.
EDIT:
try to run this short code, hopefully you will see my point in the output. (this code doesnt work on codepad, dont know why)
<?php
$str = "Hi, my name is Bob. I m 19yo and 170cm tall";
preg_match_all("#\d+#", $str, $matches);
$str = preg_replace("!\d+!", "#SPEC#", $str);
print_r($matches[0]);
print $str;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

regex to remove complete HTML entity - php

You could use this but now I can't guaranty it covers all the possible HTML entities: $res = preg_replace('/&[A-Za-z0-9#]+;/', '', $ui); That says replace any substring that: - starts with & - followed by any number of alphanumeric characters or # in random order - followed by ;.

Related

PHP - preg_match_all - a little advenced

Regex starts with x or x prefixed or suffixed

Stripping down Phonenumber (mobile)

PHP Regex for full names in a specific format

Splitting string containing letters and numbers not separated by any particular delimiter in PHP

Categories

Resources