Splitting large strings into words in php - php

I have a long string in php consisting of different paragraphs each of which with different sentences (it is pretty much a small document). I want to split the whole thing into words by removing any symbols/characters that are not relevant. For example remove commas, spaces, new lines, full stops, exclamation marks and anything that might be irrelevant so as to end up with only words.
Is there an easy way of doing this in one go, for example by using a regular expression and the preg_split function or do I have to use the explode function a number of times: eg first get all the sentences (by removing '.', '!' etc). Then get words by removing ',' and spaces etc etc.
I would not like to use the explode function on all the possible characters that are irrelevant since it is time consuming and I may accidentally omit some of all those possible characters.
I would like to find a more automatic way. I think a well define regular expression might do the work but again I will need to specify all the possible characters and also I have no idea of how to write regular expressions in php.
So what can you suggest to me ?

Do you want to remove punctuation characters, etc and then split the words into an array? Or just strip it so there are only letters and spaces? Not exactly sure what you're trying to achieve, but the following might help:
<?php
$string = "This is a sentence! It has *lots* of #$#king random non-word characters. Wouldn't you like to strip them?";
$words = preg_replace("/[^\w\ _]+/", '', $string); // strip all punctuation characters, news lines, etc.
$words = preg_split("/\s+/", $words); // split by left over spaces
var_dump($words);
Either way, it gives you the general idea of using regular expressions to manipulate text as needed. My example has two parts, this way words like "wouldn't" aren't split into two words like other answers have suggested.

To be unicode compatible, you should use this one:
preg_split('/\PL+/u', $string, -1, PREG_SPLIT_NO_EMPTY);
wich splits on characters that are not letter.
Have a look at here to see the unicode character properties.

Just use preg_replace() and define a regular expression to match on the different characters you wish to replace and provide a replacement character to replace them with.
http://php.net/manual/en/function.preg-replace.php
For the characters you wish to search on you can define those in a PHP array as seen in the PHP manual.

Your answer is in the domain of regular expressions and would probably be very difficult to get right. You could get something that works well in almost all cases but there would be exceptions.
This might help:
http://www.regular-expressions.info/wordboundaries.html

Related

Way to match *all* newline-type character sequences in a text file

I'm looking for a way to obtain all sequences of newline-like characters found in a string. I'm trying to use preg_match() as follows:
preg_match('/^[^\r\n]*(?:([\r\n]+)|[^\r\n]*)+$/', $input_text, $matches);
But I only appear to be getting the last such match. I feel like the solution probably involves the use of \G, but when I attempt to introduce it, the match fails entirely. I don't think I'm understanding how to use it correctly or where it should go.
I realize my pattern will match multiple newlines in sequence (i.e. blank lines lead to multiple newlines in a single match). This is what I want.
For example, for the string:
"ABC\nDEF\r\n\rGHI\n\n\r\n",
I would like to get:
[ "\n", "\r\n\r", "\n\n\r\n" ]
Thanks for any assistance.
Use
preg_match_all('/\R+/u', "ABC\nDEF\r\n\rGHI\n\n\r\n", $matches);
It returns all line ending sequences because \R+ matches one or more line ending sequences.

Regex Expression for strings which look like javascript objects in PHP

We have strings in php like the following two examples:
{{'LANGUAGE_ID','String inclusive special chars (,/)'}}
{{'LANGUAGE_ID','String inclusive special chars (,/)','Another string inclusive special chars (,/)'}}
The strings are always surrounded by {{ and }}. Inside we have multiple elements separated by a comma and surrounded by single quotes. The first element is always a word \w. After that we have a unknown number of elements which can be a word or sentence including special characters. What we want to get is the content (text between single quotes) for each element.
We have a solution as long as we know how many elements the string contains.
Solution for 1. example: {{'([\w]+)','([^\n\r']+)'}}
Solution for 2. example: {{'([\w]+)','([^\n\r']+)','([^\n\r']+)'}}
We are looking for a solution which works for both examples or even a example with three or more elements.
We have a regex share to play around here:
http://regexr.com/3c58c
You can use this regex using \G:
preg_match_all('/(?:{{|\G,)'([^']+)'(?=.*?}})/', $text, $matches);
print_r($matches);
RegEx Demo
How about this one:
{{'([\w]+)',('([^\n\r']+)',*)*}}

pcre regex to match first two words, numbers

I need a regular expression to match only the first two words (they may contain letters , numbers, commas and other punctuation but not white spaces, tabs or new lines) in a string.
My solution is ([^\s]+\s+){2} but if it matches something like :'123 word' *in '123 word, hello'*, it doesnt work on a string with just two words and no spaces after.
What is the right regex for this task?
You have it almost right:
(\S+\s+\S+)
Assuming you don't need stronger control on what characters to use.
If you need to match both two words or only one word only, you may use one of those:
(\S+\s+\S|\S+)
(\S+(?:\s+\S+)?)
Instead of trying to match the words, you could split the string on whitespace with preg_split().
If you really only want to allow numbers and letters [^\s] is not restrictive enough. Use this:
/[a-z0-9]+(\s+[a-z0-9]+)?/i

Multiple regular expressions in PHP

Is there a better to handle running multiple regular expressions in PHP (or in general)?
I have the code below to replace non alphanumeric characters with dashes, once the replacement has happened I want to remove the instances of multiple dashes.
$slug = preg_replace('/[^A-Za-z0-9]/i', '-', $slug);
$slug = preg_replace('/\-{2,}/i', '-', $slug);
Is there a neater way to do this? ie, setup the regex pattern to replace one pattern and then the other?
(I'm like a kid with a fork in a socket when it comes to regex)
You can eliminate the second preg_replace by saying what you really mean in the first one:
$slug = preg_replace('/[^a-z0-9]+/i', '-', $slug);
What you really mean is "replace all sequences of one or more non-alphanumeric characters with a single hyphen" and that's what /[^a-z0-9]+/i does. Also, you don't need to include the upper and lower case letters when you specify a case insensitive regex so I took that out.
No. What you have is the appropriate way to deal with this problem.
Consider it from this angle: regular expressions are meant to find a pattern (a single pattern) and deal with it somehow. As such, by trying to deal with more than one pattern at a time, you're only giving yourself headaches. It's best as is, for everyone involved.
If $slug already doesn't have multiple hyphens then you can avoid 2nd preg_replace call by using first preg_replace call like this:
$slug = preg_replace('/[^a-z0-9]+-?/i', '-', $slug);
Above code will find non-alphanumeric character optionally followed by hyphen and replace that matched text by a single hyphen -. hence no need to make 2nd preg_replace call.

RegEx string "preg_replace"

I need to do a "find and replace" on about 45k lines of a CSV file and then put this into a database.
I figured I should be able to do this with PHP and preg_replace but can't seem to figure out the expression...
The lines consist of one field and are all in the following format:
"./1/024/9780310320241/SPSTANDARD.9780310320241.jpg" or "./t/fla/8204909_flat/SPSTANDARD.8204909_flat.jpg"
The first part will always be a period, the second part will always be one alphanumeric character, the third will always be three alphanumeric characters and the fourth should always be between 1 and 13 alphanumeric characters.
I came up with the following which seems to be right however I will openly profess to not knowing very much at all about regular expressions, it's a little new to me! I'm probably making a whole load of silly mistakes here...
$pattern = "/^(\.\/[0-9a-zA-Z]{1}\/[0-9a-zA-Z]{3}\/[0-9a-zA-Z]{1,13}\/)$/";
$new = preg_replace($pattern, " ", $i);
Anyway any and all help appreciated!
Thanks,
Phil
The only mistake I encouter is the anchor for the string end $ that should be removed. And your expression is also missing the _ character:
/^(\.\/[0-9a-zA-Z]{1}\/[0-9a-zA-Z]{3}\/[0-9a-zA-Z_]{1,13}\/)/
A more general pattern would be to just exclude the /:
/^(\.\/[^\/]{1}\/[^\/]{3}\/[^\/]{1,13}\/)/
You should use PHP's builtin parser for extracting the values out of the csv before matching any patterns.
I'm not sure I understand what you're asking. Do you mean every line in the file looks like that, and you want to process all of them? If so, this regex would do the trick:
'#^.*/#'
That simply matches everything up to and including the last slash, which is what your regex would do if it weren't for that rogue '$' everyone's talking about. If there are other lines in other formats that you want to leave alone, this regex will probably suit your needs:
'#^\./\w/\w{3}/\w{1,13}/#"
Notice how I changed the regex delimiter from '/' to '#' so I don't have to escape the slashes inside. You can use almost any punctuation character for the delimiters (but of course they both have to be the same).
The $ means the end of the string. So your pattern would match ./1/024/9780310320241/ and ./t/fla/8204909_flat/ if they were alone on their line. Remove the $ and it will match the first four parts of your string, replacing them with a space.
$pattern = "/(\.\/[0-9a-z]{1}\/[0-9a-z]{3}\/[0-9a-z\_]+\.(jpg|bmp|jpeg|png))\n/is";
I just saw, that your example string doesn't end with /, so may be you should remove it from your pattern at the end. Also underscore is used in the filename and should be in the character class.

Categories