I have a file with has several spaces among the words at some point. I need to clean the file and replace existing multi-spaced sequences with one space only. I have written the following statement which does not work at all, and it seems I'm making a big mistake.
$s = preg_replace("/( *)/", " ", $x);
My file is very simple. Here is a part of it:
Hjhajhashsh dwddd dddd sss ddd wdd ddcdsefe xsddd scdc yyy5ty ewewdwdewde wwwe ddr3r dce eggrg vgrg fbjb nnn bh jfvddffv mnmb weer ffer3ef f4r4 34t4 rt4t4t 4t4t4t4t ffrr rrr ww w w ee3e iioi hj hmm mmmmm mmjm lk ;’’ kjmm ,,,, jjj hhh lmmmlm m mmmm lklmm jlmm m
Your regex replaces any number of spaces (including zero) with a space. You should only replace two or more (after all, replacing a single space with itself is pointless):
$s = preg_replace("/ {2,}/", " ", $x);
What I usually do to clean up multiple spaces is:
while (strpos($x, ' ') !== false) {
$x = str_replace(' ', ' ', $x);
}
Conditions/hypotheses:
strings with multiple spaces are rare
two spaces are by far more common than three or more
preg_replace is expensive in terms of CPU
copying characters to a new string should be avoided when possible
Of course, if condition #1 is not met, this approach does not make sense, but it usually is.
If #1 is met, but any of the others is not (this may depend on the data, the software (PHP version) or even the hardware), then the following may be faster:
if (strpos($x, ' ') !== false) {
$x = preg_replace('/ +/', ' ', $x); // i.e.: '/␣␣+/'
}
Anyway, if multiple spaces appear only in, say, 2% of your strings, the important thing is the preventive check with strpos, and you probably don't care much about optimizing the remaining 2% of cases.
// Your input
$str = "Hjhajhashsh dwddd dddd sss ddd wdd ddcdsefe xsddd scdc yyy5ty ewewdwdewde wwwe ddr3r dce eggrg vgrg fbjb nnn bh jfvddffv mnmb weer ffer3ef f4r4 34t4 rt4t4t 4t4t4t4t ffrr rrr ww w w ee3e iioi hj hmm mmmmm mmjm lk ;’’ kjmm ,,,, jjj hhh lmmmlm m mmmm lklmm jlmm m";
echo $str.'<br>';
$output = preg_replace('!\s+!', ' ', $str); // Replace multispace with sigle.
echo $output;
Related
I have a field which contain 20 character (pad string with space character from right) like below:
VINEYARD HAVEN MA
BOLIVAR TN
,
BOLIVAR, TN
NORTH TONAWANDA, NY
How can I use regular expression to parse and get data, the result I want will look like this:
[1] VINEYARD HAVEN [2] MA
[1] BOLIVAR [2] TN
[1] , or empty [2] , or empty
[1] BOLIVAR, or BOLIVAR [2] TN or ,TN
[1] NORTH TONAWANDA, or NORTH TONAWANDA [2] NY or ,NY
Currently I use this regex:
^(\D*)(?=[ ]\w{2}[ ]*)([ ]\w{2}[ ]*)
But it couldnot match the line:
,
Please help to adjust my regex so that I match all data above
What about this regex: ^(.*)[ ,](\w*)$ ? You can see working it here: http://regexr.com/3cno7.
Example usage:
<?php
$string = 'VINEYARD HAVEN MA
BOLIVAR TN
,
BOLIVAR, TN
NORTH TONAWANDA, NY';
$lines = array_map('trim', explode("\n", $string));
$pattern = '/^(.*)[ ,](\w*)$/';
foreach ($lines as $line) {
$res = preg_match($pattern, $line, $matched);
print 'first: "' . $matched[1] . '", second: "' . $matched[2] . '"' . PHP_EOL;
}
It's probably possible to implement this in a regular expression (try /(.*)\b([A-Z][A-Z])$/ ), however if you don't know how to write the regular expression you'll never be able to debug it. Yes, its worth finding out as a learning exercise, but since we're talking about PHP here (which does have a mechanism for storing compiled REs and isn't often used for bulk data operations) I would use something like the following if I needed to solve the problem quickly and in maintainable code:
$str=trim($str);
if (preg_match("/\b[A-Z][A-Z]$/i", $str, $match)) {
$state=$match[0];
$town=trim(substr($str,0,-2)), " ,\t\n\r\0\x0B");
}
In my script I replaced all "," commas with quotation+spaces.
But when it comes to numbers which are like 3,456,778, it also converts the commas to quote+space. Is there any way to add to command to ignore big numbers like that so it doesn't convert it to:
3" 456" 778"
If there is quotationm+space+any number then convert quotation+space to comma.. I mean i know how to do it with str_replace command but i dont know how to select anynumber 0-9.
Any help to do it? To convert it to:
3,456,778
I think i need to elaborate some. I needed convert this text:
Value=3,456,778,id=777
To:
Value=3,456,778" id=777"
But problem is it also convert those middle commas in between numbers.
So even if I can change my str_replace command to this like
"If comma is not in between two numbers then only convert comma to quotation+space". It would be good. Is it possible?
What about this?
preg_replace("/,([^0-9]|$)/", "\"$1", $text);
This will match all the text except commas followed by numbers.
For instance, this:
$text = "123,23 adas , asdsa d, asdasd sa 1234,234324,asdas 324324 234,";
echo $text; echo "<br/>";
echo preg_replace("/,([^0-9]|$)/", "\"$1", $text);
Will echo this:
123,23 adas , asdsa d, asdasd sa 1234,234324,asdas 324324 234"
123,23 adas " asdsa d" asdasd sa 1234,234324"asdas 324324 234"
It is not really clear from your description what you actually want to do.
This might be a step into the right direction, however:
preg_replace('/([0-9]+)" /', '\\1,', '3" 456" 778"');
Not the best solution maybe,but can give it a try.
$copy_date = '3" 456" 778"';
$copy_date = preg_replace("(\"\s{1})", ",", $copy_date);
$copy_date1 = preg_replace("(\")", "", $copy_date);
print $copy_date1;
o/p:3,456,778
I have lots of data that I need to search through for certain patterns.
Problem is when looking for said patterns I have no reference to what I'm looking for.
Or in other words, I have two paragraphs. Each on similar topics. I need to be able to compare both paragraphs and find patterns. Phrases said in both paragraphs and how many times both were said.
Can't seem to find the solution because preg_match and other functions your required to supply the things your looking for.
Example paragraphs
Paragraph 1:
Bee Pollen is made by honeybees, and is the food of the young bee. It
is considered one of nature's most completely nourishing foods as it
contains nearly all nutrients required by humans. Bee-gathered pollens
are rich in proteins (approximately 40% protein), free amino acids,
vitamins, including B-complex, and folic acid.
Paragraph 2:
Bee Pollen is made by honeybees. It is required for the fertilization
of the plant. The tiny particles consist of 50/1,000-millimeter
corpuscles, formed at the free end of the stamen in the heart of the
blossom, nature's most completely nourishing foods. Every variety of
flower in the universe puts forth a dusting of pollen. Many orchard
fruits and agricultural food crops do, too.
So from those examples these patterns:
Bee Pollen is made by honeybees
and:
nature's most completely nourishing foods
Both phrases are found in both paragraphs.
This is potentially a complex question depending on whether you're looking for similar phrases or phrases that match word for word.
Finding exact word-for-word matches is quite simple all you need to do is split on common breaks like punctuation marks (e.g. .,;:) and perhaps on conjunctions as well (e.g. and or). However, the problem comes when you come to, for example, adjectives two phrases might be exactly the same but have one word different, like so:
The world is spinnnig around its axis at a tremendous speed.
The world is spinning around its axis at a magnificent speed.
This won't match because tremendous and magnificent are used in place of one another. Potentially you could work around this, however, that would be a more complex question.
Answer
If we stick to the simple side of things we can achieve phrase matching with just a few lines of code (4 in this example; not including the formatting for comments/readability).
$wordSplits = 'and or on of as'; //List of words to split on
preg_match_all('/(?<m1>.*?)([.,;:\-]| '.str_replace(' ', ' | ', trim($wordSplits)).' )/i', $para1, $matches1);
preg_match_all('/(?<m2>.*?)([.,;:\-]| '.str_replace(' ', ' | ', trim($wordSplits)).' )/i', $para2, $matches2);
$commonPhrases = array_filter( //Removes blank $key=>$value pairs
array_intersect( //Finds matching paterns
array_map(function($item){
return(strtolower(trim($item))); //Cleans array for $para1 values - removes leading and following spaces
}, $matches1['m1']),
array_map(function($item){
return(strtolower(trim($item))); //Cleans array for $para2 values - removes leading and following spaces
}, $matches2['m2'])
)
);
var_dump($commonPhrases);
/**
OUTPUT:
array(2) {
[0]=>
string(31) "bee pollen is made by honeybees"
[5]=>
string(41) "nature's most completely nourishing foods"
}
/*
The above code will find matches splitting both on punctuation (defined in [...] of the preg_match_all pattern) it will also concatenate the word list (matching only words in the word list with a preceding and following space).
Wordlist
You can change the word list to include any breaks you like, editing the list until you get the phrases you are after, examples:
$wordSplits = 'and or';
$wordSplits = 'and but if or';
$wordSplits = 'a an as and by but because if in is it of off on or';
Punctuation
You can add any punctuation marks you like into the list (between [ and ]), however remember that some characters do have special meanings and may need to be escaped (or placed appropriately): - and ^ should become \- and \^ or be placed where their special meaning doesn't come into play.
You may consider changing:
([.,;:\-]|
To:
([.,;:\-] | //Adding a space before the pipe
So that you only split punctuation marks which are followed by a space. For example: this would mean that items like 50,000 won't be split.
Spaces and breaks
You may also consider changing the spaces to \s so that tabs and newlines etc are included and not just spaces. Like so:
'/(?<m1>.*?)([.,;:\-]|\s'.str_replace(' ', '\s|\s', trim($wordSplits)).'\s)/i'
This would also apply to:
([.,;:\-]\s|
If you decide to go down that route.
I've been working on this code, don't know if it suits your needs... Feel free to expand it!
$p1 = "Bee Pollen is made by honeybees, and is the food of the young bee. It is considered one of nature's most completely nourishing foods as it contains nearly all nutrients required by humans. Bee-gathered pollens are rich in proteins (approximately 40% protein), free amino acids, vitamins, including B-complex, and folic acid.";
$p2 = "Bee Pollen is made by honeybees. It is required for the fertilization of the plant. The tiny particles consist of 50/1,000-millimeter corpuscles, formed at the free end of the stamen in the heart of the blossom, nature's most completely nourishing foods. Every variety of flower in the universe puts forth a dusting of pollen. Many orchard fruits and agricultural food crops do, too.";
// Strip strings of periods etc.
$p1 = strtolower(str_replace(array('.', ',', '(', ')'), '', $p1));
$p2 = strtolower(str_replace(array('.', ',', '(', ')'), '', $p2));
// Extract words from first paragraph
$w1 = explode(" ", $p1);
// Build search string
$search = '';
$found = array();
foreach ($w1 as $word) {
//echo 'Word: ' . $word . "<br />";
$search .= ' ' . $word;
$search = trim($search);
//echo '. . Search string: '. $search . "<br /><br />";
if (substr_count($p2, $search)) {
$old_search = $search;
$num_occured = substr_count($p2, $search);
//echo " . . . found!" . "<br /><br /><br />";
$add = TRUE;
} else {
//echo " . . . not found! Generating new search string: " . $word . '<br />';
if ($add) {
$found[] = array('pattern' => $old_search, 'occurences' => $num_occured);
$add = FALSE;
}
$old_search = '';
$search = $word;
}
}
print_r($found);
The above code finds occurences of patterns from the first string in the second one.
I'm sure it can be written better, but since it's past midnight (local time), I'm not as "fresh" as I'd like to be...
Codepad-link
This is the string
(code)
Pivot: 96.75<br />Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.<br />Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.<br />Comment the pair has broken above its resistance and should post further advance.<br />
(text)
"Pivot: 96.75Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.Comment the pair has broken above its resistance and should post further advance."
the result should be
(code)
<b>Pivot</b>: 96.75<br /><b>Our preference</b>: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.<br /><b>Alternative scenario</b>: Below 96.75 look for further downside with 96.35 & 95.9 as targets.<br />Comment the pair has broken above its resistance and should post further advance.<br />
(text)
Pivot: 96.75Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.Comment the pair has broken above its resistance and should post further advance.
The porpuse:
Wrap all the words before : sign.
I've tried this regex: ((\A )|(<br />))(?P<G>[^:]*):, but its working only on python environment. I need this in PHP:
$pattern = '/((\A)|(<br\s\/>))(?P<G>[^:]*):/';
$description = preg_replace($pattern, '<b>$1</b>', $description);
Thanks.
This preg_replace should do the trick:
preg_replace('#(^|<br ?/>)([^:]+):#m','$1<b>$2</b>:',$input)
PHP Fiddle - Run (F9)
I should start by saying that HTML operations are better done with a proper parser such as DOMDocument. This particular problem is straightforward, so regular expressions may work without too much hocus pocus, but be warned :)
You can use look-around assertions; this frees you from having to restore the neighbouring strings during the replacement:
echo preg_replace('/(?<=^|<br \/>)[^:]+(?=:)/m', '<b>$0</b>', $str);
Demo
First, the look-behind assertion matches either the start of each line or a preceding <br />. Then, any characters except the colon are matched; the look-ahead assertion makes sure it's followed by a colon.
The /m modifier is used to make ^ match the start of each line as opposed to \A which always matches the start of the subject string.
The most "general" and least regex-expensive way to do this that I could come up with was this:
$parts = explode('<br', $str);//don't include space and `/`, as tags may vary
$formatted = '';
foreach($parts as $part)
{
$formatted .= preg_replace('/^\s*[\/>]{0,2}\s*([^:]+:)/', '<b>$1</b>',$part).'<br/>';
}
echo $formatted;
Or:
$formatted = array();
foreach($parts as $part)
{
$formatted[] = preg_replace('/^\s*[\/>]{0,2}\s*([^:]+:)/', '<b>$1</b>',$part);
}
echo implode('<br/>', $formatted);
Tested with, and gotten this as output
Pivot: 96.75Our preference: Long positions above 96.75 with targets # 97.8 & 98.25 in extension.Alternative scenario: Below 96.75 look for further downside with 96.35 & 95.9 as targets.Comment the pair has broken above its resistance and should post further advance.
That being said, I do find this bit of data weird, and, if I were you, I'd consider str_replace or preg_replace-ing all breaks with PHP_EOL:
$str = preg_replace('/\<\s*br\s*\/?\s*\>/i', PHP_EOL, $str);//allow for any form of break tag
And then, your string looks exactly like the data I had to parse, and got the regex for that here:
$str = preg_replace(...);
$formatted = preg_replace('/^([^:\n\\]++)\s{0,}:((\n(?![^\n:\\]++\s{0,}:)|.)*+)/','<b>$1:</b>$2<br/>', $str);
Provinces is a group_concat of all the individual records that contain province, some of which are blank.
So, when I encode:
$provinces = ($row['provinces']);
echo "<td>".wordwrap($provinces, 35, "<br />")."</td>";
This is what the result looks like:
Minas Gerais,,,Rio Grande do
Sul,Santa Catarina,Paraná,São Paulo
However, when I try to preg_replace out some of the nulls, and add some spaces with this expression:
$provinces = preg_replace($patterns,
$replaces, ($row['provinces']));
echo "<td>".wordwrap($provinces, 35, "<br />")."</td>";`
This is what I get!!! :(
Minas Gerais, Rio Grande do
Sul, Santa
Catarina, Paraná, São Paulo
The output is very unnatural looking.
BTW: Here are the search and replace arrays:
$patterns[0] = '/,,([,]+)?/'; $replaces[0] = ', ';
$patterns[1] = '/^,/'; $replaces[1] = '';
$patterns[2] = '/,$/'; $replaces[2] = '';
$patterns[3] = '/\b,\b/'; $replaces[3] = ', ';
$patterns[4] = '/\s,/'; $replaces[4] = ', ';
UPDATE: I even tried to change Paraná to Parana
Minas Gerais, Rio Grande do
Sul, Santa
Catarina, Parana, São
Paulo
Don't use as the replacement. wordwrap() considers that 6 characters. It doesn't interpret the HTML entity. That's why your lines are breaking funny. If you want replace spaces after you wordwrap()
Also, your first pattern should be:
// match one or more commas together
$patterns[0] = '/,+/';
Is the wordwrap() really necessary? It sounds like you are rendering this content into a table cell of some fixed width and you don't want individual entries to split across lines.
If this inference is correct - and if none of your entries is actually so long that forcing it to a single line will break your layout - then how about this: explode() on commas into an array, remove the whitespace-only entries, replace normal spaces in each array entry with , and implode() back on , (a comma followed by a space). Then let the rendering browser break lines wherever it needs.