Can anyone help me with a quick regex problem?
I have the following HTML:
555 Some Street Name<BR />
New Providence VA 22901-1311<BR />
United States<BR />
The first row is always the Street
Second row is City (which can have spaces) space State Abbv. space Zip hyphen 4 digit zip
Third row is the Country.
I need to break the HTML into a variable each. Can anyone provide a quick regex?
Edit: Maybe I wasn't clear. I need the following:
Street address, City, State, Zip, 4Digit Zip, Country as individual variables.
This doesn't even require regular expressions. You can split the diefferent lines using explode("<BR />",...). First line is Street, Last line is country. The middle line can be split using substr(), as you know that the last 4 characters are the 4 digit ZIP, the 6 characters before them are the ZIP followed by a hyphen and the 3 characters before them are the state followed by a space. So the numbers of characters of the segments (counted from the end of the line) is constant.
555 Some Street Name<BR />
New Providence VA 22901-1311<BR />
United States<BR />
ok, for the first part, let's split the lines
$array = explode('<BR />', $address);
now you need to get the informations from the second line to be parsed as well...
$array[1] = New Providence VA 22901-1311;
$tmp = explode(' ', $array[1]);
and all you need now is to set everything in the correct variable names
$fullZip = array_pop($tmp);
$zipArray = explode('-',$fullZip);
$zip = $zipArray[0];
$Digitzip = $zipArray[1];
$state = array_pop($tmp);
$providence = implode($tmp);
$country = $array[2];
$street = $array[0];
$array = explode('<BR />', $address);
This is the easiest way, just split the string by the <br />-tags. If you can avoid regular expressions, you should do, cause they are not as performant than simple string operations like an explode.
No need for a regex.
$htmlStr = '555 Some Street Name<BR />New Providence VA 22901-1311<BR />United States<BR />';
Live example
Note, however, that for more complicated HTML parsing, regexes are not the tool for the job.
Related
I have a single location field where people can enter whatever they want, but generally they will enter something in the format of "Town, Initials". So for example, these entries...
New york, Ny
columbia, sc
charleston
washington, DC
BISMARCK, ND
would ideally become...
New York, NY
Columbia, SC
Charleston
Washington, DC
Bismarck, ND
Obviously I can use ucfirst() on the string to handle the first character, but these are things I'm not sure how to do (if they can be done at all)...
Capitalizing everything after a comma
Lowercasing everything before the comma (aside from the first character)
Is this easily doable or do I need to use some sort of regex function?
You could simply chop it up and fix it.
<?php
$geo = 'New york, Ny
columbia, sc
charleston
washington, DC
BISMARCK, ND';
$geo = explode(PHP_EOL, $geo);
foreach ($geo as $str) {
// chop
$str = explode(',', $str);
// fix
echo
(!empty($str[0]) ? ucwords(strtolower(trim($str[0]))) : null).
(!empty($str[1]) ? ', '.strtoupper(trim($str[1])) : null).PHP_EOL;
}
https://3v4l.org/ojl2M
Though you should not trust the user to enter the correct format. Instead find a huge list of all states and auto complete them in. Perhaps something like https://gist.github.com/maxrice/2776900 - then validate against it.
Lets say, this is my address 537 Great North Road Grey Lynn Auckland City Auckland.
I want to put comma (,) after Grey Lynn and Auckland City
Then address will 537 Great North Road, Grey Lynn, Auckland City, Auckland
How can I do it in PHP? When the length is not fixed.
This is not a perfect solution but you can get an idea how you deal with it.
By using PHP :
$t = "537 Great North Road Grey Lynn Auckland City Auckland";
$t = str_replace(
["Road", "Lynn", "City"], // neddle
["Road,", "Lynn,", "City,"], // replace
$t
);
echo $t;
More Details
I would suggest you look at Regular Expressions (RegEx) to achieve this.
In that way you could loop through each address and use the regex pattern to replace where a comma is required.
However, I believe due to the format of the data it might be very hard to actually achieve this. The only thing you have to detect where a comma needs to go is a space, and that isn't reliable as you can have spaces between road names etc where you don't want commas to be placed!
If you can I would suggest splitting the data up, so rather than having the address in one string you have it split in separate columns / variables, for "house number", "street", "town" etc.. That way you could then use a simple string concatenation to place the commas where they should go.
E.g.:
$houseNumber . " " . $street . ", " . $town . ",";
I hope that helps!
Try This Before and after variable you can put comma.
<?php
$GreyLynn = "Grey Lynn";
$AucklandCity = "Auckland City";
echo ' , '.$GreyLynn.' , '.$AucklandCity;
?>
$seperate = "537 Great North Road Grey Lynn Auckland City Auckland";
$replace = str_replace ("Grey Lynn", ",Grey Lynn, ",$seperate);
$location = str_replace `("Auckland City", "Auckland City, ",$replace);`
Result:
537 Great North Road ,Grey Lynn, Auckland City, Auckland
I have a string like this:
$str = '<div class="content"><br />
<strong>0730</strong> – Check in direct to Compass at Marlin Wharf Berth 18</p> <p><strong>0800 </strong> – Depart Marlin Marina Cairns to the Great Barrier Reef</p>
<p><strong>1015</strong> – Arrive at your first Great Barrier Reef Location</p> <p><strong>1230</strong> – BBQ Lunch with fresh salads</p>
<p><strong>1300</strong> Cruise to 2nd Reef Location 1530 – Depart the Great Barrier Reef</p>
<p><strong>1730</strong> – Approximately Arrival time at Cairns Marina<br /> </div>';
I want to use preg_replace function to remove first </p> and last <p> tag because they are redundant, I have used this pattern but it didn't work.
$patterns = array(
'#^\s*</p>#',
'#<p>\s*$#',
);
$str = preg_replace( $patterns, '', $str );
You could do this with one expression:
$str = preg_replace("#(.*?)</p>(.*)<p>(.*)#s", "$1$2$3", $str, 1 );
This will do a non-greedy capture of text before the first </p>, then capture text greedily until <p> (which will be the last one because of the greediness). And finally the remaining text is also captured. The three captured groups are maintained, the 2 tags are not.
The s modifier is needed to allow the dot to also match new line characters.
Note that this does not check whether the removal is actually needed. It just does it, so if the HTML was already OK, you will get an non-desirable result.
This should do what you need
$str = '<div class="content"><br />
<strong>0730</strong> – Check in direct to Compass at Marlin Wharf Berth 18</p> <p><strong>0800 </strong> – Depart Marlin Marina Cairns to the Great Barrier Reef</p>
<p><strong>1015</strong> – Arrive at your first Great Barrier Reef Location</p> <p><strong>1230</strong> – BBQ Lunch with fresh salads</p>
<p><strong>1300</strong> Cruise to 2nd Reef Location 1530 – Depart the Great Barrier Reef</p>
<p><strong>1730</strong> – Approximately Arrival time at Cairns Marina<br /> </div>';
//Replace the first one, easy enough
$str = preg_replace('/<\/p>/', "", $str, 1);
$stringReplace = "<p>";
$stringLen = strlen($stringReplace);
//Get the position of the last one, with strrpos (reverse check)
$pos = strrpos($str, $stringReplace);
//Make sure there is one
if($pos !== false){
//If so, replace it with nothing
$str = substr_replace($str, "", $pos, $stringLen);
}
you can use something like this. but i don't check it myself. so let me and others know if it works or not.
$text = "Quick \"brown fox jumps \"over\" the lazy\" dog";
$resault = Regex.Replace(text, "(?<=^[^\"]*)\"|\"(?=[^\"]*$)", "\"\"\"");
I have lots of data that I need to search through for certain patterns.
Problem is when looking for said patterns I have no reference to what I'm looking for.
Or in other words, I have two paragraphs. Each on similar topics. I need to be able to compare both paragraphs and find patterns. Phrases said in both paragraphs and how many times both were said.
Can't seem to find the solution because preg_match and other functions your required to supply the things your looking for.
Example paragraphs
Paragraph 1:
Bee Pollen is made by honeybees, and is the food of the young bee. It
is considered one of nature's most completely nourishing foods as it
contains nearly all nutrients required by humans. Bee-gathered pollens
are rich in proteins (approximately 40% protein), free amino acids,
vitamins, including B-complex, and folic acid.
Paragraph 2:
Bee Pollen is made by honeybees. It is required for the fertilization
of the plant. The tiny particles consist of 50/1,000-millimeter
corpuscles, formed at the free end of the stamen in the heart of the
blossom, nature's most completely nourishing foods. Every variety of
flower in the universe puts forth a dusting of pollen. Many orchard
fruits and agricultural food crops do, too.
So from those examples these patterns:
Bee Pollen is made by honeybees
and:
nature's most completely nourishing foods
Both phrases are found in both paragraphs.
This is potentially a complex question depending on whether you're looking for similar phrases or phrases that match word for word.
Finding exact word-for-word matches is quite simple all you need to do is split on common breaks like punctuation marks (e.g. .,;:) and perhaps on conjunctions as well (e.g. and or). However, the problem comes when you come to, for example, adjectives two phrases might be exactly the same but have one word different, like so:
The world is spinnnig around its axis at a tremendous speed.
The world is spinning around its axis at a magnificent speed.
This won't match because tremendous and magnificent are used in place of one another. Potentially you could work around this, however, that would be a more complex question.
Answer
If we stick to the simple side of things we can achieve phrase matching with just a few lines of code (4 in this example; not including the formatting for comments/readability).
$wordSplits = 'and or on of as'; //List of words to split on
preg_match_all('/(?<m1>.*?)([.,;:\-]| '.str_replace(' ', ' | ', trim($wordSplits)).' )/i', $para1, $matches1);
preg_match_all('/(?<m2>.*?)([.,;:\-]| '.str_replace(' ', ' | ', trim($wordSplits)).' )/i', $para2, $matches2);
$commonPhrases = array_filter( //Removes blank $key=>$value pairs
array_intersect( //Finds matching paterns
array_map(function($item){
return(strtolower(trim($item))); //Cleans array for $para1 values - removes leading and following spaces
}, $matches1['m1']),
array_map(function($item){
return(strtolower(trim($item))); //Cleans array for $para2 values - removes leading and following spaces
}, $matches2['m2'])
)
);
var_dump($commonPhrases);
/**
OUTPUT:
array(2) {
[0]=>
string(31) "bee pollen is made by honeybees"
[5]=>
string(41) "nature's most completely nourishing foods"
}
/*
The above code will find matches splitting both on punctuation (defined in [...] of the preg_match_all pattern) it will also concatenate the word list (matching only words in the word list with a preceding and following space).
Wordlist
You can change the word list to include any breaks you like, editing the list until you get the phrases you are after, examples:
$wordSplits = 'and or';
$wordSplits = 'and but if or';
$wordSplits = 'a an as and by but because if in is it of off on or';
Punctuation
You can add any punctuation marks you like into the list (between [ and ]), however remember that some characters do have special meanings and may need to be escaped (or placed appropriately): - and ^ should become \- and \^ or be placed where their special meaning doesn't come into play.
You may consider changing:
([.,;:\-]|
To:
([.,;:\-] | //Adding a space before the pipe
So that you only split punctuation marks which are followed by a space. For example: this would mean that items like 50,000 won't be split.
Spaces and breaks
You may also consider changing the spaces to \s so that tabs and newlines etc are included and not just spaces. Like so:
'/(?<m1>.*?)([.,;:\-]|\s'.str_replace(' ', '\s|\s', trim($wordSplits)).'\s)/i'
This would also apply to:
([.,;:\-]\s|
If you decide to go down that route.
I've been working on this code, don't know if it suits your needs... Feel free to expand it!
$p1 = "Bee Pollen is made by honeybees, and is the food of the young bee. It is considered one of nature's most completely nourishing foods as it contains nearly all nutrients required by humans. Bee-gathered pollens are rich in proteins (approximately 40% protein), free amino acids, vitamins, including B-complex, and folic acid.";
$p2 = "Bee Pollen is made by honeybees. It is required for the fertilization of the plant. The tiny particles consist of 50/1,000-millimeter corpuscles, formed at the free end of the stamen in the heart of the blossom, nature's most completely nourishing foods. Every variety of flower in the universe puts forth a dusting of pollen. Many orchard fruits and agricultural food crops do, too.";
// Strip strings of periods etc.
$p1 = strtolower(str_replace(array('.', ',', '(', ')'), '', $p1));
$p2 = strtolower(str_replace(array('.', ',', '(', ')'), '', $p2));
// Extract words from first paragraph
$w1 = explode(" ", $p1);
// Build search string
$search = '';
$found = array();
foreach ($w1 as $word) {
//echo 'Word: ' . $word . "<br />";
$search .= ' ' . $word;
$search = trim($search);
//echo '. . Search string: '. $search . "<br /><br />";
if (substr_count($p2, $search)) {
$old_search = $search;
$num_occured = substr_count($p2, $search);
//echo " . . . found!" . "<br /><br /><br />";
$add = TRUE;
} else {
//echo " . . . not found! Generating new search string: " . $word . '<br />';
if ($add) {
$found[] = array('pattern' => $old_search, 'occurences' => $num_occured);
$add = FALSE;
}
$old_search = '';
$search = $word;
}
}
print_r($found);
The above code finds occurences of patterns from the first string in the second one.
I'm sure it can be written better, but since it's past midnight (local time), I'm not as "fresh" as I'd like to be...
Codepad-link
Provinces is a group_concat of all the individual records that contain province, some of which are blank.
So, when I encode:
$provinces = ($row['provinces']);
echo "<td>".wordwrap($provinces, 35, "<br />")."</td>";
This is what the result looks like:
Minas Gerais,,,Rio Grande do
Sul,Santa Catarina,Paraná,São Paulo
However, when I try to preg_replace out some of the nulls, and add some spaces with this expression:
$provinces = preg_replace($patterns,
$replaces, ($row['provinces']));
echo "<td>".wordwrap($provinces, 35, "<br />")."</td>";`
This is what I get!!! :(
Minas Gerais, Rio Grande do
Sul, Santa
Catarina, Paraná, São Paulo
The output is very unnatural looking.
BTW: Here are the search and replace arrays:
$patterns[0] = '/,,([,]+)?/'; $replaces[0] = ', ';
$patterns[1] = '/^,/'; $replaces[1] = '';
$patterns[2] = '/,$/'; $replaces[2] = '';
$patterns[3] = '/\b,\b/'; $replaces[3] = ', ';
$patterns[4] = '/\s,/'; $replaces[4] = ', ';
UPDATE: I even tried to change Paraná to Parana
Minas Gerais, Rio Grande do
Sul, Santa
Catarina, Parana, São
Paulo
Don't use as the replacement. wordwrap() considers that 6 characters. It doesn't interpret the HTML entity. That's why your lines are breaking funny. If you want replace spaces after you wordwrap()
Also, your first pattern should be:
// match one or more commas together
$patterns[0] = '/,+/';
Is the wordwrap() really necessary? It sounds like you are rendering this content into a table cell of some fixed width and you don't want individual entries to split across lines.
If this inference is correct - and if none of your entries is actually so long that forcing it to a single line will break your layout - then how about this: explode() on commas into an array, remove the whitespace-only entries, replace normal spaces in each array entry with , and implode() back on , (a comma followed by a space). Then let the rendering browser break lines wherever it needs.