Address extraction from string

Address extraction from string - php

I've been looking around for some help, but haven't found my solution yet. The problem is a lot like this one: Extract address from string
But I cannot seem to rewrite the php code to solve my problem.
I'm using Magento and I have only 1 address field combining streetname and housenumber. For my CSV export I'm using a XSLT extension, which can work with PHP also. For retrieving the address and importing, I need streetname and housenumber to be 2 strings.
At this moment I'm using:
preg_replace('/[0-9,.]/','',$address);
for retrieving the street,
and:
preg_replace('/[^0-9,.]/','',$address);
for retrieving the housenumber. And... this just doesn't work in a lot of situations. Because sometimes a number is included in the streetname (like 2nd street) or a character is included in the housenumber (like 36-B).
The only 2 things we always know are "A housenumber always includes a number" and "A housenumber (sometimes including characters) is always at the end of the string"
I've made an image of a few examples:
Examples

You will have to search for the last number in the string. Than the first space before that, and split it at this position

I found the following code to work almost perfect.
static function addressFix($address){
$r = strrev ($address);
$str1= preg_replace('/^(.*?\d+)(.*?)$/', '$2', $r);
return strrev($str1);
}
static function houseNumberFix($address){
$r = strrev ($address);
$str2= preg_replace('/^(.*?\d+)(.*?)$/', '$1', $r);
return strrev($str2);
}

Related

PHP - Searching strings from a certain format

I've seen other examples using regex however I'm a little bit troubled trying to format mine correctly, I'll have a list of numbers like this:
1.0.0.1ACS
1.0.0.2ADS
1.0.1.8AAB
However I only want to have the actual 4 numbers but the numbers could turn into multiple digits per line for example 122.222.222.222 (up to 3) how would I go about using PHP to do this? I presume I would have to put them all into a array first then for each in array, then I am confused on how to remove the extra letters.
Thanks in advance!

Yours Code is:
<?php
$input = <<<TEST
1.0.0.1ACS
1.0.0.2ADS
1.0.1.8AAB
TEST;
if (!preg_match_all("#(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})+#", $input, $matches))
print("NOT FOUND!");
else
var_dump($matches[0]);

Is there a typo in this str_replace code? / Am I reading it correctly?

Here is the line of code from a PHP file, specifically it is from zstore.php which is a file include as part of the "Zazzle Store Builder" toolset from Zazzle.com
The set of files allows someone like me, who has products for sale on Zazzle and massage that data into a nicer "storefront" which I can set up my way instead of being confined by the CMS structure of Zazzle.com where they understandably want to keep the monkeys (uhmmm... users like myself) from causing too much mayhem.
So... here is the code:
$keywords = str_replace(" ",",",str_replace(",","",$keywords));
Two questions:
Am I understanding what it does and
Is there an extra single or double quote in the string that does not need to be there?
Here is what I think the line of code is saying:
Take the string of characters that the user inputs (dance diva) and assign it to the variable called
$keywords
then run the following function on that character string
= str_replace
(" ","," <<< look for spaces. If you find a space, replace it with a comma
,str_replace(",","" <<< this is the bit I don't understand or which may have a typo
I THINK that it is saying " if you find commas, leave them alone, but I'm not certain.
,$keywords)); <<< then put the edited string of characters backing to the variable called $keywords.
What lead me to look at this was that I was inputting the following:
dance,diva which is what I THOUGHT the script was wanting from me based on the commented text in the README.txt file:
// Search terms. Comma separated keywords you can use to select products for your store
So..
Am I understanding what this line of code is supposed to do?
which, assuming I am correct, and I'm pretty sure that the first half is supposed to work as I've described, now brings me to my second question:
Why isn't the second bit working? Is there a typo?
To review:
dance diva produces results
dance,diva does not
Both, SHOULD work.
Thanks in advance for your help. I have a lot of HTML experience and computer experience but PHP is new to me.

$keywords = str_replace(" ",",",str_replace(",","",$keywords));
You can split into
$temp = str_replace(",","",$keywords);
$keywords = str_replace(" ",",",$temp);
First it replaces all comas with empty string, it is removes all comas. Then replaces all spaces with comas.
For "dance diva" there are no comas so first does nothing, then it replaces space and result is "dance,diva"
For "dance,diva" it removes coma, you get "dancediva" and there in no space to replace next so it is Your result.

Regex PHP code to scrape street address that has a line break

Been searching for two days now with Google, and a lot on SOF here, but I can't solve this regex preg_match problem. I want to simple scrape a street address, and normally I can do this easily, but because some street addresses have line breaks in the middle of them with around 25 characters of whitespace, my code displays an empty array or just NULL.
Below I have included the source code to show an example of what I'm trying to scrape, and also the failed code I have so far. Any help from someone with more experience than I, would be greatly appreciated this Sunday morning.
Sample of source code here;
<span style="font-size:14px;">736
E 17th St</span><br />
My attempt so far;
$new_data = file_get_contents('someURLaddress');
$street_address_regex = '~14px\;\"\>(.*?)\<\/span\>\<br\s\/\>\s~s';
preg_match($street_address_regex,$new_data,$extracted_street_address);
var_dump ($extracted_street_address);

I'm only doing this because it is horrible practice to use a dot. The giveaway that you're doing something wrong in Regular Expressions is when you use the Single-Line option. That's a huge waste of resources and bound to break at some point.
This is 99.9% positively what you need to use:
$street_address_regex = '~14px;">([^<]*)~i';
Or, if you are (for some reason) expecting a < as a legitimate character, either meaning Less-than or formatting tags like bold or italics, then you can do this:
$street_address_regex = '~14px;">([^<]*<)*?\/span~i';
And if it bothers you enough that you don't want to have to format out the last < character you'll get in your string, you can do this:
$street_address_regex = '~14px;">((?:[^<]*(?(?!<\/span)<))*)~i';
.
Test it With This Tester
.
But honestly, you shouldn't even be using Regex. Find the stripos of <span style="font-size:14px;"> and add its length (to get the Address Starting Point)... Then find the stripos of </span> and input the offset point of the previously found Index (to get the Address Ending Point). Subtract them to get the length. Then pull the substr using the OriginalString, StartIndex, And Length.
Sounds like a lot, but make that a small function that you use instead of Regex, and just input the OriginalString, StartString, and EndString... then return the contents between StartString and EndString using the method I just said. Make the function re-usable.
With that function, that portion of your code will literally run 10 times faster, at least. Regex is powerful as hell for patterns, but you don't have a pattern, you have two static strings from which you want the contents between them. Regex is slow as hell for static string manipulation... Especially using the Dot with Single-Line ~Shiver~
$Input = '<span style="font-size:14px;">736 E 17th St</span><br />';
echo GetBetween($Input, '14px;">', '</span');
function GetBetween($OrigStr, $StartStr, $EndStr) {
$StartPos = stripos($OrigStr, $StartStr) + strlen($StartStr);
$EndPos = stripos($OrigStr, $EndStr, $StartPos);
return substr($OrigStr, $StartPos, $EndPos - $StartPos);
}

Regex to conditionally replace Twitter hashtags with hyperlinks

I'm writing a small PHP script to grab the latest half dozen Twitter status updates from a user feed and format them for display on a webpage. As part of this I need a regex replace to rewrite hashtags as hyperlinks to search.twitter.com. Initially I tried to use:
<?php
$strTweet = preg_replace('/(^|\s)#(\w+)/', '\1#\2', $strTweet);
?>
(taken from https://gist.github.com/445729)
In the course of testing I discovered that #test is converted into a link on the Twitter website, however #123 is not. After a bit of checking on the internet and playing around with various tags I came to the conclusion that a hashtag must contain alphabetic characters or an underscore in it somewhere to constitute a link; tags with only numeric characters are ignored (presumably to stop things like "Good presentation Bob, slide #3 was my favourite!" from being linked). This makes the above code incorrect, as it will happily convert #123 into a link.
I've not done much regex in a while, so in my rustyness I came up with the following PHP solution:
<?php
$test = 'This is a test tweet to see if #123 and #4 are not encoded but #test, #l33t and #8oo8s are.';
// Get all hashtags out into an array
if (preg_match_all('/(^|\s)(#\w+)/', $test, $arrHashtags) > 0) {
foreach ($arrHashtags[2] as $strHashtag) {
// Check each tag to see if there are letters or an underscore in there somewhere
if (preg_match('/#\d*[a-z_]+/i', $strHashtag)) {
$test = str_replace($strHashtag, ''.$strHashtag.'', $test);
}
}
}
echo $test;
?>
It works; but it seems fairly long-winded for what it does. My question is, is there a single preg_replace similar to the one I got from gist.github that will conditionally rewrite hashtags into hyperlinks ONLY if they DO NOT contain just numbers?

(^|\s)#(\w*[a-zA-Z_]+\w*)
PHP
$strTweet = preg_replace('/(^|\s)#(\w*[a-zA-Z_]+\w*)/', '\1#\2', $strTweet);
This regular expression says a # followed by 0 or more characters [a-zA-Z0-9_], followed by an alphabetic character or an underscore (1 or more), followed by 0 or more word characters.
http://rubular.com/r/opNX6qC4sG <- test it here.

It's actually better to search for characters that aren't allowed in a hashtag otherwise tags like "#Trentemøller" wont work.
The following works well for me...
preg_match('/([ ,.]+)/', $string, $matches);

I have devised this: /(^|\s)#([[:alnum:]])+/gi

I found Gazlers answer to work, although the regex added a blank space at the beginning of the hashtag, so I removed the first part:
(^|\s)
This works perfectly for me now:
#(\w*[a-zA-Z_0-9]+\w*)
Example here: http://rubular.com/r/dS2QYZP45n

removing phone number from a document

I've got a challenge that I am hoping that the SO community is able to help me with.
I trying to parse a lot of html documents in my PHP application to remove personal details, such as names, addresses and phone numbers. I can remove most of these details without too much trouble, however the phone number is a real problem for me.
My idea is to take the text from these documents and the use a regex to identify the phone numbers and replace them with another value such as 'xxxx'.
I've got 2 regex that I am using one for UK landline numbers and one for UK cell/mobile numbers.
However when I try and run them against the text it just returns an empty string.
I am using the following preg_replace code:
$pattens = array(
'/^(((\+44\s?\d{4}|\(?0\d{4}\)?)\s?\d{3}\s?\d{3})|((\+44\s?\d{3}|\(?0\d{3}\)?)\s?\d{3}\s?\d{4})|((\+44\s?\d{2}|\(?0\d{2}\)?)\s?\d{4}\s?\d{4}))(\s?\#(\d{4}|\d{3}))?$/',
'/^(\+44\s?7\d{3}|\(?07\d{3}\)?)\s?\d{3}\s?\d{3}$/'
);
$replace = array('xxxxx', 'xxxxx');
//do the search for the numbers.
$updatedContents = preg_replace($pattens, $replace, $htmlContents);
At the moment this is causing me a lot of head scratching as I thought that I had this nailed, but at the moment I can't see what's wrong??
I am sure that it is something really simple.
Thanks,
Grant

You probably don't want to anchor your regular expressions. Remove the ^ from the beginning and the $ from the end.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.