I need to extract an address from a string
$string ="some text 9 th pizza tower 78 main Chennai 600001. and other information may be phone number etc";
From $string I want to extract only "9 th pizza tower 78 main Chennai 600001"
This Address format is not constant it may be in two different way
one is string variable another one is like this
$string1= "some text 9 th pizza tower main Chennai 600001. and other information may be phone number etc";
From here I need to extract "9 th pizza tower main Chennai 600001"
I don't think this is possible...extracting text from a plain text file is like asking for a tree if you're in the woods, "Which one?".
If the file is always in the same format, like:
Company Name 73
1st Cross Street, Hotel Chennai
-600000
someadditionalstuff
Then you've got a change, or if it is always separated with a special character (, . ; etc.). If it is always the same format (the one you showed above), then something like this might work:
([a-zA-Z0-9 ]*),([a-zA-Z0-9 ]*) XXX ([a-zA-Z0-9 ]*) (-[0-9]{6})
Group 1: Company Name
Group 2: Address
Group 3: City
Group 4: Zip-Code
Bobby
Sorry this is not possible. It may work for one website but not for others as there is no standard format in displaying a company address(or any address) on a web page.
Not an easy question and there isn't a magic AI code that can figure it out.
You must make some assumption , and look at a lot of data to find out if it's good ones.
for start - if you assume, every address ends with ZIP code, and you can search the string for 5 (or 6) digits and cut it after that.
TO find the start of the address is beyond my skills. maybe looking for the first number.
you need to check a lot of examples to figure out what would be the best patten that match most of them.
Yes its possible by using Google natural Language processing which is paid or you can open natural language processing which is open. But for open NLP there is no better documentation available .
Better refer from this URL :
https://opennlp.apache.org/
Related
I am trying to display data extracted from PDF document. Here is sample data which I've got in raw format from pdf 55.0 450.0 320.0 GA350C CARDS 4 21 90.0 4 1 DIGCLR This is one row where every space represent one column. I can extract each column with substr() function in PHP but I am not sure how to display data when there are three or five rows data in there, cos doesn't matter its one row or five row data will display in single line.
I can only count rows with no of space in it, here only one thing is fixed which is no of columns so so need to iterate loop efficiently.
If anyone has better idea plz let me know.
Here is string which I extracted from pdf doc with help of PdfParser.
5284 25/10/16 DATE JOB REC'D: DATE DUE: 26/10/16 JOB NUMBER: The Print Group CUSTOMER NAME: 30 days CONTACT: Tanya Bulley PHONE: (07) 3395 7248 FAX: (07) 3395 9462 ORDER NUMBER: 234456/277458 ADDRESS: The Print Group 88 Webster Road Geebung Qld 4034 Australia 5,289 QUOTE NO: PREVIOUS JOB NO: 0 2,000 Business Cards - Shed Company 2 KINDS JOB: DESCRIPTION: PRE-PRESS: Supplied Print Ready Files/ No Proof Required SIZE: BC 90 x 55mm PRINTED: CMYK 2/sides STOCK: 350gsm Gloss Art FINISH:Trim to size QTY: 2000 (1,000 each name) PACK: Carton Pack DELIVERY: 1 Point ACT [1]SPECIAL INSTRUCTIONS: Artwork Received SPECIAL INSTRUCTIONS: Out on Proof Approved Stock TYPE/ART CUTTING Proofing Pre Press Proofing 0.50 TRIMMING CARDS TRIM MAKE READY CARDS TRIM 90 x 55 STOCK 96.00 CARDS Sovereign Gloss 450x320/350 FINISHING PACK/DELIVERY PACK A4 Cartons 305x215/280 Standard Local Delivery (by we INK/CHEMICALS OUTSIDE WORK Delivery: The Print Group 88 Webster Road Geebung Qld 4034 Press Sheet Press Code Stock Code No. of Work & Turn No Up No. of Colours Front Back Description Ink Code Front Back Trim Size Depth Width Ink Notes 55.0 450.0 320.0 GA350C CARDS 4 21 90.0 4 1 DIGCLR
This is basically job order for printers and last line is job details. Right now its only one row for actual job details but in some order it can go upto 10 rows and so its hard to save it in database with proper column name. To grab words or details I used:
function GetBetween($content,$start,$end)
{
$r = explode($start, $content);
if (isset($r[1])){
$r = explode($end, $r[1]);
return $r[0];
}
return '';
}
this function. I used this function like $cust_name = GetBetween($a,'JOB NUMBER:','CUSTOMER NAME:'); I also used substr() php function to get some details and with these I've got everything apart from main data which is at last in string (I mentioned it above). I hope this explanation help you to figure out whole situation.
Sorry, i try to explain with bulk of code and long description, but stackoverflow not allow me to writing that. i'm so frustated because i spend 2 hours to doing it with my notepad
now i will give you simple clue for doing this
Avoid to using <*table> tag, try to using <*div> (only abbyy can convert <*table> near perfectly). This optional requirement
Convert PDF to DOM TREE, i recommended convert to HTML, AND this must be automation via PHP.
for Paid software: Abbyy Fine reader or Abbyy transformer (lite version)
for Free software: pdftohtml from poppler
from my experience around 5years doing this, i recommended you to use
Abbyy. And ALL of Indonesian corporation which provide digital
newspaper clipping use this software (im pretty sure about this). If
you dont have money , you must know how to get that.(i cant say it
here)
Grab HTML DOM with regular expression (regex) or http://simplehtmldom.sourceforge.net/
Another clue :
if you have problem to grab content using regex/htmldom,
1. try to get rid DOM that you dont need it. You can use preg_replace
[trash]
[YOUR_TABLE]
[trash]
then start to grab your content from this snippet
If you can edit PDF creation process, try to add unique word/string around your content
[trash]
<div>this is title</div>
[YOUR TABLE]
<div>this is footer</div>
[trash]
so you can search your content around word THis is title and this is footer.
I'm using a the Royal Mail Postcode database to make sure I get the correct address details on my form. It pulls through the address into my form using the relevant input boxes. this is great, but I want to do is in the first input box the address may be something
123 The Street or Apartment 29 or even HouseName The Street. I post these within my form.
So the post is something like $_POST["line1"];
How is it possible to split the line into sections so that I can take the HouseNo or House Name from the address, so that on my next page I can show it as HouseNo/Name & Street Name?
Using the examples above output echo $houseno; - 123, echo $street; - The Street
try to split in two parts numeric and alpha
$example = $_POST["line1"];
list($alpha,$num) = sscanf($example, "%[A-Z a-z ]%d");
echo $alpha; // address in alpha with space
echo $num; // house no
For more :- PHP: Best way to split string into alphabetic and numeric components
You could use regular expressions to try and split up characters. Admittedly you'd then have two problems.
In general you're not going to find this very easy to do. You've posted some testcases:
123 The Street
Apartment 29
"The Orchard"
Here's a couple more to think about:
1A The Street
Flat B
Rather than attempting to process a free-form field with (literally) thousands of possible variations, why not provide an interface on the Royal Mail database (which you say you have)? Many sites give this option now:
You enter your postcode and (usually) house number
A box shows a list of matching addresses to choose from
You still get addresses which exactly match the Royal Mail database, but you now get the customer to do the tricky address-matching part of the process.
Of course, you should probably still let people enter freeform address data because the address database is sometimes a few steps behind reality...
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to validate phone number using PHP?
can anyone please help me know how to validate if the field value entered is a phone number using php...
I have used a variable $phone , datatype =varchar 10 in sql db
Now i want to validate that users enter only numbers in that field..
use preg_match
if(preg_match('/^(NA|[0-9+-]+)$/',$str)) {
*code here*
} else {
"code here"
}
One way to do this is with regular expressions. When validating phone numbers, it's easier on users if you accept accompanying characters, and filter them out yourself (-+()).
http://www.php.net/manual/en/function.preg-replace.php
$phone = preg_replace ( '/[+\\.\\(\\) ]/' , '' , $phone);
Once you've done that, checking for a match of 10 digits (assuming U.S. numbers with area code) can be done like so:
if(preg_match ( '/^\\d{10}$/', $phone) ) {
// Good match
}
http://www.php.net/manual/en/function.preg-match.php
Does is_numeric solve your problem?
Edit:
I wasn't aiming to solve OPs problem, merely hoping to give him/her pointers. However, reading the question closer makes me think that OP isn't being conscious of internationalisation issues. Her field is 10 characters long, so a number like +447970122467 (a valid British mobile number) would cause a failure. I'm going to assume they are in North America, and as such can assume that all numbers are in accordance with the North American Numbering Plan. The description of this, in words, is taken from that page:
Component Name Number ranges Notes
+1 ITU country calling code "1" is also the usual trunk code for accessing long-distance service between NANP numbers. In an intra-NANP context, numbers are usually written without the leading "+"
NPA Numbering Plan Area Code Allowed ranges: [2–9] for the first digit, and [0–9] for both the second and third digits. Covers Canada, the United States, parts of the Caribbean Sea, and some Atlantic and Pacific islands. The area code is often enclosed in parentheses.
NXX Central Office (exchange) code Allowed ranges: [2–9] for the first digit, and [0–9] for both the second and third digits. Often considered part of a subscriber number. The three-digit Central Office codes are assigned to a specific CO serving its customers, but may be physically dispersed by redirection, or forwarding to mobile operators and other services.
xxxx Subscriber Number [0–9] for each of the four digits. This unique four-digit number is the subscriber number or station code.`
That ought to be enough to get OP started on solving their problem. Sorry for being curt in my initial response.
I'm trying to process a CSV file that has as in each row a text field with the name of organization and position of an individual within that organization as unstructured text. This field is usually a mess of text like this:
Assoc. Research Professor Dept. Psychology Univ. California Santa Barbara
I need to pull out the position and the organization name. For the position, I use preg_match for a series of about 60 different regular expressions for the different professions, and I think it works pretty well (my guess is that it catches about 80%). But, I'm having trouble catching the organization name. I have a MySQL table with roughly 16,000 organization names that I can perform a simple preg_match for, but due to common misspellings and abbreviations, it's only catching about 30% of the organizations. For example, my database has
University of California Santa Barbara
But the CSV file might have any of the options:
Univ Cal Santa Barbara
University Cal-Santa Barbara
University California-Santa Barbara
Cal University, Santa Barbara
I need to process several hundred thousand records, and I can't spend the time to correct 70% of the records that are currently not being processed correctly or painstakingly create multiple aliases for each organization. What I would like to be able to do is to catch small differences (such as the small misspellings, hyphens versus spaces, and common abbreviations), and, if still no matches are found, to ideally recognize an organizational name and create a new record for it.
What libraries or tools in Python or PHP would allow to perform a similarity match that would have a broader reach?
Would NLTK in Python catch misspellings?
Is it possible to use AlchemyAPI to catch misspelled organizations? So far I've only been able to use it to catch correctly spelled organizations
Since I'm comparing a short string (the organization name) to a longer string (that includes the name plus extraneous information) is there any hope in using PHP's similar_text function?
Any help or insight would be appreciated.
This is within the domain of fuzzy logic. See if these are of any help:
http://www.phpclasses.org/blog/post/119-Neural-Networks-in-PHP.html
http://ann.thwien.de/index.php/Installation
You may be able to use difflib to calculate the similarity ratio between the CSV input and the canonical spelling, and consider it a match if it's above a certain threshold (say, 0.65).
For example:
import difflib
exact = 'University of California Santa Barbara'
inputs = ['Univ Cal Santa Barbara',
'University Cal-Santa Barbara',
'University California-Santa Barbara',
'Cal University, Santa Barbara',
'Canterbury University']
sm = difflib.SequenceMatcher(None, exact)
ratios = []
for input in inputs:
sm.set_seq2(input)
ratios.append(sm.ratio())
print ratios
gives:
[0.73333333333333328, 0.81818181818181823, 0.93150684931506844,
0.71641791044776115, 0.33898305084745761]
Note how 'Canterbury University' has a much lower match ratio() than the inputs you gave.
Then again, SequenceMatcher.ratio() may be too slow computed over 16,000 values.
I need to parse a file with the following format.
0000000 ...ISBN.. ..Author.. ..Title.. ..Edit.. ..Year.. ..Pub.. ..Comments.. NrtlExt Nrtl Next Navg NQoH UrtlExt Urtl Uext Uavg UQoH ABS NEB MBS FOL
ABE0001 0-679-73378-7 ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM 0.00 13.90 0.00 10.43 0 21.00 10.50 6.44 3.22 2 2.00 0.50 2.00 2.00 ABS
The ID and ISBN are not a problem, the title is. There is no set length for these fields, and there are no solid delimiters- the space can be used for most of the file.
Another issue is that there is not always an entry in the comments field. When there is, there are spaced within the content.
So I can get the first two, and the last fourteen. I need some help figuring out how to parse the middle six fields.
This file was generated by an older program that I cannot change. I am using php to parse this file.
I would also ask myself 'How good does this have to be' and 'How many records are there'?
If, for example, you are parsing this list to put up a catalog of books to sell on a website - you probably want to be as good as you can, but expect that you will miss some titles and build in feedback mechanism so your users can help you fix the issue ( and make it easy for you to fix it in your new format).
On the other hand, if you absolutely have to get it right because you will loose lots of money for each wrong parse, and there are only a few thousand books, you should plan on getting close, and then doing a human review of the entire file.
(In my first job, we spend six weeks on a data conversion project to convert 150 records - not a good use of time).
Find the title and publisher of the book by ISBN (in some on-line database) and parse only the rest :)
BTW. are you sure that what looks like space actually is a space? There are more "invisible" characters (like non-break space). I know, not a good idea, but apparently author of that format was pretty creative...
You need to analyze you data by hand and find out what year, edition and publisher look like. For example if you find that year is always two digits and publisher always comes from some limited list, this is something you can start with.
While I don't see any way other then guessing a bit I'd go about it something like this:
I'd scale off what I know I can parse out reliably. Leaving you with ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM
From there I'd try locate the Edition and split the string into two at that position after storing and removing the Edition leaving you with ABE WOMAN IN THE DUNES (INT'L ED) & 64 RANDOM, another option is to try with the year but of course Titles such as 1984 might present a problem . (Guessing edition is of course assuming it's 7th, 51st etc for all editions).
Finally I'd assume I could somewhat reliable guess the year 64 at the start of the second string and further limit the Publisher(/Comment) part.
The rest is pure guesswork unless you got a list of authors/publishers somewhere to match against as I'd assume there are not only comments with spaces but also publishers with spaces in their names. But at least you should be down to 2 strings containing Author/Title in one and Publisher(/Comments) in the other.
All in all it should limit the manual part a bit.
Once done I'd also save it in a better format somewhere so I don't have to go about parsing it again ;)
I don't know if the pcre engine allows multiple groups from within selection, therefore:
([A-Z0-1]{7})\ (\d-\d{3}-\d{5}-\d)\
(.+)\ (\d(?:st|nd|rd))\ \d{2}\
([^\d.]+)\ (\d+.\d{2})\ (\d+.\d{2})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d{1})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\
(\d+.\d{2})\ (\d)\ (\d+.\d{2})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\
(\w{3})
It does look quite ugly and doesn't fix your author-title problem but it matches quite good for the rest of it.
Concerning your problem I don't see any solution but having a lookup table for authors or using other services to lookup title and author via the ISBN.
Thats if unlike in your example above the authors are not just represented by their first name.
Also double check all exception that might occur with the above regex as titles may contain 1st or alike.