I am trying to display data extracted from PDF document. Here is sample data which I've got in raw format from pdf 55.0 450.0 320.0 GA350C CARDS 4 21 90.0 4 1 DIGCLR This is one row where every space represent one column. I can extract each column with substr() function in PHP but I am not sure how to display data when there are three or five rows data in there, cos doesn't matter its one row or five row data will display in single line.
I can only count rows with no of space in it, here only one thing is fixed which is no of columns so so need to iterate loop efficiently.
If anyone has better idea plz let me know.
Here is string which I extracted from pdf doc with help of PdfParser.
5284 25/10/16 DATE JOB REC'D: DATE DUE: 26/10/16 JOB NUMBER: The Print Group CUSTOMER NAME: 30 days CONTACT: Tanya Bulley PHONE: (07) 3395 7248 FAX: (07) 3395 9462 ORDER NUMBER: 234456/277458 ADDRESS: The Print Group 88 Webster Road Geebung Qld 4034 Australia 5,289 QUOTE NO: PREVIOUS JOB NO: 0 2,000 Business Cards - Shed Company 2 KINDS JOB: DESCRIPTION: PRE-PRESS: Supplied Print Ready Files/ No Proof Required SIZE: BC 90 x 55mm PRINTED: CMYK 2/sides STOCK: 350gsm Gloss Art FINISH:Trim to size QTY: 2000 (1,000 each name) PACK: Carton Pack DELIVERY: 1 Point ACT [1]SPECIAL INSTRUCTIONS: Artwork Received SPECIAL INSTRUCTIONS: Out on Proof Approved Stock TYPE/ART CUTTING Proofing Pre Press Proofing 0.50 TRIMMING CARDS TRIM MAKE READY CARDS TRIM 90 x 55 STOCK 96.00 CARDS Sovereign Gloss 450x320/350 FINISHING PACK/DELIVERY PACK A4 Cartons 305x215/280 Standard Local Delivery (by we INK/CHEMICALS OUTSIDE WORK Delivery: The Print Group 88 Webster Road Geebung Qld 4034 Press Sheet Press Code Stock Code No. of Work & Turn No Up No. of Colours Front Back Description Ink Code Front Back Trim Size Depth Width Ink Notes 55.0 450.0 320.0 GA350C CARDS 4 21 90.0 4 1 DIGCLR
This is basically job order for printers and last line is job details. Right now its only one row for actual job details but in some order it can go upto 10 rows and so its hard to save it in database with proper column name. To grab words or details I used:
function GetBetween($content,$start,$end)
{
$r = explode($start, $content);
if (isset($r[1])){
$r = explode($end, $r[1]);
return $r[0];
}
return '';
}
this function. I used this function like $cust_name = GetBetween($a,'JOB NUMBER:','CUSTOMER NAME:'); I also used substr() php function to get some details and with these I've got everything apart from main data which is at last in string (I mentioned it above). I hope this explanation help you to figure out whole situation.
Sorry, i try to explain with bulk of code and long description, but stackoverflow not allow me to writing that. i'm so frustated because i spend 2 hours to doing it with my notepad
now i will give you simple clue for doing this
Avoid to using <*table> tag, try to using <*div> (only abbyy can convert <*table> near perfectly). This optional requirement
Convert PDF to DOM TREE, i recommended convert to HTML, AND this must be automation via PHP.
for Paid software: Abbyy Fine reader or Abbyy transformer (lite version)
for Free software: pdftohtml from poppler
from my experience around 5years doing this, i recommended you to use
Abbyy. And ALL of Indonesian corporation which provide digital
newspaper clipping use this software (im pretty sure about this). If
you dont have money , you must know how to get that.(i cant say it
here)
Grab HTML DOM with regular expression (regex) or http://simplehtmldom.sourceforge.net/
Another clue :
if you have problem to grab content using regex/htmldom,
1. try to get rid DOM that you dont need it. You can use preg_replace
[trash]
[YOUR_TABLE]
[trash]
then start to grab your content from this snippet
If you can edit PDF creation process, try to add unique word/string around your content
[trash]
<div>this is title</div>
[YOUR TABLE]
<div>this is footer</div>
[trash]
so you can search your content around word THis is title and this is footer.
Related
I have some problems in my search function. When some user type the sentences in search field I want to get the result from the keywords inside the sentence which user type before. For example I have database table like this:
ID | Keywords | Answer
-----------------------------------------------------------------------------
1 | price, room | The price room is $150 / night
2 | credit card | Yes, you could pay with credit card
3 | location | The Hotel location is in the Los Angeles
4 | how to, way to, book | You could pay with credit card or wire transfer
5 | room, size | The room size is 50sqm
And this is the examples of sentences which user input:
What is the room price ?
From that sentences the system will find the keywords inside the senteces in that case the keywords is room and price.
And from that keywords the systems will show the answer is The price room is $150 / night
Can I pay with credit card ?
From that sentences the system will find the keywords inside the sentences in that case the keywords is credit card.
And from that keywords the systems will show the answer is Yes, you could pay with credit card
What is the room size ?
From that sentences the system will find the keywords inside the sentences in that case the keywords is room and size.
And from that keywords the systems will show the answer is The room size is 50sqm
The example 1 and 3 has room in the sentences. I would also want to know that the keywords is room price and room size.
How could I find the keywords from the sentences which user already input ?
How to I get the answer from database with that keywords ?
From that examples I want to know how could I to do that with PHP and MySql ? Or maybe there is some way to build that ? Please anybody knows to do this could help me. Thanks before.
I would suggest not to store keywords separating with commas in single row, instead insert them in different rows. Because when you will try to search any text which is in keywords it will always check for credit card or price, room. It will not consider price and room as different words instead it will consider this as string.
For your question, try following code :
$que = 'What is the room price';
$keywords = str_replace(" ", ",", $que);
$sql = 'select answer from your_table where keywords IN (' . $keywords . ')';
OR you can try for FIND_IN_SET() to search comma separated keywords.
It may work.
My approach would be to use the concept of STOP WORDS remove all STOP WORDS from the user query.
Then only to search for ALL the KEY WORDS in the user query.
DATA entry needs to remove most of the users data to be robust. What if they intend to break your system by inserting CODE.
STOP WORDS include 'the' 'a' 'of'
The idea is to remove as much rubbish as you can and then to be very picky about other words.
Log the query data for inspection in case of failure.
log the ACCESS data that you think you are processing
and then set a timeout on the response time.
eg. if you know that the query should only ever take
X ms. Then anything that takes longer than that is suspect. It could have gotten past your protective layer. DO make sure you log the IP address and timestamp in the log files - preferably right at the start of the log entry.
Then write scripts for handling a SLICE.
A SLICE is a nice way to help the system administrators
who may have to send you a slice of the log files.
The slice can be complicated - from DAY (YYYYMMDDmm.s) to another DAY and they may have had an overnight compression system running - so your script needs to access normal log files and compressed log files. Sometimes the files are split up by system failures - ie. the system died for some reason.
Your SLICE info can be packaged up into an email etc. and sent to you for analysis.
Good luck.
I imported products from my business software to my shop everything worked OK however it only imported part of the description when the description contained characters such as ® and other strange characters.
For example:
Data exported to Shop: X-Ring 10,82 x 1,78 mm BS013 Viton® 75 +/- 5 Shore A schwarz/black
Data imported into Shop: X-Ring 10,82 x 1,78 mm BS013 Viton
After the word Viton everything else was deleted I suppose the character ® was the problem because we are in Europe and the import program did not consider this.
My question is how can I search "Viton" and replace it with "Viton® 75 +/- 5 Shore A schwarz/black"
I cannot change the program nor am I a programmer however I do know how to do a few commands within mysql. I am looking for a command replace "&Viton&" command with "Viton® 75 +/- 5 Shore A schwarz/black"
Thank you very much for your assistance
You may want to use the HTML entity ® or as icecub suggest® , which will display as the circled R on a web page. Anyway, search replace on a single column like this
Step 1. MAKE A BACKUP OF THE TABLE!
Step 2, execute this query.
update TABLE
set `COLUMN` =
replace(`COLUMN`, 'Viton', 'Viton® 75 +/- 5 Shore A schwarz/black');
I have a data in the object section_data.title and i am trying to use str.sup() where str=section_data.title; and str holds following data:
str="Not less than 30 net ft2 (2.8 net m2) per patient in a
hospital or nursing home, or not less than 15 net ft2 (1.4 net
m2) per resident in a limited care facility, shall be provided within the aggregated area of corridors, patient rooms, treatment
rooms, lounge or dining areas, and other similar areas on each side of
the horizontal exit. On stories not housing bed or litterborne
patients, not less than 6 net ft2 (0.56 net m2) per occupant
shall be provided on each side of the horizontal exit for the total
number of occupants in adjoining compartments."
Now I want to add superscript for the bold words indicated above (e.g ft2). How can I do this using str.sup() or are there any other alternative method to do so in javascript? Or any other tricks?
String in javascript is not formatted. You can only do that when you output to HTML. So basically you must write it like this
var str = "Not less than 30 net ft<sup>2</sup> (2.8 net m<sup>2</sup>)";
document.write(str);
You can do a find and replace for all string contain ft2 and m2 turn them into ft<sup>2</sup> and m<sup>2</sup>
str.replace(/ft2/g,"ft<sup>2</sup>"); //But it not safe...
I need to parse a file with the following format.
0000000 ...ISBN.. ..Author.. ..Title.. ..Edit.. ..Year.. ..Pub.. ..Comments.. NrtlExt Nrtl Next Navg NQoH UrtlExt Urtl Uext Uavg UQoH ABS NEB MBS FOL
ABE0001 0-679-73378-7 ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM 0.00 13.90 0.00 10.43 0 21.00 10.50 6.44 3.22 2 2.00 0.50 2.00 2.00 ABS
The ID and ISBN are not a problem, the title is. There is no set length for these fields, and there are no solid delimiters- the space can be used for most of the file.
Another issue is that there is not always an entry in the comments field. When there is, there are spaced within the content.
So I can get the first two, and the last fourteen. I need some help figuring out how to parse the middle six fields.
This file was generated by an older program that I cannot change. I am using php to parse this file.
I would also ask myself 'How good does this have to be' and 'How many records are there'?
If, for example, you are parsing this list to put up a catalog of books to sell on a website - you probably want to be as good as you can, but expect that you will miss some titles and build in feedback mechanism so your users can help you fix the issue ( and make it easy for you to fix it in your new format).
On the other hand, if you absolutely have to get it right because you will loose lots of money for each wrong parse, and there are only a few thousand books, you should plan on getting close, and then doing a human review of the entire file.
(In my first job, we spend six weeks on a data conversion project to convert 150 records - not a good use of time).
Find the title and publisher of the book by ISBN (in some on-line database) and parse only the rest :)
BTW. are you sure that what looks like space actually is a space? There are more "invisible" characters (like non-break space). I know, not a good idea, but apparently author of that format was pretty creative...
You need to analyze you data by hand and find out what year, edition and publisher look like. For example if you find that year is always two digits and publisher always comes from some limited list, this is something you can start with.
While I don't see any way other then guessing a bit I'd go about it something like this:
I'd scale off what I know I can parse out reliably. Leaving you with ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM
From there I'd try locate the Edition and split the string into two at that position after storing and removing the Edition leaving you with ABE WOMAN IN THE DUNES (INT'L ED) & 64 RANDOM, another option is to try with the year but of course Titles such as 1984 might present a problem . (Guessing edition is of course assuming it's 7th, 51st etc for all editions).
Finally I'd assume I could somewhat reliable guess the year 64 at the start of the second string and further limit the Publisher(/Comment) part.
The rest is pure guesswork unless you got a list of authors/publishers somewhere to match against as I'd assume there are not only comments with spaces but also publishers with spaces in their names. But at least you should be down to 2 strings containing Author/Title in one and Publisher(/Comments) in the other.
All in all it should limit the manual part a bit.
Once done I'd also save it in a better format somewhere so I don't have to go about parsing it again ;)
I don't know if the pcre engine allows multiple groups from within selection, therefore:
([A-Z0-1]{7})\ (\d-\d{3}-\d{5}-\d)\
(.+)\ (\d(?:st|nd|rd))\ \d{2}\
([^\d.]+)\ (\d+.\d{2})\ (\d+.\d{2})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d{1})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\
(\d+.\d{2})\ (\d)\ (\d+.\d{2})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\
(\w{3})
It does look quite ugly and doesn't fix your author-title problem but it matches quite good for the rest of it.
Concerning your problem I don't see any solution but having a lookup table for authors or using other services to lookup title and author via the ISBN.
Thats if unlike in your example above the authors are not just represented by their first name.
Also double check all exception that might occur with the above regex as titles may contain 1st or alike.
I need to extract an address from a string
$string ="some text 9 th pizza tower 78 main Chennai 600001. and other information may be phone number etc";
From $string I want to extract only "9 th pizza tower 78 main Chennai 600001"
This Address format is not constant it may be in two different way
one is string variable another one is like this
$string1= "some text 9 th pizza tower main Chennai 600001. and other information may be phone number etc";
From here I need to extract "9 th pizza tower main Chennai 600001"
I don't think this is possible...extracting text from a plain text file is like asking for a tree if you're in the woods, "Which one?".
If the file is always in the same format, like:
Company Name 73
1st Cross Street, Hotel Chennai
-600000
someadditionalstuff
Then you've got a change, or if it is always separated with a special character (, . ; etc.). If it is always the same format (the one you showed above), then something like this might work:
([a-zA-Z0-9 ]*),([a-zA-Z0-9 ]*) XXX ([a-zA-Z0-9 ]*) (-[0-9]{6})
Group 1: Company Name
Group 2: Address
Group 3: City
Group 4: Zip-Code
Bobby
Sorry this is not possible. It may work for one website but not for others as there is no standard format in displaying a company address(or any address) on a web page.
Not an easy question and there isn't a magic AI code that can figure it out.
You must make some assumption , and look at a lot of data to find out if it's good ones.
for start - if you assume, every address ends with ZIP code, and you can search the string for 5 (or 6) digits and cut it after that.
TO find the start of the address is beyond my skills. maybe looking for the first number.
you need to check a lot of examples to figure out what would be the best patten that match most of them.
Yes its possible by using Google natural Language processing which is paid or you can open natural language processing which is open. But for open NLP there is no better documentation available .
Better refer from this URL :
https://opennlp.apache.org/