Related
I can separate data from the plain text below with Regex.
Plain text:
190.A 42-year-old male patient has been delivered to a hospital in a grave condition with dyspnea, cough with expectoration of purulent
sputum, fever up to 39,5 oC.The ?rst symptoms appeared 3 weeks ago.
Two weeks ago, a local therapist diagnosed him wi- th acute
right-sided pneumonia. Over the last 3 days, the patient’s condition
deteriorated: there was a progress of dyspnea, weakness, lack of
appetite. Chest radiography con?rms a rounded shadow in the lower lobe
of the right lung with a horizontal?uid level, the right si- nus is
not clearly visualized. What is the most likely diagnosis? A.Abscess
of the right lung B.Acute pleuropneumonia C.Right pulmonary empyema
D.Atelectasis of the right lung E.Pleural effusion 191.An 11-year-old
boy complains of general weakness, fever up to 38,2 oC, pain and
swelli- ng of the knee joints, feeling of irregular heartbeat. 3 weeks
ago, the child had quinsy. Knee joints are swollen, the overlying skin
and skin of the knee region is reddened, local temperature is
increased, movements are li- mited. Heart sounds are muf?ed,
extrasystole is present, auscultation reveals apical systolic murmur
that is not conducted to the left ingui- nal region. ESR is 38 mm/h.
CRP is 2+, anti- streptolysin O titre - 40 0. What is the most likely
diagnosis? A.Acute rheumatic fever B.Vegetative dysfunction
C.Non-rheumatic carditis D.Juvenile rheumatoid arthritis E.Reactive
arthritis 192.A 28-year-old male patient complains of sour
regurgitation, cough and heartburn that occurs every day after having
meals, when bending forward or lying down. These problems have been
observed for 4 years. Objective status and laboratory values are
normal. FEGDS revealed endoesophagitis. What is the leading factor in
the development of this disease? A.Failure of the lower esophageal
sphincter B.Hypersecretion of hydrochloric acid C.Duodeno-gastric
re?ux D.Hypergastrinemia E.Helicobacter pylori infection 193.On
admission a 35-year-old female reports acute abdominal pain, fever up
to 38,8 oC, mucopurulent discharges. The pati- ent is nulliparous, has
a history of 2 arti?cial abortions. The patient is unmarried, has
sexual Krok 2 Medicine 20 14 24 contacts. Gynecological examination
reveals no uterus changes. Appendages are enlarged, bilaterally
painful. There is profuse purulent vaginal discharge. What study is
required to con?rm the diagnosis? A.Bacteriologic and bacteriascopic
studies B.Hysteroscopy C.Curettage of uterine cavity D.Vaginoscopy
E.Laparoscopy
What did I do for this?
For the question section:
/(\d+)\.\s*([A-Z].*?)\s+([A-Z]\..*?)(?=\d+\.\s*[A-Z]|$)/s
For the options of question section:
/\s+(?=[A-Z0-9][,.:])
PHP:
$soruAlimPattern = [
'q&a' => '/(\d+)\.\s*([A-Z].*?)\s+([A-Z]\..*?)(?=\d+\.\s*[A-Z]|$)/s',
'answers' => '/\s+(?=[A-Z0-9][,.:])/'
];
$res = [];
if (preg_match_all($soruAlimPattern['q&a'], $temizSoruCikisi, $out, PREG_SET_ORDER) > 0) {
foreach ($out AS $k => $v) {
// remove the full match ($0)
$res[$k] = array_slice($v, 1, 3);
// split the answers
$res[$k][2] = preg_split($soruAlimPattern['answers'], $res[$k][2]);
}
}
$sorularJsonKodlaniyor = json_encode($res);
[...]
I can distinguish between question and question options, but is it possible to use a single Regex code instead of 2 different Regex?
I don't know how quality the PHP code is but it works.
My problem:
1. Sometimes there are unidentifiable letters in the question and these
undefined characters are indicated with a question mark. For
example: `fever up to 39,5 oC.The ?rst symptoms` or `..39,5 oC.The ?rst symptoms..`
2. Due to the numerical values in the question, the Regex code divides the question in half. For example: `... anti- streptolysin O titre - 40 0. What is the most likely diagnosis? ` In fact, the question divides the question because of the number "zero".
Expected JSON Format:
[
{
"question": "190.A 42-year-old male patient has been delivered to a hospital in a grave condition with dyspnea, cough with expectoration of purulent sputum, fever up to 39,5 oC.The ?rst symptoms appeared 3 weeks ago. Two weeks ago, a local therapist diagnosed him wi- th acute right-sided pneumonia. Over the last 3 days, the patient’s condition deteriorated: there was a progress of dyspnea, weakness, lack of appetite. Chest radiography con?rms a rounded shadow in the lower lobe of the right lung with a horizontal?uid level, the right si- nus is not clearly visualized. What is the most likely diagnosis? ",
"answers": [
"A.Abscess of the right lung ",
"B.Acute pleuropneumonia ",
"C.Right pulmonary empyema ",
"D.Atelectasis of the right lung ",
"E.Pleural effusion 1"
]
},
{
"question": "191.An 11-year-old boy complains of general weakness, fever up to 38,2 oC, pain and swelli- ng of the knee joints, feeling of irregular heartbeat. 3 weeks ago, the child had quinsy. Knee joints are swollen, the overlying skin and skin of the knee region is reddened, local temperature is increased, movements are li- mited. Heart sounds are muf?ed, extrasystole is present, auscultation reveals apical systolic murmur that is not conducted to the left ingui- nal region. ESR is 38 mm/h. CRP is 2+, anti- streptolysin O titre - 40 0. What is the most likely diagnosis? ",
"answers": [
"A.Acute rheumatic fever ",
"B.Vegetative dysfunction ",
"C.Non-rheumatic carditis ",
"D.Juvenile rheumatoid arthritis ",
"E.Reactive arthritis 1"
]
},
{
"question": "192.A 28-year-old male patient complains of sour regurgitation, cough and heartburn that occurs every day after having meals, when bending forward or lying down. These problems have been observed for 4 years. Objective status and laboratory values are normal. FEGDS revealed endoesophagitis. What is the leading factor in the development of this disease? ",
"answers": [
"A.Failure of the lower esophageal sphincter ",
"B.Hypersecretion of hydrochloric acid ",
"C.Duodeno-gastric re?ux ",
"D.Hypergastrinemia ",
"E.Helicobacter pylori infection 1"
]
},
{
"question": "193.On admission a 35-year-old female reports acute abdominal pain, fever up to 38,8 oC, mucopurulent discharges. The pati- ent is nulliparous, has a history of 2 arti?cial abortions. The patient is unmarried, has sexual Krok 2 Medicine 20 14 24 contacts. Gynecological examination reveals no uterus changes. Appendages are enlarged, bilaterally painful. There is profuse purulent vaginal discharge. What study is required to con?rm the diagnosis? ",
"answers": [
"A.Bacteriologic and bacteriascopic studies ",
"B.Hysteroscopy ",
"C.Curettage of uterine cavity ",
"D.Vaginoscopy ",
"E.Laparoscopy 1"
]
}
]
How can I overcome these problems?
What you might do is use preg_split to get all the strings with the right characters at the start like 190.A or A.
\b(?=(?:\d+|[A-Z])\.[A-Z])
\b Word boundary
(?= Positive lookahead, assert what is on the right is
(?:\d+|[A-Z]) Match either 1+ digits or a single char A-Z
\.[A-Z] Match . and a single char A-Z
) Close positive lookahead
Regex demo | Php demo
If you have all those entries in an array, you could for example use array_reduce to the get array structure that you need for the json output.
$pattern = "/\b(?=(?:\d+|[A-Z])\.[A-Z])/";
$result = preg_split($pattern, $data, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$result = array_reduce($result, function($carry, $item){
// If the string starts with a digit
if (ctype_digit(substr($item, 0, 1))) {
// Create the questions key
$carry[] = ["question" => $item];
return $carry;
}
// Get reference to the last added array in $carry
end($carry);
$last = &$carry[key($carry)];
// Create the answers key
array_key_exists("answers", $last) ? $last["answers"][] = $item : $last["answers"] = [$item];
return $carry;
}, []);
print_r(json_encode($result))
Output
[
{
"question": "190.A 42-year-old male patient has been delivered to a hospital in a grave condition with dyspnea, cough with expectoration of purulent sputum, fever up to 39,5 oC.The ?rst symptoms appeared 3 weeks ago. Two weeks ago, a local therapist diagnosed him wi- th acute right-sided pneumonia. Over the last 3 days, the patient\u2019s condition deteriorated: there was a progress of dyspnea, weakness, lack of appetite. Chest radiography con?rms a rounded shadow in the lower lobe of the right lung with a horizontal?uid level, the right si- nus is not clearly visualized. What is the most likely diagnosis? ",
"answers": [
"A.Abscess of the right lung ",
"B.Acute pleuropneumonia ",
"C.Right pulmonary empyema ",
"D.Atelectasis of the right lung ",
"E.Pleural effusion "
]
},
{
"question": "191.An 11-year-old boy complains of general weakness, fever up to 38,2 oC, pain and swelli- ng of the knee joints, feeling of irregular heartbeat. 3 weeks ago, the child had quinsy. Knee joints are swollen, the overlying skin and skin of the knee region is reddened, local temperature is increased, movements are li- mited. Heart sounds are muf?ed, extrasystole is present, auscultation reveals apical systolic murmur that is not conducted to the left ingui- nal region. ESR is 38 mm\/h. CRP is 2+, anti- streptolysin O titre - 40 0. What is the most likely diagnosis? ",
"answers": [
"A.Acute rheumatic fever ",
"B.Vegetative dysfunction ",
"C.Non-rheumatic carditis ",
"D.Juvenile rheumatoid arthritis ",
"E.Reactive arthritis "
]
},
{
"question": "192.A 28-year-old male patient complains of sour regurgitation, cough and heartburn that occurs every day after having meals, when bending forward or lying down. These problems have been observed for 4 years. Objective status and laboratory values are normal. FEGDS revealed endoesophagitis. What is the leading factor in the development of this disease? ",
"answers": [
"A.Failure of the lower esophageal sphincter ",
"B.Hypersecretion of hydrochloric acid ",
"C.Duodeno-gastric re?ux ",
"D.Hypergastrinemia ",
"E.Helicobacter pylori infection "
]
},
{
"question": "193.On admission a 35-year-old female reports acute abdominal pain, fever up to 38,8 oC, mucopurulent discharges. The pati- ent is nulliparous, has a history of 2 arti?cial abortions. The patient is unmarried, has sexual Krok 2 Medicine 20 14 24 contacts. Gynecological examination reveals no uterus changes. Appendages are enlarged, bilaterally painful. There is profuse purulent vaginal discharge. What study is required to con?rm the diagnosis? ",
"answers": [
"A.Bacteriologic and bacteriascopic studies ",
"B.Hysteroscopy ",
"C.Curettage of uterine cavity ",
"D.Vaginoscopy ",
"E.Laparoscopy"
]
}
]
I have to find a php code to solve a math problem.
This is the problem description:
Players A and B are playing a new game of stones. There are N stones
placed on the ground, forming a sequence. The stones are labeled from
1 to N. Players A and B play in turns take exactly two consecutive stones
on the ground until there are no consecutive stones on the ground.
That is, each player can take stone i and stone i+1, where 1≤i≤N−1. If
the number of stone left is odd, A wins. Otherwise, B wins. Assume
both A and B play optimally and A plays first, do you know who the
winner is?
The line has N stones and are indexed from 1 to N --> N (1 ≤
N ≤ 10 000 000)
If the number of stone left is odd, A wins. Otherwise, B wins.
This is my code. It does work, but it is not correct.
<?php
$nStones = rand(1, 10000000);
$string = ("i");
$start = rand(1, 10000000);
$length = 2;
while($nStones > 0) {
substr( $nStones , $start [, $length ]): string;
}
if ($nStones % 2 == 1) {
echo "A";
} else {
echo "B";
}
?>
I think am missing the alternant subtraction of two consecutive stones by A & B, while $nStones > 0. Furthermore, the problem description mentions an optima subtraction until there is only one stone left. Therefore I guess the stones move together to their closest stones (the gaps disappear and are replaced by the closest stones).
I've made a start here:
<?php
class GameOfStones
{
const STONE_PAIR = 'OO';
const GAP_PAIR = '__';
public $line;
public function __construct($length)
{
$this->line = str_pad('', $length, self::STONE_PAIR);
}
// Removes a pair of stones from the line at nth location.
public function remove($n)
{
if(substr($this->line, $n-1, 2) == self::STONE_PAIR)
$this->line =
substr_replace($this->line, self::GAP_PAIR , $n-1, 2);
else
throw new Exception('Invalid move.');
}
// Check if there are no further possible moves.
public function is_finished()
{
return strpos($this->line, self::STONE_PAIR) === false;
}
// Representation of line.
public function __toString()
{
return implode('.', str_split($this->line)) ."\n";
}
};
$game = new GameOfStones(6);
echo $game;
var_dump($game->is_finished());
$game->remove(5);
echo $game;
var_dump($game->is_finished());
$game->remove(2);
echo $game;
var_dump($game->is_finished());
Output:
O.O.O.O.O.O
bool(false)
O.O.O.O._._
bool(false)
O._._.O._._
bool(true)
Currently this class starts by making a line which is a string of 'O' characters.
So if the length was 5, the line would be a string like this:
OOOOO
The remove method takes an index. If that index was 1, first the line is checked at the string's 0 index (your n-1) for two consecutive O's. In other words 'are there stones to remove at a given position?'. If there are stones, we do a string replacement at that position, and swap the two Os for two _s.
The is_finished method checks the line for the first occurance of two Os. In other words if there are two consecutive stones there is still a move on the line to play.
The magic method __toString, is the string representation of a GameOfStones object. That's used as a way to visualise the state of the game.
O.O.O.O._._
The above shows four stones and two gaps (I'm not sure if the dot separators are necessary - the underscores can bleed into each other that's why I've used them).
I have added example use of the code, where (two) pairs of stones are removed from a line of six stones. After each removal we check if there is another possible move, or rather if the game has ended.
There is no player attribution currently, that's left to you.
Your last rule:
'If the number of stone left is odd, A wins. Otherwise, B wins.'
I am struggling with. See these examples:
i) Line of length 3:
OOO
O__ A (1)
End: one (odd) stone left.
ii) Line of length 4:
OOOO
OO__ A (3)
____ B (1)
End: zero (even) stones left.
ii) Line of length 7:
OOOOOOO
O__OOOO A(1)
O__O__O B(5)
End: three (odd) stones left.
I'd say that the person that removes the pair so the next player can't go is the winner. In game ii) above if A had played at position 1 (O__O), then they would prevent B from playing.
I have following pdf file Marsheet PDF m trying to extract data shown in example, I have tried PDFParse, PDFtoText, etc.... but not working properly is there any solution or example?
<?php
//Output something like this or suggest me if u have any better option
$data_array = array(
array( "name" => "Mr Andrew Smee",
"medicine_name" => "FLUOXETINE 20MG CAPS",
"description" => "TAKE ONE ONCE DAILY FOR LOW MOOD. CAUTION:YOUR DRIVING REACTIONS MAY BE IMPAIRED",
"Dose" => '9000',
"StartDate" => '28/09/15',
"period" => '28',
"Quantity" => '28'
),
array( "name" => "Mr Andrew Smee",
"medicine_name" => "SINEMET PLUS 125MG TAB",
"description" => "TAKE ONE TABLET FIVE TIMES A DAY FOR PD
(8am,11am,2pm,5pm,8pm)
THIS MEDICINE MAY COLOUR THE URINE. THIS IS
HARMLESS. CAUTION:REACTIONS MAY BE IMPAIRED
WHILST DRIVING OR USING TOOLS OR MACHINES.",
"Dose" => '0800,1100,1400,1700,2000',
"StartDate" => '28/09/15',
"period" => '28',
"Quantity" => '140'
), etc...
);
?>
TL;DR You are almost certainly not going to do this with a library alone.
Update: a working solution (not a perfect solution!) is coded below, see 'in practice'. It requires:
defining the areas where the text is;
the possibility of installing and running a command line tool, pdf2json.
Why it is not easy
PDF files contain typesetting primitives, not extractable text; sometimes the difference is slight enough that you can go by, but usually having only extractable text, in easily accessible format, means that the document looks "slightly wrong" aesthetically, and therefore the generators that create the "best" PDFs for text extraction are also the less used.
Some generators exist that embed both the typesetting layer and an invisible text layer, allowing to see the beautiful text and to have the good text. At the expense, you guessed it, of the PDF size.
In your example, you only have the beautiful text inside the file, and the existence of a grid means that the text needs to be properly typeset.
So, inside, what there actually is to be read is this. Notice the letters inside round parentheses:
/R8 12 Tf
0.99941 0 0 1 66 765.2 Tm
[(M)2.51003(r)2.805( )-2.16558(A)-3.39556(n)
-4.33056(d)-4.33056(r)2.805(e)-4.33056(w)11.5803
( )-2.16558(S)-3.39556(m)-7.49588(e)-4.33117(e)556]TJ
ET
and if you assemble the (s)(i)(n)(g)(l)(e) letters inside, you do get "Mr Andrew Smee", but then you need to know where these letters are related to the page, and the data grid. Also you need to beware of spaces. Above, there is one explicit space character, parenthesized, between "Mr" and "Andrew"; but if you removed such spaces and fixed the offsets of all the following letters, you would still read "Mr Andrew Smee" and save two characters. Some PDF "optimizers" will try and do just that, and not considering offsets, the "text" string of that entity will just be "MrAndrewSmee".
And that is why most text extraction libraries, which can't easily manage character offsets (they use "text lines", and by and large they don't care about grids) will give you something like
Mr Andrew Smee 505738 12/04/54 (61
or, in the case of "optimized" texts,
MrAndrewSmee50573812/04/54(61
(which still gives the dangerous illusion of being parsable with a regex -- sometimes it is, sometimes it isn't, most of the times it works 95% of the time, so that the remaining 5% turns into a maintenance nightmare from Hell), but, more importantly, they will not be able to get you the content of the medication details timetable divided by cell.
Any information which is space-correlated (e.g. a name has different meanings if it's written in the left "From" or in the right "To" box) will be either lost, or variably difficult to reconstruct.
There are PDF "protection" schemes that exploit the capability of offsetting the text, and will scramble the strings. With offsets, you can write:
9 l 10 d 4 l 5 1 H 2 e 3 l o 6 W 7 o 8 r
and the PDF viewer will show you "Hello World"; but read the text directly, and you get "ldlHeloWor", or worse. You could add malicious text and place it outside the page, or write it in transparent color, to prank whoever succeeds in removing the easily removed optional copy-paste protection of PDF files. Most libraries would blithely suck up the prank text together with the good text.
Trying with most libraries, and why it might work (but probably not)
Libraries such as XPDF (and its wrappers phpxpdf, pdf2html, etc.) will give you a simple call such as this
// open PDF
$pdfToText->open('PDF-book.pdf');
// PDF text is now in the $text variable
$text = $pdfToText->getText();
$pdfToText->close();
and your "text" will contain everything, and be something like:
...
START DATE START DAY
WEEK 1 WEEK 2 WEEK 3 WEEK 4
DATE 28 29 30 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
19/10/15
Medication Details
Commencing
D.O.B
Doctor
Hour:Dose 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
Patient
Number
Period
MEDICATION ADMINISTRATION RECORD SHEETS Pharmacy No.
Document No.
02392 731680
28
0900 1
TAKE ONE ONCE DAILY FOR LOW MOOD.
CAUTION:YOUR DRIVING REACTIONS MAY BE IMPAIRED.
28
FLUOXETINE 20MG CAPS
Received Quantity returned quant. by destroyed quant. by
So, reading above, ask yourself - what is that second 28? Can you tell whether it is the received quantity, the returned quantity, the destroyed quantity without looking at the PDF? Sure, if there's only one number, chances are that it will be the received quantity. It becomes a bet.
And is 02392 731680 the document number? It looks like it is (it is not).
Notice also that in the PDF, the medicine name is before the notes. In the extracted text, it is after. By looking at the offsets inside the PDF, you understand why, and it's even a good decision -- but looking at the extracted text, it's not so easy.
So, automatic analysis looks enticingly like it can be done, but as I said, it is a very risky business. It is brittle: someone entering the wrong (for you) text somewhere in the document, sometimes even filling the fields not in sequential order, will result in a PDF which is visually correct and, at the same time, unexplainably unparseable. What are you going to tell your users?
Sometimes, a subset of the available information is stable enough for you to get the work done. In that case, XPDF or PDF2HTML, a bunch of regex, and you're home free in half a day. Yay you! Just keep in mind that any "little" addition to the project might then be impossible. Two numbers are added that are well separated in the PDF; are they 128 and 361, or 12 and 8361, or 1283 and 61? All you get in $text is 128361.
So if you go that way, document it clearly and avoid expectations which might be difficult to maintain. Your initial project might work so well, so fast, in so little, than an addition is accepted unbeknownst to you - and you're then required to do the impossible. Explaining why the first 95% was easy and the subsequent 5% very hard might be more than your job is worth.
One difficult way to do it, which worked for me
But can you do the same thing "by hand"? After all, by looking at the PDF, you know what you are seeing. Can the same thing be done by a machine? (this still applies). Sure, in this - after all - clearly delimited problem of computer vision, you very probably can. It just won't be quick and easy. You need:
a very low level library (or reading the PDF yourself; you just need to uncompress it first, and there are tools for that, e.g. pdftk). You need to recover the text with coordinates. "C" for "hospitalized" is worth nothing. "C, 495.2, 882.7" plus the coordinates of your grid tells you of a hospitalization on October 13th, 2015 -- and that is the information you are after!
patience (or a tool) to input the coordinates of the text zones. You need to tell the system which area is October 13th, 2015... as well as all the other days. For example:
// Cell name X1 Y1 X2 Y2 Text
[ 'PatientName', 60, 760, 300, 790, '' ],
[ 'PatientNumber', 310, 760, 470, 790, '' ],
...
[ 'Grid01Y01X01', 90, 1020, 110, 1040, '' ],
...
Note that very many of those values you can calculate programmatically: once you have the top left corner and know one cell's size, the others are more or less calculable with a very slight error. You needn't input yourself six grids of four weeks with six rows each, seven days per week.
You can use the same structure to create a PNG with red areas to indicate which cells you've got covered. That will be useful to visually check you did not forget anything.
At that point you parse the PDF, and every time you find a text at coordinates (x1,y1) you scan all of your cells and determine where the text should be (there are faster ways to do that using XY binary search trees). If you find 'Mr Andrew S' at 66, 765.2 you add it to PatientName. Then you find 'mee' at 109.2, 765.2 and you also add it to PatientName. Which now reads 'Mr Andrew Smee'.
If the horizontal distance is above a certain threshold, you add a space (or more than one).
(For very small text there's a slight risk of the letters being output out of order by the PDF driver and corrected through kerning, but usually that's not a problem).
At the end of the whole cycle you will be left with
[ 'PatientName', 60, 760, 300, 790, 'Mr Andrew Smee' ],
[ 'PatientNumber', 310, 760, 470, 790, '505738' ],
and so on.
I did this kind of work for a large PDF import project some years back and it worked like a charm. Nowadays, I think most of the heavy lifting could be done with TcLibPDF.
The painful part is recording by hand, the first time, the information for the grid; possibly there might be tools for that, or one could whip up a HTML5/AJAX editor using canvases.
In practice
Most of the work has already been done by the excellent pdf2json tool, which consuming the 'Andrew Smee' PDF, outputs something like:
[
{
"height" : 1263,
"width" : 892
"number" : 1,
"pages" : 1,
"fonts" : [
{
"color" : "#000000",
"family" : "Times",
"fontspec" : "0",
"size" : "15"
},
...
],
"text" : [
{ "data" : "12/04/54",
"font" : 0,
"height" : 17,
"left" : 628,
"top" : 103,
"width" : 70
},
{ "data" : "28/09/15",
"font" : 0,
"height" : 17,
"left" : 105,
"top" : 206,
"width" : 70
},
{ "data" : "AQUARIUS",
"font" : 0,
"height" : 17,
"left" : 99,
"top" : 170,
"width" : 94
},
{ "data" : " ",
"font" : 0,
"height" : 17,
"left" : 193,
"top" : 170,
"width" : 5
},
{ "data" : "NURSING",
"font" : 0,
"height" : 17,
"left" : 198,
"top" : 170,
"width" : 83
},
...
In order to make things simple, I convert the Andrew Smee PDF to a PNG and resample it to 892 x 1263 pixel (any size will do, as long as you keep track of the size. Below, they are saved in 'width' and 'height'). This way I can read pixel coordinates straight off my old PaintShop Pro's status bar :-).
The "Address" field is from 73,161 to 837,193.
My sample "template", with only three fields, is therefore in PHP 5.7 (with short array syntax, [ ] instead of Array() )
<?php
function template() {
$template = [
'Address' => [ 'x1' => 73, 'y1' => 161, 'x2' => 837, 'y2' => 193 ],
'Medicine1' => [ 'x1' => 1, 'y1' => 283, 'x2' => 251, 'y2' => 299 ],
'Details1' => [ 'x1' => 1, 'y1' => 302, 'x2' => 251, 'y2' => 403 ],
];
foreach ($template as $fieldName => $candidate) {
$template[$fieldName]['elements'] = [ ];
}
return $template;
}
// shell_exec('/usr/local/bin/pdf2json "Andrew-Smee.pdf" andrew-smee.json');
$parsed = json_decode(file_get_contents('ann-underdown.json'), true);
$pout = [ ];
foreach ($parsed as $page) {
$template = template();
foreach ($page['text'] as $text) {
// Will it blend?
foreach ($template as $fieldName => $candidate) {
if ($text['top'] > $candidate['y2']) {
continue; // Too low.
}
if (($text['top']+$text['height']) < $candidate['y1']) {
continue; // Too high.
}
if ($text['left'] > $candidate['x2']) {
continue;
}
if (($text['left']+$text['width']) < $candidate['x1']) {
continue;
}
$template[$fieldName]['elements'][] = $text;
}
}
// Now I must reassemble all my fields
foreach ($template as $fieldName => $data) {
$list = $data['elements'];
usort($list, function($txt1, $txt2) {
for ($r = 8; $r >= 1; $r /= 2) {
if (($txt1['top']/$r) < ($txt2['top']/$r)) {
return -1;
}
if (($txt1['top']/$r) > ($txt2['top']/$r)) {
return 1;
}
if (($txt1['left']/$r) < ($txt2['left']/$r)) {
return -1;
}
if (($txt1['left']/$r) > ($txt2['left']/$r)) {
return 1;
}
}
return 0;
});
$text = '';
$starty = false;
foreach ($list as $data) {
if ($data['top'] > $starty + 5) {
if ($starty > 0) {
$text .= "\n";
}
} else {
// Add space
// $text .= ' ';
}
$starty = $data['top'];
// Add text to current line
$text .= $data['data'];
}
// Remove extra spaces
$text = preg_replace('# +#', ' ', $text);
$template[$fieldName] = $text;
}
$paged[] = $template;
}
print_r($paged);
And the result (on a multipage PDF)
Array
(
[0] => Array
(
[Address] => AQUARIUS NURSING HOME 4-6 SPENCER ROAD, SOUTHSEA PO4 9RN
[Medicine1] => ATORVASTATIN 40MG TABS
[Details1] => take ONE tablet at NIGHT
)
[1] => Array
(
[Address] => AQUARIUS NURSING HOME 4-6 SPENCER ROAD, SOUTHSEA PO4 9RN
[Medicine1] => SOTALOL 80MG TABS
[Details1] => take ONE tablet TWICE each day
DO NOT STOP TAKING UNLESS YOUR DOCTOR TELLS
YOU TO STOP.
)
[2] => Array
(
[Address] => AQUARIUS NURSING HOME 4-6 SPENCER ROAD, SOUTHSEA PO4 9RN
[Medicine1] => LAXIDO ORANGE SF 13.8G SACHETS
[Details1] => ONE to TWO when required
DISSOLVE OR MIX WITH WATER BEFORE TAKING.
NOT IN CASSETTE
)
)
Sometimes its hard to extract the pdfs into required format/output directly using some libraries or tools. Same problem occurred with me recently where I had 1600+ pdfs and I needed to extract those data and store it in db. I tried almost all the libraries, tools and none of them helped me. So, I tried put some manual effort to find a pattern and process them using php. For this I used this php library PDF TO HTML.
Install PDF TO HTML library
composer require gufy/pdftohtml-php:~2
This will convert your pdf into html code with each < div > tag representing the page and < p > tag representing the titles and their values. Now using p tags if you can identify the common pattern and it is not hard to put that in the logic to process all the pdfs and convert them into csv/xls or anything else. Since in my case after each 11 < p > tags, the pattern was repeating so i used this .
$pdf = new Gufy\PdfToHtml\Pdf('<PDF_FILE_PATH>');
// get total no pages
$total_pages = $pdf->getPages();
// Iterate through each page and extract the p tags
for($i = 1; $i <= $total_pages; $i++){
// This will convert pdf to html
$html = $pdf->html($i);
// Create a dom document
$domOb = new DOMDocument();
// load html code in the dom document
$domOb->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
// Get SimpleXMLElement from Dom Node
$sxml = simplexml_import_dom($domOb);
// here you have the p tags
foreach ($sxml->body->div->p as $pTag) {
// your logic
}
}
Hope this helps you as it helped me alot
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Suppose i have 7 bags with different weight. Actually a php array contains this data.
Bag A 60 Kg
Bag B 80 Kg
Bag C 20 Kg
Bag D 10 Kg
Bag E 80 Kg
Bag F 100 Kg
Bag G 90 Kg
In php it will look like this
Array
(
[30] => 60
[31] => 120
[32] => 120
[33] => 60
[35] => 180
)
Now i have to divide all 7 bags in 4 container equally by balancing there weight.
But i cannot break the bag to manage weight. How to do this please suggest me. How can i build a formula or php function which will distribute all bags balancing there weight.
There is no limitation in container capacity. And its also not necessary to have all containers weight equal after distribution. I just need a load balancing.
Thanks in advance.
Calculate the sum of the weight of your bags then divide it by the number of containers. Then use a bin packaging algorithm to distribute the bags to the individual containers. E.g. take one bag at a time from your array and put it in the first container where the weight of the container plus the weight of your bag is less than the maximally possible container weight.
http://en.wikipedia.org/wiki/Bin_packing_problem
Update:
example written in Ruby. Should be not to hard to rewrite it in PHP. It distributes the bags to the containers relatively evenly (There might be a solution that is more accurate).
# A list of bags with different weights
list_of_bags = [11, 41, 31, 15, 15, 66, 67, 34, 20, 42, 22, 25]
# total weight of all bags
weight_of_bags = list_of_bags.inject(0) {|sum, i| sum + i}
# how many containers do we have at our disposal?
number_of_containers = 4
# How much should one container weight?
weight_per_container = weight_of_bags / number_of_containers
# We make an array containing an empty array for each container
containers = Array.new(number_of_containers){ |i| [] }
# For each bag
list_of_bags.each do |bag|
# we try to find the first container
containers.each do |container|
# where the weight of the container plus the weigth of the bag is
# less than the maximum allowed (weight_per_container)
if container.inject(0) {|sum, i| sum + i} + bag < weight_per_container
# if the current container has space for it we add the bag
# and go to the next one
container.push(bag)
break
end
end
end
# output all containers with the number of items and total weight
containers.each_with_index do |container, index|
puts "container #{index} has #{container.length} items and weigths: #{container.inject(0) {|sum, i| sum + i}}"
end
example result:
container 0 has 3 items and weigths: 83
container 1 has 3 items and weigths: 96
container 2 has 2 items and weigths: 87
container 3 has 2 items and weigths: 76
Create a function that gets a product weight and returns a bag number - the one which has the least free space that's still enough to fit. Put it in the bag. Repeat until done.
$bags = array(60,80,20,10,80,100,90);
$containers = array(1=>100,2=>100,3=>100,4=>100); // number -> free space
$placement = array();
rsort($bags); // biggest first - usually it's better
function bestContainerFor($weight) {
global $containers;
$rest = 0;
$out = 0; // in it won't change $weight fits nowhere
foreach($containers as $nr=>$space) {
if($space<$weight) continue; // not enough space
if($space-$weight<$rest) continue; // we have a better case
$rest = $space-$weight;
$out = $nr;
}
if($out) $containers[$out]-=$weight; // occupy the space
return $out;
}
foreach($bags as $nr=>$w) {
$p = bestContainerFor($w);
$placement[$nr] = $p; // for later use; in this example it's not needed
if( $p) print "Bag $nr fits in $p<br>";
if(!$p) print "Bag $nr fits nowhere<br>";
}
It's not tested. If you give me some details of your code I'll try to adapt. This just shows the principle of it.
Note that
it works with variable container sizes,
it gives you the placement of each bag, not the sum weight,
it's not optimal for equal distribution, just gives a good case
Good morning -
I'm interested in seeing an efficient way of parsing the values of an heirarchical text file (i.e., one that has a Title => Multiple Headings => Multiple Subheadings => Multiple Keys => Multiple Values) into a simple XML document. For the sake of simplicity, the answer would be written using:
Regex (preferrably in PHP)
or, PHP code (e.g., if looping were more efficient)
Here's an example of an Inventory file I'm working with. Note that Header = FOODS, Sub-Header = Type (A, B...), Keys = PRODUCT (or CODE, etc.) and Values may have one more more lines.
**FOODS - TYPE A**
___________________________________
**PRODUCT**
1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;
2) La Fe String Cheese
**CODE**
Sell by date going back to February 1, 2009
**MANUFACTURER**
Quesos Mi Pueblito, LLC, Passaic, NJ.
**VOLUME OF UNITS**
11,000 boxes
**DISTRIBUTION**
NJ, NY, DE, MD, CT, VA
___________________________________
**PRODUCT**
1) Peanut Brittle No Sugar Added;
2) Peanut Brittle Small Grind;
3) Homestyle Peanut Brittle Nuggets/Coconut Oil Coating
**CODE**
1) Lots 7109 - 8350 inclusive;
2) Lots 8198 - 8330 inclusive;
3) Lots 7075 - 9012 inclusive;
4) Lots 7100 - 8057 inclusive;
5) Lots 7152 - 8364 inclusive
**MANUFACTURER**
Star Kay White, Inc., Congers, NY.
**VOLUME OF UNITS**
5,749 units
**DISTRIBUTION**
NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN
**FOODS - TYPE B**
___________________________________
**PRODUCT**
Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice;
**CODE**
990-10/2 10/5
**MANUFACTURER**
San Mar Manufacturing Corp., Catano, PR.
**VOLUME OF UNITS**
384
**DISTRIBUTION**
PR
And here's the desired output (please excuse any XML syntactical errors):
<foods>
<food type = "A" >
<product>Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese</product>
<product>La Fe String Cheese</product>
<code>Sell by date going back to February 1, 2009</code>
<manufacturer>Quesos Mi Pueblito, LLC, Passaic, NJ.</manufacturer>
<volume>11,000 boxes</volume>
<distibution>NJ, NY, DE, MD, CT, VA</distribution>
</food>
<food type = "A" >
<product>Peanut Brittle No Sugar Added</product>
<product>Peanut Brittle Small Grind</product>
<product>Homestyle Peanut Brittle Nuggets/Coconut Oil Coating</product>
<code>Lots 7109 - 8350 inclusive</code>
<code>Lots 8198 - 8330 inclusive</code>
<code>Lots 7075 - 9012 inclusive</code>
<code>Lots 7100 - 8057 inclusive</code>
<code>Lots 7152 - 8364 inclusive</code>
<manufacturer>Star Kay White, Inc., Congers, NY.</manufacturer>
<volume>5,749 units</volume>
<distibution>NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN</distribution>
</food>
<food type = "B" >
<product>Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice</product>
<code>990-10/2 10/5</code>
<manufacturer>San Mar Manufacturing Corp., Catano, PR</manufacturer>
<volume>384</volume>
<distibution>PR</distribution>
</food>
</FOODS>
<!-- and so forth -->
So far, my approach (which might be quite inefficient with a huge text file) would be one of the following:
Loops and multiple Select/Case statements, where the file is loaded into a string buffer, and while looping through each line, see if it matches one of the header/subheader/key lines, append the appropriate xml tag to a xml string variable, and then add the child nodes to the xml based on IF statements regarding which key name is most recent (which seems time-consuming and error-prone, esp. if the text changes even slightly) -- OR
Use REGEX (Regular Expressions) to find and replace key fields with appropriate xml tags, clean it up with an xml library, and export the xml file. Problem is, I barely use regular expressions, so I'd need some example-based help.
Any help or advice would be appreciated.
Thanks.
An example you can use as a starting point. At least I hope it gives you an idea...
<?php
define('TYPE_HEADER', 1);
define('TYPE_KEY', 2);
define('TYPE_DELIMETER', 3);
define('TYPE_VALUE', 4);
$datafile = 'data.txt';
$fp = fopen($datafile, 'rb') or die('!fopen');
// stores (the first) {header} in 'name' and the root simplexmlelement in 'element'
$container = array('name'=>null, 'element'=>null);
// stores the name for each item element, the value for the type attribute for subsequent item elements and the simplexmlelement of the current item element
$item = array('name'=>null, 'type'=>null, 'current_element'=>null);
// the last **key** encountered, used to create new child elements in the current item element when a value is encountered
$key = null;
while ( false!==($t=getstruct($fp)) ) {
switch( $t[0] ) {
case TYPE_HEADER:
if ( is_null($container['element']) ) {
// this is the first time we hit **header - subheader**
$container['name'] = $t[1][0];
// ugly hack, < . name . />
$container['element'] = new SimpleXMLElement('<'.$container['name'].'/>');
// each subsequent new item gets the new subheader as type attribute
$item['type'] = $t[1][1];
// dummy implementation: "deducting" the item names from header/container[name]
$item['name'] = substr($t[1][0], 0, -1);
}
else {
// hitting **header - subheader** the (second, third, nth) time
/*
header must be the same as the first time (stored in container['name']).
Otherwise you need another container element since
xml documents can only have one root element
*/
if ( $container['name'] !== $t[1][0] ) {
echo $container['name'], "!==", $t[1][0], "\n";
die('format error');
}
else {
// subheader may have changed, store it for future item elements
$item['type'] = $t[1][1];
}
}
break;
case TYPE_DELIMETER:
assert( !is_null($container['element']) );
assert( !is_null($item['name']) );
assert( !is_null($item['type']) );
/* that's maybe not a wise choice.
You might want to check the complete item before appending it to the document.
But the example is a hack anyway ...so create a new item element and append it to the container right away
*/
$item['current_element'] = $container['element']->addChild($item['name']);
// set the type-attribute according to the last **header - subheader** encountered
$item['current_element']['type'] = $item['type'];
break;
case TYPE_KEY:
$key = $t[1][0];
break;
case TYPE_VALUE:
assert( !is_null($item['current_element']) );
assert( !is_null($key) );
// this is a value belonging to the "last" key encountered
// create a new "key" element with the value as content
// and addit to the current item element
$tmp = $item['current_element']->addChild($key, $t[1][0]);
break;
default:
die('unknown token');
}
}
if ( !is_null($container['element']) ) {
$doc = dom_import_simplexml($container['element']);
$doc = $doc->ownerDocument;
$doc->formatOutput = true;
echo $doc->saveXML();
}
die;
/*
Take a look at gettoken() at http://www.tuxradar.com/practicalphp/21/5/6
It breaks the stream into much simpler pieces.
In the next step the parser would "combine" or structure the simple tokens into more complex things.
This function does both....
#return array(id, array(parameter)
*/
function getstruct($fp) {
if ( feof($fp) ) {
return false;
}
// shortcut: all we care about "happens" on one line
// so let php read one line in a single step and then do the pattern matching
$line = trim(fgets($fp));
// this matches **key** and **header - subheader**
if ( preg_match('#^\*\*([^-]+)(?:-(.*))?\*\*$#', $line, $m) ) {
// only for **header - subheader** $m[2] is set.
if ( isset($m[2]) ) {
return array(TYPE_HEADER, array(trim($m[1]), trim($m[2])));
}
else {
return array(TYPE_KEY, array($m[1]));
}
}
// this matches _____________ and means "new item"
else if ( preg_match('#^_+$#', $line, $m) ) {
return array(TYPE_DELIMETER, array());
}
// any other non-empty line is a single value
else if ( preg_match('#\S#', $line) ) {
// you might want to filter the 1),2),3) part out here
// could also be two diffrent token types
return array(TYPE_VALUE, array($line));
}
else {
// skip empty lines, would be nicer with tail-recursion...
return getstruct($fp);
}
}
prints
<?xml version="1.0"?>
<FOODS>
<FOOD type="TYPE A">
<PRODUCT>1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;</PRODUCT>
<PRODUCT>2) La Fe String Cheese</PRODUCT>
<CODE>Sell by date going back to February 1, 2009</CODE>
<MANUFACTURER>Quesos Mi Pueblito, LLC, Passaic, NJ.</MANUFACTURER>
<VOLUME OF UNITS>11,000 boxes</VOLUME OF UNITS>
<DISTRIBUTION>NJ, NY, DE, MD, CT, VA</DISTRIBUTION>
</FOOD>
<FOOD type="TYPE A">
<PRODUCT>1) Peanut Brittle No Sugar Added;</PRODUCT>
<PRODUCT>2) Peanut Brittle Small Grind;</PRODUCT>
<PRODUCT>3) Homestyle Peanut Brittle Nuggets/Coconut Oil Coating</PRODUCT>
<CODE>1) Lots 7109 - 8350 inclusive;</CODE>
<CODE>2) Lots 8198 - 8330 inclusive;</CODE>
<CODE>3) Lots 7075 - 9012 inclusive;</CODE>
<CODE>4) Lots 7100 - 8057 inclusive;</CODE>
<CODE>5) Lots 7152 - 8364 inclusive</CODE>
<MANUFACTURER>Star Kay White, Inc., Congers, NY.</MANUFACTURER>
<VOLUME OF UNITS>5,749 units</VOLUME OF UNITS>
<DISTRIBUTION>NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN</DISTRIBUTION>
</FOOD>
<FOOD type="TYPE B">
<PRODUCT>Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice;</PRODUCT>
<CODE>990-10/2 10/5</CODE>
<MANUFACTURER>San Mar Manufacturing Corp., Catano, PR.</MANUFACTURER>
<VOLUME OF UNITS>384</VOLUME OF UNITS>
<DISTRIBUTION>PR</DISTRIBUTION>
</FOOD>
</FOODS>
Unfortunately the status of the php module for ANTLR currently is "Runtime is in alpha status." but it might be worth a try anyway...
See: http://www.tuxradar.com/practicalphp/21/5/6
This tells you how to parse a text file into tokens using PHP. Once parsed you can place it into anything you want.
You need to search for specific tokens in the file based on your criteria:
for example:
PRODUCT
This gives you the XML Tag
Then 1) can have special meaning
1) Peanut Brittle...
This tells you what to put in the XML tag.
I do not know if this is the most efficient way to accomplish your task but it is the way a compiler would parse a file and has the potential to make very accurate.
Instead of Regex or PHP use the XSLT 2.0 unparsed-text() function to read the file (see http://www.biglist.com/lists/xsl-list/archives/200508/msg00085.html)
Another Hint for an XSLT 1.0 Solution is here: http://bytes.com/topic/net/answers/808619-read-plain-file-xslt-1-0-a