I wrote this PHP code to implement the Flesch-Kincaid Readability Score as a function:
function readability($text) {
$total_sentences = 1; // one full stop = two sentences => start with 1
$punctuation_marks = array('.', '?', '!', ':');
foreach ($punctuation_marks as $punctuation_mark) {
$total_sentences += substr_count($text, $punctuation_mark);
}
$total_words = str_word_count($text);
$total_syllable = 3; // assuming this value since I don't know how to count them
$score = 206.835-(1.015*$total_words/$total_sentences)-(84.6*$total_syllables/$total_words);
return $score;
}
Do you have suggestions how to improve the code? Is it correct? Will it work?
I hope you can help me. Thanks in advance!
The code looks fine as far as a heuristic goes. Here are some points to consider that make the items you need to calculate considerably difficult for a machine:
What is a sentence?
Seriously, what is a sentence? We have periods, but they can also be used for Ph.D., e.g., i.e., Y.M.C.A., and other non-sentence-final purposes. When you consider exclamation points, question marks, and ellipses, you're really doing yourself a disservice by assuming a period will do the trick. I've looked at this problem before, and if you really want a more reliable count of sentences in real text, you'll need to parse the text. This can be computationally intensive, time-consuming, and hard to find free resources for. In the end, you still have to worry about the error rate of the particular parser implementation. However, only full parsing will tell you what's a sentence and what's just a period's other many uses. Furthermore, if you're using text 'in the wild' -- such as, say, HTML -- you're going to also have to worry about sentences ending not with punctuation but with tag endings. For instance, many sites don't add punctuation to h1 and h2 tags, but they're clearly different sentences or phrases.
Syllables aren't something we should be approximating
This is a major hallmark of this readability heuristic, and it's one that makes it the most difficult to implement. Computational analysis of syllable count in a work requires the assumption that the assumed reader speaks in the same dialect as whatever your syllable count generator is being trained on. How sounds fall around a syllable is actual a major part of what makes accents accents. If you don't believe me, try visiting Jamaica sometime. What this means it that even if a human were to do the calculations for this by hand, it would still be a dialect-specific score.
What is a word?
Not to wax psycholingusitic in the slightest, but you will find that space-separated words and what are conceptualized as words to a speaker are quite different. This will make the concept of a computable readability score somewhat questionable.
So in the end, I can answer your question of 'will it work'. If you're looking to take a piece of text and display this readability score among other metrics to offer some kind of conceivable added value, the discerning user will not bring up all of these questions. If you are trying to do something scientific, or even something pedagogical (as this score and those like it were ultimately intended), I wouldn't really bother. In fact, if you're going to use this to make any kind of suggestions to a user about content that they have generated, I would be extremely hesitant.
A better way to measure reading difficulty of a text would more likely be something having to do with the ratio of low-frequency words to high-frequency words along with the number of hapax legomena in the text. But I wouldn't pursue actually coming up with a heuristic like this, because it would be very difficult to empirically test anything like it.
Take a look at the PHP Text Statistics class on GitHub.
Please have a look at following two classes and its usage information. It will surely help you.
Readability Syllable Count Pattern Library Class:
<?php class ReadabilitySyllableCheckPattern {
public $probWords = [
'abalone' => 4,
'abare' => 3,
'abed' => 2,
'abruzzese' => 4,
'abbruzzese' => 4,
'aborigine' => 5,
'acreage' => 3,
'adame' => 3,
'adieu' => 2,
'adobe' => 3,
'anemone' => 4,
'apache' => 3,
'aphrodite' => 4,
'apostrophe' => 4,
'ariadne' => 4,
'cafe' => 2,
'calliope' => 4,
'catastrophe' => 4,
'chile' => 2,
'chloe' => 2,
'circe' => 2,
'coyote' => 3,
'epitome' => 4,
'forever' => 3,
'gethsemane' => 4,
'guacamole' => 4,
'hyperbole' => 4,
'jesse' => 2,
'jukebox' => 2,
'karate' => 3,
'machete' => 3,
'maybe' => 2,
'people' => 2,
'recipe' => 3,
'sesame' => 3,
'shoreline' => 2,
'simile' => 3,
'syncope' => 3,
'tamale' => 3,
'yosemite' => 4,
'daphne' => 2,
'eurydice' => 4,
'euterpe' => 3,
'hermione' => 4,
'penelope' => 4,
'persephone' => 4,
'phoebe' => 2,
'zoe' => 2
];
public $addSyllablePatterns = [
"([^s]|^)ia",
"iu",
"io",
"eo($|[b-df-hj-np-tv-z])",
"ii",
"[ou]a$",
"[aeiouym]bl$",
"[aeiou]{3}",
"[aeiou]y[aeiou]",
"^mc",
"ism$",
"asm$",
"thm$",
"([^aeiouy])\1l$",
"[^l]lien",
"^coa[dglx].",
"[^gq]ua[^auieo]",
"dnt$",
"uity$",
"[^aeiouy]ie(r|st|t)$",
"eings?$",
"[aeiouy]sh?e[rsd]$",
"iell",
"dea$",
"real",
"[^aeiou]y[ae]",
"gean$",
"riet",
"dien",
"uen"
];
public $prefixSuffixPatterns = [
"^un",
"^fore",
"^ware",
"^none?",
"^out",
"^post",
"^sub",
"^pre",
"^pro",
"^dis",
"^side",
"ly$",
"less$",
"some$",
"ful$",
"ers?$",
"ness$",
"cians?$",
"ments?$",
"ettes?$",
"villes?$",
"ships?$",
"sides?$",
"ports?$",
"shires?$",
"tion(ed)?$"
];
public $subSyllablePatterns = [
"cia(l|$)",
"tia",
"cius",
"cious",
"[^aeiou]giu",
"[aeiouy][^aeiouy]ion",
"iou",
"sia$",
"eous$",
"[oa]gue$",
".[^aeiuoycgltdb]{2,}ed$",
".ely$",
"^jua",
"uai",
"eau",
"[aeiouy](b|c|ch|d|dg|f|g|gh|gn|k|l|ll|lv|m|mm|n|nc|ng|nn|p|r|rc|rn|rs|rv|s|sc|sk|sl|squ|ss|st|t|th|v|y|z)e$",
"[aeiouy](b|c|ch|dg|f|g|gh|gn|k|l|lch|ll|lv|m|mm|n|nc|ng|nch|nn|p|r|rc|rn|rs|rv|s|sc|sk|sl|squ|ss|th|v|y|z)ed$",
"[aeiouy](b|ch|d|f|gh|gn|k|l|lch|ll|lv|m|mm|n|nch|nn|p|r|rn|rs|rv|s|sc|sk|sl|squ|ss|st|t|th|v|y)es$",
"^busi$"
]; } ?>
Another class which is readability algorithm class having two methods to calculate score:
<?php class ReadabilityAlgorithm {
function countSyllable($strWord) {
$pattern = new ReadabilitySyllableCheckPattern();
$strWord = trim($strWord);
// Check for problem words
if (isset($pattern->{'probWords'}[$strWord])) {
return $pattern->{'probWords'}[$strWord];
}
// Check prefix, suffix
$strWord = str_replace($pattern->{'prefixSuffixPatterns'}, '', $strWord, $tmpPrefixSuffixCount);
// Removed non word characters from word
$arrWordParts = preg_split('`[^aeiouy]+`', $strWord);
$wordPartCount = 0;
foreach ($arrWordParts as $strWordPart) {
if ($strWordPart <> '') {
$wordPartCount++;
}
}
$intSyllableCount = $wordPartCount + $tmpPrefixSuffixCount;
// Check syllable patterns
foreach ($pattern->{'subSyllablePatterns'} as $strSyllable) {
$intSyllableCount -= preg_match('`' . $strSyllable . '`', $strWord);
}
foreach ($pattern->{'addSyllablePatterns'} as $strSyllable) {
$intSyllableCount += preg_match('`' . $strSyllable . '`', $strWord);
}
$intSyllableCount = ($intSyllableCount == 0) ? 1 : $intSyllableCount;
return $intSyllableCount;
}
function calculateReadabilityScore($stringText) {
# Calculate score
$totalSentences = 1;
$punctuationMarks = array('.', '!', ':', ';');
foreach ($punctuationMarks as $punctuationMark) {
$totalSentences += substr_count($stringText, $punctuationMark);
}
// get ASL value
$totalWords = str_word_count($stringText);
$ASL = $totalWords / $totalSentences;
// find syllables value
$syllableCount = 0;
$arrWords = explode(' ', $stringText);
$intWordCount = count($arrWords);
//$intWordCount = $totalWords;
for ($i = 0; $i < $intWordCount; $i++) {
$syllableCount += $this->countSyllable($arrWords[$i]);
}
// get ASW value
$ASW = $syllableCount / $totalWords;
// Count the readability score
$score = 206.835 - (1.015 * $ASL) - (84.6 * $ASW);
return $score;
} } ?>
// Example: how to use
<?php // Create object to count readability score
$readObj = new ReadabilityAlgorithm();
echo $readObj->calculateReadabilityScore("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into: electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently; with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum!");
?>
I actually don't see any problems with that code. Of course, it could be optimized a bit if you really wanted to by replacing all the different functions with a single counting loop. However, I'd strongly argue that it isn't necessary and even outright wrong. Your current code is very readable and easy to understand, and any optimizations would probably make things worse from that perspective. Use it as it is, and don't try to optimize it unless it actually turns out to be a performance bottleneck.
Related
I need the binary code representation of an integer (unsigned byte). I found this solution, where in my case $n = 8:
function _decBinDig($x, $n)
{
return substr(decbin(pow(2, $n) + $x), 1);
}
which surprisingly takes about 74 ms, while my first try - which I thought was too slow:
function getBinary(int $x)
{
return str_pad(base_convert($x, 10, 2), 8, '0', STR_PAD_LEFT);
}
only takes about 38 ms
Is there a faster solution?
Benchmarked the following five functions:
// For a baseline, returns unpadded binary
function decBinPlain(int $x) {
return decbin($x);
}
// Alas fancier than necessary:
function decBinDig(int $x) {
return substr(decbin(pow(2, 8) + $x), 1);
}
// OP's initial test
function getBinary(int $x) {
return str_pad(base_convert($x, 10, 2), 8, '0', STR_PAD_LEFT);
}
// OP's function using decbin()
function getDecBin(int $x) {
return str_pad(decbin($x), 8, '0', STR_PAD_LEFT);
}
// TimBrownlaw's method
function intToBin(int $x) {
return sprintf( "%08d", decbin($x));
}
At 500,000 iterations each, run as 10 x (50,000 # 5), here are the stats:
[average] => [
[decBinPlain] => 0.0912
[getDecBin] => 0.1355
[getBinary] => 0.1444
[intToBin] => 0.1493
[decBinDig] => 0.1687
]
[relative] => [
[decBinPlain] => 100
[getDecBin] => 148.57
[getBinary] => 158.33
[intToBin] => 163.71
[decBinDig] => 184.98
]
[ops_per_sec] => [
[decBinPlain] => 548355
[getDecBin] => 369077
[getBinary] => 346330
[intToBin] => 334963
[decBinDig] => 296443
]
The positions are consistent. OP's function, changed to use decbin in place of base_convert, is the fastest function that returns the complete result, by a very thin margin. I'd opt for decbin simply because the meaning is crystal clear. For adding in the left-padding, str_pad is less complex than sprintf. Running PHP 7.4.4 on W10 & i5-8250U, total runtime 7.11 sec.
For a baseline, calling an empty dummy function averages 0.0542 sec. Then: If you need to run this enough times to worry about minute per-op performance gains, it's more economical to have the code inline to avoid the function call. Here, the overhead from the function call is greater than the difference between the slowest and the fastest options above!
For future reference. If you're bench-marking several options, I'd recommend testing them over a single script call and over several consecutive loops of each function. That'll help even out any "lag noise" from background programs, CPU throttling (power to max if on battery!!) etc. Then, call it a couple of times and see that the numbers are stable. You'll want to do much more than 1000 iterations to get reliable numbers. Try e.g. 10K upwards for more complex functions, and 100K upwards for simpler functions. Burn it enough if you want to prove it!
There is a "nicer" method you can try out.
function intToBin(int $x)
{
return sprintf( "%08d", decbin($x));
}
or just call the sprintf inline.
I have following pdf file Marsheet PDF m trying to extract data shown in example, I have tried PDFParse, PDFtoText, etc.... but not working properly is there any solution or example?
<?php
//Output something like this or suggest me if u have any better option
$data_array = array(
array( "name" => "Mr Andrew Smee",
"medicine_name" => "FLUOXETINE 20MG CAPS",
"description" => "TAKE ONE ONCE DAILY FOR LOW MOOD. CAUTION:YOUR DRIVING REACTIONS MAY BE IMPAIRED",
"Dose" => '9000',
"StartDate" => '28/09/15',
"period" => '28',
"Quantity" => '28'
),
array( "name" => "Mr Andrew Smee",
"medicine_name" => "SINEMET PLUS 125MG TAB",
"description" => "TAKE ONE TABLET FIVE TIMES A DAY FOR PD
(8am,11am,2pm,5pm,8pm)
THIS MEDICINE MAY COLOUR THE URINE. THIS IS
HARMLESS. CAUTION:REACTIONS MAY BE IMPAIRED
WHILST DRIVING OR USING TOOLS OR MACHINES.",
"Dose" => '0800,1100,1400,1700,2000',
"StartDate" => '28/09/15',
"period" => '28',
"Quantity" => '140'
), etc...
);
?>
TL;DR You are almost certainly not going to do this with a library alone.
Update: a working solution (not a perfect solution!) is coded below, see 'in practice'. It requires:
defining the areas where the text is;
the possibility of installing and running a command line tool, pdf2json.
Why it is not easy
PDF files contain typesetting primitives, not extractable text; sometimes the difference is slight enough that you can go by, but usually having only extractable text, in easily accessible format, means that the document looks "slightly wrong" aesthetically, and therefore the generators that create the "best" PDFs for text extraction are also the less used.
Some generators exist that embed both the typesetting layer and an invisible text layer, allowing to see the beautiful text and to have the good text. At the expense, you guessed it, of the PDF size.
In your example, you only have the beautiful text inside the file, and the existence of a grid means that the text needs to be properly typeset.
So, inside, what there actually is to be read is this. Notice the letters inside round parentheses:
/R8 12 Tf
0.99941 0 0 1 66 765.2 Tm
[(M)2.51003(r)2.805( )-2.16558(A)-3.39556(n)
-4.33056(d)-4.33056(r)2.805(e)-4.33056(w)11.5803
( )-2.16558(S)-3.39556(m)-7.49588(e)-4.33117(e)556]TJ
ET
and if you assemble the (s)(i)(n)(g)(l)(e) letters inside, you do get "Mr Andrew Smee", but then you need to know where these letters are related to the page, and the data grid. Also you need to beware of spaces. Above, there is one explicit space character, parenthesized, between "Mr" and "Andrew"; but if you removed such spaces and fixed the offsets of all the following letters, you would still read "Mr Andrew Smee" and save two characters. Some PDF "optimizers" will try and do just that, and not considering offsets, the "text" string of that entity will just be "MrAndrewSmee".
And that is why most text extraction libraries, which can't easily manage character offsets (they use "text lines", and by and large they don't care about grids) will give you something like
Mr Andrew Smee 505738 12/04/54 (61
or, in the case of "optimized" texts,
MrAndrewSmee50573812/04/54(61
(which still gives the dangerous illusion of being parsable with a regex -- sometimes it is, sometimes it isn't, most of the times it works 95% of the time, so that the remaining 5% turns into a maintenance nightmare from Hell), but, more importantly, they will not be able to get you the content of the medication details timetable divided by cell.
Any information which is space-correlated (e.g. a name has different meanings if it's written in the left "From" or in the right "To" box) will be either lost, or variably difficult to reconstruct.
There are PDF "protection" schemes that exploit the capability of offsetting the text, and will scramble the strings. With offsets, you can write:
9 l 10 d 4 l 5 1 H 2 e 3 l o 6 W 7 o 8 r
and the PDF viewer will show you "Hello World"; but read the text directly, and you get "ldlHeloWor", or worse. You could add malicious text and place it outside the page, or write it in transparent color, to prank whoever succeeds in removing the easily removed optional copy-paste protection of PDF files. Most libraries would blithely suck up the prank text together with the good text.
Trying with most libraries, and why it might work (but probably not)
Libraries such as XPDF (and its wrappers phpxpdf, pdf2html, etc.) will give you a simple call such as this
// open PDF
$pdfToText->open('PDF-book.pdf');
// PDF text is now in the $text variable
$text = $pdfToText->getText();
$pdfToText->close();
and your "text" will contain everything, and be something like:
...
START DATE START DAY
WEEK 1 WEEK 2 WEEK 3 WEEK 4
DATE 28 29 30 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
19/10/15
Medication Details
Commencing
D.O.B
Doctor
Hour:Dose 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
Patient
Number
Period
MEDICATION ADMINISTRATION RECORD SHEETS Pharmacy No.
Document No.
02392 731680
28
0900 1
TAKE ONE ONCE DAILY FOR LOW MOOD.
CAUTION:YOUR DRIVING REACTIONS MAY BE IMPAIRED.
28
FLUOXETINE 20MG CAPS
Received Quantity returned quant. by destroyed quant. by
So, reading above, ask yourself - what is that second 28? Can you tell whether it is the received quantity, the returned quantity, the destroyed quantity without looking at the PDF? Sure, if there's only one number, chances are that it will be the received quantity. It becomes a bet.
And is 02392 731680 the document number? It looks like it is (it is not).
Notice also that in the PDF, the medicine name is before the notes. In the extracted text, it is after. By looking at the offsets inside the PDF, you understand why, and it's even a good decision -- but looking at the extracted text, it's not so easy.
So, automatic analysis looks enticingly like it can be done, but as I said, it is a very risky business. It is brittle: someone entering the wrong (for you) text somewhere in the document, sometimes even filling the fields not in sequential order, will result in a PDF which is visually correct and, at the same time, unexplainably unparseable. What are you going to tell your users?
Sometimes, a subset of the available information is stable enough for you to get the work done. In that case, XPDF or PDF2HTML, a bunch of regex, and you're home free in half a day. Yay you! Just keep in mind that any "little" addition to the project might then be impossible. Two numbers are added that are well separated in the PDF; are they 128 and 361, or 12 and 8361, or 1283 and 61? All you get in $text is 128361.
So if you go that way, document it clearly and avoid expectations which might be difficult to maintain. Your initial project might work so well, so fast, in so little, than an addition is accepted unbeknownst to you - and you're then required to do the impossible. Explaining why the first 95% was easy and the subsequent 5% very hard might be more than your job is worth.
One difficult way to do it, which worked for me
But can you do the same thing "by hand"? After all, by looking at the PDF, you know what you are seeing. Can the same thing be done by a machine? (this still applies). Sure, in this - after all - clearly delimited problem of computer vision, you very probably can. It just won't be quick and easy. You need:
a very low level library (or reading the PDF yourself; you just need to uncompress it first, and there are tools for that, e.g. pdftk). You need to recover the text with coordinates. "C" for "hospitalized" is worth nothing. "C, 495.2, 882.7" plus the coordinates of your grid tells you of a hospitalization on October 13th, 2015 -- and that is the information you are after!
patience (or a tool) to input the coordinates of the text zones. You need to tell the system which area is October 13th, 2015... as well as all the other days. For example:
// Cell name X1 Y1 X2 Y2 Text
[ 'PatientName', 60, 760, 300, 790, '' ],
[ 'PatientNumber', 310, 760, 470, 790, '' ],
...
[ 'Grid01Y01X01', 90, 1020, 110, 1040, '' ],
...
Note that very many of those values you can calculate programmatically: once you have the top left corner and know one cell's size, the others are more or less calculable with a very slight error. You needn't input yourself six grids of four weeks with six rows each, seven days per week.
You can use the same structure to create a PNG with red areas to indicate which cells you've got covered. That will be useful to visually check you did not forget anything.
At that point you parse the PDF, and every time you find a text at coordinates (x1,y1) you scan all of your cells and determine where the text should be (there are faster ways to do that using XY binary search trees). If you find 'Mr Andrew S' at 66, 765.2 you add it to PatientName. Then you find 'mee' at 109.2, 765.2 and you also add it to PatientName. Which now reads 'Mr Andrew Smee'.
If the horizontal distance is above a certain threshold, you add a space (or more than one).
(For very small text there's a slight risk of the letters being output out of order by the PDF driver and corrected through kerning, but usually that's not a problem).
At the end of the whole cycle you will be left with
[ 'PatientName', 60, 760, 300, 790, 'Mr Andrew Smee' ],
[ 'PatientNumber', 310, 760, 470, 790, '505738' ],
and so on.
I did this kind of work for a large PDF import project some years back and it worked like a charm. Nowadays, I think most of the heavy lifting could be done with TcLibPDF.
The painful part is recording by hand, the first time, the information for the grid; possibly there might be tools for that, or one could whip up a HTML5/AJAX editor using canvases.
In practice
Most of the work has already been done by the excellent pdf2json tool, which consuming the 'Andrew Smee' PDF, outputs something like:
[
{
"height" : 1263,
"width" : 892
"number" : 1,
"pages" : 1,
"fonts" : [
{
"color" : "#000000",
"family" : "Times",
"fontspec" : "0",
"size" : "15"
},
...
],
"text" : [
{ "data" : "12/04/54",
"font" : 0,
"height" : 17,
"left" : 628,
"top" : 103,
"width" : 70
},
{ "data" : "28/09/15",
"font" : 0,
"height" : 17,
"left" : 105,
"top" : 206,
"width" : 70
},
{ "data" : "AQUARIUS",
"font" : 0,
"height" : 17,
"left" : 99,
"top" : 170,
"width" : 94
},
{ "data" : " ",
"font" : 0,
"height" : 17,
"left" : 193,
"top" : 170,
"width" : 5
},
{ "data" : "NURSING",
"font" : 0,
"height" : 17,
"left" : 198,
"top" : 170,
"width" : 83
},
...
In order to make things simple, I convert the Andrew Smee PDF to a PNG and resample it to 892 x 1263 pixel (any size will do, as long as you keep track of the size. Below, they are saved in 'width' and 'height'). This way I can read pixel coordinates straight off my old PaintShop Pro's status bar :-).
The "Address" field is from 73,161 to 837,193.
My sample "template", with only three fields, is therefore in PHP 5.7 (with short array syntax, [ ] instead of Array() )
<?php
function template() {
$template = [
'Address' => [ 'x1' => 73, 'y1' => 161, 'x2' => 837, 'y2' => 193 ],
'Medicine1' => [ 'x1' => 1, 'y1' => 283, 'x2' => 251, 'y2' => 299 ],
'Details1' => [ 'x1' => 1, 'y1' => 302, 'x2' => 251, 'y2' => 403 ],
];
foreach ($template as $fieldName => $candidate) {
$template[$fieldName]['elements'] = [ ];
}
return $template;
}
// shell_exec('/usr/local/bin/pdf2json "Andrew-Smee.pdf" andrew-smee.json');
$parsed = json_decode(file_get_contents('ann-underdown.json'), true);
$pout = [ ];
foreach ($parsed as $page) {
$template = template();
foreach ($page['text'] as $text) {
// Will it blend?
foreach ($template as $fieldName => $candidate) {
if ($text['top'] > $candidate['y2']) {
continue; // Too low.
}
if (($text['top']+$text['height']) < $candidate['y1']) {
continue; // Too high.
}
if ($text['left'] > $candidate['x2']) {
continue;
}
if (($text['left']+$text['width']) < $candidate['x1']) {
continue;
}
$template[$fieldName]['elements'][] = $text;
}
}
// Now I must reassemble all my fields
foreach ($template as $fieldName => $data) {
$list = $data['elements'];
usort($list, function($txt1, $txt2) {
for ($r = 8; $r >= 1; $r /= 2) {
if (($txt1['top']/$r) < ($txt2['top']/$r)) {
return -1;
}
if (($txt1['top']/$r) > ($txt2['top']/$r)) {
return 1;
}
if (($txt1['left']/$r) < ($txt2['left']/$r)) {
return -1;
}
if (($txt1['left']/$r) > ($txt2['left']/$r)) {
return 1;
}
}
return 0;
});
$text = '';
$starty = false;
foreach ($list as $data) {
if ($data['top'] > $starty + 5) {
if ($starty > 0) {
$text .= "\n";
}
} else {
// Add space
// $text .= ' ';
}
$starty = $data['top'];
// Add text to current line
$text .= $data['data'];
}
// Remove extra spaces
$text = preg_replace('# +#', ' ', $text);
$template[$fieldName] = $text;
}
$paged[] = $template;
}
print_r($paged);
And the result (on a multipage PDF)
Array
(
[0] => Array
(
[Address] => AQUARIUS NURSING HOME 4-6 SPENCER ROAD, SOUTHSEA PO4 9RN
[Medicine1] => ATORVASTATIN 40MG TABS
[Details1] => take ONE tablet at NIGHT
)
[1] => Array
(
[Address] => AQUARIUS NURSING HOME 4-6 SPENCER ROAD, SOUTHSEA PO4 9RN
[Medicine1] => SOTALOL 80MG TABS
[Details1] => take ONE tablet TWICE each day
DO NOT STOP TAKING UNLESS YOUR DOCTOR TELLS
YOU TO STOP.
)
[2] => Array
(
[Address] => AQUARIUS NURSING HOME 4-6 SPENCER ROAD, SOUTHSEA PO4 9RN
[Medicine1] => LAXIDO ORANGE SF 13.8G SACHETS
[Details1] => ONE to TWO when required
DISSOLVE OR MIX WITH WATER BEFORE TAKING.
NOT IN CASSETTE
)
)
Sometimes its hard to extract the pdfs into required format/output directly using some libraries or tools. Same problem occurred with me recently where I had 1600+ pdfs and I needed to extract those data and store it in db. I tried almost all the libraries, tools and none of them helped me. So, I tried put some manual effort to find a pattern and process them using php. For this I used this php library PDF TO HTML.
Install PDF TO HTML library
composer require gufy/pdftohtml-php:~2
This will convert your pdf into html code with each < div > tag representing the page and < p > tag representing the titles and their values. Now using p tags if you can identify the common pattern and it is not hard to put that in the logic to process all the pdfs and convert them into csv/xls or anything else. Since in my case after each 11 < p > tags, the pattern was repeating so i used this .
$pdf = new Gufy\PdfToHtml\Pdf('<PDF_FILE_PATH>');
// get total no pages
$total_pages = $pdf->getPages();
// Iterate through each page and extract the p tags
for($i = 1; $i <= $total_pages; $i++){
// This will convert pdf to html
$html = $pdf->html($i);
// Create a dom document
$domOb = new DOMDocument();
// load html code in the dom document
$domOb->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
// Get SimpleXMLElement from Dom Node
$sxml = simplexml_import_dom($domOb);
// here you have the p tags
foreach ($sxml->body->div->p as $pTag) {
// your logic
}
}
Hope this helps you as it helped me alot
I've been tasked with standardizing some address information. Toward that goal, I'm breaking the address string into granular values (our address schema is very similar to Google's format).
Progress so far:
I'm using PHP, and am currently breaking out Bldg, Suite, Room#, etc... info.
It was all going great until I encountered Floors.
For the most part, the floor info is represented as "Floor 10" or "Floor 86". Nice & easy.
For everything to that point, I can simply break the string on a string ("room", "floor", etc..)
The problem:
But then I noticed something in my test dataset. There are some cases where the floor is represented more like "2nd Floor".
This made me realize that I need to prepare for a whole slew of variations for the FLOOR info.
There are options like "3rd Floor", "22nd floor", and "1ST FLOOR". Then what about spelled out variants such as "Twelfth Floor"?
Man!! This can become a mess pretty quickly.
My Goal:
I'm hoping someone knows of a library or something that already solves this problem.
In reality, though, I'd be more than happy with some good suggestions/guidance on how one might elegantly handle splitting the strings on such diverse criteria (taking care to avoid false positives such as "3rd St").
first of all, you need to have exhaustive list of all possible formats of the input and decide, how deep you'd like to go.
If you consider spelled out variants as invalid case, you may apply simple regular expressions to capture number and detect the token (room, floor ...)
I would start by reading up on regex in PHP. For example:
$floorarray = preg_split("/\sfloor\s/i", $floorstring)
Other useful functions are preg_grep, preg_match, etc
Edit: added a more complete solution.
This solution takes as an input a string describing the floor. It can be of various formats such as:
Floor 102
Floor One-hundred two
Floor One hundred and two
One-hundred second floor
102nd floor
102ND FLOOR
etc
Until I can look at an example input file, I am just guessing from your post that this will be adequate.
<?php
$errorLog = 'error-log.txt'; // a file to catalog bad entries with bad floors
// These are a few example inputs
$addressArray = array('Fifty-second Floor', 'somefloor', '54th floor', '52qd floor',
'forty forty second floor', 'five nineteen hundredth floor', 'floor fifty-sixth second ninth');
foreach ($addressArray as $id => $address) {
$floor = parseFloor($id, $address);
if ( empty($floor) ) {
error_log('Entry '.$id.' is invalid: '.$address."\n", 3, $errorLog);
} else {
echo 'Entry '.$id.' is on floor '.$floor."\n";
}
}
function parseFloor($id, $address)
{
$floorString = implode(preg_split('/(^|\s)floor($|\s)/i', $address));
if ( preg_match('/(^|^\s)(\d+)(st|nd|rd|th)*($|\s$)/i', $floorString, $matchArray) ) {
// floorString contained a valid numerical floor
$floor = $matchArray[2];
} elseif ( ($floor = word2num($floorString)) != FALSE ) { // note assignment op not comparison
// floorString contained a valid english ordinal for a floor
; // No need to do anything
} else {
// floorString did not contain a properly formed floor
$floor = FALSE;
}
return $floor;
}
function word2num( $inputString )
{
$cards = array('zero',
'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten',
'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty');
$cards[30] = 'thirty'; $cards[40] = 'forty'; $cards[50] = 'fifty'; $cards[60] = 'sixty';
$cards[70] = 'seventy'; $cards[80] = 'eighty'; $cards[90] = 'ninety'; $cards[100] = 'hundred';
$ords = array('zeroth',
'first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'ninth', 'tenth',
'eleventh', 'twelfth', 'thirteenth', 'fourteenth', 'fifteenth', 'sixteenth', 'seventeenth', 'eighteenth', 'nineteenth', 'twentieth');
$ords[30] = 'thirtieth'; $ords[40] = 'fortieth'; $ords[50] = 'fiftieth'; $ords[60] = 'sixtieth';
$ords[70] = 'seventieth'; $ords[80] = 'eightieth'; $ords[90] = 'ninetieth'; $ords[100] = 'hundredth';
// break the string at any whitespace, dash, comma, or the word 'and'
$words = preg_split( '/([\s-,](?!and\s)|\sand\s)/i', $inputString );
$sum = 0;
foreach ($words as $word) {
$word = strtolower($word);
$value = array_search($word, $ords); // try the ordinal words
if (!$value) { $value = array_search($word, $cards); } // try the cardinal words
if (!$value) {
// if temp is still false, it's not a known number word, fail and exit
return FALSE;
}
if ($value == 100) { $sum *= 100; }
else { $sum += $value; }
}
return $sum;
}
?>
In the general case, parsing words into numbers is not easy. The best thread that I could find that discusses this is here. It is not nearly as easy as the inverse problem of converting numbers into words. My solution only works for numbers <2000, and it liberally interprets poorly formed constructs rather than tossing an error. Also, it is not resilient against spelling mistakes at all. For example:
forty forty second = 82
five nineteen hundredth = 2400
fifty-sixth second ninth = 67
If you have a lot of inputs and most of them are well formed, throwing errors for spelling mistakes is not really a big deal because you can manually correct the short list of problem entries. Silently accepting bad input, however, could be a real problem depending on your application. Just something to think about when deciding if it is worth it to make the conversion code more robust.
I'm trying to print the possible words that can be formed from a phone number in php. My general strategy is to map each digit to an array of possible characters. I then iterate through each number, recursively calling the function to iterate over each possible character.
Here's what my code looks like so far, but it's not working out just yet. Any syntax corrections I can make to get it to work?
$pad = array(
array('0'), array('1'), array('abc'), array('def'), array('ghi'),
array('jkl'), array('mno'), array('pqr'), array('stuv'), array('wxyz')
);
function convertNumberToAlpha($number, $next, $alpha){
global $pad;
for($i =0; $i<count($pad[$number[$next]][0]); $i++){
$alpha[$next] = $pad[$next][0][$i];
if($i<strlen($number) -1){
convertNumberToAlpha($number, $next++, $alpha);
}else{
print_r($alpha);
}
}
}
$alpha = array();
convertNumberToAlpha('22', 0, $alpha);
How is this going to be used? This is not a job for a simple recursive algorithm such as what you have suggested, nor even an iterative approach. An average 10-digit number will yield 59,049 (3^10) possibilities, each of which will have to be evaluated against a dictionary if you want to determine actual words.
Many times, the best approach to this is to pre-compile a dictionary which maps 10-digit numbers to various words. Then, your look-up is a constant O(1) algorithm, just selecting by a 10 digit number which is mapped to an array of possible words.
In fact, pre-compiled dictionaries were the way that T9 worked, mapping dictionaries to trees with logarithmic look-up functions.
The following code should do it. Fairly straight forward: it uses recursion, each level processes one character of input, a copy of current combination is built/passed at each recursive call, recursion stops at the level where last character of input is processed.
function alphaGenerator($input, &$output, $current = "") {
static $lookup = array(
1 => "1", 2 => "abc", 3 => "def",
4 => "ghi", 5 => "jkl", 6 => "mno",
7 => "pqrs", 8 => "tuv", 9 => "wxyz",
0 => "0"
);
$digit = substr($input, 0, 1); // e.g. "4"
$other = substr($input, 1); // e.g. "3556"
$chars = str_split($lookup[$digit], 1); // e.g. "ghi"
foreach ($chars as $char) { // e.g. g, h, i
if ($other === false) { // base case
$output[] = $current . $char;
} else { // recursive case
alphaGenerator($other, $output, $current . $char);
}
}
}
$output = array();
alphaGenerator("43556", $output);
var_dump($output);
Output:
array(243) {
[0]=>string(5) "gdjjm"
[1]=>string(5) "gdjjn"
...
[133]=>string(5) "helln"
[134]=>string(5) "hello"
[135]=>string(5) "hfjjm"
...
[241]=>string(5) "iflln"
[242]=>string(5) "ifllo"
}
You should read Norvigs article on writing a spellchecker in Python http://norvig.com/spell-correct.html . Although its a spellchecker and in python not php, it is the same concept around finding words with possible variations, might give u some good ideas.
I'm trying to generate an even distribution of random numbers based on User IDs. That is, I want a random number for each user that remains the same any time that user requests the random number (but the user doesn't need to store the number). My current algorithm (in PHP) to count distribution, for a given large array of userIDs $arr is:
$range = 100;
$results = array_fill(0, $range, 0);
foreach ($arr as $userID) {
$hash = sha1($userID,TRUE);
$data = unpack('L*', $hash);
$seed = 0;
foreach ($data as $integer) {
$seed ^= $integer;
}
srand($seed);
++$results[rand(0, $range-1)];
}
One would hope that this generates an approximately even distribution. But it doesn't! I've checked to make sure that each value in $arr is unique, but one entry in the list always gets much more activity than all the others. Is there a better method of generating a hash of a string that will give an approximately even distribution? Apparently SHA is not up to the job. I've also tried MD5 and a simple crc32, all with the same results!?
Am I crazy? Is the only explanation that I have not, in fact, verified that each entry in $arr is unique?
The sha1 hash numbers are quite uniform distributed. After executing this:
<?php
$n = '';
$salt = 'this is the salt';
for ($i=0; $i<100000; $i++) {
$n .= implode('', unpack('L*', sha1($i . $salt)));
}
$count = count_chars($n, 1);
$sum = array_sum($count);
foreach ($count as $k => $v) {
echo chr($k)." => ".($v/$sum)."\n";
}
?>
You get this result. The probability for each number:
0 => 0.083696057956298
1 => 0.12138983759522
2 => 0.094558704004335
3 => 0.07301783188663
4 => 0.092124978934097
5 => 0.088623772577848
6 => 0.11390989553446
7 => 0.092570936094051
8 => 0.12348330833868
9 => 0.11662467707838
You could use the sha1 as a simple random number generator based on the user's id.
In hexadecimal, the distribution is near to perfect:
// $n .= sha1($i . $salt, false);
0 => 0.06245515
1 => 0.06245665
2 => 0.06258855
3 => 0.0624244
4 => 0.06247255
5 => 0.0625422
6 => 0.0625246
7 => 0.0624716
8 => 0.06257355
9 => 0.0625005
a => 0.0625068
b => 0.0625086
c => 0.0624463
d => 0.06250535
e => 0.06250895
f => 0.06251425
mt_rand() should have a very even distribution over the range requested. When users are created, create a random seed for that user using mt_rand() then always mt_srand() with that seed for that user.
To get an even distribution from 0 to 99, as your example, just mt_rand(0,$range-1). Doing tricks with sha1, md5, or some other hashing algorithm won't really give you a more even distribution than straight random.
It would be helpful if you posted your results that led you to conclude that you're not getting an appropriate distribution, but it's likely one of three things is going on here:
You're simply looking at too small of a sample, and/or you're miss-interpreting your data. As others have commented, it's completely reasonable for a uniform distribution to not have perfectly uniform output.
You'd see better results if you used mt_rand instead of rand.
(Personally, I think this is most likely) You're over-optimizing your seed generation, and losing data / pigeon holing / otherwise hurting your ability to generate random numbers. Reading your code, I think you're doing the following:
Generate a uniform random hash of an unknown value
Split the hash into longs and bitwise XOR-ing them together
Setting rand's seed, and generating a random number off that seed
But why are you doing step 2? What benefit do you think you're getting from that? Try taking that step out, and just use the first value you extract from the hash as your seed, and see if that doesn't give you better results. Good rule of thumb with randomness - don't try to outsmart the people who implemented the algorithms, it can't be done :)
While all of the answers here are good, I will provide the answer that was correct for me, and that is that I was, indeed, crazy. Apparently the uniq command does not, in fact, work like I expected it to (data needs to be sorted first). So the explanation was indeed that the values in $arr were not unique.