I know that there's a Steam API allowing me to use data from Steam Community.
My question is, does anyone know if there's a Steam Market API?
For example, I want to get the current price of an item in the Steam Market.
I've googled and haven't found anything yet.
I'd be glad to have your help.
I could not find any documentation, but I use:
http://steamcommunity.com/market/priceoverview/?appid=730¤cy=3&market_hash_name=StatTrak%E2%84%A2 M4A1-S | Hyper Beast (Minimal Wear)
to return a JSON.
At time of writing, it returns:
{"success":true,"lowest_price":"261,35€ ","volume":"11","median_price":"269,52€ "}
You can change the currency. 1 is USD, 3 is euro but there are probably others.
A better search api that can give you all the results for a game, example using pubg which only has 272 items, if your game has more try changing the count parameter at the end
https://steamcommunity.com/market/search/render/?search_descriptions=0&sort_column=default&sort_dir=desc&appid=578080&norender=1&count=500
I indexed the available currencies steam uses for argument
¤cy=3
as:
1 : $63.83
2 : £46.85
3 : 52,--€
4 : CHF 56.41
5 : 4721,76 pуб.
6 : 235,09zł
7 : R$ 340,80
8 : ¥ 6,627.08
9 : 534,70 kr
10 : Rp 898 383.24
11 : RM257.74
12 : P3,072.66
13 : S$84.47
14 : ฿1,921.93
15 : 1.474.136,93₫
16 : ₩ 69,717.79
17 : 468,47 TL
18 : 2 214,94₴
19 : Mex$ 1,557.75
20 : CDN$ 99.09
21 : A$ 100.40
22 : NZ$ 107.55
23 : ¥ 505.96
24 : ₹ 5,733.04
25 : CLP$ 55.695,47
26 : S/.283.03
27 : COL$ 271.637,06
28 : R 1 193.49
29 : HK$ 606.83
30 : NT$ 2,189.42
31 : 293.64 SR
32 : 287.51 AED
Python dictionary with currency abbreviations and their codes:
currencies = {
"USD": 1, # United States dollar
"GBP": 2, # British pound sterling
"EUR": 3, # The euro
"CHF": 4, # Swiss franc
"RUB": 5, # Russian ruble
"PLN": 6, # Polish złoty
"BRL": 7, # Brazilian real
"JPY": 8, # Japanese yen
"SEK": 9, # Swedish krona
"IDR": 10, # Indonesian rupiah
"MYR": 11, # Malaysian ringgit
"BWP": 12, # Botswana pula
"SGD": 13, # Singapore dollar
"THB": 14, # Thai baht
"VND": 15, # Vietnamese dong
"KRW": 16, # South Korean won
"TRY": 17, # Turkish lira
"UAH": 18, # Ukrainian hryvnia
"MXN": 19, # Mexican Peso
"CAD": 20, # Canadian dollar
"AUD": 21, # Australian dollar
"NZD": 22, # New Zealand dollar
"CNY": 23, # Chinese yuan
"INR": 24, # Indian rupee
"CLP": 25, # Chilean peso
"PEN": 26, # Peruvian sol
"COP": 27, # Colombian peso
"ZAR": 28, # South African rand
"HKD": 29, # Hong Kong dollar
"TWD": 30, # New Taiwan dollar
"SAR": 31, # Saudi riyal
"AED": 32 # United Arab Emirates dirham
}
To add to what the other people have said, the temporary ban on the JSON site happens if you try and request 20 items within a minute's time from the server. If you're creating a script to request those links, add a three second delay between each script.
Also, the ban only lasts for the remaining server-side minute (which may not be 60 seconds).
You can use SteamApis.com to acquire Steam market prices and item information. The data is returned in JSON. The service is not free but also not that expensive.
The documentation is available to view here. It has detailed information on what endpoints are available and what data is returned.
There is not such API for now. But this link may help you:
Get the price of an item on Steam Community Market with PHP and Regex
It's basically what you want with pure php DOM parsing instead of an API. The main drawback is that you may have to change your code if Steam update their html markup.
Script-scraper which maps search results from https://steamcommunity.com/market/search?q= to array of objects
Array.from(document.querySelectorAll('a.market_listing_row_link')).map(item => {
const itemInfo = item.children[0]
return {
isStatTrek: itemInfo.getAttribute('data-hash-name').startsWith('StatTrak™'),
condition: itemInfo.getAttribute('data-hash-name').match(/.*\((.*)\)/)[1],
priceUSD: Number(itemInfo.querySelector('.normal_price[data-price]').getAttribute('data-price')/100)
}
})
can be used with iframe and "weapon | skin name (condition)" search template
Related
I can separate data from the plain text below with Regex.
Plain text:
190.A 42-year-old male patient has been delivered to a hospital in a grave condition with dyspnea, cough with expectoration of purulent
sputum, fever up to 39,5 oC.The ?rst symptoms appeared 3 weeks ago.
Two weeks ago, a local therapist diagnosed him wi- th acute
right-sided pneumonia. Over the last 3 days, the patient’s condition
deteriorated: there was a progress of dyspnea, weakness, lack of
appetite. Chest radiography con?rms a rounded shadow in the lower lobe
of the right lung with a horizontal?uid level, the right si- nus is
not clearly visualized. What is the most likely diagnosis? A.Abscess
of the right lung B.Acute pleuropneumonia C.Right pulmonary empyema
D.Atelectasis of the right lung E.Pleural effusion 191.An 11-year-old
boy complains of general weakness, fever up to 38,2 oC, pain and
swelli- ng of the knee joints, feeling of irregular heartbeat. 3 weeks
ago, the child had quinsy. Knee joints are swollen, the overlying skin
and skin of the knee region is reddened, local temperature is
increased, movements are li- mited. Heart sounds are muf?ed,
extrasystole is present, auscultation reveals apical systolic murmur
that is not conducted to the left ingui- nal region. ESR is 38 mm/h.
CRP is 2+, anti- streptolysin O titre - 40 0. What is the most likely
diagnosis? A.Acute rheumatic fever B.Vegetative dysfunction
C.Non-rheumatic carditis D.Juvenile rheumatoid arthritis E.Reactive
arthritis 192.A 28-year-old male patient complains of sour
regurgitation, cough and heartburn that occurs every day after having
meals, when bending forward or lying down. These problems have been
observed for 4 years. Objective status and laboratory values are
normal. FEGDS revealed endoesophagitis. What is the leading factor in
the development of this disease? A.Failure of the lower esophageal
sphincter B.Hypersecretion of hydrochloric acid C.Duodeno-gastric
re?ux D.Hypergastrinemia E.Helicobacter pylori infection 193.On
admission a 35-year-old female reports acute abdominal pain, fever up
to 38,8 oC, mucopurulent discharges. The pati- ent is nulliparous, has
a history of 2 arti?cial abortions. The patient is unmarried, has
sexual Krok 2 Medicine 20 14 24 contacts. Gynecological examination
reveals no uterus changes. Appendages are enlarged, bilaterally
painful. There is profuse purulent vaginal discharge. What study is
required to con?rm the diagnosis? A.Bacteriologic and bacteriascopic
studies B.Hysteroscopy C.Curettage of uterine cavity D.Vaginoscopy
E.Laparoscopy
What did I do for this?
For the question section:
/(\d+)\.\s*([A-Z].*?)\s+([A-Z]\..*?)(?=\d+\.\s*[A-Z]|$)/s
For the options of question section:
/\s+(?=[A-Z0-9][,.:])
PHP:
$soruAlimPattern = [
'q&a' => '/(\d+)\.\s*([A-Z].*?)\s+([A-Z]\..*?)(?=\d+\.\s*[A-Z]|$)/s',
'answers' => '/\s+(?=[A-Z0-9][,.:])/'
];
$res = [];
if (preg_match_all($soruAlimPattern['q&a'], $temizSoruCikisi, $out, PREG_SET_ORDER) > 0) {
foreach ($out AS $k => $v) {
// remove the full match ($0)
$res[$k] = array_slice($v, 1, 3);
// split the answers
$res[$k][2] = preg_split($soruAlimPattern['answers'], $res[$k][2]);
}
}
$sorularJsonKodlaniyor = json_encode($res);
[...]
I can distinguish between question and question options, but is it possible to use a single Regex code instead of 2 different Regex?
I don't know how quality the PHP code is but it works.
My problem:
1. Sometimes there are unidentifiable letters in the question and these
undefined characters are indicated with a question mark. For
example: `fever up to 39,5 oC.The ?rst symptoms` or `..39,5 oC.The ?rst symptoms..`
2. Due to the numerical values in the question, the Regex code divides the question in half. For example: `... anti- streptolysin O titre - 40 0. What is the most likely diagnosis? ` In fact, the question divides the question because of the number "zero".
Expected JSON Format:
[
{
"question": "190.A 42-year-old male patient has been delivered to a hospital in a grave condition with dyspnea, cough with expectoration of purulent sputum, fever up to 39,5 oC.The ?rst symptoms appeared 3 weeks ago. Two weeks ago, a local therapist diagnosed him wi- th acute right-sided pneumonia. Over the last 3 days, the patient’s condition deteriorated: there was a progress of dyspnea, weakness, lack of appetite. Chest radiography con?rms a rounded shadow in the lower lobe of the right lung with a horizontal?uid level, the right si- nus is not clearly visualized. What is the most likely diagnosis? ",
"answers": [
"A.Abscess of the right lung ",
"B.Acute pleuropneumonia ",
"C.Right pulmonary empyema ",
"D.Atelectasis of the right lung ",
"E.Pleural effusion 1"
]
},
{
"question": "191.An 11-year-old boy complains of general weakness, fever up to 38,2 oC, pain and swelli- ng of the knee joints, feeling of irregular heartbeat. 3 weeks ago, the child had quinsy. Knee joints are swollen, the overlying skin and skin of the knee region is reddened, local temperature is increased, movements are li- mited. Heart sounds are muf?ed, extrasystole is present, auscultation reveals apical systolic murmur that is not conducted to the left ingui- nal region. ESR is 38 mm/h. CRP is 2+, anti- streptolysin O titre - 40 0. What is the most likely diagnosis? ",
"answers": [
"A.Acute rheumatic fever ",
"B.Vegetative dysfunction ",
"C.Non-rheumatic carditis ",
"D.Juvenile rheumatoid arthritis ",
"E.Reactive arthritis 1"
]
},
{
"question": "192.A 28-year-old male patient complains of sour regurgitation, cough and heartburn that occurs every day after having meals, when bending forward or lying down. These problems have been observed for 4 years. Objective status and laboratory values are normal. FEGDS revealed endoesophagitis. What is the leading factor in the development of this disease? ",
"answers": [
"A.Failure of the lower esophageal sphincter ",
"B.Hypersecretion of hydrochloric acid ",
"C.Duodeno-gastric re?ux ",
"D.Hypergastrinemia ",
"E.Helicobacter pylori infection 1"
]
},
{
"question": "193.On admission a 35-year-old female reports acute abdominal pain, fever up to 38,8 oC, mucopurulent discharges. The pati- ent is nulliparous, has a history of 2 arti?cial abortions. The patient is unmarried, has sexual Krok 2 Medicine 20 14 24 contacts. Gynecological examination reveals no uterus changes. Appendages are enlarged, bilaterally painful. There is profuse purulent vaginal discharge. What study is required to con?rm the diagnosis? ",
"answers": [
"A.Bacteriologic and bacteriascopic studies ",
"B.Hysteroscopy ",
"C.Curettage of uterine cavity ",
"D.Vaginoscopy ",
"E.Laparoscopy 1"
]
}
]
How can I overcome these problems?
What you might do is use preg_split to get all the strings with the right characters at the start like 190.A or A.
\b(?=(?:\d+|[A-Z])\.[A-Z])
\b Word boundary
(?= Positive lookahead, assert what is on the right is
(?:\d+|[A-Z]) Match either 1+ digits or a single char A-Z
\.[A-Z] Match . and a single char A-Z
) Close positive lookahead
Regex demo | Php demo
If you have all those entries in an array, you could for example use array_reduce to the get array structure that you need for the json output.
$pattern = "/\b(?=(?:\d+|[A-Z])\.[A-Z])/";
$result = preg_split($pattern, $data, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$result = array_reduce($result, function($carry, $item){
// If the string starts with a digit
if (ctype_digit(substr($item, 0, 1))) {
// Create the questions key
$carry[] = ["question" => $item];
return $carry;
}
// Get reference to the last added array in $carry
end($carry);
$last = &$carry[key($carry)];
// Create the answers key
array_key_exists("answers", $last) ? $last["answers"][] = $item : $last["answers"] = [$item];
return $carry;
}, []);
print_r(json_encode($result))
Output
[
{
"question": "190.A 42-year-old male patient has been delivered to a hospital in a grave condition with dyspnea, cough with expectoration of purulent sputum, fever up to 39,5 oC.The ?rst symptoms appeared 3 weeks ago. Two weeks ago, a local therapist diagnosed him wi- th acute right-sided pneumonia. Over the last 3 days, the patient\u2019s condition deteriorated: there was a progress of dyspnea, weakness, lack of appetite. Chest radiography con?rms a rounded shadow in the lower lobe of the right lung with a horizontal?uid level, the right si- nus is not clearly visualized. What is the most likely diagnosis? ",
"answers": [
"A.Abscess of the right lung ",
"B.Acute pleuropneumonia ",
"C.Right pulmonary empyema ",
"D.Atelectasis of the right lung ",
"E.Pleural effusion "
]
},
{
"question": "191.An 11-year-old boy complains of general weakness, fever up to 38,2 oC, pain and swelli- ng of the knee joints, feeling of irregular heartbeat. 3 weeks ago, the child had quinsy. Knee joints are swollen, the overlying skin and skin of the knee region is reddened, local temperature is increased, movements are li- mited. Heart sounds are muf?ed, extrasystole is present, auscultation reveals apical systolic murmur that is not conducted to the left ingui- nal region. ESR is 38 mm\/h. CRP is 2+, anti- streptolysin O titre - 40 0. What is the most likely diagnosis? ",
"answers": [
"A.Acute rheumatic fever ",
"B.Vegetative dysfunction ",
"C.Non-rheumatic carditis ",
"D.Juvenile rheumatoid arthritis ",
"E.Reactive arthritis "
]
},
{
"question": "192.A 28-year-old male patient complains of sour regurgitation, cough and heartburn that occurs every day after having meals, when bending forward or lying down. These problems have been observed for 4 years. Objective status and laboratory values are normal. FEGDS revealed endoesophagitis. What is the leading factor in the development of this disease? ",
"answers": [
"A.Failure of the lower esophageal sphincter ",
"B.Hypersecretion of hydrochloric acid ",
"C.Duodeno-gastric re?ux ",
"D.Hypergastrinemia ",
"E.Helicobacter pylori infection "
]
},
{
"question": "193.On admission a 35-year-old female reports acute abdominal pain, fever up to 38,8 oC, mucopurulent discharges. The pati- ent is nulliparous, has a history of 2 arti?cial abortions. The patient is unmarried, has sexual Krok 2 Medicine 20 14 24 contacts. Gynecological examination reveals no uterus changes. Appendages are enlarged, bilaterally painful. There is profuse purulent vaginal discharge. What study is required to con?rm the diagnosis? ",
"answers": [
"A.Bacteriologic and bacteriascopic studies ",
"B.Hysteroscopy ",
"C.Curettage of uterine cavity ",
"D.Vaginoscopy ",
"E.Laparoscopy"
]
}
]
I would like to gather the historical price data shown on this page and format the output as a csv file.
require_once('simple_html_dom.php');
$html = file_get_html("https://coinmarketcap.com/currencies/binance-coin/historical-data/?start=20180101&end=20180519"); // you don't need to use curl
$yourDesiredContent = $html->find('#historical-data .table', 0)->plaintext;
The problem is this gathers everything into one "string" but I need to seperate each td value in the table with a comma or whatever is appropriate for csv format.
Output:
Date Open High Low Close Volume Market Cap May 18, 2018 12.36 16.22 12.05 15.14 243,261,000 1,409,710,000 May 17, 2018 12.32 12.97 12.27 12.46 54,251,500 1,404,490,000 May 16, 2018 12.56 12.57 11.95 12.26 35,233,200 1,432,140,000 May 15, 2018 12.85 13.27 12.51 12.57 45,696,200 1,465,880,000 May 14, 2018 13.16 13.17 12.36 12.87 49,317,700 1,500,590,000 May 13, 2018 13.01 13.55 12.74 13.12 70,816,500 1,483,450,000 May 12, 2018 12.97 13.15 12.25 12.94 44,001,000 1,479,430,000 May 11, 2018 13.88 14.09 12.59 12.99 57,027,500 1,582,950,000 May 10, 2018 14.68 15.05 13.82 13.82 68,250,000
Is it possible I can do this?
I have following pdf file Marsheet PDF m trying to extract data shown in example, I have tried PDFParse, PDFtoText, etc.... but not working properly is there any solution or example?
<?php
//Output something like this or suggest me if u have any better option
$data_array = array(
array( "name" => "Mr Andrew Smee",
"medicine_name" => "FLUOXETINE 20MG CAPS",
"description" => "TAKE ONE ONCE DAILY FOR LOW MOOD. CAUTION:YOUR DRIVING REACTIONS MAY BE IMPAIRED",
"Dose" => '9000',
"StartDate" => '28/09/15',
"period" => '28',
"Quantity" => '28'
),
array( "name" => "Mr Andrew Smee",
"medicine_name" => "SINEMET PLUS 125MG TAB",
"description" => "TAKE ONE TABLET FIVE TIMES A DAY FOR PD
(8am,11am,2pm,5pm,8pm)
THIS MEDICINE MAY COLOUR THE URINE. THIS IS
HARMLESS. CAUTION:REACTIONS MAY BE IMPAIRED
WHILST DRIVING OR USING TOOLS OR MACHINES.",
"Dose" => '0800,1100,1400,1700,2000',
"StartDate" => '28/09/15',
"period" => '28',
"Quantity" => '140'
), etc...
);
?>
TL;DR You are almost certainly not going to do this with a library alone.
Update: a working solution (not a perfect solution!) is coded below, see 'in practice'. It requires:
defining the areas where the text is;
the possibility of installing and running a command line tool, pdf2json.
Why it is not easy
PDF files contain typesetting primitives, not extractable text; sometimes the difference is slight enough that you can go by, but usually having only extractable text, in easily accessible format, means that the document looks "slightly wrong" aesthetically, and therefore the generators that create the "best" PDFs for text extraction are also the less used.
Some generators exist that embed both the typesetting layer and an invisible text layer, allowing to see the beautiful text and to have the good text. At the expense, you guessed it, of the PDF size.
In your example, you only have the beautiful text inside the file, and the existence of a grid means that the text needs to be properly typeset.
So, inside, what there actually is to be read is this. Notice the letters inside round parentheses:
/R8 12 Tf
0.99941 0 0 1 66 765.2 Tm
[(M)2.51003(r)2.805( )-2.16558(A)-3.39556(n)
-4.33056(d)-4.33056(r)2.805(e)-4.33056(w)11.5803
( )-2.16558(S)-3.39556(m)-7.49588(e)-4.33117(e)556]TJ
ET
and if you assemble the (s)(i)(n)(g)(l)(e) letters inside, you do get "Mr Andrew Smee", but then you need to know where these letters are related to the page, and the data grid. Also you need to beware of spaces. Above, there is one explicit space character, parenthesized, between "Mr" and "Andrew"; but if you removed such spaces and fixed the offsets of all the following letters, you would still read "Mr Andrew Smee" and save two characters. Some PDF "optimizers" will try and do just that, and not considering offsets, the "text" string of that entity will just be "MrAndrewSmee".
And that is why most text extraction libraries, which can't easily manage character offsets (they use "text lines", and by and large they don't care about grids) will give you something like
Mr Andrew Smee 505738 12/04/54 (61
or, in the case of "optimized" texts,
MrAndrewSmee50573812/04/54(61
(which still gives the dangerous illusion of being parsable with a regex -- sometimes it is, sometimes it isn't, most of the times it works 95% of the time, so that the remaining 5% turns into a maintenance nightmare from Hell), but, more importantly, they will not be able to get you the content of the medication details timetable divided by cell.
Any information which is space-correlated (e.g. a name has different meanings if it's written in the left "From" or in the right "To" box) will be either lost, or variably difficult to reconstruct.
There are PDF "protection" schemes that exploit the capability of offsetting the text, and will scramble the strings. With offsets, you can write:
9 l 10 d 4 l 5 1 H 2 e 3 l o 6 W 7 o 8 r
and the PDF viewer will show you "Hello World"; but read the text directly, and you get "ldlHeloWor", or worse. You could add malicious text and place it outside the page, or write it in transparent color, to prank whoever succeeds in removing the easily removed optional copy-paste protection of PDF files. Most libraries would blithely suck up the prank text together with the good text.
Trying with most libraries, and why it might work (but probably not)
Libraries such as XPDF (and its wrappers phpxpdf, pdf2html, etc.) will give you a simple call such as this
// open PDF
$pdfToText->open('PDF-book.pdf');
// PDF text is now in the $text variable
$text = $pdfToText->getText();
$pdfToText->close();
and your "text" will contain everything, and be something like:
...
START DATE START DAY
WEEK 1 WEEK 2 WEEK 3 WEEK 4
DATE 28 29 30 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
19/10/15
Medication Details
Commencing
D.O.B
Doctor
Hour:Dose 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
Patient
Number
Period
MEDICATION ADMINISTRATION RECORD SHEETS Pharmacy No.
Document No.
02392 731680
28
0900 1
TAKE ONE ONCE DAILY FOR LOW MOOD.
CAUTION:YOUR DRIVING REACTIONS MAY BE IMPAIRED.
28
FLUOXETINE 20MG CAPS
Received Quantity returned quant. by destroyed quant. by
So, reading above, ask yourself - what is that second 28? Can you tell whether it is the received quantity, the returned quantity, the destroyed quantity without looking at the PDF? Sure, if there's only one number, chances are that it will be the received quantity. It becomes a bet.
And is 02392 731680 the document number? It looks like it is (it is not).
Notice also that in the PDF, the medicine name is before the notes. In the extracted text, it is after. By looking at the offsets inside the PDF, you understand why, and it's even a good decision -- but looking at the extracted text, it's not so easy.
So, automatic analysis looks enticingly like it can be done, but as I said, it is a very risky business. It is brittle: someone entering the wrong (for you) text somewhere in the document, sometimes even filling the fields not in sequential order, will result in a PDF which is visually correct and, at the same time, unexplainably unparseable. What are you going to tell your users?
Sometimes, a subset of the available information is stable enough for you to get the work done. In that case, XPDF or PDF2HTML, a bunch of regex, and you're home free in half a day. Yay you! Just keep in mind that any "little" addition to the project might then be impossible. Two numbers are added that are well separated in the PDF; are they 128 and 361, or 12 and 8361, or 1283 and 61? All you get in $text is 128361.
So if you go that way, document it clearly and avoid expectations which might be difficult to maintain. Your initial project might work so well, so fast, in so little, than an addition is accepted unbeknownst to you - and you're then required to do the impossible. Explaining why the first 95% was easy and the subsequent 5% very hard might be more than your job is worth.
One difficult way to do it, which worked for me
But can you do the same thing "by hand"? After all, by looking at the PDF, you know what you are seeing. Can the same thing be done by a machine? (this still applies). Sure, in this - after all - clearly delimited problem of computer vision, you very probably can. It just won't be quick and easy. You need:
a very low level library (or reading the PDF yourself; you just need to uncompress it first, and there are tools for that, e.g. pdftk). You need to recover the text with coordinates. "C" for "hospitalized" is worth nothing. "C, 495.2, 882.7" plus the coordinates of your grid tells you of a hospitalization on October 13th, 2015 -- and that is the information you are after!
patience (or a tool) to input the coordinates of the text zones. You need to tell the system which area is October 13th, 2015... as well as all the other days. For example:
// Cell name X1 Y1 X2 Y2 Text
[ 'PatientName', 60, 760, 300, 790, '' ],
[ 'PatientNumber', 310, 760, 470, 790, '' ],
...
[ 'Grid01Y01X01', 90, 1020, 110, 1040, '' ],
...
Note that very many of those values you can calculate programmatically: once you have the top left corner and know one cell's size, the others are more or less calculable with a very slight error. You needn't input yourself six grids of four weeks with six rows each, seven days per week.
You can use the same structure to create a PNG with red areas to indicate which cells you've got covered. That will be useful to visually check you did not forget anything.
At that point you parse the PDF, and every time you find a text at coordinates (x1,y1) you scan all of your cells and determine where the text should be (there are faster ways to do that using XY binary search trees). If you find 'Mr Andrew S' at 66, 765.2 you add it to PatientName. Then you find 'mee' at 109.2, 765.2 and you also add it to PatientName. Which now reads 'Mr Andrew Smee'.
If the horizontal distance is above a certain threshold, you add a space (or more than one).
(For very small text there's a slight risk of the letters being output out of order by the PDF driver and corrected through kerning, but usually that's not a problem).
At the end of the whole cycle you will be left with
[ 'PatientName', 60, 760, 300, 790, 'Mr Andrew Smee' ],
[ 'PatientNumber', 310, 760, 470, 790, '505738' ],
and so on.
I did this kind of work for a large PDF import project some years back and it worked like a charm. Nowadays, I think most of the heavy lifting could be done with TcLibPDF.
The painful part is recording by hand, the first time, the information for the grid; possibly there might be tools for that, or one could whip up a HTML5/AJAX editor using canvases.
In practice
Most of the work has already been done by the excellent pdf2json tool, which consuming the 'Andrew Smee' PDF, outputs something like:
[
{
"height" : 1263,
"width" : 892
"number" : 1,
"pages" : 1,
"fonts" : [
{
"color" : "#000000",
"family" : "Times",
"fontspec" : "0",
"size" : "15"
},
...
],
"text" : [
{ "data" : "12/04/54",
"font" : 0,
"height" : 17,
"left" : 628,
"top" : 103,
"width" : 70
},
{ "data" : "28/09/15",
"font" : 0,
"height" : 17,
"left" : 105,
"top" : 206,
"width" : 70
},
{ "data" : "AQUARIUS",
"font" : 0,
"height" : 17,
"left" : 99,
"top" : 170,
"width" : 94
},
{ "data" : " ",
"font" : 0,
"height" : 17,
"left" : 193,
"top" : 170,
"width" : 5
},
{ "data" : "NURSING",
"font" : 0,
"height" : 17,
"left" : 198,
"top" : 170,
"width" : 83
},
...
In order to make things simple, I convert the Andrew Smee PDF to a PNG and resample it to 892 x 1263 pixel (any size will do, as long as you keep track of the size. Below, they are saved in 'width' and 'height'). This way I can read pixel coordinates straight off my old PaintShop Pro's status bar :-).
The "Address" field is from 73,161 to 837,193.
My sample "template", with only three fields, is therefore in PHP 5.7 (with short array syntax, [ ] instead of Array() )
<?php
function template() {
$template = [
'Address' => [ 'x1' => 73, 'y1' => 161, 'x2' => 837, 'y2' => 193 ],
'Medicine1' => [ 'x1' => 1, 'y1' => 283, 'x2' => 251, 'y2' => 299 ],
'Details1' => [ 'x1' => 1, 'y1' => 302, 'x2' => 251, 'y2' => 403 ],
];
foreach ($template as $fieldName => $candidate) {
$template[$fieldName]['elements'] = [ ];
}
return $template;
}
// shell_exec('/usr/local/bin/pdf2json "Andrew-Smee.pdf" andrew-smee.json');
$parsed = json_decode(file_get_contents('ann-underdown.json'), true);
$pout = [ ];
foreach ($parsed as $page) {
$template = template();
foreach ($page['text'] as $text) {
// Will it blend?
foreach ($template as $fieldName => $candidate) {
if ($text['top'] > $candidate['y2']) {
continue; // Too low.
}
if (($text['top']+$text['height']) < $candidate['y1']) {
continue; // Too high.
}
if ($text['left'] > $candidate['x2']) {
continue;
}
if (($text['left']+$text['width']) < $candidate['x1']) {
continue;
}
$template[$fieldName]['elements'][] = $text;
}
}
// Now I must reassemble all my fields
foreach ($template as $fieldName => $data) {
$list = $data['elements'];
usort($list, function($txt1, $txt2) {
for ($r = 8; $r >= 1; $r /= 2) {
if (($txt1['top']/$r) < ($txt2['top']/$r)) {
return -1;
}
if (($txt1['top']/$r) > ($txt2['top']/$r)) {
return 1;
}
if (($txt1['left']/$r) < ($txt2['left']/$r)) {
return -1;
}
if (($txt1['left']/$r) > ($txt2['left']/$r)) {
return 1;
}
}
return 0;
});
$text = '';
$starty = false;
foreach ($list as $data) {
if ($data['top'] > $starty + 5) {
if ($starty > 0) {
$text .= "\n";
}
} else {
// Add space
// $text .= ' ';
}
$starty = $data['top'];
// Add text to current line
$text .= $data['data'];
}
// Remove extra spaces
$text = preg_replace('# +#', ' ', $text);
$template[$fieldName] = $text;
}
$paged[] = $template;
}
print_r($paged);
And the result (on a multipage PDF)
Array
(
[0] => Array
(
[Address] => AQUARIUS NURSING HOME 4-6 SPENCER ROAD, SOUTHSEA PO4 9RN
[Medicine1] => ATORVASTATIN 40MG TABS
[Details1] => take ONE tablet at NIGHT
)
[1] => Array
(
[Address] => AQUARIUS NURSING HOME 4-6 SPENCER ROAD, SOUTHSEA PO4 9RN
[Medicine1] => SOTALOL 80MG TABS
[Details1] => take ONE tablet TWICE each day
DO NOT STOP TAKING UNLESS YOUR DOCTOR TELLS
YOU TO STOP.
)
[2] => Array
(
[Address] => AQUARIUS NURSING HOME 4-6 SPENCER ROAD, SOUTHSEA PO4 9RN
[Medicine1] => LAXIDO ORANGE SF 13.8G SACHETS
[Details1] => ONE to TWO when required
DISSOLVE OR MIX WITH WATER BEFORE TAKING.
NOT IN CASSETTE
)
)
Sometimes its hard to extract the pdfs into required format/output directly using some libraries or tools. Same problem occurred with me recently where I had 1600+ pdfs and I needed to extract those data and store it in db. I tried almost all the libraries, tools and none of them helped me. So, I tried put some manual effort to find a pattern and process them using php. For this I used this php library PDF TO HTML.
Install PDF TO HTML library
composer require gufy/pdftohtml-php:~2
This will convert your pdf into html code with each < div > tag representing the page and < p > tag representing the titles and their values. Now using p tags if you can identify the common pattern and it is not hard to put that in the logic to process all the pdfs and convert them into csv/xls or anything else. Since in my case after each 11 < p > tags, the pattern was repeating so i used this .
$pdf = new Gufy\PdfToHtml\Pdf('<PDF_FILE_PATH>');
// get total no pages
$total_pages = $pdf->getPages();
// Iterate through each page and extract the p tags
for($i = 1; $i <= $total_pages; $i++){
// This will convert pdf to html
$html = $pdf->html($i);
// Create a dom document
$domOb = new DOMDocument();
// load html code in the dom document
$domOb->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
// Get SimpleXMLElement from Dom Node
$sxml = simplexml_import_dom($domOb);
// here you have the p tags
foreach ($sxml->body->div->p as $pTag) {
// your logic
}
}
Hope this helps you as it helped me alot
My raw output of socket_recvfrom is:
ID IP PING IDENTIFIERNUMBER USERNAME
0 127.0.0.1:1234 0 ID123456789 Moritz
1 127.0.0.1:1234 46 ID123456789 August Jones
2 127.0.0.1:1234 46 ID123456789 Miller
It is a single string that contains all of this informations in once and just contains whitespaces between the informations. All keys can be longer or shorter.
My problem:
When I preg_split("/\s+/") it, then I get a good array with useable data, but when the username contains spaces it creates a second index for this. Not good, all data that comes after this just get destroyed.
I sort the array like this: ID, USERNAME, PING, IDENTIFIERNUMBER, IP
Example by the sorting output with username with one space in it:
ID: 0, USERNAME: Moritz, PING: 0, IDENTIFIERNUMBER: ID123456789, IP: 127.0.0.1:1234
ID: 1, USERNAME: August, PING: Jones, IDENTIFIERNUMBER: 46, IP: ID123456789
ID: 127.0.0.1:1234, USERNAME: 2, PING: Miller, IDENTIFIERNUMBER: 46, IP: ID123456789
How do I get the information correctly out of the string?
Just forgot to say:
The string begins with: --------------------------------- in a not countable order. So it can be like 10 characters or 12.
The string ends with:
(8 users in total)
The regex methode looks good. I only need to filter out the other characters.
--------------------------------- 0 127.0.0.1:1234 0 ID123456789(OK) Moritz 1 127.0.0.1:1234 46 ID123456789(OK) August Jones 2 127.0.0.1:1234 46 ID123456789(OK) Miller (7 users in total)
Last problem:
https://www.regex101.com/r/wP8cW1/1
You may use regex
(?P<ID>\d+)\s+(?P<IP>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d+)\s(?P<PINGR>\d+)\s(?P<IDENTIFIERNUMBER>ID\d+)(\(OK\))?(?P<USERNAME>(\s[A-z]\w+)+)
MATCH 1
ID [0-1] `0`
IP [2-16] `127.0.0.1:1234`
PINGR [17-18] `0`
IDENTIFIERNUMBER [19-30] `ID123456789`
USERNAME [31-37] `Moritz`
MATCH 2
ID [39-40] `1`
IP [41-55] `127.0.0.1:1234`
PINGR [56-58] `46`
IDENTIFIERNUMBER [59-70] `ID123456789`
USERNAME [71-83] `August Jones`
MATCH 3
ID [85-86] `2`
IP [87-101] `127.0.0.1:1234`
PINGR [102-104] `46`
IDENTIFIERNUMBER [105-116] `ID123456789`
USERNAME [117-123] `Miller`
Demo and explanation
Do you alredy try explode the string by new lines \n ??
test this code.
$str = '0 127.0.0.1:1234 0 ID123456789 Moritz
1 127.0.0.1:1234 46 ID123456789 August Jones
2 127.0.0.1:1234 46 ID123456789 Miller';
$lines = array_filter(explode("\n", $str));
foreach ($lines as $value) {
$t[] = preg_split("/\s+/", trim($value));
}
Now in the var $t you have a usefull data.
I have a sample graph like one below.., which I plotted with set of (x,y) values in an array X.
http://bubblebird.com/images/t.png
As you can see the image has dense peak values between 4000 to 5100
My exact question is can I programmatically find this range where the graph is most dense?
ie.. with Array X how can I find range within which this graph is dense?
for this array it would be 4000 - 5100.
Assume that the array has only one dense region for simplicity.
Thankful if you can suggest a pseudocode/code.
You can use the variance of the signal on a moving window.
Here is an example (see the graph attached where the test signal is red, the windowed variance is green and the filtered signal is blue) :
:
test signal generation :
import numpy as np
X = np.arange(200) - 100.
Y = (np.exp(-(X/10)**2) + np.exp(-((np.abs(X)-50.)/2)**2)/3.) * np.cos(X * 10.)
compute moving window variance :
window_length = 30 # number of point for the window
variance = np.array([np.var(Y[i-window_length / 2.: i+window_length/2.]) for i in range(200)])
get the indices where the variance is high (here I choose the criterion variance superior to half of the maximum variance... you can adapt it to your case) :
idx = np.where(variance > 0.5 * np.max(variance))
X_min = np.min(X[idx])
# -14.0
X_max = np.max(X[idx])
# 15.0
or filter the signal (set to zero the points with low variance)
Y_modified = np.where(variance > 0.5 * np.max(variance), Y, 0)
you may calculate the absolute difference between the adjacent values, then maybe smooth things a little with sliding window and then find the regions, where the smoothed absolute difference values are at 50% of maximum value.
using python (you have python in tags) this would look like this:
a = ( 10, 11, 9, 10, 18, 5, 20, 6, 15, 10, 9, 11 )
diffs = [abs(i[0]-i[1]) for i in zip(a,a[1:])]
# [1, 2, 1, 8, 13, 15, 14, 9, 5, 1, 2]
maximum = max(diffs)
# 15
result = [i>maximum/2 for i in diffs]
# [False, False, False, True, True, True, True, True, False, False, False]
You could use classification algorithm (for example k-means), to split data into clusters and find the most weighted cluster