I have a data in the object section_data.title and i am trying to use str.sup() where str=section_data.title; and str holds following data:
str="Not less than 30 net ft2 (2.8 net m2) per patient in a
hospital or nursing home, or not less than 15 net ft2 (1.4 net
m2) per resident in a limited care facility, shall be provided within the aggregated area of corridors, patient rooms, treatment
rooms, lounge or dining areas, and other similar areas on each side of
the horizontal exit. On stories not housing bed or litterborne
patients, not less than 6 net ft2 (0.56 net m2) per occupant
shall be provided on each side of the horizontal exit for the total
number of occupants in adjoining compartments."
Now I want to add superscript for the bold words indicated above (e.g ft2). How can I do this using str.sup() or are there any other alternative method to do so in javascript? Or any other tricks?
String in javascript is not formatted. You can only do that when you output to HTML. So basically you must write it like this
var str = "Not less than 30 net ft<sup>2</sup> (2.8 net m<sup>2</sup>)";
document.write(str);
You can do a find and replace for all string contain ft2 and m2 turn them into ft<sup>2</sup> and m<sup>2</sup>
str.replace(/ft2/g,"ft<sup>2</sup>"); //But it not safe...
Related
I have a PHP API that return data to my android app
some data in database is in Html format.
Sometimes the API return every thing
but some Html data when return nothing return from API
this is the API.
//Importing Database Script
require_once('dbConnect.php');
//Creating sql query
$sql="SELECT * FROM `blogs`";
//getting result
$r = mysqli_query($con,$sql);
//creating a blank array
$result = array();
//looping through all the records fetched
while($row = mysqli_fetch_array($r)){
//Pushing name and id in the blank array created
array_push($result,array(
"id"=>$row['id'],
"title"=>$row['title'],
"details"=>$row['details'],
"image"=>$row['photo'],
"source"=>$row['source'],
"views"=>$row['views'],
"date"=>$row['created_at']
));
}
//Displaying the array in json format
echo json_encode(array('result'=>$result));
mysqli_close($con);
NOTE the title and description are Html
title returns but details not
This is a sample of not working data
<div align="justify">The recording starts with the patter of a summer squall. Later, a
drifting tone like that of a not-quite-tuned-in radio station
rises and for a while drowns out
the patter. These are the sounds encountered by NASA’s Cassini
spacecraft as it dove
the gap between Saturn and its
innermost ring on April 26, the first of 22 such encounters before it
will plunge into
atmosphere in September. What
Cassini did not detect were many of the collisions of dust particles
hitting the spacecraft
it passed through the plane of
the ringsen the charged particles oscillate in unison.<br><br></div><h3 align="justify">How its Works ?</h3>
<p align="justify">
MIAMI — For decades, South
Florida schoolchildren and adults fascinated by far-off galaxies,
earthly ecosystems, the proper
ties of light and sound and
other wonders of science had only a quaint, antiquated museum here in
which to explore their
interests. Now, with the
long-delayed opening of a vast new science museum downtown set for
Monday, visitors will be able
to stand underneath a suspended,
500,000-gallon aquarium tank and gaze at hammerhead and tiger sharks,
mahi mahi, devil
rays and other creatures through
a 60,000-pound oculus. <br></p><p align="justify">Lens that will give the impression of seeing the fish from the bottom of
a huge cocktail glass. And that’s just one of many
attractions and exhibits.
Officials at the $305 million Phillip and Patricia Frost Museum of
Science promise that it will be a
vivid expression of modern
scientific inquiry and exposition. Its opening follows a series of
setbacks and lawsuits and a
scramble to finish the
250,000-square-foot structure. At one point, the project ran
precariously short of money. The museum
high-profile opening is
especially significant in a state s <br></p><p align="justify"><br></p><h3 align="justify">Top 5 reason to choose us</h3>
<p align="justify">
Mauna Loa, the biggest volcano
on Earth — and one of the most active — covers half the Island of
Hawaii. Just 35 miles to the
northeast, Mauna Kea, known to
native Hawaiians as Mauna a Wakea, rises nearly 14,000 feet above sea
level. To them it repre
sents a spiritual connection
between our planet and the heavens above. These volcanoes, which have
beguiled millions of
tourists visiting the Hawaiian
islands, have also plagued scientists with a long-running mystery: If
they are so close together,
how did they develop in two
parallel tracks along the Hawaiian-Emperor chain formed over the same
hot spot in the Pacific
Ocean — and why are their
chemical compositions so different? "We knew this was related to
something much deeper,
but we couldn’t see what,” said
Tim Jones.
</p>
Make sure to properly encode any HTML data before putting it into a JSON array, otherwise it could break it.
Use htmlentities($html)
Eg. "source"=>htmlentities($row['source'])
I am trying to display data extracted from PDF document. Here is sample data which I've got in raw format from pdf 55.0 450.0 320.0 GA350C CARDS 4 21 90.0 4 1 DIGCLR This is one row where every space represent one column. I can extract each column with substr() function in PHP but I am not sure how to display data when there are three or five rows data in there, cos doesn't matter its one row or five row data will display in single line.
I can only count rows with no of space in it, here only one thing is fixed which is no of columns so so need to iterate loop efficiently.
If anyone has better idea plz let me know.
Here is string which I extracted from pdf doc with help of PdfParser.
5284 25/10/16 DATE JOB REC'D: DATE DUE: 26/10/16 JOB NUMBER: The Print Group CUSTOMER NAME: 30 days CONTACT: Tanya Bulley PHONE: (07) 3395 7248 FAX: (07) 3395 9462 ORDER NUMBER: 234456/277458 ADDRESS: The Print Group 88 Webster Road Geebung Qld 4034 Australia 5,289 QUOTE NO: PREVIOUS JOB NO: 0 2,000 Business Cards - Shed Company 2 KINDS JOB: DESCRIPTION: PRE-PRESS: Supplied Print Ready Files/ No Proof Required SIZE: BC 90 x 55mm PRINTED: CMYK 2/sides STOCK: 350gsm Gloss Art FINISH:Trim to size QTY: 2000 (1,000 each name) PACK: Carton Pack DELIVERY: 1 Point ACT [1]SPECIAL INSTRUCTIONS: Artwork Received SPECIAL INSTRUCTIONS: Out on Proof Approved Stock TYPE/ART CUTTING Proofing Pre Press Proofing 0.50 TRIMMING CARDS TRIM MAKE READY CARDS TRIM 90 x 55 STOCK 96.00 CARDS Sovereign Gloss 450x320/350 FINISHING PACK/DELIVERY PACK A4 Cartons 305x215/280 Standard Local Delivery (by we INK/CHEMICALS OUTSIDE WORK Delivery: The Print Group 88 Webster Road Geebung Qld 4034 Press Sheet Press Code Stock Code No. of Work & Turn No Up No. of Colours Front Back Description Ink Code Front Back Trim Size Depth Width Ink Notes 55.0 450.0 320.0 GA350C CARDS 4 21 90.0 4 1 DIGCLR
This is basically job order for printers and last line is job details. Right now its only one row for actual job details but in some order it can go upto 10 rows and so its hard to save it in database with proper column name. To grab words or details I used:
function GetBetween($content,$start,$end)
{
$r = explode($start, $content);
if (isset($r[1])){
$r = explode($end, $r[1]);
return $r[0];
}
return '';
}
this function. I used this function like $cust_name = GetBetween($a,'JOB NUMBER:','CUSTOMER NAME:'); I also used substr() php function to get some details and with these I've got everything apart from main data which is at last in string (I mentioned it above). I hope this explanation help you to figure out whole situation.
Sorry, i try to explain with bulk of code and long description, but stackoverflow not allow me to writing that. i'm so frustated because i spend 2 hours to doing it with my notepad
now i will give you simple clue for doing this
Avoid to using <*table> tag, try to using <*div> (only abbyy can convert <*table> near perfectly). This optional requirement
Convert PDF to DOM TREE, i recommended convert to HTML, AND this must be automation via PHP.
for Paid software: Abbyy Fine reader or Abbyy transformer (lite version)
for Free software: pdftohtml from poppler
from my experience around 5years doing this, i recommended you to use
Abbyy. And ALL of Indonesian corporation which provide digital
newspaper clipping use this software (im pretty sure about this). If
you dont have money , you must know how to get that.(i cant say it
here)
Grab HTML DOM with regular expression (regex) or http://simplehtmldom.sourceforge.net/
Another clue :
if you have problem to grab content using regex/htmldom,
1. try to get rid DOM that you dont need it. You can use preg_replace
[trash]
[YOUR_TABLE]
[trash]
then start to grab your content from this snippet
If you can edit PDF creation process, try to add unique word/string around your content
[trash]
<div>this is title</div>
[YOUR TABLE]
<div>this is footer</div>
[trash]
so you can search your content around word THis is title and this is footer.
I've managed over the past few months to teach myself PHP, PDO & SQL, and have built a basic dynamic website with user registration/email activation/ and login logout functionality, following PHP/SQL best practices. Now I'm stuck on the next task...
I've created a huge dataset of squares/polygons (3 million+), each 1 minute of latitude & longitude in size, stored in a PHP array with a single set of coordinates (the top left corner). To extrapolate a square-like shape, I simply add 0.016 degrees (~1 minute) to each direction and generate the other 3 coordinates.
I now need to check that each polygon in said array is over at least some portion of land in the United States.... i.e. if one were to produce a graphical output of my completed data set and take a look at the San Fransisco coastline, they'd see something like this.
It's similar to the point-in-polygon problem, except it's dealing with another polygon instead of a point, the other polygon is a country border, and I'm not just looking at intersections. I want to check if:
A polygon/square intersects with the polygon. (Think coastline/border).
A polygon/square is inside the polygon. (Think continental U.S.).
A polygon/square contains part of the polygon. (Think small island).
This is illustrated with my crudely drawn image:
If it matches any of these three conditions, I want to keep the square. If it does not interact with the big polygon in anyway (i.e. it's over water), discard it.
I was thinking the big polygon would be a shapefile of the U.S., that or a KML file which I could strip the coordinates out of to create a very complex polygon from.
Then, I thought I'd pass these matching squares and square ID's over to a csv file for integration into a MySQL table containing a set of coordinates of each square (in fact, I'm not even sure of the best practices for handling tables of that size in MySQL, but I'll come to that when need be). The eventual goal would then be to develop a map using Google Maps API via Javascript to display these squares over a map on the website I'm coding (obviously only showing squares within the viewpoint to make sure I don't tax my db to death). I'm pretty sure I'd have to pass such information through PHP first, too. But all of that seems relatively easy compared to the task of actually making said data set.
This is obviously something that cannot be done by hand, so it needs automating. I know a bit of Python, so would that be of help? Any other tips on where to start? Someone willing to write some of the code for me?
Here is a solution that will be efficient, and as simple as possible to implement. Note that I do not say simple, but as simple as possible. This is a tricky problem, as it turns out.
1) Get U.S. polygon data using Shapefiles or KFL, which will yield a set of polygon shapes (land masses), each defined by a list of vertices.
2) Create a set of axis aligned bounding box (AABB) rectangles for the United States: one for Alaska and each Alaskan island, one for each Hawaiian island, one for the Continental United States, and one for each little island off the coast of the Continental U.S. (e.g., Bald Head Island in N.C., Catalina off the coast of California). Each bounding box is defined as a rectangle with the corners which are the minimum and maximum latitude and longitude for the shape. My guess is that there will be a few hundred of these. For example, for Hawaii's big island, the latitude runs 18°55′N to 28°27′N, and the longitude runs 154°48′W to 178°22′W. Most of your global lat/long pairs get thrown out at this step, as they are not in any of those few hundred bounding boxes. For example, your bounding box at 10°20'W, 30°40'N (a spot in the Atlantic Ocean near Las Palmas, Africa) does not overlap Hawaii, because 10°20'W is less than 154°48′W. This bit would be easy to code in Python.
3) If the lat/long pair DOES overlap one of the several hundred AABB rectangles, you then need to test it against the single polygon within the AABB rectangle. To do this it is strongly recommended to use the Minkowski Difference (MD). Please thoroughly review this website first:
http://www.wildbunny.co.uk/blog/2011/04/20/collision-detection-for-dummies/
In particular, look at the "poly versus poly" demo halfway down the page, and play with it a little. When you do, you will see that when you take the MD of the 2 shapes, if that MD contains the origin, then the two shapes are overlapping. So, all you need to do then is take the Minkowski Difference of the 2 polygons, which itself results in a new polygon (B - A, in the demo), and then see if that polygon contains the origin.
4) There are many papers online regarding algorithms to implement MD, but I don't know if you'll have the ability to read the paper and translate that into code. Since it is tricky vector math to take the MD of the two polygons (the lat/long rectangle you're testing, and the polygon contained in the bounding box which overlapped the lat/long rectangle), and you have told us that your experience level is not high yet, I would suggest using a library that already implements MD, or even better, implements collision detection.
For example:
http://physics2d.com/content/gjk-algorithm
Here, you can see the relevant pseudo-code, which you could port into Python:
if aO cross ac > 0 //if O is to the right of ac
if aO dot ac > 0 //if O is ahead of the point a on the line ac
simplex = [a, c]
d =-((ac.unit() dot aO) * ac + a)
else // O is behind a on the line ac
simplex = [a]
d = aO
else if ab cross aO > 0 //if O is to the left of ab
if ab dot aO > 0 //if O is ahead of the point a on the line ab
simplex = [a, b]
d =-((ab.unit() dot aO) * ab + a)
else // O is behind a on the line ab
simplex = [a]
d = aO
else // O if both to the right of ac and to the left of ab
return true //we intersect!
If you are unable to port this yourself, perhaps you could contact either of the authors of the 2 links I've included here--they both implemented the MD algorithm in Flash, perhaps you could license the source code.
5) Finally, assuming that you've handled the collision detection, you can simply store in the database a boolean as to whether the lat/long pair is part of the United States. Once that's done, I have no doubt you will be able to do as you'd like with your Google Maps piece.
So, to sum up, the only difficult piece here is to either 1) implement the collision detection GJK algorithm, or alternatively, 2) write an algorithm that will first calculate the Minkowski Difference between your lat/long pair and the land polygon contained within your AABB and then secondly see if that MD polygon contains the origin. If you use that approach, Ray Casting (typical point-in-a-polygon solution) would do the trick with the second part.
I hope this gives you a start in the right direction!
I think this other question answers a good portion of what you are trying to do
How do I determine if two convex polygons intersect?
The other portion is that if you are using a database, I would load in all polygons near your view point from both sets (the set of the map polygons and the set of other polygons you generated) and then run the above algorithm on this smaller set of polygons and you can generate a list of all polygons in your set that should be overlayed on the map.
I need to parse a file with the following format.
0000000 ...ISBN.. ..Author.. ..Title.. ..Edit.. ..Year.. ..Pub.. ..Comments.. NrtlExt Nrtl Next Navg NQoH UrtlExt Urtl Uext Uavg UQoH ABS NEB MBS FOL
ABE0001 0-679-73378-7 ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM 0.00 13.90 0.00 10.43 0 21.00 10.50 6.44 3.22 2 2.00 0.50 2.00 2.00 ABS
The ID and ISBN are not a problem, the title is. There is no set length for these fields, and there are no solid delimiters- the space can be used for most of the file.
Another issue is that there is not always an entry in the comments field. When there is, there are spaced within the content.
So I can get the first two, and the last fourteen. I need some help figuring out how to parse the middle six fields.
This file was generated by an older program that I cannot change. I am using php to parse this file.
I would also ask myself 'How good does this have to be' and 'How many records are there'?
If, for example, you are parsing this list to put up a catalog of books to sell on a website - you probably want to be as good as you can, but expect that you will miss some titles and build in feedback mechanism so your users can help you fix the issue ( and make it easy for you to fix it in your new format).
On the other hand, if you absolutely have to get it right because you will loose lots of money for each wrong parse, and there are only a few thousand books, you should plan on getting close, and then doing a human review of the entire file.
(In my first job, we spend six weeks on a data conversion project to convert 150 records - not a good use of time).
Find the title and publisher of the book by ISBN (in some on-line database) and parse only the rest :)
BTW. are you sure that what looks like space actually is a space? There are more "invisible" characters (like non-break space). I know, not a good idea, but apparently author of that format was pretty creative...
You need to analyze you data by hand and find out what year, edition and publisher look like. For example if you find that year is always two digits and publisher always comes from some limited list, this is something you can start with.
While I don't see any way other then guessing a bit I'd go about it something like this:
I'd scale off what I know I can parse out reliably. Leaving you with ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM
From there I'd try locate the Edition and split the string into two at that position after storing and removing the Edition leaving you with ABE WOMAN IN THE DUNES (INT'L ED) & 64 RANDOM, another option is to try with the year but of course Titles such as 1984 might present a problem . (Guessing edition is of course assuming it's 7th, 51st etc for all editions).
Finally I'd assume I could somewhat reliable guess the year 64 at the start of the second string and further limit the Publisher(/Comment) part.
The rest is pure guesswork unless you got a list of authors/publishers somewhere to match against as I'd assume there are not only comments with spaces but also publishers with spaces in their names. But at least you should be down to 2 strings containing Author/Title in one and Publisher(/Comments) in the other.
All in all it should limit the manual part a bit.
Once done I'd also save it in a better format somewhere so I don't have to go about parsing it again ;)
I don't know if the pcre engine allows multiple groups from within selection, therefore:
([A-Z0-1]{7})\ (\d-\d{3}-\d{5}-\d)\
(.+)\ (\d(?:st|nd|rd))\ \d{2}\
([^\d.]+)\ (\d+.\d{2})\ (\d+.\d{2})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d{1})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\
(\d+.\d{2})\ (\d)\ (\d+.\d{2})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\
(\w{3})
It does look quite ugly and doesn't fix your author-title problem but it matches quite good for the rest of it.
Concerning your problem I don't see any solution but having a lookup table for authors or using other services to lookup title and author via the ISBN.
Thats if unlike in your example above the authors are not just represented by their first name.
Also double check all exception that might occur with the above regex as titles may contain 1st or alike.
I would like to implement Latent Semantic Analysis (LSA) in PHP in order to find out topics/tags for texts.
Here is what I think I have to do. Is this correct? How can I code it in PHP? How do I determine which words to chose?
I don't want to use any external libraries. I've already an implementation for the Singular Value Decomposition (SVD).
Extract all words from the given text.
Weight the words/phrases, e.g. with tf–idf. If weighting is too complex, just take the number of occurrences.
Build up a matrix: The columns are some documents from the database (the more the better?), the rows are all unique words, the values are the numbers of occurrences or the weight.
Do the Singular Value Decomposition (SVD).
Use the values in the matrix S (SVD) to do the dimension reduction (how?).
I hope you can help me. Thank you very much in advance!
LSA links:
Landauer (co-creator) article on LSA
the R-project lsa user guide
Here is the complete algorithm. If you have SVD, you are most of the way there. The papers above explain it better than I do.
Assumptions:
your SVD function will give the singular values and singular vectors in descending order. If not, you have to do more acrobatics.
M: corpus matrix, w (words) by d (documents) (w rows, d columns). These can be raw counts, or tfidf or whatever. Stopwords may or may not be eliminated, and stemming may happen (Landauer says keep stopwords and don't stem, but yes to tfidf).
U,Sigma,V = singular_value_decomposition(M)
U: w x w
Sigma: min(w,d) length vector, or w * d matrix with diagonal filled in the first min(w,d) spots with the singular values
V: d x d matrix
Thus U * Sigma * V = M
# you might have to do some transposes depending on how your SVD code
# returns U and V. verify this so that you don't go crazy :)
Then the reductionality.... the actual LSA paper suggests a good approximation for the basis is to keep enough vectors such that their singular values are more than 50% of the total of the singular values.
More succintly... (pseudocode)
Let s1 = sum(Sigma).
total = 0
for ii in range(len(Sigma)):
val = Sigma[ii]
total += val
if total > .5 * s1:
return ii
This will return the rank of the new basis, which was min(d,w) before, and we'll now approximate with {ii}.
(here, ' -> prime, not transpose)
We create new matrices: U',Sigma', V', with sizes w x ii, ii x ii, and ii x d.
That's the essence of the LSA algorithm.
This resultant matrix U' * Sigma' * V' can be used for 'improved' cosine similarity searching, or you can pick the top 3 words for each document in it, for example. Whether this yeilds more than a simple tf-idf is a matter of some debate.
To me, LSA performs poorly in real world data sets because of polysemy, and data sets with too many topics. It's mathematical / probabilistic basis is unsound (it assumes normal-ish (Gaussian) distributions, which don't makes sense for word counts).
Your mileage will definitely vary.
Tagging using LSA (one method!)
Construct the U' Sigma' V' dimensionally reduced matrices using SVD and a reduction heuristic
By hand, look over the U' matrix, and come up with terms that describe each "topic". For example, if the the biggest parts of that vector were "Bronx, Yankees, Manhattan," then "New York City" might be a good term for it. Keep these in a associative array, or list. This step should be reasonable since the number of vectors will be finite.
Assuming you have a vector (v1) of words for a document, then v1 * t(U') will give the strongest 'topics' for that document. Select the 3 highest, then give their "topics" as computed in the previous step.
This answer isn't directly to the posters' question, but to the meta question of how to autotag news items. The OP mentions Named Entity Recognition, but I believe they mean something more along the line of autotagging. If they really mean NER, then this response is hogwash :)
Given these constraints (600 items / day, 100-200 characters / item) with divergent sources, here are some tagging options:
By hand. An analyst could easily do 600 of these per day, probably in a couple of hours. Something like Amazon's Mechanical Turk, or making users do it, might also be feasible. Having some number of "hand-tagged", even if it's only 50 or 100, will be a good basis for comparing whatever the autogenerated methods below get you.
Dimentionality reductions, using LSA, Topic-Models (Latent Dirichlet Allocation), and the like.... I've had really poor luck with LSA on real-world data sets and I'm unsatisfied with its statistical basis. LDA I find much better, and has an incredible mailing list that has the best thinking on how to assign topics to texts.
Simple heuristics... if you have actual news items, then exploit the structure of the news item. Focus on the first sentence, toss out all the common words (stop words) and select the best 3 nouns from the first two sentences. Or heck, take all the nouns in the first sentence, and see where that gets you. If the texts are all in english, then do part of speech analysis on the whole shebang, and see what that gets you. With structured items, like news reports, LSA and other order independent methods (tf-idf) throws out a lot of information.
Good luck!
(if you like this answer, maybe retag the question to fit it)
That all looks right, up to the last step. The usual notation for SVD is that it returns three matrices A = USV*. S is a diagonal matrix (meaning all zero off the diagonal) that, in this case, basically gives a measure of how much each dimension captures of the original data. The numbers ("singular values") will go down, and you can look for a drop-off for how many dimensions are useful. Otherwise, you'll want to just choose an arbitrary number N for how many dimensions to take.
Here I get a little fuzzy. The coordinates of the terms (words) in the reduced-dimension space is either in U or V, I think depending on whether they are in the rows or columns of the input matrix. Off hand, I think the coordinates for the words will be the rows of U. i.e. the first row of U corresponds to the first row of the input matrix, i.e. the first word. Then you just take the first N columns of that row as the word's coordinate in the reduced space.
HTH
Update:
This process so far doesn't tell you exactly how to pick out tags. I've never heard of anyone using LSI to choose tags (a machine learning algorithm might be more suited to the task, like, say, decision trees). LSI tells you whether two words are similar. That's a long way from assigning tags.
There are two tasks- a) what are the set of tags to use? b) how to choose the best three tags?. I don't have much of a sense of how LSI is going to help you answer (a). You can choose the set of tags by hand. But, if you're using LSI, the tags probably should be words that occur in the documents. Then for (b), you want to pick out the tags that are closest to words found in the document. You could experiment with a few ways of implementing that. Choose the three tags that are closest to any word in the document, where closeness is measured by the cosine similarity (see Wikipedia) between the tag's coordinate (its row in U) and the word's coordinate (its row in U).
There is an additional SO thread on the perils of doing this all in PHP at link text.
Specifically, there is a link there to this paper on Latent Semantic Mapping, which describes how to get the resultant "topics" for a text.