How to get html in json format? - php

I have a PHP API that return data to my android app
some data in database is in Html format.
Sometimes the API return every thing
but some Html data when return nothing return from API
this is the API.
//Importing Database Script
require_once('dbConnect.php');
//Creating sql query
$sql="SELECT * FROM `blogs`";
//getting result
$r = mysqli_query($con,$sql);
//creating a blank array
$result = array();
//looping through all the records fetched
while($row = mysqli_fetch_array($r)){
//Pushing name and id in the blank array created
array_push($result,array(
"id"=>$row['id'],
"title"=>$row['title'],
"details"=>$row['details'],
"image"=>$row['photo'],
"source"=>$row['source'],
"views"=>$row['views'],
"date"=>$row['created_at']
));
}
//Displaying the array in json format
echo json_encode(array('result'=>$result));
mysqli_close($con);
NOTE the title and description are Html
title returns but details not
This is a sample of not working data
<div align="justify">The recording starts with the patter of a summer squall. Later, a
drifting tone like that of a not-quite-tuned-in radio station
rises and for a while drowns out
the patter. These are the sounds encountered by NASA’s Cassini
spacecraft as it dove
the gap between Saturn and its
innermost ring on April 26, the first of 22 such encounters before it
will plunge into
atmosphere in September. What
Cassini did not detect were many of the collisions of dust particles
hitting the spacecraft
it passed through the plane of
the ringsen the charged particles oscillate in unison.<br><br></div><h3 align="justify">How its Works ?</h3>
<p align="justify">
MIAMI — For decades, South
Florida schoolchildren and adults fascinated by far-off galaxies,
earthly ecosystems, the proper
ties of light and sound and
other wonders of science had only a quaint, antiquated museum here in
which to explore their
interests. Now, with the
long-delayed opening of a vast new science museum downtown set for
Monday, visitors will be able
to stand underneath a suspended,
500,000-gallon aquarium tank and gaze at hammerhead and tiger sharks,
mahi mahi, devil
rays and other creatures through
a 60,000-pound oculus. <br></p><p align="justify">Lens that will give the impression of seeing the fish from the bottom of
a huge cocktail glass. And that’s just one of many
attractions and exhibits.
Officials at the $305 million Phillip and Patricia Frost Museum of
Science promise that it will be a
vivid expression of modern
scientific inquiry and exposition. Its opening follows a series of
setbacks and lawsuits and a
scramble to finish the
250,000-square-foot structure. At one point, the project ran
precariously short of money. The museum
high-profile opening is
especially significant in a state s <br></p><p align="justify"><br></p><h3 align="justify">Top 5 reason to choose us</h3>
<p align="justify">
Mauna Loa, the biggest volcano
on Earth — and one of the most active — covers half the Island of
Hawaii. Just 35 miles to the
northeast, Mauna Kea, known to
native Hawaiians as Mauna a Wakea, rises nearly 14,000 feet above sea
level. To them it repre
sents a spiritual connection
between our planet and the heavens above. These volcanoes, which have
beguiled millions of
tourists visiting the Hawaiian
islands, have also plagued scientists with a long-running mystery: If
they are so close together,
how did they develop in two
parallel tracks along the Hawaiian-Emperor chain formed over the same
hot spot in the Pacific
Ocean — and why are their
chemical compositions so different? "We knew this was related to
something much deeper,
but we couldn’t see what,” said
Tim Jones.
</p>

Make sure to properly encode any HTML data before putting it into a JSON array, otherwise it could break it.
Use htmlentities($html)
Eg. "source"=>htmlentities($row['source'])

Related

Convert text in specific format into real PHP code assignments

I'm having some problems to get a text in a specific format into real working PHP code.
My text file:
#T1:The German sociologist Max Weber once proposed
#S:Jos Bleau
#C:jos.bleau#domain.com
#L:"He used to be so conservative," she says, throwing up her hands in mock exasperation. "We used to have the worst arguments right here at this table. I was part of the first group of public city school teachers that struck to form a union, and Richard was very angry with me. He saw unions as corrupt. He was also very opposed to social security. He thought people could make much more money investing it on their own. Who knew that within 10 years he would become so idealistic
#R:At first, <#Ri>Stallman viewed these notices<#$p> with alarm. Rare was the software program that didn't borrow source code from past programs, and yet, with a single stroke of the president's pen, Congress had given programmers and companies the power to assert individual authorship over communally built programs. It also injected a dose of formality into what had otherwise been an informal system.
The AI Lab of the 1970s was by all accounts a special place. Cutting-edge projects and top-flight researchers gave it an esteemed position in the world of computer science. The internal hacker culture and its anarchic policies lent a rebellious mystique as well. Only later, when many of the lab's scientists and software superstars had departed, would hackers fully realize the unique and ephemeral world they had once inhabited.
As a single parent for nearly a decade-she and Richard's father, Daniel Stallman, were married in 1948, divorced in 1958, and split custody of their son afterwards-Lippman can attest to her son's aversion to authority. She can also attest to her son's lust for knowledge. It was during the times when the two forces intertwined, Lippman says, that she and her son experienced their biggest battles.
#ST:Fusions
#R:Such mythological descriptions, while extreme, underline an important fact. The ninth floor of 545 Tech Square was more than a workplace for many. For hackers such as Stallman, it was home.
The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn't long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.
#ST:Fusions
#R:The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn't long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.
#BV:Thirty years later, Breidbart remembers
#CP:(Picture: Credit – Jos Bleau) or #CP:(Picture: Thanks)
The expected output I need (Half pseudo code; Unescaped quotes):
<?php
$title1 = 'The German sociologist Max Weber once proposed';
$signature = 'Jos Bleau';
$email = 'jos.bleau#domain.com';
$lead = '"He used to be so conservative," she says, throwing up her hands in mock exasperation. "We used to have the worst arguments right here at this table. I was part of the first group of public city school teachers that struck to form a union, and Richard was very angry with me. He saw unions as corrupt. He was also very opposed to social security. He thought people could make much more money investing it on their own. Who knew that within 10 years he would become so idealistic';
$text[] = 'At first, <#Ri>Stallman viewed these notices<#$p> with alarm. Rare was the software program that didn't borrow source code from past programs, and yet, with a single stroke of the president's pen, Congress had given programmers and companies the power to assert individual authorship over communally built programs. It also injected a dose of formality into what had otherwise been an informal system.
The AI Lab of the 1970s was by all accounts a special place. Cutting-edge projects and top-flight researchers gave it an esteemed position in the world of computer science. The internal hacker culture and its anarchic policies lent a rebellious mystique as well. Only later, when many of the lab's scientists and software superstars had departed, would hackers fully realize the unique and ephemeral world they had once inhabited.
As a single parent for nearly a decade-she and Richard's father, Daniel Stallman, were married in 1948, divorced in 1958, and split custody of their son afterwards-Lippman can attest to her son's aversion to authority. She can also attest to her son's lust for knowledge. It was during the times when the two forces intertwined, Lippman says, that she and her son experienced their biggest battles.';
$subtitle[] = 'Fusions';
//etc...
?>
Note:
The names like $title1 and #T1 are completely unrelated to each other and $title1 is just used as example. It could also be $xy or something else
If #XY appears more than once in the file then the values should be added as array element, else as simple assignment
I don't know if preg_split() is the correct direction and I can do it with it? Or do I have to use other functions to accomplish this?
Explanation
First we get the data from the text file into a variable with file_get_contents() and also initialize our $output array, where each element is a line in the output, with a php tag <?php.
You can also modify $lookup with shortcut => variable name elements, where you can define which #XY: gets replaced with which variable name. If not defined the shortcut will be used as variable name.
Now that we have prepared some stuff we match each #XY: with the corresponding data with preg_match_all().
Regular Expression
/#(\w+):(.*?)(?=#\w+:)/s
\w+ matches all word characters \[a-zA-Z0-9_\], which is the XY part from #XY: and we keep it with a capturing group
+ is a quantifier and says that \w should match 1 or more times
(.*?) matches everything as much as needed
With the flag s, * also matches new lines
(?=#\w+:) makes sure (.*?) matches everything until the next #XY: and not more. Where ?= is a positive lookahead and as it says it looks ahead if that regex in the parentheses(#\w+) can be matched
We also preemptively save the amount each shortcut appears in the data with array_count_values().
Now that we have matched all data which we want we can loop through all shortcuts, which are saved in $m[1]. In the foreach loop we simply check if you have defined a lookup variable name or if we use the shortcut as variable name.
Then we simply add each assignment as new element to the output array. Where you have to note three things:
Complex (curly) syntax is used, so that you don't get problems with invalid variable names, see: How can I access a property with an invalid name?
Depending on how many times a shortcut appeared in the data we decide if it should be added as array element or normal assignment. If the shortcut appears more than once in the data it will be adding the value as array element else as simple string assignment
We use trim() to remove spaces, new lines, ... from the start and end of the string. And we use addslashes(), so we don't get problems with quotes
Done. And now we are already done. Just depending on how you want to output the result you can save it to a file with file_put_contents() or just print out the array.
Code
<?php
$text = file_get_contents("test.txt");
$output = ["<?php"];
$lookup = []; //Example: ["ST" => "subtitle"]
preg_match_all("/#(\w+):(.*?)(?=#\w+:)/s", $text, $m);
$variableShortcutCount = array_count_values($m[1]);
foreach($m[1] as $key => $variableShortcut){
if(isset($lookup[$variableShortcut])){
$output[] = '${"' . $lookup[$variableShortcut] . ($variableShortcutCount[$variableShortcut] > 1 ? '"}[]' : '"}') . " = '". addslashes(trim($m[2][$key])) . "';" ;
} else {
$output[] = '${"' . $variableShortcut . ($variableShortcutCount[$variableShortcut] > 1 ? '"}[]' : '"}') . " = '". addslashes(trim($m[2][$key])) . "';" ;
}
}
//Output to file
//file_put_contents("output.txt", implode(PHP_EOL, $output));
//Output to browser
echo "<pre><code>";
highlight_string(implode(PHP_EOL, $output));
?>
output:
<?php
${"T1"} = 'The German sociologist Max Weber once proposed';
${"S"} = 'Jos Bleau';
${"C"} = 'jos.bleau#domain.com';
${"L"} = '\"He used to be so conservative,\" she says, throwing up her hands in mock exasperation. \"We used to have the worst arguments right here at this table. I was part of the first group of public city school teachers that struck to form a union, and Richard was very angry with me. He saw unions as corrupt. He was also very opposed to social security. He thought people could make much more money investing it on their own. Who knew that within 10 years he would become so idealistic';
${"R"}[] = 'At first, <#Ri>Stallman viewed these notices<#$p> with alarm. Rare was the software program that didn\'t borrow source code from past programs, and yet, with a single stroke of the president\'s pen, Congress had given programmers and companies the power to assert individual authorship over communally built programs. It also injected a dose of formality into what had otherwise been an informal system.
The AI Lab of the 1970s was by all accounts a special place. Cutting-edge projects and top-flight researchers gave it an esteemed position in the world of computer science. The internal hacker culture and its anarchic policies lent a rebellious mystique as well. Only later, when many of the lab\'s scientists and software superstars had departed, would hackers fully realize the unique and ephemeral world they had once inhabited.
As a single parent for nearly a decade-she and Richard\'s father, Daniel Stallman, were married in 1948, divorced in 1958, and split custody of their son afterwards-Lippman can attest to her son\'s aversion to authority. She can also attest to her son\'s lust for knowledge. It was during the times when the two forces intertwined, Lippman says, that she and her son experienced their biggest battles.';
${"subtitle"}[] = 'Fusions';
${"R"}[] = 'Such mythological descriptions, while extreme, underline an important fact. The ninth floor of 545 Tech Square was more than a workplace for many. For hackers such as Stallman, it was home.
The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn\'t long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.';
${"subtitle"}[] = 'Fusions';
${"R"}[] = 'The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn\'t long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.';
${"BV"} = 'Thirty years later, Breidbart remembers';
${"CP"} = '(Picture: Credit – Jos Bleau) or';

How do I get the data from a particular row of JSON data using PHP?

I have a JSON data file that looks like this:
{
"aaData": [
["1","An Act Appropriating Funds for the Operation of the Government of the Commonwealth of the Philippines Beginning July First, Nineteen Hundred and Forty-Six Until the General Appropriations Act for the Fiscal Year Nineteen Hundred and Forty-Seven is Approved","1946-07-15"],
["2","An Act Appropriating Fifty Thousand Pesos to Defray the Expenses of a State Funeral for Manuel L. Quezon and for the Erection of a Mausoleum to Contain His Remains","1946-07-19"],
["3","An Act to Continue in Force and Effect the Act of the Congress of the United States, Approved on August 5, 1909, Entitled “An Act to Raise Revenue for the Philippine Islands, and for Other Purposes,” Otherwise Known as “The Philippine Tariff Law of 1909,” as Amended","1946-07-19"],
["4","An Act to Amend Section Twenty-Six Hundred and Ninety-Two of the Revised Administrative Code, and to Exempt from Responsibility Those Who Should Surrender Firearms Under Certain Conditions, and for Other Purposes","1946-07-19"],
["5","An Act to Amend Sections Two and Five of Commonwealth Act Numbered Five Hundred Eighteen, Entitled “An Act to Establish the National Coconut Corporation, and to Appropriate Additional Operating Capital for Said Corporation”","1946-08-01"],
["6","An Act to Provide That as of the Date of the Proclamation of the Republic of the Philippines the Present Congress of the Philippines Shall be Known as the First Congress of the Republic of the Philippines, and for Other Purposes","1946-08-05"],
["7","An Act to Establish the Foreign Funds Control Office, and for Other Purposes","1946-08-09"],
["8","An Act to Authorize the President of the Philippines to Enter Into Such Contracts or Undertakings as May be Necessary to Effectuate the Transfer to the Republic of the Philippines Under the Philippine Property Act of Nineteen Hundred and Forty-Six of Any Property or Property Rights or the Proceeds Thereof Authorized to be Transferred Under Said Act; Providing for the Administration and Disposition of Such Properties Once Received; and Appropriating the Necessary Funds Therefore","1946-08-09"],
["9","An Act to Authorize the President of the Philippines to Enter Into an Agreement or Agreements with the Government of the United States Pursuant to United States Public Act Numbered Four Hundred and Fifty-Four, Commonly Called the “Republic of the Philippines Military Assistance Act,” and to Issue the Necessary Rules and Regulations to Implement Said Act, and Providing Penalties for Violations Thereof","1946-09-02"],
["10","An Act Penalizing Usurpation of Public Authority","1946-09-02"],
["11","An Act to Prohibit the Slaughtering of Male and Female Carabaos, Horses, Mares, and Cows","1946-09-02"],
["12","An Act Amending Articles One Hundred Forty-Six, Two Hundred Ninety-Five, Two Hundred Ninety-Six and Three Hundred Six of the Revised Penal Code","1946-09-05"],
["13","An Act to Amend Sections Five and Six of Commonwealth Act Numbered Six Hundred and Seventy-Two, Entitled “An Act to Rehabilitate the Philippine National Bank”","1946-09-05"]
]
}
I'm trying to create a standard way to get data from the file just by specifying a particular line and mapping the row data into particular variables. I imagine that there's a way to do this using by converting the JSON into an array but I'm finding it difficult to understand how to select a particular row and then mapping the row's data into variables.
Ultimately, I want to call this function from another PHP file via includes and echoing/printing the result. I think my code would look like this:
<?php echo '<a href="' . $link . '" ' . 'title="' . $title . '">' ?>
and my data would be mapped as follows:
col1 => row specifier
col2 => link
col3 => title
I hope I've explained my question properly. I'm not particularly well-versed in the proper vocabulary to explain this problem. Thanks in advance! :)
You first need to json_decode() the JSON data into a native PHP object. The data contains an object ({...}) with a member aaData that contains an array ([...]), so you need to access the data as $data->aaData[$row].
I notice that your link data contains the entire <a> element and not just the link. If you want to add a title attribute to the <a> element, you would have to extract the link from it and then reassemble the <a> element with the title added. A much easier way would be to wrap the <a> element inside a <span> with the appropriate title attribute:
<?php
$json_data = '
{
"aaData": [
["1","An Act Appropriating Funds for the Operation of the Government of the Commonwealth of the Philippines Beginning July First, Nineteen Hundred and Forty-Six Until the General Appropriations Act for the Fiscal Year Nineteen Hundred and Forty-Seven is Approved","1946-07-15"],
["2","An Act Appropriating Fifty Thousand Pesos to Defray the Expenses of a State Funeral for Manuel L. Quezon and for the Erection of a Mausoleum to Contain His Remains","1946-07-19"],
["3","An Act to Continue in Force and Effect the Act of the Congress of the United States, Approved on August 5, 1909, Entitled “An Act to Raise Revenue for the Philippine Islands, and for Other Purposes,” Otherwise Known as “The Philippine Tariff Law of 1909,” as Amended","1946-07-19"],
["4","An Act to Amend Section Twenty-Six Hundred and Ninety-Two of the Revised Administrative Code, and to Exempt from Responsibility Those Who Should Surrender Firearms Under Certain Conditions, and for Other Purposes","1946-07-19"],
["5","An Act to Amend Sections Two and Five of Commonwealth Act Numbered Five Hundred Eighteen, Entitled “An Act to Establish the National Coconut Corporation, and to Appropriate Additional Operating Capital for Said Corporation”","1946-08-01"],
["6","An Act to Provide That as of the Date of the Proclamation of the Republic of the Philippines the Present Congress of the Philippines Shall be Known as the First Congress of the Republic of the Philippines, and for Other Purposes","1946-08-05"],
["7","An Act to Establish the Foreign Funds Control Office, and for Other Purposes","1946-08-09"],
["8","An Act to Authorize the President of the Philippines to Enter Into Such Contracts or Undertakings as May be Necessary to Effectuate the Transfer to the Republic of the Philippines Under the Philippine Property Act of Nineteen Hundred and Forty-Six of Any Property or Property Rights or the Proceeds Thereof Authorized to be Transferred Under Said Act; Providing for the Administration and Disposition of Such Properties Once Received; and Appropriating the Necessary Funds Therefore","1946-08-09"],
["9","An Act to Authorize the President of the Philippines to Enter Into an Agreement or Agreements with the Government of the United States Pursuant to United States Public Act Numbered Four Hundred and Fifty-Four, Commonly Called the “Republic of the Philippines Military Assistance Act,” and to Issue the Necessary Rules and Regulations to Implement Said Act, and Providing Penalties for Violations Thereof","1946-09-02"],
["10","An Act Penalizing Usurpation of Public Authority","1946-09-02"],
["11","An Act to Prohibit the Slaughtering of Male and Female Carabaos, Horses, Mares, and Cows","1946-09-02"],
["12","An Act Amending Articles One Hundred Forty-Six, Two Hundred Ninety-Five, Two Hundred Ninety-Six and Three Hundred Six of the Revised Penal Code","1946-09-05"],
["13","An Act to Amend Sections Five and Six of Commonwealth Act Numbered Six Hundred and Seventy-Two, Entitled “An Act to Rehabilitate the Philippine National Bank”","1946-09-05"]
]
}
';
$data = json_decode ($json_data);
if (!$data) {
die ("Failed to decode JSON data");
}
/* Remap data so that $links is indexed by row specifier */
foreach ($data->aaData as $row) {
$links[$row[0]] = array ($row[1], $row[2]);
}
function get_row ($row)
{
global $links;
if (isset ($links[$row])) {
return $links[$row];
} else {
return NULL;
}
}
list ($link, $title) = get_row (5);
if (isset ($link)) {
echo "<span title=\"$title\">$link</span>\n";
} else {
echo "Row not found.\n";
}
I'm assuming here that the href attribute and the title data have been properly encoded using htmlspecialchars() or similar.
Use json_decode to convert to a php array
$dtaarray = json_decode(<jsonVar>,true);
http://php.net/manual/en/function.json-decode.php

Subscripting a string in Javascript

I have a data in the object section_data.title and i am trying to use str.sup() where str=section_data.title; and str holds following data:
str="Not less than 30 net ft2 (2.8 net m2) per patient in a
hospital or nursing home, or not less than 15 net ft2 (1.4 net
m2) per resident in a limited care facility, shall be provided within the aggregated area of corridors, patient rooms, treatment
rooms, lounge or dining areas, and other similar areas on each side of
the horizontal exit. On stories not housing bed or litterborne
patients, not less than 6 net ft2 (0.56 net m2) per occupant
shall be provided on each side of the horizontal exit for the total
number of occupants in adjoining compartments."
Now I want to add superscript for the bold words indicated above (e.g ft2). How can I do this using str.sup() or are there any other alternative method to do so in javascript? Or any other tricks?
String in javascript is not formatted. You can only do that when you output to HTML. So basically you must write it like this
var str = "Not less than 30 net ft<sup>2</sup> (2.8 net m<sup>2</sup>)";
document.write(str);
You can do a find and replace for all string contain ft2 and m2 turn them into ft<sup>2</sup> and m<sup>2</sup>
str.replace(/ft2/g,"ft<sup>2</sup>"); //But it not safe...

how to format html tags fetching from xml

I need to format html tags comes from xml
example
<p><b>Location. </b> <br />Located in central Chennai, The Raintree Hotel,
Anna Salai is connected to the airport and close to U. S. Consulate,
Valluvar Kottam, and Anna University.<p><b>Hotel features</b>
Other points of interest near this luxury hotel include SDAT Tennis Stadium and Kapalishvara Temple. </p><p><b>Hotel Features. </b><br />Dining options at The Raintree Hotel, Anna Salai include 2 restaurants. A swim-up bar and a bar/lounge are open for drinks. <p>
when i fetch this data from xml using simpleXML php it shows this type of data.I don't want this raw data.i need data like this
Location
Located in central Chennai, The Raintree Hotel, Anna Salai is connected to the airport and close to U. S. Consulate, Valluvar Kottam, and Anna University. Other points of interest near this luxury hotel include SDAT Tennis Stadium and Kapalishvara Temple.
Hotel Features
Dining options at The Raintree Hotel, Anna Salai include 2 restaurants. A swim-up bar and a bar/lounge are open for drinks. Room service is available 24 hours a day. The hotel serves a complimentary breakfast. Recreational amenities include an outdoor pool, a
I need the html formatted data showing in browser not html tags
thanks in advance
You could use html-entity-decode if you are controlling how the html is rendered. If not, you expose yourself to Cross-site scripting attacks.
You can work around this by either by changing the xml format to not require html tags (i.e. include a <location> and <description> tag or similar. If you get the html from an external party, try writing a parser yourself to safely extract just the parts you need.
Well assign the value of xml to the innerHTML property of your target element. Thats it.. hope this helps you ...

Named entity recognition with preset list of names for Python / PHP

I'm trying to process a CSV file that has as in each row a text field with the name of organization and position of an individual within that organization as unstructured text. This field is usually a mess of text like this:
Assoc. Research Professor Dept. Psychology Univ. California Santa Barbara
I need to pull out the position and the organization name. For the position, I use preg_match for a series of about 60 different regular expressions for the different professions, and I think it works pretty well (my guess is that it catches about 80%). But, I'm having trouble catching the organization name. I have a MySQL table with roughly 16,000 organization names that I can perform a simple preg_match for, but due to common misspellings and abbreviations, it's only catching about 30% of the organizations. For example, my database has
University of California Santa Barbara
But the CSV file might have any of the options:
Univ Cal Santa Barbara
University Cal-Santa Barbara
University California-Santa Barbara
Cal University, Santa Barbara
I need to process several hundred thousand records, and I can't spend the time to correct 70% of the records that are currently not being processed correctly or painstakingly create multiple aliases for each organization. What I would like to be able to do is to catch small differences (such as the small misspellings, hyphens versus spaces, and common abbreviations), and, if still no matches are found, to ideally recognize an organizational name and create a new record for it.
What libraries or tools in Python or PHP would allow to perform a similarity match that would have a broader reach?
Would NLTK in Python catch misspellings?
Is it possible to use AlchemyAPI to catch misspelled organizations? So far I've only been able to use it to catch correctly spelled organizations
Since I'm comparing a short string (the organization name) to a longer string (that includes the name plus extraneous information) is there any hope in using PHP's similar_text function?
Any help or insight would be appreciated.
This is within the domain of fuzzy logic. See if these are of any help:
http://www.phpclasses.org/blog/post/119-Neural-Networks-in-PHP.html
http://ann.thwien.de/index.php/Installation
You may be able to use difflib to calculate the similarity ratio between the CSV input and the canonical spelling, and consider it a match if it's above a certain threshold (say, 0.65).
For example:
import difflib
exact = 'University of California Santa Barbara'
inputs = ['Univ Cal Santa Barbara',
'University Cal-Santa Barbara',
'University California-Santa Barbara',
'Cal University, Santa Barbara',
'Canterbury University']
sm = difflib.SequenceMatcher(None, exact)
ratios = []
for input in inputs:
sm.set_seq2(input)
ratios.append(sm.ratio())
print ratios
gives:
[0.73333333333333328, 0.81818181818181823, 0.93150684931506844,
0.71641791044776115, 0.33898305084745761]
Note how 'Canterbury University' has a much lower match ratio() than the inputs you gave.
Then again, SequenceMatcher.ratio() may be too slow computed over 16,000 values.

Categories