Working on Php and Elastic Search 6.5.2. For test scenario I used postman, When I apply Highlighting, Only the trimmed piece of content which has matched keyword sentence is displaying when I added two or more fragments under the highlight. I Don't wan to trim the content at the elastic search level. For this I changed the number of fragments to zero and I got the expected output in the elastic search that its giving the whole content, butwhen I check the output in the php application the whole content is getting bold whenever the matched keyword existing in the url field.
Index:
PUT test/_doc/1
{
"title":"Apply For the admissions graduate and undergraduate"
"url":"https://someurl.com/admissions",
"content": "Engineers play an important role in almost every aspect of modern life. As an engineer in the 21st century, you’ll work in teams to develop ingenious ways to transform the world in which we live. Industrial engineers are in high demand in nearly every industry. Astounding innovations in semiconductor microelectronic engineering will continue to drive productivity and the economy by playing a key role in a wide range of technologies – information, communication, nanotechnology, defense, medicine, and energy.Admission into the microelectronic engineering program is competitive, but our admission process is a personal one. Each application is reviewed holistically for strength of academic preparation, performance on standardized tests, counselor recommendations, and your personal career interests. We seek applicants from a variety of geographical, social, cultural, economic, and ethnic backgrounds."
}
Query:
{
"query":{
"query_string":{
"fields":[
"content"
],
"query":"admissions"
}
},
"highlight":{
"fields":{
"title":{
"pre_tags":[
"<strong>"
],
"post_tags":[
"</strong>"
],
"number_of_fragments":3 //changed to 0 earlier
},
"content":{
"pre_tags":[
"<strong>"
],
"post_tags":[
"</strong>"
],
"fragment_size":150,
"number_of_fragments":3 //changed to 0 earlier
}
}
}
}
Result:
"highlight": {
"content": [
"information, communication, nanotechnology, defense, medicine, and energy.Admission into the microelectronic engineering program is competitive, but our <strong>admission</strong>"
]
}
After looking at the code you shared via comments. This issue is not in elasticsearch response.
This following line is causing the issue:
<?php echo substr($r['highlight']['content'][0],0,300); ?>
Due to this one of the result is missing closing </b> tag as you are taking only first 300 characters.
If you notice you can see, one of the highlighting content in your html has the following:
Tourism Management Curriculum Capstone/Exam/Thesis Options <b>Admissi</div>
As you can see closing </b> tag is missing for <b>Admissi and hence everything after this is in bold.
Solution to this could be not using substring.
Related
I have start a new plugin that its suppose to show random quotes after each post
it not give any error but its now show any text after my posts.
I have try this way but its seems i have fail some were.
I don t get any error but after activate its suppose to show a random text but nothing, i could not made it work.
After look to my code i have try to modify
and there is no erros but my code continue to not show the text
Firt i tank you #mangenta for the help
But well i have the same problem of my first question is
i want to pass the quotes variable $quote to the signature place but it seems not work.
not show my text after my post like i need and i can not fin any way yet to pass
the quotes variable to the signature code
This is my code
<?php
/*
Plugin Name: Random Text Quotes
Version: 1.0
Description: Random Text Quotes Albert Einstein.
Plugin URI: https://mediaads.eu/
Author: Helder Ventura
Author URI: https://mediaads.eu
Version: (standalone)
Usage: install activate and done
*/
$bgcolor = '#FFFFCC';
$textcolor = 'black';
$textsize = '2';
// Array Structure: "Quote","Author"
$allqts = array
("*His aim was to substitute for a petrified and barren
system of ideas the unbiased and strenuous quest
for a deeper and more consistent comprehension of
physical and astronomical facts.",
"Albert Einstein",
"The discovery and use of scientific reasoning by
Galileo was one of the most important achievements
in the history of human thought.<br>" ,
"Albert Einstein",
"I admire Gandhi greatly but I believe there are two
weaknesses in his program.",
"Albert Einstein",
"I believe that Gandhi’s views were the most enlightened
among all of the political men of our time.",
"Albert Einstein",
"Liberty, when it begins to take root, is a plant of rapid growth.",
"George Washington",
"Gandhi, the greatest political genius of our time,
indicated the path to be taken.",
"Albert Einstein",
"Gandhi’s development resulted from extraordinary
intellectual and moral forces in combination with
political ingenuity and a unique situation.",
"Albert Einstein",
"Anyone who has never made a mistake has never tried anything new",
"Albert Einstein",
"Progress doesn't come from early risers, progress is made<br>by lazy men
looking for easier ways to do things.",
"Lazarus Long <font size=-2>(Time Enough for Love by Robert A. Heinlein)
</font>",
"On Johann Wolfgang von Goethe (1749–1832)
I feel in him a certain condescending attitude toward
the reader, and miss the humility that is comforting,
especially when it comes from great men.",
"Albert Einstein",
"*This was the first time I’ve ever heard of such
an important man who speaks at least briefly with
his mother every day.",
"Albert Einstein",
"On Werner Heisenberg (1901–1976)
Professor Heisenberg was here, a German.",
"Albert Einstein",
"I am very happy here and enjoying the American
summer as well as the news about Hitler’s mad
deed of desperation.",
"Albert Einstein",
"Hitler appeared, a man with limited intellectual
abilities and unfit for any useful work, bursting with
envy and bitterness against all whom circumstance
and nature had favored over him. . . .",
"Albert Einstein",
"*I haven’t forgotten that the Swiss authorities didn’t
stand by me in any way when Hitler stole all of my
savings, even those designated for my children.",
"Albert Einstein",
"On Immanuel Kant (1724–1804)
*Kant’s much-praised view on Time reminds me of
Andersen’s tale of the emperor’s new clothes, only
that instead of the emperor’s new clothes we have
the form of intuition.",
"Albert Einstein",
"*Kant is sort of a highway with lots and lots of milestones.
Then all the little dogs come and each deposits
his contribution at the milestones.",
"Albert Einstein",
"What seems to me the most important thing in
Kant’s philosophy is that it speaks of a priori concepts
for the construction of science.",
"Albert Einstein",
"Kant, thoroughly convinced of the indispensability
of certain concepts, took them—just as they are selected—to
be the necessary premises for every kind
of thinking and differentiated them from concepts
of empirical origin.",
"Albert Einstein",
"On George Kennan (1904–2005)
Princeton University Press sent me George Kennan’s
new book [Realities of American Foreign Policy] and I
read it right away.",
"Albert Einstein",
"[Kepler] belonged to those few who cannot do otherwise
than openly acknowledge their convictions
on every subject. . . .",
"Albert Einstein",
"There we meet a finely sensitive person, passionately
dedicated to the search for a deeper insight into the
essence of natural events, who, despite internal and
external difficulties, reached his loftily placed goal.",
"Albert Einstein",
"Share Your Love With The World.",
"Helder Ventura"
);
// Gets the Total number of Items in the array
// Divides by 2 because there is a Quote followed by an Author
$totalqts = (count($allqts)/2);
// Subtracted 1 from the total because '0' is not accounted for otherwise
$nmbr = (rand(0,($totalqts-1)));
$nmbr = $nmbr*2;
//$nmbr = 18;
$quote = $allqts[$nmbr];
$nmbr = $nmbr+1;
$author = $allqts[$nmbr];
// You can delete this section
// it is only so Search engines can find it
if ($_SERVER['PHP_SELF'] == "/quotes.php") {
echo "<Title>Random Text Quote</title>";
echo "<meta name=\"Description\" content=\"Random Text Quote\">";
echo "<meta name=\"keywords\" content=\"Random Text Quote\">";
}
/// End Delete
$space = "<font color=$bgcolor>.....................................</font>";
$comments = "<br><center><font size='-2'><i><a href='quotes.php'>Random Text
Quote</a></i></font></center>";
echo "<center>";
echo "<br>";
echo "<br>";
echo "<br>";
echo "<br>";
echo "<br>";
echo "<br>";
echo "<br>";
echo "<br>";
echo "<br>";
echo "<Font color=$textcolor size='$textsize'><i>";
echo "$quote<br>";
echo "</i></font>";
echo "$space $author";
echo "$comments";
echo "</center>";
// You can delete this section as well - it's my shameless plug:
// it is only so Search engines can find it
if ($_SERVER['PHP_SELF'] == "/quotes.php") {
echo "<br/><br/>If you <i>really</i> like it, I do accept donations via
PayPal: <a href='http://zonadelike.publiadds.org.pt//donate'>Donations</a>";
echo "<br/><br/>";
}
/// End Delete
IF ($_SERVER['PHP_SELF'] == "/quotes.php") {
show_source("quotes.php");
}
// Add Signature Image after single post
add_filter('the_content','add_signature', 1);
function add_signature($text) {
global $post;
if(($post->post_type == 'post'))
$text .= '<div class="signature"><a href="https://mediaads.eu/"
target="_blank" title=$text>'.$quote.'</a></div>';
return $text;
}
?>
I tested the following code from your original post.
<?php
/*
Plugin Name: Random Text Quotes
Version: 1.0
Description: Random Text Quotes Albert Einstein.
Plugin URI: https://mediaads.eu/
Author: Helder Ventura
Author URI: https://mediaads.eu
Version: (standalone)
Usage: install activate and done
*/
function ab_arq_generate() {
$quotes = array(
'I am happy to be in Boston. I have heard of Boston
as one of the most famous cities in the world and the
center of education. I am happy to be here and expect
to enjoy my visit to this city and to Harvard.
On his visit to the city with Chaim Weizmann. New York
Times, May 17, 1921. Contributed by A. J. Kox in response
to the many quotations about Princeton in this book (see
later in this section).',
'*America is interesting, with all its hustle and bustle.
It is easier to feel enthusiasm for it than for other
countries I’ve unsettled with my presence. I had to
consent to being shown around like a prize ox to address
innumerable small and large gatherings. . . .
It’s a wonder I survived it all.
To Michele Besso, ca. May 21–30, 1921. CPAE, Vol. 12,
Doc. 141',
'*It is the women . . . who dominate all of American
life. The men are interested in nothing at all; they
work, work as I haven’t seen anyone work anywhere
else. For the rest, they are toy dogs for their
wives, who spend the money in the most excessive
fashion and who shroud themselves in a veil of
extravagance.
From an interview in the Nieuwe Rotterdamsche Courant,
July 4, 1921. Einstein insisted he was wrongly quoted and
wrote a rebuttal in the Vossische Zeitung six days later,
claiming he was shocked when he read the account. ',
'Even if Americans are less scholarly than Germans,
they do have more enthusiasm and energy, causing
a wider dissemination of new ideas among the
people.
Quoted in the New York Times, July 12, 1921
',
'A firm approach is indispensable everywhere in
America; otherwise one receives no payment and
little esteem.
To Maurice Solovine, January 14, 1922. Published in Letters
to Solovine, 49. Einstein Archives 21-157'
);
return $quotes[rand(0, count($quotes)-1)];
}
function ab_arq_change_bloginfo( $text, $show ) {
if( 'description' == $show ) {
$text = ab_arq_generate();
}
return $text;
}
add_filter( 'bloginfo', 'ab_arq_change_bloginfo', 10, 2 );
// Add Signature Image after single post
add_filter('the_content','add_signature', 1);
function add_signature($text) {
global $post;
if(($post->post_type == 'post'))
$text .= '<div class="signature"><a href="https://mediaads.eu/"
target="_blank" title=$text>AAAAA</a></div>';
return $text;
}
?>
The only change I made was I added the text 'AAAAA' in your signature A element since it had no text content.
This code works on my site using the 2016 theme. So, there is nothing wrong with the code itself. I would try a different theme to see if the theme you are using is the source of problem - I would try 2016 as it is known to work in my environment. For your code to run the theme must call certain filters - 'blog_info', 'the_content' - further, the 'blog_info' filter must be called with the parameter 'description'. Some themes may not do this. The filter 'the_content' was probably called but did not produce any visible content as your A element had no text content.
UPDATED CODE - to move quote from tagline location to signature location
<?php
/*
Plugin Name: Random Text Quotes
Version: 1.0
Description: Random Text Quotes Albert Einstein.
Plugin URI: https://mediaads.eu/
Author: Helder Ventura
Author URI: https://mediaads.eu
Version: (standalone)
Usage: install activate and done
*/
function ab_arq_generate() {
$quotes = array(
'I am happy to be in Boston. I have heard of Boston
as one of the most famous cities in the world and the
center of education. I am happy to be here and expect
to enjoy my visit to this city and to Harvard.
On his visit to the city with Chaim Weizmann. New York
Times, May 17, 1921. Contributed by A. J. Kox in response
to the many quotations about Princeton in this book (see
later in this section).',
'*America is interesting, with all its hustle and bustle.
It is easier to feel enthusiasm for it than for other
countries I’ve unsettled with my presence. I had to
consent to being shown around like a prize ox to address
innumerable small and large gatherings. . . .
It’s a wonder I survived it all.
To Michele Besso, ca. May 21–30, 1921. CPAE, Vol. 12,
Doc. 141',
'*It is the women . . . who dominate all of American
life. The men are interested in nothing at all; they
work, work as I haven’t seen anyone work anywhere
else. For the rest, they are toy dogs for their
wives, who spend the money in the most excessive
fashion and who shroud themselves in a veil of
extravagance.
From an interview in the Nieuwe Rotterdamsche Courant,
July 4, 1921. Einstein insisted he was wrongly quoted and
wrote a rebuttal in the Vossische Zeitung six days later,
claiming he was shocked when he read the account. ',
'Even if Americans are less scholarly than Germans,
they do have more enthusiasm and energy, causing
a wider dissemination of new ideas among the
people.
Quoted in the New York Times, July 12, 1921
',
'A firm approach is indispensable everywhere in
America; otherwise one receives no payment and
little esteem.
To Maurice Solovine, January 14, 1922. Published in Letters
to Solovine, 49. Einstein Archives 21-157'
);
return $quotes[rand(0, count($quotes)-1)];
}
// Add Signature Image after single post
add_filter('the_content','add_signature', 1);
function add_signature($text) {
global $post;
if(($post->post_type == 'post'))
$text .= '<div class="signature"><a href="https://mediaads.eu/"
target="_blank" title=$text>' . ab_arq_generate() . '</a></div>';
return $text;
}
?>
I try to put html content within json, it broke.
My invalid Json http://i.imgur.com/8wfEikY.png
{
"item": {
"title": "Japanese investors back Lookup, a messaging app for local shopping in India",
"desc": "An infusion of US$116,000 from Japan's social games company DeNA and Teruhide Sato, founder of BEENOS, takes the three-month-old startup\u2019s seed funding to US$382,000.",
"link": "https:\/\/www.techinasia.com\/dena-teruhide-sato-beenos-fund-lookup\/",
"content": "<p><img src="https: \/\/www-techinasia.netdna-ssl.com\/wp-content\/uploads\/2015\/01\/lookup-app-main-720x289.jpg" alt="lookupappmain" width="720" height="289" class="aligncentersize-largewp-image-213938" \/><\/p>\n<p>Bangalore-based instant messaging app <a href="https: \/\/www.techinasia.com\/tag\/lookup\/">Lookup<\/a> – a Craiglist cum WhatsApp for local businesses – just got its third dose of seed funding. Japan’s leading social games company <a href="https: \/\/www.techinasia.com\/tag\/dena\/">DeNA<\/a> and Teruhide Sato, founder of BEENOS group, a global conglomerate with ecommerce holdings and a business incubator, invested US$116,000 into this three-month-old startup founded by Deepak Ravindran, a young serial entrepreneur.<\/p>\n<p>“Both our recent investors have strong footholds in the mobile space and have successfully led innovations in Japan,” says Ravindran, suggesting that the investors would be giving Lookup more than just funding.<\/p>\n<p><a href="http: \/\/www.lookup.to">Lookup<\/a> lists businesses, restaurants, and even police stations for users to connect with. Unlike Craigslist or JustDial which would give you a number to dial, Lookup lets you shoot off a message to the local businesses without leaving the app. You can find prices and availability of products or services at local businesses, book appointments at salons, or make reservations at restaurants with this app. Any store or restaurant using Lookup can then respond instantly.<\/p>\n<p>Lookup has a call center tracking the messages to ensure that its users receive responses immediately, even if a store is not using the app. “Our guarantee is that you get answers within five minutes. We do this by employing dedicated people for handling your request. Lookup’s call center fields your responses, calls up stores, and types answers back to you in real-time. No calling, no waiting,” Ravindran told <em>Tech in Asia<\/em>.<\/p>\n<p>To celebrate the latest funding from Japanese investors, Lookup is gifting free sushi for a week to new users from Bangalore who download the app. For this, it has tied up with two Japanese restaurants Shiro and Ginseng.<\/p>\n<p>With this latest infusion, Lookup’s seed round of venture capital funding closed at US$382,000. It had earlier bagged US$166,000 from tech billionaire Kris Gopalakrishnan, co-founder of Indian IT bellwether Infosys, and US$100,000 from MKS Switzerland SA, a precious metals and financial services group of companies.<\/p>\n<p><center><strong>See: <a href="https: \/\/www.techinasia.com\/college-dropout-turned-mit-top-innovator-rolls-craigslist-whatsapp-app-local-shopping-india\/">College dropout turned MIT top innovator rolls Craigslist and WhatsApp into one app for local shopping in India<\/a><\/strong><\/center><\/p>\n<p>This post <a href="https: \/\/www.techinasia.com\/dena-teruhide-sato-beenos-fund-lookup\/" title="JapaneseinvestorsbackLookup,
amessagingappforlocalshoppinginIndia">Japanese investors back Lookup, a messaging app for local shopping in India<\/a> appeared first on Tech in Asia.<\/p>"
}
}
What I did in PHP
$arr = array();
$arr["item"]["content"] = $content; // $content is dynamic, scrapped from somewhere
echo json_encode($arr, true);
I tried htmlentities and addcslashes($item_content,'"') but nnoe of that work.
It's because of the " sign in the image tag. You could use the HTML entities function to encode it en the decode function to decode it.
A neater way to do it is to save the image url in a different property of your item.
You haven't escaped the quotation marks ("") in your content - this means that your content string is only "<p><img src=" and then PHP is confused as to what the rest of this stuff is.
You need to change it be like this:
"content": "<p><img src=\"https: \/\/www-techinasia.netdna-ssl.com\/wp-content\/uploads\/2015\/01\/lookup-app-main-720x289.jpg" alt=\"loo...More content..."
(I've added \ before the quotation marks that don't end the string - in future - look for the syntax highlighting - if things change colour without you expecting the end of a variable - then something has gone wrong)
If you'd like to do this with PHP - you can use the HTML entities function (http://php.net/htmlentities) or simply the addslashes function (http://php.net/manual/en/function.addslashes.php)
E.g.
<?php $str = "A 'quote' is <b>bold</b>";
// Outputs: A 'quote' is <b>bold</b> echo
htmlentities($str);
// Outputs: A 'quote' is <b>bold</b> echo
htmlentities($str, ENT_QUOTES); ?>`
[Cite: PHP Manual]
<?php $str = "Is your name O'Reilly?";
// Outputs: Is your name O\'Reilly?
echo addslashes($str); ?>
[Cite: PHP Manual]
I have some strings from our accounting system I need to process. The accounting system only gives the option to put in the postal code and city in one input field. The data is later exported through xml and imported in a php system.
I'm looking for a way to extract the postal code from the city, however these come in various formats so a simple substr(); is not working
Some examples of the values I need to process are:
1234 ZC ALPHEN AAN DEN RIJN
1234SG UTRECHT
33602 BIELEFELD
W7 3QB LONDON
How do I split the postal code from city for each of these? I already contacted the manufacturer of the accounting system, and they understood my problem and will look into splitting the values in 2 for future calls, but that will take some time.
It's not in keeping with Google's Terms and Conditions unless you're storing this data to be displayed on a Google map, but it is awfully tempting to harness their power because they're just so good at this stuff.
The Geocoding API will be able to handle pretty much any address/postcode combination and variation you can throw at it - with or without spaces, postcode first or last, etc. etc., including different place names ("London", "Londres").
A request to
http://maps.googleapis.com/maps/api/geocode/json?address=2408%20ZC%20ALPHEN%20AAN%20DEN%20RIJN&sensor=false
returns a JSON stream containing, among other things:
"address_components" : [
{
"long_name" : "2408 ZB",
"short_name" : "2408 ZB",
"types" : [ "postal_code" ]
},
{
"long_name" : "Alphen aan den Rijn",
"short_name" : "Alphen aan den Rijn",
"types" : [ "locality", "political" ]
},
...
This page outlines the requirements and limitations for using the service.
Note that the Google API will guess stuff if the data is slightly wrong. Your initial example of 1234 ZC isn't correct and the API will interpolate in an attempt to give you something you work with. Make sure you explore how the API reacts to incorrect data, and be careful not to shoot yourself in the foot with the results.
If you know the country at the time you are attempting to split the postal code off from the city you could use that to look up a regular expression (or similar piece of data) that corresponds to the correct way to parse out the postal code.
For example, you might map countries to regexes in an array (these regular expressions are just samples -- not vigorously tested):
$regexMap = array(
'US' => '(\d{5}|\d{5}-\d{4}|\d{9})\s+(.*)',
'UK' => '([\d\w]{2,4}\s+\d\w{2})\s+(.*)',
...
);
$regularExpression = $regexMap[$country];
preg_match($regularExpression, $incomingPostalCodeAndCity, $postalData);
$postalCode = $postalData[0];
$city = $postalData[1];
While you probably can combine regular expressions for some (many?) countries, postal codes vary enough that you'll probably still need a fairly lengthy list of regexes.
Each regex should be designed to return the postal code as the first subpattern and the city as the second subpattern.
There is some related information in the answers to this question: What is the ultimate postal code and zip regex? (including some lists of postal code regular expressions for various countries).
I want to create coupon codes that users can remember easily. My idea is something like:
squirrel45
nantucket23
That is, a real word chosen randomly from a long dictionary list (preferably compiled for this purpose) combined 2 random digits. My questions are:
Where can I find such a dictionary list?
Do you see any problems with the system? (security is not ultra important here, just something reasonable is fine)
Can you suggest any good improvements or alternatives?
Fwiw I am not crazy about the Markov word generators because I think their idiosyncrasies would be too hard to remember. I'd like a client to be able to keep the code in his head, and tell it to the merchant when he arrives to redeem it.
Thanks,
Jonah
Word lists are easy to find. Make sure you sanity filter them for foul words ;)
Here's a huge word list that can be easily scrubbed:
http://www.scrabble-assoc.com/boards/dictionary/10-15-20030401.txt
From there you can easily load in words into your database and create your coupon code like so:
$coupon_code = $rand_word . rand(20,99);
After you do this, simply store your coupon code in the database and whenever you make a new code, check it against existing codes before you apply it. Even slim odds are possible odds.
More word lists in various formats:
http://scrabble.wonderhowto.com/blog/ultimate-scrabble-word-list-resource-0115617/
5-letter words:
http://homepage.ntlworld.com/adam.bozon/Scrabble5.htm
6-letter words:
http://homepage.ntlworld.com/adam.bozon/Scrabble6.htm
7-letter words:
http://homepage.ntlworld.com/adam.bozon/Scrabble7.htm
8-letter words:
http://homepage.ntlworld.com/adam.bozon/Scrabble8.htm
Sample:
PIKES PIKIS PILAF PILAR PILAU PILAW PILEA PILED PILEI PILES PILIS
PILLS PILOT PILUS PIMAS PIMPS PINAS PINCH PINED PINES PINEY PINGO
PINGS PINKO PINKS PINKY PINNA PINNY PINON PINOT PINTA PINTO PINTS
PINUP PIONS PIOUS PIPAL PIPED PIPER PIPES PIPET PIPIT PIQUE PIRNS
PIROG PISCO PISOS PISTE PITAS PITCH PITHS PITHY PITON PIVOT PIXEL
PIXES PIXIE PIZZA PLACE PLACK PLAGE PLAID PLAIN PLAIT PLANE PLANK
PLANS PLANT PLASH PLASM PLATE PLATS PLATY PLAYA PLAYS PLAZA PLEAD
PLEAS PLEAT PLEBE PLEBS PLENA PLEWS PLICA PLIED PLIER PLIES PLINK
PLODS PLONK PLOPS PLOTS PLOTZ PLOWS PLOYS PLUCK PLUGS PLUMB PLUME
PLUMP PLUMS PLUMY PLUNK PLUSH PLYER POACH POCKS POCKY PODGY PODIA
POEMS POESY POETS POGEY POILU POIND POINT POISE POKED POKER POKES
With that you could generate a coupon code POACH72
Concatenating 2 words will increase the security posture of your system.
e.g. squirrel.nantucket.123
The Diceware page has a couple of long word lists, American and International. It also has a useful description of how to meet various levels of security.
I'm looking for a good example of using Regular Expressions in PHP to "reverse engineer" a form letter (with a known format, of course) that has been pasted into a multiline textbox and sent to a script for processing.
So, for example, let's assume this is the original plain-text input (taken from a USDA press release):
WASHINGTON, April 5, 2010 - North
American Bison Co-Op, a New Rockford,
N.D., establishment is recalling
approximately 25,000 pounds of whole
beef heads containing tongues that may
not have had the tonsils completely
removed, which is not compliant with
regulations that require the removal
of tonsils from cattle of all ages,
the U.S. Department of Agriculture's
Food Safety and Inspection Service
(FSIS) announced today.
For clarity, the fields that are variables are highlighted below:
[pr_city=]WASHINGTON, [pr_date=]April 5, 2010 - [corp_name=]North
American Bison Co-Op, a [corp_city=]New Rockford,
[corp_state=]N.D., establishment is recalling
approximately [amount=]25,000 pounds of [product=]whole
beef heads containing tongues that may
not have had the tonsils completely
removed, which is not compliant with
regulations that require [reason=]the removal
of tonsils from cattle of all ages,
the U.S. Department of Agriculture's
Food Safety and Inspection Service
(FSIS) announced today.
How could I efficiently extract the contents of the
pr_city
pr_date
corp_name
corp_city
corp_state
amount
product
reason
fields from my example?
Any help would be appreciated, thanks.
Well, a regex that works on your example could look like this (line breaks introduced to keep this beast legible, need to be removed prior to use):
/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a
(?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is
recalling approximately (?P<amount>.*?) of (?P<product>.*?),
which is not compliant with regulations that require (?P<reason>.*?),
the U\.S\. Department of Agriculture\'s Food Safety and Inspection
Service \(FSIS\) announced today\.$/
So, in PHP you could do
if (preg_match('/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a (?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is recalling approximately (?P<amount>.*?) of (?P<product>.*?), which is not compliant with regulations that require (?P<reason>.*?), the U\.S\. Department of Agriculture\'s Food Safety and Inspection Service \(FSIS\) announced today\.$/', $subject, $regs)) {
$prcity = $regs['pr_city'];
$prdate = $regs['pr_date'];
... etc.
} else {
$result = "";
}
This assumes a couple of things, for instance that there are no line breaks, and that the input is the entire string (and not a larger string from which this part has to be extracted from). I've tried to make assumptions about legal values that make some sense, but there is the very real chance that other inputs could break this. So some more test cases are probably needed.
If the surrounding text is constant, then something like this partial regex could do the trick:
preg_match('/^(.*?), (.*?)- (.*?), a (.*?), (.*?), establishment is recalling approximately (.*?), which is not compliant with regulations that require (.*?), the U.S. Department of Agriculture's Food Safety and Inspection Service (FSIS) announced today./', $text, $matches);
$matches[1] = 'WASHINGTON';
$matches[2] = 'April 5, 2010';
$matches[3] = ... etc...
If the surrounding text changes, then you're going to end up with a ton of false matches, no matches, etc... Essentially you'd need an AI to parse/understand PR releases.
Edit: Please disregard this crazy answer, as the other two are better. I should probably delete it, but I'm keeping it up for reference.
I have a crazy idea that just might work: build an XML string from the input by adding markups, then parse it. It might look something like this (completely untested) code:
preg_replace('([^,]*), ([^-]*)- ...etc...', '<pr_city>\1</pr_city><pr_date>\2</pr_date> ...etc...');
Parsing the XML afterwards is a needlessly complicated process that is best left to the PHP documentation: http://www.php.net/manual/en/function.xml-parse.php .
You could also consider converting it to JSON with this method, then using json_decode() to parse it. In any case, you have to think about what happens when " marks and > symbols appear in the input.
It might be easier to just match and remove one piece of the text at a time.