I'm new to php and am doing this project to teach myself a bit.
I'm importing XML data from a Reuters RSS feed and would like to sort the content of all the responses alphabetically. I've had no problem loading the information I want to the page using a foreach loop, however the sorting system I'm using alphabetizes the words in each xml title individually, as opposed to together as one string.
How can I group or save all the responses together in order to sort them as a whole once they've been collected by the foreach loop?
Here's what I have so far:
<?php
function getFeed($feed_url) {
$content = file_get_contents($feed_url);
$x = new SimpleXmlElement($content);
$string = $x->channel->item ;
echo "<p>";
foreach($x->channel->item as $entry) {
$string = $entry->title;
$split=explode(" ", $string);
sort($split); // sorts the elements
echo implode(" ", $split); //combine and print the elements
}
echo "</p>";
}?>
What you are wanting to do is build an array with all the words and then sort it at the end.
<?php
function getFeed($feed_url) {
$content = file_get_contents($feed_url);
$x = new SimpleXmlElement($content);
$titles = "";
foreach($x->channel->item as $entry) {
$titles .= " $entry->title";
}
$split = explode(" ", $titles);
sort($split, SORT_FLAG_CASE | SORT_NATURAL);
return trim(implode(" ", $split));
}
$words = getFeed("http://feeds.reuters.com/Reuters/PoliticsNews");
echo "<p>$words</p>";
I didn't remove non-word characters, so things like quotes will mess with the sorting.
Output:
<p>'sanctuary' abuse after announcement, areas as battle California, cards, case crises cut detention drill Egyptian Egyptian-American Exxon financial Florida for freed from funding future give green greets healthcare Homeland hope in in in in Indian looms not of officials orders orders other permission plan possible prevent probe probes racial reboot reform resigns revamped review review review rule rules Russia Security see seeks senator sets slur state summons tax tax testify threatens to to to to Top Trump Trump Trump Trump Trump Trump-Russia Twitter U.S. U.S. U.S. U.S. Uphill visa-holders Waiting Washington will</p>
Consider saving titles to an array, sort their values, and then iterate back out for echo output:
$content = file_get_contents("http://feeds.reuters.com/Reuters/PoliticsNews");
$x = new SimpleXmlElement($content);
$titles = [];
foreach($x->channel->item as $entry) {
$titles[] = $entry->title;
}
sort($titles, SORT_NATURAL | SORT_FLAG_CASE); # CASE INSENSITIVE SORT
foreach($titles as $t) {
echo "<p>". $t ."</p>";
}
# <p>Ex-Illinois Governor Blagojevich's 14-year prison term upheld</p>
# <p>Exxon probe is unconstitutional, Republican prosecutors say</p>
# <p>Group sues Trump for repealing U.S. wildlife rule in rare legal challenge</p>
# <p>Indian techies, IT firms fret as Trump orders U.S. visa review</p>
# <p>Trump, Republicans face tricky task of averting U.S. government shutdown</p>
# <p>Trump administration may change rules that allow terror victims to immigrate to U.S.</p>
# <p>U.S. House committee sets more hearings in Trump-Russia probe</p>
# <p>U.S. judicial panel finds Texas hurt Latino vote with redrawn boundaries</p>
# <p>U.S. retailers bet on Congress over Bolivia to thwart Trump border tax issue</p>
# <p>U.S. Treasury's Mnuchin: Trump to order reviews of financial rules</p>
Related
I'm trying to create a basic concordance script that will print the ten words before and after the value found inside an array. I did this by splitting the text into an array, identifying the position of the value, and then printing -10 and +10 with the searched value in the middle. However, this only presents the first such occurrence. I know I can find the others by using array_keys (found in positions 52, 78, 80), but I'm not quite sure how to cycle through the matches, since array_keys also results in an array. Thus, using $matches (with array_keys) in place of $location below doesn't work, since you cannot use the same operands on an array as an integer. Any suggestions? Thank you!!
<?php
$text = <<<EOD
The spread of a deadly new virus is accelerating, Chinese President Xi Jinping warned, after holding a special government meeting on the Lunar New Year public holiday.
The country is facing a "grave situation" Mr Xi told senior officials.
The coronavirus has killed at least 42 people and infected some 1,400 since its discovery in the city of Wuhan.
Meanwhile, UK-based researchers have warned of a real possibility that China will not be able to contain the virus.
Travel restrictions have come in place in several affected cities. From Sunday, private vehicles will be banned from central districts of Wuhan, the source of the outbreak.
EOD;
$new = explode(" ", $text);
$location = array_search("in", $new, FALSE);
$concordance = 10;
$top_range = $location + $concordance;
$bottom_range = $location - $concordance;
while($bottom_range <= $top_range) {
echo $new[$bottom_range] . " ";
$bottom_range++;
}
?>
You can just iterate over the values returned by array_keys, using array_slice to extract the $concordance words either side of the location and implode to put the sentence back together again:
$words = explode(' ', $text);
$concordance = 10;
$results = array();
foreach (array_keys($words, 'in') as $idx) {
$results[] = implode(' ', array_slice($words, max($idx - $concordance, 0), $concordance * 2 + 1));
}
print_r($results);
Output:
Array
(
[0] => least 42 people and infected some 1,400 since its discovery in the city of Wuhan.
Meanwhile, UK-based researchers have warned of a
[1] => not be able to contain the virus.
Travel restrictions have come in place in several affected cities. From Sunday, private vehicles will
[2] => able to contain the virus.
Travel restrictions have come in place in several affected cities. From Sunday, private vehicles will be banned
)
If you want to avoid generating similar phrases where a word occurs twice within $concordance words (e.g. indexes 1 and 2 in the above array), you can maintain a position for the end of the last match, and skip occurrences that occur in that match:
$words = explode(' ', $text);
$concordance = 10;
$results = array();
$last = 0;
foreach (array_keys($words, 'in') as $idx) {
if ($idx < $last) continue;
$results[] = implode(' ', array_slice($words, max($idx - $concordance, 0), $concordance * 2 + 1));
$last = $idx + $concordance;
}
print_r($results);
Output
Array
(
[0] => least 42 people and infected some 1,400 since its discovery in the city of Wuhan.
Meanwhile, UK-based researchers have warned of a
[1] => not be able to contain the virus.
Travel restrictions have come in place in several affected cities. From Sunday, private vehicles will
)
Demo on 3v4l.org
Try this:
<?php
$text = <<<EOD
The spread of a deadly new virus is accelerating, Chinese President Xi Jinping warned, after holding a special government meeting on the Lunar New Year public holiday.
The country is facing a "grave situation" Mr Xi told senior officials.
The coronavirus has killed at least 42 people and infected some 1,400 since its discovery in the city of Wuhan.
Meanwhile, UK-based researchers have warned of a real possibility that China will not be able to contain the virus.
Travel restrictions have come in place in several affected cities. From Sunday, private vehicles will be banned from central districts of Wuhan, the source of the outbreak.
EOD;
$words = explode(" ", $text);
$concordance = 10; // range -+
$result = []; // result array
$index = 0;
if (count($words) === 0) // be sure there is no empty array
exit;
do {
$location = array_search("in", $words, false);
if (!$location) // break loop if $location not found
break;
$count = count($words);
// check range of array indexes
$minRange = ($location - $concordance > 0) ? ($location-$concordance) : 0; // array can't have index less than 0 (shorthand if)
$maxRange = (($location + $concordance) < ($count - 1)) ? ($location+$concordance) : $count - 1; // array can't have index equal or higher than array count (shorthand if)
for ($range = $minRange; $range < $maxRange; $range++) {
$result[$index][] = $words[$range]; // group words by index
}
unset($words[$location]); // delete element which contain "in"
$words = array_values($words); // reindex array
$index++;
} while ($location); // repeat until $location exist
print_r($result); // <--- here's your results
?>
I'm not sure if this is possible. I have a list with specific words on new lines, and I need to select the words between those lines. For example my source is:
Word_1
Word_2
Location
Variable_1
Variable_2
Variable_3
Section
Word_9
I need regex to find the line after Location, which I am doing using (.*(?<=\bLocation\s)(\w+).*)|.* and replacing with $1. However this only gives me Variable_1 and I need it to give me Variable_1,Variable_2,Variable_3. And here's the catch, sometimes there is one Variable, sometimes 2, sometimes 3, sometimes 4. BUT, the following word will always be Section. So I'm thinking I need basically some way to tell Regex to select every line after Location but before Section.
Any ideas?
Real world example:
Category
Business
Dates
StatusOpen
Closing Information
Location
National
South-East Asia
New South Wales
Victoria
Sections
General
Difficulty Rating
Administrator
Output would be National,South-East Asia,New South Wales,Victoria
In Python you can use DOTALL in your re.
print(re.findall("Location(.*)Section",string,re.DOTALL)[0])
Output:
Variable_1
Variable_2
Variable_3
For PHP can you try the below.
'\w*Location(.*)Section/s'
You can check the output in this link for your example.
Regex101 link
Output match:
National
South-East Asia
New South Wales
Victoria
A solution with PHP would be:
<?php
$string =
"Category
Business
Dates
StatusOpen
Closing Information
Location
National
South-East Asia
New South Wales
Victoria
Sections
General
Difficulty Rating
Administrator";
$initialValue = false;
$lastValue = false;
$arResult = [];
$arValue = explode("\n", $string);
foreach($arValue as $value) {
$value = trim($value);
if ($value == "Location") {
$initialValue = true;
} else if ($value == "Sections") {
$lastValue = true;
} else if ($initialValue == true && $lastValue == false) {
$arResult[] = $value;
}
}
echo implode(",",$arResult); // National,South-East Asia,New South Wales,Victoria
i have json string but when i am getting it json_decode() it is showing blank.
$str = '[{"actcode":"Auck4","actname":"Sky Tower","date":"","time":"","timeduration":"","adult":"0","adultprice":"28","child":"0","childprice":"0","description":"Discover the best of Auckland in half a day. Soak up spectacular sights on this scenic tour, from heritage-listed buildings on Queen Street to the stunning Viaduct Harbour and panoramic vistas from the Sky Tower observation deck.
Start your tour with a hotel pick-up and travel through Auckland?s dynamic Central Business District. Travel across the iconic Auckland Harbour Bridge and admire stunning city views. Then, return to the city centre and visit the vibrant precinct of Wynyard Quarter. Here, wander among the sculptures and enjoy the happenings on the water of Viaduct Harbour.
Continue to Queen Street, also known as the ?Golden Mile? of Aucklands business and shopping district. Marvel at historic buildings like the Ferry Terminal building before visiting the Auckland Museum. Here, explore fascinating exhibits paying tribute to New Zealands natural, Maori and European histories. Afterwards, travel along Aucklands most expensive residential streets with fantastic views of the Waitemata Harbour and its islands.
Your tour ends at Sky Tower, the tallest free-standing structure in the Southern Hemisphere. Take in breathtaking 360-degree views of the city and its surroundings. In the afternoon, continue your own exploration of Auckland."}]';
i tried the below code
$array = json_decode($str,true);
echo print_r($array);
this one too
$str1 = trim($str);
$array = json_decode($str1,true);
echo print_r($array);
but the string si showing blank
try this one.
$string = mysql_real_escape_string($str);
$findsym = array('\r', '\n');
$removesym = array("", "");
$strdone = stripslashes(str_replace($findsym,$removesym,strip_tags($string)));
$jsonarray = json_decode($strdone,true);
echo "<pre>"; echo print_r($jsonarray);
I have the following specific output from which I would like to isolate from and including the word "industry" (whichever case) and the sub string until the next delimiter typically "|". I get the $output from an API So the contents of $output are always different but the generic expression may be something like: blah blah blah |industry = industry info| blah blah blah. If the word industry exists in the output I would just like to get industry = industry info. Is there a generic regex which can do this? The specific output I have returned is:
<?php
$output = '{{other uses|UBS (disambiguation)}} {{Use dmy dates|date=April
2015}} {{Infobox company |name = UBS Group AG |logo = [[File:UBS
Logo.svg|200px|UBS Group AG Logo]] |type = [[Aktiengesellschaft]]
([[Aktiengesellschaft|AG]])
[[Public company]] |traded_as = {{SWX|UBSG}} {{SWX|UBSN}}
{{nyse|UBS}} |foundation=1854 |predecessor = [[Union Bank of
Switzerland]] and [[Swiss Bank Corporation]] merged in 1998;
[[PaineWebber]] merged in 2000 |location = [[Zürich]]
[[Basel]] |key_people = [[Axel A. Weber]] (Chairman){{br}}[[Sergio
Ermotti]] (CEO) {{br}} |area_served = Worldwide |industry =[[Banking]],
[[Financial services]] |products = [[Investment Banking]]
[[Investment Management]] [[Wealth Management]] [[Private Banking]]
[[Commercial Bank|Corporate Banking]]
[[Private Equity]]
[[Finance and Insurance]]
[[Retail Banking|Consumer Banking]]
[[Mortgage loans|Mortgages]]
[[Credit Cards]] |revenue = {{Increase}} [[Swiss franc|CHF]]28.027
billion (2014) |operating_income = {{Decrease}} CHF2.461 billion (2014)
{{cite web|title=UBS Annual Report
2014|url=http://www.ubs.com/global/en/about_ubs/
investor_relations/annualreporting/2014/_jcr_content/par/
columncontrol_0/col1/linklist/link.1899571414.file/
bGluay9wYXRoPS9jb250ZW50L2RhbS9zdGF0aWMvZ2xvYmFsL2ludmV
zdG9yX3JlbGF0aW9ucy9hbm51YWwyMDE0L2FubnVhbC1yZXBv
cnQtZ3JvdXAtMjAxNC1lbi5wZGY=/annual-report-group-2014-
en.pdf|publisher=UBS.com|accessdate=May 3, 2015}}
|assets = {{Increase}} CHF1.062 trillion (2014) |equity = {{Increase}}
CHF54.368 billion (2014) |num_employees = {{Decrease}} 60,155 (2014)
|caption=We Will Not Rest |homepage = [https://www.ubs.com/ UBS.com] }}
'''UBS AG''' is a Swiss global [[financial services]] company,
incorporated in the [[Canton of Zurich]],{{cite web|title=Trade Register:
UBS AG|url=http://www.moneyhouse.ch/en/u/ubs_ag_CH-270.3.004.646-4.htm}}
and co-headquartered in [[Zürich]] and [[Basel]].{{cite
web|url=https://www.ubs.com/global/en/about_ubs/
investor_relations/faq/about.html|title=Corporate information - UBS
Global topics|work=ubs.com|accessdate=March 29, 2015}} The company
provides [[investment banking]], [[asset management]], and [[wealth
management]] services for private, corporate, and institutional clients
worldwide, and for retail clients in Switzerland as well.{{cite
web|url=https://www.ubs.com/global/en/about_ubs/
investor_relations/our_businesses.html|title=Our clients & businesses -
UBS Global topics|work=ubs.com|
accessdate=March 29, 2015}} The name ''UBS'' was originally an
abbreviation for the [[Union Bank of Switzerland]], but it ceased to be a
representational abbreviation after the bank's merger with [[Swiss Bank
Corporation]] in 1998. The company traces its origins to 1856, when the
earliest of its predecessor banks was founded.{{cite web|title=150 years
of banking tradition|url=https://www.ubs.com/global/en/about_ubs/
about_us/history/_jcr_content/rightpar/
teaser_0/linklist/link.651908116.file/
bGluay9wYXRoPS9jb250ZW50L2RhbS91YnMvZ2xvY
mFsL2Fib3V0X3Vicy9hYm91dF91cy9oaXN0b3J5X29mX3
Vicy8xNTBfeWVhcnNfb2ZfYmFua2luZ19FTkcucGRm/
150_years_of_banking_ENG.pdf|work=ubs.com|
accessdate=March 29, 2015}} UBS is the biggest
bank in Switzerland, operating in more than 50
countries with about 60,000 employees around the world, as of 2014.{{cite
web|title=About us: UBS in a few
words|url=https://www.ubs.com/global/en/
about_ubs/about_us/ourprofile.html|work=ubs.com}} It is considered the
world's largest manager of private wealth assets, with over [[Swiss
franc|CHF]]2.2 trillion in invested assets,J.P.Morgan Cazenove Europe';
?>
[^|]*\bindustry\b[^|]*
Try this.See demo.Use i flag.
https://regex101.com/r/uF4oY4/79
This will match a string which starts from after | has industry till the next |.
$re = "/[^|]*\\bindustry\\b[^|]*/i";
$str = "{{other uses|UBS (disambiguation)}} {{Use dmy dates|date=April \n2015}} {{Infobox company |name = UBS Group AG |logo = [[File:UBS \nLogo.svg|200px|UBS Group AG Logo]] |type = [[Aktiengesellschaft]] \n([[Aktiengesellschaft|AG]])\n[[Public company]] |traded_as = {{SWX|UBSG}} {{SWX|UBSN}}\n{{nyse|UBS}} |foundation=1854 |predecessor = [[Union Bank of \nSwitzerland]] and [[Swiss Bank Corporation]] merged in 1998; \n[[PaineWebber]] merged in 2000 |location = [[Zürich]]\n[[Basel]] |key_people = [[Axel A. Weber]] (Chairman){{br}}[[Sergio \nErmotti]] (CEO) {{br}} |area_served = Worldwide |industry =[[Banking]], \n[[Financial services]] |products = [[Investment Banking]]\n[[Investment Management]] [[Wealth Management]] [[Private Banking]]\n[[Commercial Bank|Corporate Banking]]\n[[Private Equity]]\n[[Finance and Insurance]]\n[[Retail Banking|Consumer Banking]]\n[[Mortgage loans|Mortgages]]\n[[Credit Cards]] |revenue = {{Increase}} [[Swiss franc|CHF]]28.027 \nbillion (2014) |operating_income = {{Decrease}} CHF2.461 billion (2014) \n{{cite web|title=UBS Annual Report \n2014|url=http://www.ubs.com/global/en/about_ubs/\ninvestor_relations/annualreporting/2014/_jcr_content/par/\ncolumncontrol_0/col1/linklist/link.1899571414.file/\nbGluay9wYXRoPS9jb250ZW50L2RhbS9zdGF0aWMvZ2xvYmFsL2ludmV\nzdG9yX3JlbGF0aW9ucy9hbm51YWwyMDE0L2FubnVhbC1yZXBv\ncnQtZ3JvdXAtMjAxNC1lbi5wZGY=/annual-report-group-2014- \nen.pdf|publisher=UBS.com|accessdate=May 3, 2015}} \n|assets = {{Increase}} CHF1.062 trillion (2014) |equity = {{Increase}} \nCHF54.368 billion (2014) |num_employees = {{Decrease}} 60,155 (2014) \n|caption=We Will Not Rest |homepage = [https://www.ubs.com/ UBS.com] }} \n'''UBS AG''' is a Swiss global [[financial services]] company, \nincorporated in the [[Canton of Zurich]],{{cite web|title=Trade Register: \nUBS AG|url=http://www.moneyhouse.ch/en/u/ubs_ag_CH-270.3.004.646-4.htm}} \nand co-headquartered in [[Zürich]] and [[Basel]].{{cite \nweb|url=https://www.ubs.com/global/en/about_ubs/\ninvestor_relations/faq/about.html|title=Corporate information - UBS \nGlobal topics|work=ubs.com|accessdate=March 29, 2015}} The company \nprovides [[investment banking]], [[asset management]], and [[wealth \nmanagement]] services for private, corporate, and institutional clients \nworldwide, and for retail clients in Switzerland as well.{{cite \nweb|url=https://www.ubs.com/global/en/about_ubs/\ninvestor_relations/our_businesses.html|title=Our clients & businesses - \nUBS Global topics|work=ubs.com|\naccessdate=March 29, 2015}} The name ''UBS'' was originally an \nabbreviation for the [[Union Bank of Switzerland]], but it ceased to be a \nrepresentational abbreviation after the bank's merger with [[Swiss Bank \nCorporation]] in 1998. The company traces its origins to 1856, when the \nearliest of its predecessor banks was founded.{{cite web|title=150 years \nof banking tradition|url=https://www.ubs.com/global/en/about_ubs/\nabout_us/history/_jcr_content/rightpar/\nteaser_0/linklist/link.651908116.file/\nbGluay9wYXRoPS9jb250ZW50L2RhbS91YnMvZ2xvY\nmFsL2Fib3V0X3Vicy9hYm91dF91cy9oaXN0b3J5X29mX3\nVicy8xNTBfeWVhcnNfb2ZfYmFua2luZ19FTkcucGRm/\n150_years_of_banking_ENG.pdf|work=ubs.com|\naccessdate=March 29, 2015}} UBS is the biggest\nbank in Switzerland, operating in more than 50 \ncountries with about 60,000 employees around the world, as of 2014.{{cite \nweb|title=About us: UBS in a few \nwords|url=https://www.ubs.com/global/en/\nabout_ubs/about_us/ourprofile.html|work=ubs.com}} It is considered the \nworld's largest manager of private wealth assets, with over [[Swiss \nfranc|CHF]]2.2 trillion in invested assets,J.P.Morgan Cazenove Europ";
preg_match_all($re, $str, $matches);
I would not apply a regex on a large input string like yours. As you can see in the regex debugger, vks' regex makes about 340,000 steps to finally fetch you a result.
I suggest splitting the string with | first, and then grepping out the info you need.
$chks = explode("|", $output);
foreach ($chks as $chk) {
if (strpos($chk,'industry =') !== false) {
echo $chk;
}
}
Result:
industry =[[Banking]],
[[Financial services]]
See IDEONE demo
Good morning -
I'm interested in seeing an efficient way of parsing the values of an heirarchical text file (i.e., one that has a Title => Multiple Headings => Multiple Subheadings => Multiple Keys => Multiple Values) into a simple XML document. For the sake of simplicity, the answer would be written using:
Regex (preferrably in PHP)
or, PHP code (e.g., if looping were more efficient)
Here's an example of an Inventory file I'm working with. Note that Header = FOODS, Sub-Header = Type (A, B...), Keys = PRODUCT (or CODE, etc.) and Values may have one more more lines.
**FOODS - TYPE A**
___________________________________
**PRODUCT**
1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;
2) La Fe String Cheese
**CODE**
Sell by date going back to February 1, 2009
**MANUFACTURER**
Quesos Mi Pueblito, LLC, Passaic, NJ.
**VOLUME OF UNITS**
11,000 boxes
**DISTRIBUTION**
NJ, NY, DE, MD, CT, VA
___________________________________
**PRODUCT**
1) Peanut Brittle No Sugar Added;
2) Peanut Brittle Small Grind;
3) Homestyle Peanut Brittle Nuggets/Coconut Oil Coating
**CODE**
1) Lots 7109 - 8350 inclusive;
2) Lots 8198 - 8330 inclusive;
3) Lots 7075 - 9012 inclusive;
4) Lots 7100 - 8057 inclusive;
5) Lots 7152 - 8364 inclusive
**MANUFACTURER**
Star Kay White, Inc., Congers, NY.
**VOLUME OF UNITS**
5,749 units
**DISTRIBUTION**
NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN
**FOODS - TYPE B**
___________________________________
**PRODUCT**
Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice;
**CODE**
990-10/2 10/5
**MANUFACTURER**
San Mar Manufacturing Corp., Catano, PR.
**VOLUME OF UNITS**
384
**DISTRIBUTION**
PR
And here's the desired output (please excuse any XML syntactical errors):
<foods>
<food type = "A" >
<product>Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese</product>
<product>La Fe String Cheese</product>
<code>Sell by date going back to February 1, 2009</code>
<manufacturer>Quesos Mi Pueblito, LLC, Passaic, NJ.</manufacturer>
<volume>11,000 boxes</volume>
<distibution>NJ, NY, DE, MD, CT, VA</distribution>
</food>
<food type = "A" >
<product>Peanut Brittle No Sugar Added</product>
<product>Peanut Brittle Small Grind</product>
<product>Homestyle Peanut Brittle Nuggets/Coconut Oil Coating</product>
<code>Lots 7109 - 8350 inclusive</code>
<code>Lots 8198 - 8330 inclusive</code>
<code>Lots 7075 - 9012 inclusive</code>
<code>Lots 7100 - 8057 inclusive</code>
<code>Lots 7152 - 8364 inclusive</code>
<manufacturer>Star Kay White, Inc., Congers, NY.</manufacturer>
<volume>5,749 units</volume>
<distibution>NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN</distribution>
</food>
<food type = "B" >
<product>Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice</product>
<code>990-10/2 10/5</code>
<manufacturer>San Mar Manufacturing Corp., Catano, PR</manufacturer>
<volume>384</volume>
<distibution>PR</distribution>
</food>
</FOODS>
<!-- and so forth -->
So far, my approach (which might be quite inefficient with a huge text file) would be one of the following:
Loops and multiple Select/Case statements, where the file is loaded into a string buffer, and while looping through each line, see if it matches one of the header/subheader/key lines, append the appropriate xml tag to a xml string variable, and then add the child nodes to the xml based on IF statements regarding which key name is most recent (which seems time-consuming and error-prone, esp. if the text changes even slightly) -- OR
Use REGEX (Regular Expressions) to find and replace key fields with appropriate xml tags, clean it up with an xml library, and export the xml file. Problem is, I barely use regular expressions, so I'd need some example-based help.
Any help or advice would be appreciated.
Thanks.
An example you can use as a starting point. At least I hope it gives you an idea...
<?php
define('TYPE_HEADER', 1);
define('TYPE_KEY', 2);
define('TYPE_DELIMETER', 3);
define('TYPE_VALUE', 4);
$datafile = 'data.txt';
$fp = fopen($datafile, 'rb') or die('!fopen');
// stores (the first) {header} in 'name' and the root simplexmlelement in 'element'
$container = array('name'=>null, 'element'=>null);
// stores the name for each item element, the value for the type attribute for subsequent item elements and the simplexmlelement of the current item element
$item = array('name'=>null, 'type'=>null, 'current_element'=>null);
// the last **key** encountered, used to create new child elements in the current item element when a value is encountered
$key = null;
while ( false!==($t=getstruct($fp)) ) {
switch( $t[0] ) {
case TYPE_HEADER:
if ( is_null($container['element']) ) {
// this is the first time we hit **header - subheader**
$container['name'] = $t[1][0];
// ugly hack, < . name . />
$container['element'] = new SimpleXMLElement('<'.$container['name'].'/>');
// each subsequent new item gets the new subheader as type attribute
$item['type'] = $t[1][1];
// dummy implementation: "deducting" the item names from header/container[name]
$item['name'] = substr($t[1][0], 0, -1);
}
else {
// hitting **header - subheader** the (second, third, nth) time
/*
header must be the same as the first time (stored in container['name']).
Otherwise you need another container element since
xml documents can only have one root element
*/
if ( $container['name'] !== $t[1][0] ) {
echo $container['name'], "!==", $t[1][0], "\n";
die('format error');
}
else {
// subheader may have changed, store it for future item elements
$item['type'] = $t[1][1];
}
}
break;
case TYPE_DELIMETER:
assert( !is_null($container['element']) );
assert( !is_null($item['name']) );
assert( !is_null($item['type']) );
/* that's maybe not a wise choice.
You might want to check the complete item before appending it to the document.
But the example is a hack anyway ...so create a new item element and append it to the container right away
*/
$item['current_element'] = $container['element']->addChild($item['name']);
// set the type-attribute according to the last **header - subheader** encountered
$item['current_element']['type'] = $item['type'];
break;
case TYPE_KEY:
$key = $t[1][0];
break;
case TYPE_VALUE:
assert( !is_null($item['current_element']) );
assert( !is_null($key) );
// this is a value belonging to the "last" key encountered
// create a new "key" element with the value as content
// and addit to the current item element
$tmp = $item['current_element']->addChild($key, $t[1][0]);
break;
default:
die('unknown token');
}
}
if ( !is_null($container['element']) ) {
$doc = dom_import_simplexml($container['element']);
$doc = $doc->ownerDocument;
$doc->formatOutput = true;
echo $doc->saveXML();
}
die;
/*
Take a look at gettoken() at http://www.tuxradar.com/practicalphp/21/5/6
It breaks the stream into much simpler pieces.
In the next step the parser would "combine" or structure the simple tokens into more complex things.
This function does both....
#return array(id, array(parameter)
*/
function getstruct($fp) {
if ( feof($fp) ) {
return false;
}
// shortcut: all we care about "happens" on one line
// so let php read one line in a single step and then do the pattern matching
$line = trim(fgets($fp));
// this matches **key** and **header - subheader**
if ( preg_match('#^\*\*([^-]+)(?:-(.*))?\*\*$#', $line, $m) ) {
// only for **header - subheader** $m[2] is set.
if ( isset($m[2]) ) {
return array(TYPE_HEADER, array(trim($m[1]), trim($m[2])));
}
else {
return array(TYPE_KEY, array($m[1]));
}
}
// this matches _____________ and means "new item"
else if ( preg_match('#^_+$#', $line, $m) ) {
return array(TYPE_DELIMETER, array());
}
// any other non-empty line is a single value
else if ( preg_match('#\S#', $line) ) {
// you might want to filter the 1),2),3) part out here
// could also be two diffrent token types
return array(TYPE_VALUE, array($line));
}
else {
// skip empty lines, would be nicer with tail-recursion...
return getstruct($fp);
}
}
prints
<?xml version="1.0"?>
<FOODS>
<FOOD type="TYPE A">
<PRODUCT>1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;</PRODUCT>
<PRODUCT>2) La Fe String Cheese</PRODUCT>
<CODE>Sell by date going back to February 1, 2009</CODE>
<MANUFACTURER>Quesos Mi Pueblito, LLC, Passaic, NJ.</MANUFACTURER>
<VOLUME OF UNITS>11,000 boxes</VOLUME OF UNITS>
<DISTRIBUTION>NJ, NY, DE, MD, CT, VA</DISTRIBUTION>
</FOOD>
<FOOD type="TYPE A">
<PRODUCT>1) Peanut Brittle No Sugar Added;</PRODUCT>
<PRODUCT>2) Peanut Brittle Small Grind;</PRODUCT>
<PRODUCT>3) Homestyle Peanut Brittle Nuggets/Coconut Oil Coating</PRODUCT>
<CODE>1) Lots 7109 - 8350 inclusive;</CODE>
<CODE>2) Lots 8198 - 8330 inclusive;</CODE>
<CODE>3) Lots 7075 - 9012 inclusive;</CODE>
<CODE>4) Lots 7100 - 8057 inclusive;</CODE>
<CODE>5) Lots 7152 - 8364 inclusive</CODE>
<MANUFACTURER>Star Kay White, Inc., Congers, NY.</MANUFACTURER>
<VOLUME OF UNITS>5,749 units</VOLUME OF UNITS>
<DISTRIBUTION>NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN</DISTRIBUTION>
</FOOD>
<FOOD type="TYPE B">
<PRODUCT>Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice;</PRODUCT>
<CODE>990-10/2 10/5</CODE>
<MANUFACTURER>San Mar Manufacturing Corp., Catano, PR.</MANUFACTURER>
<VOLUME OF UNITS>384</VOLUME OF UNITS>
<DISTRIBUTION>PR</DISTRIBUTION>
</FOOD>
</FOODS>
Unfortunately the status of the php module for ANTLR currently is "Runtime is in alpha status." but it might be worth a try anyway...
See: http://www.tuxradar.com/practicalphp/21/5/6
This tells you how to parse a text file into tokens using PHP. Once parsed you can place it into anything you want.
You need to search for specific tokens in the file based on your criteria:
for example:
PRODUCT
This gives you the XML Tag
Then 1) can have special meaning
1) Peanut Brittle...
This tells you what to put in the XML tag.
I do not know if this is the most efficient way to accomplish your task but it is the way a compiler would parse a file and has the potential to make very accurate.
Instead of Regex or PHP use the XSLT 2.0 unparsed-text() function to read the file (see http://www.biglist.com/lists/xsl-list/archives/200508/msg00085.html)
Another Hint for an XSLT 1.0 Solution is here: http://bytes.com/topic/net/answers/808619-read-plain-file-xslt-1-0-a