Related
How would I decode this JSON data to get the Location link of the event? NOTE: When I say Location I don't mean the field "location" in the json data, I am referring to the field which is in "customFields", then has a "value" which is a link to Google Maps, it also has the "type" = 9.
Problem: I am currently stuck with a page which looks like the image below, the "Notice: Undefined offset: # in...." error continues for 200 lines, because the JSON file contains the data of 200 events, the JSON included only contains the first event.
Desired Result: For the link to google maps page to be echoed on every line. I think the solution is very simple, just changing my Source code (Included) so that it can read the JSON file.
JSON dataset:
[{"eventID":152913573,"template":"Brisbane City Council","title":"Clock Tower Tour","description":"The Clock Tour Tower is a ‘must-do’ for anyone and everyone in Brisbane!<br /> <br /> For many years, City Hall’s Clock Tower made the building the tallest in Brisbane, offering visitors a magnificent 360 degree view of the city around them. Whilst the view has changed significantly over the last 90 years, the time-honoured tradition of “taking a trip up the tower” happily continues at Museum of Brisbane.<br /> <br /> The Clock Tower Tour includes a ride in one of Brisbane’s oldest working cage lifts, a look behind Australia’s largest analogue clock faces and time to explore the observation platform that shares a unique perspective of the city. See if you can catch a glimpse of the bells!<br /> <br /> <strong>Location</strong>: Tour begins from Museum of Brisbane reception on Level 3 of City Hall.","location":"Museum of Brisbane, Brisbane City","webLink":"","startDateTime":"2021-06-13T00:00:00","endDateTime":"2021-06-14T00:00:00","dateTimeFormatted":"Sunday, June 13, 2021","allDay":true,"startTimeZoneOffset":"+1000","endTimeZoneOffset":"+1000","canceled":false,"openSignUp":false,"reservationFull":false,"pastDeadline":false,"requiresPayment":false,"refundsAllowed":false,"waitingListAvailable":false,"signUpUrl":"https://eventactions.com/eareg.aspx?ea=Rsvp&invite=0tva7etjn38te1bve2yj59425pupt7wvscmr1z6depcj9ctnrh7r","repeatingRegistration":0,"repeats":"Every Sunday, Tuesday, Wednesday, Thursday, Friday and Saturday through June 30, 2021","seriesID":152913560,"eventImage":{"url":"https://www.trumba.com/i/DgDhxtvzZEBEz%2AjAEUDofPUE.jpeg","size":{"width":1290,"height":775}},"detailImage":{"url":"https://www.trumba.com/i/DgDhxtvzZEBEz%2AjAEUDofPUE.jpeg","size":{"width":1290,"height":775}},"customFields":[{"fieldID":22503,"label":"Venue","value":"Museum of Brisbane, Brisbane City","type":17},{"fieldID":22505,"label":"Venue address","value":"Museum of Brisbane, Brisbane City Hall, 64 Adelaide Street, Brisbane City","type":9},{"fieldID":21859,"label":"Event type","value":"Family events, Free","type":17},{"fieldID":22177,"label":"Cost","value":"Free","type":0},{"fieldID":23562,"label":"Age","value":"Suitable for all ages","type":0},{"fieldID":22732,"label":"Bookings","value":"Book via the Museum of Brisbane website.","type":1},{"fieldID":51540,"label":"Bookings required","value":"Yes","type":3}],"permaLinkUrl":"https://www.brisbane.qld.gov.au/trumba?trumbaEmbed=view%3devent%26eventid%3d152913573","eventActionUrl":"https://eventactions.com/eventactions/brisbane-city-council#/actions/cvuzsak1g2d45mndcjwkp24nfw","categoryCalendar":"Brisbane's calendar|Museum of Brisbane","registrationTransferTargetCount":0,"regAllowChanges":true}]
Code so far:
<?php
$output = file_get_contents("Events.json");
$decode = json_decode($output, true);
for($i = 0; $i < count($decode); $i++) {
if($decode[$i]['customFields'][$i]['type'] == 9){
echo $decode[$i]['customFields'][$i]['label'][$i]['value'];
}
echo "<br>";
}
?>
You're using the $i loop counter twice in the same expression, but the second time you use it it's pointing at non-existent elements. The snippet below 1) treats JSON objects as objects (I find it less confusing when matching code to data), and 2) uses foreach to iterate over the arrays.
I've also extracted the latitude and longitude for you into $latlong
Try this:
$decode = json_decode($json);
foreach ($decode as $event) {
foreach ($event->customFields as $field) {
if ($field->type == 9) {
echo $field->value."\n";
if (preg_match('/href="(.*?)"/', $field->value, $matches)){
preg_match('/q=([\-\.0-9]*),([\-\.0-9]*)/',$matches[1], $latlong);
array_shift($latlong);
var_dump($latlong);
}
break;
}
}
}
Output
Museum of Brisbane, Brisbane City Hall, 64 Adelaide Street, Brisbane City
array(2) {
[0]=>
string(11) "-27.4693454"
[1]=>
string(11) "153.0216909"
}
Demo:https://3v4l.org/AkRvI
I have the following specific output from which I would like to isolate from and including the word "industry" (whichever case) and the sub string until the next delimiter typically "|". I get the $output from an API So the contents of $output are always different but the generic expression may be something like: blah blah blah |industry = industry info| blah blah blah. If the word industry exists in the output I would just like to get industry = industry info. Is there a generic regex which can do this? The specific output I have returned is:
<?php
$output = '{{other uses|UBS (disambiguation)}} {{Use dmy dates|date=April
2015}} {{Infobox company |name = UBS Group AG |logo = [[File:UBS
Logo.svg|200px|UBS Group AG Logo]] |type = [[Aktiengesellschaft]]
([[Aktiengesellschaft|AG]])
[[Public company]] |traded_as = {{SWX|UBSG}} {{SWX|UBSN}}
{{nyse|UBS}} |foundation=1854 |predecessor = [[Union Bank of
Switzerland]] and [[Swiss Bank Corporation]] merged in 1998;
[[PaineWebber]] merged in 2000 |location = [[Zürich]]
[[Basel]] |key_people = [[Axel A. Weber]] (Chairman){{br}}[[Sergio
Ermotti]] (CEO) {{br}} |area_served = Worldwide |industry =[[Banking]],
[[Financial services]] |products = [[Investment Banking]]
[[Investment Management]] [[Wealth Management]] [[Private Banking]]
[[Commercial Bank|Corporate Banking]]
[[Private Equity]]
[[Finance and Insurance]]
[[Retail Banking|Consumer Banking]]
[[Mortgage loans|Mortgages]]
[[Credit Cards]] |revenue = {{Increase}} [[Swiss franc|CHF]]28.027
billion (2014) |operating_income = {{Decrease}} CHF2.461 billion (2014)
{{cite web|title=UBS Annual Report
2014|url=http://www.ubs.com/global/en/about_ubs/
investor_relations/annualreporting/2014/_jcr_content/par/
columncontrol_0/col1/linklist/link.1899571414.file/
bGluay9wYXRoPS9jb250ZW50L2RhbS9zdGF0aWMvZ2xvYmFsL2ludmV
zdG9yX3JlbGF0aW9ucy9hbm51YWwyMDE0L2FubnVhbC1yZXBv
cnQtZ3JvdXAtMjAxNC1lbi5wZGY=/annual-report-group-2014-
en.pdf|publisher=UBS.com|accessdate=May 3, 2015}}
|assets = {{Increase}} CHF1.062 trillion (2014) |equity = {{Increase}}
CHF54.368 billion (2014) |num_employees = {{Decrease}} 60,155 (2014)
|caption=We Will Not Rest |homepage = [https://www.ubs.com/ UBS.com] }}
'''UBS AG''' is a Swiss global [[financial services]] company,
incorporated in the [[Canton of Zurich]],{{cite web|title=Trade Register:
UBS AG|url=http://www.moneyhouse.ch/en/u/ubs_ag_CH-270.3.004.646-4.htm}}
and co-headquartered in [[Zürich]] and [[Basel]].{{cite
web|url=https://www.ubs.com/global/en/about_ubs/
investor_relations/faq/about.html|title=Corporate information - UBS
Global topics|work=ubs.com|accessdate=March 29, 2015}} The company
provides [[investment banking]], [[asset management]], and [[wealth
management]] services for private, corporate, and institutional clients
worldwide, and for retail clients in Switzerland as well.{{cite
web|url=https://www.ubs.com/global/en/about_ubs/
investor_relations/our_businesses.html|title=Our clients & businesses -
UBS Global topics|work=ubs.com|
accessdate=March 29, 2015}} The name ''UBS'' was originally an
abbreviation for the [[Union Bank of Switzerland]], but it ceased to be a
representational abbreviation after the bank's merger with [[Swiss Bank
Corporation]] in 1998. The company traces its origins to 1856, when the
earliest of its predecessor banks was founded.{{cite web|title=150 years
of banking tradition|url=https://www.ubs.com/global/en/about_ubs/
about_us/history/_jcr_content/rightpar/
teaser_0/linklist/link.651908116.file/
bGluay9wYXRoPS9jb250ZW50L2RhbS91YnMvZ2xvY
mFsL2Fib3V0X3Vicy9hYm91dF91cy9oaXN0b3J5X29mX3
Vicy8xNTBfeWVhcnNfb2ZfYmFua2luZ19FTkcucGRm/
150_years_of_banking_ENG.pdf|work=ubs.com|
accessdate=March 29, 2015}} UBS is the biggest
bank in Switzerland, operating in more than 50
countries with about 60,000 employees around the world, as of 2014.{{cite
web|title=About us: UBS in a few
words|url=https://www.ubs.com/global/en/
about_ubs/about_us/ourprofile.html|work=ubs.com}} It is considered the
world's largest manager of private wealth assets, with over [[Swiss
franc|CHF]]2.2 trillion in invested assets,J.P.Morgan Cazenove Europe';
?>
[^|]*\bindustry\b[^|]*
Try this.See demo.Use i flag.
https://regex101.com/r/uF4oY4/79
This will match a string which starts from after | has industry till the next |.
$re = "/[^|]*\\bindustry\\b[^|]*/i";
$str = "{{other uses|UBS (disambiguation)}} {{Use dmy dates|date=April \n2015}} {{Infobox company |name = UBS Group AG |logo = [[File:UBS \nLogo.svg|200px|UBS Group AG Logo]] |type = [[Aktiengesellschaft]] \n([[Aktiengesellschaft|AG]])\n[[Public company]] |traded_as = {{SWX|UBSG}} {{SWX|UBSN}}\n{{nyse|UBS}} |foundation=1854 |predecessor = [[Union Bank of \nSwitzerland]] and [[Swiss Bank Corporation]] merged in 1998; \n[[PaineWebber]] merged in 2000 |location = [[Zürich]]\n[[Basel]] |key_people = [[Axel A. Weber]] (Chairman){{br}}[[Sergio \nErmotti]] (CEO) {{br}} |area_served = Worldwide |industry =[[Banking]], \n[[Financial services]] |products = [[Investment Banking]]\n[[Investment Management]] [[Wealth Management]] [[Private Banking]]\n[[Commercial Bank|Corporate Banking]]\n[[Private Equity]]\n[[Finance and Insurance]]\n[[Retail Banking|Consumer Banking]]\n[[Mortgage loans|Mortgages]]\n[[Credit Cards]] |revenue = {{Increase}} [[Swiss franc|CHF]]28.027 \nbillion (2014) |operating_income = {{Decrease}} CHF2.461 billion (2014) \n{{cite web|title=UBS Annual Report \n2014|url=http://www.ubs.com/global/en/about_ubs/\ninvestor_relations/annualreporting/2014/_jcr_content/par/\ncolumncontrol_0/col1/linklist/link.1899571414.file/\nbGluay9wYXRoPS9jb250ZW50L2RhbS9zdGF0aWMvZ2xvYmFsL2ludmV\nzdG9yX3JlbGF0aW9ucy9hbm51YWwyMDE0L2FubnVhbC1yZXBv\ncnQtZ3JvdXAtMjAxNC1lbi5wZGY=/annual-report-group-2014- \nen.pdf|publisher=UBS.com|accessdate=May 3, 2015}} \n|assets = {{Increase}} CHF1.062 trillion (2014) |equity = {{Increase}} \nCHF54.368 billion (2014) |num_employees = {{Decrease}} 60,155 (2014) \n|caption=We Will Not Rest |homepage = [https://www.ubs.com/ UBS.com] }} \n'''UBS AG''' is a Swiss global [[financial services]] company, \nincorporated in the [[Canton of Zurich]],{{cite web|title=Trade Register: \nUBS AG|url=http://www.moneyhouse.ch/en/u/ubs_ag_CH-270.3.004.646-4.htm}} \nand co-headquartered in [[Zürich]] and [[Basel]].{{cite \nweb|url=https://www.ubs.com/global/en/about_ubs/\ninvestor_relations/faq/about.html|title=Corporate information - UBS \nGlobal topics|work=ubs.com|accessdate=March 29, 2015}} The company \nprovides [[investment banking]], [[asset management]], and [[wealth \nmanagement]] services for private, corporate, and institutional clients \nworldwide, and for retail clients in Switzerland as well.{{cite \nweb|url=https://www.ubs.com/global/en/about_ubs/\ninvestor_relations/our_businesses.html|title=Our clients & businesses - \nUBS Global topics|work=ubs.com|\naccessdate=March 29, 2015}} The name ''UBS'' was originally an \nabbreviation for the [[Union Bank of Switzerland]], but it ceased to be a \nrepresentational abbreviation after the bank's merger with [[Swiss Bank \nCorporation]] in 1998. The company traces its origins to 1856, when the \nearliest of its predecessor banks was founded.{{cite web|title=150 years \nof banking tradition|url=https://www.ubs.com/global/en/about_ubs/\nabout_us/history/_jcr_content/rightpar/\nteaser_0/linklist/link.651908116.file/\nbGluay9wYXRoPS9jb250ZW50L2RhbS91YnMvZ2xvY\nmFsL2Fib3V0X3Vicy9hYm91dF91cy9oaXN0b3J5X29mX3\nVicy8xNTBfeWVhcnNfb2ZfYmFua2luZ19FTkcucGRm/\n150_years_of_banking_ENG.pdf|work=ubs.com|\naccessdate=March 29, 2015}} UBS is the biggest\nbank in Switzerland, operating in more than 50 \ncountries with about 60,000 employees around the world, as of 2014.{{cite \nweb|title=About us: UBS in a few \nwords|url=https://www.ubs.com/global/en/\nabout_ubs/about_us/ourprofile.html|work=ubs.com}} It is considered the \nworld's largest manager of private wealth assets, with over [[Swiss \nfranc|CHF]]2.2 trillion in invested assets,J.P.Morgan Cazenove Europ";
preg_match_all($re, $str, $matches);
I would not apply a regex on a large input string like yours. As you can see in the regex debugger, vks' regex makes about 340,000 steps to finally fetch you a result.
I suggest splitting the string with | first, and then grepping out the info you need.
$chks = explode("|", $output);
foreach ($chks as $chk) {
if (strpos($chk,'industry =') !== false) {
echo $chk;
}
}
Result:
industry =[[Banking]],
[[Financial services]]
See IDEONE demo
I've been noticing some odd behavior while experimenting with benchmarking SplFixedArrays. Take this little snippet of code, for instance...
<?php
$splFixedArray = new \SplFixedArray( 100000 );
echo number_format( memory_get_usage() ) . PHP_EOL;
$variable = 'Truffaut single-origin coffee wayfarers, church-key asymmetrical 90\'s trust fund hashtag before they sold out thundercats photo booth. Godard sustainable roof party keffiyeh, Odd Future chillwave mlkshk kogi VHS leggings hoodie art party next level dreamcatcher yr. Blog american apparel aesthetic tattooed farm-to-table, stumptown viral whatever mixtape raw denim Williamsburg skateboard flexitarian actually tofu. Echo Park lomo disrupt PBR, jean shorts irony fingerstache blog kale chips. Street art iPhone PBR fingerstache Bushwick Cosby sweater. McSweeney\'s mumblecore semiotics, twee quinoa tofu +1 fingerstache pop-up. Echo Park bitters disrupt irony. Truffaut single-origin coffee wayfarers, church-key asymmetrical 90\'s trust fund hashtag before they sold out thundercats photo booth. Godard sustainable roof party keffiyeh, Odd Future chillwave mlkshk kogi VHS leggings hoodie art party next level dreamcatcher yr. Blog american apparel aesthetic tattooed farm-to-table, stumptown viral whatever mixtape raw denim Williamsburg skateboard flexitarian actually tofu. Echo Park lomo disrupt PBR, jean shorts irony fingerstache blog kale chips. Street art iPhone PBR fingerstache Bushwick Cosby sweater.';
var_dump( $variable );
for( $i = 0; $i < 100000; $i++ )
{
$splFixedArray[ $i ] = $variable;
}
echo number_format( memory_get_usage() );
Which outputs...
1,032,080
string(1209) "Truffaut single-origin coffee wayfarers, church-key asymmetrical 90's trust fund hashtag before they sold out thundercats photo booth. Godard sustainable roof party keffiyeh, Odd Future chillwave mlkshk kogi VHS leggings hoodie art party next level dreamcatcher yr. Blog american apparel aesthetic tattooed farm-to-table, stumptown viral whatever mixtape raw denim Williamsburg skateboard flexitarian actually tofu. Echo Park lomo disrupt PBR, jean shorts irony fingerstache blog kale chips. Street art iPhone PB"...
1,032,384
Now, let's add a simple random integer onto the end while in the for loop...
<?php
$splFixedArray = new \SplFixedArray( 100000 );
echo number_format( memory_get_usage() ) . PHP_EOL;
$variable = 'Truffaut single-origin coffee wayfarers, church-key asymmetrical 90\'s trust fund hashtag before they sold out thundercats photo booth. Godard sustainable roof party keffiyeh, Odd Future chillwave mlkshk kogi VHS leggings hoodie art party next level dreamcatcher yr. Blog american apparel aesthetic tattooed farm-to-table, stumptown viral whatever mixtape raw denim Williamsburg skateboard flexitarian actually tofu. Echo Park lomo disrupt PBR, jean shorts irony fingerstache blog kale chips. Street art iPhone PBR fingerstache Bushwick Cosby sweater. McSweeney\'s mumblecore semiotics, twee quinoa tofu +1 fingerstache pop-up. Echo Park bitters disrupt irony. Truffaut single-origin coffee wayfarers, church-key asymmetrical 90\'s trust fund hashtag before they sold out thundercats photo booth. Godard sustainable roof party keffiyeh, Odd Future chillwave mlkshk kogi VHS leggings hoodie art party next level dreamcatcher yr. Blog american apparel aesthetic tattooed farm-to-table, stumptown viral whatever mixtape raw denim Williamsburg skateboard flexitarian actually tofu. Echo Park lomo disrupt PBR, jean shorts irony fingerstache blog kale chips. Street art iPhone PBR fingerstache Bushwick Cosby sweater.';
var_dump( $variable );
for( $i = 0; $i < 100000; $i++ )
{
$splFixedArray[ $i ] = $variable . rand();
}
echo number_format( memory_get_usage() );
Which results in this...
1,034,320
string(1209) "Truffaut single-origin coffee wayfarers, church-key asymmetrical 90's trust fund hashtag before they sold out thundercats photo booth. Godard sustainable roof party keffiyeh, Odd Future chillwave mlkshk kogi VHS leggings hoodie art party next level dreamcatcher yr. Blog american apparel aesthetic tattooed farm-to-table, stumptown viral whatever mixtape raw denim Williamsburg skateboard flexitarian actually tofu. Echo Park lomo disrupt PBR, jean shorts irony fingerstache blog kale chips. Street art iPhone PB"...
129,834,272
What I'm curious about is why function calls are resulting in stacked memory usage. Is it normal that memory would not be freed up after the iteration?
This behavior is expected.
In the first case, you are storing the same value multiple times, which can be implemented as a single instance of the value and a batch of references to it, with a copy-on-write semantic for cases when the value at a given array index is changed.
In the second case, you are storing many different values, which can't be handled the same way; memory must be allocated for the full contents of each value, which results in the difference you see in memory consumption between the two cases.
php: sort and count instances of words in a given string
In this article, I have know how to count instances of words in a given string and sort by frequency. Now I want make a further work, match the result words into anther array ($keywords), then only get the top 5 words. But I do not know how to do that, open a question. thanks.
$txt = <<<EOT
The 2013 Monaco Grand Prix (formally known as the Grand Prix de Monaco 2013) was a Formula One motor race that took place on 26 May 2013 at the Circuit de Monaco, a street circuit that runs through the principality of Monaco. The race was won by Nico Rosberg for Mercedes AMG Petronas, repeating the feat of his father Keke Rosberg in the 1983 race. The race was the sixth round of the 2013 season, and marked the seventy-second time the Monaco Grand Prix has been held. Rosberg had started the race from pole.
Background
Mercedes protest
Just before the race, Red Bull and Ferrari filed an official protest against Mercedes, having learned on the night before the race of a three-day tyre test undertaken by Pirelli at the venue of the last grand prix using Mercedes' car driven by both Hamilton and Rosberg. They claimed this violated the rule against in-season testing and gave Mercedes a competitive advantage in both the Monaco race and the next race, which would both be using the tyre that was tested (with Pirelli having been criticised following some tyre failures earlier in the season, the tests had been conducted on an improved design planned to be introduced two races after Monaco). Mercedes stated the FIA had approved the test. Pirelli cited their contract with the FIA which allows limited testing, but Red Bull and Ferrari argued this must only be with a car at least two years old. It was the second test conducted by Pirelli in the season, the first having been between race 4 and 5, but using a 2011 Ferrari car.[4]
Tyres
Tyre supplier Pirelli brought its yellow-banded soft compound tyre as the harder "prime" tyre and the red-banded super-soft compound tyre as the softer "option" tyre, just as they did the previous two years. It was the second time in the season that the super-soft compound was used at a race weekend, as was the case with the soft tyre compound.
EOT;
$words = array_count_values(str_word_count($txt, 1));
arsort($words);
var_dump($words);
$keywords = array("Monaco","Prix","2013","season","Formula","race","motor","street","Ferrari","Mercedes","Hamilton","Rosberg","Tyre");
//var_dump($words) which should match in $keywords array, then get top 5 words.
You already have $words as an associative array, indexed by the word and with the count as the value, so we use array_flip() to make your $keywords array an associative array indexed by word as well. Then we can use array_intersect_key() to return only those entries from $words that have a matching index entry in our flipped $keywords array.
This gives a resulting $matchWords array, still keyed by the word, but containing only those entries from the original $words array that match $keywords; and still sorted by frequency.
We then simply use array_slice() to extract the first 5 entries from that array.
$matchWords = array_intersect_key(
$words,
array_flip($keywords)
);
$matchWords = array_slice($matchWords, 0, 5);
var_dump($matchWords);
gives
array(5) {
'race' =>
int(11)
'Monaco' =>
int(7)
'Mercedes' =>
int(5)
'Rosberg' =>
int(4)
'season' =>
int(4)
}
Caveat: You could have problems with case-sensitivity. "Race" !== "race", so the $words = array_count_values(str_word_count($txt, 1)); line will treat these as two different words.
Good morning -
I'm interested in seeing an efficient way of parsing the values of an heirarchical text file (i.e., one that has a Title => Multiple Headings => Multiple Subheadings => Multiple Keys => Multiple Values) into a simple XML document. For the sake of simplicity, the answer would be written using:
Regex (preferrably in PHP)
or, PHP code (e.g., if looping were more efficient)
Here's an example of an Inventory file I'm working with. Note that Header = FOODS, Sub-Header = Type (A, B...), Keys = PRODUCT (or CODE, etc.) and Values may have one more more lines.
**FOODS - TYPE A**
___________________________________
**PRODUCT**
1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;
2) La Fe String Cheese
**CODE**
Sell by date going back to February 1, 2009
**MANUFACTURER**
Quesos Mi Pueblito, LLC, Passaic, NJ.
**VOLUME OF UNITS**
11,000 boxes
**DISTRIBUTION**
NJ, NY, DE, MD, CT, VA
___________________________________
**PRODUCT**
1) Peanut Brittle No Sugar Added;
2) Peanut Brittle Small Grind;
3) Homestyle Peanut Brittle Nuggets/Coconut Oil Coating
**CODE**
1) Lots 7109 - 8350 inclusive;
2) Lots 8198 - 8330 inclusive;
3) Lots 7075 - 9012 inclusive;
4) Lots 7100 - 8057 inclusive;
5) Lots 7152 - 8364 inclusive
**MANUFACTURER**
Star Kay White, Inc., Congers, NY.
**VOLUME OF UNITS**
5,749 units
**DISTRIBUTION**
NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN
**FOODS - TYPE B**
___________________________________
**PRODUCT**
Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice;
**CODE**
990-10/2 10/5
**MANUFACTURER**
San Mar Manufacturing Corp., Catano, PR.
**VOLUME OF UNITS**
384
**DISTRIBUTION**
PR
And here's the desired output (please excuse any XML syntactical errors):
<foods>
<food type = "A" >
<product>Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese</product>
<product>La Fe String Cheese</product>
<code>Sell by date going back to February 1, 2009</code>
<manufacturer>Quesos Mi Pueblito, LLC, Passaic, NJ.</manufacturer>
<volume>11,000 boxes</volume>
<distibution>NJ, NY, DE, MD, CT, VA</distribution>
</food>
<food type = "A" >
<product>Peanut Brittle No Sugar Added</product>
<product>Peanut Brittle Small Grind</product>
<product>Homestyle Peanut Brittle Nuggets/Coconut Oil Coating</product>
<code>Lots 7109 - 8350 inclusive</code>
<code>Lots 8198 - 8330 inclusive</code>
<code>Lots 7075 - 9012 inclusive</code>
<code>Lots 7100 - 8057 inclusive</code>
<code>Lots 7152 - 8364 inclusive</code>
<manufacturer>Star Kay White, Inc., Congers, NY.</manufacturer>
<volume>5,749 units</volume>
<distibution>NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN</distribution>
</food>
<food type = "B" >
<product>Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice</product>
<code>990-10/2 10/5</code>
<manufacturer>San Mar Manufacturing Corp., Catano, PR</manufacturer>
<volume>384</volume>
<distibution>PR</distribution>
</food>
</FOODS>
<!-- and so forth -->
So far, my approach (which might be quite inefficient with a huge text file) would be one of the following:
Loops and multiple Select/Case statements, where the file is loaded into a string buffer, and while looping through each line, see if it matches one of the header/subheader/key lines, append the appropriate xml tag to a xml string variable, and then add the child nodes to the xml based on IF statements regarding which key name is most recent (which seems time-consuming and error-prone, esp. if the text changes even slightly) -- OR
Use REGEX (Regular Expressions) to find and replace key fields with appropriate xml tags, clean it up with an xml library, and export the xml file. Problem is, I barely use regular expressions, so I'd need some example-based help.
Any help or advice would be appreciated.
Thanks.
An example you can use as a starting point. At least I hope it gives you an idea...
<?php
define('TYPE_HEADER', 1);
define('TYPE_KEY', 2);
define('TYPE_DELIMETER', 3);
define('TYPE_VALUE', 4);
$datafile = 'data.txt';
$fp = fopen($datafile, 'rb') or die('!fopen');
// stores (the first) {header} in 'name' and the root simplexmlelement in 'element'
$container = array('name'=>null, 'element'=>null);
// stores the name for each item element, the value for the type attribute for subsequent item elements and the simplexmlelement of the current item element
$item = array('name'=>null, 'type'=>null, 'current_element'=>null);
// the last **key** encountered, used to create new child elements in the current item element when a value is encountered
$key = null;
while ( false!==($t=getstruct($fp)) ) {
switch( $t[0] ) {
case TYPE_HEADER:
if ( is_null($container['element']) ) {
// this is the first time we hit **header - subheader**
$container['name'] = $t[1][0];
// ugly hack, < . name . />
$container['element'] = new SimpleXMLElement('<'.$container['name'].'/>');
// each subsequent new item gets the new subheader as type attribute
$item['type'] = $t[1][1];
// dummy implementation: "deducting" the item names from header/container[name]
$item['name'] = substr($t[1][0], 0, -1);
}
else {
// hitting **header - subheader** the (second, third, nth) time
/*
header must be the same as the first time (stored in container['name']).
Otherwise you need another container element since
xml documents can only have one root element
*/
if ( $container['name'] !== $t[1][0] ) {
echo $container['name'], "!==", $t[1][0], "\n";
die('format error');
}
else {
// subheader may have changed, store it for future item elements
$item['type'] = $t[1][1];
}
}
break;
case TYPE_DELIMETER:
assert( !is_null($container['element']) );
assert( !is_null($item['name']) );
assert( !is_null($item['type']) );
/* that's maybe not a wise choice.
You might want to check the complete item before appending it to the document.
But the example is a hack anyway ...so create a new item element and append it to the container right away
*/
$item['current_element'] = $container['element']->addChild($item['name']);
// set the type-attribute according to the last **header - subheader** encountered
$item['current_element']['type'] = $item['type'];
break;
case TYPE_KEY:
$key = $t[1][0];
break;
case TYPE_VALUE:
assert( !is_null($item['current_element']) );
assert( !is_null($key) );
// this is a value belonging to the "last" key encountered
// create a new "key" element with the value as content
// and addit to the current item element
$tmp = $item['current_element']->addChild($key, $t[1][0]);
break;
default:
die('unknown token');
}
}
if ( !is_null($container['element']) ) {
$doc = dom_import_simplexml($container['element']);
$doc = $doc->ownerDocument;
$doc->formatOutput = true;
echo $doc->saveXML();
}
die;
/*
Take a look at gettoken() at http://www.tuxradar.com/practicalphp/21/5/6
It breaks the stream into much simpler pieces.
In the next step the parser would "combine" or structure the simple tokens into more complex things.
This function does both....
#return array(id, array(parameter)
*/
function getstruct($fp) {
if ( feof($fp) ) {
return false;
}
// shortcut: all we care about "happens" on one line
// so let php read one line in a single step and then do the pattern matching
$line = trim(fgets($fp));
// this matches **key** and **header - subheader**
if ( preg_match('#^\*\*([^-]+)(?:-(.*))?\*\*$#', $line, $m) ) {
// only for **header - subheader** $m[2] is set.
if ( isset($m[2]) ) {
return array(TYPE_HEADER, array(trim($m[1]), trim($m[2])));
}
else {
return array(TYPE_KEY, array($m[1]));
}
}
// this matches _____________ and means "new item"
else if ( preg_match('#^_+$#', $line, $m) ) {
return array(TYPE_DELIMETER, array());
}
// any other non-empty line is a single value
else if ( preg_match('#\S#', $line) ) {
// you might want to filter the 1),2),3) part out here
// could also be two diffrent token types
return array(TYPE_VALUE, array($line));
}
else {
// skip empty lines, would be nicer with tail-recursion...
return getstruct($fp);
}
}
prints
<?xml version="1.0"?>
<FOODS>
<FOOD type="TYPE A">
<PRODUCT>1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;</PRODUCT>
<PRODUCT>2) La Fe String Cheese</PRODUCT>
<CODE>Sell by date going back to February 1, 2009</CODE>
<MANUFACTURER>Quesos Mi Pueblito, LLC, Passaic, NJ.</MANUFACTURER>
<VOLUME OF UNITS>11,000 boxes</VOLUME OF UNITS>
<DISTRIBUTION>NJ, NY, DE, MD, CT, VA</DISTRIBUTION>
</FOOD>
<FOOD type="TYPE A">
<PRODUCT>1) Peanut Brittle No Sugar Added;</PRODUCT>
<PRODUCT>2) Peanut Brittle Small Grind;</PRODUCT>
<PRODUCT>3) Homestyle Peanut Brittle Nuggets/Coconut Oil Coating</PRODUCT>
<CODE>1) Lots 7109 - 8350 inclusive;</CODE>
<CODE>2) Lots 8198 - 8330 inclusive;</CODE>
<CODE>3) Lots 7075 - 9012 inclusive;</CODE>
<CODE>4) Lots 7100 - 8057 inclusive;</CODE>
<CODE>5) Lots 7152 - 8364 inclusive</CODE>
<MANUFACTURER>Star Kay White, Inc., Congers, NY.</MANUFACTURER>
<VOLUME OF UNITS>5,749 units</VOLUME OF UNITS>
<DISTRIBUTION>NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN</DISTRIBUTION>
</FOOD>
<FOOD type="TYPE B">
<PRODUCT>Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice;</PRODUCT>
<CODE>990-10/2 10/5</CODE>
<MANUFACTURER>San Mar Manufacturing Corp., Catano, PR.</MANUFACTURER>
<VOLUME OF UNITS>384</VOLUME OF UNITS>
<DISTRIBUTION>PR</DISTRIBUTION>
</FOOD>
</FOODS>
Unfortunately the status of the php module for ANTLR currently is "Runtime is in alpha status." but it might be worth a try anyway...
See: http://www.tuxradar.com/practicalphp/21/5/6
This tells you how to parse a text file into tokens using PHP. Once parsed you can place it into anything you want.
You need to search for specific tokens in the file based on your criteria:
for example:
PRODUCT
This gives you the XML Tag
Then 1) can have special meaning
1) Peanut Brittle...
This tells you what to put in the XML tag.
I do not know if this is the most efficient way to accomplish your task but it is the way a compiler would parse a file and has the potential to make very accurate.
Instead of Regex or PHP use the XSLT 2.0 unparsed-text() function to read the file (see http://www.biglist.com/lists/xsl-list/archives/200508/msg00085.html)
Another Hint for an XSLT 1.0 Solution is here: http://bytes.com/topic/net/answers/808619-read-plain-file-xslt-1-0-a