Regex to isolate specific word in body of text until delimiter - php

I have the following specific output from which I would like to isolate from and including the word "industry" (whichever case) and the sub string until the next delimiter typically "|". I get the $output from an API So the contents of $output are always different but the generic expression may be something like: blah blah blah |industry = industry info| blah blah blah. If the word industry exists in the output I would just like to get industry = industry info. Is there a generic regex which can do this? The specific output I have returned is:
<?php
$output = '{{other uses|UBS (disambiguation)}} {{Use dmy dates|date=April
2015}} {{Infobox company |name = UBS Group AG |logo = [[File:UBS
Logo.svg|200px|UBS Group AG Logo]] |type = [[Aktiengesellschaft]]
([[Aktiengesellschaft|AG]])
[[Public company]] |traded_as = {{SWX|UBSG}} {{SWX|UBSN}}
{{nyse|UBS}} |foundation=1854 |predecessor = [[Union Bank of
Switzerland]] and [[Swiss Bank Corporation]] merged in 1998;
[[PaineWebber]] merged in 2000 |location = [[Zürich]]
[[Basel]] |key_people = [[Axel A. Weber]] (Chairman){{br}}[[Sergio
Ermotti]] (CEO) {{br}} |area_served = Worldwide |industry =[[Banking]],
[[Financial services]] |products = [[Investment Banking]]
[[Investment Management]] [[Wealth Management]] [[Private Banking]]
[[Commercial Bank|Corporate Banking]]
[[Private Equity]]
[[Finance and Insurance]]
[[Retail Banking|Consumer Banking]]
[[Mortgage loans|Mortgages]]
[[Credit Cards]] |revenue = {{Increase}} [[Swiss franc|CHF]]28.027
billion (2014) |operating_income = {{Decrease}} CHF2.461 billion (2014)
{{cite web|title=UBS Annual Report
2014|url=http://www.ubs.com/global/en/about_ubs/
investor_relations/annualreporting/2014/_jcr_content/par/
columncontrol_0/col1/linklist/link.1899571414.file/
bGluay9wYXRoPS9jb250ZW50L2RhbS9zdGF0aWMvZ2xvYmFsL2ludmV
zdG9yX3JlbGF0aW9ucy9hbm51YWwyMDE0L2FubnVhbC1yZXBv
cnQtZ3JvdXAtMjAxNC1lbi5wZGY=/annual-report-group-2014-
en.pdf|publisher=UBS.com|accessdate=May 3, 2015}}
|assets = {{Increase}} CHF1.062 trillion (2014) |equity = {{Increase}}
CHF54.368 billion (2014) |num_employees = {{Decrease}} 60,155 (2014)
|caption=We Will Not Rest |homepage = [https://www.ubs.com/ UBS.com] }}
'''UBS AG''' is a Swiss global [[financial services]] company,
incorporated in the [[Canton of Zurich]],{{cite web|title=Trade Register:
UBS AG|url=http://www.moneyhouse.ch/en/u/ubs_ag_CH-270.3.004.646-4.htm}}
and co-headquartered in [[Zürich]] and [[Basel]].{{cite
web|url=https://www.ubs.com/global/en/about_ubs/
investor_relations/faq/about.html|title=Corporate information - UBS
Global topics|work=ubs.com|accessdate=March 29, 2015}} The company
provides [[investment banking]], [[asset management]], and [[wealth
management]] services for private, corporate, and institutional clients
worldwide, and for retail clients in Switzerland as well.{{cite
web|url=https://www.ubs.com/global/en/about_ubs/
investor_relations/our_businesses.html|title=Our clients & businesses -
UBS Global topics|work=ubs.com|
accessdate=March 29, 2015}} The name ''UBS'' was originally an
abbreviation for the [[Union Bank of Switzerland]], but it ceased to be a
representational abbreviation after the bank's merger with [[Swiss Bank
Corporation]] in 1998. The company traces its origins to 1856, when the
earliest of its predecessor banks was founded.{{cite web|title=150 years
of banking tradition|url=https://www.ubs.com/global/en/about_ubs/
about_us/history/_jcr_content/rightpar/
teaser_0/linklist/link.651908116.file/
bGluay9wYXRoPS9jb250ZW50L2RhbS91YnMvZ2xvY
mFsL2Fib3V0X3Vicy9hYm91dF91cy9oaXN0b3J5X29mX3
Vicy8xNTBfeWVhcnNfb2ZfYmFua2luZ19FTkcucGRm/
150_years_of_banking_ENG.pdf|work=ubs.com|
accessdate=March 29, 2015}} UBS is the biggest
bank in Switzerland, operating in more than 50
countries with about 60,000 employees around the world, as of 2014.{{cite
web|title=About us: UBS in a few
words|url=https://www.ubs.com/global/en/
about_ubs/about_us/ourprofile.html|work=ubs.com}} It is considered the
world's largest manager of private wealth assets, with over [[Swiss
franc|CHF]]2.2 trillion in invested assets,J.P.Morgan Cazenove Europe';
?>

[^|]*\bindustry\b[^|]*
Try this.See demo.Use i flag.
https://regex101.com/r/uF4oY4/79
This will match a string which starts from after | has industry till the next |.
$re = "/[^|]*\\bindustry\\b[^|]*/i";
$str = "{{other uses|UBS (disambiguation)}} {{Use dmy dates|date=April \n2015}} {{Infobox company |name = UBS Group AG |logo = [[File:UBS \nLogo.svg|200px|UBS Group AG Logo]] |type = [[Aktiengesellschaft]] \n([[Aktiengesellschaft|AG]])\n[[Public company]] |traded_as = {{SWX|UBSG}} {{SWX|UBSN}}\n{{nyse|UBS}} |foundation=1854 |predecessor = [[Union Bank of \nSwitzerland]] and [[Swiss Bank Corporation]] merged in 1998; \n[[PaineWebber]] merged in 2000 |location = [[Zürich]]\n[[Basel]] |key_people = [[Axel A. Weber]] (Chairman){{br}}[[Sergio \nErmotti]] (CEO) {{br}} |area_served = Worldwide |industry =[[Banking]], \n[[Financial services]] |products = [[Investment Banking]]\n[[Investment Management]] [[Wealth Management]] [[Private Banking]]\n[[Commercial Bank|Corporate Banking]]\n[[Private Equity]]\n[[Finance and Insurance]]\n[[Retail Banking|Consumer Banking]]\n[[Mortgage loans|Mortgages]]\n[[Credit Cards]] |revenue = {{Increase}} [[Swiss franc|CHF]]28.027 \nbillion (2014) |operating_income = {{Decrease}} CHF2.461 billion (2014) \n{{cite web|title=UBS Annual Report \n2014|url=http://www.ubs.com/global/en/about_ubs/\ninvestor_relations/annualreporting/2014/_jcr_content/par/\ncolumncontrol_0/col1/linklist/link.1899571414.file/\nbGluay9wYXRoPS9jb250ZW50L2RhbS9zdGF0aWMvZ2xvYmFsL2ludmV\nzdG9yX3JlbGF0aW9ucy9hbm51YWwyMDE0L2FubnVhbC1yZXBv\ncnQtZ3JvdXAtMjAxNC1lbi5wZGY=/annual-report-group-2014- \nen.pdf|publisher=UBS.com|accessdate=May 3, 2015}} \n|assets = {{Increase}} CHF1.062 trillion (2014) |equity = {{Increase}} \nCHF54.368 billion (2014) |num_employees = {{Decrease}} 60,155 (2014) \n|caption=We Will Not Rest |homepage = [https://www.ubs.com/ UBS.com] }} \n'''UBS AG''' is a Swiss global [[financial services]] company, \nincorporated in the [[Canton of Zurich]],{{cite web|title=Trade Register: \nUBS AG|url=http://www.moneyhouse.ch/en/u/ubs_ag_CH-270.3.004.646-4.htm}} \nand co-headquartered in [[Zürich]] and [[Basel]].{{cite \nweb|url=https://www.ubs.com/global/en/about_ubs/\ninvestor_relations/faq/about.html|title=Corporate information - UBS \nGlobal topics|work=ubs.com|accessdate=March 29, 2015}} The company \nprovides [[investment banking]], [[asset management]], and [[wealth \nmanagement]] services for private, corporate, and institutional clients \nworldwide, and for retail clients in Switzerland as well.{{cite \nweb|url=https://www.ubs.com/global/en/about_ubs/\ninvestor_relations/our_businesses.html|title=Our clients & businesses - \nUBS Global topics|work=ubs.com|\naccessdate=March 29, 2015}} The name ''UBS'' was originally an \nabbreviation for the [[Union Bank of Switzerland]], but it ceased to be a \nrepresentational abbreviation after the bank's merger with [[Swiss Bank \nCorporation]] in 1998. The company traces its origins to 1856, when the \nearliest of its predecessor banks was founded.{{cite web|title=150 years \nof banking tradition|url=https://www.ubs.com/global/en/about_ubs/\nabout_us/history/_jcr_content/rightpar/\nteaser_0/linklist/link.651908116.file/\nbGluay9wYXRoPS9jb250ZW50L2RhbS91YnMvZ2xvY\nmFsL2Fib3V0X3Vicy9hYm91dF91cy9oaXN0b3J5X29mX3\nVicy8xNTBfeWVhcnNfb2ZfYmFua2luZ19FTkcucGRm/\n150_years_of_banking_ENG.pdf|work=ubs.com|\naccessdate=March 29, 2015}} UBS is the biggest\nbank in Switzerland, operating in more than 50 \ncountries with about 60,000 employees around the world, as of 2014.{{cite \nweb|title=About us: UBS in a few \nwords|url=https://www.ubs.com/global/en/\nabout_ubs/about_us/ourprofile.html|work=ubs.com}} It is considered the \nworld's largest manager of private wealth assets, with over [[Swiss \nfranc|CHF]]2.2 trillion in invested assets,J.P.Morgan Cazenove Europ";
preg_match_all($re, $str, $matches);

I would not apply a regex on a large input string like yours. As you can see in the regex debugger, vks' regex makes about 340,000 steps to finally fetch you a result.
I suggest splitting the string with | first, and then grepping out the info you need.
$chks = explode("|", $output);
foreach ($chks as $chk) {
if (strpos($chk,'industry =') !== false) {
echo $chk;
}
}
Result:
industry =[[Banking]],
[[Financial services]]
See IDEONE demo

Related

Save/Alphabetize XML Responses in a Foreach Array

I'm new to php and am doing this project to teach myself a bit.
I'm importing XML data from a Reuters RSS feed and would like to sort the content of all the responses alphabetically. I've had no problem loading the information I want to the page using a foreach loop, however the sorting system I'm using alphabetizes the words in each xml title individually, as opposed to together as one string.
How can I group or save all the responses together in order to sort them as a whole once they've been collected by the foreach loop?
Here's what I have so far:
<?php
function getFeed($feed_url) {
$content = file_get_contents($feed_url);
$x = new SimpleXmlElement($content);
$string = $x->channel->item ;
echo "<p>";
foreach($x->channel->item as $entry) {
$string = $entry->title;
$split=explode(" ", $string);
sort($split); // sorts the elements
echo implode(" ", $split); //combine and print the elements
}
echo "</p>";
}?>
What you are wanting to do is build an array with all the words and then sort it at the end.
<?php
function getFeed($feed_url) {
$content = file_get_contents($feed_url);
$x = new SimpleXmlElement($content);
$titles = "";
foreach($x->channel->item as $entry) {
$titles .= " $entry->title";
}
$split = explode(" ", $titles);
sort($split, SORT_FLAG_CASE | SORT_NATURAL);
return trim(implode(" ", $split));
}
$words = getFeed("http://feeds.reuters.com/Reuters/PoliticsNews");
echo "<p>$words</p>";
I didn't remove non-word characters, so things like quotes will mess with the sorting.
Output:
<p>'sanctuary' abuse after announcement, areas as battle California, cards, case crises cut detention drill Egyptian Egyptian-American Exxon financial Florida for freed from funding future give green greets healthcare Homeland hope in in in in Indian looms not of officials orders orders other permission plan possible prevent probe probes racial reboot reform resigns revamped review review review rule rules Russia Security see seeks senator sets slur state summons tax tax testify threatens to to to to Top Trump Trump Trump Trump Trump Trump-Russia Twitter U.S. U.S. U.S. U.S. Uphill visa-holders Waiting Washington will</p>
Consider saving titles to an array, sort their values, and then iterate back out for echo output:
$content = file_get_contents("http://feeds.reuters.com/Reuters/PoliticsNews");
$x = new SimpleXmlElement($content);
$titles = [];
foreach($x->channel->item as $entry) {
$titles[] = $entry->title;
}
sort($titles, SORT_NATURAL | SORT_FLAG_CASE); # CASE INSENSITIVE SORT
foreach($titles as $t) {
echo "<p>". $t ."</p>";
}
# <p>Ex-Illinois Governor Blagojevich's 14-year prison term upheld</p>
# <p>Exxon probe is unconstitutional, Republican prosecutors say</p>
# <p>Group sues Trump for repealing U.S. wildlife rule in rare legal challenge</p>
# <p>Indian techies, IT firms fret as Trump orders U.S. visa review</p>
# <p>Trump, Republicans face tricky task of averting U.S. government shutdown</p>
# <p>Trump administration may change rules that allow terror victims to immigrate to U.S.</p>
# <p>U.S. House committee sets more hearings in Trump-Russia probe</p>
# <p>U.S. judicial panel finds Texas hurt Latino vote with redrawn boundaries</p>
# <p>U.S. retailers bet on Congress over Bolivia to thwart Trump border tax issue</p>
# <p>U.S. Treasury's Mnuchin: Trump to order reviews of financial rules</p>

json string is showing blank why is not getting decoded

i have json string but when i am getting it json_decode() it is showing blank.
$str = '[{"actcode":"Auck4","actname":"Sky Tower","date":"","time":"","timeduration":"","adult":"0","adultprice":"28","child":"0","childprice":"0","description":"Discover the best of Auckland in half a day. Soak up spectacular sights on this scenic tour, from heritage-listed buildings on Queen Street to the stunning Viaduct Harbour and panoramic vistas from the Sky Tower observation deck.
Start your tour with a hotel pick-up and travel through Auckland?s dynamic Central Business District. Travel across the iconic Auckland Harbour Bridge and admire stunning city views. Then, return to the city centre and visit the vibrant precinct of Wynyard Quarter. Here, wander among the sculptures and enjoy the happenings on the water of Viaduct Harbour.
Continue to Queen Street, also known as the ?Golden Mile? of Aucklands business and shopping district. Marvel at historic buildings like the Ferry Terminal building before visiting the Auckland Museum. Here, explore fascinating exhibits paying tribute to New Zealands natural, Maori and European histories. Afterwards, travel along Aucklands most expensive residential streets with fantastic views of the Waitemata Harbour and its islands.
Your tour ends at Sky Tower, the tallest free-standing structure in the Southern Hemisphere. Take in breathtaking 360-degree views of the city and its surroundings. In the afternoon, continue your own exploration of Auckland."}]';
i tried the below code
$array = json_decode($str,true);
echo print_r($array);
this one too
$str1 = trim($str);
$array = json_decode($str1,true);
echo print_r($array);
but the string si showing blank
try this one.
$string = mysql_real_escape_string($str);
$findsym = array('\r', '\n');
$removesym = array("", "");
$strdone = stripslashes(str_replace($findsym,$removesym,strip_tags($string)));
$jsonarray = json_decode($strdone,true);
echo "<pre>"; echo print_r($jsonarray);

how to import data from text file without any delimiter or separator?

I have to do a task of importing data to mysql from a text file using php code, yeah it's sound so easy, and I have already done that before like importing data from csv file, from excel file or any text file where data is separated with any delimiter. But in my current case, there is no any delimiter just two spaces and the fix length of field. For example-
table field
|------|-----------|-----------|-------------|
| id(8)| name(50) | state(15) | category(10)|
|------|-----------|-----------|-------------|
| | | | |
sample data of upload.txt file-
::format::
ID NAME ADDRESS CATEGORY
10719922 Union Bank of India delhi normal
10719956 State Bank of India mumbai normal
10719522 HDFC Bank gujrat high
10759924 ICICI Bank goa normal
Now you can understand the data format of text file, i.e. field length + two spaces, field length + two spaces and so on. The problem is if the data is not matches with the field size then again spaces are hare to complete the length of field that's why two space are not available as delimiter. Like take the first of data- id have 8 digit data than two spaces and name length 50 but data have only 19 character so there is 31 spaces to complete the length 50 after that two space then next field. So I have no delimiter or syntax (rather than length + 2 spaces) to identify the single field data. I am very confused how to import this data to MySQL using php script. does any one think It can be happen. Please I need some idea or php code, to handle this situation. Thank You
It shouldn't be more difficult than this:
<?php
$input = <<<END
10719922 Union Bank of India delhi normal
10719956 State Bank of India mumbai normal
10719522 HDFC Bank gujrat high
10759924 ICICI Bank goa normal
END;
$def = array(
"id" => 8,
"name" => 50,
"state" => 15,
"category" => 10
);
foreach (explode(PHP_EOL, $input) as $line) {
foreach ($def as $field => $length) {
$value = substr($line, 0, $length + 2);
$line = substr($line, $length + 2);
print $field.' = '.trim($value).PHP_EOL;
}
print '----------------------------------------'.PHP_EOL;
}
?>
Basic idea is to create a format definition in the $def hash, and then process all lines according to that format definition.
Executing this code will yield the output below. Change the actual implementation to fit your needs.
id = 10719922
name = Union Bank of India
state = delhi
category = normal
----------------------------------------
id = 10719956
name = State Bank of India
state = mumbai
category = normal
----------------------------------------
id = 10719522
name = HDFC Bank
state = gujrat
category = high
----------------------------------------
id = 10759924
name = ICICI Bank
state = goa
category = normal
----------------------------------------
You could use preg_split() function, and explode the string/line/row by >= 2 spaces:
$line = '10719922 Union Bank of India delhi normal';
$m = preg_split('~(\h{2,})~', $line);
print_r($m);
demo

How to create csv file using raw text data

i am much confused at this point regarding the csv file creation and insert data in the database.
suppose i have below text data - that is of 45000 record set, i am posting dew of them below.
Winged Wheels in France, by Michael Myers Shoemaker 45790
A Battle Fought on Snow Shoes, by Mary Cochrane Rogers 45789
The German Classics of the Nineteenth and Twentieth Centuries, 45788
Volume 11, by Friedrich Spielhagen, Theodor Storm,
Wilhelm Raabe, Marion D. Learned and Ewald Eiserhardt
[Subtitle: Masterpieces of German Literature
Translated Into English]
Zofloya ou le Maure, Tomes 1-4, by Charlotte Dacre 45787
[Subtitle: Histoire du XVe si?cle]
[Language: French]
Their Majesties as I Knew Them, by Xavier Paoli 45786
[Subtitle: Personal Reminiscences of the
Kings and Queens of Europe]
[Translator: Alexander Teixeira de Mattos]
New York Times Current History: The European War, Vol. 8, 45785
Pt. 2, No. 1, July 1918, by Various
Gallery of Comicalities, by Robert Cruikshank, 45784
George Cruikshank and Robert Seymour
[Subtitle: Embracing Humorous Sketches]
Katri, by Emil Nervander 45783
[Subtitle: Kertomus 17 vuosi-sadasta]
[Language: Finnish]
The Little Brown Jug at Kildare, by Meredith Nicholson 45782
[Illustrator: James Montgomery Flagg]
Beaumont & Fletcher's Works (6 through 10), by Francis Beaumont 45781
and John Fletcher
[Subtitle: The Queen of Corinth; Bonduca; The Knight of the
Burning Pestle; Loves Pilgrimage; The Double Marriage]
Beaumont & Fletcher's Works (1 through 5), by Francis Beaumont 45780
and John Fletcher
[Subtitle: A Wife for a Month; The Lovers Progress;
The Pilgrim; The Captain; The Prophetess]
The Washington Historical Quarterly, Volume V, 1914, by Various 45779
[Editor: Edmond S. Meany]
Minstrelsy of the Scottish Border Volume III of 3, by Walter Scott 45778
[Subtitle: Consisting of Historical and Romantic Ballads,
Collected In the Southern Counties of Scotland; With
a Few Of Modern Date, Founded Upon Local Tradition.
In Three Volumes. Vol. III]
What i want is simply insert Winged Wheels in France, by Michael Myers Shoemaker in one column and 45790 in other column of CSV. then i will be able to add them to my database.
moreover, e.g,
The German Classics of the Nineteenth and Twentieth Centuries,
Volume 11, by Friedrich Spielhagen, Theodor Storm,
Wilhelm Raabe, Marion D. Learned and Ewald Eiserhardt
[Subtitle: Masterpieces of German Literature
Translated Into English]
i want to insert above text in this way:
The German Classics of the Nineteenth and Twentieth Centuries,
Volume 11, by Friedrich Spielhagen, Theodor Storm,
Wilhelm Raabe, Marion D. Learned and Ewald Eiserhardt
means no this portion:
[Subtitle: Masterpieces of German Literature
Translated Into English]
the ", by" should also omitted and so my new data would be like this. so actually i need three columns in the csv.
1 | Winged Wheels in France | Michael Myers Shoemaker | 45790
2 | The German Classics of the Nineteenth and Twentieth Centuries,
Volume 11 | Friedrich Spielhagen, Theodor Storm,
Wilhelm Raabe, Marion D. Learned and Ewald Eiserhardt | 45789
Please help in getting it inserted in excel file and create csv from it.
thank you all.
do like this
(not tested)
$f = file_get_contents('yourtextfile.txt');
$f = preg_replace('/\[(.*?)\]/s','',$f);
$f = str_replace(array("\n", "\r"), '', $f);
file_put_contents('temp.txt',$f);
$file = file('temp.txt');
foreach($file as $key => $line){
if($line!=null || $line!='')
{
mysqli_query($connection,"insert into table1(column1) values('$line')");
}
}
edit
$f = file_get_contents('tt.txt');
$f = preg_replace('/\[(.*?)\]/s','',$f);
$keywords = preg_split("/[ ]{15}/", $f);
print_r(array_filter($keywords));
i'm still not clear,whether you have those numbers in your file or you have just mentioned it there!!

How best use Regular Expressions to convert Heirarchical Text File into XML?

Good morning -
I'm interested in seeing an efficient way of parsing the values of an heirarchical text file (i.e., one that has a Title => Multiple Headings => Multiple Subheadings => Multiple Keys => Multiple Values) into a simple XML document. For the sake of simplicity, the answer would be written using:
Regex (preferrably in PHP)
or, PHP code (e.g., if looping were more efficient)
Here's an example of an Inventory file I'm working with. Note that Header = FOODS, Sub-Header = Type (A, B...), Keys = PRODUCT (or CODE, etc.) and Values may have one more more lines.
**FOODS - TYPE A**
___________________________________
**PRODUCT**
1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;
2) La Fe String Cheese
**CODE**
Sell by date going back to February 1, 2009
**MANUFACTURER**
Quesos Mi Pueblito, LLC, Passaic, NJ.
**VOLUME OF UNITS**
11,000 boxes
**DISTRIBUTION**
NJ, NY, DE, MD, CT, VA
___________________________________
**PRODUCT**
1) Peanut Brittle No Sugar Added;
2) Peanut Brittle Small Grind;
3) Homestyle Peanut Brittle Nuggets/Coconut Oil Coating
**CODE**
1) Lots 7109 - 8350 inclusive;
2) Lots 8198 - 8330 inclusive;
3) Lots 7075 - 9012 inclusive;
4) Lots 7100 - 8057 inclusive;
5) Lots 7152 - 8364 inclusive
**MANUFACTURER**
Star Kay White, Inc., Congers, NY.
**VOLUME OF UNITS**
5,749 units
**DISTRIBUTION**
NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN
**FOODS - TYPE B**
___________________________________
**PRODUCT**
Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice;
**CODE**
990-10/2 10/5
**MANUFACTURER**
San Mar Manufacturing Corp., Catano, PR.
**VOLUME OF UNITS**
384
**DISTRIBUTION**
PR
And here's the desired output (please excuse any XML syntactical errors):
<foods>
<food type = "A" >
<product>Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese</product>
<product>La Fe String Cheese</product>
<code>Sell by date going back to February 1, 2009</code>
<manufacturer>Quesos Mi Pueblito, LLC, Passaic, NJ.</manufacturer>
<volume>11,000 boxes</volume>
<distibution>NJ, NY, DE, MD, CT, VA</distribution>
</food>
<food type = "A" >
<product>Peanut Brittle No Sugar Added</product>
<product>Peanut Brittle Small Grind</product>
<product>Homestyle Peanut Brittle Nuggets/Coconut Oil Coating</product>
<code>Lots 7109 - 8350 inclusive</code>
<code>Lots 8198 - 8330 inclusive</code>
<code>Lots 7075 - 9012 inclusive</code>
<code>Lots 7100 - 8057 inclusive</code>
<code>Lots 7152 - 8364 inclusive</code>
<manufacturer>Star Kay White, Inc., Congers, NY.</manufacturer>
<volume>5,749 units</volume>
<distibution>NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN</distribution>
</food>
<food type = "B" >
<product>Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice</product>
<code>990-10/2 10/5</code>
<manufacturer>San Mar Manufacturing Corp., Catano, PR</manufacturer>
<volume>384</volume>
<distibution>PR</distribution>
</food>
</FOODS>
<!-- and so forth -->
So far, my approach (which might be quite inefficient with a huge text file) would be one of the following:
Loops and multiple Select/Case statements, where the file is loaded into a string buffer, and while looping through each line, see if it matches one of the header/subheader/key lines, append the appropriate xml tag to a xml string variable, and then add the child nodes to the xml based on IF statements regarding which key name is most recent (which seems time-consuming and error-prone, esp. if the text changes even slightly) -- OR
Use REGEX (Regular Expressions) to find and replace key fields with appropriate xml tags, clean it up with an xml library, and export the xml file. Problem is, I barely use regular expressions, so I'd need some example-based help.
Any help or advice would be appreciated.
Thanks.
An example you can use as a starting point. At least I hope it gives you an idea...
<?php
define('TYPE_HEADER', 1);
define('TYPE_KEY', 2);
define('TYPE_DELIMETER', 3);
define('TYPE_VALUE', 4);
$datafile = 'data.txt';
$fp = fopen($datafile, 'rb') or die('!fopen');
// stores (the first) {header} in 'name' and the root simplexmlelement in 'element'
$container = array('name'=>null, 'element'=>null);
// stores the name for each item element, the value for the type attribute for subsequent item elements and the simplexmlelement of the current item element
$item = array('name'=>null, 'type'=>null, 'current_element'=>null);
// the last **key** encountered, used to create new child elements in the current item element when a value is encountered
$key = null;
while ( false!==($t=getstruct($fp)) ) {
switch( $t[0] ) {
case TYPE_HEADER:
if ( is_null($container['element']) ) {
// this is the first time we hit **header - subheader**
$container['name'] = $t[1][0];
// ugly hack, < . name . />
$container['element'] = new SimpleXMLElement('<'.$container['name'].'/>');
// each subsequent new item gets the new subheader as type attribute
$item['type'] = $t[1][1];
// dummy implementation: "deducting" the item names from header/container[name]
$item['name'] = substr($t[1][0], 0, -1);
}
else {
// hitting **header - subheader** the (second, third, nth) time
/*
header must be the same as the first time (stored in container['name']).
Otherwise you need another container element since
xml documents can only have one root element
*/
if ( $container['name'] !== $t[1][0] ) {
echo $container['name'], "!==", $t[1][0], "\n";
die('format error');
}
else {
// subheader may have changed, store it for future item elements
$item['type'] = $t[1][1];
}
}
break;
case TYPE_DELIMETER:
assert( !is_null($container['element']) );
assert( !is_null($item['name']) );
assert( !is_null($item['type']) );
/* that's maybe not a wise choice.
You might want to check the complete item before appending it to the document.
But the example is a hack anyway ...so create a new item element and append it to the container right away
*/
$item['current_element'] = $container['element']->addChild($item['name']);
// set the type-attribute according to the last **header - subheader** encountered
$item['current_element']['type'] = $item['type'];
break;
case TYPE_KEY:
$key = $t[1][0];
break;
case TYPE_VALUE:
assert( !is_null($item['current_element']) );
assert( !is_null($key) );
// this is a value belonging to the "last" key encountered
// create a new "key" element with the value as content
// and addit to the current item element
$tmp = $item['current_element']->addChild($key, $t[1][0]);
break;
default:
die('unknown token');
}
}
if ( !is_null($container['element']) ) {
$doc = dom_import_simplexml($container['element']);
$doc = $doc->ownerDocument;
$doc->formatOutput = true;
echo $doc->saveXML();
}
die;
/*
Take a look at gettoken() at http://www.tuxradar.com/practicalphp/21/5/6
It breaks the stream into much simpler pieces.
In the next step the parser would "combine" or structure the simple tokens into more complex things.
This function does both....
#return array(id, array(parameter)
*/
function getstruct($fp) {
if ( feof($fp) ) {
return false;
}
// shortcut: all we care about "happens" on one line
// so let php read one line in a single step and then do the pattern matching
$line = trim(fgets($fp));
// this matches **key** and **header - subheader**
if ( preg_match('#^\*\*([^-]+)(?:-(.*))?\*\*$#', $line, $m) ) {
// only for **header - subheader** $m[2] is set.
if ( isset($m[2]) ) {
return array(TYPE_HEADER, array(trim($m[1]), trim($m[2])));
}
else {
return array(TYPE_KEY, array($m[1]));
}
}
// this matches _____________ and means "new item"
else if ( preg_match('#^_+$#', $line, $m) ) {
return array(TYPE_DELIMETER, array());
}
// any other non-empty line is a single value
else if ( preg_match('#\S#', $line) ) {
// you might want to filter the 1),2),3) part out here
// could also be two diffrent token types
return array(TYPE_VALUE, array($line));
}
else {
// skip empty lines, would be nicer with tail-recursion...
return getstruct($fp);
}
}
prints
<?xml version="1.0"?>
<FOODS>
<FOOD type="TYPE A">
<PRODUCT>1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;</PRODUCT>
<PRODUCT>2) La Fe String Cheese</PRODUCT>
<CODE>Sell by date going back to February 1, 2009</CODE>
<MANUFACTURER>Quesos Mi Pueblito, LLC, Passaic, NJ.</MANUFACTURER>
<VOLUME OF UNITS>11,000 boxes</VOLUME OF UNITS>
<DISTRIBUTION>NJ, NY, DE, MD, CT, VA</DISTRIBUTION>
</FOOD>
<FOOD type="TYPE A">
<PRODUCT>1) Peanut Brittle No Sugar Added;</PRODUCT>
<PRODUCT>2) Peanut Brittle Small Grind;</PRODUCT>
<PRODUCT>3) Homestyle Peanut Brittle Nuggets/Coconut Oil Coating</PRODUCT>
<CODE>1) Lots 7109 - 8350 inclusive;</CODE>
<CODE>2) Lots 8198 - 8330 inclusive;</CODE>
<CODE>3) Lots 7075 - 9012 inclusive;</CODE>
<CODE>4) Lots 7100 - 8057 inclusive;</CODE>
<CODE>5) Lots 7152 - 8364 inclusive</CODE>
<MANUFACTURER>Star Kay White, Inc., Congers, NY.</MANUFACTURER>
<VOLUME OF UNITS>5,749 units</VOLUME OF UNITS>
<DISTRIBUTION>NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN</DISTRIBUTION>
</FOOD>
<FOOD type="TYPE B">
<PRODUCT>Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice;</PRODUCT>
<CODE>990-10/2 10/5</CODE>
<MANUFACTURER>San Mar Manufacturing Corp., Catano, PR.</MANUFACTURER>
<VOLUME OF UNITS>384</VOLUME OF UNITS>
<DISTRIBUTION>PR</DISTRIBUTION>
</FOOD>
</FOODS>
Unfortunately the status of the php module for ANTLR currently is "Runtime is in alpha status." but it might be worth a try anyway...
See: http://www.tuxradar.com/practicalphp/21/5/6
This tells you how to parse a text file into tokens using PHP. Once parsed you can place it into anything you want.
You need to search for specific tokens in the file based on your criteria:
for example:
PRODUCT
This gives you the XML Tag
Then 1) can have special meaning
1) Peanut Brittle...
This tells you what to put in the XML tag.
I do not know if this is the most efficient way to accomplish your task but it is the way a compiler would parse a file and has the potential to make very accurate.
Instead of Regex or PHP use the XSLT 2.0 unparsed-text() function to read the file (see http://www.biglist.com/lists/xsl-list/archives/200508/msg00085.html)
Another Hint for an XSLT 1.0 Solution is here: http://bytes.com/topic/net/answers/808619-read-plain-file-xslt-1-0-a

Categories