I have sudo string in a sort of xml format which i want to convert to xml in php. Which I then want to grab the imaware which contains date string in different formats and convert to appropriate date.
sudo string
<IMAware>
09-03-2016 05:28
</IMAware>
<NextUpdate>
</NextUpdate>
<ETR>
</ETR>
<SMS>
text
</SMS>
<Summary>
text
</Summary>
<Impact>
text
</Impact>
<Plan>
</Plan>
<Resolution>
text
</Resolution>
<Complete>
text
</Complete>
<Timeline>
text
</Timeline>
<Crisis>
</Crisis>
So far I have the following
for ($i = 0; $i < count($dbData); $i++) {
try {
print_r(trim($dbData[$i]['INCIDENT_NOTES']));
$xml = simplexml_load_string (trim($dbData[$i]['INCIDENT_NOTES']));
print_r($xml);
} catch (Exception $e) {
echo $e;
}
/*$items = $xml->xpath(item);
foreach($items as $item) {
echo $item['title'], ': ', $item['description'], "\n";
}*/
}
This fails with
Warning: simplexml_load_string(): Entity: line 5: parser error : Extra content at the end of the document in /var/SP/oiadm/docroot/dev/maskella/common/api/Api.class.php on line 1444
Do I need to enclose with<?xml version='1.0'?> <document></document>?
Date formats which I have are
21-02-2016 20:14
Date/Time: 09/02 - 15:40
Date: 08/02 - 11:50
Yes, the xml doc will need the header, and will also need a root element which encapsulates the elements from your sudo example.
See What is the correct format to use for Date/Time in an XML file for appropriate date formats in XML.
Does anyone knows of an alternative to convert a SimpleXmlElement into a string?
The standard string casting is very slow:
$myString = (string)$simpleXmlElement->someNode;
I needed to know which one is faster for finding an element with a specific text-value: XPath or walking the nodes... So I wrote a simple script which would measure the duration of 1000 iterations for both ways.
The first results were that XPath was much slower, but then I found out that I forgot the string cast in the node-walking part. When I fixed that, the node-walking was much much slower.
So, only the cast-to-string flipped the entire outcome.
Please review the following code to understand the issue at hand:
<pre>
<?php
//---------------------------------------------------------------------------
date_default_timezone_set('Europe/Amsterdam');
error_reporting(E_ALL);
//---------------------------------------------------------------------------
$data = <<<'EOD'
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<children>
<child><test>ads</test></child>
<child><test>sdf</test></child>
<child><test>dfg</test></child>
<child><test>fgh</test></child>
<child><test>ghj</test></child>
<child><test>hjk</test></child>
<child><test>jkl</test></child>
<child><test>ads</test></child>
<child><test>sdf</test></child>
<child><test>dfg</test></child>
<child><test>fgh</test></child>
<child><test>ghj</test></child>
<child><test>hjk</test></child>
<child><test>jkl</test></child>
<child><test>123</test></child>
<child><test>234</test></child>
<child><test>345</test></child>
<child><test>456</test></child>
<child><test>567</test></child>
<child><test>678</test></child>
<child><test>789</test></child>
<child><test>890</test></child>
<child><test>90-</test></child>
<child><test>0-=</test></child>
<child><test>qwe</test></child>
</children>
</root>
EOD;
$xml = new SimpleXMLElement($data);
$values = array('123', '234', '345', '456', '567', '678', '789', '890', '90-', '0-=', 'qwe');
$valCount = count($values);
$tries = 1000;
//---------------------------------------------------------------------------
echo("Running XPath... ");
$startTime = microtime(true);
for ($idx=0; $idx<$tries; $idx++)
$xml->xpath('/root/children/child[test="'.$values[($idx % $valCount)].'"]');
$duration = microtime(true) - $startTime;
echo("Finished in: $duration\r\n");
//---------------------------------------------------------------------------
echo("Running NodeWalk... ");
$startTime = microtime(true);
for ($idx=0; $idx<$tries; $idx++)
{
$nodes = $xml->children->child;
foreach ($nodes as $node)
if ((string)$node->test == $values[($idx % $valCount)])
break;
}
$duration = microtime(true) - $startTime;
echo("Finished in: $duration\r\n");
//---------------------------------------------------------------------------
?>
</pre>
When altering the line:
if ((string)$node->test == $values[($idx % $valCount)])
to:
if ($node->test == $values[($idx % $valCount)])
The code even looks at more nodes, but it's still a lot faster. So, it looks to me that the string cast here is very slow.
Does anyone have a faster alternative for the string cast?
Amazingly so, as #Gordon pointed out, there is actually not so much difference in performance... on Linux.
As the original tests were done only on Windows, I retested it on Linux, and what do you know... The difference between the XPath-way and the node-walking-way is gone.
For me that's enough, because I'm actually writing this for Linux (prod platform). But is still something to notice when building for Windows.
I have searched for solution for this problem but none fix my problem.
The answers suggest that I use isset to check the array before working on it. But I will explain how it doesnt do it for me later.
Pre-req:
I've got a huge XML file from a tour & travel webservice which I would parse and convert to PHP array and later do some operation on it. (Filter tours mostly).
My Approach:
I'm using SimpleXML to load the xml and convert it to PHP array like so:
$xml = file_get_contents(APPPATH."tour.xml", true);
$xmlString = htmlentity_to_xml($xml); //custom method to clean XML
$Str = simplexml_load_string($xmlString, 'SimpleXMLElement', LIBXML_NOCDATA);
//converting to array
$json = json_encode($Str);
$array = json_decode($json,TRUE);
Then I'm sending this array to a fitlerTours($searchParams, $tourArray) method along with search parameters (cityName & dates) and the array itself.
Then using foreach() i'm going through each tour to look for the cityName and raising a flag if found.
The Problem
When I filter the tours (the ones which contain cityName) for dates, I'm getting this.
Severity: Warning
Message: Illegal string offset 'year'
Filename: controllers/tourFilter.php
Line Number: 78
Warning shows for offeset 'month' and 'day' also.
Here's my PHP for date filter: (Line 78 is 4th line)
if($flag == 1){
if(!empty($fromDate)){
foreach($tour['departureDates']['date'] AS $date){
$dateDep = strtotime($date['year'] . "-" . (($date['month']) < 10 ? "0".$date['month'] : $date['month']) . "-" . (($date['day']) < 10 ? "0".$date['day'] : $date['day']));
if(strtotime($fromDate) <= $dateDep && $dateDep <= strtotime($fromDate . "+".$range." days")){
if($date['departureStatus'] != "SoldOut"){
$dateFlag = 1;
}
}
}
}
else{
$dateFlag = 1;
}
$flag = 0;
}
if($dateFlag == 1){//Collect tours which contain the keyword & dates to $response array
$responseArray[] = $tour;
$dateFlag = false; //Reset Flag
}
Here's the snippet of XML:
...
<departureDates>
<date>
<day>7</day>
<month>1</month>
<year>2016</year>
<singlesPrice>12761</singlesPrice>
<doublesPrice>9990</doublesPrice>
<triplesPrice>0</triplesPrice>
<quadsPrice>0</quadsPrice>
<shipName/>
<departureStatus>Available</departureStatus>
</date>
<date>
<day>8</day>
<month>1</month>
<year>2016</year>
<singlesPrice>12761</singlesPrice>
<doublesPrice>9990</doublesPrice>
<triplesPrice>0</triplesPrice>
<quadsPrice>0</quadsPrice>
<shipName/>
<departureStatus>SoldOut</departureStatus>
</date>
</departureDates>
...
Now if i use the solution I found by searching around is check if the array is set properly by isset() it doesn't return true and line 78 is not executed and the data is lost. But I need the data.
This happens for only keywords i search.
Any help is appreciated.
The error says that the $date var is detected as string in some point...
Characters within strings may be accessed and modified by specifying
the zero-based offset of the desired character after the string using
square array brackets, as in $str[42]. Think of a string as an array
of characters for this purpose.
See here
So try this:
if(is_array($date)){
$dateDep = strtotime($date['year'] . "-" . (($date['month']) < 10 ? "0".$date['month'] : $date['month']) . "-" . (($date['day']) < 10 ? "0".$date['day'] : $date['day']));
if(strtotime($fromDate) <= $dateDep && $dateDep <= strtotime($fromDate . "+".$range." days")){
if($date['departureStatus'] != "SoldOut"){
$dateFlag = 1;
}
}
}
else {
//If this is not an array what is it then?
var_dump($date);
}
I found this web service which provides the date time of a timezone. http://www.earthtools.org/timezone-1.1/24.0167/89.8667
I want to call it & get the values like isotime with php.
So I tried
$contents = simplexml_load_file("http://www.earthtools.org/timezone-1.1/24.0167/89.8667");
$xml = new DOMDocument();
$xml->loadXML( $contents );
AND also with
file_get_contents
With file_get_contents it gets only a string of numbers not the XML format. Something like this
1.0 24.0167 89.8667 6 F 20 Feb 2014 13:50:12 2014-02-20 13:50:12 +0600 2014-02-20 07:50:12 Unknown
Nothing worked. Can anyone please help me that how can I get the isotime or other values from that link using PHP?
Everything works):
$url = 'http://www.earthtools.org/timezone-1.1/24.0167/89.8667';
$nodes = array('localtime', 'isotime', 'utctime');
$cont = file_get_contents($url);
$node_values = array();
if ($cont && ($xml = simplexml_load_string($cont))) {
foreach ($nodes as $node) {
if ($xml->$node) $node_values[$node] = (string)$xml->$node;
}
}
print_r($node_values);
I'm wanting to build a scraper that parses through transcripts from the Leveson Inquiry, which are in the following format as plaintext:
1 Thursday, 2 February 2012
2 (10.00 am)
3 LORD JUSTICE LEVESON: Good morning.
4 MR BARR: Good morning, sir. We're going to start today
5 with witnesses from the mobile phone companies,
6 Mr Blendis from Everything Everywhere, Mr Hughes from
7 Vodafone and Mr Gorham from Telefonica.
8 LORD JUSTICE LEVESON: Very good.
9 MR BARR: We're going to listen to them all together, sir.
10 Can I ask that the gentlemen are sworn in, please.
11 MR JAMES BLENDIS (affirmed)
12 MR ADRIAN GORHAM (sworn)
13 MR MARK HUGHES (sworn)
14 Questions by MR BARR
15 MR BARR: Can I start, please, Mr Hughes, with you. Could
16 you tell us the position that you hold and a little bit
17 about your professional background, please?
18 MR HUGHES: Yes, sure. I'm currently head of fraud risk and
19 security for Vodafone UK. I have been in that position
20 since August 2011 and I've worked in the fraud risk and
21 security department in Vodafone since October 2006.
22 Q. Mr Gorham, if I could ask you the same question, please.
23 MR GORHAM: I'm the head of fraud and security for
24 Telefonica O2, I've been in that role for ten years and
25 have been in the industry for 13.
1
(Full example)
Ultimately I want to build an XML file structured as follows:
<hearing date="2012-02-02" time="10:00">
<quote speaker="Lord Justice Leveson" page="1" line="3">Good morning.</quote>
<quote speaker="Mr Barr" page="1" line="4">Good morning, sir. We're going to start today with witnesses from the mobile phone companies, Mr Blendis from Everything Everywhere, Mr Hughes from Vodafone and Mr Gorham from Telefonica.</quote>
<quote speaker="Lord Justice Leveson" page="1" line="8">Very good.</quote>
[... and on ...]
</hearing>
...Any help?
(Also note, that "MR BARR:" changes into simply "Q." at a certain point.)
Many thanks!
This is generally a very hard problem, and is way out of scope for StackOverflow. That said, if I had to do this I'd take an iterative approach:
Identify regularities in the text layout and devise a candidate grammar.
Write a parser using the grammar; the parse would be quite strict and discard (with error messages) anything that didn't match.
Run it on the entire text
Examine the output and mismatches, revise the grammar, identify special cases
Go back to step 3
As to the details of those steps, only you can decide if you're getting out what you want. Also, any solution is going to require manual intervention, either beforehand or afterwards, to clean up low-frequency inconsistencies.
let me start by saying this is not a foolproof script, there might well be something I forgot or overlooked,
but it is a proof of concept for you to improve and expand upon or just get an idea.
There are enough regularities in the text layout for us to work with, what the script does is split the
transcript to an array of lines and match those lines against a few patterns in an attempt to identify the
regularities and determine the role of the data.
Example Script:
<?php
/*
Proof of Concept : Transcript to XML by Robjong
? :
- action on date change (what to do when the date changes?)
- what to do with lines like "MR MARK HUGHES (sworn)" (make it a note?!)
- what to do with lines like "Questions by MR BARR" (make it a note?!)
- detect events/notes in quotes better? (e.g: MR BLENDIS: (Nods head).)
Notes :
- desperately needs error checking/handling!!!! (for now it just got in the way)
- it might well be that the configuration of PHP will cause file_get_contents to fail,
try curl or download it manually and read the local file
- if you are running PHP < 5.2.4, change the \h in the pattern to \s or [\t ]
*/
# basic usage
// get the transcript as plain text
$txt = file_get_contents( 'http://www.levesoninquiry.org.uk/wp-content/uploads/2012/02/Transcript-of-Morning-Hearing-2-February-2012.txt' );
// convert transcript to XML
$xml = transcriptToXML_beta( $txt );
// we have the transcript as XML, now what?
file_put_contents( 'transcript.xml', $xml ); // let's write it to a file
echo $xml;
function transcriptToXML_beta( $string ) { // beta is just to emphasize lack of torough testing
$lines = explode( "\n", $string ); // split text into an array array of lines
if( !is_array( $lines ) ) { // the provided string was not multiline
return false;
}
// these vars will hold the data we need to build our XML
$date = ''; // the date of the transcript
$time = ''; // transcript time
$page = 1; // this will hold the current page number
$linenr = ''; // this will hold the line nr
$speaker = ''; // the name of the speaker
$text = ''; // transcribed text attributed to the speaker
$new = false; // will be true if a new item has been matched
$event = ''; // this will hold notes that are in a quote but need to be stored separately (events)
$xml = ''; // this will be the XML string
$i = 0; // count the lines to display actual line number for debugging
foreach( $lines as $line ) { // loop over lines
$i++;
if( !preg_match( "/[[:graph:]]/", $line ) ) { // line is empty, does not contain printable characters....
continue; // ....so we skip to the next one
}
if( preg_match( "/^\h*\d+\h+(?P<date>[a-z]+,\h+\d+\h+[a-z]+\h\d{4})\s*$/i", $line, $match ) ) { # it looks like a date
$date = $match['date']; // set date
$speaker = ''; // reset vars
$text = '';
continue;// no need to handle this line any further
} elseif( preg_match( "/^\h*\d+\h+([A-Z]+(?:\s+[A-Z]+){0,4}\h+\(.*?\)|(?i:questions\h+by)[A-Z\h]+)\s*$/", $line, $match ) ) { # (qued) event, uppercase text followed by text between parentheses
$event .= " <event page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry to que, to be added after current quote
continue;// no need to handle this line any further
} elseif( preg_match( "/^\h*(\d*)\h*\(\h*(?P<time>\d{1,2}[:.]\d{1,2}\h*[ap]m)\)\s*$/i", $line, $match ) ) { # seems we have a time entry
$time = $match['time']; // set date
$xml .= " <time page=\"{$page}\" line=\"{$match[1]}\">" . strtoupper( str_replace( '.', ':', $match['time'] ) ) . "</time>\n"; // add time as entry
$speaker = ''; // reset vars
$text = '';
continue;// no need to handle this line any further
} elseif( preg_match( "/^\h*(\d+)\s*$/", $line, $match ) ) { # line has just one or more digits, we assume its a pagenr
if( $match[1] <= $page ) { // if the number is lower then the current page number ignore it, this avoids triggering errors for 'empty lines' that only have a line number
continue;
}
$page = (int) $match[1] + 1; // set pagenr, add one because the nr is at the bottom of the page
continue;// no need to handle this line any further
} elseif( preg_match( "/^\h*\d+\s+\(([[:print:]]+)\)\s*$/", $line, $match ) && !$speaker ) { # note, text is between parentheses
$xml .= " <event page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry as note
continue;// no need to handle this line any further
} elseif( preg_match( "/^\h*\d+\h+[A-Z\h]+\(.*?\)\s*$/", $line, $match ) && !$speaker ) { # note, uppercase text followed by text between parentheses, only if not in quote
$xml .= " <event type=\"note\" speaker=\"\" page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry as note
continue;// no need to handle this line any further
} elseif( preg_match("/^\h*(?P<linenr>\d+)\h+(?P<speaker>(?:\h[A-Z]+(?:\h[A-Z]+){0,4}))[:.]\h*(?P<text>[[:print:]]+?)\s*$/", $line, $match ) ) { # new speaker entry
if( $new && $speaker ) { // if we have one open we need to add it first
$xml .= " <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n"; // add quote
$new = false; // reset
if( $event ) { // if we have a qued note we need to add that too
$xml .= $event; // add entry to XML string
$event = ''; // clear 'que'
}
}
$speaker = trim( $match['speaker'] ); // assign match to speaker var
$linenr = (int) $match['linenr']; // assign line number
$text = trim( $match['text'] ); // assign text
$new = true; // set new match bool
} elseif( preg_match( "/^\h*(?P<linenr>\d+)\h+(?P<text>[[:print:]]+?)\s*$/", $line, $match ) ) { # follow up text
$text .= ' ' . trim( $match['text'] ); // append text
} else { # unkown line (add check for linenr only lines or remove $match[1] >= $page from the pagenr match conditional)
// not sure what kind of line this is... add it as a separate 'type'?!
trigger_error( "Unable to parse line {$i} \"{$line}\"" ); // throw exception / trigger error
continue; // no need to handle this line any further
}
if( !$new && $speaker ) {
$xml .= " <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n";
$speaker = ''; // reset vars
$text = '';
$new = false;
if( $event ) { // if we have a qued note we need to add that too
$xml .= $event; // add entry to XML string
$event = ''; // clear 'que'
}
}
}
// if we have a match open we need to handle it, this might happen because we do not assign the match in the same iteration as we matched it
if( $new ) {
$xml .= " <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n";
}
if( !trim( $xml ) ) { // no text found so $xml is still an empty string
return false;
}
$date = new DateTime( $date ); // instantiate datetime with the time from the transcript
$date = date( 'Y-m-d', $date->getTimestamp() ); // format date
// now we need to wrap the nodes in a root node
$xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<hearing date=\"{$date}\">\n{$xml}</hearing>\n";
return $xml; // return the XML
}
?>
I will update the comments and script later today
Output Sample:
<hearing date="2012-02-02">
<time page="1" line="2">10:00 AM</time>
<entry type="quote" speaker="LORD JUSTICE LEVESON" page="1" line="3">Good morning.</entry>
<entry type="quote" speaker="MR BARR" page="1" line="4">Good morning, sir. We're going to start today with witnesses from the mobile phone companies, Mr Blendis from Everything Everywhere, Mr Hughes from Vodafone and Mr Gorham from Telefonica.</entry>
<entry type="quote" speaker="LORD JUSTICE LEVESON" page="1" line="8">Very good.</entry>
<entry type="quote" speaker="MR BARR" page="1" line="9">We're going to listen to them all together, sir. Can I ask that the gentlemen are sworn in, please.</entry>
<event page="1" line="9">MR JAMES BLENDIS (affirmed)</event>
<event page="1" line="9">MR ADRIAN GORHAM (sworn)</event>
<event page="1" line="9">MR MARK HUGHES (sworn)</event>
<event page="1" line="9">Questions by MR BARR</event>
b.t.w. just out of curiosity, what is it you need this for?