I have a couple of xml feeds which I need to convert to json. My service provider uploads the xml files 2-3 times a day to our server. At this point I use codebeautify.org to convert the files to json and then re-load them back to our server. Is there a way that I could have this conversion done automatically for me either by way of a php script or similar. Appreciate advice on how I should tackle it. Thanks in advance
Here you go:
function removeNamespaceFromXML( $xml )
{
// Because I know all of the the namespaces that will possibly appear in
// in the XML string I can just hard code them and check for
// them to remove them
$toRemove = ['rap', 'turss', 'crim', 'cred', 'j', 'rap-code', 'evic'];
// This is part of a regex I will use to remove the namespace declaration from string
$nameSpaceDefRegEx = '(\S+)=["\']?((?:.(?!["\']?\s+(?:\S+)=|[>"\']))+.)["\']?';
// Cycle through each namespace and remove it from the XML string
foreach( $toRemove as $remove ) {
// First remove the namespace from the opening of the tag
$xml = str_replace('<' . $remove . ':', '<', $xml);
// Now remove the namespace from the closing of the tag
$xml = str_replace('</' . $remove . ':', '</', $xml);
// This XML uses the name space with CommentText, so remove that too
$xml = str_replace($remove . ':commentText', 'commentText', $xml);
// Complete the pattern for RegEx to remove this namespace declaration
$pattern = "/xmlns:{$remove}{$nameSpaceDefRegEx}/";
// Remove the actual namespace declaration using the Pattern
$xml = preg_replace($pattern, '', $xml, 1);
}
// Return sanitized and cleaned up XML with no namespaces
return $xml;
}
function namespacedXMLToArray($xml)
{
// One function to both clean the XML string and return an array
return json_decode(json_encode(simplexml_load_string(removeNamespaceFromXML($xml))), true);
}
print_r(namespacedXMLToArray($xml));
Source: https://laracasts.com/discuss/channels/general-discussion/converting-xml-to-jsonarray
Related
I use the PHP zip:// stream wrapper to parse large XML files line by line. For example:
$stream_uri = 'zip://' . __DIR__ . '/archive.zip#foo.xml';
$reader = new XMLReader();
$reader->open( $stream_uri, null );
$reader->read();
while ( true ) {
echo( $reader->readInnerXml() . PHP_EOL );
if ( ! $reader->next() ) {
break;
}
}
Quite often an XML file will include dodgy UTF control characters XMLReader doesn't like. So I'd like to implement a custom stream wrapper I can pass the output of the zip:// stream to, which will run a preg_replace on each line to remove those characters.
My dream is to be able to do this:
stream_wrapper_register( 'xmlchars', 'XML_Chars' );
$stream_uri = 'xmlchars://zip://' . __DIR__ . '/archive.zip#foo.xml';
and have XMLReader happily read the tidied-up nodes. I've figured out a way to reconstruct the zip stream URI based on the path passed to my wrapper:
class XML_Chars {
protected $stream_uri = '';
protected $handle;
function stream_open( $path, $mode, $options, &$opened_path ) {
$parsed_url = parse_url( $path );
$this->stream_uri = 'zip:' . $parsed_url['path'] . '#' . $parsed_url['fragment'];
return true;
}
}
But I'm puzzled about the best way to open the zip:// stream so I can modify its output and pass the result through to the XMLReader. Can anyone give me any pointers about how to implement that?
In case useful to anybody else, I've found a different way to solve the problem: a stream filter. You define it like this:
class UTF_Character_Filter extends php_user_filter {
public function filter( $in, $out, &$consumed, $closing ) {
while ( $bucket = stream_bucket_make_writeable( $in ) ) {
$consumed += $bucket->datalen;
// Remove characters in the hex range 0 - 8, B and C, E to 1F
// i.e. all control characters except newline, tab and return
$bucket->data = preg_replace( '|[\x0-\x8\xB-\xC\xE-\x1F]|ms', '', $bucket->data );
stream_bucket_append( $out, $bucket );
}
return PSFS_PASS_ON;
}
}
stream_filter_register( 'utf_character_filter', 'UTF_Character_Filter' );
And use it like this:
php://filter/read=utf_character_filter/resource=zip://archive.zip#import.xml
I'd still be interested to know if anyone's figured out how to make a stream wrapper that can accept the input of another stream wrapper though, as it could be a handy tool.
Using DOMDocument(), I'm replacing links in a $message and adding some things, like [#MERGEID]. When I save the changes with $dom_document->saveHTML(), the links get "sort of" url-encoded. [#MERGEID] becomes %5B#MERGEID%5D.
Later in my code I need to replace [#MERGEID] with an ID. So I search for urlencode('[#MERGEID]') - however, urlencode() changes the commercial at symbol (#) to %40, while saveHTML() has left it alone. So there is no match - '%5B#MERGEID%5D' != '%5B%40MERGEID%5D'
Now, I know can run str_replace('%40', '#', urlencode('[#MERGEID]')) to get what I need to locate the merge variable in $message.
My question is, what RFC spec is DOMDocument using, and why is it different than urlencode or even rawurlencode? Is there anything I can do about that to save a str_replace?
Demo code:
$message = 'Google';
$dom_document = new \DOMDocument();
libxml_use_internal_errors(true); //Supress content errors
$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'));
$elements = $dom_document->getElementsByTagName('a');
foreach($elements as $element) {
$link = $element->getAttribute('href'); //http://www.google.com?ref=abc
$tag = $element->getAttribute('data-tag'); //thebottomlink
if ($link) {
$newlink = 'http://www.example.com/click/[#MERGEID]?url=' . $link;
if ($tag) {
$newlink .= '&tag=' . $tag;
}
$element->setAttribute('href', $newlink);
}
}
$message = $dom_document->saveHTML();
$urlencodedmerge = urlencode('[#MERGEID]');
die($message . ' and url encoded version: ' . $urlencodedmerge);
//<a data-tag="thebottomlink" href="http://www.example.com/click/%5B#MERGEID%5D?url=http://www.google.com?ref=abc&tag=thebottomlink">Google</a> and url encoded version: %5B%40MERGEID%5D
I believe that those two encoding serve different purposes. urlencode() encodes "a string to be used in a query part of a URL", while $element->setAttribute('href', $newlink); encodes a complete URL to be used as an URL.
For example:
urlencode('http://www.google.com'); // -> http%3A%2F%2Fwww.google.com
This is convenient for encoding the query part, but it cannot be used on <a href='...'>.
However:
$element->setAttribute('href', $newlink); // -> http://www.google.com
will properly encode the string so that it is still usable in href. The reason that it cannot encode # because it cannot tell whether # is a part of the query or is it part of the userinfo or email url (for example: mailto:invisal#google.com or invisal#127.0.0.1)
Solution
Instead of using [#MERGEID], you can use ##MERGEID##. Then, you replace that with your ID later. This solution does not require you to even use urlencode.
If you insist to use urlencode, you can just use %40 instead of #. So, your code will be like this $newlink = 'http://www.example.com/click/[%40MERGEID]?url=' . $link;
You can also do something like $newlink = 'http://www.example.com/click/' . urlencode('[#MERGEID]') . '?url=' . $link;
urlencode function and rawurlencode are mostly based on RFC 1738. However, since 2005 the current RFC in use for URIs standard is RFC 3986.
On the other hand, The DOM extension uses UTF-8 encoding, which is based on RFC 3629 . Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.
The generic URI syntax mandates that new URI schemes that provide for
the representation of character data in a URI must, in effect,
represent characters from the unreserved set without translation, and
should convert all other characters to bytes according to UTF-8, and
then percent-encode those values.
Here is a function to decode URLs according to RFC 3986.
<?php
function myUrlEncode($string) {
$entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%25', '%23', '%5B', '%5D');
$replacements = array('!', '*', "'", "(", ")", ";", ":", "#", "&", "=", "+", "$", ",", "/", "?", "%", "#", "[", "]");
return str_replace($entities, $replacements, urldecode($string));
}
?>
PHP Fiddle.
Update:
Since UTF8 has been used to encode $message:
$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'))
Use urldecode($message) when returning the URL without percents.
die(urldecode($message) . ' and url encoded version: ' . $urlencodedmerge);
The root cause of your problem has been very well explained from a technical point of view.
In my opinion, however, there is a conceptual flaw in your approach, and it created the situation that you are now trying to fix.
By processing your input $message through a DomDocument object, you have moved to a higher level of abstraction. It is wrong to manipulate as a unique plain string something that has been "promoted" to a HTML stream.
Instead of trying to reproduce DomDocument's behaviour, use the library itself to locate, extract and replace the values of interest:
$token = 'blah blah [#MERGEID]';
$message = '<a id="' . $token . '" href="' . $token . '"></a>';
$dom = new DOMDocument();
$dom->loadHTML($message);
echo $dom->saveHTML(); // now we have an abstract HTML document
// extract a raw value
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('href');
// do the low-level fiddling
$newstring = str_replace($token, 'replaced', $rawstring);
// push the new value back into the abstract black box.
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', $newstring);
// less code written, but works all the time
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('id');
$newstring = str_replace($token, 'replaced', $rawstring);
$dom->getElementsByTagName('a')->item(0)->setAttribute('id', $newstring);
echo $dom->saveHTML();
As illustrated above, today we are trying to fix the problem when your token is inside a href, but one day we may want to search and replace the tag elsewhere in the document. To account for this case, do not bother making your low-level code HTML-aware.
(an alternative option would be not loading a DomDocument until all low-level replacements are done, but I am guessing this is not practical)
Complete proof of concept:
function searchAndReplace(DOMNode $node, $search, $replace) {
if($node->hasAttributes()) {
foreach ($node->attributes as $attribute) {
$input = $attribute->nodeValue;
$output = str_replace($search, $replace, $input);
$attribute->nodeValue = $output;
}
}
if(!$node instanceof DOMElement) { // this test needs double-checking
$input = $node->nodeValue;
$output = str_replace($search, $replace, $input);
$node->nodeValue = $output;
}
if($node->hasChildNodes()) {
foreach ($node->childNodes as $child) {
searchAndReplace($child, $search, $replace);
}
}
}
$token = '<>&;[#MERGEID]';
$message = '<a/>';
$dom = new DOMDocument();
$dom->loadHTML($message);
$dom->getElementsByTagName('a')->item(0)->setAttribute('id', "foo$token");
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', "http://foo#$token");
$textNode = new DOMText("foo$token");
$dom->getElementsByTagName('a')->item(0)->appendchild($textNode);
echo $dom->saveHTML();
searchAndReplace($dom, $token, '*replaced*');
echo $dom->saveHTML();
If you use saveXML() it won't mess with the encoding the way saveHTML() does:
PHP
//your code...
$message = $dom_document->saveXML();
EDIT: also remove the XML tag:
//this will add an xml tag, so just remove it
$message=preg_replace("/\<\?xml(.*?)\?\>/","",$message);
echo $message;
Output
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>Google</body></html>
Notice that both still correctly convert & to &
Would it not make sense to just urlencode the original [#mergeid] whan saving it in the first place as well? Your search should then match without the need for the str_replace?
$newlink = 'http://www.example.com/click/'.urlencode('[#MERGEID]').'?url=' . $link;
I know this does not answer the first post of the question, but you cannot post code in comments as far as I can tell.
I am trying to read a website's content but i have a problem i want to get images, links these elements but i want to get elements them selves not the element content for instance i want to get that: i want to get that entire element.
How can i do this..
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.link.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
$dom = new DOMDocument;
#$dom->loadHTML($output);
$items = $dom->getElementsByTagName('a');
for($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "<br />";
}
curl_close($ch);;
?>
You appear to be asking for the serialized html of a DOMElement? E.g. you want a string containing link text? (Please make your question clearer.)
$url = 'http://example.com';
$dom = new DOMDocument();
$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $a) {
// Best solution, but only works with PHP >= 5.3.6
$htmlstring = $dom->saveHTML($a);
// Otherwise you need to serialize to XML and then fix the self-closing elements
$htmlstring = saveHTMLFragment($a);
echo $htmlstring, "\n";
}
function saveHTMLFragment(DOMElement $e) {
$selfclosingelements = array('></area>', '></base>', '></basefont>',
'></br>', '></col>', '></frame>', '></hr>', '></img>', '></input>',
'></isindex>', '></link>', '></meta>', '></param>', '></source>',
);
// This is not 100% reliable because it may output namespace declarations.
// But otherwise it is extra-paranoid to work down to at least PHP 5.1
$html = $e->ownerDocument->saveXML($e, LIBXML_NOEMPTYTAG);
// in case any empty elements are expanded, collapse them again:
$html = str_ireplace($selfclosingelements, '>', $html);
return $html;
}
However, note that what you are doing is dangerous because it could potentially mix encodings. It is better to have your output as another DOMDocument and use importNode() to copy the nodes you want. Alternatively, use an XSL stylesheet.
I'm assuming you just copy-pasted some example code and didn't bother trying to learn how it actually works...
Anyway, the ->nodeValue part takes the element and returns the text content (because the element has a single text node child - if it had anything else, I don't know what nodeValue would give).
So, just remove the ->nodeValue and you have your element.
I had an interesting task today and couldn't find much on the subject. I wanted to share this, and ask for any suggestions on how this could have been done more elegantly. I consider myself a mediocre programmer who really wants to improve so any feedback is highly appreciated. There is also a strange bug I can't figure out. So here goes..and hopefully this helps someone who ever has to do something similar.
A client was redoing a site, moving content around, and had a couple thousand redirects that needed to be made. Marketing sent me an XLS with old URLs in one column, new URLs in the next. These were the actions I took:
Saved the XLS as CSV
Wrote a script which:
Formatted the list as valid 301 redirects
Exported the list to a text file
I then copy / pasted all the new directives into my .htaccess file.
Then, I wrote another script that checked to make sure each of the new links was valid (no 404s). The first script worked exactly as expected. For some reason, I can get the second script to print out all the 404 errors (there were several), but the script doesn't die when it finishes traversing the loop, and it doesn't write to the file, it just hangs in command line. No errors get reported. Any idea what's going on? Here is the code for both scripts:
Formatting 301s:
<?php
$source = "301.csv";
$output = "301.txt";
//grab the contents of the source file as an array, prepare the output file for writing
$sourceArray = file($source);
$handleOutput = fopen($output, "w");
//Set the strings we want to replace in an array. The first array are the original lines and the second are the strings to be replaced
$originalLines = array(
'http://hipaasecurityassessment.com',
','
);
$replacementStrings = array(
'',
' '
);
//Split each item from the array into two strings, one which occurs before the comma and the other which occurs after
function setContent($sourceArray, $originalLines = array(), $replacementStrings = array()){
$outputArray = array();
$text = 'redirect 301 ';
foreach ($sourceArray as $number => $item){
$pattern = '/[,]/';
$item = preg_split($pattern, $item);
$item = array(
$item[0],
preg_replace('#"#', '', $item[1])
);
$item = implode(' ', $item);
$item = str_replace($originalLines, $replacementStrings, $item);
array_push($outputArray,$text,$item);
}
$outputString = implode('', $outputArray);
return $outputString;
}
//Invoke the set content function
$outputString = setContent($sourceArray, $originalLines, $replacementStrings);
//Finally, write to the text file!
fwrite($handleOutput, $outputString);
Checking for 404s:
<?php
$source = "301.txt";
$output = "print404.txt";
//grab the contents of the source file as an array, prepare the output file for writing
$sourceArray = file($source);
$handleOutput = fopen($output, "w");
//Split each item from the array into two strings, one which occurs before the space and the other which occurs after
function getUrls($sourceArray = array()){
$outputArray = array();
foreach ($sourceArray as $number => $item){
$item = str_replace('redirect 301', '', $item);
$pattern = '#[ ]+#';
$item = preg_split($pattern, $item);
$item = array(
$item[0],
$item[1],
$item[2]
);
array_push($outputArray, $item[2]);
}
return $outputArray;
}
//Check each URL for a 404 error via a curl request
function check404($url = array(), $handleOutput){
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);
$content = curl_exec( $handle );
$response = curl_getinfo( $handle );
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 404) {
//fwrite($handleOutput, $url);
print $url;
}
};
$outputArray = getUrls($sourceArray);
foreach ($outputArray as $url)
{
$errors = check404($url, $handleOutput);
}
You should have used fgetcsv() for generating the original URL list. This splits up CSV files into an array, simplifying the transformation.
Can't say anything about the 404s or the error cause. But using the wacky curl functions is almost always a bad indicator. For testing purposes I would have used a commandline tool like wget instead so the results can be proof-checked manually.
But maybe you could try PHPs own get_headers() instead. It's supposed to show the raw result headers; shouldn't not follow redirects itself.
I'm trying to write a parser for a xml postback listener, but can't seem to get it to dump the xml for a sample. The API support guy told me to use 'DOMDocument', maybe 'SimpleXML'? Anyways here's the code: (thanks!)
<?php
$xml_document = file_get_contents('php://input');
$doc = new DOMDocument();
$doc->loadXML($xml_document);
$doc->save("test2/".time().".sample.xml").".xml");
?>
How about use this to create an XML file?
/**
* Will output in a similar form to print_r, but the nodes are xml so can be collapsed in browsers
*
* #param mixed $mixed
*/
function print_r_xml($mixed)
{
// capture the output of print_r
$out = print_r($mixed, true);
// Replace the root item with a struct
// MATCH : '<start>element<newline> ('
$root_pattern = '/[ \t]*([a-z0-9 \t_]+)\n[ \t]*\(/i';
$root_replace_pattern = '<struct name="root" type="\\1">';
$out = preg_replace($root_pattern, $root_replace_pattern, $out, 1);
// Replace array and object items structs
// MATCH : '[element] => <newline> ('
$struct_pattern = '/[ \t]*\[([^\]]+)\][ \t]*\=\>[ \t]*([a-z0-9 \t_]+)\n[ \t]*\(/miU';
$struct_replace_pattern = '<struct name="\\1" type="\\2">';
$out = preg_replace($struct_pattern, $struct_replace_pattern, $out);
// replace ')' on its own on a new line (surrounded by whitespace is ok) with '</var>
$out = preg_replace('/^\s*\)\s*$/m', '</struct>', $out);
// Replace simple key=>values with vars
// MATCH : '[element] => value<newline>'
$var_pattern = '/[ \t]*\[([^\]]+)\][ \t]*\=\>[ \t]*([a-z0-9 \t_\S]+)/i';
$var_replace_pattern = '<var name="\\1">\\2</var>';
$out = preg_replace($var_pattern, $var_replace_pattern, $out);
$out = trim($out);
$out='<?xml version="1.0"?><data>'.$out.'</data>';
return $out;
}
Im my application I posted all of the $_POST variables to it:
$handle = fopen("data.xml", "w+");
$content = print_r_xml($_POST);
fwrite($handle,$content);
fclose();