Magento Controller displaying big XML file causes odd rendering in Browser - php

I am working on a relatively extensive but not huge XML file that is being delivered from a custom Module in Magento using a controller. Everything has been going well and I have been able to get it to work and add nodes with no issues. The browser (Chrome in this case) has been rendering the XML document fine and stylizing it as expected. In order to display the XML headers properly via Magento I am using the following code:
/* Set display to render output as an XML document */
$this->loadLayout(false);
$this->getResponse()->setHeader('Content-Type','text/xml');
echo "<Magento >";
.
.
.
echo "</Magento>\n";
$this->renderLayout();
When the output reaches 140 nodes, it stops rendering as formatted XML and just shows the data as if it is being rendered as HTML and doesn't know the node definition. Example:
If I comment a node out so there are 139 it will render properly. If I save the full 140+ node XML page as a file then drop that file into Chrome, it renders properly. For Example:
If I run the 140+ node XML file through a validator it comes through as valid XML code. I also get the same result of a broken rendering display if I use Firefox instead to view the page.
The question. Is there a limit as to how big an XML file can be to have it be delivered via a Magento Controller? If so, can I reset this limit to more than 139 nodes so this renders properly? Conversely, should I not worry about if the browser is rendering the page properly because it is going to be used by a different program and not a browser?

Check your header with curl. My guess it it's not being set.
curl -I http://example.com
Try setting your header directly with
header('Content-Type: text/xml');
Setting a header on the response object will only impact content that's delivered via the response object. Since you're echoing code directly, the response object never gets the chance to send its headers.
Both Chrome and Firefox have a set of heuristics that attempt to guess how a document should be rendered. Even with a text/html header a short XML document may be detected as XML, triggering XML rendering. Once it reachers a certain length, the browsers guess that's it's an HTML document, and the HTML rendering engine takes over.
Ensuring your header is set correctly should solve the problem.

Usually this problem happens when you break a tag (not close it) or insert characters not supported. Unknown size limitations of XML, perhaps your browser validation is crashing.

Related

How to set PDF page title from Symfony Controller?

Here is my method in a Controller:
/**
* #Route("/bilgi/agr", name="user_agreement")
*/
public function agr(): Response
{
$response = new BinaryFileResponse(__DIR__ . '/../../public/docs/User_Agreement.pdf');
$response->setContentDisposition(ResponseHeaderBag::DISPOSITION_INLINE, $response->getFile()->getFileName());
return $response;
}
I'm expecting to see Page title as User_Agreement.pdf, but instead it is agr. Which is not appropriate. I can't change the route because it is being used on several other classes/files.
Is there any way I can set a custom title or at least the file name? When I save the file I see the file name as User_Agreement.pdf so the file name is also correct.
If that is not possible is there workaround to show it in twig/html?
You can't. <title>" is an HTML element, not a PDF one.
If a browser renders a PDF directly (as many/most do nowadays), the only thing they could use for a "title" is the URL for the request. agr in your case.
It's basically something the server side has no control of. It just sends the response, and the browser decides how to show it, and what to show in the space usually reserved for the <title> element.
With the Content-Disposition header you can hint the browser about what name they should suggest for the file to the end-user, but that's all.
If you absolutely need this, yes, you could send a regular HTML response and somehow show the PDF inlined/embedded on the page.

Get text from url.jsonp with PHP

I'm trying to get the plain text from this webpage: https://html2-f.scribdassets.com/55ssxtbbb45pk2eg/pages/319-42c28ee981.jsonp
which upon inspection is a callback function that inserts HTML. I'm trying to scrape the page and reformat the text to be comprehensive and actually display the HTML instead of it being plain text.
PHP:
echo file_get_contents("https://html2-f.scribdassets.com/55ssxtbbb45pk2eg/pages/319-42c28ee981.jsonp");
The returning text is a complete mess
����X321-5db7e88872.jsonp�Y]n�6���E�ıH�;��E�#���b�PM��%�f#K�H��}�;�z���:�eG"e��:#�E����j��XޖdJ���$�&$~����>a�8#��p�ӥy��X��8�r��(#kZ���85�j�A�%��������Ȇ�...
Whereas it should look like this:
"<div class=\"newpage\" id=\"page319\" style=\"width: 902px; height:1167px\">\n<div class=text_layer style=\"z-index:2\"><div class=ie_fix>\n \n<div class=\"ff81\" style=\"font-size:114px\">\n<span class=a style=\"left:331px;top:75px;color:#ffffff\">1<span class=w9></span>3</span></div>...
Although I could manually copy/paste the text from the webpage into a text editor for future usage, I would like to eliminate this step as I'll need to do this for 320 pages.
Is there some work around for .jsonp urls? Or is the data encrypted by the server? (I just don't know)
The response is gzip'd. You can see it in the response headers:
Content-Encoding: gzip
So, you need to unzip it. You can do this either changing your whole approach and using cURL, or using the stream wrapper compress.zlib://. Just prepend that to the URL:
echo file_get_contents("compress.zlib://https://html2-f.scribdassets.com/55ssxtbbb45pk2eg/pages/319-42c28ee981.jsonp");
That will get you the correct response. Notice that this is still a JSONP response, so it's in form of a callback. You need to decide what to do with it.

file_get_contents not actually grabbing file? Blank?

Try this sample code I threw together to illustrate a point:
<?php
$url = "http://www.amazon.com/gp/offer-listing/B003WSNV4E/";
$html = file_get_contents($url);
echo($html);
?>
The amazon homepage works fine using this method (it is echoed in the browser), but this page just doesn't output anything. Is there a reason for this, and how can I fix it?
I think your problem is that you're misunderstanding your own code.
You made this comment on the question (emphasis mine):
I've never used those utilities before, so maybe I'm doing it wrong but it only seems to be downloading this page: https://www.amazon.com/gp/offer-listing/B003WSNV4E/ref=dp_olp_new?ie=UTF8&condition=new
This implies to me that an Amazon page is appearing in your browser when you run this code. This is entirely expected.
When you try to download https://rads.stackoverflow.com/amzn/click/B003WSNV4E, you're being redirected to https://www.amazon.com/gp/offer-listing/B003WSNV4E/ref=dp_olp_new?ie=UTF8&condition=new which is the intent of StackOverflow's RADS system.
What happens from there is your code is loading the raw HTML into your $html variable and dumping it straight to the browser. Because you're passing raw HTML to the browser, the browser is interpreting it as such, and it tries (and succeeds) in rendering the page.
If you just want to see the code, but not render it, then you need to convert it into html entities first:
echo htmlentities($html);

Get all content with file_get_contents()

I'm trying to retrieve an webpage that has XML data using file_get_contents().
$get_url_report = 'https://...'; // GET URL
$str = file_get_contents($get_url_report);
The problem is that file_get_contents gets only the secure content of the page and returns only some strings without the XML. In Windows IE, if I type in $get_url_report, it would warn it if I want to display everything. If I click yes, then it shows me the XML, which is what I want to store in $str. Any ideas on how to retrieve the XML data into a string from the webpage $get_url_report?
You should already be getting the pure XML if the URL is correct. If you're having trouble, perhaps the URL is expecting you to be logged in or something similar. Use a var_dump($str) and then view source on that page to see what you get back.
Either way, there is no magic way to get any linked content from the XML. All you would get is the XML itself and would need further PHP code to process and get any links/images/data from it.
Verify if openssl is enable on your php, a good exemple of how to do it:
How to get file_get_contents() to work with HTTPS?

How to detect if a page is an RSS or ATOM feed

I'm currently building a new online Feed Reader in PHP. One of the features I'm working on is feed auto-discovery. If a user enters a website URL, the script will detect that its not a feed and look for the real feed URL by parsing the HTML for the proper <link> tag.
The problem is, the way I'm currently detecting if the URL is a feed or a website only works part of the time, and I know it can't be the best solution. Right now I'm taking the CURL response and running it through simplexml_load_string, if it can't parse it I treat it as a website. Here is the code.
$xml = #simplexml_load_string( $site_found['content'] );
if( !$xml ) // this is a website, not a feed
{
// handle website
}
else
{
// parse feed
}
Obviously, this isn't ideal. Also, when it runs into an HTML website that it can parse, it thinks its a feed.
Any suggestions on a good way of detecting the difference between a feed or non-feed in PHP?
I would sniff for the various unique identifiers those formats have:
Atom: Source
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
RSS 0.90: Source
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://my.netscape.com/rdf/simple/0.9/">
Netscape RSS 0.91
<rss version="0.91">
etc. etc. (See the 2nd source link for a full overview).
As far as I can see, separating Atom and RSS should be pretty easy by looking for <feed> and <rss> tags, respectively. Plus you won't find those in a valid HTML document.
You could make an initial check to tell HTML and feeds apart by looking for <html> and <body> elements first. To avoid problems with invalid input, this may be a case where using regular expressions (over a parser) is finally justified for once :)
If it doesn't match the HTML test, run the Atom / RSS tests on it. If it is not recognized as a feed, or the XML parser chokes on invalid input, fall back to HTML again.
what that looks like in the wild - whether feed providers always conform to those rules - is a different question, but you should already be able to recognize a lot this way.
I think your best choice is getting the Content-Type header as I assume that's the way firefox (or any other browser) does it. Besides, if you think about it, the Content-Type is indeed the way server tells user agents how to process the response content. Almost any decent HTTP server sends a correct Content-Type header.
Nevertheless you could try to identify rss/atom in the content as a second choice if the first one "fails"(this criteria is up to you).
An additional benefit is that you only need to request the header instead of the entire document, thus saving you bandwidth, time, etc. You can do this with curl like this:
<?php
$ch = curl_init("http://sample.com/feed");
curl_setopt($ch, CURLOPT_NOBODY, true); // this set the HTTP Request Method to HEAD instead GET(default) and the server only sends HTTP Header(no content).
curl_exec($ch);
$conType = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
if (is_rss($conType)){ // You need to implement is_rss($conType) function
// TODO
}elseif(is_html($conType)) { // You need to implement is_html($conType) function
// Search a rss in html
}else{
// Error : Page has no rss/atom feed
}
?>
Why not try to parse your data with a component built specifically to parse RSS/ATOM Feed, like Zend_Feed_Reader ?
With that, if the parsing succeeds, you'll be pretty sure that the URL you used is indeed a valid RSS/ATOM feed.
And I should add that you could use such a component to parse feed in order to extract their informations, too : no need to re-invent the wheel, parsing the XML "by hand", and dealing with special cases yourself.
Use the Content-Type HTTP response header to dispatch to the right handler.

Categories