error in xml sitemap - php

i have problem in my website as i can't generate sitemap form site https://www.xml-sitemaps.com/ , every time i generate it get empty sitemap with this comment
" This XML file does not appear to have any style information associated with it. The document tree is shown below."
and i really don't know how can i solve it
so please could anyone help me in this problem

That's not an error. That's absolutely as expected. Sitemaps aren't for you to look at, they are designed for a machine to process.
Reference

XML and HTML are like cousins. They share a lot of it's syntaxis, but html has some standardized semantics, while the semantics of xml is totally up to you, the creator of a xml document.
So what happens here; you're browser is reading xml, but a tag is not considered something to be outputted, it is considered to be describing something, which can be outputted on a certain way, if there is styling information available.
So what you see is the browser (probably IE; get a proper browser) telling you it doesn't know how to present the xml file in a proper way and deciding it will not show it at all. But if you look to the source (ctrl+u or cmd+u), your sitemap will be there.
Although in contrast to what #ShakirKhan says; xml is designed to be looked at both by humans as well as computers. So it is understandable, but it should not happen. You should just see the xml file (so, get a proper browser)

Related

Trying to figure out how to "hide" images and source code

I'm in the process of (slowly) learning how to make my websites more secure. I was checking out D&D Beyond, and noticed a few things I've never seen before, and I would like to learn more about.
Portions of the source code don't show up when you View the Source.
It's hard to explain. I tried to explain it in a different post, and I got a ton of snarky remarks. I'm telling you, I know what I saw. I would like to know how this is possible and how I can replicate it.
I typically write in PHP/JQuery, so I'd primarily like to learn more using those languages.
Example:
You can create a Character using their Character Builder, then view your Character Sheet. The main portion of your character's stats are enclosed in a very large parent div: ".character_sheet"
If you MANUALLY save your Character Sheet to your Desktop, you can see the HTML for this section. If you inspect this section in Firefox, you can also see the data. However, if you try to CTRL+U while in the browser, the HTML in this section does not appear. It also will not appear if you try to curl/fopen/file_get_contents
Additionally, images are not visible by normal means.
For Example: I am aware of how to disable right-clicking on a website, but if someone wanted to take my images, all they'd have to do is open my source code and look at the image url and save it from there.
On the D&D Beyond site, I can bring up Firefox's web inspector where an Image SHOULD be, take a look at the CSS, and... nothing. No link to an image, where one should be. I don't know how they're getting images to appear without css/html. I'd be very interested to know how this is done.
If anyone has any insight/guesses/etc and can point me in the right direction to learn some more, I'd really appreciate it!
Server-side code such as PHP is always hidden to visitors (unless you have a security vulnerability of some sort).
Client-side code such as HTML, JavaScript and CSS is always visible to the visitor. Even if you can't see it immediately in the DOM, it will be hiding there somewhere.
The most likely scenario is that it is hidden within an embedded .js or .css file, which would look similar to the following:
<script src="scripts.js"></script>
<link rel="stylesheet" type="text/css" href="theme.css">
HTML can be outputted to the page through JavaScript, which will not show in the DOM (though it would show up with a PHP echo). HTML can also be 'hidden' through use of <iframe> tags and HTML imports.
JavaScript has a wide array of ways in which it can be obscured / malformed, so it can be hard to track down. You may some some strange, 'unreadable' code in the DOM / .js files, which in turn could be outputting the HTML itself.
Please consider the below points,
All client side resources are viewable although you can make it easyless readable by javascript and it's better to do most of your codes by server side.
You need to know about what search engines love if your app is a public web site & will be indexed by those search engines, as some search engines don't scrape to the web pages which have only JavaScript code.
You can create images without <img> tags using CSS background-image Property.
there are some useful lib's to make your code more hard readable like Closure Compiler Service & JSFuck & JS Packers although it's better to make it by yourself and just add like those techniques to your knowledge, noting that this will make your code size larger.
and at all there are no white page source, it should contains at least <script> and if you saw a real white page it may be disabled from sever side to be viewable at top of window and it may be works if embedded in iframe or by sending specific headers to it or whatever else.
You can make your server & client sides cooperate :) to get great result and more secured.

how do I find out what php function from Wordpress is generating blocks of HTML using browser inspector?

I'm developing a Wordpress site, which I'm fairly new to. I'm not sure if this is a stupid question or not but I haven't been able to return any decent google results regarding this. Anyway, is there a way to find out what PHP function is generating a piece of HTML code using a browser code inspector like Chrome's? Thanks!
No.
Once the data arrive to the browser, all the PHP code have been processed and you can't know what part of PHP generated which part of the HTML code.
No - not without modifying the php code to enable some kind of debugging. Chrome can only give you information about the received html document on the client side (you). But php code gets parsed server side.
You kind of can:
Download a copy of the theme and plugins folder
Open the page on your site that you want to find the function for.
Find a div/class that is specific to section e.g. <article>
Open a text editor like notepad++ (one that will allow you to search through multiple files at ones)
Use the find feature of chosen text editor and search for the div/class
The result will show you a list of pages where that term is.
Look through those pages for the function you are looking for (it might take a few goes)
The above it is a bit of a roundabout way of doing it, but I think other than looking through each file separately, it is you next best way.

Formatting HTML correctly from cURL requests

I am working on an applet that allows the user to input a URL to a news article or other webpage (in Japanese) and view the contents of that page within an iFrame in my page. The idea is that once the content is loaded into the page, the user can highlight words using their cursor, which stores the selected text in an array (for translating/adding to a personal dictionary of terms) and surrounds the text in a red box (div) according to a stylesheet defined on my domain. To do this, I use cURL to retrieve the HTML of the external page and dump it into the source of the iFrame.
However, I keep running into major formatting problems with the retrieved HTML. The big problem is preserving style sheets, and to fix this, I've used DOMDocument to add tags to the section of the retrieved HTML. This works for some pages/URLs, but there are still lots of style problems with the output HTML for many others. For example, div layers crash into each other, alignments are off, and backgrounds are missing. This is made a bit more problematic as I need to embed the output HTML into a new in order to make the onClick javascript function for passing text selections in the embedded content to work, which means the resulting source ends up looking like this:
<div onclick="parent.selectionFunction()" id ="studyContentn">
<!-- HTML of output from cURL, including doctype declarations and <html>,<head> tags -->
</div>
It seems like for the most part a lot of the formatting issues I keep running into are largely arbitrary. I've tried using php Tidy to clean output from HTML, but that also only works for some pages but not many others. I've got a slight suspicion it may have to do with CDATA declarations that get parsed oddly when working with DOMDocument, but I am not certain.
Is there a way I can guarantee that HTML output from cURL will be rendered correctly and faithfully in all instances? Or is there perhaps a better way of going about doing this? I've tried a bunch of different ways of approaching this issue, and each gets closer to a solution but brings its own new problems as well.
Thanks -- let me know if I can clarify anything.
If I understand correctly you are trying to pull the html of a complete web page and display it under your domain, in your html. This is always going to be tricky, a lot of java script will break, relative url's will be wrong and as you mentioned, styles as well. Your probably also changing the dimensions that the page is displayed in. These can all be worked around but your going to be fighting an uphill battle with each new site, or if a current site change design
I'd probably take a different approach to the problem. You might want to write a browser plugin as the interface to the external web site instead. Then your applet can sit on top of the functional and tested (hopefully) site. Then you can focus on what you need to do for your applet rather than a never ending list of fiddly html issues.
I am trying to do a similar thing. It is very difficult to conserve the formatting, and the JS scripts in webpage complicated the thing. I finally gave up the complete the idea of completely displaying the original format, but do it with a workaround:
Select only headers, links, lists, paragraph which you are interested at.
Add the domain path of your ownsite to links.
You may wrap the headers, links etc. items by your own class.
Display it
in your case you want to select text and store it, which is another topic. What I did is to parse the HTMl in two levels, and then it is easy to do the selection. Keep in mind IE and Firefox/Chrome needs to be dealt with separately.

How do I allow video embed html safely on a site?

I have a php application in which we allow every user to have a "public page" which shows their linked video. We are having an input textbox where they can specify the embed video's html code. The problem we're running into is that if we take that input and directly display it on the page as it is, all sorts of scripts can be inserted here leading into a very insecure system.
We want to allow embed code from all sites, but since they differ in how they're structured, it becomes difficult to keep tabs on how each one is structured.
What are the approaches folks have taken to tackle this scenario? Are there third-party scripts that do this for you?
Consider using some sort of pseudo-template which takes advantage of oEmbed. oEmbed is a safe way to link to a video (as the content authority, you're not allowing direct embed, but rather references to embeddable content).
For example, you might write a parser that searches for something like:
[embed]http://oembed.link/goes/here[/embed]
You could then use one of the many PHP oEmbed libraries to request the resource from the provided link and replace the pseudo-embed code with the real embed code.
Hope this helps.
I would have the users input the URL to the video. From there you can insert the proper code yourself. It's easier for them and safer for you.
If you encounter an unknown URL, just log it, and add the code needed to support it.
The best approach would be to have a white list tag that are allowed and remove everything else. It would also be necessary to filter all the attribute of those tag to remove the "onsomething" attribute.
In order to do a proper parsing, you need to use a XML parser. XMLReader and XMLWriter would works nicely to do that. You read the data from XMLReader, if the tag is in the white list, you write it in the XMLWriter. At the end of the process, you have your parsed data in the XMLWritter.
A code example of this would be this script. It has in the white list the tag test and video. If you give it the following input :
<z><test attr="test"></test><img />random text<video onclick="evilJavascript"><test></test></video></z>
It will output this :
<div><test attr="test"></test>random text<video><test></test></video></div>

How to know if the website being scraped has changed?

I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.
It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.
I think you don't have any clean solutions if you are scraping a page where content changes.
I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.
You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).
Another possibile approach would be to code some constraints and check them before store to db.
For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.
If you are scraping plain text, it will be more difficult to check.
Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.
A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.
Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.
http://php.net/manual/en/book.dom.php
If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?
(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)
If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.
There are lot of way you can do it:-
SaxParser
DOmParser etc
I have a small blog which will give some pointers to what I mean
http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html
or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.
First, in some cases you may want to compare hashes of the original to the new html. MD5 and SHA1 are two popular hashes. This may or may not be valid in all circumstances but is something you should be familiar with. This will tell you if something has changed - content, tags, or anything.
To understand if the structure has changed you would need to capture a histogram of the tag occurrences and then compare those. If you care about tags being out of order then you would have to capture a tree of the tags and do a comparison to see if the tags occur in the same order. This is going to be very specific to what you want to achieve.
PHP Simple HTML DOM Parser is a tool which will help you parse the HTML.
Explode() is not an HTML parser, but you want to know about changes in the HTML structure. That's going to be tricky. Try using an HTML parser. Nothing else will be able to do this properly.

Categories