XML/PHP Complex Elements Parsing while preserving original markup - php

I am pretty new to coding and am still learning the ropes. I got an XML feed that I'm processing where, amongst other common data that is easy to process, I can get complex elements such as :
<para>Text before the Link more text after the link.</para>.
I get the data out of XML by simplexml_load_string($xml), perhapes this is where I'm going wrong. What is the best way to process an XML file where there may or may not be 'inline child' elements so that the output looks as it should. I can get the data out but what I end up with is two entries, "Text before the Link more text after the link." and the link itself.
This seems simple and I'm sure others may have run into this but I'm not sure how to preserve the original markup. I have been trying to find an answer to this with the good old google but I must be not using the right terms. Any help or pointers to a resource would be much appreciated. I have gone through a few tutorials online, but I'm unsure what I'm doing wrong.
Thanks.

It seems that SimpleXml can work on simple xmls ;)
Here may be solution for you Use DOMDocument

Related

How to properly format text retrieved from a website?

I'm building an application for a company that, unfortunately, has a very poorly designed website. Most of the HTML tags are wrongly and sometimes randomly placed, there is excessive use of no-break-spaces, p tags are randomly assigned, they don't follow any rule and so on...
I'm retrieving data from their website by using a crawler and then feeding the resulted strings to my application through my own web-service. The problem is that once displaying it into the android textview, the text is formatted all wrong, spread and uneven, very dissorderly.
Also, worth mentioning that I can not suggest to the company for various reasons to modify their website...
I've tried
String text = Html.fromHtml(myString).toString();
and other variations, I've even tried formatting it manually but it's been a pain.
My question is:
Is there an easy, elegant way to re-format all this text, either with PHP on my web-service or with Java, directly in my Android application?
Thanks to anyone who will take the time to answer...
You can use Tidy with PHP to clean up the code if you're keeping it in place. Otherwise stripping the HTML would probably make working with it a lot easier.
I would so: no, there is no easy, elegant way. HTML combines data and visual representation, they are inherently linked. To understand the data you must look at the tags. Tags like <h1> and <a> carry meaning.
If the HTML is structured enough to break it down into meaningful blocks: header, body and unrelated/unimportant stuff. Then you could apply restyling principles to those. A simple solution is to just strip all the tags, get only the textNodes and stitch them together. If the HTML is exceptionally poorly formatted you might get sentences that are out of order, but if the HTML isn't too contrived I expect this approach should work.
To give you an indication of the complexity involved: You could have <span>s that have styling applied to them, for instance display: block. This changes the way the span is displayed, from inline to block, so it behaves more like a <div> would. This means that each <span> will likely be on it's own line, it will seem to force a line break. Detecting these situations isn't impossible but it is quite complex. Who knows what happens when you've got list elements, tables or even floating elements; they might be completely out of order.
Probably not the most elegant solution, but I managed to get the best results by stripping some tags according to what I needed with php (that was really easy to do) and then displaying the retrieved strings into formatted WebViews.
As I said, probably not the most elegant solution but in this case it worked best for me.

Serializing html and css, and storing in mysql

I am in the process of developing an html template generator. I would like to store the templates in a mysql database (or text files if that would be more efficient). I am looking for the best way to serialize the html and css, and then reproduce the original efficiently.
Javascript and php will be used to create/edit/and remove elements from both the browser and the database for later reproduction.
Elements can be added such as div, p, a, etc. and nesting should not be an issue. The main problem I am having is if a div is nested within a div and a paragraph element is nested within the second div at a later time, how would all of this be stored in a database? Elements will be deleted and new elements will be added in random order.
I hope this was somewhat clear. I am not really looking for code, but suggestions on how all of this would work together. Any help would be greatly appreciated. Thanks in advance
Store the entire thing as a blob in a single field in the database. Unless you're doing something quite novel, there's no reason the database layer needs to have an in-depth understanding of DOM structure.
I want to first point out that, what you need is a parser not a serialising method. You can simply store the template, then when you read it out, parse each individual element and then build the editing form. Sort of like xtgem does. No need to serialise. Unless you already parsed the data in an array.
If each user has his own template, then the chances of multiple reads is very low. Almost nill. Flat files will do very well in this siiuation. No need to sanitise and better performance. Do not forget file locking though.
If you are returning data in text input or textarea elements, no problem with xss. You don't need to sanitise output either.
I agree with #Rex. When I was developing a forum, I found that storing the original was better. It allowed for easy editing.
It would be best to separate the files and parse them separately. Keep presentation separate from markup.
Parsers have never been my forte however. So I can't help in that aspect.
You can take a look at pcltemplate from http://phpconcept.net to see how the parser works. Maybe give you ideas.

Short snippet summarizing a webpage?

Is there a clean way of grabbing the first few lines of a given link that summarizes that link? I have seen this being done in some online bookmarking applications but have no clue on how they were implemented. For instance, if I give this link, I should be able to get a summary which is roughly like:
I'll admit it, I was intimidated by
MapReduce. I'd tried to read
explanations of it, but even the
wonderful Joel Spolsky left me
scratching my head. So I plowed ahead
trying to build decent pipelines to
process massive amounts of data
Nothing complex at first sight but grabbing these is the challenging part. Just the first few lines of the actual post should be fine. Should I just use a raw approach of grabbing the entire html and parsing the meta tags or something fancy like that (which obviously and unfortunately is not generalizable to every link out there) or is there a smarter way to achieve this? Any suggestions?
Update:
I just found InstaPaper do this but am not sure if it is getting the information from RSS feeds or some other way.
Well first of all i would suggest you use PHP with a DOM Parser Class, this will make it a lot easier to get the tag contents you need.
// Get HTML from URL or file
$html = file_get_html('http://www.google.com/');
// Find all paragraphs
$paragraphs = $html->find('p')
//echo the first paragraph
echo $paragraphs[0];
The problem is a lot of sites have poorly structure html, some are built on tables, the key get around this is that you decide what tags will you consider the website description. I would try to get the meta description tag, if this one does not exist, look for the first paragraph.
You bes tbet is to pull from from the meta description tag. Most blog platforms will stuff the user/system provided excerpt of the post in here as will a lot of CMS platforms. Then if that meta tag isnt present i would just fall back to title or pick a paragraph of appropriate depth.

How to know if the website being scraped has changed?

I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.
It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.
I think you don't have any clean solutions if you are scraping a page where content changes.
I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.
You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).
Another possibile approach would be to code some constraints and check them before store to db.
For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.
If you are scraping plain text, it will be more difficult to check.
Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.
A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.
Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.
http://php.net/manual/en/book.dom.php
If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?
(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)
If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.
There are lot of way you can do it:-
SaxParser
DOmParser etc
I have a small blog which will give some pointers to what I mean
http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html
or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.
First, in some cases you may want to compare hashes of the original to the new html. MD5 and SHA1 are two popular hashes. This may or may not be valid in all circumstances but is something you should be familiar with. This will tell you if something has changed - content, tags, or anything.
To understand if the structure has changed you would need to capture a histogram of the tag occurrences and then compare those. If you care about tags being out of order then you would have to capture a tree of the tags and do a comparison to see if the tags occur in the same order. This is going to be very specific to what you want to achieve.
PHP Simple HTML DOM Parser is a tool which will help you parse the HTML.
Explode() is not an HTML parser, but you want to know about changes in the HTML structure. That's going to be tricky. Try using an HTML parser. Nothing else will be able to do this properly.

Simplehtmldom - curl, loops, arrays?

Pse forgive what is most likely a stupid question. I've successfully managed to follow the simplehtmldom examples and get data that I want off one webpage.
I want to be able to set the function to go through all html pages in a directory and extract the data. I've googled and googled but now I'm confused as I had in my ignorant state thought I could (in some way) use PHP to form an array of the filenames in the directory but I'm struggling with this.
Also it seems that a lot of the examples I've seen are using curl. Please can someone tell me how it should be done. THere are a significant number of files. I've tried concatenating them but this only works with doing this through an html editor - using cat -> doesn't work.
You probably want to use glob('some/directory/*.html'); (manual page) to get a list of all the files as an array. Then iterate over that and use the DOM stuff for each filename.
You only need curl if you're pulling the HTML from another web server, if these are stored on your web server you want glob().
Assuming the parser you talk about is working ok, you should build a simple www-spider. Look at all the links in a webpage and build a list of "links-to-scan". And scan each of those pages...
You should take care of circular references though.

Categories