Which is best for performance usage/usability, "file_get_content" or "DOM LoadHtmlFile" ?
As they both get the html content, what is the optimal situation to use one or the other?
This Also go with the "file_put_content" and "DOM SaveHtmlFile"
I was a newbie back when I posted this question. Here is my conclusion at the time I am writing this,
they both do different things completely. "DOM LoadHtmlFile" is a part
of a DOMDocument class which is very powerful to manipulate html
formatted content, and file_get_content is used to get file/http
content as a string and manipulate as your wish. On a case scenario if
the target content are in html format, depending on needs, if requires
heavily change then I would personally use DomDocument class. If
otherwise, I would just manipulate as a string.
Related
I'm trying to set up a translation tool to translate websites. What I want to do is import html-code and get all translatable texts from that site.
One idea would be to use strip_tags, but it would ignore strings that could be translated such as alt-texts, title-texts and probably others that I don't have on my mind yet. Is there a clean way to do this?
In this case you need to parse HTML and extract text yourself. As you, probably, already know, parsing HTML with regular expressions is A Bad Idea (tm). SO, the only right solution is to parse DOM of the document. On this step you are free to use any tools including standard DOMDocument class.
If you are looking for some libraries or scripts to help, i would suggest to look on html2text which could be used commercially. As i see, it doesn't support attributes for <img> tags, but it's very easy to fix (use <a> tag as example).
If you are looking for some automated text extraction, then you should definitely look on something like Bolierpipe.
I would personally use the DOM Crowler component from Symfony2, which is a nice wrapper around php DOM functions and start from there.
I have some random HTML layouts that contain important text I would like to extract. I cannot just strip_tags() as that will leave a bunch of extra junk from the sidebar/footer/header/etc.
I found a method built in Python and I was wondering if there is anything like this in PHP.
The concept is rather simple: use
information about the density of text
vs. HTML code to work out if a line of
text is worth outputting. (This isn’t
a novel idea, but it works!) The basic
process works as follows:
Parse the HTML code and keep track of the number of bytes processed.
Store the text output on a per-line, or per-paragraph basis.
Associate with each text line the number of bytes of HTML required to
describe it.
Compute the text density of each line by calculating the ratio of text
t> o bytes.
Then decide if the line is part of the content by using a neural network.
You can get pretty good results just
by checking if the line’s density is
above a fixed threshold (or the
average), but the system makes fewer
mistakes if you use machine learning -
not to mention that it’s easier to
implement!
Update: I started a bounty for an answer that could pull main content from a random HTML template. Since I can't share the documents I will be using - just pick any random blog sites and try to extract the body text from the layout. Remember that the header, sidebar(s), and footer may contain text also. See the link above for ideas.
phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.
UPDATE 2
DEMO: http://so.lucafilosofi.com/find-important-text-in-arbitrary-html-using-php/
tested on a casual blogs list taken from Technorati Top 100 and Best Blogs of 2010
many blogs make use of CMS;
blogs html structure is the same almost the time.
avoid common selectors like #sidebar, #header, #footer, #comments, etc..
avoid any widget by tag name script, iframe
clear well know content like:
/\d+\scomment(?:[s])/im
/(read the rest|read more).*/im
/(?:.*(?:by|post|submitt?)(?:ed)?.*\s(at|am|pm))/im
/[^a-z0-9]+/im
search for well know classes and ids:
typepad.com .entry-content
wordpress.org .post-entry .entry .post
movabletype.com .post
blogger.com .post-body .entry-content
drupal.com .content
tumblr.com .post
squarespace.com .journal-entry-text
expressionengine.com .entry
gawker.com .post-body
Ref: The blog platforms of choice among the top 100 blogs
$selectors = array('.post-body','.post','.journal-entry-text','.entry-content','.content');
$doc = phpQuery::newDocumentFile('http://blog.com')->find($selectors)->children('p,div');
search based on common html structure that look like this:
<div>
<h1|h2|h3|h4|a />
<p|div />
</div>
$doc = phpQuery::newDocumentFile('http://blog.com')->find('h1,h2,h3,h4')->parent()->children('p,div');
Domdocument can be used to parse html documents, which can then be queried through PHP.
Edit: wikied
I worked on a similar project a while back. It's not as complex as the Python script but it will do a good job. Check out the Simple HTML PHP Parser
http://simplehtmldom.sourceforge.net/
Depending on your HTML structure and if you have id's or classes in place you can get a little complicated and use preg_match() to specifically get any information between a certain start and end tag. This means that you should know how to write regular expressions.
You can also look into a browser emulation PHP class. I've done this for page scraping and it works well enough depending on how well formatted the DOM is. I personally like SimpleBrowser
http://www.simpletest.org/api/SimpleTest/WebTester/SimpleBrowser.html
I have developed a HTML parser and filter PHP package that can be used for that purpose.
It consists of a set of classes that can be chained together to perform a series of parsing, filtering and transformation operations in HTML/XML code.
It was meant to deal with real world pages, so it can deal with malformed tag and data structures, so it can preserve as much as the original document as possible.
One of the filter classes it comes with can do DTD validation. Another can discard insecure HTML tags and CSS to prevent XSS attacks. Another can simply extract all document links.
All those filter classes are optional. You can chain them together the way you want, if you need any at all.
So, to solve your problem, I do not think there is already a specific solution for that in PHP anywhere, but a special filter class could be developed for it. Take a look at the package. It is thoroughly documented.
If you need help, just check my profile and mail me and I may even develop the filter that does exactly what you need, eventually inspired in any solutions that exist for other languages.
I have gone through the Stack Overflow post "Best XML Parser for PHP".
For the same question. It is mentioned that, if I need to manipulate XML files then go for DOM XML. My requirements are:
I have saved navigation in database. It is an HTML string.
I want to remove some pages or say li tags wrapping pages that user don't want to exist in his/her page. After removing the unwanted li's, I want to save the whole string back to the database.
The same navigation will be used on another page. But, the HTML will be different. It will be similar, with the ul and li, but I need to add some more divs and spans to it.
The navigation will be edited on this page and on each change (e.g. Changing page title, deleting a node/page, moving under another page as child.) a Ajax call will save the changes to another table in database.
Using the new structure, again build the navigation, which will be updated in the first navigation table.
Which will be best option?
For your use case, using XMLParser does not make much sense since you want to save back the whole file after modifying it.
SimpleXML lets you do that much easier (saveXML() method) - with XMLReader you would have to generate the XML on your own during parsing.
I'd recommend SimpleXML.
My personal preference is XML Parser but I think that the library you choose probably will not affect the solution that much. The usages of any XML manipulation library is probably largely the same.
For larger files you will want to use XML Parser because Simple XML will load the entire XML into memory to parse it whereas XML parser is stream based.
I was wondering if there's a way to use PHP (or any other server-side or even client-side [if possible] language) to obtain certain pieces of information from a different website (NOT a local file like the include 'nav.php'.
What I mean is that...Say I have a blog at www.blog.com and I have another website at www.mysite.com
Is there a way to gather ALL of the h2 links from www.blog.com and put them in a div in www.mysite.com?
Also, is there a way I could grab the entire information inside a DIV (with an ID of-course) from blog.com and insert it in mysite.com?
Thanks,
Amit
First of all, if you want to retrieve content from a blog, check if the blog generator (ie, Blogger, WordPress) does not have a API thanks to which you won't have to reinvent the wheel. Usually, good APis come with good documentations (meaning that probably 5% out of all APIs are good APIs) and these documentations should come with code examples for top languages such as PHP, JavaScript, Java, etc... Once again, if it is to retrieve content from a blog, there should be tons of frameworks that are here for you
Check out the PHP Simple HTML DOM library
Can be as easy as:
// Create DOM from URL or file
$html = file_get_html('http://www.otherwebsite.com/');
// Find all images
foreach($html->find('h2') as $element)
echo $element->src;
This can be done by opening the remote website as a file, then taking the HTML and using the DOM parser to manipulate it.
$site_html = file_get_contents('http://www.example.com/');
$document = new DOMDocument();
$document->loadHTML($site_html);
$all_of_the_h2_tags = $document->getElementsByTagName('h2');
Read more about PHP's DOM functions for what to do from here, such as grabbing other tags, creating new HTML out of bits and pieces of the DOM, and displaying that on your own site.
Your first step would be to use CURL to do a request on the other site, and bring down the HTML from the page you want to access. Then comes the part of parsing the HTML to find all the content you're looking for. One could use a bunch of regular expressions, and you could probably get the job done, but the Stackoverflow crew might frown at you. You could also take the resulting HTML and use the domDocument object, and loadHTML to parse the HTML and load the content you want.
Also, if you control both sites, you can set up a special page on the first site (www.blog.com) with exactly the information you need, properly formatted either in HTML you can output directly, or XML that you can manipulate more easily from www.mysite.com.
I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.
It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.
I think you don't have any clean solutions if you are scraping a page where content changes.
I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.
You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).
Another possibile approach would be to code some constraints and check them before store to db.
For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.
If you are scraping plain text, it will be more difficult to check.
Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.
A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.
Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.
http://php.net/manual/en/book.dom.php
If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?
(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)
If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.
There are lot of way you can do it:-
SaxParser
DOmParser etc
I have a small blog which will give some pointers to what I mean
http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html
or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.
First, in some cases you may want to compare hashes of the original to the new html. MD5 and SHA1 are two popular hashes. This may or may not be valid in all circumstances but is something you should be familiar with. This will tell you if something has changed - content, tags, or anything.
To understand if the structure has changed you would need to capture a histogram of the tag occurrences and then compare those. If you care about tags being out of order then you would have to capture a tree of the tags and do a comparison to see if the tags occur in the same order. This is going to be very specific to what you want to achieve.
PHP Simple HTML DOM Parser is a tool which will help you parse the HTML.
Explode() is not an HTML parser, but you want to know about changes in the HTML structure. That's going to be tricky. Try using an HTML parser. Nothing else will be able to do this properly.