Parse HTML and replace content in DIV - php

I want to know how i can find the DIV tag in a HTML page. This is because i want to replace the links inside that DIV with different links. I do not understand what exact code i require.

First, notice that PHP won't do anything client side. But you should already know it.
you should use file_get_contents to read the webpage as a string (or what is provided by a library for html parsing).
There is already a question that explain how to parse html in any way: Robust and Mature HTML Parser for PHP
If it doesn't fit your needs, try searching it on google: php html parsing, I found some libraries
For example this library I've found allows you to find all tags: http://simplehtmldom.sourceforge.net/
Notice that this is not a great approach and I suggest you change your html page to be a PHP page, and insert some code in place of A tags. This will make everything easier.
Last thing, if the html page is static (it doesn't change), you can use easily line counting to get contents from X line to Y line, put your customized A-tags and then read from J to the end of file.
Good luck anyway.

Related

How to "read" a HTML document in PHP?

I'm facing a problem for a quite long time. Unfortunately I was not able to find the solution by my own, so I have to post my question here.
I am writting a little php script that creates a PDF file from a dynamically created HTML file.
Now I want to "parse" the html file and do a action in addiction to which tag is next in HTML.
E.g.
<div><p>Test</p></div>
My script should recognize:
First tag is a div: do function for div
Second tag is a p: do function for p
I don't know for what I should search. Regular expressions? HTML parser?
Thanks for a hint!
Try an XML parser. In PHP the SimpleXML is probably what you are looking for.
I've used several times phpQuery. That's a nice solution, although it's quite big and seems that is no longer supported (last commit > 10 months).
What you need to do is read the HTML file into a PHP variable/object
http://www.php-mysql-tutorial.com/wikis/php-tutorial/read-html-files-using-php.aspx
And then use RegEx to parse the HTML Tags and Attributes
http://www.codeproject.com/Articles/297056/Most-Important-Regular-Expression-for-parsing-HTML

Find important text in arbitrary HTML using PHP?

I have some random HTML layouts that contain important text I would like to extract. I cannot just strip_tags() as that will leave a bunch of extra junk from the sidebar/footer/header/etc.
I found a method built in Python and I was wondering if there is anything like this in PHP.
The concept is rather simple: use
information about the density of text
vs. HTML code to work out if a line of
text is worth outputting. (This isn’t
a novel idea, but it works!) The basic
process works as follows:
Parse the HTML code and keep track of the number of bytes processed.
Store the text output on a per-line, or per-paragraph basis.
Associate with each text line the number of bytes of HTML required to
describe it.
Compute the text density of each line by calculating the ratio of text
t> o bytes.
Then decide if the line is part of the content by using a neural network.
You can get pretty good results just
by checking if the line’s density is
above a fixed threshold (or the
average), but the system makes fewer
mistakes if you use machine learning -
not to mention that it’s easier to
implement!
Update: I started a bounty for an answer that could pull main content from a random HTML template. Since I can't share the documents I will be using - just pick any random blog sites and try to extract the body text from the layout. Remember that the header, sidebar(s), and footer may contain text also. See the link above for ideas.
phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.
UPDATE 2
DEMO: http://so.lucafilosofi.com/find-important-text-in-arbitrary-html-using-php/
tested on a casual blogs list taken from Technorati Top 100 and Best Blogs of 2010
many blogs make use of CMS;
blogs html structure is the same almost the time.
avoid common selectors like #sidebar, #header, #footer, #comments, etc..
avoid any widget by tag name script, iframe
clear well know content like:
/\d+\scomment(?:[s])/im
/(read the rest|read more).*/im
/(?:.*(?:by|post|submitt?)(?:ed)?.*\s(at|am|pm))/im
/[^a-z0-9]+/im
search for well know classes and ids:
typepad.com .entry-content
wordpress.org .post-entry .entry .post
movabletype.com .post
blogger.com .post-body .entry-content
drupal.com .content
tumblr.com .post
squarespace.com .journal-entry-text
expressionengine.com .entry
gawker.com .post-body
Ref: The blog platforms of choice among the top 100 blogs
$selectors = array('.post-body','.post','.journal-entry-text','.entry-content','.content');
$doc = phpQuery::newDocumentFile('http://blog.com')->find($selectors)->children('p,div');
search based on common html structure that look like this:
<div>
<h1|h2|h3|h4|a />
<p|div />
</div>
$doc = phpQuery::newDocumentFile('http://blog.com')->find('h1,h2,h3,h4')->parent()->children('p,div');
Domdocument can be used to parse html documents, which can then be queried through PHP.
Edit: wikied
I worked on a similar project a while back. It's not as complex as the Python script but it will do a good job. Check out the Simple HTML PHP Parser
http://simplehtmldom.sourceforge.net/
Depending on your HTML structure and if you have id's or classes in place you can get a little complicated and use preg_match() to specifically get any information between a certain start and end tag. This means that you should know how to write regular expressions.
You can also look into a browser emulation PHP class. I've done this for page scraping and it works well enough depending on how well formatted the DOM is. I personally like SimpleBrowser
http://www.simpletest.org/api/SimpleTest/WebTester/SimpleBrowser.html
I have developed a HTML parser and filter PHP package that can be used for that purpose.
It consists of a set of classes that can be chained together to perform a series of parsing, filtering and transformation operations in HTML/XML code.
It was meant to deal with real world pages, so it can deal with malformed tag and data structures, so it can preserve as much as the original document as possible.
One of the filter classes it comes with can do DTD validation. Another can discard insecure HTML tags and CSS to prevent XSS attacks. Another can simply extract all document links.
All those filter classes are optional. You can chain them together the way you want, if you need any at all.
So, to solve your problem, I do not think there is already a specific solution for that in PHP anywhere, but a special filter class could be developed for it. Take a look at the package. It is thoroughly documented.
If you need help, just check my profile and mail me and I may even develop the filter that does exactly what you need, eventually inspired in any solutions that exist for other languages.

Which is the best option, SimpleXml or XML Parser in PHP?

I have gone through the Stack Overflow post "Best XML Parser for PHP".
For the same question. It is mentioned that, if I need to manipulate XML files then go for DOM XML. My requirements are:
I have saved navigation in database. It is an HTML string.
I want to remove some pages or say li tags wrapping pages that user don't want to exist in his/her page. After removing the unwanted li's, I want to save the whole string back to the database.
The same navigation will be used on another page. But, the HTML will be different. It will be similar, with the ul and li, but I need to add some more divs and spans to it.
The navigation will be edited on this page and on each change (e.g. Changing page title, deleting a node/page, moving under another page as child.) a Ajax call will save the changes to another table in database.
Using the new structure, again build the navigation, which will be updated in the first navigation table.
Which will be best option?
For your use case, using XMLParser does not make much sense since you want to save back the whole file after modifying it.
SimpleXML lets you do that much easier (saveXML() method) - with XMLReader you would have to generate the XML on your own during parsing.
I'd recommend SimpleXML.
My personal preference is XML Parser but I think that the library you choose probably will not affect the solution that much. The usages of any XML manipulation library is probably largely the same.
For larger files you will want to use XML Parser because Simple XML will load the entire XML into memory to parse it whereas XML parser is stream based.

edit html page and redisplay PHP

So I have been using a method to retrieve images from a website but I thought it may be easier to simply show the page without some details I don't want displayed. The website in paticular know we are doing this so there shouldn't be any legal complications. So would it be possible to open the html page within PHP, search for a specific that would be the same in each page, remove it and then redisplay the page within the browser with its new edits?
You can use the Tidy or HTML Purifier libraries to clean up and navigate the document tree, find the elements you are looking for, and remove them. I can't find comprehensive docs for Tidy, but the examples on php.net should be enough to help you get started.
Yes this is possible, you'd need to use file_get_contents("http://url"); to load the page into a string, then preg_replace with a regex to clean the string.

Using PHP to retrieve information from a different site

I was wondering if there's a way to use PHP (or any other server-side or even client-side [if possible] language) to obtain certain pieces of information from a different website (NOT a local file like the include 'nav.php'.
What I mean is that...Say I have a blog at www.blog.com and I have another website at www.mysite.com
Is there a way to gather ALL of the h2 links from www.blog.com and put them in a div in www.mysite.com?
Also, is there a way I could grab the entire information inside a DIV (with an ID of-course) from blog.com and insert it in mysite.com?
Thanks,
Amit
First of all, if you want to retrieve content from a blog, check if the blog generator (ie, Blogger, WordPress) does not have a API thanks to which you won't have to reinvent the wheel. Usually, good APis come with good documentations (meaning that probably 5% out of all APIs are good APIs) and these documentations should come with code examples for top languages such as PHP, JavaScript, Java, etc... Once again, if it is to retrieve content from a blog, there should be tons of frameworks that are here for you
Check out the PHP Simple HTML DOM library
Can be as easy as:
// Create DOM from URL or file
$html = file_get_html('http://www.otherwebsite.com/');
// Find all images
foreach($html->find('h2') as $element)
echo $element->src;
This can be done by opening the remote website as a file, then taking the HTML and using the DOM parser to manipulate it.
$site_html = file_get_contents('http://www.example.com/');
$document = new DOMDocument();
$document->loadHTML($site_html);
$all_of_the_h2_tags = $document->getElementsByTagName('h2');
Read more about PHP's DOM functions for what to do from here, such as grabbing other tags, creating new HTML out of bits and pieces of the DOM, and displaying that on your own site.
Your first step would be to use CURL to do a request on the other site, and bring down the HTML from the page you want to access. Then comes the part of parsing the HTML to find all the content you're looking for. One could use a bunch of regular expressions, and you could probably get the job done, but the Stackoverflow crew might frown at you. You could also take the resulting HTML and use the domDocument object, and loadHTML to parse the HTML and load the content you want.
Also, if you control both sites, you can set up a special page on the first site (www.blog.com) with exactly the information you need, properly formatted either in HTML you can output directly, or XML that you can manipulate more easily from www.mysite.com.

Categories