Parsing HTML with Php

Parsing HTML with Php - php

I cant get the data between the tags into the arrays:
// Load the HTML string from file and create a SimpleXMLElement
$html_string = file_get_contents("data/csr.html"); /*the string really is in $html_string*/
$root = new SimpleXMLElement($html_string);
Problem starts here when I try to get that the value between the tags: div, h2 and span into an array
// Fetch all div, h2 and span values
$divArray = $hdlsArray = $dtlsArray = array();
foreach ($root->div as $div) {
$divArray[] = $div;
echo "".$div."<br />";
}
foreach ($root->h2 as $h2) {
$hdlsArray[] = $h2;
echo "".$h2."<br />";
}
foreach ($root->span as $span) {
$dtlsArray[] = $span;
echo "".$span."<br />";
}
The result of this is a blank page instead of printing the actual tag data

As an alternate to SimpleXMLElement, I suggest Simple HTML DOM (online manual). I've used it before and very much satisfied with the results. It allows you to use jQuery like selectors so fetching all div, h2 and span values is fairly simple.

This page says (about SimpleXML) "the only problem with it is that it'll only load valid XML" but may provide a workaround for HTML.
The 'Related Questions' on StackOverflow include this one, but it describes HTML inside valid XML tags.

Related

Get all HTML list element using Simple HTML Dom

Currently I am working on a project which requires me to parse some data from an alternative website, and I'm having some issues (note I am very new to PHP coding.)
Here's the code I am using below + the content it returns.
$dl = $html2->find('ol.tracklist',0);
print $dl = $dl->outertext;
The above code returns the data for what we're trying to get, it's below but extremely messy provided you would like to see click here.
However, when I put this in a foreach, it only returns one of the a href attributes at a time.
foreach($html2->find('ol.tracklist') as $li)
{
$title = $li->find('a',0);
print $title;
}
What can I do so that it returns all of the a href elements from the example code above?
NOTE: I am using simple_html_dom.php for this.

Based on the markup, just point directly to it, just get it list then point to its anchor:
foreach ($html2->find('ol.tracklist li') as $li) {
$anchor = $li->find('ul li a', 0);
echo $anchor->href; // and other attributes
}

Display all the same elements Simple HTML DOM Parser

I have a problem, i'm parsing IMDB web page using Simple HTML DOM Parser, and my code is the next one:
<?php
require('../simple_html_dom.php');
$url = 'http://www.imdb.com/search/name?gender=female';
$html = file_get_html($url);
foreach ($html->find('table.results tbody') as $div) {
$extractname = $div->find('tr.detailed td.name a', 0);
$extractimg = $div->find('tr.detailed td.image', 0);
$name = $extractname->innertext;
$img = $extractimg->innertext;
echo $img, $name;
};
?>
This script returns me that:
Link to image
Well so my problem is that i don't know why my script only returns me one element, and not all the elements.
Thanks!

You are getting one element because <tbody> is only one on that page.
You probably want to get result for each tr row.
foreach ($html->find('table.results tbody tr') as $div) {}

I usually use XPATH to do things like this, so forgive me if I am wrong.
To me it looks like find() gets an array, you should be looping over the $extractname as an array of elements just as you are doing with the $html find for tbody tags, and the same with $extractimg.
So to me a) you find all the tbody tags and loop them b) inside each tbody you are looking for the other elements which become their own arrays.

Remove tags with Simple HTML DOM parser [duplicate]

I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?

There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';

when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.

I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;

I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.

The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}

This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}

Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.

Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing

Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}

This works now:
$element->remove();
You can see the documentation for the method here.

Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}

I want to load specific div form other website in php

I have a problem to load specific div element and show on my page using PHP. My code right now is as follows:
<?php
$page = file_get_contents("http://www.bbc.co.uk/sport/football/results");
preg_match('/<div id="results-data" class="fixtures-table full-table-medium">(.*)<\/div>/is', $page, $matches);
var_dump($matches);
?>
I want it to load id="results-data" and show it on my page.

You won't be able to manipulate the URL to get only a portion of the page. So what you'll want to do is grab the page contents via the server-side language of your choice and then parse the HTML. From there you can grab the specific DIV you are looking for and then print that out to your screen. You could also use to remove unwanted content.
With PHP you could use file_get_contents() to read the file you want to parse and then use DOMDocument to parse it and grab the DIV you want.
Here's the basic idea. This is untested but should point you in the right direction:
$page = file_get_contents('http://www.bbc.co.uk/sport/football/results');
$doc = new DOMDocument();
$doc->loadHTML($page);
$divs = $doc->getElementsByTagName('div');
foreach($divs as $div) {
// Loop through the DIVs looking for one withan id of "content"
// Then echo out its contents (pardon the pun)
if ($div->getAttribute('id') === 'content') {
echo $div->nodeValue;
}
}

You should use some html parser. Take a look at PHPQuery, here is how you can do it:
require_once('phpQuery/phpQuery.php');
$html = file_get_contents('http://www.bbc.co.uk/sport/football/results');
phpQuery::newDocumentHTML($html);
$resultData = pq('div#results-data');
echo $resultData;
Check it out here:
http://code.google.com/p/phpquery
Also see their selectors' documentation.

Simple HTML Dom: How to remove elements?

I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?

There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';

when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.

I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;

I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.

The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}

This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}

Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.

Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing

Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}

This works now:
$element->remove();
You can see the documentation for the method here.

Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing HTML with Php - php

As an alternate to SimpleXMLElement, I suggest Simple HTML DOM (online manual). I've used it before and very much satisfied with the results. It allows you to use jQuery like selectors so fetching all div, h2 and span values is fairly simple.

This page says (about SimpleXML) "the only problem with it is that it'll only load valid XML" but may provide a workaround for HTML. The 'Related Questions' on StackOverflow include this one, but it describes HTML inside valid XML tags.

Related

Get all HTML list element using Simple HTML Dom

Display all the same elements Simple HTML DOM Parser

Remove tags with Simple HTML DOM parser [duplicate]

I want to load specific div form other website in php

Simple HTML Dom: How to remove elements?

Categories

Resources