Currently I am working on a project which requires me to parse some data from an alternative website, and I'm having some issues (note I am very new to PHP coding.)
Here's the code I am using below + the content it returns.
$dl = $html2->find('ol.tracklist',0);
print $dl = $dl->outertext;
The above code returns the data for what we're trying to get, it's below but extremely messy provided you would like to see click here.
However, when I put this in a foreach, it only returns one of the a href attributes at a time.
foreach($html2->find('ol.tracklist') as $li)
{
$title = $li->find('a',0);
print $title;
}
What can I do so that it returns all of the a href elements from the example code above?
NOTE: I am using simple_html_dom.php for this.
Based on the markup, just point directly to it, just get it list then point to its anchor:
foreach ($html2->find('ol.tracklist li') as $li) {
$anchor = $li->find('ul li a', 0);
echo $anchor->href; // and other attributes
}
Related
Exemple with a mediawiki link : https://www.visionduweb.eu/wiki/index.php?title=Utiliser_PHP
Show the source code and identify the sommaire from this Mediawiki page.
I search how i can parse the source code and found the HTML code for this sommaire.
#
I tried with $domExemple = $xpath->query(« //ul/li »); but I have too many answers and poorly formatted.
I tried with $domExemple = $xpath->query(« //ul/li[#class=’toclevel-1 tocsection-1′] »); which gives me the result, but, how to get all toclevel and tocsection, without having to specify the number 1, or 2, or 3, ... toclevel or tocsection.
In this example, I do not get the HTML content, only the text content.
I would have preferred to retrieve the HTML content.
I believe you can simplify your xpath expression using the syntax defined here:
How can I match on an attribute that contains a certain string?
Try something like this:
$results = $xpath->query('//ul/li[contains(#class, "toclevel-") and contains(#class, "tocsection-"]');
foreach ($results as $li) {
// to get html of $li, import it into a fresh DOMDocument and run saveHTML
$newdoc = new DOMDocument();
$cloned = $li->cloneNode(true);
$newdoc->appendChild($newdoc->importNode($cloned, true));
echo $newdoc->saveHTML();
}
I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?
There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';
when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.
I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;
I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.
The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}
This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}
Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.
Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing
Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}
This works now:
$element->remove();
You can see the documentation for the method here.
Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}
Sort of a two part question but maybe one answers the other. I'm trying to get a piece of information out of an
<div id="foo">
<div class="bar"><a data1="xxxx" data2="xxxx" href="http://foo.bar">Inner text"</a>
<div class="bar2"><a data3="xxxx" data4="xxxx" href="http://foo.bar">more text"</a>
Here is what I'm using now.
$articles = array();
$html=file_get_html('http://foo.bar');
foreach($html->find('div[class=bar] a') as $a){
$articles[] = array($a->href,$a->innertext);
}
This works perfectly to grab the href and the inner text from the first div class. I tried adding a $a->data1 to the foreach but that didn't work.
How do I grab those inner data tags at the same time I grab the href and innertext.
Also is there a good way to get both classes with one statement? I assume I could build the find off of the id and grab all the div information.
Thanks
To grab all those attributes, you should before investigate the parsed element, like this:
foreach($html->find('div[class=bar] a') as $a){
var_dump($a->attr);
}
...and see if those attributes exist. They don't seem to be valid HTML, so maybe the parser discards them.
If they exist, you can read them like this:
foreach($html->find('div[class=bar] a') as $a){
$article = array($a->href, $a->innertext);
if (isset($a->attr['data1'])) {
$article['data1'] = $a->attr['data1'];
}
if (isset($a->attr['data2'])) {
$article['data2'] = $a->attr['data2'];
}
//...
$articles[] = $article;
}
To get both classes you can use a multiple selector, separated by a comma:
foreach($html->find('div[class=bar] a, div[class=bar2] a') as $a){
...
I know this question is old, but the OP asked how they could get all the attributes in one statement. I just did this for a project I'm working on.
You can get all the attributes for an element with the getAllAttributes() method. The results are automatically stored in an array property called attr.
In the example below I am grabbing all links but you can use this with whatever you want. NOTE: This also works with data- attributes. So if there is an attribute called data-url it will be accessible with $e->attr['data-url'] after you run the getAllAttributes method.
In your case the attributes your looking for will be $e->attr['data1'] and $e->attr['data2']. Hope this helps someone if not the OP.
Get all Attributes
$html = file_get_html('somefile.html');
foreach ($html->find('a') as $e) { //used a tag here, but use whatever you want
$e->getAllAttributes();
//testing that it worked
print_r($e->attr);
}
Check this code
<?php
$html = file_get_html('somefile.html');
foreach ($html->find('a') as $e) {
$filter = $e->getAttribute('data-filter-string');
}
?>
$data1 = $html->find('.bar > a', 0)->attr['data1'];
$data2 = $html->find('.bar > a', 0)->attr['data2'];
I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?
There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';
when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.
I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;
I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.
The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}
This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}
Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.
Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing
Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}
This works now:
$element->remove();
You can see the documentation for the method here.
Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}
I cant get the data between the tags into the arrays:
// Load the HTML string from file and create a SimpleXMLElement
$html_string = file_get_contents("data/csr.html"); /*the string really is in $html_string*/
$root = new SimpleXMLElement($html_string);
Problem starts here when I try to get that the value between the tags: div, h2 and span into an array
// Fetch all div, h2 and span values
$divArray = $hdlsArray = $dtlsArray = array();
foreach ($root->div as $div) {
$divArray[] = $div;
echo "".$div."<br />";
}
foreach ($root->h2 as $h2) {
$hdlsArray[] = $h2;
echo "".$h2."<br />";
}
foreach ($root->span as $span) {
$dtlsArray[] = $span;
echo "".$span."<br />";
}
The result of this is a blank page instead of printing the actual tag data
As an alternate to SimpleXMLElement, I suggest Simple HTML DOM (online manual). I've used it before and very much satisfied with the results. It allows you to use jQuery like selectors so fetching all div, h2 and span values is fairly simple.
This page says (about SimpleXML) "the only problem with it is that it'll only load valid XML" but may provide a workaround for HTML.
The 'Related Questions' on StackOverflow include this one, but it describes HTML inside valid XML tags.