Get all titles of area's using html dom parser - php

Trying to get all the title="" vales of an area's but its only picking the the main title of the page.
Sample area html:
<area shape="poly" title="Orkney" href="orkney" coords="" />
The page has lots of these and i'm trying to get all the title names
My PHP:
function getTextBetweenTags($string, $tagname) {
// Create DOM from string
$html = str_get_html($string);
$titles = array();
// Find all tags
foreach($html->find($tagname) as $element) {
$titles[] = $element->plaintext;
}
return $titles;
}
print_r(getTextBetweenTags($url, 'title'));
All i get is the title of the page, is this the correct way, or can it not be done, thanks for any help :)

call the method like this:
print_r(getTextBetweenTags($url, 'area'));
since you're searching for area tags. The title is an argument. See this documentation for an idea how to do it.

You need to look for <area> elements and return their title attribute
function getTextBetweenTags($string, $tagname) {
// Create DOM from string
$html = str_get_html($string);
$titles = array();
// Find all tags
foreach($html->find($tagname) as $element) {
$titles[] = $element->title;
}
return $titles;
}
print_r(getTextBetweenTags($url, 'area'));
And if you using the $url for the paramenter, shouldn't you be using file_get_html() instead of str_get_html()?

Related

Parsing HTML Table Data from XML with PHP

I am somewhat new with PHP, but can't really wrap my head around what I am doing wrong here given my situation.
Problem: I am trying to get the href of a certain HTML element within a string of characters inside an XML object/element via Reddit (if you visit this page, it would be the actual link of the video - not the reddit link but the external youtube link or whatever - nothing else).
Here is my code so far (code updated):
Update: Loop-mania! Got all of the hrefs, but am now trying to store them inside a global array to access a random one outside of this function.
function getXMLFeed() {
echo "<h2>Reddit Items</h2><hr><br><br>";
//$feedURL = file_get_contents('https://www.reddit.com/r/videos/.xml?limit=200');
$feedURL = 'https://www.reddit.com/r/videos/.xml?limit=200';
$xml = simplexml_load_file($feedURL);
//define each xml entry from reddit as an item
foreach ($xml -> entry as $item ) {
foreach ($item -> content as $content) {
$newContent = (string)$content;
$html = str_get_html($newContent);
foreach($html->find('table') as $table) {
$links = $table->find('span', '0');
//echo $links;
foreach($links->find('a') as $link) {
echo $link->href;
}
}
}
}
}
XML Code:
http://pasted.co/0bcf49e8
I've also included JSON if it can be done this way; I just preferred XML:
http://pasted.co/f02180db
That is pretty much all of the code. Though, here is another piece I tried to use with DOMDocument (scrapped it).
foreach ($item -> content as $content) {
$dom = new DOMDocument();
$dom -> loadHTML($content);
$xpath = new DOMXPath($dom);
$classname = "/html/body/table[1]/tbody/tr/td[2]/span[1]/a";
foreach ($dom->getElementsByTagName('table') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
//$originalURL = $node->getAttribute('href');
}
//$html = $dom->saveHTML();
}
I can parse the table fine, but when it comes to getting certain element's values (nothing has an ID or class), I can only seem to get ALL anchor tags or ALL table rows, etc.
Can anyone point me in the right direction? Let me know if there is anything else I can add here. Thanks!
Added HTML:
I am specifically trying to extract <span>[link]</span> from each table/item.
http://pastebin.com/QXa2i6qz
The following code can extract you all the youtube links from each content.
function extract_youtube_link($xml) {
$entries = $xml['entry'];
$videos = [];
foreach($entries as $entry) {
$content = html_entity_decode($entry['content']);
preg_match_all('/<span><a href="(.*)">\[link\]/', $content, $matches);
if(!empty($matches[1][0])) {
$videos[] = array(
'entry_title' => $entry['title'],
'author' => preg_replace('/\/(.*)\//', '', $entry['author']['name']),
'author_reddit_url' => $entry['author']['uri'],
'video_url' => $matches[1][0]
);
}
}
return $videos;
}
$xml = simplexml_load_file('reddit.xml');
$xml = json_decode(json_encode($xml), true);
$videos = extract_youtube_link($xml);
foreach($videos as $video) {
echo "<p>Entry Title: {$video['entry_title']}</p>";
echo "<p>Author: {$video['author']}</p>";
echo "<p>Author URL: {$video['author_reddit_url']}</p>";
echo "<p>Video URL: {$video['video_url']}</p>";
echo "<br><br>";
}
The code outputs in the multidimensional format of array with the elements inside are entry_title, author, author_reddit_url and video_url. Hope it helps you!
If you're looking for a specific element you don't need to parse the whole thing. One way of doing it could be to use the DOMXPath class and query directly the xml. The documentation should guide you through.
http://php.net/manual/es/class.domxpath.php .

Get all value of h1 tag

I want to get all value of h1 tag and I found this article: getting all values from h1 tags using php
Then I tried to using:
<?php
include "functions/simple_html_dom.php";
function getTextBetweenTags($string, $tagname) {
// Create DOM from string
$html = str_get_html($string);
$titles = array();
// Find all tags
foreach($html->find($tagname) as $element) {
$titles[] = $element->plaintext;
}
return $titles;
}
echo getTextBetweenTags("http://mydomain.com/","h1");
?>
But it's not running and I get:
Notice: Array to string conversion in C:\xampp\htdocs\checker\abc.php
on line 14 Array
Plz help me fix it. I want to get all h1 tag of a website with input data is URL of that website. Thanks so much!
You're trying to echo an array, which will result into an error. And the function is little bit off. Example:
include 'functions/simple_html_dom.php';
function getTextBetweenTags($url, $tagname) {
$values = array();
$html = file_get_html($url);
foreach($html->find($tagname) as $tag) {
$values[] = trim($tag->innertext);
}
return $values;
}
$output = getTextBetweenTags('http://www.stackoverflow.com/', 'h1');
echo '<pre>';
print_r($output);

Extracting multiple strong tags using PHP Simple HTML DOM Parser

I have over 500 pages (static) containing content structures this way,
<section>
Some text
<strong>Dynamic Title (Different on each page)</strong>
<strong>Author name (Different on each page)</strong>
<strong>Category</strong>
(<b>Content</b> <b>MORE TEXT HERE)</b>
</section>
And I need to extract the data as formatted below, using PHP Simple HTML DOM Parser
$title = <strong>Dynamic Title (Different on each page)</strong>
$authot = <strong>Author name (Different on each page)</strong>
$category = <strong>Category</strong>
$content = (<b>Content</b> <b>MORE TEXT HERE</b>)
I have failed so far and can't get my head around it, appreciate any advice or code snippet to help me going on.
EDIT 1,
I have now solved the part with strong tags using,
$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
$content[] = $a->innertext;
}
$title= $content[0];
$author= $content[1];
the only remaining issue is --> How to extract content within parentheses? using similar method?
OK first you want to get all of the tags
Then you want to search through those again for the tags and tags
Something like this:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
$strong = array();
// Find all <sections>
foreach($html->find('section') as $element) {
$section = $element->src;
// get <strong> tags from <section>
foreach($section->find('strong') as $strong) {
$strong[] = $strong->src;
}
$title = $strong[0];
$authot = $strong[1];
$category = $strong[2];
}
To get the parts in parentheses - just get the b tag text and then add the () brackets.
Or if you're asking how to get parts in between the brackets - use explode then remove the closing bracket:
$pieces = explode("(", $title);
$different_on_each_page = str_replace(")","",$pieces[1]);
$html_code = 'html';
$dom = new \DOMDocument();
$dom->LoadHTML($html_code);
$xpath = new \DOMXPath($this->dom);
$nodelist = $xpath->query("//strong");
for($i = 0; $i < $nodelist->length; $i++){
$nodelist->item($i)->nodeValue; //gives you the text inside
}
My final code that works now looks like this.
$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
$content[] = $a->innertext;
}
$title= $content[0];
$author= $content[1];
$category = $content[2];
$details = file_get_html($url)->plaintext;
$input = $details;
preg_match_all("/\(.*?\)/", $input, $matches);
print_r($matches[0]);

Remove HTML element from parsed HTML document on a condition

I've parsed a HTML document using Simple PHP HTML DOM Parser. In the parsed document there's a ul-tag with some li-tags in it. One of these li-tags contains one of those dreaded "Add This" buttons which I want to remove.
To make this worse, the list item has no class or id, and it is not always in the same position in the list. So there is no easy way (correct me if I'm wrong) to remove it with the parser.
What I want to do is to search for the string 'addthis.com' in all li-elements and remove any element that contains that string.
<ul>
<li>Foobar</li>
<li>addthis.com</li><!-- How do I remove this? -->
<li>Foobar</li>
</ul>
FYI: This is purley a hobby project in my quest to learn PHP and not a case of content theft for profit.
All suggestions are welcome!
Couldn't find a method to remove nodes explicitly, but can remove with setting outertext to empty.
$html = new simple_html_dom();
$html->load(file_get_contents("test.html"), false, false); // preserve formatting
foreach($html->find('ul li') as $element) {
if (count($element->find('a.addthis_button')) > 0) {
$element->outertext="";
}
}
echo $html;
Well what you can do is use jQuery after the parsing. Something like this:
$('li').each(function(i) {
if($(this).html() == "addthis.com"){
$(this).remove();
}
});
This solution uses DOMDocument class and domnode.removechild method:
$str="<ul><li>Foobar</li><li>addthis.com</li><li>Foobar</li></ul>";
$remove='addthis.com';
$doc = new DOMDocument();
$doc->loadHTML($str);
$elements = $doc->getElementsByTagName('li');
$domElemsToRemove = array();
foreach ($elements as $element) {
$pos = strpos($element->textContent, $remove); // or similar $element->nodeValue
if ($pos !== false) {
$domElemsToRemove[] = $element;
}
}
foreach( $domElemsToRemove as $domElement ){
$domElement->parentNode->removeChild($domElement);
}
$str = $doc->saveHTML(); // <ul><li>Foobar</li><li>Foobar</li></ul>

PHP DOMDocument stripping HTML tags

I'm working on a small templating engine, and I'm using DOMDocument to parse the pages. My test page so far looks like this:
<block name="content">
<?php echo 'this is some rendered PHP! <br />' ?>
<p>Main column of <span>content</span></p>
</block>
And part of my class looks like this:
private function parse($tag, $attr = 'name')
{
$strict = 0;
/*** the array to return ***/
$out = array();
if($this->totalBlocks() > 0)
{
/*** a new dom object ***/
$dom = new domDocument;
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
/*** load the html into the object ***/
if($strict==1)
{
$dom->loadXML($this->file_contents);
}
else
{
$dom->loadHTML($this->file_contents);
}
/*** the tag by its tag name ***/
$content = $dom->getElementsByTagname($tag);
$i = 0;
foreach ($content as $item)
{
/*** add node value to the out array ***/
$out[$i]['name'] = $item->getAttribute($attr);
$out[$i]['value'] = $item->nodeValue;
$i++;
}
}
return $out;
}
I have it working the way I want in that it grabs each <block> on the page and injects it's contents into my template, however, it is stripping the HTML tags within the <block>, thus returning the following without the <p> or <span> tags:
this is some rendered PHP! Main column of content
What am I doing wrong here? :) Thanks
Nothing: nodeValue is the concatenation of the value portion of the tree, and will never have tags.
What I would do to make an HTML fragment of the tree under $node is this:
$doc = new DOMDocument();
foreach($node->childNodes as $child) {
$doc->appendChild($doc->importNode($child, true));
}
return $doc->saveHTML();
HTML "fragments" are actually more problematic than you'd think at first, because they tend to lack things like doctypes and character sets, which makes it hard to deterministically go back and forth between portions of a DOM tree and HTML fragments.

Categories