Get text contained within a specific html element using php - php

I need to get all of the text contained between a specific div. In the following example I want to get everything between the div with class name "st" :
<div class="title">This is a title</div>
<div class="st">Some example <em>text</em> here.</div>
<div class="footer">Footer text</div>
So the result would be
Some example <em>text</em> here.
or even just
Some example text here.
Does anyone know how to accomplish this?

Server-side in PHP
A very basic way would be something like this:
$data = ''; // your HTML data from the question
preg_match( '/<div class="\st\">(.*?)<\/div>/', $data, $match );
Then iterate the $match object. However, this could return bad data if your .st DIV has another DIV inside it.
A more proper way would be:
function getData()
{
$dom = new DOMDocument;
$dom -> loadHTML( $data );
$divs = $dom -> getElementsByTagName('div');
foreach ( $divs as $div )
{
if ( $div -> hasAttribute('class') && strpos( $div -> getAttribute('class'), 'st' ) !== false )
{
return $div -> nodeValue;
}
}
}
Client-side
If you're using jQuery, it would be easy like this:
$('.st').text();
or
$('.st').html();
If you're using plain JavaScript, it would be a little complicated cause you'll have to check all DIV elements until you find the one with your desired CSS class:
function foo()
{
var divs = document.getElementsByTagName('div'), i;
for (i in divs)
{
if (divs[i].className.indexOf('st') > -1)
{
return divs[i].innerHTML;
}
}
}

Use DOM. Example:
$html_str = "<html><body><div class='st'>Some example <em>text</em> here.</div></body></html>";
$dom = new DOMDocument('1.0', 'iso-8859-1');
$dom->loadHTML($html_str); // just one method of loading html.
$dom->loadHTMLFile("some_url_to_html_file");
$divs = getElementsByClassName($dom,"st");
$div = $divs[0];
$str = '';
foreach ($div->childNodes as $node) {
$str .= $dom->saveHTML($node);
}
print_r($str);
The below function is not mine, but this user's. If you find this function useful, go to the previously linked answer and vote it up.
function getElementsByClassName(DOMDocument $domNode, $className) {
$elements = $domNode->getElementsByTagName('*');
$matches = array();
foreach($elements as $element) {
if (!$element->hasAttribute('class')) {
continue;
}
$classes = preg_split('/\s+/', $element->getAttribute('class'));
if (!in_array($className, $classes)) {
continue;
}
$matches[] = $element;
}
return $matches;
}

PHP is a server side language, to do this you should use a client side language like javascript (and possibly a library like jQuery for easy ad fast cross-browser coding). And then use javascript to send the data you need to the backend for processing (Ajax).
jQuery example:
var myText = jQuery(".st").text();
jQuery.ajax({
type: 'POST',
url: 'myBackendUrl',
myTextParam: myText,
success: function(){
alert('done!');
},
});
Then, in php:
<?php
$text = $_POST['myTextParam'];
// do something with text

Using a XML parser:
$htmlDom = simple_load_string($htmlSource);
$results = $htmlDom->xpath("//div[#class='st']/text()");
while(list( , $node) = each($result)) {
echo $node, "\n";
}

use jquery/ajax
then do something like:
<script>
$(document).ready(function() {
$.ajax({
type: "POST",
url: "urltothepageyouneed the info",
data: { ajax: "ajax", divcontent:$(".st").html()}
})
});
</script>
Basically
$(".st").html()
will return the HTML
and
$(".st").text()
Will return the text
Hope that helps

Related

PHP: Remove javascript events from html

Is there any way to remove js events like 'onload', 'onclick',... from html elements in PHP?
For example if <a (onclick)="alert('hi')">Link</a> is given, the desired output should be <a>Link</a>.
I did it this way:
$dom = new DOMDocument;
$dom->loadHTML($request->request->get('description'));
$nodes = $dom->getElementsByTagName('*');
foreach($nodes as $node)
{
if ($node->hasAttribute('onload'))
{
$node->removeAttribute('onload');
}
if ($node->hasAttribute('onclick'))
{
$node->removeAttribute('onclick');
}
}
$dom->saveHTML();
However I'm not sure if it's a safe way to that, because if later a new js event will be created the chance that I'll forget to blacklist it is real.
function filterText($value)
{
if(!$value) return $value;
return escapeJsEvent(removeScriptTag($value));
}
function escapeJsEvent($value){
return preg_replace('/(<.+?)(?<=\s)on[a-z]+\s*=\s*(?:([\'"])(?!\2).+?\2|(?:\S+?\(.*?\)(?=[\s>])))(.*?>)/i', "$1 $3", $value);
}
function removeScriptTag($text)
{
$search = array("'<script[^>]*?>.*?</script>'si",
"'<iframe[^>]*?>.*?</iframe>'si");
$replace = array('','');
$text = preg_replace($search, $replace, $text);
return preg_replace_callback("'&#(\d+);'", function ($m) {
return chr($m[1]);
}, $text);
}
echo filterText('<img src=1 href=1 onerror="javascript:alert(1)"></img>');
You should build a Javascript method that does this for you, and can apply it after the body loads, because php code executes at page load and you can't check later in the document if theres other event, until it loads again.

Parsing HTML Table Data from XML with PHP

I am somewhat new with PHP, but can't really wrap my head around what I am doing wrong here given my situation.
Problem: I am trying to get the href of a certain HTML element within a string of characters inside an XML object/element via Reddit (if you visit this page, it would be the actual link of the video - not the reddit link but the external youtube link or whatever - nothing else).
Here is my code so far (code updated):
Update: Loop-mania! Got all of the hrefs, but am now trying to store them inside a global array to access a random one outside of this function.
function getXMLFeed() {
echo "<h2>Reddit Items</h2><hr><br><br>";
//$feedURL = file_get_contents('https://www.reddit.com/r/videos/.xml?limit=200');
$feedURL = 'https://www.reddit.com/r/videos/.xml?limit=200';
$xml = simplexml_load_file($feedURL);
//define each xml entry from reddit as an item
foreach ($xml -> entry as $item ) {
foreach ($item -> content as $content) {
$newContent = (string)$content;
$html = str_get_html($newContent);
foreach($html->find('table') as $table) {
$links = $table->find('span', '0');
//echo $links;
foreach($links->find('a') as $link) {
echo $link->href;
}
}
}
}
}
XML Code:
http://pasted.co/0bcf49e8
I've also included JSON if it can be done this way; I just preferred XML:
http://pasted.co/f02180db
That is pretty much all of the code. Though, here is another piece I tried to use with DOMDocument (scrapped it).
foreach ($item -> content as $content) {
$dom = new DOMDocument();
$dom -> loadHTML($content);
$xpath = new DOMXPath($dom);
$classname = "/html/body/table[1]/tbody/tr/td[2]/span[1]/a";
foreach ($dom->getElementsByTagName('table') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
//$originalURL = $node->getAttribute('href');
}
//$html = $dom->saveHTML();
}
I can parse the table fine, but when it comes to getting certain element's values (nothing has an ID or class), I can only seem to get ALL anchor tags or ALL table rows, etc.
Can anyone point me in the right direction? Let me know if there is anything else I can add here. Thanks!
Added HTML:
I am specifically trying to extract <span>[link]</span> from each table/item.
http://pastebin.com/QXa2i6qz
The following code can extract you all the youtube links from each content.
function extract_youtube_link($xml) {
$entries = $xml['entry'];
$videos = [];
foreach($entries as $entry) {
$content = html_entity_decode($entry['content']);
preg_match_all('/<span><a href="(.*)">\[link\]/', $content, $matches);
if(!empty($matches[1][0])) {
$videos[] = array(
'entry_title' => $entry['title'],
'author' => preg_replace('/\/(.*)\//', '', $entry['author']['name']),
'author_reddit_url' => $entry['author']['uri'],
'video_url' => $matches[1][0]
);
}
}
return $videos;
}
$xml = simplexml_load_file('reddit.xml');
$xml = json_decode(json_encode($xml), true);
$videos = extract_youtube_link($xml);
foreach($videos as $video) {
echo "<p>Entry Title: {$video['entry_title']}</p>";
echo "<p>Author: {$video['author']}</p>";
echo "<p>Author URL: {$video['author_reddit_url']}</p>";
echo "<p>Video URL: {$video['video_url']}</p>";
echo "<br><br>";
}
The code outputs in the multidimensional format of array with the elements inside are entry_title, author, author_reddit_url and video_url. Hope it helps you!
If you're looking for a specific element you don't need to parse the whole thing. One way of doing it could be to use the DOMXPath class and query directly the xml. The documentation should guide you through.
http://php.net/manual/es/class.domxpath.php .

Transferring DOMDocument from PHP to Javascript using function

I have a PHP function creating a DOMDocument XML file, i need to get the DOMDocument into Javascript, i have thought about using
The function in PHP returns the DOMDocument, this is the PHP function
function coursexml($cc, $type){
$xmlfile = new DOMDocument();
if (#$xmlfile->load("books.xml") === false || $cc == "" || $type == "") {
header('location:/assignment/errors/500');
exit;
}
$string = "";
$xpath = new DOMXPath($xmlfile);
$nodes = $xpath->query("/bookcollection/items/item[courses/course='$cc']");
$x = 0;
foreach( $nodes as $n ) {
$id[$x] = $n->getAttribute("id");
$titles = $n->getElementsByTagName( "title" );
$title[$x] = $titles->item(0)->nodeValue;
$title[$x] = str_replace(" /", "", $title[$x]);
$title[$x] = str_replace(".", "", $title[$x]);
$isbns = $n->getElementsByTagName( "isbn" );
$isbn[$x] = $isbns->item(0)->nodeValue;
$bcs = $n->getElementsByTagName( "borrowedcount" );
$borrowedcount[$x] = $bcs->item(0)->nodeValue;
if ($string != "") $string = $string . ", ";
$string = $string . $x . "=>" . $borrowedcount[$x];
$x++;
}
if ($x == 0) header('location:/assignment/errors/501');
$my_array = eval("return array({$string});");
asort($my_array);
$coursexml = new DOMDocument('1.0', 'utf-8');
$coursexml->formatOutput = true;
$node = $coursexml->createElement('result');
$coursexml->appendChild($node);
$root = $coursexml->getElementsByTagName("result");
foreach ($root as $r) {
$node = $coursexml->createElement('course', "$cc");
$r->appendChild($node);
$node = $coursexml->createElement('books');
$r->appendChild($node);
$books = $coursexml->getElementsByTagName("books");
foreach ($books as $b) {
foreach ($my_array as $counter => $bc) {
$bnode = $coursexml->createElement('book');
$bnode = $b->appendChild($bnode);
$bnode->setAttribute('id', "$id[$counter]");
$bnode->setAttribute('title', "$title[$counter]");
$bnode->setAttribute('isbn', "$isbn[$counter]");
$bnode->setAttribute('borrowedcount', "$borrowedcount[$counter]");
}
}
}
return $coursexml;
}
So what i want to do is call the function in Javascript, and returns the DOMDocument.
Try the following
<?php include('coursexml.php'); ?>
<script>
var xml = <?php $xml = coursexml("CC140", "xmlfile");
echo json_encode($xml->saveXML()); ?>;
document.write("output" + xml);
var xmlDoc = (new DOMParser()).parseFromString(xml, 'text/xml');
</script>
you can simply put this function to a URL ( eg have it in a standalone file? that's up to you ), and call it from the client side via AJAX. For details on doing such a call, please reference How to make an AJAX call without jQuery? .
Edit:
try to create a simple PHP file that includes and calls the function you have. From what you've described so far, it will probably look like
<?php
include("functions.php");
print coursexml($cc, $type);
assuming this file is called xml.php , when you access it via your browser in http://mydomain.com/xml.php you should see the XML document (nothing related to Javascript so far).
Now, in your main document, you include a piece of Javascript that will call upon this URL to load the XML. An example would be (assuming you are using jQuery, for a simple Javascript function reference the above link) :
$.ajax({
url: "xml.php",
success: function(data){
// Data will contain you XML and can be used in Javascript here
}
});

Reading an attribute of one HTML tag and writing it to an attribute of another HTML tag

I have an HTML <img> with an "alt" attribute. My <img> is wrapped in an <a>. The <a> has a "title" attribute. For example:
<a title="" href ="page.html"><img src="image.jpg" alt="text"></a>
I need to read the value of the "alt" attribute of the <img> and write it to the "title" attribute value of the <a>. Is there a way to do this in PHP?
you can do this by php
$url="http://example.com";
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
echo $tag->getAttribute('alt');
}
As started by NullPointer,
$url="http://example.com";
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
$parent = $tag->parentNode;
if($parent->nodeName == 'a') {
$parent->setAttribute('tittle', $tag->getAttribute('alt'));
}
}
Hope it helps
You can try with JQUERY,
<script>
var altValue = $('img').attr('alt');
$('a').attr('title', altValue);
</script>
You have to use this when you need on onClick and if you have more img and a tags as follows
<script>
function changeTitle(this) {
var altValue = $(this).find('img').attr('alt');
$(this).find('a').attr('title', altValue);
}
</script>
While the question is tagged php, I thought I'd offer a simple, plain-JavaScript, means of doing the same thing client-side:
function attributeToParent(tag, from, to, within) {
if (!tag || !from || !to) {
return false;
}
else {
within = within && within.nodeType == 1 ? within : document;
var all = within.getElementsByTagName(tag);
for (var i = 0, len = all.length; i < len; i++) {
if (all[i].getAttribute(from) && all[i].parentNode.tagName.toLowerCase() !== 'body') {
all[i].parentNode.setAttribute(to, all[i].getAttribute(from));
}
}
}
}
attributeToParent('img', 'alt', 'title');​
JS Fiddle demo.
This could be tidied up somewhat, but I think it's relatively clear as it is (albeit a little messier than I'd like).
References:
getAttribute().
getElementsByTagName().
setAttribute().
String.toLowerCase().
element.tagName.
PHP is a server side language. So once the output is rendered, you cannot change it anymore. (Unless you download the content with PHP and then output the changed data, but that seems only usefull if you cannot access the orignal source) If you are creating the output IN php you could use:
$alt = 'text';
echo '<a title="'.$alt.'" href ="page.html"><img src="image.jpg" alt="'.$alt.'"></a>';
If you already have the output, you could use jquery ( http://jquery.com/ )
<script type='text/javascript'>
//perform after page is done
$(function() {
//each image in an a tag
$('a img').each(function() {
var $el = $(this);
var alt = $el.attr('alt');
$el.parent('a').attr('title', alt);
});
});
</script>
Update
If its pure PHP string modification, you could also use a regular expression to change it, instead of the dom manipulation:
$string = '<a title="" href ="page.html"><img src="image.jpg" alt="text"></a>';
$pattern = '/<a(.*?)title="(.*?)"(.*?)<img(.*?)alt="(.*?)"(.*?)<\/a>/i';
$replacement = '<a${1}title="$5"${3}<img${4}alt="${5}"${6}</a>';
echo preg_replace($pattern, $replacement, $string);

Remove HTML element from parsed HTML document on a condition

I've parsed a HTML document using Simple PHP HTML DOM Parser. In the parsed document there's a ul-tag with some li-tags in it. One of these li-tags contains one of those dreaded "Add This" buttons which I want to remove.
To make this worse, the list item has no class or id, and it is not always in the same position in the list. So there is no easy way (correct me if I'm wrong) to remove it with the parser.
What I want to do is to search for the string 'addthis.com' in all li-elements and remove any element that contains that string.
<ul>
<li>Foobar</li>
<li>addthis.com</li><!-- How do I remove this? -->
<li>Foobar</li>
</ul>
FYI: This is purley a hobby project in my quest to learn PHP and not a case of content theft for profit.
All suggestions are welcome!
Couldn't find a method to remove nodes explicitly, but can remove with setting outertext to empty.
$html = new simple_html_dom();
$html->load(file_get_contents("test.html"), false, false); // preserve formatting
foreach($html->find('ul li') as $element) {
if (count($element->find('a.addthis_button')) > 0) {
$element->outertext="";
}
}
echo $html;
Well what you can do is use jQuery after the parsing. Something like this:
$('li').each(function(i) {
if($(this).html() == "addthis.com"){
$(this).remove();
}
});
This solution uses DOMDocument class and domnode.removechild method:
$str="<ul><li>Foobar</li><li>addthis.com</li><li>Foobar</li></ul>";
$remove='addthis.com';
$doc = new DOMDocument();
$doc->loadHTML($str);
$elements = $doc->getElementsByTagName('li');
$domElemsToRemove = array();
foreach ($elements as $element) {
$pos = strpos($element->textContent, $remove); // or similar $element->nodeValue
if ($pos !== false) {
$domElemsToRemove[] = $element;
}
}
foreach( $domElemsToRemove as $domElement ){
$domElement->parentNode->removeChild($domElement);
}
$str = $doc->saveHTML(); // <ul><li>Foobar</li><li>Foobar</li></ul>

Categories