i have some problems with html simple dom and dont know how to get some specific data, i read manual and try by my self, but it looks i miss something so hope somebody can help me.
1th problem:
HTML:
<div>
<h4>Režie:</h4>
<span data-truncate="60">
Ridley Scott
</span>
</div>
<div>
<h4>Scénář:</h4>
<span data-truncate="60">
William Monahan
</span>
</div>
<div>
<h4>Kamera:</h4>
<span data-truncate="60">
John Mathieson
</span>
</div>
<div>
<h4>Hudba:</h4>
<span data-truncate="60">
Harry Gregson-Williams
</span>
</div>
My PHP code:
$ret = $html->find('span[data-truncate*="60"]'); //rezia
foreach ($ret as $rezia) {
echo "rezia <br/>";
}
But this code print just name and a href from all of this name, and what i need is just name which is under "REŽIE"(Ridley Scott) and "Scénář" (William Monahan)
2th Problem
HTML:
<div id="rating">
<h2 class="average">71%</h2>
<p class="charts">
PHP code:
$percenta = $html->find('h2[class*="average"]'); //pocet ˇ%
foreach ($percenta as $hodnotenie) {
echo "$hodnotenie";
}
What i get from this is 71% and i want just number, not that HTML around, is it possible?
3th problem (the last one:P):
HTML:
<table>
<tr>
<th>
V kinech ČR
od:
</th>
<td class="date">
06.05.2005
</td>
</tr>
<tr>
<th>
V kinech SR
od:
</th>
<td class="date">
05.05.2005
</td>
</tr>
<tr class="separator">
<th>
Na DVD
od:
</th>
<td class="date">
01.10.2005 Bonton
</td>
</tr>
PHP code:
$ret = $html->find('td[class="date"]');
$kino = array();
foreach ($ret as $kino) {
$datum[] = $datum->innertext;
}
echo "$datum[0]";
I get not output from this and i have no idea whats wrong on my code. I just want to get that DATEs (so should be 06.05.2005, 05.05.2005, 01.10.2005)
You didn't load the html, look at this
$html = str_get_html('Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook');
foreach($html->find('text') as $t){
if(substr($t, 0, 1)==':')
{
// do whatever you want
echo substr($t, 1).'<br />';
}
}
Output will be
2012-12-13
Peter Novak
books,cinema,facebook
Also, check this one to load a remote site's content
$html = file_get_html('http://heera.it');
// Find all article blocks
foreach($html->find('div.post-entry') as $article) {
echo $article->find('div.post-entry-content h2 a', 0) . '<br />';
echo $article->find('div.post-entry-content p', 0)->plaintext. '<br />';
echo "<hr />";
}
The result will be
Related
I just learning about simple_html_dom.php, I try to get only all the p attribute content in entry-content class and make it to one paragraph or one sentence.
here the raw html file from the website that i want to get the content.
<div class="entry-content">
<p><img class="alignnone" src="xxxxxxxxxxx" width="800" height="450" /></p>
<p>data1<span id="more-287848"></span></p>
<p>data2</p>
<p>data3</p>
<p>data4</p>
<p>......</p>
<p>......</p>
<p>dataN</p>
<div class="wpa wpmrec">
<a class="wpa-about" href="https://wordpress.com/about-these-ads/" rel="nofollow"></a>
<div class="u">
<script type='text/javascript'>
(function(g){g.__ATA.initAd({sectionId:34789711, width:300, height:250});})(window);
</script>
</div>
</div>
</div>
here my code to get it :
<?php
require_once __DIR__.'/simple_html_dom.php';
$html = new simple_html_dom();
$html->load_file('https://xxxxxxxxx');
$isi = $html->find('div[class="entry-content"]',0)->innertext;
?>
<table border="1">
<thead>
<tr>
<td><?php echo $isi; ?></td>
</tr>
</thead>
</table>
how to do it? thank you guys.
You should be able to iterate all of the <p> elements and adding the text to a variable. I have not tried this, but something like this:
$complete = "";
foreach($html->find('div.entry-content p') as $p)
{
$complete .= $p->plaintext;
echo $p->plaintext;
}
echo $complete;
There's a lot of information in the documentation here:
http://simplehtmldom.sourceforge.net/manual.htm
I am working on a web scraper. I have searched the product title on a webpage with my product.if same product exist on the page then i want to extract the price of that product.
for this i am using XPath
here is my html code from which i need to extract price.
<div class="products_list_table">
<table id="products_list_table_table" cellspacing="6" cellpadding="0" border="0">
<tbody>
<tr>
<td valign="top" align="center">
<span class="product_title">Malik Candy FC Composite Hockey Stick</span>
<div class="list_price_bar all-cnrs">
<span class="list_price_title">Price Now:</span>
<span class="list_sale_price">£40.00</span>
</div>
</td>
</tr>
<tr>
<td valign="top" align="center">
<span class="product_title">Malik TC Stylish Hockey Stick</span>
<div class="list_price_bar all-cnrs">
<span class="list_price_title">Price Now:</span>
<span class="list_sale_price">£70.00</span>
</div>
</td>
</tr>
...
</tbody>
</table>
<div>
There are many tr tags for all products and i search for a product title if it found i want to extract price of that product.
here is my php code in file test.php
<?php
set_time_limit(0);
if(isset($_POST['title']) && $_POST['title']!= ''){
$product_title = mysql_real_escape_string($_POST['title']);
$url = 'http://www.example.com';
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$found = $xpath->evaluate("boolean(//span[contains(text(), '". $product_title ."' )])");
if($found == false){
echo "Not Found";
}
else {
$elements = $xpath->evaluate("//span[#class='list_sale_price']");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue.'<br>';
}
}
}
}
}
?>
here i am using form in test.php to search product
<html>
<head>
<title></title>
</head>
<body>
<form action="" method="post">
<label>Enter product title to search</label><br /><br />
<input type="text" name="title" size="50" /><br /><br />
<input type="submit" value="Search" onclick="msg()"/>
</form>
</body>
</html>
After finding the product, i want to extract price of that product but the it displays all the prices on the page. where i made mistake. Need xpath expression to extract the price of matched product.
You don't need multiple expressions. You can extract the price with one XPath expression by selecting the div following your matched span, and in this context, extracting its child span which has the class of list_sale_price:
//span[contains(text(), 'Malik Candy' )]/following-sibling::div/span[#class='list_sale_price']
I have a website that is made in php and mysql.
It is a podcast site which I have created.
The home page has a list of podcasts and once one is clicked it then brings up the episode.php?id= followed by the ID that is listed in mysql for that podcast.
at the bottom of the episodes page I have added a comment box.
and I have it to display the comments saved in mysql using:
<?php class feedback {
public function fetch_all(){
global $pdo;
$query = $pdo->prepare("SELECT * FROM comments");
$query->execute();
return $query->fetchAll();
} }
$feedback = new feedback;
$articles = $feedback->fetch_all();
?>
<html>
<body>
<?php foreach ($articles as $feedback) { ?>
<div class="comment" align="center">Name: <font size="3" color="grey"><?php echo $feedback['name']; ?></font> Email: <font size="3" color="grey">Hidden</font>
<br />
<font size="5" color="red"><div align="left"><?php echo $feedback['post']; ?></font></div></div>
<br><div class="divider2"> </div><br>
<?php } ?>
</html>
</body>
This displays all the comments that are listed in the comments field in mysql.
each comment has a "cast" tab which displays the id of the podcast.
How can I get this to reflect the page being viewed?
for example.
if I'm viewing episode.php?id=1 then I want the comments with the "cast" tab of "1" to be displayed and not the "cast" tab of "2". Also the same goes for episode.php?id=2. and so on!
Please can someone guide me on how to do this?
thank you.
Kev
so I had a little play around with this after trying Timo Dörsching's suggestion which did not work.
I hit undo to get it back to my original post and changed this:
<?php foreach ($articles as $feedback) { ?>
<div class="comment" align="center">Name: <font size="3" color="grey"><?php echo $feedback['name']; ?></font> Email: <font size="3" color="grey">Hidden</font>
<br />
<font size="5" color="red"><div align="left"><?php echo $feedback['post']; ?></font></div></div>
<br><div class="divider2"> </div><br>
<?php } ?>
to this:
<?php foreach ($articles as $feedback) {
if ($feedback['cast'] === $_GET['id']) { ?>
<div class="comment" align="center">Name: <font size="3" color="grey"><?php echo $feedback['name']; ?></font> Email: <font size="3" color="grey">Hidden</font>
<br />
<font size="5" color="red"><div align="left"><?php echo $feedback['post']; ?></font></div></div>
<br><div class="divider2"> </div><br>
<?php } } ?>
this does the job perfectly.
perhaps you have to filter the $_GET['id'] & $cast.... but this easy to google how to do this...
Have a look at: How can I prevent SQL injection in PHP?
$cast = $_GET['id'];
public function fetch_all($cast){
global $pdo;
$query = $pdo->prepare("SELECT * FROM comments WHERE cast/id/... = ".$cast."");
$query->execute();
return $query->fetchAll();
} }
$feedback = new feedback;
$articles = $feedback->fetch_all();
I've got a new online store written in PHP and MySQL.
<div class="content-area">
<div class="page-heading">
<h1>Store</h1>
</div>
<p style="padding-top: 5px;"><strong>You are here:</strong> Home » Store</p>
<table border="0" cellpadding="0" cellspacing="0" width="500">
<?php
$categories=mysql_query("SELECT * FROM categories WHERE parent='0' ORDER by owner ASC, title ASC");
while($categoriesRow=mysql_fetch_array($categories)) {
$categoriesSub=mysql_query("SELECT * FROM categories WHERE parent='$categoriesRow[id]'");
?>
<tr>
<td valign="top">
<div class="product_list">
<div class="image_product">
<img alt="<?php echo $categoriesRow['title']; ?>" src="<?php echo $cls->truska(true); ?>/theme_section_image.gif" border="0" style="vertical-align: middle;" />
</div>
<div>
<h3 class="product"><?php echo $categoriesRow['title']; ?> <?php if(mysql_num_rows($categoriesSub) > 0) { ?>(<?php while($categoriesSubRow=mysql_fetch_array($categoriesSub)) { }?>)<?php } ?></h3>
</div>
</div>
</td>
</tr>
<tr>
<td class="dotted_line_blue" colspan="1">
<img src="<?php echo $cls->truska(true); ?>/theme_shim.gif" height="1" width="1" alt=" " />
</td>
</tr>
<?php
}
?>
</table>
</div>
Where my second While loop is, I need to work out what is the last result, so I can omit a comma from my while loop.
It will show as 'Lego (Lego City, Lego Starwars,)' but I want it to show as 'Lego (Lego City, Lego Starwars)'.
How can I get if the current result is the last?
You can fix this by coming at it from the other direction.
Instead of appending a comma after each result except the last, try pre-pending a comma on every result except the first.
Set up a variable called $first outside your loop, and set it to 1. Inside the loop:
if ($first == 0) {
echo ",";
} else {
$first = 0;
}
don't add the comma if it is the first result, and add it before in all the next ones.
Just build your array of results and implode it. This takes care of any counting automatically:
$comma_separated = implode(",", $array);
You shouldn't use plain mysql access, look at PDO.
Answering your question, try something like this:
$items = array();
while ($row = mysql_fetch_assoc($result)) {
$items[] = $row['foo'];
}
echo implode(', ', $items);
I have a wysiwyg on a site. The problem is that the users are copy pasting a lot of data in to it leaving a lot of unclosed and improperly formatted div tags that are breaking the site layout.
Is there an easy an easy way to strip all occurrences of <div> and </div>?
str_replace won't work because some of the divs have styling and other things in them so it would need to account for <div style="some styling"> <div align="center"> etc.
I'm guessing this could be done with a regular expression but I am total a total beginner when it comes to those.
Better to use DOM for HTML parser but if you have no choice but to use RegEx then you can use it like this:
$patterns = array();
$patterns[0] = '/<div[^>]*>/';
$patterns[1] = '/<\/div>/';
$replacements = array();
$replacements[2] = '';
$replacements[1] = '';
echo preg_replace($patterns, $replacements, $html);
No. You do NOT ever parse/manipulate HTML with regexes.
Regexes cannot be bargained with. They can't be reasoned with. They don't understand html, they don't grok xml. And they absolute will NOT stop until your DOM tree is dead.
You use htmlpurifier and/or DOM to manipulate the tree.
Here's a simplified example of how you could do it with PHP
<?php
/**
* Removes the divs because why not
*/
function strip_divs(&$text, $id = 'html') {
$replacements = array();
worker($text, $replacements, $id);
foreach ($replacements as $key => $val) {
$text = mb_str_replace($key, $val, $text);
}
return $text;
}
function worker(&$body, &$replacements, $id) {
static $call_count;
if (empty($call_count)) {
$call_count = array();
}
if (empty($call_count[$id])) {
$call_count[$id] = 0;
}
if (mb_strpos($body, '</div>')) {
$body = mb_str_replace('</div>', '', $body);
}
if (mb_strpos($body, '<di') !== FALSE) {
$call_count[$id] ++;
// Gets the important junk
$rm = '<di' . xml_get($body, '<di', '>') . '>';
// Builds the replacements HTML
$replacement_html = '';
$next_id = count($replacements);
$replacement_id = "[[div-$next_id]]";
$replacements[$replacement_id] = $replacement_html;
$body = mb_str_replace($rm, $replacement_id, $body);
if (mb_strpos($body, '<di') !== FALSE && $call_count[$id] < 200) {
worker($body, $replacements, $id);
}
}
}
/**
* Returns text by specifying a start and end point
*
* #param str $str
* The text to search
* #param str $start
* The beginning identifier
* #param str $end
* The ending identifier
*/
function xml_get($str, $start, $end) {
$str = "|" . $str . "|";
$len = mb_strlen($start);
if (mb_strpos($str, $start) > 0) {
$int_start = mb_strpos($str, $start) + $len;
$temp = right($str, (mb_strlen($str) - $int_start));
$int_end = mb_strpos($temp, $end);
$return = trim(left($temp, $int_end));
return $return;
}
else {
return FALSE;
}
}
function right($str, $count) {
return mb_substr($str, ($count * -1));
}
function left($str, $count) {
return mb_substr($str, 0, $count);
}
/**
* Multibyte str replace
*/
if (!function_exists('mb_str_replace')) {
function mb_str_replace($search, $replace, $subject, &$count = 0) {
if (!is_array($subject)) {
$searches = is_array($search) ? array_values($search) : array($search);
$replacements = is_array($replace) ? array_values($replace) : array($replace);
$replacements = array_pad($replacements, count($searches), '');
foreach ($searches as $key => $search) {
$parts = mb_split(preg_quote($search), $subject);
$count += count($parts) - 1;
$subject = implode($replacements[$key], $parts);
}
}
else {
foreach ($subject as $key => $value) {
$subject[$key] = mb_str_replace($search, $replace, $value, $count);
}
}
return $subject;
}
}
$html = <<<HTML
<table>
<tbody>
<tr>
<td class="votecell">
<div class="vote">
<input type="hidden" name="_id_" value="9607101">
<a class="vote-up-off" title="This question shows research effort; it is useful and clear">up vote</a>
<span itemprop="upvoteCount" class="vote-count-post ">0</span>
<a class="vote-down-off" title="This question does not show any research effort; it is unclear or not useful">down vote</a>
<a class="star-off" href="#">favorite</a>
<div class="favoritecount"><b></b></div>
</div>
</td>
<td class="postcell">
<div>
<div class="post-text" itemprop="text">
<p>I have a wysiwyg on a site. The problem is that the users are copy pasting a lot of data in to it leaving a lot of unclosed and improperly formatted div tags that are breaking the site layout. </p>
<p>Is there an easy an easy way to strip all occurrences of <code><div></code> and <code></div></code>?</p>
<p>str_replace won't work because some of the divs have styling and other things in them so it would need to account for <code><div style="some styling"> <div align="center"></code> etc</p>
<p>I'm guessing this could be done with a regular expression but I am total a total beginner when it comes to those. </p>
<p>Thanks a lot,
Martin
</p>
</div>
<div class="post-taglist">
php regex replace str-replace strip-tags
</div>
<table class="fw">
<tbody>
<tr>
<td class="vt">
<div class="post-menu">share<span class="lsep">|</span>improve this question</div>
</td>
<td align="right" class="post-signature">
<div class="user-info ">
<div class="user-action-time">
edited <span title="2012-03-07 18:32:29Z" class="relativetime">Mar 7 '12 at 18:32</span>
</div>
<div class="user-gravatar32">
</div>
<div class="user-details">
<div class="-flair">
</div>
</div>
</div>
</td>
<td class="post-signature owner">
<div class="user-info ">
<div class="user-action-time">
asked <span title="2012-03-07 18:31:11Z" class="relativetime">Mar 7 '12 at 18:31</span>
</div>
<div class="user-gravatar32">
<a href="/users/702826/martin-hunt">
<div class="gravatar-wrapper-32"><img src="https://www.gravatar.com/avatar/a578c3eae229c86dbe46d4b1603e071b?s=32&d=identicon&r=PG" alt="" width="32" height="32"></div>
</a>
</div>
<div class="user-details">
Martin Hunt
<div class="-flair">
<span class="reputation-score" title="reputation score " dir="ltr">313</span><span title="7 silver badges"><span class="badge2"></span><span class="badgecount">7</span></span><span title="20 bronze badges"><span class="badge3"></span><span class="badgecount">20</span></span>
</div>
</div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
<tr>
<td class="votecell"></td>
<td>
<div id="comments-9607101" class="comments ">
<table>
<tbody data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true">
<tr id="comment-12187969" class="comment ">
<td class="comment-actions">
<table>
<tbody>
<tr>
<td class=" comment-score">
<span title="number of 'useful comment' votes received" class="cool">1</span>
</td>
<td>
</td>
</tr>
</tbody>
</table>
</td>
<td class="comment-text">
<div style="display: block;" class="comment-body">
<span class="comment-copy">So you need to remove all the div tags but not the content between the div. Am I right?</span>
– Siva Charan
<span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12187969_9607101"><span title="2012-03-07 18:34:11Z" class="relativetime-clean">Mar 7 '12 at 18:34</span></a></span>
</div>
</td>
</tr>
<tr id="comment-12189778" class="comment ">
<td>
<table>
<tbody>
<tr>
<td class=" comment-score">
</td>
<td>
</td>
</tr>
</tbody>
</table>
</td>
<td class="comment-text">
<div style="display: block;" class="comment-body">
<span class="comment-copy">Replace the XPath with <code>//div[not[#*]]</code> to remove all div elements (incl. content) without attributes.</span>
– Gordon
<span class="comment-date" dir="ltr"><a class="comment-link" href="#comment12189778_9607101"><span title="2012-03-07 19:58:21Z" class="relativetime-clean">Mar 7 '12 at 19:58</span></a></span>
<span class="edited-yes" title="this comment was edited 2 times"></span>
</div>
</td>
</tr>
</tbody>
</table>
</div>
<div id="comments-link-9607101" data-rep="50" data-anon="true">
<a class="js-add-link comments-link disabled-link " title="Use comments to ask for more information or suggest improvements. Avoid answering questions in comments.">add a comment</a><span class="js-link-separator dno"> | </span>
<a class="js-show-link comments-link dno" title="expand to show all comments on this post" href="#" onclick=""></a>
</div>
</td>
</tr>
</tbody>
</table>
HTML;
echo strip_divs($html);