PHP parsing html with DOM - php

I am trying to get the below code to work, so that I understand what I am doing on a larger project. For the most part,I'm just trying to make sure I can grab elements by their tag. The code seems to break at "$html = new simple_html_dom();" because if I comment that out, then I get the two print outs. but if it's not commented out, nothing shows up on the screen at all.
<?php
# create and load the HTML
include('simple_html_dom.php');
print "hello ";
$html = new simple_html_dom();
print "world"
#$html->load("<html><body><p>Hello World!</p><p>We're here</p></body></html>");
# get an element representing the second paragraph
#$element = $html->find("p");
# modify it
#$element[1]->innertext .= " and we're here to stay.";
# output it!
#echo $html->save();
?>

Is the simple_html_dom.php file actually exiting and loaded? You might want to write that include similar to:
<?php
if(!#file_exists('./simple_html_dom.php') )
{
echo 'can not include: simple_html_dom.php';
die();
}
else
{
include('./simple_html_dom.php');
}
?>
So the PHP parser on the server will let you know if there is a problem.

Related

extracting h2 header from website using simplehtmldom

im studying simple html dom.
as mentioned in their documentation, if we want to retrieve headers from website like , we would proceed as following:
<?php
include('simple_html_dom.php');
$html = file_get_html('https://www.w3schools.com/');
//to find h1 headers from a webpage
$headlines = array();
foreach($html->find('h2') as $header) {
$headlines[] = $header->plaintext;
}
print_r($headlines);
?>
when i test this sample on my local server, it prints only:
array ()
if i had well understood it should print:
Python Php Java etc....everything that is inside <h2> tag.
am i missing something?

Is it possible to change original html text in php?

I am trying to make "manner friendly" website. We use different declination dependent on gender and other factors. For example:
You did = robili
It did = robilo
She did = robila
Linguisticaly this is very simplified (and unlucky) example! I would like to change html text in php file where appropriate. For example
<? php
something
?>
html text of the page and somewhere is the word "robil"
<div>we tried to robil^i|o|a^</div>
<? php something ?>
Now I would like to replace all occurences of different tokens ^characters|characters|characters^ and replace them by one of their internal values according to "gender".
It is easy in javascript on the client side, but you will see all this weird "tokenizing" before javascript replace it.
Here I do not know the elegant solution.
Or do you have better idea?
Thanks for advice.
You can add these scripts before and after the HTML:
<?php
// start output buffering
ob_start();
?>
<html>
<body>
html text of the page and somewhere is the word "robil"
<div>we tried to robil^i|o|a^, but also vital^si|sa|ste^, borko^mal|mala|malo^ </div>
</body>
</html>
<?php
$use = 1; // indicate which declination to use (0,1 or 2)
// get buffered html
$html = ob_get_contents();
ob_end_clean();
// match anything between '^' than's not a control chr or '^', min 5 and max 20 chrs.
if (preg_match_all('/\^[^[:cntrl:]\^]{3,20}\^/',$html,$matches))
{
// replace all
foreach (array_unique($matches[0]) as $match)
{
$choices = explode('|',trim($match,'^'));
$html = str_replace($match,$choices[$use],$html);
}
}
echo $html;
This returns:
html text of the page and somewhere is the word "robil" we tried to
robilo, but also vitalsa, borkomala

parsing html page using php to find out text on which link is assiged

say i have html code like this
$html = "This is some stuff right here. OH MY GOSH";
i am trying to get values of href and also on which anchor work i mean check this out text i am able to get href value by following this code
$displaybody->find('a ') as $element;
echo $element;
well it works for me but how do i get value of check this out could you guys help me out. i did search but i am not able to find it out . thanks in advance
my actual html look like this
» Download MP4 « - <b>144p (Video Only)</b> - <span> 19.1</span> MB<br />
my href look like this above code return download mp4 and i want it like downloadmp4 114p (video only) 19.1 mb how do i do that
If what you are using now is the SimpleHTMLDOM, then ->innertext works fine on that anchor elements that you have found:
include 'simple_html_dom.php';
$html = "This is some stuff right here. OH MY GOSH";
$displaybody = str_get_html($html);
foreach($displaybody->find('a ') as $element) {
echo $element->innertext . '<br/>';
}
If you were referring to PHP's DOMDocument, then its not find() function you need to use, to target each anchor element, you need to use ->getElementsByTagName(), then each selected elements you need to use ->nodeValue:
$html = "This is some stuff right here. OH MY GOSH";
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $element) {
echo $element->nodeValue . '<br/>';
}

Extracting data from HTML using Simple HTML DOM Parser

For a college project, I am creating a website with some back end algorithms and to test these in a demo environment I require a lot of fake data. To get this data I intend to scrape some sites. One of these sites is freelance.com.To extract the data I am using the Simple HTML DOM Parser but so far I have been unsuccessful in my efforts to actually get the data I need.
Here is an example of the HTML layout of the page I intend to scrape. The red boxes mark the required data.
Here is the code I have written so far after following some tutorials.
<?php
include "simple_html_dom.php";
// Create DOM from URL
$html = file_get_html('http://www.freelancer.com/jobs/Website-Design/1/');
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table[id=project_table] tr') as $tr) {
foreach($tr->find('td[class=title-col]') as $t) {
//get the inner HTML
$data = $t->outertext;
echo $data;
}
}
?>
Hopefully someone can point me in the right direction as to how I can get this working.
Thanks.
The raw source code is different, that's why you're not getting the expected results...
You can check the raw source code using ctrl+u, the data are in table[id=project_table_static], and the cells td have no attributes, so, here's a working code to get all the URLs from the table:
$url = 'http://www.freelancer.com/jobs/Website-Design/1/';
// Create DOM from URL
$html = file_get_html($url);
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table#project_table_static tbody tr') as $i=>$tr) {
// Skip the first empty element
if ($i==0) {
continue;
}
echo "<br/>\$i=".$i;
// get the first anchor
$anchor = $tr->find('a', 0);
echo " => ".$anchor->href;
}
// Clear dom object
$html->clear();
unset($html);
Demo

How to code PHP function that displays a specific div from external file if div called by getElementById has no value?

Thank you for answering my question so quickly. I did some more digging and ultimately found a solution for grabbing data from external file and specific div and posting it into another document using PHP DOMDocument. Now I'm looking to improve the code by adding an if condition that will grab data from a different div if the one called for initially by getElementById has now data. Here is the code for what I got so far.
External html as source.
<div id="tab1_header" class="cushycms"><h2>Meeting - 12:00pm to 3:00pm</h2></div>
My PHP file calling from source looks like this.
<?php
$source = "user_data.htm";
$dom = new DOMDocument();
$dom->loadHTMLFile($source);
$dom->preserveWhiteSpace = false;
$tab1_header = $dom->getElementById('tab1_header');
?>
<html>
<head>
<title></title>
</head>
<body>
<div><h2><?php echo $tab1_header->nodeValue; ?></h2></div>
</body>
</html>
The following function will output a message if a div id can't be found but...
if(!tab1_header)
{
die("Element not found");
}
I would like to call for a different div if the one called for initially has no data. Meaning if <div id="tab1_header"></div> then grab <div id="alternate"><img src="filler.png" /></div>. Can someone help me modify the function above to achieve this result.
Thanks.
either split up master.php so div1\2 are in a file each or set them each to a var, them include master.php, and use the appropriate variable
master.php
$d1='<div id="description1">Some Text</div>';
$d2='<div id="description2">Some Text</div>';
description1.php
include 'master.php';
echo $d1;
You can't do this solely with PHP includes unless you put the divs into separate files. Look into PHP templating; it's probably the best solution for this. Or, since you're new to the language, try using variables:
master.php
$description1 = '<div id="description1">Some Text</div>';
$description2 = '<div id="description2">Some Text</div>';
board1.php
include 'master.php';
echo $description1;
board2.php
include 'master.php';
echo $description2;
Alternatively, you could use JavaScript, but that might get a little messy.
Short answer is: although it's possible it's probably very bad idea taking this approach.
Longer answer: the solution may turn out to be too complicated. If in your master.php file is only HTML markup, you could read content of that file with file_get_contents() function and then parse it (i.e. with DOMDocument library functions). You would have to look for a div with given id.
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$divs = $doc->getElementsByTagName('div');
foreach ($divs as $div)
{
if( $div->getAttribute('id') == 'description1' )
{
echo $div->nodeValue."\n";
}
}
?>
If your master.php file has also some dynamic content you could do following trick:
<?php
ob_start();
include('master.php');
$sMasterPhpContent = ob_get_clean();
// same as above - parse HTML
?>
Edit:
$tab_header = $dom->getElementById('tab1_header') ? $dom->getElementById('tab1_header') : $dom->getElementById('tab2_header');

Categories