i am working on one project and i have a problem with one thing.
Webpage that i am going to screen scrape have a ~5-10 sec loading time because of high amount of data.
When i am trying screen scrape with PHP Simple HTML DOM Parser i got no results.
Screen is blank. All elements i use is ok, because when i enter another page from the same website which has exactly the same code in the few start lines everything is working.
Is there any chance to wait for website finish loading and then screen scrape.
Thanks
My code is:
<!DOCTYPE html>
<html>
<head>
<title>Test</title>
</head>
<body>
<?php
error_reporting(0);
include_once('../../simple_html_dom.php');
function scraping_slashdot() {
// create HTML DOM
$html = file_get_html('http://www.examplepage.com/');
// get article block
foreach($html->find('div[id="rightBlock"]') as $article) {
// get title1
$item['title1'] = $article->find('div.[class="inputHead"]', 0)->plaintext;
$ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
//output
$ret = scraping_slashdot();
foreach($ret as $v) {
echo $v['title1'];
}
?>
</body>
</html>
have you tried using jquery? you can complete a function once the page has loaded by adding:
$(document).ready()
Related
im studying simple html dom.
as mentioned in their documentation, if we want to retrieve headers from website like , we would proceed as following:
<?php
include('simple_html_dom.php');
$html = file_get_html('https://www.w3schools.com/');
//to find h1 headers from a webpage
$headlines = array();
foreach($html->find('h2') as $header) {
$headlines[] = $header->plaintext;
}
print_r($headlines);
?>
when i test this sample on my local server, it prints only:
array ()
if i had well understood it should print:
Python Php Java etc....everything that is inside <h2> tag.
am i missing something?
Having the next PHP code that produce HTML code:
(simplified function, the real one on same idea but longer with loops and so on):
<?php
function show_doc_html() {
$text_to_title = "some text from db";
?>
<html>
<head>
<title>
<?php echo $text_to_title ?>
</title>
</head>
<body>
</body>
</html>
<?
}
I would like to return a PDF to the user, without changing too much of this code. we are working under drupal so we have function that can get html string and convert it to pdf, but the former function doesn't return anything but printing to stdout. Id it possible? or should i rebuild the old function to return string?
Is that what you need?
I have used it in my project. You can just write your markup and inline css in node template using view_mode = 'PDF'
The easiest way was to encapsulate my function with "ob_start()" get all text using "ob_get_contents()", then convert it to pdf.
Something like:
function show_doc_pdf() {
ob_start();
function show_doc_html() ;
$html_var = ob_get_contents();
ob_end_clean();
//use wkhtmltopdf api on $html_var to return pdf
}
For a college project, I am creating a website with some back end algorithms and to test these in a demo environment I require a lot of fake data. To get this data I intend to scrape some sites. One of these sites is freelance.com.To extract the data I am using the Simple HTML DOM Parser but so far I have been unsuccessful in my efforts to actually get the data I need.
Here is an example of the HTML layout of the page I intend to scrape. The red boxes mark the required data.
Here is the code I have written so far after following some tutorials.
<?php
include "simple_html_dom.php";
// Create DOM from URL
$html = file_get_html('http://www.freelancer.com/jobs/Website-Design/1/');
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table[id=project_table] tr') as $tr) {
foreach($tr->find('td[class=title-col]') as $t) {
//get the inner HTML
$data = $t->outertext;
echo $data;
}
}
?>
Hopefully someone can point me in the right direction as to how I can get this working.
Thanks.
The raw source code is different, that's why you're not getting the expected results...
You can check the raw source code using ctrl+u, the data are in table[id=project_table_static], and the cells td have no attributes, so, here's a working code to get all the URLs from the table:
$url = 'http://www.freelancer.com/jobs/Website-Design/1/';
// Create DOM from URL
$html = file_get_html($url);
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table#project_table_static tbody tr') as $i=>$tr) {
// Skip the first empty element
if ($i==0) {
continue;
}
echo "<br/>\$i=".$i;
// get the first anchor
$anchor = $tr->find('a', 0);
echo " => ".$anchor->href;
}
// Clear dom object
$html->clear();
unset($html);
Demo
Thank you for answering my question so quickly. I did some more digging and ultimately found a solution for grabbing data from external file and specific div and posting it into another document using PHP DOMDocument. Now I'm looking to improve the code by adding an if condition that will grab data from a different div if the one called for initially by getElementById has now data. Here is the code for what I got so far.
External html as source.
<div id="tab1_header" class="cushycms"><h2>Meeting - 12:00pm to 3:00pm</h2></div>
My PHP file calling from source looks like this.
<?php
$source = "user_data.htm";
$dom = new DOMDocument();
$dom->loadHTMLFile($source);
$dom->preserveWhiteSpace = false;
$tab1_header = $dom->getElementById('tab1_header');
?>
<html>
<head>
<title></title>
</head>
<body>
<div><h2><?php echo $tab1_header->nodeValue; ?></h2></div>
</body>
</html>
The following function will output a message if a div id can't be found but...
if(!tab1_header)
{
die("Element not found");
}
I would like to call for a different div if the one called for initially has no data. Meaning if <div id="tab1_header"></div> then grab <div id="alternate"><img src="filler.png" /></div>. Can someone help me modify the function above to achieve this result.
Thanks.
either split up master.php so div1\2 are in a file each or set them each to a var, them include master.php, and use the appropriate variable
master.php
$d1='<div id="description1">Some Text</div>';
$d2='<div id="description2">Some Text</div>';
description1.php
include 'master.php';
echo $d1;
You can't do this solely with PHP includes unless you put the divs into separate files. Look into PHP templating; it's probably the best solution for this. Or, since you're new to the language, try using variables:
master.php
$description1 = '<div id="description1">Some Text</div>';
$description2 = '<div id="description2">Some Text</div>';
board1.php
include 'master.php';
echo $description1;
board2.php
include 'master.php';
echo $description2;
Alternatively, you could use JavaScript, but that might get a little messy.
Short answer is: although it's possible it's probably very bad idea taking this approach.
Longer answer: the solution may turn out to be too complicated. If in your master.php file is only HTML markup, you could read content of that file with file_get_contents() function and then parse it (i.e. with DOMDocument library functions). You would have to look for a div with given id.
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$divs = $doc->getElementsByTagName('div');
foreach ($divs as $div)
{
if( $div->getAttribute('id') == 'description1' )
{
echo $div->nodeValue."\n";
}
}
?>
If your master.php file has also some dynamic content you could do following trick:
<?php
ob_start();
include('master.php');
$sMasterPhpContent = ob_get_clean();
// same as above - parse HTML
?>
Edit:
$tab_header = $dom->getElementById('tab1_header') ? $dom->getElementById('tab1_header') : $dom->getElementById('tab2_header');
I used #Alex's approach here to remove script tags from a HTML document using the built in DOMDocument. The problem is if I have a script tag with Javascript content and then another script tag that links to an external Javascript source file, not all script tags are removed from the HTML.
$result = '
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>
hey
</title>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script>
alert("hello");
</script>
</head>
<body>hey</body>
</html>
';
$dom = new DOMDocument();
if($dom->loadHTML($result))
{
$script_tags = $dom->getElementsByTagName('script');
$length = $script_tags->length;
for ($i = 0; $i < $length; $i++) {
if(is_object($script_tags->item($i)->parentNode)) {
$script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}
}
echo $dom->saveHTML();
}
The above code outputs:
<html>
<head>
<meta charset="utf-8">
<title>hey</title>
<script>
alert("hello");
</script>
</head>
<body>
hey
</body>
</html>
As you can see from the output, only the external script tag was removed. Is there anything I can do to ensure all script tags are removed?
Your error is actually trivial. A DOMNode object (and all its descendants - DOMElement, DOMNodeList and a few others!) is automatically updated when its parent element changes, most notably when its number of children change. This is written on a couple of lines in the PHP doc, but is mostly swept under the carpet.
If you loop using ($k instanceof DOMNode)->length, and subsequently remove elements from the nodes, you'll notice that the length property actually changes! I had to write my own library to counteract this and a few other quirks.
The solution:
if($dom->loadHTML($result))
{
while (($r = $dom->getElementsByTagName("script")) && $r->length) {
$r->item(0)->parentNode->removeChild($r->item(0));
}
echo $dom->saveHTML();
I'm not actually looping - just popping the first element one at a time. The result: http://sebrenauld.co.uk/domremovescript.php
To avoid that you get the surprises of a live node list -- that gets shorter as you delete nodes -- you could work with a copy into an array using iterator_to_array:
foreach(iterator_to_array($dom->getElementsByTagName($tag)) as $node) {
$node->parentNode->removeChild($node);
};