For a college project, I am creating a website with some back end algorithms and to test these in a demo environment I require a lot of fake data. To get this data I intend to scrape some sites. One of these sites is freelance.com.To extract the data I am using the Simple HTML DOM Parser but so far I have been unsuccessful in my efforts to actually get the data I need.
Here is an example of the HTML layout of the page I intend to scrape. The red boxes mark the required data.
Here is the code I have written so far after following some tutorials.
<?php
include "simple_html_dom.php";
// Create DOM from URL
$html = file_get_html('http://www.freelancer.com/jobs/Website-Design/1/');
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table[id=project_table] tr') as $tr) {
foreach($tr->find('td[class=title-col]') as $t) {
//get the inner HTML
$data = $t->outertext;
echo $data;
}
}
?>
Hopefully someone can point me in the right direction as to how I can get this working.
Thanks.
The raw source code is different, that's why you're not getting the expected results...
You can check the raw source code using ctrl+u, the data are in table[id=project_table_static], and the cells td have no attributes, so, here's a working code to get all the URLs from the table:
$url = 'http://www.freelancer.com/jobs/Website-Design/1/';
// Create DOM from URL
$html = file_get_html($url);
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table#project_table_static tbody tr') as $i=>$tr) {
// Skip the first empty element
if ($i==0) {
continue;
}
echo "<br/>\$i=".$i;
// get the first anchor
$anchor = $tr->find('a', 0);
echo " => ".$anchor->href;
}
// Clear dom object
$html->clear();
unset($html);
Demo
Related
im studying simple html dom.
as mentioned in their documentation, if we want to retrieve headers from website like , we would proceed as following:
<?php
include('simple_html_dom.php');
$html = file_get_html('https://www.w3schools.com/');
//to find h1 headers from a webpage
$headlines = array();
foreach($html->find('h2') as $header) {
$headlines[] = $header->plaintext;
}
print_r($headlines);
?>
when i test this sample on my local server, it prints only:
array ()
if i had well understood it should print:
Python Php Java etc....everything that is inside <h2> tag.
am i missing something?
I am trying to get a data field using PHP Simple HTML DOM Parser. I can pull the links, images etc but cannot get a certain data attribute.
Example HTML -
<div id="used">
<div id="srpVehicle-1C3CCCEG2FN601809" class="vehicle" data-vin="1C3CCCEG2FN601809">
<div id="srpVehicle-1C3CCCEG2FN601810" class="vehicle" data-vin="1f2CfCEG2FN266778">
</div>
I would like to get all the "data-vin" fields on a site.
Here is my go at it -
$html = file_get_html($url);
foreach($html->find("div[data-vin]", 0) as $vin){
echo $vin."<br>";
}
But it returns the whole page when I echo $vin. How can I access that data-vin field?
$html->find("data-vin", 0)
is looking for tags named data-vin, when you really want tags with the attribute data-vin.
foreach($html->find("[data-vin]") as $tag){
echo $tag->getAttribute('data-vin')."<br>";
}
I was recently asked to take part in a certain project that, for now, aims to parse a chunk of HTML code to PHP . Using a certain website that I have been assigned, I went through inspect element to complete my code's missing parts. The actual aim is to spit out (using echo) some certain data on localhost, without them being stored into a database or anything relevant. Attached is the html and PHP code in a few printscreens (couldnt upload the raw codes, dunno why). Thanks in advance!
Php code:
<?php
include_once('simple_html_dom.php');
$html = new simple_html_dom();
// Website link to scrap
$website = 'https://www1.gsis.gr/webtax3/etak/faces/main.jspx?_adf.ctrl-
state=16kjeyshcz_4&_afrLoop=70130840737831';
// Create DOM from URL or file
$html = file_get_html($website, false, null, 0);
//$html = str_get_html('<html><body><div id="pt1:r1:0:t3::db">Hello</div>
<div class="xx8">Goodbye</div></body></html>');
//$ret = $html->find('.xx8', 0)->plaintext;
if (is_array($html)) {
foreach($html->find('div[class=xx8]')->outertext as $data) {
echo $data->outertext;
}
}
?>
HTML code (via inspect element, where Δεν βρεθηκαν γηπεδα is the custom text of the page i told you about):
<div id="pt1:r1:1:t3::db" class="xx8"
style="position:relative;width:100%;overflow:hidden" _afrcolcount="30"><table
class="xxb xy3" style="table-layout:fixed;position:relative;width:2097px;"
cellspacing="0" _totalwidth="2097" _selstate="{}" _rowcount="0" _startrow="0">
<colgroup span="30"><col style="width:80px;"><col style="width:110px;"><col
style="width:105px;"><col style="width:105px;"><col style="width:105px;"><col
style="width:75px;"><col style="width:35px;"><col style="width:50px;"><col
style="width:55px;"><col style="width:80px;"><col style="width:65px;"><col
style="width:65px;"><col style="width:55px;"><col style="width:95px;"><col
style="width:65px;"><col style="width:55px;"><col style="width:75px;"><col
style="width:75px;"><col style="width:60px;"><col style="width:60px;"><col
style="width:60px;"><col style="width:60px;"><col style="width:50px;"><col
style="width:50px;"><col style="width:60px;"><col style="width:50px;"><col
style="width:62px;"><col style="width:125px;"><col style="width:55px;"><col
style="width:55px;"></colgroup></table>Δε βρέθηκαν γήπεδα.</div>
I'm totally new to php, and I'm having a hard time changing the src attribute of img tags.
I have a website that pulls a part of a page using Simple Html Dom php, here is the code:
<?php
include_once('simple_html_dom.php');
$html = file_get_html('http://www.tabuademares.com/br/bahia/morro-de-sao-paulo');
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
$elem = $html->find('table[id=tabla_mareas]', 0);
echo $elem;
?>
This code correctly returns the part of the page I want. But when I do this the img tags comes with the src of the original page: /assets/svg/icon_name.svg
What I want to do is change the original src so that it looks like this: http://www.mywebsite.com/wp-content/themes/mytheme/assets/svg/icon_name.svg
I want to put the url of my site in front of assets / svg / icon_name.svg
I already tried some tutorials, but I could not make any work.
Could someone please kind of help a noob in php?
i could make it work. So if someone have the same question, here is how i managed to get the code working.
<?php
// Note you must download the php files simple_html_dom.php from
// this link https://sourceforge.net/projects/simplehtmldom/files/
//than include them
include_once('simple_html_dom.php');
//target the website
$html = file_get_html('http://the_target_website.com');
//loop thru all images of the html dom
foreach($html ->find('img') as $item) {
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $item->src;
// Set a attribute
$item->src = 'http://yourwebsite.com/'.$value;
}
//save the variable
$html->save();
//findo on html the div you want to get the content
$elem = $html->find('div[id=container]', 0);
//output it using echo
echo $elem;
?>
That's it!
did you read the documentation for read and modify attributes
As per that
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $e->href;
// Set a attribute
$e->href = 'ursitename'.$value;
what im currently doing is i have a text area for user to copy and paste the html code.
i want to get a certain element of that html file.
in pure html, this can be done via jquery selector
but i think its a whole different thing when html code is on a variable and considered as a string.
how can i get a certain element location in that way?
code is:
function searchHtml() {
$html = $_POST; // text area input contains html code
$selector = "#rso > div > div > div:nth-child(1) > div > h3 > a"; //example - the a element with hello world
$getValue = getValueBySelector($selector); //will return hello world
}
function getValueBySelector($selector) {
//what will i do here?
}
searchHtml();
You can look at SimpleHTMLDom Parser (manual at http://simplehtmldom.sourceforge.net/manual.htm). This is a powerful tool to parse the HTML code to find and extract various elements and their attribute.
For your particular case, you can use
// Create a DOM object from the input string
$htmlDom = str_get_html($html);
// Find the required element
$e = $htmlDom->find($selector);
Oh, and you've to pass the provided input value to the getValueBySelector() function :-)