Extracting data from HTML using Simple HTML DOM Parser

Extracting data from HTML using Simple HTML DOM Parser - php

For a college project, I am creating a website with some back end algorithms and to test these in a demo environment I require a lot of fake data. To get this data I intend to scrape some sites. One of these sites is freelance.com.To extract the data I am using the Simple HTML DOM Parser but so far I have been unsuccessful in my efforts to actually get the data I need.
Here is an example of the HTML layout of the page I intend to scrape. The red boxes mark the required data.
Here is the code I have written so far after following some tutorials.
<?php
include "simple_html_dom.php";
// Create DOM from URL
$html = file_get_html('http://www.freelancer.com/jobs/Website-Design/1/');
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table[id=project_table] tr') as $tr) {
foreach($tr->find('td[class=title-col]') as $t) {
//get the inner HTML
$data = $t->outertext;
echo $data;
}
}
?>
Hopefully someone can point me in the right direction as to how I can get this working.
Thanks.

The raw source code is different, that's why you're not getting the expected results...
You can check the raw source code using ctrl+u, the data are in table[id=project_table_static], and the cells td have no attributes, so, here's a working code to get all the URLs from the table:
$url = 'http://www.freelancer.com/jobs/Website-Design/1/';
// Create DOM from URL
$html = file_get_html($url);
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table#project_table_static tbody tr') as $i=>$tr) {
// Skip the first empty element
if ($i==0) {
continue;
}
echo "<br/>\$i=".$i;
// get the first anchor
$anchor = $tr->find('a', 0);
echo " => ".$anchor->href;
}
// Clear dom object
$html->clear();
unset($html);
Demo

Related

extracting h2 header from website using simplehtmldom

im studying simple html dom.
as mentioned in their documentation, if we want to retrieve headers from website like , we would proceed as following:
<?php
include('simple_html_dom.php');
$html = file_get_html('https://www.w3schools.com/');
//to find h1 headers from a webpage
$headlines = array();
foreach($html->find('h2') as $header) {
$headlines[] = $header->plaintext;
}
print_r($headlines);
?>
when i test this sample on my local server, it prints only:
array ()
if i had well understood it should print:
Python Php Java etc....everything that is inside <h2> tag.
am i missing something?

PHP Web Scrape with PHP Simple HTML DOM Parser

I am trying to get a data field using PHP Simple HTML DOM Parser. I can pull the links, images etc but cannot get a certain data attribute.
Example HTML -
<div id="used">
<div id="srpVehicle-1C3CCCEG2FN601809" class="vehicle" data-vin="1C3CCCEG2FN601809">
<div id="srpVehicle-1C3CCCEG2FN601810" class="vehicle" data-vin="1f2CfCEG2FN266778">
</div>
I would like to get all the "data-vin" fields on a site.
Here is my go at it -
$html = file_get_html($url);
foreach($html->find("div[data-vin]", 0) as $vin){
echo $vin."<br>";
}
But it returns the whole page when I echo $vin. How can I access that data-vin field?

$html->find("data-vin", 0)
is looking for tags named data-vin, when you really want tags with the attribute data-vin.
foreach($html->find("[data-vin]") as $tag){
echo $tag->getAttribute('data-vin')."<br>";
}

php echo to html parsing issues

I was recently asked to take part in a certain project that, for now, aims to parse a chunk of HTML code to PHP . Using a certain website that I have been assigned, I went through inspect element to complete my code's missing parts. The actual aim is to spit out (using echo) some certain data on localhost, without them being stored into a database or anything relevant. Attached is the html and PHP code in a few printscreens (couldnt upload the raw codes, dunno why). Thanks in advance!
Php code:
<?php
include_once('simple_html_dom.php');
$html = new simple_html_dom();
// Website link to scrap
$website = 'https://www1.gsis.gr/webtax3/etak/faces/main.jspx?_adf.ctrl-
state=16kjeyshcz_4&_afrLoop=70130840737831';
// Create DOM from URL or file
$html = file_get_html($website, false, null, 0);
//$html = str_get_html('<html><body><div id="pt1:r1:0:t3::db">Hello</div>
<div class="xx8">Goodbye</div></body></html>');
//$ret = $html->find('.xx8', 0)->plaintext;
if (is_array($html)) {
foreach($html->find('div[class=xx8]')->outertext as $data) {
echo $data->outertext;
}
}
?>
HTML code (via inspect element, where Δεν βρεθηκαν γηπεδα is the custom text of the page i told you about):
<div id="pt1:r1:1:t3::db" class="xx8"
style="position:relative;width:100%;overflow:hidden" _afrcolcount="30"><table
class="xxb xy3" style="table-layout:fixed;position:relative;width:2097px;"
cellspacing="0" _totalwidth="2097" _selstate="{}" _rowcount="0" _startrow="0">
<colgroup span="30"><col style="width:80px;"><col style="width:110px;"><col
style="width:105px;"><col style="width:105px;"><col style="width:105px;"><col
style="width:75px;"><col style="width:35px;"><col style="width:50px;"><col
style="width:55px;"><col style="width:80px;"><col style="width:65px;"><col
style="width:65px;"><col style="width:55px;"><col style="width:95px;"><col
style="width:65px;"><col style="width:55px;"><col style="width:75px;"><col
style="width:75px;"><col style="width:60px;"><col style="width:60px;"><col
style="width:60px;"><col style="width:60px;"><col style="width:50px;"><col
style="width:50px;"><col style="width:60px;"><col style="width:50px;"><col
style="width:62px;"><col style="width:125px;"><col style="width:55px;"><col
style="width:55px;"></colgroup></table>Δε βρέθηκαν γήπεδα.</div>

Change src atribute from img, using Simple HTML Dom php library

I'm totally new to php, and I'm having a hard time changing the src attribute of img tags.
I have a website that pulls a part of a page using Simple Html Dom php, here is the code:
<?php
include_once('simple_html_dom.php');
$html = file_get_html('http://www.tabuademares.com/br/bahia/morro-de-sao-paulo');
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
$elem = $html->find('table[id=tabla_mareas]', 0);
echo $elem;
?>
This code correctly returns the part of the page I want. But when I do this the img tags comes with the src of the original page: /assets/svg/icon_name.svg
What I want to do is change the original src so that it looks like this: http://www.mywebsite.com/wp-content/themes/mytheme/assets/svg/icon_name.svg
I want to put the url of my site in front of assets / svg / icon_name.svg
I already tried some tutorials, but I could not make any work.
Could someone please kind of help a noob in php?

i could make it work. So if someone have the same question, here is how i managed to get the code working.
<?php
// Note you must download the php files simple_html_dom.php from
// this link https://sourceforge.net/projects/simplehtmldom/files/
//than include them
include_once('simple_html_dom.php');
//target the website
$html = file_get_html('http://the_target_website.com');
//loop thru all images of the html dom
foreach($html ->find('img') as $item) {
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $item->src;
// Set a attribute
$item->src = 'http://yourwebsite.com/'.$value;
}
//save the variable
$html->save();
//findo on html the div you want to get the content
$elem = $html->find('div[id=container]', 0);
//output it using echo
echo $elem;
?>
That's it!

did you read the documentation for read and modify attributes
As per that
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $e->href;
// Set a attribute
$e->href = 'ursitename'.$value;

use selector search on html code(string) on PHP variable or ways alike

what im currently doing is i have a text area for user to copy and paste the html code.
i want to get a certain element of that html file.
in pure html, this can be done via jquery selector
but i think its a whole different thing when html code is on a variable and considered as a string.
how can i get a certain element location in that way?
code is:
function searchHtml() {
$html = $_POST; // text area input contains html code
$selector = "#rso > div > div > div:nth-child(1) > div > h3 > a"; //example - the a element with hello world
$getValue = getValueBySelector($selector); //will return hello world
}
function getValueBySelector($selector) {
//what will i do here?
}
searchHtml();

You can look at SimpleHTMLDom Parser (manual at http://simplehtmldom.sourceforge.net/manual.htm). This is a powerful tool to parse the HTML code to find and extract various elements and their attribute.
For your particular case, you can use
// Create a DOM object from the input string
$htmlDom = str_get_html($html);
// Find the required element
$e = $htmlDom->find($selector);
Oh, and you've to pass the provided input value to the getValueBySelector() function :-)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extracting data from HTML using Simple HTML DOM Parser - php

Related

extracting h2 header from website using simplehtmldom

PHP Web Scrape with PHP Simple HTML DOM Parser

php echo to html parsing issues

Change src atribute from img, using Simple HTML Dom php library

use selector search on html code(string) on PHP variable or ways alike

Categories

Resources