PHP Web Scrape with PHP Simple HTML DOM Parser - php

I am trying to get a data field using PHP Simple HTML DOM Parser. I can pull the links, images etc but cannot get a certain data attribute.
Example HTML -
<div id="used">
<div id="srpVehicle-1C3CCCEG2FN601809" class="vehicle" data-vin="1C3CCCEG2FN601809">
<div id="srpVehicle-1C3CCCEG2FN601810" class="vehicle" data-vin="1f2CfCEG2FN266778">
</div>
I would like to get all the "data-vin" fields on a site.
Here is my go at it -
$html = file_get_html($url);
foreach($html->find("div[data-vin]", 0) as $vin){
echo $vin."<br>";
}
But it returns the whole page when I echo $vin. How can I access that data-vin field?

$html->find("data-vin", 0)
is looking for tags named data-vin, when you really want tags with the attribute data-vin.
foreach($html->find("[data-vin]") as $tag){
echo $tag->getAttribute('data-vin')."<br>";
}

Related

PHP Simple HTML DOM - How to get the text inside a tag

I use PHP Simple HTML Dom to get some HTML, now I have a HTML dom like follow code, I need fetch the plain text inner a tag, but I am getting the text link(Kiwi Fruit Basket).
HTML Code
<div class="name" style="height: 34px;">
Kiwi Fruit Basket
</div>
Php Code
// Create DOM from URL or file
$html = file_get_html('http://floristchennai.com/');
// Find all links text
foreach($html->find('.name a') as $element)
{
echo "<br>a tag text value=" . $element;
}
Doing it this way I don't get the text I want to get.
Thanks in advance!
try:
innertext() innertext used for Read or write the inner HTML text of element.
foreach($html->find('.name a') as $element)
{
echo "<br>a tag text value=" . $element->innertext;
}
API Ref

Get contents of element from external page PHP

I'd like to get the content (CSS, children, ect.) to display on a HTML page, but this element is on a external page. When I use:
$page = new DOMDocument();
$page->loadHTMLFile('about.php');
$text = $page->getElementById('text');
echo $text->nodeValue;
I only get the text, but #text also has a image as child and some CSS. Can I get (and echo) those to, kind of like with an iframe, but then with a element. If so, how?
Thanks a lot.
Maybe what you're looking for is DOMDocument::saveHTML().
If you set the optional arguments it outputs only this particular node.
$elm = $page->getElementById('text');
echo $elm->ownerDocument->saveHTML($elm);
I have found a solution, although it doesn't retrieve the CSS, but if you only need the element and its children, this is my best bet.
Use simple_html_dom.php to do all the hard stuff.
My external page:
<div id='text'>
<img src='img/dummy.png' align='left' alt='Image not available. Our apologies.'/>
<span>text</span><br/>
<p>
text
</p>
<p>
text
</p>
<p>
text
</p>
<div>
Now, my page that I'd like to show the contents of my external page:
<?php include('../includes/simple_html_dom.php'); ?>
....
<?php
$html = file_get_html('about.php');
$ret = $html->find('div#text', 0);
echo $ret;
?>
what this does, it echos the element with its children, without CSS unfortunately.

How to parse multiple elements in portions for html via Simple Html Dom

I am attempting to get various elements inside of an li as shown below. I am pretty new to this so I may not be using the most efficient methods but this is where I have started...
EXAMPLE CODE SIMPLIFIED....
<li id='entry_0' title='09879879'>
<div ....>
<h2> The title text would go here </h2>
<span class='entrySize' ....> 20oz </span>
<span class='entryPrice' ....> $32.09 </span>
<span class='anotherEntry' ....> More Data I need To Grab </span>
.......
</div>
</li>
<li> .... With same structure as above .... 100's of entries like this </li>
I know how to pull individual parts separately but having trouble grasping how to do it grouped within a portion of the html.
$filename = "directory/file.html";
$html = file_get_html($filename);
for($i=0; $i<=count(entryNumber);$i++)
{
$li_id = "entry_".$i;
foreach($html->find('li[id='.$li_id.']') as $li) {
echo $li->innertext;
}
}
So this gets me the content in the line item tag with the id number as the unique attribute. I would like to grab the h2 text, entrySize, entryPrice etc as I iterate through the line item tags. What I don't understand is once I have the line item tag content how can I parse through that line item inner tags and attributes. There maybe other parts of the full HTML document that has tags with same id, class as these throughout the document so I am breaking this down to portions and than looking to parse each section at a time.
I would also like to pull the title attribute out of the title tag for the li tag.
I hope my explanation make sense.
You should probably use a DOM parser. PHP comes bundled with one, and there are many other's you could use.
http://php.net/dom
PHP Simple HTML DOM Parser
<?php
$html = file_get_content($page);
$doc = new DOMDocument();
$doc->loadHTML($html);
// now find what you need
$items = $dom->getElementsByTagName('li');
foreach ($items as $item) {
$id = $item->getAttribute('id');
if (strpos($id, 'item_') !== false) {
// found matchin li, grab its children
}
}
Use this as a baseline, we can't write all the code for you. Check out the PHP docs to finish this :) From what I have so far, you need to follow the docs to make it grab the child values, and handle them.

Extracting data from HTML using Simple HTML DOM Parser

For a college project, I am creating a website with some back end algorithms and to test these in a demo environment I require a lot of fake data. To get this data I intend to scrape some sites. One of these sites is freelance.com.To extract the data I am using the Simple HTML DOM Parser but so far I have been unsuccessful in my efforts to actually get the data I need.
Here is an example of the HTML layout of the page I intend to scrape. The red boxes mark the required data.
Here is the code I have written so far after following some tutorials.
<?php
include "simple_html_dom.php";
// Create DOM from URL
$html = file_get_html('http://www.freelancer.com/jobs/Website-Design/1/');
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table[id=project_table] tr') as $tr) {
foreach($tr->find('td[class=title-col]') as $t) {
//get the inner HTML
$data = $t->outertext;
echo $data;
}
}
?>
Hopefully someone can point me in the right direction as to how I can get this working.
Thanks.
The raw source code is different, that's why you're not getting the expected results...
You can check the raw source code using ctrl+u, the data are in table[id=project_table_static], and the cells td have no attributes, so, here's a working code to get all the URLs from the table:
$url = 'http://www.freelancer.com/jobs/Website-Design/1/';
// Create DOM from URL
$html = file_get_html($url);
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table#project_table_static tbody tr') as $i=>$tr) {
// Skip the first empty element
if ($i==0) {
continue;
}
echo "<br/>\$i=".$i;
// get the first anchor
$anchor = $tr->find('a', 0);
echo " => ".$anchor->href;
}
// Clear dom object
$html->clear();
unset($html);
Demo

Select Content of div using php

I have a div named "main" in my page. I put the code to convert a html into pdf using php at the end of page. I want to select the content (div named main contains paragraphs, charts, tables etc.).
How ?
Below code will show you how to get DIV tag's content using PHP code.
PHP Code:
<?php
$content="test.html";
$source=new DOMdocument();
$source->loadHTMLFile($content);
$path=new DOMXpath($source);
$dom=$path->query("*/div[#id='test']");
if (!$dom==0) {
foreach ($dom as $dom) {
print "
The Type of the element is: ". $dom->nodeName. "
<b><pre><code>";
$getContent = $dom->childNodes;
foreach ($getContent as $attr) {
print $attr->nodeValue. "</code></pre></b>";
}
}
}
?>
We are getting DIV tag with ID "test", You can replace it with your desired one.
test.html
<div id="test">This is my content</div>
Output:
The Type of the element is: div
This is my content
You should put the php code into a separate file from the html and use something like DOMDocument to get the content from the div.
$dom = new DOMDocument();
$dom->loadHTMLFile('yourfile.html');
...
You cannot directly interact with the HTML DOM via PHP.
What you could do, is using a with an input containing your content. When submitting the form you can access the data via PHP.
But maybe you want to use Javascript for that task?
Nevertheless, a quick'n'dirty PHP example:
<form action="" method="post">
<textarea name="content">hello world</textarea>
</form>
<?php
if (isset($_POST['content'])) {
echo $_POST['content'];
}
?>

Categories