HTML DOM Parser in PHP

HTML DOM Parser in PHP - php

Using PHP Simple HTML DOM Parse but unable to get images to display.
I am not a coder and am trying to pull articles and images from a website. The articles are fine but the images are not displaying. Instead part of the path displays e.g.
> //ssl.gstatic.com/ui/v1/button/search-white.png
> //ssl.gstatic.com/ui/v1/button/search-white.png
> //ssl.gstatic.com/ui/v1/icons/common/settings.png
Using Google as an example, here's the code I am using:
<?php
$html = file_get_html('https://news.google.com/nwshp?hl=en&tab=in');
foreach($html->find('h2') as $e)
echo $e->innertext . '<br><br>';
foreach($html->find('div.jsdisplay') as $e)
echo $e->innertext . '<br>';
foreach($html->find('img') as $element)
echo $element->src . '<br>';
?>
Thanks for any help

You should replace
foreach($html->find('img') as $element)
echo $element->src . '<br>';
With
foreach ( $html->find('img') as $element ) {
$img = str_replace(array("//ssl"), array("http://ssl"), $element->src);
for($i = 0; $i < 5; $i ++) {
$img = str_replace("//nt$i", "http://nt$i",$img);
}
echo "<img src=\"$img\" /> <br>";
}

Update my answer after your last comment with your original site URL 'http://frielatvsales.com/QuadAttachments.htm'
try below code.
include_once "simplehtmldom/simple_html_dom.php";
$url = "http://frielatvsales.com/QuadAttachments.htm";
$html = file_get_html($url);
preg_match('#^(?:http://)?([^/]+)#i', $url, $matches);
$host = $matches[1];
foreach($html->find('h2') as $e) {
echo $e->innertext . '<br><br>';
}
foreach($html->find('div.jsdisplay') as $e) {
echo $e->innertext . '<br>';
}
foreach($html->find('img') as $element) {
echo '<img src=http://'.$host.'/'.$element->src . ' /><br>';
}

//ssl.gstatic.com/ui/v1/button/search-white.png is a relative URI (the scheme is not specified, so it will use the same scheme (e.g. http: or https:) as the page it appears in).
Resolve it as you would any other relative URI.
My question is how to get the images to display using the code in my original post.
You have to output an <img> tag instead of the URI as plain text.

Related

SimpleHTMLDom: Call to a member function find() on array

So i want to loop trough specific TD's in a LARGE html page. I'm using simplehtmldom in order to achieve that. The problem is that i cant make it work without putting every step in a foreach.
Here is my php
include('../inc/simple_html_dom.php');
$html = file_get_html("http://www.bjork-family.com/f43-london-stories");
// I just put the dom of pagebody into TEST
$test = $html->find('#page-body');
foreach($test->find('img') as $element)
{
echo "<img src='" . $element->src . "'/>" . '<br>';
}
i get this error
Fatal error: Call to a member function find() on array in /mywebsite.php on line 39
line 39 is this one
foreach($test->find('img') as $element)
I tried a lot of different things, if i keep it really really simple like that :
// Create DOM from URL or file
$html = file_get_html('http://www.bjork-family.com/f43-london-stories');
foreach($html->find('img') as $element)
echo $element->src . '<br>';
then it works !
So it seems to be working when i go like that
foreach($html->find('div') as $element)
{
if($element->id == 'page-body')
{
echo $element->id;
echo "EXIST <br>";
}
// echo "<img src='" . $element->src . "'/>" . '<br>';
}
But i don't want to search into my html only using foreach, is there another way where i can get to my position and then do i loop (i have to loop trough tr in a table )

$test = $html->find('#page-body');
Now $test is an array
$test = $html->find('#page-body', 0);
Now $test is an element
foreach($test->find('img') as $element)
{
echo "<img src='" . $element->src . "'/>" . '<br>';
}
This will work now. Also you can simplify with:
foreach($test->find('#page-body img') as $element)
{
echo "<img src='" . $element->src . "'/>" . '<br>';
}

PHPQuery - get all links of contains specific url page

I am trying to get all links of contains specific url page on a given page using PHPQuery. I am using the PHP support syntax of PHPQuery.
include_once 'phpQuery.php';
$url = 'http://www.phonearena.com/phones/manufacturer/';
$doc = phpQuery::newDocumentFile($url);
$urls = $doc['a'];
foreach ($urls as $url) {
echo pq($url)->attr('href') . '<br>';
}
The code above works . But it shows all the links
I want to show only those containing "/phones/manufacturer/".
I tried this but it shows nothing:
include_once 'phpQuery.php';
$url = 'http://www.phonearena.com/phones/manufacturer/';
$doc = phpQuery::newDocumentFile($url);
$urls = $doc['a'];
foreach ($urls as $url) {
echo pq($url)->attr('href:contains("/phones/manufacturer/")') . '<br>';
}

Use below coding get all urls from that site,
$doc = new DOMDocument();
#$doc->loadHTML(file_get_contents('http://www.phonearena.com/phones/manufacturer/'));
$ahreftags = $doc->getElementsByTagName('a');
foreach ($ahreftags as $tag) {
echo "<br/>";
echo $tag->getAttribute('href');
echo "<br/>";
}
exit;

Try this, a little italian guide, jquery documentation
include_once 'phpQuery.php';
$url = 'http://www.phonearena.com/phones/manufacturer/';
$doc = phpQuery::newDocumentFile($url);
$urls = $doc['a[href*="/phones/manufacturer/"]'];
foreach ($urls as $url) {
echo pq($url)->attr('href') . '<br>';
}

php get images only within body tags

I need to do the equivalent of this:
$tags2 = $doc->getElementsByTagName('img');
$mybody = $doc->getElementsByTagName('body');
//if there's a body tag
foreach ($mybody as $bod){
//loop through each img element
foreach ($tags2 as $tag) {
echo '<img src=' . $tag->getAttribute('src') . '/>';
echo "<br/>" . $tag->getAttribute('href') ;
}
}
Here's the context:
$str = file_get_contents('http://somewebsite.html');
$doc = new DOMDocument();
#$doc->loadHTML('<?xml encoding="UTF-8">' . $str);
$tidy = new tidy();
$tidy->parseFile($str);
$tidy->cleanRepair();
if(!empty($tidy->errorBuffer)) {
echo "The following errors or warnings occured:\n";
echo $tidy->errorBuffer;
}
else {
$str = $tidy;
}
$tags2 = $doc->getElementsByTagName('img');
$mybody = $doc->getElementsByTagName('body');
foreach ($mybody as $bod){
foreach ($tags2 as $tag) {
echo '<img src=' . $tag->getAttribute('src') . '/>';
echo "<br/>" . $tag->getAttribute('href') ;
}
}
^ outputs all the images on the page, in the header, on sidebars, etc. as well as the image in the body. I just want the image in the body. I tried a few other examples I saw on here using recursion but they were to get the styles or paragraph tags and I couldn't get them to retrieve image tag and image src attribute properly.
How can I do an inner loop for any images within the body once I have the body tag?
Thank you.

You just need to reverse two lines and rewrite a smidgen.
$mybody = $doc->getElementsByTagName('body')->item(0);
$tags2 = $mybody->getElementsByTagName('img');
The reason is that the Body tag is actually a DOMElement instance of the class, and is able to perform the same call to getElementsByTagName.

Change attribute using php & querypath

I want to use PHP & QueryPath to find all images in a document, then modify its src like this:
I want to change
http://test.com/test/name.jpg
to
http://example.com/xxx/name.jpg
I can find the specific class name using
$qp2 = $qp->find('body');
Now when I want to find all img on it to change the src:
foreach ($qp2->find('img') as $i) {
//here change the src
}
But when I execute
echo $qp2->html();
I see only last image. Where is the problem?

Like this?
foreach($qp2->find('img') as $key as $img) {
echo $img->html();
}
Sometimes you have to use top() or end() when you are re-using the qp object. Something like:
$qp = htmlqp($lpurl);
foreach ($qp->find('img') as $key => $img){
print_r($img->attr('src'));
$url = parse_url ($img->attr('src'));
print_r($url);
echo '<br/>';
if (!isset($url['scheme']) && !isset($url['host']) && !empty($url['path'])){
$newimg = $htmlpath . '/' . $url['path'];
$img->end()->attr('src', $newimg);
echo $img->html();
}
}
foreach ($qp->top()->find('script') as $key => $script){
print_r($script->attr('src'));
$url = parse_url ($script->attr('src'));
print_r($url);
if (!isset($url['scheme']) && !isset($url['host']) && !empty($url['path'])){
$newjs = $htmlpath . '/' . $url['path'];
echo '<br/>';
echo 'this is the modified ' . $newjs;
}
}

Simple HTML Dom

Thanks for taking the time to read my post... I'm trying to extract some information from my website using Simple HTML Dom...
I have it reading from the HTML source ok, now I'm just trying to extract the information that I need. I have a feeling I'm going about this in the wrong way... Here's my script...
<?php
include_once('simple_html_dom.php');
// create doctype
$dom = new DOMDocument("1.0");
// display document in browser as plain text
// for readability purposes
//header("Content-Type: text/plain");
// create root element
$xmlProducts = $dom->createElement("products");
$dom->appendChild($xmlProducts);
$html = file_get_html('http://myshop.com/small_houses.html');
$html .= file_get_html('http://myshop.com/medium_houses.html');
$html .= file_get_html('http://myshop.com/large_houses.html');
//Define my variable for later
$product['image'] = '';
$product['title'] = '';
$product['description'] = '';
foreach($html->find('img') as $src){
if (strpos($src->src,"http://myshop.com") === false) {
$src->src = "http://myshop.com/$src->src";
}
$product['image'] = $src->src;
}
foreach($html->find('p[class*=imAlign_left]') as $description){
$product['description'] = $description->innertext;
}
foreach($html->find('span[class*=fc3]') as $title){
$product['title'] = $title->innertext;
}
echo $product['img'];
echo $product['description'];
echo $product['title'];
?>
I put echo's on the end for sake of testing...but I'm not getting anything... Any pointers would be a great HELP!
Thanks
Charles

file_get_html() returns a HTMLDom Object, and you cannot concatenate Objects, although HTMLDom have __toString methods when there concatenated there more then lilly corrupt in some way, try the following:
<?php
include_once('simple_html_dom.php');
// create doctype
$dom = new DOMDocument("1.0");
// display document in browser as plain text
// for readability purposes
//header("Content-Type: text/plain");
// create root element
$xmlProducts = $dom->createElement("products");
$dom->appendChild($xmlProducts);
$pages = array(
'http://myshop.com/small_houses.html',
'http://myshop.com/medium_houses.html',
'http://myshop.com/large_houses.html'
)
foreach($pages as $page)
{
$product = array();
$source = file_get_html($page);
foreach($source->find('img') as $src)
{
if (strpos($src->src,"http://myshop.com") === false)
{
$product['image'] = "http://myshop.com/$src->src";
}
}
foreach($source->find('p[class*=imAlign_left]') as $description)
{
$product['description'] = $description->innertext;
}
foreach($source->find('span[class*=fc3]') as $title)
{
$product['title'] = $title->innertext;
}
//debug perposes!
echo "Current Page: " . $page . "\n";
print_r($product);
echo "\n\n\n"; //Clear seperator
}
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

HTML DOM Parser in PHP - php

Related

SimpleHTMLDom: Call to a member function find() on array

PHPQuery - get all links of contains specific url page

php get images only within body tags

Change attribute using php & querypath

Simple HTML Dom

Categories

Resources