Parsing with SimpleHTMLPArser

Parsing with SimpleHTMLPArser - php

Folks,
I am using SIMPLEHTMLPARSER.
I am not able to parse HTML, When i var_dump the html document, it just shows the DOM structure and no HTML content.
$produrl = 'http://wap.ebay.com/Pages/ViewItem.aspx?aid=160586179890&sv=160586179890/';
var_dump(file_get_html($produrl));
$html = file_get_html($produrl);
var_dump($html->find('div[id=Teaser_Item] img[src]', 0));
Actually, what i want to extract is the IMG SRC which is:
http://wap.ebay.com/Pages/RbHttpHandler.ashx?width=51&height=240&fsize=999000&format=jpg&url=http%3A%2F%2Fi.ebayimg.com%2F00%2F%24%28KGrHqN%2C!jEE2n%28iTLozBNwBPG0bUg~~0_1.JPG%3Fset_id%3D8800005007
can someone help me debugging this, please?
Cheers
Natasha Thomas

<?php
require_once('simple_html_dom.php');
$produrl = 'http://wap.ebay.com/Pages/ViewItem.aspx?aid=160586179890&sv=160586179890/';
// Grab the document
$html = file_get_html($produrl);
// Find the img tag in the Teaser_Item div
$a = $html->find('div[id=Teaser_Item] img', 0);
// Display the src
echo($a->attr['src']);
?>

Related

Change src atribute from img, using Simple HTML Dom php library

I'm totally new to php, and I'm having a hard time changing the src attribute of img tags.
I have a website that pulls a part of a page using Simple Html Dom php, here is the code:
<?php
include_once('simple_html_dom.php');
$html = file_get_html('http://www.tabuademares.com/br/bahia/morro-de-sao-paulo');
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
$elem = $html->find('table[id=tabla_mareas]', 0);
echo $elem;
?>
This code correctly returns the part of the page I want. But when I do this the img tags comes with the src of the original page: /assets/svg/icon_name.svg
What I want to do is change the original src so that it looks like this: http://www.mywebsite.com/wp-content/themes/mytheme/assets/svg/icon_name.svg
I want to put the url of my site in front of assets / svg / icon_name.svg
I already tried some tutorials, but I could not make any work.
Could someone please kind of help a noob in php?

i could make it work. So if someone have the same question, here is how i managed to get the code working.
<?php
// Note you must download the php files simple_html_dom.php from
// this link https://sourceforge.net/projects/simplehtmldom/files/
//than include them
include_once('simple_html_dom.php');
//target the website
$html = file_get_html('http://the_target_website.com');
//loop thru all images of the html dom
foreach($html ->find('img') as $item) {
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $item->src;
// Set a attribute
$item->src = 'http://yourwebsite.com/'.$value;
}
//save the variable
$html->save();
//findo on html the div you want to get the content
$elem = $html->find('div[id=container]', 0);
//output it using echo
echo $elem;
?>
That's it!

did you read the documentation for read and modify attributes
As per that
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $e->href;
// Set a attribute
$e->href = 'ursitename'.$value;

PHP extract specific DIV from target URL

I'm using Simple HTML DOM to try and extract a div and all of it's contents from a target URL, here is my code:
<?php
require 'simple_html_dom.php';
$html = file_get_html('http://mozilla.org');
foreach($html->find('.accordion') as $element)
echo $element . '<br>';
?>
The problem I have is that the above code only extracts the plain text of the div. There are also images in the div that I need to extract. If I use this following code, then all images are extracted but so is everything else in the page.
<?php
require 'simple_html_dom.php';
$html = file_get_html('http://mozilla.org');
echo $html;
?>
So my question is, how can I use the first bit of code to extract the contents + images from .accordion?
Thanks

You could always try;
$imgs = array();
foreach($html->find('.accordion',0)->find('img') as $img){
$imgs[] = $img->src;
}
print_r($imgs);
This should populate the $imgs variable with all of the image links from the .accordion div.
:)

SimpleHtmlDom extract content inside a div not from its child

I want to extract content from a div, but not needed the contents from its childrens. I m using simplehtmldom parser and the following code
//html code
<div id="frame">
Needed this content
Not needed
</div>
//php code
$elem = file_get_html($url);
$content = $elem->find('div#frame')->plaintext;
echo $content;
but this code results,
Needed this contentNot needed
I want the result as,
Needed this content
How to change th code for getting that output. Help plz. thanks in advance

The only way I can think of, is to delete all your div's children, then print the left content... Here's how:
// includes Simple HTML DOM Parser
include "simple_html_dom.php";
$text = '<div id="frame">
<span><b>Not needed</b></span>
Needed this content
Not needed
</div>';
//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($text);
$content = $html->find('div#frame',0);
// Before
echo $content->innertext;
// Delete all unwanted children
foreach( $content->children() as $i => $unwantedTags ) {
echo "<br/>$i => ".$unwantedTags->tag;
$unwantedTags->outertext = '';
}
// After
echo "<br/>".$content->innertext;
// Clear dom object
$html->clear();
unset($html);
See this working DEMO

How to remove all 'alt' attribute from all the <img> tags from HTML file in PHP? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Remove style attribute from HTML tags
Current image looks like
<img src="images/sample.jpg" alt="xyz"/>
Now I want to remove all such alt tags present in all the tags in HTML file, the PHP code itself should replace all the alt attribute appearances.
The output should be like
<img src="images/sample.jpg" /> only
How can be done with php?
Thanks in Advance

Use DOMDocument for HTML parsing/manipulation. The example below reads a HTML file, removes the alt attribute from all img tags, then prints out the HTML.
$dom = new DOMDocument();
$dom->loadHTMLFile('file.html');
foreach($dom->getElementsByTagName('img') as $image)
{
$image->removeAttribute('alt');
}
echo $dom->saveHTML(); // print the modified HTML

Read your file. You can use file_get_contents() to read a file
$fileContent = file_get_contents('filename.html');
$fileContent = preg_replace('/alt=\"(.*)\"/', '', $fileContent);
file_put_contents('filename.html', $fileContent);
Make sure your file is writable

First, you need to get a hold on the document source you want to modify. It's not clear if you want to edit some html files on your server, edit the html output generated by a request or what...
In this answer I'm gonna step over on how you get to the HTML. It could be a file_get_contents('filename.html'); or some magic with output buffering.
Since you don't want to parse HTML with regular expressions you need to use a parser:
Since the alt attribute is required for the HTML to be valid, if you want to "remove" it you have to set it to an empty string.
This should work:
$doc = DOMDocument::loadHTML($myhtml);
$images = $doc->getElementsByTagName('img');
foreach($images as $img) {
$image->setAttribute('alt', '');
}
$myhtml = $doc->saveHTML();

For valid xHTML it should have the alt attribute.
Something like this would work:
$xml = new SimpleXMLElement($doc); // $doc is the html document.
foreach ($xml->xpath('//img') as $img_tag) {
if (isset($img_tag->attributes()->alt)) {
unset($img_tag->attributes()->alt);
}
}
$new_doc = $xml->asXML();

Get SRC from div contents

I have code that gets a div contents:
include_once('simple_html_dom.php');
$html = file_get_html("link");
$ret = $html->find('div');
echo $ret[0];
preg_match_all('/(src)=("[^"]*")/i',$ret[0], $link);
echo $link[0];
It returns the full div contents including all the CSS. However I just wanted it to echo the information after src= basically just echoing the image link and nothing else. I've tried to use preg_match with no success.
Any ideas?

Your HTML parser will help you there - there should be a src property in the $ret object:
echo $ret[0]->src;

You don't need regexp for that since you already use a dom parser.
foreach($ret as $element)
echo $element->src,'<br/>';

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing with SimpleHTMLPArser - php

Related

Change src atribute from img, using Simple HTML Dom php library

PHP extract specific DIV from target URL

SimpleHtmlDom extract content inside a div not from its child

How to remove all 'alt' attribute from all the <img> tags from HTML file in PHP? [duplicate]

Get SRC from div contents

Categories

Resources