Detect node from html string and extract image inside in PHP - php

I'm locked on something that make me crazy, if you can help me, would be cool !
I have a string containing a valid HTML code. In this PHP code I would like to detect a specific image via the class of his parent, and extract the url of the image + update it with another url.
Here is an example :
$html = '....<div class="header">...<img src="theimage.png" />...</div>...';
I would like to parse the $html string, extract the url "theimage.png" and replace it by "theimage2.png" (after some internal working)
I tried to use a REGEX, but I'm not sure it's the best solution, because I need to be sure that only the image in .header will be returned, and I need to execute some functions to get the name of the future link.
Maybe the solution is to parse the HTML with DOM Node, but it didn't work too.
Can you help me please ? Thanks !!!

You can achieve this using SimplePHPDom
PHP
$html = '<body><div class="header"><p>some html</p><img src="theimage.png" /></div></body>';
$html_dom = str_get_html($html);
foreach($html_dom->find('.header img') as $element) {
echo $element->src . '<br>'; // output src of img inside .header div
}

Related

how to parse this to get title in php

I want to parse HTML code present in $raw to get the title and save it mysql. I have tried to do it with php dom and Ganon HTML parser but when I run it, shows me an error 500. it would be great if you solve this problem with Ganon.
function store($raw)
{
include_once('ganon.php');
$html = file_get_dom($raw);
echo $html('title', 0)->parent->getPlainText();
}
store ('<html> all html code </html>');
There are a few problems with your code.
Firstly you use file_get_dom() which is expecting to be passed in a file name, so usestr_get_dom() instead.
Secondly, the example HTML doesn't contain a title, so this won't work.
Then when you find the title, you go to the parent element and output from there. You just need to use that nodes content.
include_once('ganon.php');
function store($raw)
{
$html = str_get_dom($raw);
echo $html('title', 0)->getPlainText();
}
store ('<html><title>Title</title> all html code </html>');
outputs...
Title of page

how do I find a tag with simple_html_DOM

Im trying to use simple_html_dom with php to parse a webpage with this tag:
<div class=" row result" id="p_a8a968e2788dad48" data-jk="a8a968e2788dad48" itemscope itemtype="http://schema.org/JobPosting" data-tn-component="organicJob">
where data-tn-component="organicJob" is the identifier I want to parse based on, I cant seem to specify the text in a way that simple_html_dom recognizes.
Ive tried a few things along this line:
<?PHP
include 'simple_html_dom.php';
$f="http://www.indeed.com/jobs?q=Electrician&l=maine";
$html->load_file($f);
foreach($html->find('div[data-tn-component="organicJob"]') as $div)
{
echo $div->innertext ;
}
?>
but the parser doesn't find any of the results, even though i know they are there. Probably I'm not making specifying the thing I find correctly.
I'm looking at the API, but I still don't understand how to format the find string.
what am I doing wrong?
Your selector is correct but i see other problems in your code
1) you are missing .php in your include include 'simple_html_dom'; it should be
include '/absolute_path/simple_html_dom.php';
2) to load content through url use file_get_html function instead $html->load_file($f); which is wrong as php don't know that $html is simple_html_dom object
$html = file_get_html('http://www.google.com/');
// then only call
$html->find( ...
3) in your provided link: http://www.indeed.com/jobs?q=Electrician+Helper&l=maine there is no present element with data-tn-component attribute
so final code should be
include '/absolute_path/simple_html_dom.php';
$html = file_get_html('http://www.indeed.com/jobs?q=Electrician&l=maine');
$html->load_file($f);
foreach($html->find('div[data-tn-component="organicJob"]') as $div)
{
echo $div->innertext ;
}

How to write this crawler in php?

I need to create a php script.
The idea is very simple:
When I send a link of a blogpost to this php script, then the webpage is crawled and the first image with the title page are saved on my server.
What PHP function I have to use for this crawler ?
Use PHP Simple HTML DOM Parser
// Create DOM from URL
$html = file_get_html('http://www.example.com/');
// Find all images
$images = array();
foreach($html->find('img') as $element) {
$images[] = $element->src;
}
Now $images array have images links of given webpage. Now you can store your desired image in database.
HTML Parser: HTMLSQL
Features: you can get external html file, http or ftp link and parse content.
Well, you'll have to use quite a few functions :)
But I'm going to assume that you're asking specifically about finding the image, and say that you should use a DOM parser like Simple HTML DOM Parser, then curl to grab the src of the first img element.
I would user file_get_contents() and a regular expression to extract the first image tags src attribute.
CURL or a HTML Parser seem overkill in this case, but you are welcome to check it out.

Problem getting contents of class using php simple dom parser

I can't work out how to get the contents of the second span 'cantfindme' using php simple html dom parser(http://simplehtmldom.sourceforge.net/manual.htm). Using the code below I can get the contents of the first span 'dontneedme'. I cant seem to get anything from second span at all.
$html = str_get_html('<html><body><table><tr><td class="bar">bar</td><td><div class="foo"><span class="dontneedme">Hello</span></div></td></tr><tr><td class="bar">bar</td><td><div class="foo"><span class="cantfindme">Goodbye</span></div></td></tr></body></html>');
foreach($html->find('.foo', 0) as $article)
{
echo "++</br>";
echo $article->plaintext;
echo "--</br>";
}
Can anyone see where I'm going wrong?
Try using this selector.
$html->find('div.foo .cantfindme');
Check out the documentation for more examples.

DOM Manipulation with PHP

I would like to make a simple but non trivial manipulation of DOM Elements with PHP but I am lost.
Assume a page like Wikipedia where you have paragraphs and titles (<p>, <h2>). They are siblings. I would like to take both elements, in sequential order.
I have tried GetElementbyName but then you have no possibility to organize information.
I have tried DOMXPath->query() but I found it really confusing.
Just parsing something like:
<html>
<head></head>
<body>
<h2>Title1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Title2</h2>
<p>Paragraph3</p>
</body>
</html>
into:
Title1
Paragraph1
Paragraph2
Title2
Paragraph3
With a few bits of HTML code I do not need between all.
Thank you. I hope question does not look like homework.
I think DOMXPath->query() is the right approach. This XPath expression will return all nodes that are either a <h2> or a <p> on the same level (since you said they were siblings).
/html/body/*[name() = 'p' or name() = 'h2']
The nodes will be returned as a node list in the right order (document order). You can then construct a foreach loop over the result.
I have uased a few times simple html dom by S.C.Chen.
Perfect class for access dom elements.
Example:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Check it out here. simplehtmldom
May help with future projects
Try having a look at this library and corresponding project:
Simple HTML DOM
This allows you to open up an online webpage or a html page from filesystem and access its items via class names, tag names and IDs. If you are familiar with jQuery and its syntax you need no time in getting used to this library.

Categories