This is my code :
<form method="POST">
<input name="link">
<button type="submit">></button>
</form>
<title>GET IMAGE URL</title>
<?php
if (!isset($_POST['link'])) exit();
$link = $_POST['link'];
$parse = explode('.html', $link);
echo '<div id="pin" style="float:center"><textarea class="text" cols="110" rows="50">';
for ($i = 1; $i <=5; $i++)
{
if ($i > 1)
$link = "$parse[0]-$i.html";
$get = file_get_contents($link);
if (preg_match_all('/src="(.*?)"/', $get, $matches))
{
foreach ($matches[1] as $content)
echo $content."\r\n";
}
}
echo '</textarea>';
The page I'm trying to get the img src has 10 to 15 page,so I want my code to get all the img url until the end of the page. How can I do that without the loop?
If I use:
for ($i = 1; $i <=5; $i++)
this will get only 5 page img urls, but I want to make it get until the end. Then I don't need to edit the loop everytime I submit another URL with a different number of pages.
From this
this will get only 5 page img urls, but I want to make it get until the end. Then I don't need to edit the loop everytime I submit another URL with a different number of pages.
I could understand that your problem is with dynamic number of pages.Your urls have a next page link at the bottom
下一页
Identify it and get your images in while loop
<?php
// Link given in form
$link = "http://www.xiumm.org/photos/XiuRen-17305.html";
$parse = explode('.html', $link);
$i=1;
// Intialize a boolean
$nextPageFound = true;
while($nextPageFound) {
// Construct URL Every time when nextPageFound
if ($i == 1) {
$url = "$parse[0].html";
echo "First Page<br><br>";
} else {
$url = "$parse[0]-$i.html";
}
// Getting URL Contents
$get = file_get_contents($url);
if (preg_match_all('/src="(.*?)"/', $get, $matches))
{
// echoing contents
foreach ($matches[1] as $content)
echo $content."<br>";
}
// check nextPageBtn if available
if (strpos($get, '"nextPageBtn"') !== false) {
$nextPageFound = true;
// increment +1
$i++;
echo "<br>Page $i<br><br>";
} else {
$nextPageFound = false;
echo "THE END";
}
}
?>
You should use an HTML/XML parser, like DOMDocument, in combination with DOMXPath (xpath is query language to query (X)HTML data structures):
// create DOMDocument
$doc = new DOMDocument();
// load remote HTML file
$doc->loadHTMLFile( $link );
// create DOMXPath
$xpath = new DOMXPath( $doc );
// fetch all IMG elements that have a src attribute
$nodes = $xpath->query( '//img[#src]' );
// loop trough found IMG elements and echo their src attribute values
for( $i = 0; $i < $nodes->length; $i++ ) {
echo $nodes->item( $i )->getAttribute( 'src' ) . PHP_EOL;
}
Regarding the xpath query //div[contains(#class,'pic_box')]//#src, mentioned by #Enuma, in the comments:
The resulting DOMNodeList of that query will not contain DOMElement objects, but DOMAttr objects, because the query directly asks for attributes, not elements. Since DOMAttr represents an attribute and not an element, the method getAttribute() does not exist. To get the value of the attribute you have to use the property DOMAttr->value.
So, we have to slightly alter the relevant part of our example code from above to:
// loop trough found src attributes and echo their value
for( $i = 0; $i < $nodes->length; $i++ ) {
echo $nodes->item( $i )->value . PHP_EOL;
}
Putting it all together, our example code then becomes:
// create DOMDocument
$doc = new DOMDocument();
// load remote HTML file
$doc->loadHTMLFile( $link );
// create DOMXPath
$xpath = new DOMXPath( $doc );
// fetch all src attributes that are descendants of div.pic_box
$nodes = $xpath->query( '//div[contains(#class,'pic_box')]//#src' );
// loop trough found src attributes and echo their value
for( $i = 0; $i < $nodes->length; $i++ ) {
echo $nodes->item( $i )->value . PHP_EOL;
}
PS.: In order for DOMDocument to be able to load remote files, I believe some php config setting may be required to be set, which I don't know off the top of my head, right now. But since it already appeared to be working for #Enuma, it's not actually relevant now. Perhaps I'll look them up later.
Related
I need to process a DOM and remove all hyperlinks to a particular site while retaining the underlying text. Thus, something ling text changes into text. Taking cue from this thread, I wrote this:
$as = $dom->getElementsByTagName('a');
for ($i = 0; $i < $as->length; $i++) {
$node = $as->item($i);
$link_href = $node->getAttribute('href');
if (strpos($link_href,'offendinglink.com') !== false) {
$cl = $node->getAttribute('class');
$text = new DomText($node->nodeValue);
$node->parentNode->insertBefore($text, $node);
$node->parentNode->removeChild($node);
$i--;
}
}
This works fine except that I also need to retain the class attributed to the offending <a> tag and maybe turn it into a <div> or a <span>. Thus, I need this:
text
to turn into this:
<div class="nice">text</div>
How do I access the new element after it's been added (like in my code snippet)?
quote "How do I access the new element after it's been added (like in my code snippet)?" - your element is in $text i think.. anyway, i think this should work, if you need to save the class and the textContent, but nothing else
foreach($dom->getElementsByTagName('a') as $url){
if(parse_url($url->getAttribute("href"),PHP_URL_HOST)!=='badsite.com') {
continue;
}
$ele = $dom->createElement("div");
$ele->textContent = $url->textContent;
$ele->setAttribute("class",$url->getAttribute("class"));
$url->parentNode->insertBefore($ele,$url);
$url->parentNode->removeChild($url);
}
Tested solution:
<?php
$str = "<b>Dummy</b> <a href='http://google.com' target='_blank' class='nice' id='nicer'>Google.com</a> <a href='http://yandex.ru' target='_blank' class='nice' id='nicer'>Yandex.ru</a>";
$doc = new DOMDocument();
$doc->loadHTML($str);
$anchors = $doc->getElementsByTagName('a');
$l = $anchors->length;
for ($i = 0; $i < $l; $i++) {
$anchor = $anchors->item(0);
$link = $doc->createElement('div', $anchor->nodeValue);
$link->setAttribute('class', $anchor->getAttribute('class'));
$anchor->parentNode->replaceChild($link, $anchor);
}
echo preg_replace(['/^\<\!DOCTYPE.*?<html><body>/si', '!</body></html>$!si'], '', $doc->saveHTML());
Or see runnable.
I need to read an XML file, i watched some tutorials and tried different sollutions, but for some reason I can't figure out why it doenst work.
The XML file that I want to read: http://www.voetbalzone.nl/rss/rss.xml
This is the code that im using:
$xml= "http://www.voetbalzone.nl/rss/rss.xml"
for ($i = 0; $i < 10; $i++)
{
$title = $xml->rss->channel->item[$i]->title;
}
The error I get: Premature end of data in tag
It works for me like this:
<?php
$xml = simplexml_load_file("http://www.voetbalzone.nl/rss/rss.xml");
for ($i = 0; $i < 10; $i++)
{
$title = $xml->channel->item[$i]->title;
}
?>
Note that you are overwriting the variable $title each time, so that you will have the title of the 10. element in it after the loop finished [I assume that is not what you want?]
To get all 'item'-Elements inside 'channel' as an Array to iterate through you can use xpath like this:
<?php
$xml = simplexml_load_file("http://www.voetbalzone.nl/rss/rss.xml");
$item_array = $xml->xpath("//rss/channel/item");
foreach($item_array as $item) {
echo $item->title . "\n";
}
?>
I would suggest to read about php's SimpleXML here: http://php.net/manual/en/book.simplexml.php
I wrote the following:
<?php
$str = 'http://stackoverflow.com';
$DOM = new DOMDocument;
$DOM->loadHTML($str);
//get all H1
$items = $DOM->getElementsByTagName('h1');
//display all H1 text
for ($i = 0; $i < $items->length; $i++)
{
echo $items->item($i)->nodeValue . "<br/>";
}
?>
And just wanted to simply retrieve all the H1 elements of stackoverflow, but can't get it working. Whenever I try filling in the variable $str manually (for example: <h1>hello</h1><div><h1>hello2</h1></div>) it is working. But whenever I try to parse content from another webpage it is not doing anything at all...
Help would be appericiated!
$str = 'http://stackoverflow.com';
$DOM = new DOMDocument;
$DOM->loadHTMLFile($str); // get html
echo $DOM->saveHTML(); echo html
$DOM->saveHTMLFile(FILE_NAME); save html to file
I've following strings with me in which the file name is present in between anchor tags:
$test1 = test<div class="comment_attach_file">
<a class="comment_attach_file_link" href="http://52.1.47.143/feed/download/year_2015/month_04/file_3b701923a804ed6f28c61c4cdc0ebcb2.txt" >phase2 screen.txt</a><br>
<a class="comment_attach_file_link_dwl" href="http://52.1.47.143/feed/download/year_2015/month_04/file_3b701923a804ed6f28c61c4cdc0ebcb2.txt" >Download</a>
</div>;
$test2 = This is a holiday list.<div class="comment_attach_file">
<a class="comment_attach_file_link" href="http://52.1.47.143/feed/download/year_2015/month_04/file_2c96b997f03eefab317811e368731bb6.pdf" >Holiday List-2013.pdf</a><br>
<a class="comment_attach_file_link_dwl" href="http://52.1.47.143/feed/download/year_2015/month_04/file_2c96b997f03eefab317811e368731bb6.pdf" >Download</a>
</div>;
$test3 = <div class="comment_attach_file">
<a class="comment_attach_file_link" href="http://52.1.47.143/feed/download/year_2015/month_04/file_8479c0b60867fdce35ae94a668dfbba9.docx" >sample2.docx</a><br>
</div>;
From the first string I want text(i.e. file name) "phase2 screen.txt"
From the second string I want text(i.e. file name) "Holiday List-2013.pdf"
From the third string I want text(i.e. file name) "sample2.docx"
How should I do in PHP using $dom = new DOMDocument;?
Please someone help me.
Thanks.
You can use DOMxpath to target that link that contains the text you want, use its class to point to it:
$dom = new DOMDocument;
for($i = 1; $i <= 3; $i++) {
#$dom->loadHTML(${"test{$i}"});
$xpath = new DOMXpath($dom);
$file_name = $xpath->evaluate('string(//a[#class="comment_attach_file_link"])');
echo $file_name , '<br/>';
}
Or if you don't want to use xpath, you can get the anchor elements and check for its class, if its that one, get the ->nodeValue:
$dom = new DOMDocument;
for($i = 1; $i <= 3; $i++) {
#$dom->loadHTML(${"test{$i}"});
foreach($dom->getElementsByTagName('a') as $anchor) {
if($anchor->getAttribute('class') === 'comment_attach_file_link') {
echo $anchor->nodeValue, '<br/>';
break;
}
}
}
Sample Output
If you want to get the value after the hash mark or anchor as shown in a user's browser: This isn't possible with "standard" HTTP as this value is never sent to the server (hence it won't be available in $_SERVER["REQUEST_URI"] or similar predefined variables). You would need some sort of JavaScript magic on the client side, e.g. to include this value as a POST parameter.
In dom you can use some function like this to get the links and change it according to your need
function findAnchors($html)
{
$links = array();
$doc = new DOMDocument();
$doc->loadHTML($html);
$navbars = $doc->getElementsByTagName('div');
foreach ($navbars as $navbar) {
$id = $navbar->getAttribute('id');
if ($id === "anchors") {
$anchors = $navbar->getElementsByTagName('a');
foreach ($anchors as $a) {
$links[] = $doc->saveHTML($a);
}
}
}
return $links;
}
I am trying to fetch the name,address and location from crawling of a website . Its a single page and dont want any other thing other than this. I am using the below code.
<?php
include 'simple_html_dom.php';
$html = "http://www.phunwa.com/phone/0191/2604233";
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="address-tags"]')->item(0);
for($i=0; $i < $div->length; $i++ )
{
print "nodename=".$div->item( $i )->nodeName;
print "\t";
print "nodevalue : ".$div->item( $i )->nodeValue;
print "\r\n";
echo $link->getElementsByTagName("<p>");
}
?>
The website html source code is
<div class="address-tags">
<p><strong>Name:</strong> RAJ GOPAL SINGH</p>
<p><strong>Address:</strong> R/O BARNAI NETARKOTHIAN, P.O.MUTHI TEH.& DISTT.JAMMU,X, 181206</p>
<p><strong>Location:</strong> JAMMU, Jammu & Kashmir, India</p>
<p><strong>Other Numbers:</strong> 01912604233 | +911912604233 | +91-191-2604233</p>
Can somone please help me get the three attributes as output. Nothing is echop on the page as of now.
Thanks alot .
you need $dom->load($html); instead of $dom->loadHtml($html);. After doing this you wil; find your html is not well formed, so $xpath stay empty.
Maybe try something like:
$html = file_get_contents('http://www.phunwa.com/phone/0191/2604233');
$name = preg_replace('/(.*)(<p><strong>Name:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$address = preg_replace('/(.*)(<p><strong>Address:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$location = preg_replace('/(.*)(<p><strong>Location:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$othernumbers = preg_replace('/(.*)(<p><strong>Other Numbers:<\/strong> )(.*)/mis','$3',$html);
list($othernumbers,$trash)= preg_split('/<\/p>/mis',$othernumbers,0);
echo 'name: '.$name.'<br>address: '.$address.'<br>location: '.$location.'<br>other numbers: '.$othernumbers;
exit;
You should use the following for your XPath query:
//*[#class='address-tags']/p
so you're retrieving the actual paragraph nodes that are children of the 'address-tags' parent. Then you can use a loop on them:
$nodes = $xpath->query('//*[#class="address-tags"]/p');
for ($i = 0; $i < $nodes->length; $i++) {
echo $nodes->item($i)->nodeValue;
}
// or just
foreach($nodes as $node) {
echo $node->nodeValue;
}
Right now your code is properly fetching the first div that's found, but then you continue treating that div as if it was a DOMNodeList returned from an xpath query, which is incorrect. ->item() returns a DOMNode object, which does NOT have an ->item() method.