PHP - Multi Curl - scraping for data/content - php

I have started building a single Curl session with - curl, dom, xpath, and it worked great.
I am now building a scraper based off curl for taking data off multiple sites in one flow, and the script is echo'ing the single phrase i put in.. but it does not pick up variables.
do{
$n=curl_multi_exec($mh, $active);
}while ($active);
foreach ($urls as $i => $url){
$res[$i]=curl_multi_getcontent($conn[$i]);
echo ('<br />success');
}
So this does echo the success-text as many times as there are urls.. but really this is not what i want.. I want to break up the html like i could with the single curl session..
What i did in the single curl session:
//parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($res);
// grab all the on the page
$xpath = new DOMXPath($dom);
$product_img = $xpath->query("//div[#id='MAIN']//a");
for ($i = 0; i < $product_img->length; $i++){
$href = $product_img->item($i);
$url = $href->getAttribute('href');
echo "<br />Link : $url";
}
This dom parsing / xpath is working for the single session curl, but not when i run the multicurl.
On Multicurl i can do curl_multi_getcontent for the URL from the session, but this is not want..
I would like to get the same content as i picked up with Dom / Xpath in the single session.
What can i do ?
EDIT
It seems i am having problems with the getAttribute. It is a link for an image i am having trouble grabbing. The link is showing when scraping, but then it throws an error :
Fatal error: Call to a member function getAttribute() on a non-object in
The query:
// grab all the on the page
$xpath = new DOMXPath($dom);
$product_img = $xpath->query("//img[#class='product']");
$product_name = $xpath->query("//img[#class='product']");
This is working:
for ($i = 0; i < $product_name->length; $i++) {
$prod_name = $product_name->item($i);
$name = $prod_name->getAttribute('alt');
echo "<br />Link stored: $name";
}
This is not working:
for ($i = 0; i < $product_img->length; $i++) {
$href = $product_img->item($i);
$pic_link = $href->getAttribute('src');
echo "<br />Link stored: $pic_link";
}
Any idea of what i am doing wrong ?
Thanks in advance.

For some odd reason, it is only that one src that won't work right.
This question can be considered "solved".

Related

Duplicate PHP Code Block

I get images from a specific url. with this script im able to display them on my website without any problems. the website i get the images from has more than one (about 200) pages that i need the images from.
I dont want to copy the block of PHP code manually and fill in the page number every time from 1 to 200. Is it possible to do it in one block?
Like: $html = file_get_html('http://example.com/page/1...to...200');
<?php
require_once('simple_html_dom.php');
$html = file_get_html('http://example.com/page/1');
foreach($html->find('img') as $element) {
echo '<img src="'.$element->src.'"/>';
}
$html = file_get_html('http://example.com/page/2');
foreach($html->find('img') as $element) {
echo '<img src="'.$element->src.'"/>';
}
$html = file_get_html('http://example.com/page/3');
foreach($html->find('img') as $element) {
echo '<img src="'.$element->src.'"/>';
}
?>
You can use a for loop like so:
require_once('simple_html_dom.php');
for($i = 1; $i <= 200; $i++){
$html = file_get_html('http://example.com/page/'.$i);
foreach($html->find('img') as $element) {
echo '<img src="'.$element->src.'"/>';
}
}
So now you have one block of code, that will execute 200 times.
It changes the page number by appending the value of $i to the url, and every time the loop completes a round, the value of $i becomes $i + 1.
if you wish to start on a higher page number, you can just change the value of $i = 1 to $i = 2 or any other number, and you can change the 200 to whatever the max is for your case.
There are many good solutions, on of them: try to make a loop from 1 to 200
for($i = 1; $i <= 200; $i++){
$html = file_get_html('http://example.com/page/'.$i);
foreach($html->find('img') as $element) {
echo '<img src="'.$element->src.'"/>';
}
}
<?php
function SendHtml($httpline) {
$html = file_get_html($httpline);
foreach($html->find('img') as $element) {
echo '<img src="'.$element->src.'"/>';
}
}
for ($x = 1; $x <= 200; $x++) {
$httpline="http://example.com/page/";
$httpline.=$x;
SendHtml($httpline);
}
?>
Just loop. Create a sending function and loop to make the calls.
I recommend you to read all php docu in https://www.w3schools.com/php/default.asp
First, store them in a database. You can(/should) download the images to your own server, or also store the uri to the image. You can use code like FMashiro's for that, or something similar, but opening 200 pages and parsing their HTML takes forever. Every pageview.
And then you simply use the LIMIT functionallity in queries to create pages yourself.
I recommend this method anyways, as this will be MUCH faster than parsing html every time someone opens this page. And you'll have sorting options and other pro's a database gives you.

Fetching image from google using dom

I want to fetch image from google using PHP. so I tried to get help from net I got a script as I needed but it is showing this fatal error
Fatal error: Call to a member function find() on a non-object in C:\wamp\www\nq\qimages.php on line 7**
Here is my script:
<?php
include "simple_html_dom.php";
$search_query = "car";
$search_query = urlencode( $search_query );
$html = file_get_html( "https://www.google.com/search?q=$search_query&tbm=isch" );
$image_container = $html->find('div#rcnt', 0);
$images = $image_container->find('img');
$image_count = 10; //Enter the amount of images to be shown
$i = 0;
foreach($images as $image){
if($i == $image_count) break;
$i++;
// DO with the image whatever you want here (the image element is '$image'):
echo $image;
}
?>
I am also using Simple html dom.
Look at my example that works and gets first image from google results:
<?php
$url = "https://www.google.hr/search?q=aaaa&biw=1517&bih=714&source=lnms&tbm=isch&sa=X&ved=0CAYQ_AUoAWoVChMIyKnjyrjQyAIVylwaCh06nAIE&dpr=0.9";
$content = file_get_contents($url);
libxml_use_internal_errors(true);
$dom = new DOMDocument;
#$dom->loadHTML($content);
$images_dom = $dom->getElementsByTagName('img');
foreach ($images_dom as $img) {
if($img->hasAttribute('src')){
$image_url = $img->getAttribute('src');
}
break;
}
//this is first image on url
echo $image_url;
This error usually means that $html isn't an object.
It's odd that you say this seems to work. What happens if you output $html? I'd imagine that the url isn't available and that $html is null.
Edit: Looks like this may be an error in the parser. Someone has submitted a bug and added a check in his code as a workaround.

Get <div> content with PHP

I am trying to fetch the content inside a <div> via file_get_contents. What I want to do is to fetch the content from the div resultStats on google.com. My problem is (afaik) printing it.
A bit of code:
$data = file_get_contents("https://www.google.com/?gws_rd=cr&#q=" . $_GET['keyword'] . "&gws_rd=ssl");
preg_match("#<div id='resultStats'>(.*?)<\/div>#i", $data, $matches);
Simply using
print_r($matches);
only returns Array(), but I want to preg_match the number. Any help is appreciated!
Edit: thanks for showing me the right direction! I got rid of the preg_ call and went for DOM instead. Although I am pretty new to PHP and this is giving me an headache; I found this code here on Stack Overflow and I am trying to edit it to get it to work. This far I only receive a blank page, and don't know what I am doing wrong.
$str = file_get_contents("https://www.google.com/search?source=hp&q=" . $_GET['keyword'] . "&gws_rd=ssl");
$DOM = new DOMDocument;
#$dom->loadHTML($str);
//get
$items = $DOM->getElementsByTagName('resultStats');
//print
for ($i = 0; $i < $items->length; $i++)
echo $items->item($i)->nodeValue . "<br/>";
} else { exit("No keyword!") ;}
Posted on behalf of the OP.
I decided to use the PHP Simple HTML DOM Parser and ended up something like this:
include_once('/simple_html_dom.php');
$setDomain = "https://www.google.com/search?source=hp&q=" . $_GET['keyword'] . "&gws_rd=ssl";
$str = file_get_html($setDomain);
$html = str_get_html($str);
$html->find('div div[id=resultStats]', 0)->innertext . '<br>';
Problem solved!

Echoing only a div with php

I'm attempting to make a script that only echos the div that encolose the image on google.
$url = "http://www.google.com/";
$page = file($url);
foreach($page as $theArray) {
echo $theArray;
}
The problem is this echos the whole page.
I want to echo only the part between the <div id="lga"> and the next closest </div>
Note: I have tried using if's but it wasn't working so I deleted them
Thanks
Use the built-in DOM methods:
<?php
$page = file_get_contents("http://www.google.com");
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($page);
libxml_use_internal_errors(false);
$domx = new DOMXPath($domd);
$lga = $domx->query("//*[#id='lga']")->item(0);
$domd2 = new DOMDocument();
$domd2->appendChild($domd2->importNode($lga, true));
echo $domd2->saveHTML();
In order to do this you need to parse the DOM and then get the ID you are looking for. Check out a parsing library like this http://simplehtmldom.sourceforge.net/manual.htm
After feeding your html document into the parser you could call something like:
$html = str_get_html($page);
$element = $html->find('div[id=lga]');
echo $element->plaintext;
That, I think, would be your quickest and easiest solution.

How to parse actual HTML from page using CURL?

I am "attempting" to scrape a web page that has the following structures within the page:
<p class="row">
<span>stuff here</span>
Descriptive Link Text
<div>Link Description Here</div>
</p>
I am scraping the webpage using curl:
<?php
$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, "http://www.host.tld/");
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
curl_close($handle);
?>
I have done some research and found that I should not use a RegEx to parse the HTML that is returned from the curl, and that I should use PHP DOM. This is how I have done this:
$newDom = new domDocument;
$newDom->loadHTML($html);
$newDom->preserveWhiteSpace = false;
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo $printString . "<br>";
}
Now I am not pretending that I completely understand this but I get the gist, and I do get the sections I am wanting. The only issue is that what I get is only the text of the HTML page, as if I had copied it out of my browser window. What I want is the actual HTML because I want to extract the links and use them too, like so:
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo "LINK " . $printString . "<br>";
}
As you can see, I cannot get the link because I am only getting the text of the webpage and not the source, like I want. I know the "curl_exec" is pulling the HTML because I have tried just that, so I believe that the DOM is somehow stripping the HTML that I want.
According to comments on the PHP manual on DOM, you should use the following inside your loop:
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($sections->item($i), true));
$innerHTML = trim($tmp_dom->saveHTML());
This will set $innerHTML to be the HTML content of the node.
But I think what you really want is to get the 'a' nodes under the 'p' node, so do this:
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++) {
$sec = $sections->item($i);
$links = $sec->getElementsByTagName('a');
$linkNo = $links->length;
for ($j=0; $j<$linkNo; $j++) {
$printString = $links->item($j)->nodeValue;
echo $printString . "<br>";
}
}
This will just print the body of each link.
You can pass a node to DOMDocument::saveXML(). Try this:
$printString = $newDom->saveXML($sections->item($i));
you might want to take a look at phpQuery for doing server-side HTML parsing things. basic example

Categories