I am trying to create a program that will open a text file with urls seperated by |. It will then take the first line of the text document, crawl that url and remove it from the text file. Each url is to be scraped by a basic crawler. I know the crawler part works because if I enter in one of the urls in quotations, rather than a variable from the text file, it will work. I am at the point where it will not return anything because the url simply will not be accepted.
this is a basic version of my code because I had to break it down alot to iscolate the problem.
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
$url = $urlarray[0];
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
$title = $element->getAttribute('title');
$class = $element->getAttribute('class');
if($class == 'result_link')
{
$title = str_replace('Synonyms of ', '', $title);
echo $title . "<br />";
}
}`
The code below works like a champ tested with your example data:
<?php
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
$url = $urlarray[0];
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
$title = $element->getAttribute('title');
$class = $element->getAttribute('class');
if($class == 'result_link')
{
$title = str_replace('Synonyms of ', '', $title);
echo $title . "<br />";
}
}
?>
ALMOST FORGOT: LETS NOW PUT IT IN A LOOP TO LOOP THROUGH ALL URLS:
<?php
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
$url = $urlarray[0];
foreach($urlarray as $url) {
if(!empty($url)) {
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,trim($url));
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
$title = $element->getAttribute('title');
$class = $element->getAttribute('class');
if($class == 'result_link')
{
$title = str_replace('Synonyms of ', '', $title);
echo $title . "<br />";
}
}
echo '<hr />';
}
}
?>
So if you put in a URL manually $url = 'http://www.mywebsite.com'; every thing works as expected?
If so there is a problem here:
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
are you sure urls.txt is loading? Are you sure it contains http://a.com|http://b.com etc?
I would var dump
$contents = file_get_contents('urls.txt') before the explode statement to see if it is loading in.
If yes, then I would explode the into $urlarray, and var dump $urlarray[0]
if it looks right I would trim it before being sent to dom with trim($urlarray[0])
I may even go as far as using valid regex to make sure these URL's are in fact URL's before sending it to dom.
Let me know the results and I will try to help further, or post all sample code including URLS.txt
And I will run it locally
Related
I Finally managed to make a script in php for scraping basic elements from other websites. It is super simple. This example shows how to get title and url.
ini_set('display_errors', 1);
$url = 'http://test123cxqwq12.000webhostapp.com/mainpage.php';
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$title = $xpath->query('/html/body/a/h1');
$source = $xpath->query('/html/body/a/#href');
for ($i = 0; $i <= count($source)-1; $i++) {
$new = $source[$i]->nodeValue;
$text = $title[$i]->nodeValue;
echo ""."</br>";
}
Page with results: http://test123cxqwq12.000webhostapp.com/scrap.php
Page to scraping content: http://test123cxqwq12.000webhostapp.com/mainpage.php
Subpage: http://test123cxqwq12.000webhostapp.com/subpage.php
Now I would like to go a step further and take the data from the subpage. So instead of taking source from main page like is right now. I would like to go into this source and take another source (in this example google.com link) from subpage. I'm out of ideas. I would like to ask for some tips, is it possible to do it with xpath in similar way I was doing now?
I think a solution could be to store the URL in a database then apply your Curl and xpath functions to them
<?php
function curlGet($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$results = curl_exec($ch);
curl_close($ch);
return $results;
}
function returnXPathObject($item) {
$xmlPageDom = new DomDocument();
#$xmlPageDom->loadHTML($item);
$xmlPageXPath = new DOMXPath($xmlPageDom);
return $xmlPageXPath;
}
$allUrl = $cxn->query("SELECT * FROM yourDatabaseUrl");
$allUrl = $allUrl->fetchAll();
for ($i = 0; $i<count($allUrl); $i++){
$url = $allUrl[$i];
$getDom = curlGet($url);
$getDomXpath = returnXPathObject($getDom);
$title = $getDomXpath->query('/html/body/a/h1');
$source = $getDomXpath->query('/html/body/a/#href');
}
I'm not sure about this answer it's just a proposition
I want to know Groupon active deals so I write a scraper like:
libxml_use_internal_errors(true);
$dom = new DOMDocument();
#$dom->loadHTMLFile('https://www.groupon.com/browse/new-york?category=food-and-drink&minPrice=1&maxPrice=999');
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//li[#class='slot']//a/#href");
foreach($entries as $e) {
echo $e->textContent . '<br />';
}
but when I run this function browser loading all time, just loading something but don't show any error.
How can I fix it? Not just case with Groupon - I also try other websites but also don't work. WHy?
What about using CURL to loading page data.
Not just case with Groupon - I also try other websites but also don't work
I think this code will help you but you should expect unexpected situations for each website which you want to scrap.
<?php
$dom = new DOMDocument();
$data = get_url_content('https://www.groupon.com', true);
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//label");
foreach($entries as $e) {
echo $e->textContent . '<br />';
}
function get_url_content($url = null, $justBody = true)
{
/* Init CURL */
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_HTTPHEADER, []);
$data = curl_exec($ch);
if ($justBody)
$data = #(explode("\r\n\r\n", $data, 2))[1];
var_dump($data);
return $data;
}
I am a beginner, in php, and i want to use regex to remove a string after the first white space i found.
My code is like:
<?php
$barcode = $_POST['barcode'];
$adress = "https://...";
$timeout = 40;
$ch = curl_init($adress);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
if (strpos($adress, 'https://') === 0) {
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
}
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page_content = curl_exec($ch);
$dom = new \DOMDocument('1.0', 'UTF-8');
$internalErrors = libxml_use_internal_errors(true);
$node = $dom->loadHTML($page_content);
$listeImages = $dom->getElementsByTagName('img');
foreach ($listeImages as $image)
{
$link = $dom->createElement("a");
$href = $dom->createAttribute('srcset');
$href->value = $image->getAttribute('srcset');
$value_img = print( $href -> nodeValue.'<br />');
$image->parentNode->replaceChild($link, $image);
$link->appendChild($href);
/*$exclusion_space_url=array(' ', ' 1x');
$url_cleaned=$this->$value_img;
foreach ($exclusion_space_url as $key=>$value) {
$url_cleaned=str_replace($value," ",$url_cleaned);*/ ***what i tried to do***
}
}
curl_close($ch);
?>
And the result of this is :
An html page with 3-4 lines of url serparate with space
Thank for your help
$yourString = "Your String With Space";
$str= strtok($yourString,' ');
echo $str;
I want to get the title of a webpage with file_get_contents(),
I tried:
$get=file_get_meta_tags("http://example.com");
echo $get["title"];
but it doesn't match.
What is wrong with it?
Title tag is not part of match in get_meta_tags() function and it is also not a meta tag.
Try this:
$get=file_get_contents("http://example.com");
preg_match("#<title>(.*?)</title>#i,$get,$matches);
print_r($matches);
Regex #<title>(.*?)</title>#i matches the title string.
Use the Below Code snipet to get the webpage title.
<?php
function curl_file_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$targetUrl = "http://google.com/";
$html = curl_file_get_contents($targetUrl);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$page_title = $nodes->item(0)->nodeValue;
echo "Title: $page_title". '<br/><br/>';
?>
how can I parse images on this site with cURL?
with this code I can show the whole site's html, but I need only images:
$ch = curl_init('http://www.lamoda.ru/shoes/sapogi/?sitelink=leftmenu&sf=16&rdr565=1#sf=16');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, '1');
$text = curl_exec($ch);
curl_close($ch);
if (!preg_match('/src="https?:\/\/"/', $text))
$text = preg_replace('/src="(.*)"/', "src=\"$MY_BASE_URL\\1\"", $text);
echo $text;
thank you!
I tried this:
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, '1');
$text = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
#$doc->loadHTML($text->content);
$imgs = $doc->getElementsByTagName('img');
foreach ($imgs as $img)
{
$imgarray[] = $img -> getAttribute('src');
}
return $imgarray;
BUT: on this site images uploaded via JS and it doesn't show images at all =((
You can use a DOM Parser to achieve this:
$ch = curl_init('URL_GOES_HERE');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, '1');
$text = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument;
$dom->loadHTML($text);
foreach ($dom->getElementsByTagName('img') as $img) {
echo $img->getAttribute('src');
}
you can use html parse simple_html_dom:
http://simplehtmldom.sourceforge.net/manual.htm
// Create DOM from URL or file
$url = 'http://www.lamoda.ru/shoes/sapogi/?sitelink=leftmenu&sf=16&rdr565=1#sf=16';
$html = file_get_html($url);
// Find all images
foreach($html->find('img') as $element)
echo $element->src;