I'm trying to use file_get_html on a web page to find images (and their URLs) on the page.
<?php
include('simplehtmldom_1_7/simple_html_dom.php');
$html = file_get_html('https://www.mrporter.com/en-us/mens/givenchy/jaw-neoprene--suede--leather-and-mesh-sneakers/1093998');
foreach($html->find('img') as $e)
$img_url_array[] = $e->src . '<br>';
$array_size = sizeof($img_url_array);
$x =0;
while($x <$array_size){
echo "image url is " . $img_url_array[$x] ;
$x =$x+1;
}
?>
The script keeps on loading and doesn't pause. Is there a way to set a timeout or an exception?
Related
I am attempting to index all of our items for a sitemap. I am using the Spatie Sitemap Generator package to do this. Currently we have over 65,000 items which is problematic because a sitemap can only contain 50,000 links. No matter though, the package has a solution: a Sitemap Index that points to multiple sitemaps. Because the items are not accessible from any other pages on the site I cannot just use the crawling feature and I must add links manually. So that is what I am doing below. However I have two problems:
I get segmentation faults half the time
I get this error when trying to write the sitemaps to a file:
Call to undefined method Spatie\Sitemap\Sitemap::getType() (View: /Users/noahgary/Projects/han-user-portal/vendor/spatie/laravel-sitemap/resources/views/sitemapIndex/index.blade.php)
I'm not exactly sure why I am getting this error or how to assure there is a type to get when this code is executed... But I know the problem code is here in the package:
<?= '<'.'?'.'xml version="1.0" encoding="UTF-8"?>'."\n" ?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
#foreach($tags as $tag)
#include('laravel-sitemap::sitemapIndex/' . $tag->getType())
#endforeach
</sitemapindex>
Here is my code:
$this->info('Generate Sitemap');
$this->info(config('app.url'));
$sitemapPath = public_path('sitemap.xml');
$siteMapIndex = SitemapIndex::create();
$page = 1;
$size = 1000;
$max_result_window = 10000;
$data = json_decode('{"sort":"updated_at","order":"DESC","page":'.$page.',"size":'.$size.',"active":1,"search_criteria":[{"field":"active","value":1,"operator":"eq","nested":false}], "source": "sitemap"}', true);
// Get first result
$response = $this->itemService->getByCriteria($data);
try {
$total_results = 0;
$this->info("Max result window: " . $max_result_window);
for ($i = 0; $i < 6; $i++)
// TODO: get total number of items for this loop
{
$sitemap = Sitemap::create(config('app.url'));
$this->info("Indexing items " . ($i*$max_result_window+1) . " to " . ($i+1)*$max_result_window);
while ($total_results < ($max_result_window - $size))
// While we have less than the maximum number of results allowed by elastic...
{
foreach ($response->hits->hits as $item) // Add a link for each item to the sitemap
{
$sitemap->add(Url::create("/shop/" . $item->_id));
}
// Some stats
$total_results = ($page * $size - ($size - sizeof($response->hits->hits)));
$this->info("Items indexed: " . ($max_result_window * $i + $total_results));
//Get next page
$page++;
$data = json_decode('{"sort":"updated_at","order":"DESC","from":' . ($i*$max_result_window) . ',"page":' . $page . ',"size":' . $size . ',"active":1,"search_criteria":[{"field":"active","value":1,"operator":"eq","nested":false}], "source": "sitemap"}', true);
$response = $this->itemService->getByCriteria($data);
}
//Reset. We are moving the result window.
$page = 1;
$total_results = 0;
// Write sitemap to sitemap index and move on to create a new index.
$siteMapIndex->add($sitemap);
$sitemap = null;
unset($sitemap);
}
} catch (\Exception $e) {
var_dump($response);
}
$siteMapIndex->writeToFile($sitemapPath);
Okay, So the reason I got this error was because I was trying to add a sitemap object to the index. I should have added it like so:
$sitemap->writeToFile($sitemapPath);
$siteMapIndex->add($sitemapPath);
The documentation gives an example of adding a sitemap object but this is ONLY in the case of changing the lastModificationDate like the example below from the documentation:
SitemapIndex::create()
->add('/pages_sitemap.xml')
->add(Sitemap::create('/posts_sitemap.xml')
->setLastModificationDate(Carbon::yesterday()))
->writeToFile($sitemapIndexPath);
So, the correct implementation is:
Create sitemap
Add links
Write sitemap
Add sitemap, by its path, to the sitemap index
I am trying this code to get all images src from the link (https://www.vfmii.com/exc/aspquery?command=invoke&ipid=HL26423&ids=42337&RM=N)
But it is showing nothing to me. Can you suggest me some better way?
<?php
include_once 'simple_html_dom.php';
$html = file_get_html('https://www.vfmii.com/exc/aspquery?command=invoke&ipid=HL26423&ids=42337&RM=N');
// Find all images
foreach($html->find('img') as $element) {
echo $element->src. "<br>";
}
?>
Content is loaded using XHR. But you can get the JSON :
$js = file_get_contents('https://www.vfmii.com/exc/aspquery?command=invoke&ipid=HL26423&ids=42337&RM=N&out=json&lang=en');
$json = substr($js,8,-2) ;
$data = json_decode($json, true);
// print_r(array_keys($data)) ;
// example :
foreach ($data['rcoData'] as $rcoData) {
if (isset($rcoData['encodings'])) {
$last = end($rcoData['encodings'])['url'] ;
echo $last ;
}
}
The website in question that you're trying to scrape has content that is loaded through javascript after load. "PHP Simple HTML DOM Parser" can only get content that is on the page statically on load.
I get images from a specific url. with this script im able to display them on my website without any problems. the website i get the images from has more than one (about 200) pages that i need the images from.
I dont want to copy the block of PHP code manually and fill in the page number every time from 1 to 200. Is it possible to do it in one block?
Like: $html = file_get_html('http://example.com/page/1...to...200');
<?php
require_once('simple_html_dom.php');
$html = file_get_html('http://example.com/page/1');
foreach($html->find('img') as $element) {
echo '<img src="'.$element->src.'"/>';
}
$html = file_get_html('http://example.com/page/2');
foreach($html->find('img') as $element) {
echo '<img src="'.$element->src.'"/>';
}
$html = file_get_html('http://example.com/page/3');
foreach($html->find('img') as $element) {
echo '<img src="'.$element->src.'"/>';
}
?>
You can use a for loop like so:
require_once('simple_html_dom.php');
for($i = 1; $i <= 200; $i++){
$html = file_get_html('http://example.com/page/'.$i);
foreach($html->find('img') as $element) {
echo '<img src="'.$element->src.'"/>';
}
}
So now you have one block of code, that will execute 200 times.
It changes the page number by appending the value of $i to the url, and every time the loop completes a round, the value of $i becomes $i + 1.
if you wish to start on a higher page number, you can just change the value of $i = 1 to $i = 2 or any other number, and you can change the 200 to whatever the max is for your case.
There are many good solutions, on of them: try to make a loop from 1 to 200
for($i = 1; $i <= 200; $i++){
$html = file_get_html('http://example.com/page/'.$i);
foreach($html->find('img') as $element) {
echo '<img src="'.$element->src.'"/>';
}
}
<?php
function SendHtml($httpline) {
$html = file_get_html($httpline);
foreach($html->find('img') as $element) {
echo '<img src="'.$element->src.'"/>';
}
}
for ($x = 1; $x <= 200; $x++) {
$httpline="http://example.com/page/";
$httpline.=$x;
SendHtml($httpline);
}
?>
Just loop. Create a sending function and loop to make the calls.
I recommend you to read all php docu in https://www.w3schools.com/php/default.asp
First, store them in a database. You can(/should) download the images to your own server, or also store the uri to the image. You can use code like FMashiro's for that, or something similar, but opening 200 pages and parsing their HTML takes forever. Every pageview.
And then you simply use the LIMIT functionallity in queries to create pages yourself.
I recommend this method anyways, as this will be MUCH faster than parsing html every time someone opens this page. And you'll have sorting options and other pro's a database gives you.
I have started building a single Curl session with - curl, dom, xpath, and it worked great.
I am now building a scraper based off curl for taking data off multiple sites in one flow, and the script is echo'ing the single phrase i put in.. but it does not pick up variables.
do{
$n=curl_multi_exec($mh, $active);
}while ($active);
foreach ($urls as $i => $url){
$res[$i]=curl_multi_getcontent($conn[$i]);
echo ('<br />success');
}
So this does echo the success-text as many times as there are urls.. but really this is not what i want.. I want to break up the html like i could with the single curl session..
What i did in the single curl session:
//parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($res);
// grab all the on the page
$xpath = new DOMXPath($dom);
$product_img = $xpath->query("//div[#id='MAIN']//a");
for ($i = 0; i < $product_img->length; $i++){
$href = $product_img->item($i);
$url = $href->getAttribute('href');
echo "<br />Link : $url";
}
This dom parsing / xpath is working for the single session curl, but not when i run the multicurl.
On Multicurl i can do curl_multi_getcontent for the URL from the session, but this is not want..
I would like to get the same content as i picked up with Dom / Xpath in the single session.
What can i do ?
EDIT
It seems i am having problems with the getAttribute. It is a link for an image i am having trouble grabbing. The link is showing when scraping, but then it throws an error :
Fatal error: Call to a member function getAttribute() on a non-object in
The query:
// grab all the on the page
$xpath = new DOMXPath($dom);
$product_img = $xpath->query("//img[#class='product']");
$product_name = $xpath->query("//img[#class='product']");
This is working:
for ($i = 0; i < $product_name->length; $i++) {
$prod_name = $product_name->item($i);
$name = $prod_name->getAttribute('alt');
echo "<br />Link stored: $name";
}
This is not working:
for ($i = 0; i < $product_img->length; $i++) {
$href = $product_img->item($i);
$pic_link = $href->getAttribute('src');
echo "<br />Link stored: $pic_link";
}
Any idea of what i am doing wrong ?
Thanks in advance.
For some odd reason, it is only that one src that won't work right.
This question can be considered "solved".
i am trying to build simple php crawler
for this purpose
i am getting constants of webpage using
http://simplehtmldom.sourceforge.net/
after getting page data i get page as bellow
include('simplehtmldom/simple_html_dom.php');
$html = file_get_html('http://www.mypage.com');
foreach($html->find('a') as $e)
echo $e->href . '<br>';
this works perfectly,and print all links on that page.
i only want to get some url like
/view.php?view=open&id=
i have wirtten function for this purpose
function starts_text_with($s, $prefix){
return strpos($s, $prefix) === 0;
}
and use this function as
include('simplehtmldom/simple_html_dom.php');
$html = file_get_html('http://www.mypage.com');
foreach($html->find('a') as $e) {
if (starts_text_with($e->href, "/view.php?view=open&id=")))
echo $e->href . '<br>';
}
but nothing return.
i hope you understand what i need.
i need to print only url which match that criteria.
Thanks
include('simplehtmldom/simple_html_dom.php');
$html = file_get_html('http://www.mypage.com');
foreach($html->find('a') as $e) {
if (preg_match($e->href, "view.php?view=open&id="))
echo $e->href . '<br>';
}
try this once.
refer preg_match