image scraping pointing each url to directory using php - php

I have the code for image scraping but what I am trying to fix here are a few things:
replace this $the_site = "url"; with my input type="text"
so instead of putting url on the code I want to put the url on my input.
I want to make multiple folder and links, instead of putting same code like 5 times on the page I want to to point each url to a directory.
My code is about downloading images from pages and save them to folder so I want to put all inside one php tags
here's my code
<?php
$the_site = "url";
$the_tag = "div"; #
$the_class = "slides";
$html = file_get_contents($the_site);
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//'.$the_tag.'[contains(#id,"'.$the_class.'")]/img') as $item) {
$img_src = $item->getAttribute('src');
//print $img_src."\n"; Ignore This
//copy($img_src,'C:\xampp\htdocs\grabIMG\download'); Ignore This
$img_name = end(explode("/",$img_src));
echo $img_name.' has downloaded<br />';
$img_content = file_get_contents($img_src);
$fp = fopen(" folder/".$img_name,"w");
fwrite($fp,$img_content);
fclose($fp);
}
?>
i been posting this code like 5 times in the page, each time opening new php tags but i get this error and excution won't be completed
Fatal error: Maximum execution time of 30 seconds exceeded in
C:\xampp\htdocs\grabIMG\index.php on line 106

Related

How to get specific values on an xml using php?

I'm creating a simple web app using html and php, this app reads an XML file provided by a user on a webpage and then prints some values content on it.
The files are sent to the server by POST and then are read by a php script.
This is a fragment of my xml test file:
<cfdi:Impuestos totalImpuestosTrasladados="72.07">
<cfdi:Traslados>
<cfdi:Traslado importe="50.00" impuesto="IEPS" tasa="16.00"/>
<cfdi:Traslado importe="22.07" impuesto="IVA" tasa="16.00"/>
</cfdi:Traslados>
</cfdi:Impuestos>
And this is my code made on php:
<?php
foreach ($_FILES['archivos']['name'] as $f => $name) {
$tempPath = $_FILES["archivos"]["tmp_name"][$f];
//Load xml file directly on the temp path
$xml = simplexml_load_file($tempPath, null, true);
$namespaces = $xml->getDocNamespaces();
//Creating and loading DOM object
$dom = new DOMDocument('1.0','utf-8');
$dom->load($tempPath);
//Getting IVAs' values
foreach ($xml->xpath('//cfdi:Comprobante//cfdi:Impuestos//cfdi:Traslados//cfdi:Traslado') as $impuestos){
$IVA = $impuestos['importe'];
echo("IVA = ".$IVA."<br>");
}
}
?>
That prints:
IVA = 50.00
IVA = 22.07
Print's screenshot
On this case I want to print only the IVA's value but it takes the IEPS' value too because IEPS is on the same path as IVA. These values are only on this part of the file and I don't know if there's a way to separate them.
I just want to select only the IVA's value. Would you help me please?:(
The complete XML file is: Here
Try:
foreach ($xml->xpath("//cfdi:Comprobante//cfdi:Impuestos//cfdi:Traslados//cfdi:Traslado[#impuesto='IVA']") as $impuestos){

cache folder not functioning, strange file names being saved (php)

Have a project where I'm scraping a few sites with data, then outputting onto one site. To help with load times, I'm trying to rig it so once every 10 mins, my main website does a full data scrape, then stores it all into a cache folder called "cache", stored in the root folder. Then, anytime I refresh main site after that 10 mins, it pulls from the cache, making load times quite fast at that point.
Trouble is, load times haven't changed, which it really should using this method, so I'm doing something wrong. Would appreciate any help. Now I can confirm the data IS being stored in the cache, because I see the files automatically appearing there. So the issue has to be that the code is broken where specified to grab the data from cache, after it's stored every 10 minutes, it's not grabbing the data.
*part of me wonders if the issue is with how the filenames are being saved in cache, right now it seems to be random values. for ex, one is named f32dd7f0b85eb4c1be0bb9a417cc29ea553d898e.html
I'd think it needs to be saved as a specific file name. Not sure how to achieve that though. The code at the end of my php reference files seem to specify this, so not sure issue. The code that is supposed to be doing this is at the bottom of the post.
I'm really new to php, and honestly have only gotten this far through some very nice and helpful people. I'm close, but not quite there yet with this cache framework.
global.php in root folder:
<?php
$_cache_time =600; //10 minutes
$_cache_dir="./cache"; //cache dir
function deleteBlankInArray($var){
return !ctype_space($var)&&!empty($var);
}
function cache_start($filename)
{
global $_cache_dir,$_cache_time;
$cachefile = $_cache_dir.'/'.sha1($filename).'.html';
ob_start();
if(file_exists($cachefile) && (time( )-$_cache_time <
filemtime($cachefile)))
{
include($cachefile);
ob_flush();
return true;
}
return false;
}
function cache_end($filename)
{
global $_cache_dir,$_cache_time;
$cachefile = $_cache_dir.'/'.sha1($filename).'.html';
$fp = fopen($cachefile, 'w');
fwrite($fp, ob_get_contents());
fclose($fp);
ob_flush();
}
My main website, is an xhtml site. It's referencing these php pages like this:
<?php include 's&pcurrent.php';?>
<?php include 'news.php';?>
It's referencing/outputting multiple php files, which is why load times are slow, if not pulling from cache.
And lastly, this is an example of one of my php files that are being "included". This one is called litecoinchange.php
<?php
error_reporting(E_ALL^E_NOTICE^E_WARNING);
include_once "global.php";
//filename of the file
if(!cache_start("litecoinchange.php")){
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('https://coinmarketcap.com/');
$xpath = new DOMXPath($doc);
$query = "//tr[#id='id-litecoin']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$result = trim($entry->textContent);
$ret_ = explode(' ', $result);
//make sure every element in the array don't start or end with blank
foreach ($ret_ as $key=>$val){
$ret_[$key]=trim($val);
}
//delete the empty element and the element is blank "\n" "\r" "\t"
//I modify this line
$ret_ = array_values(array_filter($ret_,deleteBlankInArray));
//echo the last element
echo $ret_[7];
//filename of the file
cache_end("litecoinchange");
}
}

Php foreach Loop ends unexpected when reading content from remote site using simple_html_dom

I have this code:
$target_url = "http://mysiteexmpl.com/";
$html = new simple_html_dom();
$html->load_file($target_url);
//here I find specific links and start looping each
foreach($html->find('a.link') as $link){
$newtarget_url = $link->href;
//here I open each url that I find as new
$newhtml = new simple_html_dom();
$newhtml->load_file($newtarget_url);
//getting price
foreach($newhtml->find('div.pprice') as $price){
$price=preg_replace("/[^0-9.]/", "",$price);echo '<br>';
}
//getting other info and so on
foreach($newhtml->find('div.prohead > h1') as $title){
$title= $title->innertext;echo '<br>';
}
//here I execute several queries and copying images from remote site to mine
}
The problem is that my target url-s are 21 if I echo $newtarget_url and dont execute queries, but when execute the full code and queries loop stops on the 7-th url and dont loop over all 21 url-s that it is supposed to loop
Is this a memory leak problem or something else? How to debug it? How can the code above be optimized?
Thank you in avance for your time

PHPQuery... trying to get all links of all images from page

I am trying to get all links of all images on a given page using PHPQuery. I am using the PHP support syntax of PHPQuery.
This is the code I have so far:
include('phpQuery-onefile.php');
$all = phpQuery::newDocumentFileHTML("http://www.mysite.com", $charset = 'utf-8');
// in theory this gives me all image sources
$images = $all->find('img')->attr('src');
// but if I do `echo $images;` what I get is the src to the first image
Out of curiosity I have tried
$images = $all->find('img:first')->attr('src');
and
$images = $all->find('img:last')->attr('src');
and it prints correctly the first and the last image's addresses, respectively, but how in hell can I get an array of all links?
Within your foreach loop, you need to wrap the $a with a pq().
For example:
$all = phpQuery::newDocumentFileHTML("http://www.mysite.com", $charset = 'utf-8');
$imgs = $all['img'];
foreach ($imgs as $img) {
// Note: $img must be used like "pq($img)"
echo pq($img)->attr('src');
}

Can't get the dom node value extracted

I have a code that links to another site, grabs that data, and returns the string to a variable.. i'm wondering why this isn't working however?
<?php
$file = $DOCUMENT_ROOT . "http://www.sc2brasd.net";
$doc = new DOMDocument();
#$doc->loadHTMLFile($file);
$elements = $doc->getElementsByTagName('h1');
for ($i=1; $i<=7; $i++)
{
echo trim($elements->item($i)->nodeValue);
}
?>
there are seven "h1" tags that i would like to grab but they won't return to echo out? an example of the string would be "Here is the test string i am trying to pull out"
This will not work because the path dont exists. It points to a file on your server.
$file = $DOCUMENT_ROOT . "http://www.sc2brasd.net";
I'n not sure if loadHTMLFile() can handle URLs at all. You may need to get the document with file() and load it with DOMDocument::loadHTML.

Categories