i have been trying to grab the product price and shipping cost
on aliexpress.com
The price is set and fixed and thus - easy...
However, the shipping cost is loaded after the site determines
which country you are from.
I viewed the source and it has a hidden input field which is populated (probably) after checking my location or ip.
How can i use CURL to "fool" the site and get the shipping cost to my country - aka scraping it using PHP?
The CURL i got:
$html = curl_download($producturl, $browserAgent);
$dom = new DOMDocument();
$dom->validateOnParse = true;
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
libxml_clear_errors();
// get and clean product price
$price = $dom->getElementById('product-price');
$price = $price->nodeValue;
$clnprice = currency_string_remover($price);
$clnprice = explode(' ', $clnprice);
$clnprice = array_filter(array_map('trim',$clnprice),'strlen');
$clnprice = array_values($clnprice)[0];
$currency = currency_string_extractor($price);
// get and clean shipping price
// >> this is empty until page determines location! PROBLEM
$shipprice = $dom->getElementById('shipping-cost');
$shipprice = $shipprice->nodeValue;
echo '<pre>SPRICE';
print_r($shipprice);
echo '</pre>';
$shipprice = explode('-', $shipprice);
$shipprice = $shipprice[0];
$shipprice = currency_string_remover($shipprice);
echo '<div id="sitename">aliexpress</div>';
echo '<div id="price">'.$clnprice.'</div>';
echo '<div id="shipprice">'.$shipprice.'</div>';
echo '<div id="currency">'.$currency.'</div>';
Does anyone have any ideas? Pointers? Helping links?
I've checked the site. It works for several languages and countries. With Russia (my case) at a product page a main price includes a shipping cost. So this dom html item remains empty: <span id="shipping-cost"></span>
By the way, it's not in a form (in my case).
If you suspect it's ajax (javascript) populated you better check all js files for the shipping-cost key word. I've done it with Chrome dev-tools and in my case I found no appearing of it in any js-files (incl. the source html file). So most likely it's not javascript (ajax) updated but rather this field is originally on server generated and might be served empty.
Your browser watches the site from certain country and the server where you run php code (curl scraper) does it from totally different country (IP). Thus the Aliexpress will respond with different page content. So I recommend you a free proxy service hola.org to change/rotate country (IP) thru proxying for debugging. Thus you might check this site with different country based IPs to see this field.
You might need to check other field (product-info-shipping) to see shipping conditions. http://joxi.ru/xAe8Wy1hGDgq2y
If you really want to request the web page with the shipping-cost populated as in certain country (IP) then you need to use a proxy service to proxy your curl requests.
Related
I am currently trying to add on my homepage a custom message based on the location of the visitor. If the visitor have already been to the checkout and choosed shipping Country, then the message will be based on this country. Otherwise we will use the WooCommerce geolocation function.
function get_user_country(){
if( isset($woocommerce->customer) ){
$country = $woocommerce->customer->get_shipping_country();
}
else{
$geo = new WC_Geolocation();
$user_ip = $geo->get_ip_address();
$user_geo = $geo->geolocate_ip( $user_ip );
$country = $user_geo['country'];
}
if( $country == 'CA' ){
echo 'We offer free shipping to Canada';
}
}
My issue is that it seems that the result is store in cache.
When refreshing my homepage, the message is not updated.
And I don't want to exclude my homepage from caching.
I have read that one way to get the country dynamically is to use Ajax instead of php.
But I am a beginner in web development and I am afraid this out of my knowledge for now...
is there any other way to resolve my issue? Thanks.
My requirement is to restrict a content element with IP of a specific country (Eg: Austria). That means people visiting the website from Austrian IPs should be visible the content element and for all other users, it should be hidden. I am using geoip solution to check the country. I wrote a user function to implement this feature. I wrote a small extension and set hidden flag 1 and 0 based on countries. But due to TYPO3 caching, I want to clear the cache everytime to reflect the changes in frontend. I included the extension as USER_INT, and extension is non-cachable. But unfortunately not working. Functionality working, but due to caching changes not reflect in realtime.
$uid = 175; // uid of the content element needs to be hidden
$geoplugin = new \geoPlugin();
$geoplugin->locate();
$countryCode = $geoplugin->countryCode;
if( $countryCode == 'AT' ){
$GLOBALS['TYPO3_DB']->exec_UPDATEquery('tt_content', 'uid IN ('.$uid.')', array('hidden' => 0));
}else{
$GLOBALS['TYPO3_DB']->exec_UPDATEquery('tt_content', 'uid IN ('.$uid.')', array('hidden' => 1));
}
Is there any method available in TYPO3 to restrict content element for specific IP / Countries? or can you guys suggest a solution to fix this please?
The solution of Jost is much less dirty than hiding the element in the database depending on the visitors country. By your way the database probably changed on every user visit.
Just create a micro extension.
Im trying to use cURL to get all the product prices from this site but i dont really know how to scrape all prices for every product on this site http://www.bikestore.ie/.
can someone please give me som tips?
Right now im just testing to get one price for a product and that is no problem, butt can i get the precis for all products??
my code right now is:
public function Scrape(){
$curl = curl_init('http://www.bikestore.ie/scott-speedster-30-bike-2015.html');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$page = curl_exec($curl);
if(!empty($curl)){
$doc = new DOMDocument;
$doc->loadHTML($page);
$xpath = new DOMXPath($doc);
$rupees = $xpath->evaluate('string(//div[#class="product-shop"]//div[#class="price-box"]//span[#class="price"])');
echo $rupees;
}
else {
print "Not found";
}
It's not an easy task.
The site is structutred. But each product is defined in url.
ex.: http://www.bikestore.ie/scott-speedster-30-bike-2015.html
when you add it to cart the unique product identifier is seen:
Steps
Crawl the whole site with cURl (find all the links <a> of products). See the post on simple python crawler, you just make similar with php.
Store them in DB (ex. MySQL)
For each link you run your Scrape() procedure fetching price/product id. Getting price of a product you mark its link as 'checked' in DB so that you do not run it once again.
Notes: For the sake of parallel processing you might run processes of point 1&2 and of point 3 in parallel. Use cron for this.
I'm working on a script in php, that takes all the products URLs from an e-commerce site, for now I'm just using the function get_file_contents() and after I search for the keyword with preg_match_all() that anticipate the item url,now, my question is, can I use a more direct and efficent way to store all this link from a website and put they on my database?
I've recently created a crawler system for my client's project. Basically these are the steps i followed:
Project was based on PHP and it supports multiple type of document such a xml, json and html.
I've created a base product object with the properties i needed (title, image, price, link, category, source site)
For each web site, i'm using a parser which generally uses PhpDom library with DomXpath.
Basically, i find the product listing tag, and loop through the records and create a new product list object with products object inside it (step 2).
When parsing the web site finishes, i'm sending the whole list to my base action it checks if the product with that url is already exists if not it adds to db.
Also in my server, i'm running a cron where it checks all the product links and if the response returns 404 or 500 it adds a flag to product with 1. I'm also running an other cron which checks the links again with the flag 1. If it's still responding with error code, it removes the content from my database.
This is a sample parser code that i use. I hope it will help you through the process:
$content = file_get_contents($url);
libxml_use_internal_errors(true);
$oDom = new DomDocument;
$oDom->validateOnParse = false;
$res = $oDom->loadHTML($content);
libxml_clear_errors();
$oDom->preserveWhiteSpace = false;
$oXpath = new DOMXPath($oDom);
$productNode = $oXpath->query('//div[#class="ulist span4"]');
if($productNode){
$productsList = array();
foreach($productNode as $p){
$this->oProduct = new Products();
$productURL = $oXpath->query('div[#class="ures"]/div[#class="ures-cell"]/a', $p)->item(0);
$this->oProduct->url = $this->base.'/'.$productURL->getAttribute('href');
$this->oProduct->category = $categoryID;
$this->oProduct->productPeek = $peek;
$titleNode = $oXpath->query('div[#class="ubilgialan span12"]/div[#class="span12 uadi"]/a/span', $p)->item(0);
$this->oProduct->title = trim($titleNode->nodeValue);
$priceNode = $oXpath->query('div[#class="ubilgialan span12"]/div[#class="span8 ufytalan"]/div[#class="ufyt"]/span[#class="divdiscountprice"]/span', $p)->item(0);
$this->oProduct->price = trim($priceNode->nodeValue);
$imageNode = $oXpath->query('div[#class="ures"]/div[#class="ures-cell"]/a/img', $p)->item(0);
$this->oProduct->image = $this->base."/".$imageNode->getAttribute('src');
$productsList[] = $this->oProduct;
}
if(count($productsList) > 0){
return $productsList;
}
}
Here is my code so far.
$dom_currys = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom_currys->loadHTMLFile('http://www.currys.co.uk/gbuk/apple-new-ipod-touch-8gb-4th-generation-07677427-pdt.html');
libxml_clear_errors();
$xpath_currys = new DOMXpath($dom_currys);
$nodes_currys = $xpath_currys->query(
'/html/body/div/div/div[2]/div/div/div[2]/div/ul[2]/li/span'
);
$currys_stock_data = $nodes_currys->item(0)->nodeValue; // "Available for home delivery"
echo $currys_stock_data;
When echoed, it comes back with
 Available for home delivery
 Available to reserve & collect
I only require the "Available for home delivery" part. Each is in a separate "li" element, however still it brings back both, the XPath if I wanted the second one would be
/html/body/div/div/div[2]/div/div/div[2]/div/ul[2]/li[2]/span
I suspect its to do with selecting the correct item, but not sure if its right or not.
Also I require that the result be checked by an IF statement. What I have so far:
if (strpos($currys_stock_data, 'Available for home') !== false) {
$currys_stockyesno = "Yes";
} else {
$currys_stockyesno = "No";
}
echo $currys_stockyesno;
I thought it would be best to check if it contained "Available for home delivery" rather than a straight match, because the website can sometimes say its available for home delivery in 2 days, along those lines. So long as the string contained that string, then it would return as true/yes. But it's saying no...
I looked a the site you are scrapping and found that the li is actually what contains the text. The span has a class on it for the icon. Since the check mark icon changes, we need to check for this too. However, it doesn't seem like you actually need the text, you need to check if the item allows home delivery.
$xpath = "//li[contains(., 'Available for home delivery')]/span[class='icon icon-check']";
Then, just check the length:
if( $nodes_currys->length === 1 ) // true if available for home.
I should also note that this method will not work on their search/browse pages since they use images there.....very confusing and why I hate scraping :P
I have examined the HTML source and confirmed that
/html/body/div/div/div[2]/div/div/div[2]/div/ul[2]/li
selects two elements.
If you want to select only the first of the two text nodes, use:
/html/body
/div/div/div[2]
/div/div/div[2]
/div/ul[2]/li[1]
/span/following-sibling::text()