Simple html dom parser can`t parse all page - php

I need to get info from the center column of that site
(I need phone numbers exactly)
I`m using SimpleHTML dom parser, and was trying some curl method, but it always gives me html source without that central column !
I understood that using this code:
$html = file_get_html('http://vashmagazin.ua/cat/catalog/?rub=100&subrub=1');
$str = $html->Save();
echo $str;
I need to say can i do this or not today or i will loose this order.
Sorry for my bad english, thanks.

Pay attention on request headers and iconv for the charset conversion.
If you don't convert the string from windows-1251 in utf-8, preg_match will fail.
After conversion I used a simple regular expression to extract the phone numbers from the whole page.
<?php
$url = 'http://vashmagazin.ua/cat/catalog/?rub=100&subrub=1';
$ch = curl_init();
$request_headers = array
(
"Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Charset" => "windows-1251,utf-8;q=0.7,*;q=0.3",
);
$header = array();
foreach ($request_headers as $key => $value)
$header[] = "{$key}: {$value}";
curl_setopt($ch, CURLOPT_HTTPGET, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7');
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
curl_close($ch);
$html = iconv("windows-1251", "UTF-8", $html);
$matches = array();
$pattern = '/\([0-9]{3}\)[0-9]{3,}\-[0-9]+/us';
if (preg_match_all($pattern, $html, $matches))
{
var_dump($matches);
}
?>
The source code above is fully tested and fully working.
If you can't install the curl library try to replace the curl block with a file_get_contents($url).
To install curl on your operating system search on google, on Ubuntu use sudo apt-get install curl libcurl3 php5-curl and restart apache.

Related

How to select specific text from a string generated by a PHP script?

I've been trying to scrape a HLS file from Twitch using several PHP scripts. The first one runs a cURL command to get the HLS URL through a Python script that returns said URL and converts the generated string to plain text, and the second (which is the one that isn't working) is supposed the extract the M3U8 file and make it able to be played.
First script (extract.php)
<?php
header('Content-Type: text/plain; charset=utf-8');
$url = "https://pwn.sh/tools/streamapi.py?url=twitch.tv/cgtn_live_russian&quality=1080p60";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
//for debug only!
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$resp = curl_exec($curl);
curl_close($curl);
var_dump($resp);
$undesirable = array("}");
$cleanurl = str_replace($undesirable,"");
echo substr($cleanurl, 39, 898);
?>
This script (let's call it extract.php) works, and it returns (in plain text) the same information the Python script would return, which is this:
string(904) "{"success": true, "urls": {"1080p60": "https://video-weaver.fra05.hls.ttvnw.net/v1/playlist/[token].m3u8"}}"
Second script (play.php)
<?php
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Referer:https://myserver.com/" .
"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:51.0) Gecko/20100101 Firefox/51.0"
));
$html = file_get_contents("extract.php");
preg_match_all(
'/(http.*?\.m3u8[^&">]+)/',
$html,
$posts, // will contain the article data
PREG_SET_ORDER // formats data into an array of posts
);
foreach ($posts as $post) {
$link = $post[0];
header("Location: $link");
}
?>
This second script (let's call it play.php) should theoretically return the M3U8 file (without string(904) "{"success": true, "urls": {"1080p60":) and make it able to be played in a media player, such as VLC, but it doesn't return anything.
Can someone tell me what's wrong? Did I make a syntax or regex error when making these PHP files or is the second file not working because of the other elements of the string?
Thanks in advance.
I think you can rely on the regex to get the URL out instead of trying to clean the string manually. The other way would be to use json_decode().
Anyways the idea is to define a variable in extract.php, in this case it is $resp. Doing it via echo as you are now will not make it available in the parent script.
You can then reference that variable in play.php once extract.php has been included.
<?php
//extract.php
$resp = '';
$url = "https://pwn.sh/tools/streamapi.py?url=twitch.tv/cgtn_live_russian&quality=1080p60";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
//for debug only!
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$resp = curl_exec($curl);
curl_close($curl);
//play.php
include('./extract.php');
//$resp is set in extraact.php
preg_match_all(
'/(http.*?\.m3u8)/',
$resp,
$posts, // will contain the article data
PREG_SET_ORDER // formats data into an array of posts
);
foreach ($posts as $post) {
$link = $post[0];
}
header("Location: $link");
die();

cURL returns null array

I have made a simple web Crawler with PHP cURL that should grab all the images of a particular page from Amazon where the keyword samsung has been searched.
Here is the code:
$curl = curl_init(); // $curl is going to be data type curl resource
$search_string = "samsung";
$url = "https://www.amazon.com/s?k$search_string";
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); // ssl
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); // storing in variable
$result = curl_exec($curl);
preg_match_all("!https://m.media-amazon.com/images/I/[^\s]*?._AC_UL320_.jpg!", $result, $matches);
print_r($matches);
curl_close($curl);
But now I get Null array:
Array ( [0] => Array ( ) )
I don't why it is showing that, so if you know what is going wrong or how can I handle this, please let me know, I would really appreciate any idea from you guys...
Thanks in advance.
Note that I have specified [^\s]*? regular expression instead of image name to load all the available images on web page.
UPDATE #1:
Results of curl --head https://www.amazon.com/s?k=samsung
HTTP/1.1 503 Service Unavailable
Content-Type: text/html
Content-Length: 2671
Connection: keep-alive
Server: Server
Date: Tue, 15 Jun 2021 20:59:38 GMT
x-amz-rid: 9BVX8KQMWJ4QDJ75ETYV
Vary: Content-Type,Accept-Encoding,X-Amzn-CDN-Cache,X-Amzn-AX-Treatment,User-Agent
Last-Modified: Fri, 14 May 2021 19:08:48 GMT
ETag: "a6f-5c24ef9383000"
Accept-Ranges: bytes
Strict-Transport-Security: max-age=47474747; includeSubDomains; preload
Permissions-Policy: interest-cohort=()
X-Cache: Error from cloudfront
Via: 1.1 5345148f0ba8ae3c67b69d035acdbfc5.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: AMS50-C1
X-Amz-Cf-Id: AHdq2-QLEtCE4WvXZIEh_P75D8hCrHP09EAkNqBer5VBS-pI-blj1w==
First issue: Your code:
$url = "https://www.amazon.com/s?k$search_string";
should be (note the "=")
$url = "https://www.amazon.com/s?k=$search_string";
Second issue: Amazon is smart, they're not going to let you scrape as you will. The result is the content for:
You can see this with:
$result = curl_exec($curl);
var_dump($result);
Third issue: Regex is not working. One should test Regex at https://www.phpliveregex.com/#tab-preg-match-all
(Using a right-click > view source, copy and paste of the page content.)
From what I got your regex did not return any results, but this did: https://m.media-amazon.com/images/I/[^\s]*?.jpg
May be that the string bit ._AC_UL320_ is also a Amazon anti-scraping thing... :(
it's not https://www.amazon.com/s?k$search_string, it's supposed to be 'https://www.amazon.com/s?k='.urlencode($search_string);, also Amazon.com requires you to send a Accept-Encoding header, otherwise you'll risk getting gzip-compressed responses with nothing to decompress it which means you need a CURLOPT_ENCODING, also amazon will block you if you don't supply a User-Agent header, so you must supply a CURLOPT_USERAGENT, also Amazon will block you without a browser-like Accept header, so you need CURLOPT_HTTPHEADER => array('accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng')
also Do not parse html with regex, Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.
Instead use a HTML parser like DOMDocument
this code
<?php
$curl = curl_init(); // $curl is going to be data type curl resource
$search_string = "samsung";
$url = "https://www.amazon.com/s?k=".urlencode($search_string);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); // ssl
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); // storing in variable
curl_setopt_array($curl,array(
CURLOPT_ENCODING =>'',
CURLOPT_USERAGENT=>'libcurl',
CURLOPT_HTTPHEADER=>array(
'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
)
));
$html=curl_exec($curl);
$domd = new DOMDocument();
#$domd->loadHTML($html);
foreach($domd->getElementsByTagName("img") as $img){
echo $img->getAttribute("src"),"\n";
}
outputs
//fls-na.amazon.com/1/batch/1/OP/ATVPDKIKX0DER:136-7756522-9160852:777GSTVR1XJ9MBF1N0KN$uedata=s:%2Frd%2Fuedata%3Fstaticb%26id%3D777GSTVR1XJ9MBF1N0KN:0
https://images-na.ssl-images-amazon.com/images/G/01/gno/sprites/nav-sprite-global-1x-hm-dsk-reorg._CB405937547_.png
https://m.media-amazon.com/images/I/81HdcaHSq4L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/91eAcgt9fSL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/81afsli5ctL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61m1Dot5KCL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61HFJwSDQ4L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/216-OX9rBaL._SS72_.png
https://m.media-amazon.com/images/I/21OXy0oJ8VL._SS160_.png
https://m.media-amazon.com/images/I/61jfI8GyQgL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61LUNEgB6iL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/813dec-cszS._AC_UY218_.jpg
https://m.media-amazon.com/images/I/81AT+Flc+EL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/216-OX9rBaL._SS72_.png
https://m.media-amazon.com/images/I/21OXy0oJ8VL._SS160_.png
https://m.media-amazon.com/images/I/61a5ejk6K2L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/81+3SWSAhDL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61pwE8H34zL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/71ejkOW4y2L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/71G6eW8H8hL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/91dFUw5MUTS._AC_UY218_.jpg
https://m.media-amazon.com/images/I/81P4RzFnw6L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/712iry8nIYL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61VgW9ZZXiL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61ft-L7HnUL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/51icdppvRVL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/6164p9jY2jS._AC_UY218_.jpg
https://m.media-amazon.com/images/I/51skvShlcsL._AC_UY218_.jpg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/93913ead-ae42-4933-8fc4-e9f88b0396c9/1635f47b-1fa9-40ca-8d85-47f529c1ba8b/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/6aa489c6-af9d-48d0-94c8-cce1a4f50fc7/ff2a7805-3166-41b9-9881-d00901ca9dfd/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/73b89b9f-ee28-446f-8535-beacd328c95a/8caa5478-3583-49f9-9dcb-6e5b0a254fa6/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/457fd8ad-f566-4682-bb66-fd865954aec0/fb2cdc76-7ed6-4b86-9196-d40c3ead2914/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/5c60fcd5-17c1-4389-8423-2252436f21c8/0125e72d-9178-4048-bea3-9d268a406a05/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/f852e5ab-0fa9-4f91-b195-b0facc4d0d70/30b0ec08-79b2-428d-98df-aadffd2c00eb/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/d173de56-5162-463f-be97-d256c1895024/7974c773-0c53-43a1-bfb4-91d7cc3ce801/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/68995c82-c645-4ec0-9168-20f77b8ae24d/625e2c3f-01d9-401e-b4a4-bb865ad9e525/media._SL60_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif
https://m.media-amazon.com/images/S/mms-media-storage-prod/final/BrandPosts/brandPosts/2cfe5e10-6a7e-43f4-80c7-d87f212b8007/43e8a030-58c5-491a-9854-cd4d8824a873/media._SL480_.jpeg
https://images-na.ssl-images-amazon.com/images/G/01/personalization/ybh/loading-4x-gray._CB485916920_.gif
https://assoc-na.associates-amazon.com/abid/um?s=136-7756522-9160852&m=ATVPDKIKX0DER
//fls-na.amazon.com/1/batch/1/OP/ATVPDKIKX0DER:136-7756522-9160852:777GSTVR1XJ9MBF1N0KN$uedata=s:%2Frd%2Fuedata%3Fnoscript%26id%3D777GSTVR1XJ9MBF1N0KN:0
$url = "https://www.amazon.com/s?k$search_string";
yes your url is wrong
Actull url is.you can try
$url = "https://www.amazon.com/s?k=$search_string";
Firstly there is a typo
change
$url = "https://www.amazon.com/s?k".$search_string;
to
$url = "https://www.amazon.com/s?k=".$search_string;
Amazon expects some header values to be there when requesting content please refer to the following curl request
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.3>
curl_setopt($curl, CURLOPT_HTTPHEADER, array(
'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v>
));
curl_setopt($curl, CURLOPT_ENCODING, '');
$result=curl_exec($curl);
Lastly, Change your preg_match_all function from
preg_match_all("!https://m.media-amazon.com/images/I/[^\s]*?._AC_UL320_.jpg!", $result, $matches);
To
preg_match_all('/(https?:\/\/\S+\.(?:jpg|png|gif))\s+/', $result, $matches);
Complete Code :
<?php
$curl = curl_init();
$search_string = "samsung";
$url = "https://www.amazon.com/s?k=".$search_string;
//set headers to match with amazon header . you can check headers with any browsers developer tool.
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36');
curl_setopt($curl, CURLOPT_HTTPHEADER, array(
'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
));
curl_setopt($curl, CURLOPT_ENCODING, '');
$result=curl_exec($curl);
preg_match_all('/(https?:\/\/\S+\.(?:jpg|png|gif))\s+/', $result, $matches);
print_r($matches);

simple_html_dom: 403 Access denied

I implemented this function in order to parse HTML pages using two different "methods".
As you can see both are using the very handy class called simple_html_dom.
The difference is the first method is also using curl to load the HTML while the second is not using curl
Both methods are working fine on a lot of pages but I'm struggling with this specific call:
searchThroughDOM('https://fr.shopping.rakuten.com/offer/buy/3458931181/new-york-1997-4k-ultra-hd-blu-ray-blu-ray-bonus-edition-boitier-steelbook.html', 'simple_html_dom');
In both cases, I end up with a 403 access denied response.
Did I do something wrong?
Or is there another method in order to avoid this type of denial?
function searchThroughDOM ($url, $method)
{
echo '$url = '.$url.'<br>'.'$method = '.$method.'<br><br>';
$time_start = microtime(true);
switch ($method) {
case 'curl':
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36');
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($str);
break;
case 'simple_html_dom':
$html = new simple_html_dom();
$html->load_file($url);
break;
}
$collection = $html->find('h1');
foreach($collection as $x => $x_value) {
echo 'x = '.$x.' => value = '.$x_value.'<br>';
}
$html->save('result.htm');
$html->clear();
$time_end = microtime(true);
echo 'Elapsed Time (DOM) = '.($time_end - $time_start).'<br><br>';
}
From my point of view , there is nothing wrong with "simple_html_dom"
you may remove the simple html dom "part" of the code , leave only for the CURL
which I assume is the source of the problem.
There are lots of reasons cause the curl Not working on page
first of all I can see you add
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
you should also try to add CURLOPT_SSL_VERIFYHOST , false
Secondly , check your curl version, see if it is too old
third option, if none of above working , you may want to enable cookie , it may possible the cookie disabled cause the website detect it is machine, not real person send the request .
lastly , if all above attempt failed , try other library or even file_get_content ,
Curl is not your only option, of cause it is the most powerful one.

Failed to load external entity on simplexml_load_file at Openstreetmap

I recently checked one of our websites and realized that the search for postal code wasn't working anymore.
I get the following error:
'Failed to load external entity'
If instead I use simplexml_load_string() I receive
'Start tag expected, '<' not found'.
This is the code I'm using:
libxml_use_internal_errors(true);
$xml = simplexml_load_file('https://nominatim.openstreetmap.org/search?postalcode=28217&country=DE&format=xml&polygon=1&addressdetails=1&boundary=postalcode');
if (false === $xml) {
$errors = libxml_get_errors();
var_dump($errors);
}
I read somewhere it might actually has something to do with HTTP headers but I did not find any useful info on this.
In OSM Nominatim's usage policy it is stated that you need to provide a User-Agent or HTTP-Referer request header to identify the application. As such, using a user-agent to masquerade as end-user browser is really not great etiquette.
You can find the usage policy here. It also says that the default values used by http libraries (like the one simplexml_load_file() uses) are not acceptable.
You say you are using simplexml_load_string(), but fail to say how are you getting the XML to that function. But the most likely scenario is that whichever method you are using to get the XML file, you are also neglecting to pass the mandatory headers.
As such, I'd create a request using php-curl, provide one of these headers to identify your app; and parse the resulting XML string with simplexml_parse_string().
E.g.:
// setup variables
$nominatim_url = 'https://nominatim.openstreetmap.org/search?postalcode=28217&country=DE&format=xml&polygon=1&addressdetails=1&boundary=postalcode';
$user_agent = 'ID_Identifying_Your_App v100';
$http_referer = 'http://www.urltoyourapplication.com';
$timeout = 10;
// curl initialization
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $nominatim_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
// this is are the bits you are missing
// Setting curl's user-agent
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
// you an also use this one (http-referer), it's up to you. Either one or both.
curl_setopt($ch, CURLOPT_REFERER, $http_referer);
// get the XML
$data = curl_exec($ch);
curl_close($ch);
// load it in simplexml
$xml = simplexml_load_string($data);
// This was your code, left as it was
if (false === $xml) {
$errors = libxml_get_errors();
var_dump($errors);
}
you can useing curlwith adding custom header , i hope this code useful for you :
<?php
$request_url='https://nominatim.openstreetmap.org/search?postalcode=28217&country=DE&format=xml&polygon=1&addressdetails=1&boundary=postalcode';
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Accept-Language: en-US,en;q=0.9,fa;q=0.8,und;q=0.7',
'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36'));
curl_setopt($ch, CURLOPT_URL, $request_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
echo($data);

get information from html table using Curl

i need to get some information about some plants and put it into mysql table.
My knowledge on Curl and DOM is quite null, but i've come to this:
set_time_limit(0);
include('simple_html_dom.php');
$ch = curl_init ("http://davesgarden.com/guides/pf/go/1501/");
curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1");
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Accept-Language: es-es,en"));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec ($ch);
curl_close ($ch);
$html= str_get_html($data);
$e = $html->find("table", 8);
echo $e->innertext;
now, i'm really lost about how to move in from this point, can you please guide me?
Thanks!
This is a mess.
But at least it's a (somewhat) consistent mess.
If this is a one time extraction and not a rolling project, personally I'd use quick and dirty regex on this instead of simple_html_dom. You'll be there all day twiddling with the tags otherwise.
For example, this regex pulls out the majority of title/data pairs:
$pattern = "/<b>(.*?)</b>\s*<br>(.*?)</?(td|p)>/si";
You'll need to do some pre and post cleaning before it will get them all though.
I don't envy you having this task...
Your best bet will be to wrape this in php ;)
Yes, this is a ugly hack for a ugly html code.
<?php
ob_start();
system("
/usr/bin/env links -dump 'http://davesgarden.com/guides/pf/go/1501/' |
/usr/bin/env perl -lne 'm/((Family|Genus|Species):\s+\w+\s+\([\w-]+\))/ && \
print $1'
");
$out = ob_get_contents();
ob_end_clean();
print $out;
?>
Use Simple Html Dom and you would be able to access any element/element's content you wish. Their api is very straightforward.
you can try somthing like this.
<?php
$ch = curl_init ("http://www.digionline.ir/Allprovince/CategoryProducts/cat=10301");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($page);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$data = array();
// get all table rows and rows which are not headers
$table_rows = $xpath->query('//table[#id="tbl-all-product-view"]/tr[#class!="rowH"]');
foreach($table_rows as $row => $tr) {
foreach($tr->childNodes as $td) {
$data[$row][] = preg_replace('~[\r\n]+~', '', trim($td->nodeValue));
}
$data[$row] = array_values(array_filter($data[$row]));
}
echo '<pre>';
print_r($data);
?>

Categories