DOMDocument sometimes return tangle of characters - php

I need to extract DOM from external website in php. I tried testing URL, but sometimes it shows a many many chinesse letters :) (more specifically characters in unicode I though)
It's strange, that if I use different link, it works, but if I use link below and run php for example 3 times, after 3. try it stops working (but for the 1, a 2. time it shows normal DOM structure)
URL: https://www.csfd.cz/film/300902-bohemian-rhapsody/prehled/
DOM after 3. (ca.) run: https://i.stack.imgur.com/lnM1I.png
Code:
$doc = new \DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile("https://www.csfd.cz/film/300902-bohemian-rhapsody/prehled/");
dd($doc->saveHTML());
Does anybody know, what to do?

I guess it is because of the site compression, you can extract data by using good old curl:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.csfd.cz/film/300902-bohemian-rhapsody/prehled/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');
$headers = array();
$headers[] = 'Connection: keep-alive';
$headers[] = 'Cache-Control: max-age=0';
$headers[] = 'Save-Data: on';
$headers[] = 'Upgrade-Insecure-Requests: 1';
$headers[] = 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36';
$headers[] = 'Dnt: 1';
$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8';
$headers[] = 'Accept-Encoding: gzip, deflate, br';
$headers[] = 'Accept-Language: en-US;q=0.8,en;q=0.7,uk;q=0.6';
$headers[] = 'Cookie: nette-samesite=1; developers-ad=1;';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
$doc = new \DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($result);
dd($doc->saveHTML());

Related

Why is my GraphQL Query using curlopt_postfields returning a json error?

I'm trying to get some code of mine to work. but I keep getting the following error. Any thoughts on what's going wrong here? I think I have all the quoatations escaped correctly
{"errors":[{"message":"json body could not be decoded: invalid
character 'L' after object key:value pair"}],"data":null}
I know my query is correct as I can run it in the graphQL playground and get the data.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://xxxxxxxxxxxx.com/api/v4/endpoint');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, "{\"query\":\"{ search(q: \"LM123\") { results { part { mpn manufacturer { name }}}}\"}");
curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');
$headers = array();
$headers[] = 'Accept-Encoding: gzip, deflate';
$headers[] = 'Content-Type: application/json';
$headers[] = 'Accept: application/json';
$headers[] = 'Connection: keep-alive';
$headers[] = 'Dnt: 1';
$headers[] = 'Origin: https://xxxxxxxxxxxxx.com';
$headers[] = 'Token: xxxxxxxxxxxxxxxxxxxxxxxx';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close($ch);
echo $result;
If I run a simple query that doesn't search for a term it works perfectly. Like:
curl_setopt($ch, CURLOPT_POSTFIELDS, "{\"query\":\"{ categories { name }}\"}");
You have a problem with using double quotes here \"LM123\". When your JSON is parsing, the parser expects, that this \" ends your value and then you will have , \"other_key\": \"...\" in your JSON, but you have LM123... instead.
You can try something like this:
curl_setopt($ch, CURLOPT_POSTFIELDS, '{"query":"{ search(q: \"LM123\") { results { part { mpn manufacturer { name }}}}"}');

PHP Curl - Is it possible to evade Cloudflare block?

We have recently encountered CloudFare blocking on some sites we have not had issue scraping in the past.
It is not an IP block (tried from several IPs) and it is not tied to an account or any other kind of authentication. The site does not show user captchas
We created a PHP Curl request using the exact GET request with all headers but we receive a 403 Forbidden error and it displays:
www.******.com used CloudFlare to restrict access
Forgive my ignorance, but how does CloudFlare detect this? There are no cookies involved (as its the initial site request), the user-agent and everything else is identical.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.******.com/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_VERBOSE, TRUE);
$headers = array();
$headers[] = 'Host: www.******.com';
$headers[] = 'Connection: keep-alive';
$headers[] = 'Sec-Ch-Ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"';
$headers[] = 'Sec-Ch-Ua-Mobile: ?0';
$headers[] = 'Upgrade-Insecure-Requests: 1';
$headers[] = 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36';
$headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9';
$headers[] = 'Sec-Fetch-Site: none';
$headers[] = 'Sec-Fetch-Mode: navigate';
$headers[] = 'Sec-Fetch-User: ?1';
$headers[] = 'Sec-Fetch-Dest: document';
$headers[] = 'Accept-Encoding: gzip, deflate, br';
$headers[] = 'Accept-Language: en-US,en;q=0.9';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
result
HTTP/1.1 403 Forbidden
Any possible workaround?
Thank you
i think only cloudflare developers can give you correct answer. But they will not open their commercial secrets so easily.
All other information is theories and speculation. as far as i know, as response to first request to site cloudflare shows page with some javascript to profile browser, to proof it is stack browser used by humans, not curl.
if browser is stack one, than cloudflare allows requests to be served by backend.
I think you can try selenium, sometimes it doesn't trigger cloudflare blocker page with captcha

php curl post not processing, acting like get with no data

I am trying to use php's curl to post to a sites form for me then extract the result, but it is not working. Instead it shows a blank form, like I just did a basic GET reequest to the page.
<?php
$domains = [
'expireddomains.net',
'stackoverflow.com',
'toastup.com'
];
$ccd = '';
foreach ($domains as $domain) {
$ccd .= $domain . '\r\n';
}
// set post fields
$post = [
'removedoubles' => '1',
'removeemptylines' => '1',
'showallwordmatches' => '1',
'wordlist' => 'en-v1',
'camelcasedomains' => $ccd,
'button_submit' => 'Camel+Case+Domains'
];
$ch = curl_init('https://www.expireddomains.net/tools/camel-case-domain-names/');
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
$headers = [
'Referer: https://www.expireddomains.net/tools/camel-case-domain-names/',
'Content-Type: application/x-www-form-urlencoded',
'Origin: https://www.expireddomains.net'
];
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
// execute!
$response = curl_exec($ch);
// close the connection, release resources used
curl_close($ch);
I confirmed the postdata formatting and names using the network tab in the browser's dev tools to check the request.
Originally I wasn't sending any headers, then I thought maybe the site validated the origin or referer, but even adding that didn't work.
I checked, the form doesn't include any hidden fields for something like a CSRF token or anything.
Any ideas?
For application/x-www-form-urlencoded, use http_build_query and let it encode the values like +'s etc, plus the seperator between domains is | not new lines.
<?php
$domains = [
'expireddomains.net',
'stackoverflow.com',
'toastup.com'
];
// set post fields
$post = [
'removedoubles' => 1,
'removeemptylines' => 1,
'showallwordmatches' => 1,
'wordlist' => 'en-v1',
'camelcasedomains' => implode(' | ', $domains),
'button_submit' => 'Camel Case Domains'
];
$ch = curl_init('https://www.expireddomains.net/tools/camel-case-domain-names/');
$headers = array();
$headers[] = 'authority: www.expireddomains.net';
$headers[] = 'pragma: no-cache';
$headers[] = 'cache-control: no-cache';
$headers[] = 'origin: https://www.expireddomains.net';
$headers[] = 'upgrade-insecure-requests: 1';
$headers[] = 'dnt: 1';
$headers[] = 'content-Type: application/x-www-form-urlencoded';
$headers[] = 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36';
$headers[] = 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9';
$headers[] = 'sec-fetch-site: same-origin';
$headers[] = 'sec-fetch-mode: navigate';
$headers[] = 'sec-fetch-user: ?1';
$headers[] = 'sec-fetch-dest: document';
$headers[] = 'referer: https://www.expireddomains.net/tools/camel-case-domain-names/';
$headers[] = 'referrer-policy: same-origin';
$headers[] = 'accept-language: en-GB,en-US;q=0.9,en;q=0.8';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip, deflate');
// execute!
$response = curl_exec($ch);
// close the connection, release resources used
curl_close($ch);
// parse whats in textarea
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($response);
libxml_clear_errors();
$result = [];
foreach ($dom->getElementsByTagName('textarea') as $textarea) {
if ($textarea->getAttribute('name') === "camelcasedomains") {
$result = explode(' | ', $textarea->nodeValue);
}
}
print_r($result);
Result:
Array
(
[0] => ExpiredDomains.net
[1] => ExpiredDoMains.net
)
You could probably remove most of the headers, if not needed. I just added them all to exactly match the request, but ended up being the aforementioned encoding.

curl doesn't return response or http code after several times it does

I want to send many request to a website and f‌ind the id of last post that exist.
As my host hits the limit of requests to that websites after several requests, I expect the curl request to return an error so that I can save the last post's id in my database and continue scrolling later.
But after about 200 successful requests, curl doesn't return any response nor http code.
To be specific I want to get posts of a telegram channel from an id to the end.
Here is the function that I have written for this purpose:
function get_post_html_content($channel_username, $message_id){
try {
error_log($message_id."\n");
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,
"https://t.me/".$channel_username."/".$message_id."?embed=1");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
$headers = array();
$headers[] = 'Pragma: no-cache';
$headers[] = 'Sec-Fetch-Site: same-origin';
$headers[] = 'Origin: https://t.me';
$headers[] = 'Accept-Encoding: gzip,deflate';
$headers[] = 'Accept-Language: en-US,en;q=0.9';
$headers[] = 'Sec-Fetch-Mode: cors';
$headers[] = 'Content-Type: application/x-www-form-urlencoded';
$headers[] = 'Accept: */*';
$headers[] = 'Cache-Control: no-cache';
$headers[] = 'Referrer Policy: no-referrer-when-downgrade';
$headers[] = 'Connection: keep-alive';
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$content = curl_exec($ch);
if (!$content) {
$errno = curl_errno($ch);
$error = curl_error($ch);
error_log("Curl returned error $errno: $error\n");
curl_close($ch);
return false;
}
$http_code = intval(curl_getinfo($ch, CURLINFO_HTTP_CODE));
error_log("http code: ".$http_code."\n");
} catch (Exception $e) {
error_log($e->getMessage());
}
$content = gzdecode($content);
curl_close($ch);
return $content;
}
The problem is after several time that http code 200 printed in error log file and this function returns the content, Suddenly it doesn't print any http code in error log and even doesn't return false so than I can save last post id in database.
So how can I change this function to return false in this situation?

PHP scrapy cURL request not working

I am implementing a scrapy spider to crawl a website that contains real estate offers. The site contains a telephone number to the real estate agent, which can be retreived be an ajax post request.
To get a phone number I have to get ID from URL, next get from source csrfToken and send this with POST by special URL with ID. This method was working good but since yesterday not working.
My code:
$urlSite = "https://www.otodom.pl/mazowieckie/oferta/piekne-mieszkanie-na-mokotowie-do-wynajecia-ID3ezHA.html";
$ch = curl_init();
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_URL, $urlSite);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
curl_close($ch);
preg_match("/csrfToken = '(.+?)'/", $result, $output_array);
preg_match("/ID(.+?).html/", $urlSite, $output_array_id);
$token = $output_array[1];
$id = $output_array_id[1];
$url = "https://www.otodom.pl/ajax/mazowieckie/misc/contact/phone/" . $id . "/";
$headers = [
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding: gzip, deflate, br',
'Accept-Language: pl,en-US;q=0.8,en;q=0.6,ru;q=0.4',
'Cache-Control: no-cache',
'Content-Type: application/x-www-form-urlencoded; charset=UTF-8',
'Content-Length: 74',
'Host: www.otodom.pl',
'Referer: ' . $urlSite,
'User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36'
];
$data = array(
'CSRFToken' => $token
);
$data_string = http_build_query($data);
$ch = curl_init();
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "POST");
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data_string);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$phone = utf8_decode(curl_exec($ch));
curl_close($ch);
echo $phone;
Please help me, I am working for this a few hours and nothing.
{"status":"error","message":"Spróbuj wykonać operację ponownie. Jeśli
to nie pomoże, sprawdź czy masz włączoną obsługę JavaScript w
przeglądarce."}
As I mentioned on my comment, you need JavaScript in order to get the phone number. One way to achieve this is using selenium, here's a python example:
import time
from selenium import webdriver
geckodriver = 'C:/path_to/geckodriver.exe'
driver = webdriver.Firefox(executable_path = geckodriver)
driver.get("https://www.otodom.pl/mazowieckie/oferta/piekne-mieszkanie-na-mokotowie-do-wynajecia-ID3ezHA.html")
driver.find_element_by_class_name("phone-spoiler").click()
time.sleep(2)
print driver.find_element_by_class_name("phone-number").text
# 515 174 616
Notes:
1 - Install Selenium:
pip install selenium
2 - Download the geckodriver
3 - Replace C:/path_to with the path where you saved geckodriver.exe.
4 - Add C:/path_to to your environment.
5 - Restart your system.
6 - Run python name_of_script.py and the phone number will be displayed.
The steps above assume that you're using a windows machine.

Categories