Select HTML content using PHP - php

I want to get the paragraphs under this tag:
I tried to:
<?php
$doc = new DOMDocument();
$doc->loadHTMLFile("https://sabq.org/xMQjz2");
$elements = $doc->getElementsByTagName('p');
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->textContent. "\n";
}
}
}
?>
And I got the paragraphs I wanted along with unwanted ones, and they were duplicated.
EDIT:
I changed the URL, hope it works

The link that you have provided throws an error when accessing it so what I did, I found a function that could get the contents of the webpage using curl instead of the DOMDocument class which you were using.
I used preg_match and regex to extract the specific element that you were looking for.
Here's the code:
<?php
//opened url
$content = get_fcontent("https://sabq.org/%D8%B4%D8%A7%D9%87%D8%AF-%D8%A3%D9%84%D9%81-%D8%B5%D9%81%D8%AD%D8%A9-%D8%AA%D8%B1%D9%88%D9%8A-%D9%82%D8%B5%D8%B5-%D8%A7%D9%84%D8%AD%D8%B1%D9%85%D9%8A%D9%86-%D9%85%D9%86%D8%B0-%D8%A7%D9%86%D8%B7%D9%84%D8%A7%D9%82-%D8%A7%D9%84%D8%B9%D9%87%D8%AF-%D8%A7%D9%84%D8%B3%D8%B9%D9%88%D8%AF%D9%8A");
//extract specific html tag and its innerHTML
preg_match('/<p .*? ng\-bind\-html\=\"getContent\(material\.content\)\" .*?>.*?<\/p>/m', $content[0], $matches);
//display the wanted element
echo $matches[0];
//getting contents using curl because threw error: failed to open stream
function get_fcontent( $url, $javascript_loop = 0, $timeout = 5 ) {
$url = str_replace( "&", "&", urldecode(trim($url)) );
$cookie = tempnam ("/tmp", "CURLCOOKIE");
$ch = curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_ENCODING, "" );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false ); # required for https urls
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
$content = curl_exec( $ch );
$response = curl_getinfo( $ch );
curl_close ( $ch );
if ($response['http_code'] == 301 || $response['http_code'] == 302) {
ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");
if ( $headers = get_headers($response['url']) ) {
foreach( $headers as $value ) {
if ( substr( strtolower($value), 0, 9 ) == "location:" )
return get_url( trim( substr( $value, 9, strlen($value) ) ) );
}
}
}
if ( ( preg_match("/>[[:space:]]+window\.location\.replace\('(.*)'\)/i", $content, $value) || preg_match("/>[[:space:]]+window\.location\=\"(.*)\"/i", $content, $value) ) && $javascript_loop < 5) {
return get_url( $value[1], $javascript_loop+1 );
} else {
return array( $content, $response );
}
}
?>
For testing, I created a local file called test.html:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<p>This should not be showing.</p>
<p ng-bind-html="getContent(material.content)" id="dev-content" class="details-text">This is a test.</p>
</body>
</html>
I used the local url http://localhost/example/test.html instead of the link you provided for testing purposes.
And from the local file I created for testing, I got the following result:
<p ng-bind-html="getContent(material.content)" id="dev-content" class="details-text">This is a test.</p>
Here's the result that I got from the original url:
<p ng-bind-html="getContent(material.content)" id="dev-content" class="details-text"></p>
I hope this helps!

Related

PHP cURL and Simple HTML Dom

I'm sorry, but I speak a little English only.
I use this:
<?php
function file_get_contents_curl ( $url ) {
$ch = curl_init ();
curl_setopt ( $ch, CURLOPT_AUTOREFERER, TRUE );
curl_setopt ( $ch, CURLOPT_HEADER, 0 );
curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1 );
curl_setopt ( $ch, CURLOPT_URL, $url );
curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, TRUE );
curl_setopt ( $ch, CURLOPT_SSL_VERIFYPEER, 0 ); //
curl_setopt ( $ch, CURLOPT_SSL_VERIFYHOST, 0 ); //
curl_setopt ( $ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; rv:71.0) Gecko/20100101 Firefox/71.0' ); // spoof
$data = curl_exec ( $ch );
curl_close ( $ch );
return $data;
}
include ( __DIR__ . '/simplehtmldom_1_9_1/simple_html_dom.php' );
// 1. OK: $url = 'https://www.p***hub.com/model/ashley-porner';
// 2. OK: $url = 'https://www.p***hub.com/model/ashley-diamond-and-diamond-king';
// 3. NOT OK: $url = 'https://www.p***hub.com/model/ambercashh';
// 4. NOT OK: $url = 'https://www.p***hub.com/model/autumn-raine';
$html = file_get_contents_curl ( $url );
$html = str_get_html ( $html );
var_dump ( $html ); // boolean(false) if NOT OK
?>
The 1-2. URL is ok, but the 3-4. URL is not ok. Not show, no view. The return is false.
I try change from 600000 to 6000000 (~/simplehtmldom_1_9_1/simple_html_dom.php), but the new value is more loading time and than crashed my website:
// OLD: defined('MAX_FILE_SIZE') || define('MAX_FILE_SIZE', 600000);
defined('MAX_FILE_SIZE') || define('MAX_FILE_SIZE', 6000000); // NEW
What is the problem?
Thanks.
As test you can run the following - obviously the urls will need editing but it shows reasonable performance - why you were running out of memory must therefore lie in code not included
<?php
function file_get_contents_curl ( $url ) {
$ch = curl_init ();
curl_setopt ( $ch, CURLOPT_AUTOREFERER, TRUE );
curl_setopt ( $ch, CURLOPT_HEADER, 0 );
curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1 );
curl_setopt ( $ch, CURLOPT_URL, $url );
curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, TRUE );
curl_setopt ( $ch, CURLOPT_SSL_VERIFYPEER, 0 );
curl_setopt ( $ch, CURLOPT_SSL_VERIFYHOST, 0 );
curl_setopt ( $ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; rv:71.0) Gecko/20100101 Firefox/71.0' ); // spoof
$data = curl_exec ( $ch );
curl_close ( $ch );
return $data;
}
$start=time();
$memstart=memory_get_usage();
$baseurl='https://www.*******.com/model/';
$models=['ashley-porner','ashley-diamond-and-diamond-king','ambercashh','autumn-raine'];
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->recover=true;
$dom->strictErrorChecking=false;
/* do some expensive DOM operations to test performance */
$query='//section[ #class="topProfileHeader" ]/div/div/div[ #class="content-columns" ]/div[ #class="infoPiece" ]';
foreach( $models as $model ){
$url = $baseurl . $model;
$res = file_get_contents_curl( $url );
$dom->loadHTML( $res );
$xp=new DOMXPath( $dom );
libxml_clear_errors();
$col=$xp->query( $query );
if( $col->length > 0 ){
foreach( $col as $node ) {
echo str_repeat( '.', strlen( $node->nodeValue ) ) . '<br />';
}
}
}
$memory=memory_get_usage() - $memstart;
printf(
'<div style="padding:1rem; border:1px solid red;">Script took approx: %ss - consumed: %sMb, Peak memory consumption: %sMb</div>',
( time() - $start ),
round( $memory / pow(1024,2), 2 ),
round( memory_get_peak_usage() / pow(1024,2), 2 )
);
?>

PHP's fopen() can't get url, but curl can

A while back, I wrote a little utility function that takes inPath and outPath, opens both and copies from one to the other using fread() and fwrite(). allow_url_fopen is enabled.
Well, I've got a url that I'm trying to get the contents of, and fopen() doesn't get any data, but if I use curl to do the same, it works.
The url in question is: http://www.deltagroup.com/Feeds/images.php?lid=116582497&id=1
fopen version:
$in = #fopen( $inPath, "rb" );
$out = #fopen( $outPath, "wb" );
if( !$in || !$out )
{
echo 0;
}
while( $chunk = fread( $in, 8192 ) )
{
fwrite( $out, $chunk, 8192 );
}
fclose( $in );
fclose( $out );
if( file_exists($outPath) )
{
echo 1;
}
else
{
echo 0;
}
curl version:
$opt = "curl -o " . $outPath . " " . $inPath;
$res = `$opt`;
if( file_exists($outPath) )
{
echo 1;
}
else
{
echo 0;
}
Any idea why this would happen?
Even using php's curl, I was unable to download the file- until I added a curlopt_useragent string. Nothing in the response indicated that it was required (no errors, nothing other than an HTTP 200).
Final code:
$out = #fopen( $outPath, "wb" );
if( !$out )
{
echo 0;
}
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, $inPath );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt( $ch, CURLOPT_FILE, $out );
curl_setopt( $ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13');
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, 15 );
curl_setopt( $ch, CURLOPT_TIMEOUT, 18000 );
$data = curl_exec( $ch );
curl_close( $ch );
fclose( $out );
if( file_exists($outPath) )
{
echo 1;
}
else
{
echo 0;
}

How to download a copy of the html from a website?

How do I download a copy of the html from a website that has language detection (eg google, youtube) and redirection? I have tried file_get_contents but it is to limited.
I am trying to use curl in php to get the html from www.google.com but it detects that I am from the UK and sends me a 302 redirect to www.google.co.uk.
I have tried many different things with no joy, is this possible? websites like www.markosweb.com do it..
my code:
$ch = curl_init( "http://www.google.com/" );
// $userAgent = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)";
// $userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)';
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$header = array(
"Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
"Accept-Language: en-US,us;q=0.7,en-us;q=0.5,en;q=0.3",
"Accept-Charset: windows-1251,utf-8;q=0.7,*;q=0.7",
"Keep-Alive: 300");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,TRUE); //TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,5); //The number of seconds to wait while trying to connect.
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); //The contents of the "User-Agent: " header to be used in a HTTP request.
curl_setopt($ch, CURLOPT_FAILONERROR, TRUE); //To fail silently if the HTTP code returned is greater than or equal to 400.
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //To follow any "Location: " header that the server sends as part of the HTTP header.
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE); //To automatically set the Referer: field in requests where it follows a Location: redirect.
curl_setopt($ch, CURLOPT_TIMEOUT, 10); //The maximum number of seconds to allow cURL functions to execute.
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, 0);
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
I have tried changing the useragent to lots of things, tried with and without header details. I managed to get something if I used header info : "Accept-Language: ru-ru,ru;q=0.7,en-us;q=0.5,en;q=0.3" but it was in russian or something.
Thanks for your help.
Carl
Try this proxy script:
// Change these configuration options if needed, see above descriptions for info.
$enable_jsonp = false;
$enable_native = false;
$valid_url_regex = '/.*/';
// ############################################################################
$url = $_GET['url'];
if ( !$url ) {
// Passed url not specified.
$contents = 'ERROR: url not specified';
$status = array( 'http_code' => 'ERROR' );
} else if ( !preg_match( $valid_url_regex, $url ) ) {
// Passed url doesn't match $valid_url_regex.
$contents = 'ERROR: invalid url';
$status = array( 'http_code' => 'ERROR' );
} else {
$ch = curl_init( $url );
if ( strtolower($_SERVER['REQUEST_METHOD']) == 'post' ) {
curl_setopt( $ch, CURLOPT_POST, true );
curl_setopt( $ch, CURLOPT_POSTFIELDS, $_POST );
}
if ( $_GET['send_cookies'] ) {
$cookie = array();
foreach ( $_COOKIE as $key => $value ) {
$cookie[] = $key . '=' . $value;
}
if ( $_GET['send_session'] ) {
$cookie[] = SID;
}
$cookie = implode( '; ', $cookie );
curl_setopt( $ch, CURLOPT_COOKIE, $cookie );
}
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_HEADER, true );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_USERAGENT, $_GET['user_agent'] ? $_GET['user_agent'] : $_SERVER['HTTP_USER_AGENT'] );
list( $header, $contents ) = preg_split( '/([\r\n][\r\n])\\1/', curl_exec( $ch ), 2 );
$status = curl_getinfo( $ch );
curl_close( $ch );
}
// Split header text into an array.
$header_text = preg_split( '/[\r\n]+/', $header );
if ( $_GET['mode'] == 'native' ) {
if ( !$enable_native ) {
$contents = 'ERROR: invalid mode';
$status = array( 'http_code' => 'ERROR' );
}
// Propagate headers to response.
foreach ( $header_text as $header ) {
if ( preg_match( '/^(?:Content-Type|Content-Language|Set-Cookie):/i', $header ) ) {
header( $header );
}
}
print $contents;
} else {
// $data will be serialized into JSON data.
$data = array();
// Propagate all HTTP headers into the JSON data object.
if ( $_GET['full_headers'] ) {
$data['headers'] = array();
foreach ( $header_text as $header ) {
preg_match( '/^(.+?):\s+(.*)$/', $header, $matches );
if ( $matches ) {
$data['headers'][ $matches[1] ] = $matches[2];
}
}
}
// Propagate all cURL request / response info to the JSON data object.
if ( $_GET['full_status'] ) {
$data['status'] = $status;
} else {
$data['status'] = array();
$data['status']['http_code'] = $status['http_code'];
}
// Set the JSON data object contents, decoding it from JSON if possible.
$decoded_json = json_decode( $contents );
$data['contents'] = $decoded_json ? $decoded_json : $contents;
// Generate appropriate content-type header.
$is_xhr = strtolower($_SERVER['HTTP_X_REQUESTED_WITH']) == 'xmlhttprequest';
header( 'Content-type: application/' . ( $is_xhr ? 'json' : 'x-javascript' ) );
// Get JSONP callback.
$jsonp_callback = $enable_jsonp && isset($_GET['callback']) ? $_GET['callback'] : null;
// Generate JSON/JSONP string
$json = json_encode( $data );
print $jsonp_callback ? "$jsonp_callback($json)" : $json;
}
Make sure to perform a request like this:
http://example.com/script?url=http://whateverurl.com/
Oh, and this PHP script will display the result in JSON.
From there, you can parse it using jQuery.
Like I use this jQuery code:
<script type="text/javascript">
$(document).ready(function(){
var url='+++++URL WHICH THE PHP PROXY SCRIPT IS IN++++++';
$(window).load(function(){
$.getJSON(url,function(json){
$("#resu").append(""+json.contents+"");
});
});
});
</script>
Edit: This script is not a true proxy in the sense that it does fake an IP address. Sorry for the confusion.

php download file from remote server, from download url with headers set

Im trying to downlaod a file with php that is located on a remote server, and having no luck. Ive tried using fopen, get_file_contents, but nothing has worked.
I am passing in a download URL, which isnt the exact file location, it is the "download url" of the file, which then forces the browser to download.
So Im thinking that is why the file fopen and file_get_contents is failing, can someone tell me what I have to do to download a file from a url with headers set to force file download.
Any help greatly appreciated!
While not technically a duplicate, this has been asked on SO before: How to get redirecting url link with php from bit.ly
Your problem is that file_get_contents does not follow redirects. See the linked answer for a solution.
altough you didnt form your question clearly i think i know what you mean.
try this function taken from php.net comments
didnt test it but it looks good and seems to follow html header redirects as well as meta and javascript redirects to the file.
<?php
/*==================================
Get url content and response headers (given a url, follows all redirections on it and returned content and response headers of final url)
#return array[0] content
array[1] array of response headers
==================================*/
function get_url( $url, $javascript_loop = 0, $timeout = 5 )
{
$url = str_replace( "&", "&", urldecode(trim($url)) );
$cookie = tempnam ("/tmp", "CURLCOOKIE");
$ch = curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_ENCODING, "" );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false ); # required for https urls
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
$content = curl_exec( $ch );
$response = curl_getinfo( $ch );
curl_close ( $ch );
if ($response['http_code'] == 301 || $response['http_code'] == 302)
{
ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");
if ( $headers = get_headers($response['url']) )
{
foreach( $headers as $value )
{
if ( substr( strtolower($value), 0, 9 ) == "location:" )
return get_url( trim( substr( $value, 9, strlen($value) ) ) );
}
}
}
if ( ( preg_match("/>[[:space:]]+window\.location\.replace\('(.*)'\)/i", $content, $value) || preg_match("/>[[:space:]]+window\.location\=\"(.*)\"/i", $content, $value) ) &&
$javascript_loop < 5
)
{
return get_url( $value[1], $javascript_loop+1 );
}
else
{
return array( $content, $response );
}
}
?>

Error on line 14, php curl dom

<?php
$url='http://edition.cnn.com/?fbid=4OofUbASN5k';
$var = fread_url($url);// function calling to get the page from curl
$search = array('#<script[^>]*?>.*?</script>#si'); // Strip out javascript
$var = preg_replace($search, "\n", html_entity_decode($var)); // Strip out javascript
$linklabel = array();
$link = array();
$dom = new DOMDocument($var);
#$dom->loadHTML($var);
$xpath = new DOMXPath($dom);// Grab the DOM nodes
foreach($xpath->find('a') as $element) {
array_push($linklabel, $element->innerText);
print $linklabel;
array_push($link, $element->href);
print $link.'<br>';
}
function fread_url($url) {
if(function_exists("curl_init")) {
$ch = curl_init();
$user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; ".
"Windows NT 5.0)";
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt( $ch, CURLOPT_HTTPGET, 1 );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION , 1 );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION , 1 );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
$html = curl_exec($ch);
//print $html;//printing the web page.
curl_close($ch);
}
else {
$hfile = fopen($url,"r");
if($hfile) {
while(!feof($hfile)) {
$html.=fgets($hfile,1024);
}
}
}
return $html;
}
i need to seperate links and link labels into two seperate arrays. i followed several forums and made a code, but is getting error. i don't know about the find function used in the code
Several problems, mainly calls to inexistent functions and references to inexistent properties. Correct version:
<?php
$var = <<<EOD
<html>
sdfd
</html>
EOD;
$dom = new DOMDocument();
#$dom->loadHTML($var);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a') as $element) {
$linklabel[] = $element->textContent;
$link[] = $element->getAttribute("href");
}
var_dump($linklabel);
var_dump($link);

Categories