CURLOPT_RETURNTRANSFER returns HTML in string

CURLOPT_RETURNTRANSFER returns HTML in string - php

I'm trying to parse HTML using CURL DOMDocument or Xpath, but the CURLOPT_RETURNTRANSFER always returns the url's HTML in string which makes it invalid HTML to be parsed
Returned output:
string(102736) "<!DOCTYPE html>
<html itemscope itemtype="http://schema.org/QAPage" class="html__responsive">
<head>
<title>html - PHP outputting text WITHOUT echo/print? - Stack Overflow</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
<link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
<link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
<meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">"
PHP snipe see the output
$cc = $http->get($url);
var_dump($cc);
CURL library used: https://github.com/seikan/HTTP/blob/master/class.HTTP.php
When I remove CURLOPT_RETURNTRANSFER I see the HTML without the string(102736), but it echo the url even if i didn't request (reference: curl_exec printing results when I don't want to)
Here is the PHP snipe I used to parse html:
$cc = $http->get($url);
$doc = new \DOMDocument();
$doc->loadHTML($cc);
// all links in document
$links = [];
$arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
foreach($arr as $item) { // DOMElement Object
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
$links[] = [
'href' => $href,
'text' => $text
];
}
Any idea?

Check the return value -
print_r($cc);
you will probably find that the output is an array (if the code ran successfully). From the library source, the return of get() is...
return [
'header' => $headers,
'body' => substr($response, $size),
];
So you will need to change the load line to be...
$doc->loadHTML($cc['body']);
Update:
as an example of the above and using this question as the page to work with...
$cc = $http->get("https://stackoverflow.com/questions/51319473/curlopt-returntransfer-returns-html-in-string/51319585?noredirect=1#comment89619183_51319585");
$doc = new \DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($cc['body']);
// all links in document
$links = [];
$arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
foreach($arr as $item) { // DOMElement Object
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
$links[] = [
'href' => $href,
'text' => $text
];
}
print_r($links);
Outputs...
Array
(
[0] => Array
(
[href] => #
[text] =>
)
[1] => Array
(
[href] => https://stackoverflow.com
[text] => Stack Overflow
)
[2] => Array
(
[href] => #
[text] =>
)
[3] => Array
(
[href] => https://stackexchange.com/users/?tab=inbox
...

Related

PHP - Read three lines of remote html

I need to read three lines of a remote page using PHP. I'm using code from Jose Vega found here to read the title:
<?php
function get_title($url){
$str = file_get_contents($url);
if(strlen($str)>0){
$str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
return $title[1];
}
}
//Example:
echo get_title("http://www.washingtontimes.com/");
?>
When I plug in a URL, I want to extract the following information:
<title>TITLE HERE</title>
<meta property="end_date" content="Tue Aug 28 2018 03:59:59 GMT+0000 (UTC)" />
<meta property="start_date" content="Mon Aug 06 2018 04:00:00 GMT+0000 (UTC)" />
Outputs: $title, $start, $end
Displayed as a title with a link to URL, followed by Starts: ____, Ends: ____, preferably converted to simple dates
Bonus Question: How can I efficiently parse dozens of sites using this script? The sites are all ascending numerically. index.php?id=103 index.php?id=104 index.php?id=105
Displaying:
ID Title Start End
#103 TitleWithLink StartDate EndDate
#104 TitleWithLink StartDate EndDate
#105 TitleWithLink StartDate EndDate

Well, you could solve your issue with the DomDocument class.
$doc = new \DomDocument();
$title = $start = $end = '';
if ($doc->loadHTMLFile($url)) {
// Get the title
$titles = $dom->getElementsByTagName('title');
if ($titles->length > 0) {
$title = $titles->item(0)->nodeValue;
}
// get meta elements
$xpath = new \DOMXPath($doc);
$ends = $xpath->query('//meta[#property="end_date"]');
$if ($ends->length > 0) {
$end = $ends->item(0)->getAttribute('content');
}
$starts = $xpath->query('//meta[#property="start_date"]');
if ($starts->length > 0) {
$start = $starts->item(0)->getAttribute('content');
}
var_dump($title, $start, $end);
}
With the getElementsByTagName method of the DomDocument class you can find the title element in the whole html of a given url. With the DOMXPath class you can retrieve the specific meta data you want. You don 't need much code for finding specific informations in a html string.
The code shown above is not tested.

Parsing html using php to an array

I have the below html
<p>text1</p>
<ul>
<li>list-a1</li>
<li>list-a2</li>
<li>list-a3</li>
</ul>
<p>text2</p>
<ul>
<li>list-b1</li>
<li>list-b2</li>
<li>list-b3</li>
</ul>
<p>text3</p>
Does anyone have an idea to parse this html file with php to get this output using complex array
fist one for the tags "p"
and the second for tags "ul" because after above every "p" tag a tag "ul"
Array
(
[0] => Array
(
[value] => text1
(
[il] => list-a1
[il] => list-a2
[il] => list-a3
)
)
[1] => Array
(
[value] => text2
(
[il] => list-b1
[il] => list-b2
[il] => list-b3
)
)
)
I can't use replace or removing all tags cause I use
foreach ($doc->getElementsByTagName('p') as $link)
{
$dont = $link->textContent;
if (strpos($dont, 'document.') === false) {
$links2[] = array(
'value' => $link->textContent, );
}
$er=0;
foreach ($doc->getElementsByTagName('ul') as $link)
{
$dont2 = $link->nodeValue;
//echo $dont2;
if (strpos($dont2, 'favorisContribuer') === false) {
$links3[]= array(
'il' => $link->nodeValue, );
}

You could use the DOMDocument class (http://php.net/manual/en/class.domdocument.php)
You can see an example below.
<?php
$html = '
<p>text1</p>
<ul>
<li>list-a1</li>
<li>list-a2</li>
<li>list-a3</li>
</ul>
<p>text2</p>
<ul>
<li>list-b1</li>
<li>list-b2</li>
<li>list-b3</li>
</ul>
<p>text3</p>
';
$doc = new DOMDocument();
$doc->loadHTML($html);
$textContent = $doc->textContent;
$textContent = trim(preg_replace('/\t+/', '<br>', $textContent));
echo '
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
' . $textContent . '
</body>
</html>
';
?>
However, I would suggest using javascript to find the content and send it to php instead.

PHP DOM - How to iterate xpath into an array with parent/children/child?

I am new to using DOM with PHP and need some help figuring out a solution of iterating xpath into an array. The the examples I found online provided very little help.
This is the string content from my XML file:
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.2-c004 1.136881, 2010/06/10-18:11:35 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description
rdf:about=""
xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:tiff="http://ns.adobe.com/tiff/1.0/"
xmlns:exif="http://ns.adobe.com/exif/1.0/"
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:aux="http://ns.adobe.com/exif/1.0/aux/"
xmlns:crs="http://ns.adobe.com/camera-raw-settings/1.0/"
xmlns:Iptc4xmpCore="http://iptc.org/std/Iptc4xmpCore/1.0/xmlns/"
xmlns:xmpRights="http://ns.adobe.com/xap/1.0/rights/"
photoshop:LegacyIPTCDigest="B0D1E9B9CFC1C774E7277517B04970DC"
photoshop:ColorMode="3"
photoshop:ICCProfile="sRGB IEC61966-2.1"
photoshop:AuthorsPosition="Tester"
photoshop:Headline="Big City Landscape"
photoshop:CaptionWriter="Freelancer"
photoshop:DateCreated="2016-08-05T02:16Z"
photoshop:City="NA"
photoshop:State="NA"
photoshop:Country="NA"
photoshop:TransmissionReference="2323"
photoshop:Instructions="set to landscape"
photoshop:Credit="Photographor: FirstName lastname"
photoshop:Source="Smart Phone Photo"
tiff:Make="Motorola"
tiff:Model="MB865"
tiff:Orientation="1"
tiff:ImageWidth="3264"
tiff:ImageLength="1840"
tiff:PhotometricInterpretation="2"
tiff:SamplesPerPixel="3"
tiff:XResolution="72/1"
tiff:YResolution="72/1"
tiff:ResolutionUnit="2"
exif:ExifVersion="0220"
exif:ExposureTime="1/11"
exif:ShutterSpeedValue="3459432/1000000"
exif:FNumber="24/10"
exif:ApertureValue="2526069/1000000"
exif:ExposureProgram="0"
exif:BrightnessValue="0/1"
exif:ExposureBiasValue="0/10"
exif:MaxApertureValue="3/1"
exif:SubjectDistance="0/1"
exif:MeteringMode="1"
exif:LightSource="4"
exif:FocalLength="460/100"
exif:SceneType="1"
exif:CustomRendered="1"
exif:ExposureMode="0"
exif:WhiteBalance="0"
exif:SceneCaptureType="0"
exif:GainControl="256"
exif:Contrast="0"
exif:Saturation="0"
exif:Sharpness="0"
exif:SubjectDistanceRange="0"
exif:DigitalZoomRatio="65536/65535"
exif:PixelXDimension="3264"
exif:PixelYDimension="1840"
exif:ColorSpace="1"
xmp:ModifyDate="2016-02-22T09:22:39-05:00"
xmp:MetadataDate="2016-08-05T02:21:35-04:00"
aux:ApproximateFocusDistance="0/1"
crs:AlreadyApplied="True"
Iptc4xmpCore:IntellectualGenre="NA"
Iptc4xmpCore:Location="NA"
Iptc4xmpCore:CountryCode="NA">
<dc:rights>
<rdf:Alt>
<rdf:li xml:lang="x-default">Copyright FirstName lastname</rdf:li>
</rdf:Alt>
</dc:rights>
<dc:creator>
<rdf:Seq>
<rdf:li>FirstName lastname</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default">Jurks on the move</rdf:li>
</rdf:Alt>
</dc:description>
<dc:subject>
<rdf:Bag>
<rdf:li>New Jurks in Town</rdf:li>
</rdf:Bag>
</dc:subject>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">Big City Jurks</rdf:li>
</rdf:Alt>
</dc:title>
<tiff:BitsPerSample>
<rdf:Seq>
<rdf:li>8</rdf:li>
<rdf:li>8</rdf:li>
<rdf:li>8</rdf:li>
</rdf:Seq>
</tiff:BitsPerSample>
<exif:ISOSpeedRatings>
<rdf:Seq>
<rdf:li>107</rdf:li>
</rdf:Seq>
</exif:ISOSpeedRatings>
<exif:Flash exif:Fired="True" exif:Return="0" exif:Mode="1" exif:Function="False" exif:RedEyeMode="False"/>
<Iptc4xmpCore:CreatorContactInfo
Iptc4xmpCore:CiAdrExtadr=""
Iptc4xmpCore:CiAdrCity=""
Iptc4xmpCore:CiAdrRegion="NY"
Iptc4xmpCore:CiAdrPcode=""
Iptc4xmpCore:CiAdrCtry="USA"
Iptc4xmpCore:CiTelWork=""
Iptc4xmpCore:CiEmailWork="you#yourwebsite.com"
Iptc4xmpCore:CiUrlWork="www.yourwebsite.com"/>
<Iptc4xmpCore:SubjectCode>
<rdf:Bag>
<rdf:li>Jurks</rdf:li>
</rdf:Bag>
</Iptc4xmpCore:SubjectCode>
<Iptc4xmpCore:Scene>
<rdf:Bag>
<rdf:li>Big City</rdf:li>
</rdf:Bag>
</Iptc4xmpCore:Scene>
<xmpRights:UsageTerms>
<rdf:Alt>
<rdf:li xml:lang="x-default">Free to use</rdf:li>
</rdf:Alt>
</xmpRights:UsageTerms>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
This is how I approach the issue.
$__data = "xmp-cache-test.xml";
$content = file_get_contents('xmp-cache-test.xml');
if(preg_match("/(\<x\:xmpmeta.*?\>.*?\<\/x\:xmpmeta\>)/s", $content, $matches))
$data = "<?xml version='1.0'?>\n" . $matches[1];
$myXmlString = $data ;
$myXmlFilename = $__data;
$doc = new DOMDocument();
$doc->loadXML($myXmlString);
$doc->documentURI = $myXmlFilename;
$xpath = new DOMXpath($doc);
$xpath->registerNamespace('x', 'adobe:ns:meta/');
$xpath->registerNamespace('xmp', 'http://ns.adobe.com/xap/1.0/');
$xpath->registerNamespace("Iptc4xmpCore", "http://iptc.org/std/Iptc4xmpCore/1.0/xmlns/");
$xpath->registerNamespace('rdf', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#');
$elements = $xpath->evaluate('//rdf:RDF/rdf:Description');
$arr_xmp = iterator_to_array($elements);
print_r($arr_xmp);
// The print result:
Array (
[0] => DOMElement Object (
[tagName] => rdf:Description
[schemaTypeInfo] =>
[nodeName] => rdf:Description
[nodeValue] => Copyright FirstName lastname FirstName lastname Jurks on the move
New Jurks in Town Big City Jurks 8 8 8 107 Jurks Big City Free to use
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] => (object value omitted)
[nextSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] => http://www.w3.org/1999/02/22-rdf-syntax-ns# [prefix] => rdf
[localName] => Description
[baseURI] => xmp-cache-test.xml
[textContent] => Copyright FirstName lastname FirstName lastname Jurks on the move
New Jurks in Town Big City Jurks 8 8 8 107 Jurks Big City Free to use
) )
The above result is not what I had expected.
I would rather to have in the array for viewing something more like the following example below
and along with a few other options:
Array (
[rdf:about] =>
[xmlns:photoshop] => http://ns.adobe.com/photoshop/1.0/
[xmlns:dc] => http://purl.org/dc/elements/1.1/
[xmlns:tiff] => http://ns.adobe.com/tiff/1.0/
[xmlns:exif] => http://ns.adobe.com/exif/1.0/
[xmlns:xmp] => http://ns.adobe.com/xap/1.0/
[xmlns:aux] => http://ns.adobe.com/exif/1.0/aux/
[xmlns:crs] => http://ns.adobe.com/camera-raw-settings/1.0/
[xmlns:Iptc4xmpCore] => http://iptc.org/std/Iptc4xmpCore/1.0/xmlns/
[xmlns:xmpRights] => http://ns.adobe.com/xap/1.0/rights/
[photoshop:LegacyIPTCDigest] => B0D1E9B9CFC1C774E7277517B04970DC
[photoshop:ColorMode] => 3
[photoshop:ICCProfile] => sRGB IEC61966-2.1
[photoshop:AuthorsPosition] => Tester
[photoshop:Headline] => Big City Landscape
[photoshop:CaptionWriter] => Freelancer
[photoshop:DateCreated] => 2016-08-05T02:16Z
[photoshop:City] => NA
[photoshop:City] => NA
[photoshop:State] => NA
[photoshop:Country] => NA
[photoshop:TransmissionReference] => 2323
[photoshop:Instructions] => set to landscape
[photoshop:Credit] => Photographor: FirstName lastname
[photoshop:Source] => Smart Phone Photo
[tiff:Make] => Motorola
[tiff:Model] => MB865
[tiff:Orientation] => 1
------------ // continue
)
Options: By giving an example would be helpful.
How should I approach the creation of the array by using DOM?
If I need to remove say "tiff and exif" from the array what
should the approach be like?
Use Dom to update say "photoshop:Credit" value.
How to use DOM to reverse the array back to the XML string.

=============EDIT===================
The xml to array part, almost the same question here: What is the best php DOM 2 Array function?
I played with the code a bit and this is the result:
function xml_to_array($root) {
$result = array();
if ($root->hasAttributes()) {
$attrs = $root->attributes;
foreach ($attrs as $attr) {
$result['#attributes'][$attr->name] = $attr->value;
}
}
if ($root->hasChildNodes()) {
$children = $root->childNodes;
if ($children->length == 1) {
$child = $children->item(0);
if ($child->nodeType == XML_TEXT_NODE) {
$result['_value'] = $child->nodeValue;
return count($result) == 1
? $result['_value']
: $result;
}
}
$groups = array();
foreach ($children as $child) {
if($child->nodeType == XML_TEXT_NODE && empty(trim($child->nodeValue))) continue;
if (!isset($result[$child->nodeName])) {
$result[$child->nodeName] = xml_to_array($child);
} else {
if (!isset($groups[$child->nodeName])) {
$result[$child->nodeName] = array($result[$child->nodeName]);
$groups[$child->nodeName] = 1;
}
$result[$child->nodeName][] = xml_to_array($child);
}
}
}
return $result;
}
// $content = your xml raw source
if(preg_match("/(\<x\:xmpmeta.*?\>.*?\<\/x\:xmpmeta\>)/s", $content, $matches))
$data = "<?xml version='1.0'?>\n" . $matches[1];
$myXmlString = $data ;
//$myXmlFilename = $__data;
$doc = new DOMDocument();
$doc->loadXML($myXmlString);
$array = xml_to_array($doc);
print_r($array);
Very neat function someone wrote there, it iterates thru-out the xml collecting attributes and node values and pretty much ignoring the pain associated to dreadful namespaces.
If you need to remove an item from the array, just use unset, as in:
unset($array['x:xmpmeta']['rdf:RDF']['rdf:Description']['tiff:BitsPerSample']);
As for how to update an attribute value, exact same question here: Change tag attribute value with PHP DOMDocument
$dom = new DOMDocument();
$dom->loadHTML('Click here');
foreach ($dom->getElementsByTagName('a') as $item) {
$item->setAttribute('href', 'http://google.com/');
echo $dom->saveHTML();
exit;
}
And finally, how to reverse back from array to DOM: there is no easy way, you would have to manually create a DOM object and create nodes and attributes one by one.
Once populated you would call http://php.net/manual/en/domdocument.savexml.php to get the xml code.
<?php
$doc = new DOMDocument('1.0');
// we want a nice output
$doc->formatOutput = true;
$root = $doc->createElement('book');
$root = $doc->appendChild($root);
$title = $doc->createElement('title');
$title = $root->appendChild($title);
$text = $doc->createTextNode('This is the title');
$text = $title->appendChild($text);
echo "Saving all the document:\n";
echo $doc->saveXML() . "\n";
echo "Saving only the title part:\n";
echo $doc->saveXML($title);
?>
Hope this helps,

How to use PHP to find all elements in HTML and get all the positions?

I'm trying to find all the elements of a tag in HTML and get the starting and ending point.
Here's my sample HTML
some content <iframe></iframe> <iframe></iframe> another content
Here's what I have got so far for code.
$dom = HtmlDomParser::str_get_html( $this->content );
$iframes = array();
foreach( $dom->find( 'iframe' ) as $iframe) {
$iframes[] = $iframe;
}
return array(
'hasIFrame' => count( $iframes ) > 0
);
Getting the number of elements is easy but I'm not sure if HTMLDomParser can get the starting and ending position?
What I want is
array(
'hasIFrame' => true,
'numberOfElements => 2,
array (
0 => array (
'start' => $firstStartingElement,
'end' => $firstEndingElement
),
1 => array (
'start' => $secondStartingElement,
'end' => $secondEndingElement
)
)

If you have a look at the official doc (http://simplehtmldom.sourceforge.net/) you can easily found out how many elements of a type there is in your DOM :
// Find all images
foreach($html->find('img') as $element) {
echo $element->src . '<br>';
}
All you have to do is retrieving $html->find('iframe') and verify its size to know if there is at least once

You can do something like this:
$html = "some content <iframe></iframe> <iframe></iframe> another content";
preg_match_all('/<iframe>/', $html, $iframesStartPositions, PREG_OFFSET_CAPTURE);
preg_match_all('/<iframe\/>/', $html, $iframesEndPositions, PREG_OFFSET_CAPTURE);
$iframesPositions = array();
foreach( $dom->find( 'iframe' ) as $key => $iframe) {
$iframesPositions[] = array(
'start' => $iframesStartPositions[0][$key][1],
'end' => $iframesEndPositions[0][$key][1] + 9 // 9 is the length of the ending tag <iframe/>
);
}
return array(
'hasIFrame' => count($iframesPositions) > 0,
'numberOfElements' => count($iframesPositions),
'positions' => $iframesPositions
);

Add existing array to 3d array

I'm trying to build a 3d php array, that ultimately gets outputted as xml... This is the code I'm trying to use to prove the concept...
$test = array('apple','orange');
$results = Array(
'success' => '1',
'error_number' => '',
'error_message' => '',
'results' => Array (
'number_of_reports' => mysql_num_rows($run),
'item' => $test
)
);
I want the resulting array to look like
<success>1</success>
<error_number/>
<error_message/>
<results>
<number_of_reports>18</number_of_reports>
<item>
<0>apple</0>
<1>orange</1>
</item>
</results>
In reality the apple and orange array would be a 3d one in itself... If you've ever used the ebay api... you'll have an idea of what I'm trying to do (I think)

Try it:
Code:
<?php
$test = array('apple','orange');
$results = Array(
'success' => '1',
'error_number' => '',
'error_message' => '',
'results' => Array (
'number_of_reports' => 1,
'item' => $test
)
);
print_r($results);
function addChild1($xml, $item, $clave)
{
if(is_array($item)){
$tempNode = $xml->addChild($clave,'');
foreach ($item as $a => $b)
{
addChild1($tempNode, $b, $a);
}
} else {
$xml->addChild("$clave", "$item");
}
}
$xml = new SimpleXMLElement('<root/>');
addChild1($xml, $results,'data');
$ret = $xml->asXML();
print $ret;
Output:
<?xml version="1.0"?>
<root><data><success>1</success><error_number></error_number><error_message></error_message><results><number_of_reports>1</number_of_reports><item><0>apple</0><1>orange</1></item></results></data></root>

See below URL. I think it very use full to you:-
How to convert array to SimpleXML
Or Try it:-
$xml = new SimpleXMLElement('<root/>');
array_walk_recursive($test_array, array ($xml, 'addChild'));
print $xml->asXML();

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

CURLOPT_RETURNTRANSFER returns HTML in string - php

Related

PHP - Read three lines of remote html

Parsing html using php to an array

PHP DOM - How to iterate xpath into an array with parent/children/child?

How to use PHP to find all elements in HTML and get all the positions?

Add existing array to 3d array

Categories

Resources