Get specific text from webpage - php

I have this Page Test1 on this other page test
I have this PHP code running to get some code from test1.
<?php
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTMLFile("http://inviatapenet.gethost.ro/sop/test1.php");
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[#type='button']/#onclick");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
?>
The result is this
OnPlay('sop://broker.sopcast.com:3912/120704 cod ', ' eu - Nr.1 in tv ! ')
OnPlay('sop://broker.sopcast.com:3912/140601 cod ', ' eu - Nr.1 in tv ! ')
OnPlay('sop://broker.sopcast.com:3912/124589 cod ', ' eu - Nr.1 tv')
OnPlay('sop://broker.sopcast.com:3912/589994 cod ', ' eu - tv ')
OnPlay('sop://broker.sopcast.com:3912/ cod ', ' eu - tv ')
But I need only this data from all of that: `sop://broker.sopcast.com:3912/140601
All of them.
How to get rid of extra text or how to get gest the(sop://broker.sopcast.com:3912/140601,sop://broker.sopcast.com:3912/120704)

If the string is always formatted like this, you can simply use explode to get the sop:// URL.
<?php
header('Content-Type: text/plain; charset=UTF-8');
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTMLFile("http://inviatapenet.gethost.ro/sop/test1.php");
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[#type='button']/#onclick");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
$content = $node->nodeValue;
$content = explode("'", $content, 3);
$content = explode(" ", $content[1], 2);
$sop = $content[0];
unset($content);
var_dump($sop);
}
}
}
?>

I think you might need do some string manipulation on resultant OnClick event handlers text.
<?php
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTMLFile("http://inviatapenet.gethost.ro/sop/test1.php");
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*[#type='button']/#onclick");
$value_text = array();
$index = 0;
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
value_text[$index++] = getReuiredValue($node->nodeValue);
}
}
//value_text will contain all required values as array
print_r($value_text);
}
function getReuiredValue($on_play)
{
$pos = strpos($on_play, 'cod ');
//following call will parse the OnPlay string and get the required value out of string
$updated_on_play = substr($on_play, 8, (strlen($on_play) - (strlen($on_play) - $pos) - 8));
$updated_on_play = trim($updated_on_play);
return $updated_on_play;
}
?>

Related

Parsing HTML to extract array of DIV content by class

$html = file_get_contents("https://www.wireclub.com/chat/room/music");
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = array();
foreach($xpath->evaluate('//div[#class="message clearfix"]/node()') as $childNode) {
$result[] = $dom->saveHtml($childNode);
}
echo '<pre>'; var_dump($result);
I would like the content of each individual DIV in an array to be processed individually.
This code is clumping every DIV together.
You could retrieve all the div and get the nodeValue
$dom = new DOMDocument();
$dom->loadHTML($html);
$myDivs = $dom->getElementsByTagName('div');
foreach($myDivs as $key => $value) {
$result[] = $value->nodeValue;
}
var_dump($result);
for class you should
you could use you code
$xpath = new DOMXPath($dom);
$myElem = $xpath->query("//*[contains(#class, '$classname')]");
foreach($myElem as $key => $value) {
$result[] = $value->nodeValue;
}

How to get a table by ID from a URL?

I am attempting to get a table from a specific URL by it's ID. My method is getting the raw HTML from the URL, converting it into a readable DOM for PHP, and then finding the table via a query.
The results of the below code is $elements always being empty (length of 0).
<?php
$c = curl_init('http://www.urlhere.com/');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c);
if (curl_error($c))
die(curl_error($c));
curl_close($c);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$elements = $xpath->query("*/table[#id=anyid]");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
?>
How can I render this table successfully on my page?
EDIT:
A snippet of the HTML I am trying to get, taken directly from the $html variable:
<div></div><table class=sortable id=anyid></table>
To continue on the comments, you could hide those errors first thru:
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
This discussion is thoroughly tacked here.
Then to apply it, just add it in your code:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$elements = $xpath->query("//table[#id='anyid']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
Sample Output

DOMDocument removing html elements

Here is my code:
$text = '<div class="cgus_post"><div class="imgbox"><img src="/cgmedia/default.gif"></div>
<h2 id="post-15055">
Willie Nelson Celebrates 80th Birthday Stoned and Auditioning for Gandalf</h2>
<p>This video pretty much sums up why Willie Nelson is fucking awesome. Willie decided to celebrate his 80th birthday by recording an ‘audition’ for Peter Jackson. Willie wants to take the reigns from Ian McKellan in The Hobbit 2, and decided to show off his acting skills and give some of his own wizardly advice. The result is hilarious. Watch …</p>
<br class="clear">
</div>';
$dom = new DomDocument();
$dom->loadHTML($text);
$classname = 'cgus_post';
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
foreach($nodes as $node){
echo $node->nodeValue;
}
The problem I am having is I am querying for the div that contains the class cgus_post and its returning just the text. How do I have it return the HTML elements also?
Here's my innerHTML function that I always use:
function innerHTML(DOMNode $node, $trim = true, $decode = true) {
$innerHTML = '';
foreach ($node->childNodes as $inner_node) {
$temp_container = new DOMDocument();
$temp_container->appendChild($temp_container->importNode($inner_node, true));
$innerHTML .= ($trim ? trim($temp_container->saveHTML()) : $temp_container->saveHTML());
}
return ($decode ? html_entity_decode($innerHTML) : $innerHTML);
}
So then you do:
$dom = new DOMDocument();
$dom->loadHTML($html);
echo htmlentities(innerHTML($dom->documentElement->childNodes->item(0)->firstChild));

Xpath for extracting links

I create an scraper for an automoto site and first I want to get all manufactures and after that all links of models for each manufactures but with the code below I get only the first model on the list. Why?
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.auto-types.com');
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//li[#class='clearfix_center']/a/#href");
$output = array();
foreach($entries as $e) {
$dom2 = new DOMDocument();
#$dom2->loadHTMLFile('http://www.auto-types.com' . $e->textContent);
$xpath2 = new DOMXPath($dom2);
$data = array();
$data['newLinks'] = trim($xpath2->query("//div[#class='modelImage']/a/#href")->item(0)->textContent);
$output[] = $data;
}
echo '<pre>' . print_r($output, true) . '</pre>';
?>
SO I need to get: mercedes/100, mercedes/200, mercedes/300 but now with my script i get only the first link so mercedes/100...
please help
You need to iterate through the results instead of just taking the first item:
$items = $xpath2->query("//div[#class='modelImage']/a/#href");
$links = array();
foreach($items as $item) {
$links[] = $item->textContent;
}
$data['newLinks'] = implode(', ', $links);

Extract all urls Href php

How do I convert these links to sha1? and then return to the html already applied with sha1
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
if (preg_match("/globo.com/i", $link->getAttribute('href'))) {
$v = $link->getAttribute('href');
$str = str_replace($v,'http://www.globo.com/?id='.sha1($v),$v);
$str2 = str_replace($v,$str,$html);
echo $str2."";
}
}
You can just put the href back into the element:
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
$href = $link->getAttribute('href');
if (preg_match("/globo.com/i", $href)) {
$newHref = 'http://www.globo.com/?id=' . sha1($v);
$link->setAttribute('href', $newHref);
}
}
And then export the finished HTML using saveHTML().
echo $dom->saveHTML();

Categories