how can i get specific data between div with file_get_content - php

suppose i echo out
$url = "http://www.mydomain.com";
echo file_get_content($url);
and http://www.mydomain.com has a div i.e
<title>sitename</title>
</head><body>
Lorem Ipsum.......
<div id="divname">and here is div content</div>
Copyright bla bla bla
no i want to only fetch content between div with id="divname" how can i do that

$url = "http://www.mydomain.com";
$html = new SimpleXmlElement($url, null, true);
$content = $html->xpath("//div[#id='divname']");
Of course you could still use file_get_contents or curl if you want to introduce error checking on the fetch of the document.

With Simple HTML DOM Parser
$url = "http://www.mydomain.com";
$html = file_get_html($url);
$ret = $html->find('div[id=divname]');

$html = file_get_html('http://www.mydomain.com');
foreach($html->find('div#divname') as $e)
echo $e->innertext;
Here as "divname" is an id so we have used # so if you have any class then you may use .(dot)

Related

Replace all links in the body of html page using PHP

I have used the following code to replace all the links on HTML page.
$output = file_get_contents($turl);
$newOutput = str_replace('href="http', 'target="_parent" href="hhttp://localhost/e/site.php?turl=http', $output);
$newOutput = str_replace('href="www.', 'target="_parent" href="http://localhost/e/site.php?turl=www.', $newOutput);
$newOutput = str_replace('href="/', 'target="_parent" href="http://localhost/e/site.php?turl='.$turl.'/', $newOutput);
echo $newOutput;
I want to modify this code to replace only links inside the body and not in the head.
You can use DOMDocument to parse and manipulate the source. It's always a better idea to use a dedicated parser for a task like this instead of using string operations.
// Parse the HTML into a document
$dom = new \DOMDocument();
$dom->loadXML($html);
// Loop over all links within the `<body>` element
foreach($dom->getElementsByTagName('body')[0]->getElementsByTagName('a') as $link) {
// Save the existing link
$oldLink = $link->getAttribute('href');
// Set the new target attribute
$link->setAttribute('target', "_parent");
// Prefix the link with the new URL
$link->setAttribute('href', "http://localhost/e/site.php?turl=" . urlencode($oldLink));
}
// Output the result
echo $dom->saveHtml();
See https://eval.in/843484
You can decapitate the code.
Finds the body and separate the head from the body to two variables.
//$output = file_get_contents($turl);
$output = "<head> blablabla
Bla bla
</head>
<body>
Foobar
</body>";
//Decapitation
$head = substr($output, 0, strpos($output, "<body>"));
$body = substr($output, strpos($output, "<body>"));
// Find body tag and parse body and head to each variable
$newOutput = str_replace('href="http', 'target="_parent" href="hhttp://localhost/e/site.php?turl=http', $body);
$newOutput = str_replace('href="www.', 'target="_parent" href="http://localhost/e/site.php?turl=www.', $newOutput);
$newOutput = str_replace('href="/', 'target="_parent" href="http://localhost/e/site.php?turl='.$turl.'/', $newOutput);
echo $head . $newOutput;
https://3v4l.org/WYcYP

Find specific domain name and append url in string PHP

Let's say I have the following string:
<?php
$str = 'To subscribe go to Here';
?>
What I'm trying to do is find the URLS within the string that have a specific domain name, "foo.com" for this example, then append the url.
What I want to accomplish:
<?php
$str = 'To subscribe go to Here';
?>
If the domain name in the urls isn't foo.com, I don't want them to be appended.
You can use parse_url() function and the DomDoccument class of php to manipulate the urls, like this:
$str = 'To subscribe go to Here';
$dom = new DomDocument();
$dom->loadHTML($str);
$urls = $dom->getElementsByTagName('a');
foreach ($urls as $url) {
$href = $url->getAttribute('href');
$components = parse_url($href);
if($components['host'] == "foo.com"){
$components['path'] .= "?package=2";
$url->setAttribute('href', $components['scheme'] . "://" . $components['host'] . $components['path']);
}
$str = $dom->saveHtml();
}
echo $str;
Output:
To subscribe go to [Here]
^ href="http://foo.com/subscribe?package=2"
Here are the references:
The DOMDocument class
parse_url()

How do I get the value of a <pre> tag with no ID?

I have the following code set up from an example:
<?php
$url = 'http://somedomain/something';
$content = file_get_contents($url);
$first_step = explode( '<div id="somediv">' , $content );
$second_step = explode("</div>" , $first_step[1] );
echo $second_step[0];
?>
The problem here is that the website from which I'm trying to fetch the value of the pre tag has no ID:
<pre>some content</pre>
I've also tried this but no success so far:
<?php
$url = 'http://somedomain/something';
$content = file_get_contents($url);
$first_step = explode( '<script>document.getElementsByTagName("pre")' , $content );
$second_step = explode("</script>" , $first_step[1] );
echo $second_step[0];
?>
Basically, I'm trying to fetch a value from a domain which is wrapped by a pre tag with no additional identifiers. Any help appreciated!
PHP ships with a pretty decent document parser:
$dom = new DOMDocument;
$dom->loadHTMLFile('http://somedomain/something');
foreach ($dom->getElementsByTagName('pre') as $node) {
// do stuff with $node
echo $node->nodeValue, "\n";
}
See also: DOMDocument
there are many ways to parse html dom elements,
For PHP Dome parser, check the link http://simplehtmldom.sourceforge.net/
For Yahoo YQL, use this link https://developer.yahoo.com/yql/
In Javascript, Jquery also there are so many methods to parse HTML.
Use which is convenient to you.

Namespace in MRSS feed using simplexml PHP Script

Tried to research what I'm doing wrong here, but no luck so far. I want to pull the links and URLs in this MRSS feed using this script, but it's not working. Thought all I needed to do was use namespaces to get the child elements out, but no luck:
<?php
$html = "";
$url = "http://feeds.nascar.com/feeds/video?command=search_videos&media_delivery=http&custom_fields=adtitle%2cfranchise&page_size=100&sort_by=PUBLISH_DATE:DESC&token=217e0d96-bd4a-4451-88ec-404debfaf425&any=franchise:%20Preview%20Show&any=franchise:%20Weekend%20Top%205&any=franchise:Up%20to%20Speed&any=franchise:Press%20Pass&any=franchise:Sprint%20Cup%20Practice%20Clips&any=franchise:Sprint%20Cup%20Highlights&any=franchise:Sprint%20Cup%20Final%20Laps&any=franchise:Sprint%20Cup%20Victory%20Lane&any=franchise:Sprint%20Cup%20Post%20Race%20Reactions&any=franchise:All%20Access&any=franchise:Nationwide%20Series%20Qualifying%20Clips&any=franchise:Nationwide%20Series%20Highlights&any=franchise:Nationwide%20Series%20Final%20Laps&any=franchise:Nationwide%20Series%20Victory%20Lane&any=franchise:Nationwide%20Series%20Post%20Race%20Reactions&any=franchise:Truck%20Series%20Qualifying%20Clips&any=franchise:Truck%20Series%20Highlights&any=franchise:Truck%20Series%20Final%20Laps&any=franchise:Truck%20Series%20Victory%20Lane&any=franchise:Truck%20Series%20Post%20Race%20Reactions&output=mrss";
$xml = simplexml_load_file($url);
$namespaces = $xml->getNamespaces(true); // get namespaces
for($i = 0; $i < 50; $i++){ // will return the 50 most recent videos
$title = $xml->channel->item[$i]->title;
$link = $xml->channel->item[$i]->link;
$pubDate = $xml->channel->item[$i]->pubDate;
$description = $xml->channel->item[$i]->description;
$titleid = $xml->channel->item[$i]->children($namespaces['bc'])->titleid;
$url = $xml->channel->item[$i]->children($namespaces['media'])->url;
$html .= "<h3>$title</h3>$description<p>$pubDate<p>$link<p>Video ID: $titleid<p>
<iframe width='480' height='270' src='http://link.brightcove.com/services/player/bcpid3742068445001?bckey=//my API token goes here &bctid=$titleid&autoStart=false' frameborder='0'></iframe><hr/>";/* this embed code is from the youtube iframe embed code format but is actually using the embedded Ooyala player embedded on the Campus Insiders page. I replaced any specific guid (aka video ID) numbers with the "$guid" variable while keeping the Campus Insider Ooyala publisher ID, "eb3......fad" */
}
echo $html;
?>
I take it this isn't the right approach:
$url = $xml->channel->item[$i]->children($namespaces['media'])->url;
What am I doing wrong here?
Thanks for any and all help!
MD
SimpleXML is deceptively named as it is more difficult to use than DOMDocument or the other PHP XML extensions. To get the URL, you'll need to access the url attribute of the media:content node:
<media:content duration="95" medium="video" type="video/mp4"
url="http://brightcove.meta.nascar.com.edgesuite.net/vod/etc"/>
Target the first <media:content> node using
$xml->channel->item[$i]->children($namespaces['media'])->content[0]
and get its attributes:
$m_attrs =
$xml->channel->item[$i]->children($namespaces['media'])->content[0]->attributes();
You can then access the url attribute:
echo "URL: " . $m_attrs["url"] . "\n";
Your code should thus be:
$titleid = $xml->channel->item[$i]->children($namespaces['bc'])->titleid;
$m_attrs = $xml->channel->item[$i]->children($namespaces['media'])->content[0]->attributes();
$url = $m_attrs["url"];
$html .= "<h3>$title</h3>$description<p>$pubDate<p>$link<p>Video ID: $titleid<p> (etc.)";

PHP DOMDocument how to get element?

I am trying to read a website's content but i have a problem i want to get images, links these elements but i want to get elements them selves not the element content for instance i want to get that: i want to get that entire element.
How can i do this..
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.link.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
$dom = new DOMDocument;
#$dom->loadHTML($output);
$items = $dom->getElementsByTagName('a');
for($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "<br />";
}
curl_close($ch);;
?>
You appear to be asking for the serialized html of a DOMElement? E.g. you want a string containing link text? (Please make your question clearer.)
$url = 'http://example.com';
$dom = new DOMDocument();
$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $a) {
// Best solution, but only works with PHP >= 5.3.6
$htmlstring = $dom->saveHTML($a);
// Otherwise you need to serialize to XML and then fix the self-closing elements
$htmlstring = saveHTMLFragment($a);
echo $htmlstring, "\n";
}
function saveHTMLFragment(DOMElement $e) {
$selfclosingelements = array('></area>', '></base>', '></basefont>',
'></br>', '></col>', '></frame>', '></hr>', '></img>', '></input>',
'></isindex>', '></link>', '></meta>', '></param>', '></source>',
);
// This is not 100% reliable because it may output namespace declarations.
// But otherwise it is extra-paranoid to work down to at least PHP 5.1
$html = $e->ownerDocument->saveXML($e, LIBXML_NOEMPTYTAG);
// in case any empty elements are expanded, collapse them again:
$html = str_ireplace($selfclosingelements, '>', $html);
return $html;
}
However, note that what you are doing is dangerous because it could potentially mix encodings. It is better to have your output as another DOMDocument and use importNode() to copy the nodes you want. Alternatively, use an XSL stylesheet.
I'm assuming you just copy-pasted some example code and didn't bother trying to learn how it actually works...
Anyway, the ->nodeValue part takes the element and returns the text content (because the element has a single text node child - if it had anything else, I don't know what nodeValue would give).
So, just remove the ->nodeValue and you have your element.

Categories