Replace all links in the body of html page using PHP

Replace all links in the body of html page using PHP - php

I have used the following code to replace all the links on HTML page.
$output = file_get_contents($turl);
$newOutput = str_replace('href="http', 'target="_parent" href="hhttp://localhost/e/site.php?turl=http', $output);
$newOutput = str_replace('href="www.', 'target="_parent" href="http://localhost/e/site.php?turl=www.', $newOutput);
$newOutput = str_replace('href="/', 'target="_parent" href="http://localhost/e/site.php?turl='.$turl.'/', $newOutput);
echo $newOutput;
I want to modify this code to replace only links inside the body and not in the head.

You can use DOMDocument to parse and manipulate the source. It's always a better idea to use a dedicated parser for a task like this instead of using string operations.
// Parse the HTML into a document
$dom = new \DOMDocument();
$dom->loadXML($html);
// Loop over all links within the `<body>` element
foreach($dom->getElementsByTagName('body')[0]->getElementsByTagName('a') as $link) {
// Save the existing link
$oldLink = $link->getAttribute('href');
// Set the new target attribute
$link->setAttribute('target', "_parent");
// Prefix the link with the new URL
$link->setAttribute('href', "http://localhost/e/site.php?turl=" . urlencode($oldLink));
}
// Output the result
echo $dom->saveHtml();
See https://eval.in/843484

You can decapitate the code.
Finds the body and separate the head from the body to two variables.
//$output = file_get_contents($turl);
$output = "<head> blablabla
Bla bla
</head>
<body>
Foobar
</body>";
//Decapitation
$head = substr($output, 0, strpos($output, "<body>"));
$body = substr($output, strpos($output, "<body>"));
// Find body tag and parse body and head to each variable
$newOutput = str_replace('href="http', 'target="_parent" href="hhttp://localhost/e/site.php?turl=http', $body);
$newOutput = str_replace('href="www.', 'target="_parent" href="http://localhost/e/site.php?turl=www.', $newOutput);
$newOutput = str_replace('href="/', 'target="_parent" href="http://localhost/e/site.php?turl='.$turl.'/', $newOutput);
echo $head . $newOutput;
https://3v4l.org/WYcYP

Related

Find specific domain name and append url in string PHP

Let's say I have the following string:
<?php
$str = 'To subscribe go to Here';
?>
What I'm trying to do is find the URLS within the string that have a specific domain name, "foo.com" for this example, then append the url.
What I want to accomplish:
<?php
$str = 'To subscribe go to Here';
?>
If the domain name in the urls isn't foo.com, I don't want them to be appended.

You can use parse_url() function and the DomDoccument class of php to manipulate the urls, like this:
$str = 'To subscribe go to Here';
$dom = new DomDocument();
$dom->loadHTML($str);
$urls = $dom->getElementsByTagName('a');
foreach ($urls as $url) {
$href = $url->getAttribute('href');
$components = parse_url($href);
if($components['host'] == "foo.com"){
$components['path'] .= "?package=2";
$url->setAttribute('href', $components['scheme'] . "://" . $components['host'] . $components['path']);
}
$str = $dom->saveHtml();
}
echo $str;
Output:
To subscribe go to [Here]
^ href="http://foo.com/subscribe?package=2"
Here are the references:
The DOMDocument class
parse_url()

How do I get the value of a <pre> tag with no ID?

I have the following code set up from an example:
<?php
$url = 'http://somedomain/something';
$content = file_get_contents($url);
$first_step = explode( '<div id="somediv">' , $content );
$second_step = explode("</div>" , $first_step[1] );
echo $second_step[0];
?>
The problem here is that the website from which I'm trying to fetch the value of the pre tag has no ID:
<pre>some content</pre>
I've also tried this but no success so far:
<?php
$url = 'http://somedomain/something';
$content = file_get_contents($url);
$first_step = explode( '<script>document.getElementsByTagName("pre")' , $content );
$second_step = explode("</script>" , $first_step[1] );
echo $second_step[0];
?>
Basically, I'm trying to fetch a value from a domain which is wrapped by a pre tag with no additional identifiers. Any help appreciated!

PHP ships with a pretty decent document parser:
$dom = new DOMDocument;
$dom->loadHTMLFile('http://somedomain/something');
foreach ($dom->getElementsByTagName('pre') as $node) {
// do stuff with $node
echo $node->nodeValue, "\n";
}
See also: DOMDocument

there are many ways to parse html dom elements,
For PHP Dome parser, check the link http://simplehtmldom.sourceforge.net/
For Yahoo YQL, use this link https://developer.yahoo.com/yql/
In Javascript, Jquery also there are so many methods to parse HTML.
Use which is convenient to you.

Interpret string as HTML in PHP

i am working on PHP, and i've only just begun, so i would like to ask some advice on something i can't quite seem to find online.
I have a PHP-file that gets 2 strings: A name and a lot of HTML-text.
$name = $_POST['name'];
$content = $_POST['content'];
Now i want to use those 2 to create a new HTML file and save it. So far i've managed to make it save a new HTML file and use the name as the <title> tag. Now what i want is to do is replace the (currently empty) body with my HTML string, interpreted as HTML.
This builds my body tag:
$body = $doc->createElement('body');
$body = $root->appendChild($body);
Now i found 2 ways to do this, and both don't work:
Solution 1:
$content = $doc->createTextNode($content);
$content = $body->appendChild($content);
This inserts my HTML into my body, but it parses literally as '<div id="lala">content</div>'. So this isn't what i want.
Solution 2:
$content = $doc->loadHTML($content);
This actually makes the html load as HTML, but now it replaces my entire HTML and things like adding css and js to the head will now actually go UNDER the new body, instead of in the head. As such:
$link_1 = $doc->createElement('link');
$link_1->setAttribute('rel','stylesheet');
$link_1->setAttribute('href','css/mystylesheet.min.css');
$link_1 = $head->appendChild($link_1);
So basically i want to both interpret my string as HTML, AND make it load in the correct place. I've tried to change it to $body->loadHTML($content), but this gives me an internal error.
Does anyone know the PHP-method i'm looking for? Thanks!

This will work. Let me know if you need explanation :)
<?php
ini_set( 'display_errors', 1 );
ini_set( 'error_reporting', E_ALL );
$dom = new DOMDocument( '1.0' );
$dom->formatOutput = true;
$dom->preserveWhiteSpace = true;
// test values
$name = 'Test';
$content = '<div onclick="alert(\'Hello!\');">Test div</div>';
// create <html>, <head>, <title> and <body> tags
$html = $dom->createElement( 'html' );
$head = $dom->createElement( 'head' );
$title = $dom->createElement( 'title' );
$body = $dom->createElement( 'body' );
// title text
$titleText = $dom->createTextNode( $name );
// import the text in a new dom
$dom1 = new DOMDocument( '1.0' );
$dom1->formatOutput = true;
$dom1->preserveWhiteSpace = true;
$bodyText = $dom1->loadHTML( $content );
$bodyText = $dom1->getElementsByTagName('body')->item(0);
// add them to the dom
$html = $dom->appendChild( $html );
$html->appendChild( $head );
$head->appendChild( $title );
$title->appendChild( $titleText );
$html->appendChild( $body );
$bodyT = $dom->importNode( $bodyText, true );
$body->appendChild( $bodyT );
echo $dom->saveHTML();
?>
Hope this helps.

Of course, the text in createTextNode stands for plain text (vs. HTML).
Fatal error: Call to undefined method DOMElement::loadHTML()
Correct, loadHTML() belongs to DOMDocument, not DOMElement. In your case, the document is apparently $doc (not $body). So you need to call:
$doc->loadHTML()

PHP DOMDocument how to get element?

I am trying to read a website's content but i have a problem i want to get images, links these elements but i want to get elements them selves not the element content for instance i want to get that: i want to get that entire element.
How can i do this..
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.link.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
$dom = new DOMDocument;
#$dom->loadHTML($output);
$items = $dom->getElementsByTagName('a');
for($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "<br />";
}
curl_close($ch);;
?>

You appear to be asking for the serialized html of a DOMElement? E.g. you want a string containing link text? (Please make your question clearer.)
$url = 'http://example.com';
$dom = new DOMDocument();
$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $a) {
// Best solution, but only works with PHP >= 5.3.6
$htmlstring = $dom->saveHTML($a);
// Otherwise you need to serialize to XML and then fix the self-closing elements
$htmlstring = saveHTMLFragment($a);
echo $htmlstring, "\n";
}
function saveHTMLFragment(DOMElement $e) {
$selfclosingelements = array('></area>', '></base>', '></basefont>',
'></br>', '></col>', '></frame>', '></hr>', '></img>', '></input>',
'></isindex>', '></link>', '></meta>', '></param>', '></source>',
);
// This is not 100% reliable because it may output namespace declarations.
// But otherwise it is extra-paranoid to work down to at least PHP 5.1
$html = $e->ownerDocument->saveXML($e, LIBXML_NOEMPTYTAG);
// in case any empty elements are expanded, collapse them again:
$html = str_ireplace($selfclosingelements, '>', $html);
return $html;
}
However, note that what you are doing is dangerous because it could potentially mix encodings. It is better to have your output as another DOMDocument and use importNode() to copy the nodes you want. Alternatively, use an XSL stylesheet.

I'm assuming you just copy-pasted some example code and didn't bother trying to learn how it actually works...
Anyway, the ->nodeValue part takes the element and returns the text content (because the element has a single text node child - if it had anything else, I don't know what nodeValue would give).
So, just remove the ->nodeValue and you have your element.

how can i get specific data between div with file_get_content

suppose i echo out
$url = "http://www.mydomain.com";
echo file_get_content($url);
and http://www.mydomain.com has a div i.e
<title>sitename</title>
</head><body>
Lorem Ipsum.......
<div id="divname">and here is div content</div>
Copyright bla bla bla
no i want to only fetch content between div with id="divname" how can i do that

$url = "http://www.mydomain.com";
$html = new SimpleXmlElement($url, null, true);
$content = $html->xpath("//div[#id='divname']");
Of course you could still use file_get_contents or curl if you want to introduce error checking on the fetch of the document.

With Simple HTML DOM Parser
$url = "http://www.mydomain.com";
$html = file_get_html($url);
$ret = $html->find('div[id=divname]');

$html = file_get_html('http://www.mydomain.com');
foreach($html->find('div#divname') as $e)
echo $e->innertext;
Here as "divname" is an id so we have used # so if you have any class then you may use .(dot)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Replace all links in the body of html page using PHP - php

Related

Find specific domain name and append url in string PHP

How do I get the value of a <pre> tag with no ID?

Interpret string as HTML in PHP

PHP DOMDocument how to get element?

how can i get specific data between div with file_get_content

Categories

Resources