How can I get a specific div from website? [duplicate] - php

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 5 years ago.
I am trying get a specific div element (i.e. with attribute id="vung_doc") from a website, but I get almost every element. Do you have any idea what's wrong?
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = true;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://lightnovelgate.com/chapter/epoch_of_twilight/chapter_300');
$xpath = new DOMXPath($doc);
$query = "//*[#class='vung_doc']";
$entries = $xpath->query($query);
var_dump($entries->item(0)->textContent);

Actually, it appears that that one element, which has both id and class attributes with value vung_doc, has many paragraphs inside its text content. Perhaps you are thinking each paragraph should be in its own div element.
<div id="vung_doc" class="vung_doc" style="font-size: 18px;">
<p></p>
"Mayor song..."
In the screenshot at the bottom of this post, I added an outline style to that element, to show just how many paragraphs are within that element.
If you wanted to separate the paragraphs, you could use preg_split() to split on any new line characters:
$entries = $xpath->query($query);
foreach($entries as $entry) {
$paragraphs = preg_split("/[\r\n]+/s",$entry->textContent);
foreach($paragraphs as $paragraph) {
if (trim($paragraph)) {
echo '<b>paragraph:</b> '.$paragraph;
break;
}
}
}
See a demonstration of this in this playground example. Note that before loading the HTML file, libxml_use_internal_errors() is called, to suppress the XML errors:
libxml_use_internal_errors(true);
Screenshot of the target div element with outline added:

Change
$query = "//*[#class='vung_doc']";
to
$query = "//*[#id='vung_doc']";

Related

How do I extract all URL links from an RSS feed? [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 7 years ago.
I need to extract all the links to news articles from the NY Times RSS feed to a MySQL database periodically. How do I go about doing this? Can I use some regular expression (in PHP) to match the links? Or is there some other alternative way? Thanks in advance.
UPDATE 2 I tested the code below and had to modify the
$links = $dom->getElementsByTagName('a');
and change it to:
$links = $dom->getElementsByTagName('link');
It successfully outputted the links. Good Luck
UPDATE Looks like there is a complete answer here: How do you parse and process HTML/XML in PHP.
I developed a solution so that I could recurse all the links in my website. I've removed the code which verified the domain was the same with each recursion (since the question didn't ask for this), but you can easily add one back in if you need it.
Using html5 DOMDocument, you can parse HTML or XML document to read links. It is better than using regex. Try something like this
<?php
//300 seconds = 5 minutes - or however long you need so php won't time out
ini_set('max_execution_time', 300);
// using a global to store the links in case there is recursion, it makes it easy.
// You could of course pass the array by reference for cleaner code.
$alinks = array();
// set the link to whatever you are reading
$link = "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml";
// do the search
linksearch($link, $alinks);
// show results
var_dump($alinks);
function linksearch($url, & $alinks) {
// use $queue if you want this fn to be recursive
$queue = array();
echo "<br>Searching: $url";
$href = array();
//Load the HTML page
$html = file_get_contents($url);
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('link');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
$href[] = $link->getAttribute('href');
}
foreach (array_unique($href) as $link) {
// add to list of links found
$queue[] = $link;
}
// remove duplicates
$queue = array_unique($queue);
// get links that haven't yet been processed
$queue = array_diff($queue, $alinks);
// update array passed by reference with new links found
$alinks = array_merge($alinks, $queue);
if (count($queue) > 0) {
foreach ($queue as $link) {
// recursive search - uncomment out if you use this
// remember to check that the domain is the same as the one starting from
// linksearch($link, $alinks);
}
}
}
DOM+Xpath allows you to fetch nodes using expressions.
RSS Item Links
To fetch the RSS link elements (the link for each item):
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$expression = '//channel/item/link';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->textContent);
}
Atom Links
The atom:link have a different semantic, they are part of the Atom namespace and used to describe relations. NYT uses the standout relation to mark featured stories. To fetch the Atom links you need to register a prefix for the namespace. Attributes are nodes, too so you can fetch them directly:
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/a:link[#rel="standout"]/#href';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->value);
}
Here are other relations like prev and next.
HTML Links (a elements)
The description elements contain HTML fragments. To extract the links from them you have to load the HTML into a separate DOM document.
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/description';
foreach ($xpath->evaluate($expression) as $description) {
$fragment = new DOMDocument();
$fragment->loadHtml($description->textContent);
$fragmentXpath = new DOMXpath($fragment);
foreach ($fragmentXpath->evaluate('//a[#href]/#href') as $link) {
var_dump($link->value);
}
}

php dom document remove some html tags but keep inner tags and text

I need to remove some tags (e.g. <div></div>) in HTML document and keep inner tags and text.
I managed to do that with Simple HTML Dom Parser. But it can't process big files due to huge memory requirements.
I would prefer to use native PHP tools like DOMDocument cause I read that it's more optimized and quicker in processing HTML documents.
But I struggle at the first stage - how to remove some tags while keeping inner text and tags.
Source HTML sample is:
<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>
I try this code:
$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
foreach ($oldnodes as $node) {
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
}
echo $doc->saveHTML();
It produces the output:
<html><body>00000aaaaa<div>bbbbbbccc<a>link</a>cccdddddd</div>eeeee<div>1111</div></body></html>
I need the following:
<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>
Could someone please help me with proper code for the task?
You can use strip_tags function in PHP.
$thmltext = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
strip_tags($htmltext, '<html>,<body>,<a>');
This remove all tags except html,body,a
And output is:
<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>
EDIT:
If it is input from user, it's better for security reason to use whitelist tags and not blacklist.
If your code only contains simple HTML tags without any attributes you can keep it simple like:
$value = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
$pattern = '/<[\/]*(div|h1)>/';
$removedTags = preg_replace($pattern, '', $value);
Since you wrote in your comment that there are more than just div tags you want to remove, I added a h1 tag to the pattern in case you also want to remove h1 tags.
This code snippet is only for simple code, but fits to your HTML input and output example.
Try this..
Just replace the for loop with the below code.
foreach ($oldnodes as $node) {
$children = $node->childNodes;
$string = "";
foreach($children as $child) {
$childString = $doc->saveXML($child);
$string = $string."".$childString;
}
$fragment = $doc->createDocumentFragment();
$fragment->appendXML($string);
$node->parentNode->insertBefore($fragment,$node);
$node->parentNode->removeChild($node);
}
I found a way to make it work.
The reason code in question not working is the manipulation with nodes in nodelist ruin nodelist. So "foreach" function wents through only 2 out of 4 items in nodelist - the rest 2 become distorted.
So I had to deal with only the 1st element of the list and then rebuild list until there are some items in the list left.
The code is:
$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
echo "<!--
".$htmltext."
-->
";
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
while ($oldnodes->length>0){
$node=$oldnodes->item(0);
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
$oldnodes = $doc->getElementsByTagName('div');
}
echo $doc->saveHTML();
I hope that will be helpful for someone who finds same difficulties.

How to remove consecutive links from a webpage?

I wish to remove consecutive links on a webpage
Here is a sample
<div style="font-family: Arial;">
<br>
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
Google is a search
engine
In the above html I want to remove the first 2 A tags and not the third one (My script should only remove consecutive tags)
Don't use a regex for this. They are extremely powerful but not for finding this kind of "consecutive" tags.
I suggest you use DOM. Then you can browse the HTML as a tree.
Here is an example (not tested):
$doc = new DOMDocument();
// avoid blank nodes when parsing
$doc->preserveWhiteSpace = false;
// reads HTML in a string, loadHtmlFile() also exists
$doc->loadHTML($html);
// find all "a" tags
$links = $doc->getElementsByTagName('a');
// remove the first link
$parent = $links->item(0)->parentNode;
$parent->removeChild($links->item(0));
// test the node following the second link
if ($links->item(1)->nextSibling->nodeType != XML_TEXT_NODE) {
// delete this node ...
}
// print the modified HTML
// See DOMDocument's attributes if you want to format the output
echo $doc->saveHTML();

How to get the html inside a $node rather than just the $nodeValue [duplicate]

This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 2 years ago.
Description of the current situation:
I have a folder full of pages (pages-folder), each page inside that folder has (among other things) a div with id="short-info".
I have a code that pulls all the <div id="short-info">...</div> from that folder and displays the text inside it by using textContent (which is for this purpose the same as nodeValue)
The code that loads the divs:
<?php
$filename = glob("pages-folder/*.php");
sort($filename);
foreach ($filename as $filenamein) {
$doc = new DOMDocument();
$doc->loadHTMLFile($filenamein);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("*//div[#id='short-info']");
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->textContent;
}
}
}
?>
Now the problem is that if the page I am loading has a child, like an image: <div id="short-info"> <img src="picture.jpg"> Hello world </div>, the output will only be Hello world rather than the image and then Hello world.
Question:
How do I make the code display the full html inside the div id="short-info" including for instance that image rather than just the text?
You have to make an undocumented call on the node.
$node->c14n() Will give you the HTML contained in $node.
Crazy right? I lost some hair over that one.
http://php.net/manual/en/class.domnode.php#88441
Update
This will modify the html to conform to strict HTML. It is better to use
$html = $Node->ownerDocument->saveHTML( $Node );
Instead.
You'd want what amounts to 'innerHTML', which PHP's dom doesn't directly support. One workaround for it is here in the PHP docs.
Another option is to take the $node you've found, insert it as the top-level element of a new DOM document, and then call saveHTML() on that new document.

Get all images url from string [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
How to extract img src, title and alt from html using php?
Hi,
I have found solution to get first image from string:
preg_match('~<img[^>]*src\s?=\s?[\'"]([^\'"]*)~i',$string, $matches);
But I can't manage to get all images from string.
One more thing... If image contains alternative text (alt attribute) how to get it too and save to another variable?
Thanks in advance,
Ilija
Don't do this with regular expressions. Instead, parse the HTML. Take a look at Parse HTML With PHP And DOM. This is a standard feature in PHP 5.2.x (and probably earlier). Basically the logic for getting images is roughly:
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
}
This should be trivial to adapt to finding images.
This is what I tried but can't get it print value of src
$dom = new domDocument;
/*** load the html into the object ***/
$dom->loadHTML($html);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
/*** the table by its tag name ***/
$images = $dom->getElementsByTagName('img');
/*** loop over the table rows ***/
foreach ($images as $img)
{
/*** get each column by tag name ***/
$url = $img->getElementsByTagName('src');
/*** echo the values ***/
echo $url->nodeValue;
echo '<hr />';
}
EDIT: I solved this problem
$dom = new domDocument;
/*** load the html into the object ***/
$dom->loadHTML($string);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach($images as $img)
{
$url = $img->getAttribute('src');
$alt = $img->getAttribute('alt');
echo "Title: $alt<br>$url<br>";
}
Note that Regular Expressions are a bad approach to parsing anything that involves matching braces.
You'd be better off using the DOMDocument class.
You assume that you can parse HTML using regular expressions. That may work for some sites, but not all sites. Since you are limiting yourself to only a subset of all web pages, it would be interesting to know how you limit yourself... maybe you can parse the HTML in a quite easy way from php.
Look at preg_match_all to get all matches.

Categories