I have searched through some other threads here but have not found the perfect solution
if I have the following layout
<html>
<div> <---- This one
<div> text </div>
<div> text </div>
</div>
<p> text </p>
<div> <---- This one
<div> text </div>
<div> text </div>
</div>
<p> text </p>
</html>
How would I go about getting only the divs on the top level. (NOTE: the two divs inside is only an exmaple, there could be just one, or could be 5 or 6.
Note: The rest of the code that this ties into is using the Simple html dom, I need this to work with that.
One possibiltity is to use XPath. It will look something like the following:
<?php
$doc = new DOMDocument();
// one of the few cases where you may use error suppression
#$doc->loadHTML($yourHtml);
$xPath = new XPath($doc);
$nodes = $xPath->query('//html/div');
Disclaimer: I haven't tested this, but it should at least get you close
Related
I was trying to make this code as below.
PHP code I wrote:
$i = 1;
<div class="sampleNumberIndex">
<p>
<h2> <?php echo $i; ?></h2>
</p>
</div>
Made code that I expected
<div class="sampleNumberIndex">
<p>
<h2> 1 </h2>
</p>
</div>
But php code made this.
<div class="sampleNumberIndex">
<p> </p>
<h2> 1 </h2>
<p> </p>
</div>
I'm using chrome, what did make it like that?
That has nothing to do with PHP. It's just that html doesn't want you to wrap h2-tags in p-tags. You should rather use divs, regardless of what you are trying to accomplish here.
You are probably looking at Inspect Element section which repairs bad HTML codes. Real source codes are available at View page source section. Try Ctrl + U to view source code.
It's not very clear from your question, but I suppose you're seeing the unexpected tags in the browser's DOM inspector. "Show Page Source" should show you the actual raw HTML your browser received.
Edit: what #AliN11 said
HTML infers tags when missing in content. In your case, the HTML browser knows that h2 can't appear in the content of a p element, hence it adds the </p> end-element tag before h2. Then after the h2 element, it encounters the </p> end-element tag, and inserts a <p> start-element tag before it, because none is open at the context position.
The first insertion - that for the </p> end-element tag - is part of the regular parsing rules for HTML; omitting </p> is allowed according to the HTML specification. But the second insertion - that for the <p> start-element tag - is not, and is an effect of HTML recovery (repair) kicking in.
I've explained HTML/SGML tag insertion in detail on the project page of my SGML software and the linked slides of a talk I gave about it last year.
First of all, I know that there are many questions leading towards including HTML.
The thing is, when I include one HTML (1) file into another (2), using <?php include("1.html") ?>, and both files consist of something like this:
<html>
<body>
<div id="specific div">
<span id="span1">1</span>
</div>
</body>
</html>
Having two different spans in the same specific div - once I include one file into the other one, it would look like this:
<html>
<body>
<div id="specific div">
<span id="span1">1</span>
</div>
<div id="specific div">
<span id="span2">2</span>
</div>
</body>
</html>
while I want the contents of the specific div merged into one of them, instead of having to divs with the same id in the end:
<html>
<body>
<div id="specific div">
<span id="span1">1</span>
<span id="span2">2</span>
</div>
</body>
</html>
How do I achieve that?
EDIT: I found a different and less complicated solution for my specific situation. Therefore I can't really select the correct answer now, so I might select one if it gets enough upvotes.
You could use php's DomDocument::loadHTMLFile() function. With this you can load both of your files and merge them the way you like it.
If your file looks like you said, something like this:
<html>
<body>
<div id="specific div1">
<span id="span">bla bla bla</span>
</div>
</body>
</html>
You can use the DomDocument:
$dom1 = new DomDocument();
$dom1->loadHTMLFile("file_1.html", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom2 = new DomDocument();
$dom2->loadHTMLFile("file_2.html", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$element = $dom2->getElementById("specific div2")->firstChild;
$dom1->getElementById("specific div1")->appendChild($element);
$merged_html = $dom1->saveHTML()
So this would merge the contents of the div[id="specific div2"] to the div[id="specific div1"]
DOMDocument also supports xpath if you like it more than going through the nodes maually or selecting by id.
If you want to include one html file to another then you can either use iframe or can convert the files to php as well to use include.
Like if you want to include index2.html in index1.html then you can use below iframe code in index1.html:
index1.html
<iframe src="index2.html"></iframe>
Or you can convert your files to PHP and then simply include one file to other like below:
index1.php
<?php include('index2.php'); ?>
You can also use Server Side Include like below to include one html to another:
index1.html
<html>
<body>
<!--#include virtual="/index2.html" -->
</body>
</html>
I'm scraping (using PHP simple HTML DOM) a number of different (news) sites with the aim of getting the main content/body of text on the page.
To do this the best way i could figure out was to find the main header/headline (H1) and to get the text contained within the same div as this header tag.
How would i go about getting the contents of the whole (parent?) div, in both examples below.
<div> <----- need to get contents of this whole div (containing the h1 and likely the main body of text)
<h1></h1>
main body of text here
</div>
Div maybe be further up the tree.
<div> <----- need to get contents of this whole div
<div>
<h1></h1>
</div>
<div>
main body of text here
</div>
</div>
Div even further up the tree.
<div> <----- need to get contents of this whole div
<div>
<div>
<h1></h1>
</div>
<div>
main body of text here
</div>
</div>
</div>
Then i could compare the size of each, and determine the main div.
You can use parent to get the parent element of the h1:
# assuming that the <h1> element is the first <h1> on the page:
$div = $html->find('h1', 0)->parent();
Assuming $e contains the H1 element that you selected. You can call $e->parent() to grab the parent element.
Look under "How to traverse the DOM tree?" on the "Traverse the DOM tree" tab. http://simplehtmldom.sourceforge.net/manual.htm
I'm just getting to grips with QueryPath after using HTML Simple Dom for quite some time and am finding that the QP documentation doesn't seem to offer much in the way of examples for all of its functions.
At the moment I'm trying to retrieve some text from a HTML doc that doesn't make much use of ID's or Classes, so I'm a little outside of my comfort zone.
Here's the HTML:
<div class="blue-box">
<div class="top">
<h2><img src="pic.gif" alt="Advertise"></h2>
<p>Some uninteresting stuff</p>
<p>More stuff</p>
</div>
</div>
<div class="blue-box">
<div class="top">
<h2><img src="pic2.gif" alt="Location"></h2>
**I NEED THIS TEXT**
<div style="margin:stuff">
<img src="img3.gif">
</div>
</div>
</div>
I was thinking about selecting the class 'box-blue' as the starting point and then descending from there. The issue is that there could be any number of box-blue classes in the HTML doc.
Therefore I was thinking that maybe I should try to select the image with alt="Location" and then use ->next()->text() or something along those lines?
I've tried about 15 variations os far and none are getting the text I need.
Assistance most appreciated!
Can you have a look to this example http://jsfiddle.net/Pedro3M/mujtk/
I made like you said using the alt attribute, if you confirm if this is always unique
$("img[alt='Location']").parent().parent().text();
How about:
$doc->find('div.top:has(img[alt="Location"])')->text();
Within my HTML, I have a php script that includes a file. At that point, the code is indented 2 tabs. What I would like to do is make the php script add two tabs to each line. Here's an example:
Main page:
<body>
<div>
<?php include("test.inc"); ?>
</div>
</body>
And "test.inc":
<p>This is a test</p>
<div>
<p>This is a nested test</p>
<div>
<p>This is an more nested test</p>
</div>
</div>
What I get:
<body>
<div>
<p>This is a test</p>
<div>
<p>This is a nested test</p>
<div>
<p>This is an more nested test</p>
</div>
</div>
</div>
</body>
What I want:
<body>
<div>
<p>This is a test</p>
<div>
<p>This is a nested test</p>
<div>
<p>This is an more nested test</p>
</div>
</div>
</div>
</body>
I realise I could just add leading tabs to the include file. However, VS keeps removing those when I format the document.
In your test.inc file, you can use output buffering to capture all the output of the PHP script, before it is sent to the browser. You can then post-process it to add the tabs you want, and send it on. At the top on the file, add
<?php
ob_start();
?>
At the end, add
<?php
$result = ob_get_contents();
ob_end_clean();
print str_replace("\t" . $result, "\n", "\n\t");
?>
I don't necessarily subscribe to this solution - it can be memory intensive, depending on your output, and will prevent your include file from sending partial results to the client as it works. You might be better off reformatting the output, or using some form of custom "print" wrapper that tabs things (and use printing of heredocs for constant HTML output).
Edit: Use str_replace, as suggested by comment
I don't think your solution can be done easily. You might consider using HTML Tidy to clean your source code before presenting it to a client. There are good tutorials for it on the internet.
The easiest solution is to add leading tabs to the include file, but instead of using literal tabs, use the \t escape sequence.