I am working on a project involving tens of thousands of files that I downloaded from the internet. The source of the pages (MO government) didn't program the pages too well.
I am pulling certain elements from the pages to be put into another page to be referenced in my website better. Here is a working example:
<div id="intsect">
<strong>Common law in force--effect on statutes.</strong>
</div>
// PHP CODE
// Load Document
$doc = new DOMDocument();
// Load File Values
#$doc->loadHTMLFile("stathtml/" . $file);
// Load the <div id="intsect"></div> value
$genAssem = $doc->getElementById("intsect");
// Appropriate value
$genAssem = " <b>Statute Name: </b>" . $genAssem->textContent . "<br><br>";
# VALUE (example)
Statute Name: Common law in force--effect on statutes.
Here is the part that is killing me:
<div id="intsect">
<strong>Common law in force--effect on statutes.</strong>
</div>
<!-- THIS PART -->
<p> 1.035. Whenever the word "voter" is used in the laws of this state it shall mean registered voter, or legal voter.
The programmers didn't give it an ID or a Class. I need to extract the paragraph tag that follows #intsect. Is there a PHP selector that can select the <p></p> tags after the #intsect one?
You can use xpath to target that <p> tag which has a preceding sibling of div that has an ID of intsect:
$doc = new DOMDocument;
#$doc->loadHTMLFile("stathtml/" . $file);
$xpath = new DOMXpath($doc);
$p = $xpath->query('//p[preceding-sibling::div[#id="intsect"]]');
if($p->length > 0) {
echo $p->item(0)->nodeValue;
}
Sample Output
Related
Using PHP, I want to retrieve a specific element in an external website.
The external website is https://mcnmedia.tv/iframe/2684 The specific element I want to retrieve is the first link in the 'Recordings' tab.
For example, the first link contains the following html;
<div class="small-12 medium-6 me column recording-item">
<div class="recording-item-inner">
<a class="small-12 column recording-name" href="/recordings/2435">
<div class="info">
<b>Mass</b><br>
<small>26 Mar 2020</small>
</div><i class="fa fa-play"></i></a>
</div>
</div>
I want to retrieve the href and display a direct link on my website like;
View Latest Recording - https://mcnmedia.tv/recordings/2435.
I have the following PHP but it isn't working as i'd like, currently it outputs the text only (Mass 26 Mar 2020), I'm not sure how to get the actual href link address?
<?php
$page = file_get_contents('https://mcnmedia.tv/iframe/2684');
#$doc = new DOMDocument();
#$doc->loadHTML($page);
$xpath = new DomXPath($doc);
$nodeList = $xpath->query("//div[#class='recording-item-inner']");
$node = $nodeList->item(0);
// To check the result:
echo "<p>" . $node->nodeValue . "</p>";
?>
How can I achieve this?
You aren't going quite far enough with your XPath to fetch the href, you can add /a/#href to say use the href attribute inside the <a> tag...
$nodeList = $xpath->evaluate("//div[#class='recording-item-inner']/a/#href");
you can simplify this, use evaluate() to fetch a specific value and modify the XPath to be fetch the attribute as a string instead of the node...
$href = $xpath->evaluate("string(//div[#class='recording-item-inner']/a/#href)");
echo "<p>" . $href . "</p>";
I am attempting to get various elements inside of an li as shown below. I am pretty new to this so I may not be using the most efficient methods but this is where I have started...
EXAMPLE CODE SIMPLIFIED....
<li id='entry_0' title='09879879'>
<div ....>
<h2> The title text would go here </h2>
<span class='entrySize' ....> 20oz </span>
<span class='entryPrice' ....> $32.09 </span>
<span class='anotherEntry' ....> More Data I need To Grab </span>
.......
</div>
</li>
<li> .... With same structure as above .... 100's of entries like this </li>
I know how to pull individual parts separately but having trouble grasping how to do it grouped within a portion of the html.
$filename = "directory/file.html";
$html = file_get_html($filename);
for($i=0; $i<=count(entryNumber);$i++)
{
$li_id = "entry_".$i;
foreach($html->find('li[id='.$li_id.']') as $li) {
echo $li->innertext;
}
}
So this gets me the content in the line item tag with the id number as the unique attribute. I would like to grab the h2 text, entrySize, entryPrice etc as I iterate through the line item tags. What I don't understand is once I have the line item tag content how can I parse through that line item inner tags and attributes. There maybe other parts of the full HTML document that has tags with same id, class as these throughout the document so I am breaking this down to portions and than looking to parse each section at a time.
I would also like to pull the title attribute out of the title tag for the li tag.
I hope my explanation make sense.
You should probably use a DOM parser. PHP comes bundled with one, and there are many other's you could use.
http://php.net/dom
PHP Simple HTML DOM Parser
<?php
$html = file_get_content($page);
$doc = new DOMDocument();
$doc->loadHTML($html);
// now find what you need
$items = $dom->getElementsByTagName('li');
foreach ($items as $item) {
$id = $item->getAttribute('id');
if (strpos($id, 'item_') !== false) {
// found matchin li, grab its children
}
}
Use this as a baseline, we can't write all the code for you. Check out the PHP docs to finish this :) From what I have so far, you need to follow the docs to make it grab the child values, and handle them.
I have many articles, divided into sections, stored in a database. Each section consists of a section tag, followed by a header (h2) and a primary div. Some also have subheaders (h3). The raw display looks something like this:
<section id="ecology">
<h2 class="Article">Ecology</h2>
<div class="Article">
<h3 class="Article">Animals</h3>
I'm using the following DOM script to add some classes, ID's and glyphicons:
$i = 1; // initialize counter
// initialize DOMDocument
$dom = new DOMDocument;
#$dom->loadHTML($Content); // load the markup
$sections = $dom->getElementsByTagName('section'); // get all section tags
if($sections->length > 0) { // if there are indeed section tags inside
// work on each section
foreach($sections as $section) { // for each section tag
$section->setAttribute('data-target', '#b' . $i); // set id for section tag
// get div inside each section
foreach($section->getElementsByTagName('h2') as $h2) {
if($h2->getAttribute('class') == 'Article') { // if this div has class maindiv
$h2->setAttribute('id', 'a' . $i); // set id for div tag
}
}
foreach($section->getElementsByTagName('div') as $div) {
if($div->getAttribute('class') == 'Article') { // if this div has class maindiv
$div->setAttribute('id', 'b' . $i); // set id for div tag
}
}
$i++; // increment counter
}
}
// back to string again, get all contents inside body
$Content = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
$Content .= $dom->saveHTML($child); // convert to string and append to the container
}
I'd like to modify the above code so that it places certain examples of "inner text" between tags.
For example, consider these headings:
<h3 class="Article">Animals</h3>
<h3 class="Article">Plants</h3>
I would like the DOM to change them to this:
<h3 class="Article"><span class="label label-default">Animals</span></h3>
<h3 class="Article"><span class="label label-default">Plants</span></h3>
I want to do something similar with the h2 tags. I don't yet know the DOM terminology well enough to search for good tutorials - not to mention confusion with DOM programs and jQuery. ;)
I think these are the basic functions I need to focus on, but I don't know how to plug them in:
$text = $data->textContent;
elementNode.textContent=string
Two Notes: 1) I understand I can do this with jQuery (perhaps a lot easier), but I think PHP might be better, as they say some users can have JavaScript disabled. 2) I'm using the class "Article" largely to distinguish elements I want to be styled by PHP DOM. A header with a different class, or no class at all, should not be affected by the DOM script.
I know this topic was posted everywhere, but their question is not I want. I want to insert some HTML codes before the page is loaded without touching the original code in the page.
Suppose my header was rendered by a function called render_header():
function render_body() {
return "<body>
<div class='container'>
<div class='a'>A</div>
<div class='b'>B</div>
</div>
</body>";
}
From now, I want to insert HTML codes using PHP without editing the render_body(). I want a function that insert some divs to container'div.
render_body();
<?php *//Insert '<div class="c" inside div container* ?>
Just as an alternative using XPath - this should load in the output from render_body() to an XML (DOMDocument) object and create an XPath object to query your HTML so you can easily work out where you want to insert the new HTML.
This will probably only work if you're using XML well formed HTML though.
//read in the document
$xml = new DOMDocument();
$xml->loadHTML(render_body());
//create an XPath query object
$xpath = new DOMXpath($xml);
//create the HTML nodes you want to insert
// using $xml->createElement() ...
//find the node to which you want to attach the new content
$xmlDivClassA = $xpath->query('//body/div[#class="a"]')->item(0);
$xmlDivClassA->appendChild( /* the HTML nodes you've previously created */ );
//output
echo $xml->saveHTML();
Took a little while as I had to refer to the documentation ... too much JQuery lately it's ruining my ability to manipulate the DOM without looking things up :\
The only thing I can think of is to turn on output buffering and then use the DOMDocument class to read in the entire buffer and then make changes to it. It is worth doing some reading of the documentation (http://www.php.net/manual/en/book.dom.php) provided in the script...
ie.:
<?php
function render_body() {
return "<body>
<div class='container'>
<div class='a'>A</div>
<div class='b'>B</div>
</div>
</body>";
}
$dom = new DOMDocument();
$dom->loadHTML(render_body());
// get body tag
$body = $dom->getElementsByTagName('body')->item(0);
// add a new element at the end of the body
$element = $dom->createElement('div', 'My new element at the end!');
$body->appendChild($element);
echo $dom->saveHTML(); // echo what is in the dom
?>
EDIT:
As per CD001's suggestions, I have tested this code and it works.
I'd like an easy way for content contributors with limited coding experience to designate the expiration date for selected content on existing HTML (PHP) pages on our site. I'd prefer to remove the content server-side so it isn't still available in the source code.
Illustration of a potential solution I am mulling over:
<div class="story"> ... </div>
Let's say I'd like the above div and its contents to disappear starting on June 1, 2011. So I would add a value to the class attribute:
<div class="story disappears-20110601"> ... </div>
Then I would have to write some code (xpath?) to locate all elements that have a class value with a pattern like ="... disapears-YYYYMMDD". If the date reference is valid, and that date is today or earlier, the code would remove the entire div and its contents from the DOM, and then serve the page without the expired div.
Before I try to set this up, what do you think of the concept? Is it feasible? If implemented sitewide, would it be a horrible resource hog?
A much better way is to store the content in a database table and assign the expiration date via a DATETIME field. Using a css class for this is a little square-peg.
Here is code for you:
<?php
$content = <<<EOF
<div>
some text 1
<div class="story disappears-20110101"> 20110101 </div>
<div class="story disappears-20110601"> 20110601 </div>
some text 2
</div>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DOMXPath($doc);
$expired = $xpath->query('//*[contains(#class, \'disappears-\')]');
$remove = array();
foreach ($expired as $n)
if (preg_match('~disappears-(\d{4})(\d{2})(\d{2})~', $n->getAttribute('class'), $m))
if (time() > mktime(0, 0, 0, $m[2], $m[3], $m[1]))
$remove[] = $n;
foreach ($remove as $n)
$n->parentNode->removeChild($n);
echo $doc->saveHTML();
Have a nice day.
Well, you could do that using Regexp, but I assume it /could/ (doesn't meant it will) get a little bit messy. I suggest using a database if you have access to one, if not store the stories into separate files (in one directory) and then load/delete/edit them via file-name.
Doing that through HTML would be unpleasant, in my opinion.
EDIT:
<?php
$html_content = file_get_contents('...');
preg_match('/class="story disappears-(\d*)"/i', $html_content, $match_array)
foreach($match_array as $val) {
if (intval($val) < intval(date('Ymd'))) {
$new_html_content = preg_replace('/(<div class="story disappears-'. $val .'">.*<\/div>)/', '', $html_content);
echo $new_html_content;
}
}
?>
Just a side note, you should try debugging this first, I might've done some mistake since I didn't use php in a while. However, if you stumble upon any errors, let me know in the comments, so I can update the code.
I mean there has to be a database available. Even if the pages are hand-coded, you can have them upload each page to your server, and have the back-end create a database entry that correlates to the uploaded page. this database entry could store info about the uploaded page such as the expire date. This also makes it easier organize/serve the page.
This XPath 1.0 will select the desired elements:
//*[
20110601
>=
substring-before(
substring-after(
concat(
' ',
normalize-space(#class),
' '
),
' disappears-'
),
' '
)
]