Extract text from HTML <p> with a particular title - php

I have a huge file with lots of entries, they have one thing in common, the first line. I want to extract all of the text from a paragraph where the first line is:
Type of document: Contract Notice
The HTML code I am working on is here:
<!-- other HTML -->
<p>
<b>Type of document:</b>
" Contract Notice" <br>
<b>Country</b> <br>
... rest of text ...
</p>
<!-- other HTML -->
I have put the HTML into a DOM like this:
$dom = new DOMDocument;
$dom->loadHTML($content);
I need to return all of the text in the paragraph node where the first line is 'Type of document: Contract Notice' I am sure there is a simple way of doing this using DOM methods or XPath, please advise!

Speaking of XPath, try the following expression which selects<p> elements:
whose <b> child element (first one) has the value Type of document:
whose next sibling text node (first one) contains the text Contract Notice
//p[
b[1][.="Type of document:"]
/following-sibling::text()[1][contains(., "Contract Notice")]
]

With this XPath expression, you select the text of all children of the p element:
//b[text()="Type of document:"]/parent::p/*/text()

I don't like using DomDocument parsing unless I need to heavily parse a document, but if you want to do so then it could be something like:
//Using DomDocument
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DOMXpath($doc);
$matchedDoms = $xpath->query('//b[text()="Type of document:"]/parent::p//text()');
$data = '';
foreach($matchedDoms as $domMatch) {
$data .= $domMatch->data . ' ';
}
var_dump($data);
I would prefer a simple regex line to do it all, after all it's just one piece of the document you are looking for:
//Using a Regular Expression
preg_match('/<p>.*<b>Type of document:<\/b>.*Contract Notice(?<data>.*)<\/p>/si', $content, $matches);
var_dump($matches['data']); //If you want everything in there
var_dump(strip_tags($matches['data'])); //If you just want the text

Related

Prepend HTML text using DOMDocument without parent container

Let's say I have <p>Text</p>
I'd like to create a function using DOMDocument to be able to insert text, eg:
insertText('<p>Text</p>', '<strong>1.</strong> ')
So that the result was <p><strong>1.</strong> Text<p>
I'm already accessing this paragraph tag, so I think I'm almost there, I just cannot figure out how to append plain text that can be read as HTML
$dom = new DOMDocument();
$question_paragraphs = array();
$dom->loadHTML($str);
$par = $dom->getElementsByTagName('p');
if ($par->length == 1) {
$par->item(0)->setAttribute("class", "first last");
###
### How do I do this here?
###
}
Is it possible to inject text this way?
You can get use the method insertBefore (Official Documentation) as follows
You create your strong element
You insert this node before the text node
$span = $dom->createElement('strong', '1.');
$par->item(0)->insertBefore($span, $par->item(0)->firstChild);
Please note that the second parameter of the insertBefore function is the child to which you want to prepend your tag. So in this case you can use firstChild as your <p> only contains the Text.
This will finally output
<p class="first last"><span>1.</span>Text</p>

How do i form this replacement with regular expression

I need to remove a particular string using php, the string needs to start with <div class='event'> followed by any possible string but which contains $myVariable, which is then followed by </div>. How do I remove all this using preg_replace()? I have worked out it might be something like this
preg_replace("<div class='event'>(.*)" . $myVariable . "(.*)</div>", "", $content);
But I cant get it to work.
Update:
I need to remove a div and everything inside it, the div contains an event name and date but I can only delete the div based on the events name and so the date needs to be defined as practically any string.
Let's imagine you have a <div> and inside it, there is some text node with a specific word you define with $myVariable.
The task is:
Read the document in
Initialize DOM
Collect the <div> tags with .nodeValue containing $myVariable text
Remove those tags from the DOM
Return updated DOM
The code for that algorithm is below (DOM is initialized with a HTML string in the demo):
$html = "<<YOUR_HTML_STRING>>"
$dom = new DOMDocument; // Declaring the DOM
$dom->loadHTML($html); // Initializing the DOM with an HTML string
$myVariable = "2015-09-12"; // Your dynamic variable
$xpath = new DOMXPath($dom); // Initializing the DOMXpath
$divs = $xpath->query("//div[contains(.,'$myVariable')]"); // Collecting DIVs
// having $myVariable
foreach($divs as $div) {
$div->parentNode->removeChild($div); // Removing the DIVs
}
echo $dom->saveHTML(); // Getting the updated DOM
See IDEONE demo
Note that you can force DOMDocument to omit adding !DOCTYPE using the following to declare and initialize DOM:
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

How to remove consecutive links from a webpage?

I wish to remove consecutive links on a webpage
Here is a sample
<div style="font-family: Arial;">
<br>
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
Google is a search
engine
In the above html I want to remove the first 2 A tags and not the third one (My script should only remove consecutive tags)
Don't use a regex for this. They are extremely powerful but not for finding this kind of "consecutive" tags.
I suggest you use DOM. Then you can browse the HTML as a tree.
Here is an example (not tested):
$doc = new DOMDocument();
// avoid blank nodes when parsing
$doc->preserveWhiteSpace = false;
// reads HTML in a string, loadHtmlFile() also exists
$doc->loadHTML($html);
// find all "a" tags
$links = $doc->getElementsByTagName('a');
// remove the first link
$parent = $links->item(0)->parentNode;
$parent->removeChild($links->item(0));
// test the node following the second link
if ($links->item(1)->nextSibling->nodeType != XML_TEXT_NODE) {
// delete this node ...
}
// print the modified HTML
// See DOMDocument's attributes if you want to format the output
echo $doc->saveHTML();

PHP DOMXPath problem

$xpath = new DOMXpath($doc);
$res = $xpath->query(".//*[#id='post2679883']/tr[2]/td[2]/div[2]");
foreach( $res as $obj ) {
var_dump($obj->nodeValue);
}
I need to take all the items in the id with the word "post".
Example:
<div id="post2242424">trarata</div>
<div id="post114525">trarata</div>
<div id="post8568686">trarata</div>
Question number two:
I need to get this elements with HTML tags, but $obj->nodeValue returns text without html tags.
You could use the xpath function starts-with to filter the nodes in your XPath if all the nodes you want start with "post". For example;
$xpath->query(".//*[starts-with(#id, 'post')]/tr[2]/td[2]/div[2]");
For the second part, I think has been answered already - PHP DOMDocument stripping HTML tags

Add an attribute to an HTML element

I can't quite figure it out, I'm looking for some code that will add an attribute to an HTML element.
For example lets say I have a string with an <a> in it, and that <a> needs an attribute added to it, so <a> gets added style="xxxx:yyyy;". How would you go about doing this?
Ideally it would add any attribute to any tag.
It's been said a million times. Don't use regex's for HTML parsing.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//a") as $node)
{
$node->setAttribute("style","xxxx");
}
$newHtml = $dom->saveHtml()
Here is using regex:
$result = preg_replace('/(<a\b[^><]*)>/i', '$1 style="xxxx:yyyy;">', $str);
but Regex cannot parse malformed HTML documents.

Categories