How can i replace this <p><span class="headline"> with this <p class="headline"><span>
easiest with PHP.
$data = file_get_contents("http://www.ihr-apotheker.de/cs1.html");
$clean1 = strstr($data, '<p>');
$str = preg_replace('#(<a.*>).*?(</a>)#', '$1$2', $clean1);
$ausgabe = strip_tags($str, '<p>');
echo $ausgabe;
Before I alter the html from the site I want to get the class declaration from the span to the <p> tag.
dont parse html with regex!
this class should provide what you need
http://simplehtmldom.sourceforge.net/
The reason not to parse HTML with regex is if you can't guarantee the format. If you already know the format of the string, you don't have to worry about having a complete parser.
In your case, if you know that's the format, you can use str_replace
str_replace('<p><span class="headline">', '<p class="headline"><span>', $data);
Well, answer was accepted already, but anyway, here is how to do it with native DOM:
$dom = new DOMDocument;
$dom->loadHTMLFile("http://www.ihr-apotheker.de/cs1.html");
$xPath = new DOMXpath($dom);
// remove links but keep link text
foreach($xPath->query('//a') as $link) {
$link->parentNode->replaceChild(
$dom->createTextNode($link->nodeValue), $link);
}
// switch classes
foreach($xPath->query('//p/span[#class="headline"]') as $node) {
$node->removeAttribute('class');
$node->parentNode->setAttribute('class', 'headline');
}
echo $dom->saveHTML();
On a sidenote, HTML has elements for headings, so why not use a <h*> element instead of using the semantically superfluous "headline" class.
Have you tried using str_replace?
If the placement of the <p> and <span> tags are consistent, you can simply replace one for the other with
str_replace("replacement", "part to replace", $string);
Related
Say I have the following string:
<a name="anchor" title="anchor title">
Currently I can extract name and title with strpos and substr, but I want to do it right. How can I do this with regex? And what if I wanted to extract from many of these tags within a block of text?
I've tried this regex:
/name="([A-Z,a-z])\w+/g
But it gets the name=" part as well, I just want the value.
The regex (\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']? can be used to extract all attributes
DOMDocument example:
<?php
$titles = array();
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br><a name="anchor" title="anchor title"></body></html>");
$links = $doc->getElementsByTagName('a');
if ($links->length!=0) {
foreach ($links as $a) {
$titles[] = $a->getAttribute('title');
}
}
?>
You commented: "I'm actually parsing the data before the page is rendered so DOM is not possible, right?"
We're working with the scraped HTML, so we construct a DOM with these functions and parse like XML.
Good examples in the comments here: http://php.net/manual/en/domdocument.getelementsbytagname.php
I need to be able to parse some text and find all the instances where an tag has target="_blank".... and for each match, add (for example): This link opens in a new window before the closeing tag.
For example:
Before:
Go here now
After:
Go here now<span>(This link opens in a new window)</span>
This is for a PHP site, so i assume preg_replace() will be the method... i just dont have the skills to write the regex properly.
Thanks in advance for any help anyone can offer.
You should never use a regex to parse HTML, except maybe in extremely well-defined and controlled circumstances.
Instead, try a built-in parser:
$dom = new DOMDocument();
$dom->loadHTML($your_html_source);
$xpath = new DOMXPath($dom);
$links = $xpath->query("//a[#target='_blank']");
foreach($links as $link) {
$link->appendChild($dom->createTextNode(" (This link opens in a new window)"));
}
$output = $dom->saveHTML();
Aternatively, if this is being output to the browser, you can just use CSS:
a[target='_blank']:after {
content: ' (This link opens in a new window)';
}
This will work for anchor tag replacement....
$string = str_replace('<a ','<a target="_blank" ',$string);
Well #Kolink is right, but there's my RegExp version.
$string = '<p>mess</p>Google<p>mess</p>';
echo preg_replace("/(\<a.*?target=\"_blank\".*?>)(.*?)(\<\/a\>)/miU","$1$2(This link opens in a new window)$3",$string);
This does the job:
$newText = '<span>(This link opens in a new window)</span>';
$pattern = '~<a\s[^>]*?\btarget\s*=(?:\s*([\'"])_blank\1|_blank\b)[^>]*>[^<]*(?:<(?!/a>)[^<]*)*\K~i';
echo preg_replace($pattern, $newText, $html);
However this direct string approach may replace also commented html parts, strings or comments in css or javascript code and eventually inside javascript literal regexes, that is at best unneeded and at worst unwanted at all. That's why you should use a DOM approach if you want to avoid these pitfalls. All you have to do is to append a new node to each link with the desired attribute:
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$nodeList = $xp->query('//a[#target="_blank"]');
foreach($nodeList as $node) {
$newNode = dom->createElement('span', '(This link opens in a new window)');
$node->appendChild($newNode);
}
$html = $dom->saveHTML();
To finish, a last alternative consists to not change the html at all and to play with css:
a[target="_blank"]::after {
content: " (This link opens in a new window)";
font-style: italic;
color: red;
}
You won't be able to write a regex that will evaluate an infinitely long string. I suggest:
$h = explode('>', $html);
This will give you the chance to traverse it like any other array and then do:
foreach($h as $k){
if(!preg_match('/^<a href=/', $k){
continue;
}elseif(!preg_match(/target="_blank")/, $k){
continue;
}else{
$h[$k + 1] .= '(open in new window);
}
}
$html = implode('>', $h);
This is how I would approach such a problem. of course, I just threw this out off the top of my head and is note guaranteed to work as is, but with a few possible tweaks to your exact logic, and you will have what you need.
If I have a document like this:
<!-- in doc.xml -->
<a>
<b>
greetings?
<c>hello</c>
<d>goodbye</c>
</b>
</a>
Is there any way to use simplexml (or any php builtin really) to get a string containing:
greetings?
<c>hello</c>
<d>goodbye</d>
Whitespace and such doesn't matter.
Thanks!
I must admit this wasn't as simple as one would think. This is what I came up with:
$xml = new DOMDocument;
$xml->load('doc.xml');
// find just the <b> node(s)
$xpath = new DOMXPath($xml);
$results = $xpath->query('/a/b');
// get entire <b> node as text
$node = $results->item(0);
$text = $xml->saveXML($node);
// remove encapsulating <b></b> tags
$text = preg_replace('#^<b>#', '', $text);
$text = preg_replace('#</b>$#', '', $text);
echo $text;
Regarding the XPath query
The query returns all matching nodes, so if there are multiple matching <b> tags, you can loop through $results to get them all.
My query for '/a/b' assumes that <a> is the root and <b> is its child/immediate descendant. You could alter it for different scenarios. Here's an XPath reference. Some adjustments might include:
'a/b' –– <b> is child of <a>, but <a> is anywhere, not just in the root
'a//b' –– <b> is a descendant of <a> no matter how deep, not just a direct child
'//b' –– all <b> nodes anywhere in the document
Regarding method of obtaining string contents
I tried using $node->nodeValue or $node->textContent, but both of them strip out the <c> and <d> tags, leaving just the text contents of those. I also tried casting it as a DOMText object, but that didn't directly work and was more trouble than it was worth.
Regarding the use of regular expressions
It could be done without regex, but I found it easiest to use them. I wanted to make sure that I only stripped the <b> and </b> at the very beginning and end of the string, just in case there were other <b> nodes within the contents.
How about this? Since you already know the XML format:
<?php
$xml = simplexml_load_file('doc.xml');
$str = $xml->b;
$str .= "<c>".$xml->b->c."</c>";
$str .= "<d>".$xml->b->d."</d>";
echo $str;
?>
Here's an alternative using DOM (to balance the SimpleXML answers!) that outputs the contents of all of the first <b> element.
$doc = new DOMDocument;
$doc->load('doc.xml');
$bee = $doc->getElementsByTagName('b')->item(0);
$innerxml = '';
foreach ($bee->childNodes as $node) {
$innerxml .= $doc->saveXML($node);
}
echo $innerxml;
Im a bit stumped on how to make a string uppercase in php while not making the markup uppercase.
So for example:
<p>Chicken & cheese</p>
Will become
<p>CHICKEN & CHEESE</p>
Any advice appreciated, thanks!
The following will replace all DOMText node data in the BODY with uppercase data:
$html = <<< HTML
<p>Chicken & cheese</p>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
foreach($xPath->query('/html/body//text()') as $text) {
$text->data = strtoupper($text->data);
}
echo $dom->saveXML($dom->documentElement);
gives:
<html><body><p>CHICKEN & CHEESE</p></body></html>
Also see
(related) Best Methods to parse HTML
Well you could use the DOM class and transform all text with it.
EDIT: or you could use this css:
.text{
text-transform: uppercase;
}
as GUMBO suggested
Parse it, then capitalize as you like.
I would be tempted to make the whole string uppercase...
$str = strtoupper('<p>Chicken & cheese</p>');
...And then use a preg_match() call to re-iterate over the HTML tags (presuming the HTML is valid) to lowercase the HTML tags and their attributes.
I can't quite figure it out, I'm looking for some code that will add an attribute to an HTML element.
For example lets say I have a string with an <a> in it, and that <a> needs an attribute added to it, so <a> gets added style="xxxx:yyyy;". How would you go about doing this?
Ideally it would add any attribute to any tag.
It's been said a million times. Don't use regex's for HTML parsing.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//a") as $node)
{
$node->setAttribute("style","xxxx");
}
$newHtml = $dom->saveHtml()
Here is using regex:
$result = preg_replace('/(<a\b[^><]*)>/i', '$1 style="xxxx:yyyy;">', $str);
but Regex cannot parse malformed HTML documents.