Preg_replace, point after and before selection - php

<div style="display:none">250</div>.<div style="display:none">145</div>
id want:
<div style="display:none">250</div>#.#<div style="display:none">145</div>
or like this:
<div style="display:none">111</div>125<div style="display:none">110</div>
where id want
<div style="display:none">111</div>#125#<div style="display:none">110</div>
id like a preg replace to put those hashtags around the numb, so i asume the REGEX would look something like this:
"<\/div>[.]|<\/div>\d{1,3}"
The digit (in case its a digit, can be 1-3 digits), or it can be a dot.
Anyhow, i dont know hot to preg replace around the value:
"<\/div>[.]|<\/div>\d{1,3}" replace: $0#
Inserts it after the value..
EDIT
I cannot use a HTML parser, because i cannot find one that does not threat styles / classes as plaintext, and i need the values attached, to determine if the element is visible or not :(
and yes, it is driving me insane, but i am almost done :)

You really should not be trying to parse HTML with regex. There are only a couple of people I know who can do it. And even if you would have been one of them regex still is not the right tool for the job. Use PHP's DOMDocument optionally with DOMXPath.
With xpath:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$textNode = $xpath->query('//text()')->item(1);
$textNode->parentNode->replaceChild($dom->createTextNode('#' . $textNode->textContent . '#'), $textNode);
echo htmlspecialchars($dom->saveHTML());
http://codepad.viper-7.com/KLTLDA
With childnodes:
$dom = new DOMDocument();
$dom->loadHTML($html);
$body = $dom->getElementsByTagName('body')->item(0);
$textNode = $body->childNodes->item(1);
$textNode->parentNode->replaceChild($dom->createTextNode('#' . $textNode->textContent . '#'), $textNode);
echo htmlspecialchars($dom->saveHTML());
http://codepad.viper-7.com/Ii4vPb

In your case,
preg_replace("~</div\s*>(\.|\d{1,3})<div~i", '</div>#$1#<div', $string);
That's assuming no spaces between the divs and the content, and nothing otherwise weird is between.
Note that regex is very brittle, and would fail silently on even the slightest change in HTML.

Related

Read page source using PHP with primes "

I am trying to read the source code of a page. I just want to read some text that is within a certain division element with the id "wrapper_left".
My problem is that if a prime " is used in the first argument of the explode function, it does not work. I tried escaping the string, although I figured this wouldn't do anything.
$source_code = htmlspecialchars(file_get_contents('http://mydomain.com'));
$source_code = explode('<div id="wrapper_left">', $source_code);
echo $source_code[1];
Thanks tons in advance.
Don't bother trying to get this done with explode(), string manipulation, or a regular expression, you need an HTML parser, like DOMDocument:
$doc = new DOMDocument;
$doc->loadHTMLFile( 'http://mydomain.com');
$xpath = new DOMXPath( $doc);
$div = $xpath->query( '//div[#id="wrapper_left"]')->item(0);
echo $div->textContent;
You can see it working in this demo, which, when fed this HTML:
<div id="wrapper_left">Some text</div>
It produces:
Some text

How to turn off converting special characters to entities in DOMDocument?

I'm using the code as bellow to get the wanted content form HTML by DOMDocument,
$subject = 'some html code';
$doc = new DOMDocument('1.0');
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
$docSave = new DOMDocument('1.0');
foreach ( $result as $node ) {
$domNode = $docSave->importNode($node, true);
$docSave->appendChild($domNode);
}
echo $docSave->saveHTML();
The problem is that if there is a spcial character in HTML $subject like space or new line then it is converted to html entitle. Input HTML is far away form being in good style and some special characters are also within paths in tags, for instance:
$subject = '<div><a href='http://www.site.com/test.php?a=1&b=2, 3,
4'></a></div>';
will produce:
<div><a href='http://www.site.com/test.php?a=1&b=2,%203,%0A%204'></a></div>
instead of:
<div><a href='http://www.site.com/test.php?a=1&b=2, 3,
4'></a></div>'
What one can do to omit conversion of special characters to their entities if wants to keep the invalid html?
I tried do set this flag substituteEntities to false but I got no improvement, maybe I used it wrong? some examples of code would be very helpful.
You can't use a parser and be able to manipulate the bad HTML. A parser would clean up the HTML in order to parse it.
If you absolutely must use the bad HTML, use regexes but be aware that there is an extreme risk of head injury as you will either be -brick'd- or bang your head against the desk too much.

How can I remove <br/> if no text comes before or after it? DOMxpath or regex?

How can I remove <br/> if no text comes before or after it?
For instance,
<p><br/>hello</p>
<p>hello<br/></p>
they should be rewritten like this,
<p>hello</p>
<p>hello</p>
Should I use DOMxpath or regex would be better?
(Note: I have a post about removing <p><br/></p> with DOMxpath earlier, and then I came across this issue!)
EDIT:
If I have this in the input,
$content = '<p><br/>hello<br/>hello<br/></p>';
then it should be
<p>hello<br/>hello</p>'
To select the mentioned br you can use:
"//p[node()[1][self::br]]/br[1] | //p[node()[last()][self::br]]/br[last()]"
or, (maybe) faster:
"//p[br]/node()[self::br and (position()=1 or position()=last())]"
Just getting the br when the first (or last) node of p is br.
This will select br such as:
<p><br/>hello</p>
<p>hello<br/></p>
and first and last br like in:
<p><br/>hello<br/>hello<br/></p>
not middle br like in:
<p>hello<br/>hello</p>
PS: to get eventually the first br in a pair like this <br/><br/>:
"//br[following::node()[1][self::br]]"
In case for some code, I could get it to working like this (Demo). It has a slight modification from #empo's xpath (very slightly) and shows the removal of the matches as well as some more test-cases:
$html = <<<EOD
<p><br/>hello</p>
<p>hello<br/></p>
<p>hello<br/>Chello</p>
<p>hello <i>molly</i><br/></p>
<p>okidoki</p>
EOD;
$doc = new DomDocument;
$doc->loadHTML($html);
$xpath = new DomXPath($doc);
$nodes = $xpath->query('//p[node()[1][self::br] or node()[last()][self::br]]/br');
foreach($nodes as $node) {
$node->parentNode->removeChild($node);
}
var_dump($doc->saveHTML());

Add an attribute to an HTML element

I can't quite figure it out, I'm looking for some code that will add an attribute to an HTML element.
For example lets say I have a string with an <a> in it, and that <a> needs an attribute added to it, so <a> gets added style="xxxx:yyyy;". How would you go about doing this?
Ideally it would add any attribute to any tag.
It's been said a million times. Don't use regex's for HTML parsing.
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//a") as $node)
{
$node->setAttribute("style","xxxx");
}
$newHtml = $dom->saveHtml()
Here is using regex:
$result = preg_replace('/(<a\b[^><]*)>/i', '$1 style="xxxx:yyyy;">', $str);
but Regex cannot parse malformed HTML documents.

PHP - How to replace a phrase with another?

How can i replace this <p><span class="headline"> with this <p class="headline"><span>
easiest with PHP.
$data = file_get_contents("http://www.ihr-apotheker.de/cs1.html");
$clean1 = strstr($data, '<p>');
$str = preg_replace('#(<a.*>).*?(</a>)#', '$1$2', $clean1);
$ausgabe = strip_tags($str, '<p>');
echo $ausgabe;
Before I alter the html from the site I want to get the class declaration from the span to the <p> tag.
dont parse html with regex!
this class should provide what you need
http://simplehtmldom.sourceforge.net/
The reason not to parse HTML with regex is if you can't guarantee the format. If you already know the format of the string, you don't have to worry about having a complete parser.
In your case, if you know that's the format, you can use str_replace
str_replace('<p><span class="headline">', '<p class="headline"><span>', $data);
Well, answer was accepted already, but anyway, here is how to do it with native DOM:
$dom = new DOMDocument;
$dom->loadHTMLFile("http://www.ihr-apotheker.de/cs1.html");
$xPath = new DOMXpath($dom);
// remove links but keep link text
foreach($xPath->query('//a') as $link) {
$link->parentNode->replaceChild(
$dom->createTextNode($link->nodeValue), $link);
}
// switch classes
foreach($xPath->query('//p/span[#class="headline"]') as $node) {
$node->removeAttribute('class');
$node->parentNode->setAttribute('class', 'headline');
}
echo $dom->saveHTML();
On a sidenote, HTML has elements for headings, so why not use a <h*> element instead of using the semantically superfluous "headline" class.
Have you tried using str_replace?
If the placement of the <p> and <span> tags are consistent, you can simply replace one for the other with
str_replace("replacement", "part to replace", $string);

Categories