Remove all occurrences of <br> before text - php

I'm trying to remove all <br> before my text.
So I have this:
<p>
<br/><br/>When the battle is on between contestants in a talent show, it gets really competitive when down to the last four. X-FactorUSAcontestant Marcus Canty knows this all too well as this is the stage he was voted off of the show earlier this year. <br/><br/>
</p>
I want to get rid of the first two <br/> but also I'd want to get rid of them if there were more than 2.
I would prefer to sue xpath as I'm already using it, at the moment I have this.
foreach($xpath->query('//br[not(preceding::text())]') as $node) {
$node->parentNode->removeChild($node);
}
For some reason on this particular page it doesn't seem to be working.
UPDATE
Originally the question was why was there at the start of document when my xpath should be getting rid of them (see below). I applied some regex to see if that worked which revealed the doctype you see now. I thought the doctype was somehow causing my original problem but it just wasn't being shown until now. This content is what I've imported from blogger and currently manipulating to fit a new blog.
link to example page
!DOCTYPE html PUBLIC “-//W3C//DTD HTML 4.0 Transitional//EN” “http://www.w3.org/TR/REC-html40/loose.dtd”><br><br>
Here's my code:
global $post;
$postTime = $post - > post_date;
$postTime = strtotime($postTime);
$startDate = "2014/01/16";
if ($postTime < strtotime($startDate)) {
$html = mb_convert_encoding($content, 'HTML-ENTITIES', "UTF-8");
$doc = new DOMDocument();#$doc - > loadHTML($html);
$xpath = new DOMXPath($doc);
foreach($xpath - > query('//br[not(preceding::text())]') as $node) {
$node - > parentNode - > removeChild($node);
}
$nodes = $xpath - > query('//a[string-length(.) = 0]');
foreach($nodes as $node) {
$node - > parentNode - > removeChild($node);
}
$nodes = $xpath - > query('//*[not(text() or node() or self::br)]');
foreach($nodes as $node) {
$node - > parentNode - > removeChild($node);
}
remove_filter('the_content', 'wpautop');
$content = $doc - > saveHTML();
$content = ltrim($content, '<br>');
$content = strip_tags($content, '<br> <a> <iframe>');
$content = preg_replace(array('/(<br\s*\/?>\s*){1,}/'), array('<br/><br/>'), $content);
$content = str_replace(' ', ' ', $content);
$content = "<p>".implode("</p>\n\n<p>", preg_split('/\n(?:\s*\n)+/', $content))."</p>";
return $content;
Help appreciated.

What about ltrim?
$string = ltrim($string, '<br/>');

You could try using a regex
s/!DOCTYPE html PUBLIC “-\/\/W3C\/\/DTD HTML 4.0 Transitional\/\/EN” “http:\/\/www.w3.org\/TR\/REC-html40\/loose.dtd”>((<br[^>]*/>)+)(.*)/\3/
or in PHP:
$pattern = '/^((<br[^>]*/>)+)(.*)/i';
$replacement = '$3';
$content = preg_replace($pattern, $replacement, $content);

Related

How to create a simple screen scraper in PHP

I am trying to create a simple screen scraper that gets me the price of a specific item. Here is an example of a product I want to get the price from:
https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html
This is the portion of the html code I am interested in:
enter image description here
I want to get the '4699' thing.
Here is what I have been trying to do but it does not seem to work:
$html = file_get_contents("https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html");
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
//Now query the document:
foreach ($xpath->query('/<span class="price">[0-9]*\\.[0-9]+/i') as $node) {
echo $node, "\n";
}
You could just use standard PHP string functions to get the price out of the $html:
$url = "https://www.flanco.ro/telefon-mobil-apple-iphone-14-5g-128gb-purple.html";
$html = file_get_contents($url);
$seek = '<span class="special-price"><span class="price">';
$end = strpos($html, $seek) + strlen($seek);
$price = substr($html, $end, strpos($html, ',', $end) - $end);
Or something similar. This is all the code you need. This code returns:
4.699
My point is: In this particular case you don't need to parse the DOM and use a regular expression to get that single price.
Since there are a few price classes on the page. I would specifically target the pricesPrp class.
Also on your foreach you are trying to convert a DOMElement object into a string which wouldn't work
Update your xpath query as such :
$query = $xpath->query('//div[#class="pricesPrp"]//span[#class="special-price"]//span[#class="price"]');
If you want to see the different nodes:
echo '<pre>';
foreach ($query as $node) {
var_dump($node);
}
And if you want to get that specific price :
$price = $query->item(0)->nodeValue;
echo $price;
$html = file_get_contents('PASTE_URL');
$doc = new DOMDocument();
#$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
#$selector = new DOMXPath($doc);
$result = $selector->query('//span[#class="price"]');
foreach($result as $node) {
echo $node->nodeValue;
}

Changing code from regex

I have the following 2 sets of code (Wordpress) using regex, but I was told that it's a bad practice.
I am using it in 2 ways:
To ake out the blockquote and images from the post and just display the text.
To essentially do the opposite and display just the images.
Looking to write it in the proper more acceptable/cross browser form.
html (display text):
<?php
$content = preg_replace('/<blockquote>(.*?)<\/blockquote>/', '', get_the_content());
$content = preg_replace('/(<img [^>]*>)/', '', $content);
$content = wpautop($content); // Add paragraph-tags
$content = str_replace('<p></p>', '', $content); // remove empty paragraphs
echo $content;
?>
html (display images):
<?php
preg_match_all('/(<img [^>]*>)/', get_the_content(), $images);
for( $i=0; isset($images[1]) && $i < count($images[1]); $i++ ) {
if ($i == end(array_keys($images[1]))) {
echo sprintf('<div id="last-img">%s</div>', $images[1][$i]);
continue;
}
echo $images[1][$i];
}
?>
You can use the answer from here: Strip Tags and everything in between
The point is to use a parser, rather than roll-your-own regex that might be buggy.
$content = get_the_content();
$content = wpautop($content);
$doc = new DOMDocument();
$doc->loadHTML(get_the_content(), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//blockquote') as $node) {
$node->parentNode->removeChild($node);
}
foreach ($xpath->query('//img') as $node) {
$node->parentNode->removeChild($node);
}
foreach( $xpath->query('//p[not(node())]') as $node ) {
$node->parentNode->removeChild($node);
}
$content = $doc->saveHTML($doc);
You may find that php DOMDocument has wrapped your html fragment in <html> tags, in which case look at How to saveHTML of DOMDocument without HTML wrapper?
The part that removes empty p tags is from Remove empty tags from a XML with PHP

PHP Preg Replace - Match String with Space - Wordpress

I'm trying to scan my wordpress content for:
<p><span class="embed-youtube">some iframed video</span></p>
and then change it into:
<p class="img_wrap"><span class="embed-youtube">some iframed video</span></p>
using the following code in my function.php file in my theme:
$classes = 'class="img_wrap"';
$youtube_match = preg_match('/(<p.*?)(.*?><span class="embed-youtube")/', $content, $youtube_array);
if(!empty($youtube_match))
{
$content = preg_replace('/(<p.*?)(.*?><span class=\"embed-youtube\")/', '$1 ' . $classes . '$2', $content);
}
but for some reason I am not getting a match on my regex nor is the replace working. I don't understand why there isn't a match because the span with class embed-youtube exists.
UPDATE - HERE IS THE FULL FUNCTION
function give_attachments_class($content){
$classes = 'class="img_wrap"';
$img_match = preg_match("/(<p.*?)(.*?><img)/", $content, $img_array);
$youtube_match = preg_match('/(<p.*?)(.*?><span class="embed-youtube")/', $content, $youtube_array);
// $doc = new DOMDocument;
// #$doc->loadHTML($content); // load the HTML data
// $xpath = new DOMXPath($doc);
// $nodes = $xpath->query('//p/span[#class="embed-youtube"]');
// foreach ($nodes as $node) {
// $node->parentNode->setAttribute('class', 'img_wrap');
// }
// $content = $doc->saveHTML();
if(!empty($img_match))
{
$content = preg_replace('/(<p.*?)(.*?><img)/', '$1 ' . $classes . '$2', $content);
}
else if(!empty($youtube_match))
{
$content = preg_replace('/(<p.*?)(.*?><span class=\"embed-youtube\")/', '$1 ' . $classes . '$2', $content);
}
$content = preg_replace("/<img(.*?)src=('|\")(.*?).(bmp|gif|jpeg|jpg|png)(|\")(.*?)>/", '<img$1 data-original=$3.$4 $6>' , $content);
return $content;
}
add_filter('the_content','give_attachments_class');
Instead of using regex, make effective use of DOM and XPath to do this for you.
$doc = new DOMDocument;
#$doc->loadHTML($html); // load the HTML data
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//p/span[#class="embed-youtube"]');
foreach ($nodes as $node) {
$node->parentNode->setAttribute('class', 'img_wrap');
}
echo $doc->saveHTML();
Here is a quick and dirty REGEX I did for you. It finds the entire string starting with p tag, ending p tag, span also included etc. I also wrote it to include single or double quotes for you since you never know and also to include spaces in various places. Let me know how it works out for you, thanks.
(<p )+(class=)['"]+img_wrap+['"](><span)+[ ]+(class=)+['"]embed-youtube+['"]>[A-Za-z0-9='" ]+(</span></p>)
I have tested it on your code and a few other variations and it works for me.

Extracting multiple strong tags using PHP Simple HTML DOM Parser

I have over 500 pages (static) containing content structures this way,
<section>
Some text
<strong>Dynamic Title (Different on each page)</strong>
<strong>Author name (Different on each page)</strong>
<strong>Category</strong>
(<b>Content</b> <b>MORE TEXT HERE)</b>
</section>
And I need to extract the data as formatted below, using PHP Simple HTML DOM Parser
$title = <strong>Dynamic Title (Different on each page)</strong>
$authot = <strong>Author name (Different on each page)</strong>
$category = <strong>Category</strong>
$content = (<b>Content</b> <b>MORE TEXT HERE</b>)
I have failed so far and can't get my head around it, appreciate any advice or code snippet to help me going on.
EDIT 1,
I have now solved the part with strong tags using,
$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
$content[] = $a->innertext;
}
$title= $content[0];
$author= $content[1];
the only remaining issue is --> How to extract content within parentheses? using similar method?
OK first you want to get all of the tags
Then you want to search through those again for the tags and tags
Something like this:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
$strong = array();
// Find all <sections>
foreach($html->find('section') as $element) {
$section = $element->src;
// get <strong> tags from <section>
foreach($section->find('strong') as $strong) {
$strong[] = $strong->src;
}
$title = $strong[0];
$authot = $strong[1];
$category = $strong[2];
}
To get the parts in parentheses - just get the b tag text and then add the () brackets.
Or if you're asking how to get parts in between the brackets - use explode then remove the closing bracket:
$pieces = explode("(", $title);
$different_on_each_page = str_replace(")","",$pieces[1]);
$html_code = 'html';
$dom = new \DOMDocument();
$dom->LoadHTML($html_code);
$xpath = new \DOMXPath($this->dom);
$nodelist = $xpath->query("//strong");
for($i = 0; $i < $nodelist->length; $i++){
$nodelist->item($i)->nodeValue; //gives you the text inside
}
My final code that works now looks like this.
$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
$content[] = $a->innertext;
}
$title= $content[0];
$author= $content[1];
$category = $content[2];
$details = file_get_html($url)->plaintext;
$input = $details;
preg_match_all("/\(.*?\)/", $input, $matches);
print_r($matches[0]);

Modify html attribute with php

I have a html string that contains exactly one a-element in it. Example:
test
In php I have to test if rel contains external and if yes, then modify href and save the string.
I have looked for DOM nodes and objects. But they seem to be too much for only one A-element, as I have to iterate to get html nodes and I am not sure how to test if rel exists and contains external.
$html = new DOMDocument();
$html->loadHtml($txt);
$a = $html->getElementsByTagName('a');
$attr = $a->item(0)->attributes();
...
At this point I am going to get NodeMapList that seems to be overhead. Is there any simplier way for this or should I do it with DOM?
Is there any simplier way for this or should I do it with DOM?
Do it with DOM.
Here's an example:
<?php
$html = 'test';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//a[contains(concat(' ', normalize-space(#rel), ' '), ' external ')]");
foreach($nodes as $node) {
$node->setAttribute('href', 'http://example.org');
}
echo $dom->saveHTML();
I kept going to modify with DOM. This is what I get:
$html = new DOMDocument();
$html->loadHtml('<?xml encoding="utf-8" ?>' . $txt);
$nodes = $html->getElementsByTagName('a');
foreach ($nodes as $node) {
foreach ($node->attributes as $att) {
if ($att->name == 'rel') {
if (strpos($att->value, 'external')) {
$node->setAttribute('href','modified_url_goes_here');
}
}
}
}
$txt = $html->saveHTML();
I did not want to load any other library for just this one string.
The best way is to use a HTML parser/DOM, but here's a regex solution:
$html = 'test<br>
<p> Some text</p>
test2<br>
<a rel="external">test3</a> <-- This won\'t work since there is no href in it.
';
$new = preg_replace_callback('/<a.+?rel\s*=\s*"([^"]*)"[^>]*>/i', function($m){
if(strpos($m[1], 'external') !== false){
$m[0] = preg_replace('/href\s*=\s*(("[^"]*")|(\'[^\']*\'))/i', 'href="http://example.com"', $m[0]);
}
return $m[0];
}, $html);
echo $new;
Online demo.
You could use a regular expression like
if it matches /\s+rel\s*=\s*".*external.*"/
then do a regExp replace like
/(<a.*href\s*=\s*")([^"]\)("[^>]*>)/\1[your new href here]\3/
Though using a library that can do this kind of stuff for you is much easier (like jquery for javascript)

Categories