WordPress: Check how many elements are in a post - php

Is there any way to check how many elements like headlines (h2,h3,...), paragraphs and links are in a WordPress post?
At the moment I'm using this code:
<?php
$content = get_the_content();
$count_h2s = explode('<h2>', $content);
$h2 = 0;
foreach ($count_h2s as $count_h2) {
$h2++;
}
echo $h2;
?>
It seems to work for the headlines. But if I'm using it to count <p>-tags I only get a count of 1. Even if there are more. I could imagine this is because these tags are not in the editor but headlines are?!
And maybe there is a more elegant way to count the elements than my code ;)

Loop is not necessary, use PHP function substr_count
$query = get_post(get_the_ID());
$content = apply_filters('the_content', $query->post_content);
$p_count = substr_count($content, '<p>');
echo $p_count ;
// Be aware, if there is a more-tag inside the post, this `<p>`-tag wouldn't count!
Should be easy to use it for other tags, such as ...

It looks like you're getting thrown off by the filter wordpress uses to automatically convert line breaks into <p> tags. The line breaks aren't in your editor because they are being added through the filter after the fact.
https://codex.wordpress.org/Function_Reference/wpautop
So, while you're seeing the <p> tags in the HTML source of your page, you want to search your get_the_content() for line breaks, instead, as these are what's being converted to <p> tags.

Related

How to remove multiple HTML tags containing certain text strings from a Wordpress posts using the DOMDocument with get_the_content()

Ok, so what I have is a wordpress site with a lot of posts containing many paragraphs I don't need. Using a SQL query to remove these from the database would be ideal, however I doubt it can be done, so I'm focusing on getting the post content with the get_the_content() and filtering what I don't need using the DOMDocument. To make things more complicated, these elements cannot be identified by ids or classes.
Example: I have a WP article/post containing this sentence:
<p>How is everything going today?</p>
I want to search for "How is" (case insensitive, if possible) and remove the entire P element.
This is what I have got so far:
<?php
// the_content();
error_reporting(0);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML(strip_tags(mb_convert_encoding(get_the_content(), 'HTML-ENTITIES', 'UTF-8'), '<a>,<iframe>,<figure>,<figcaption>,<video>,<img>,<p>,<br>,<div>,<table>,<thead>,<tbody>,<tfoot>,<tr>,<th>,<td>,<ul>,<ol>,<li>,<h2>,<h3>,<h4>,<h5>,<h6>'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD | LIBXML_NOWARNING | LIBXML_NONET | LIBXML_NOERROR);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
//this should keep the BR tags
foreach ($xpath->query('//br') as $br) {
$br->parentNode->replaceChild($dom->createTextNode("\n"), $br);
}
//this should remove empty html tags
while (($node_list = $xpath->query('//*[not(*) and not(#*) and not(text()[normalize-space()])]')) && $node_list->length) {
foreach ($node_list as $node) {
$node->parentNode->removeChild($node);
}
}
//this should remove the only tags
$nbsp = $xpath->query("//*[text()=\" \xC2\xA0\" or text()=\"\xC2\xA0\"]"); # this should remove ony elements such as <p> </p> or <p> </p>
foreach($nbsp as $e) {
$e->parentNode->removeChild($e);
}
//this should remove tags containing certain text strings - THIS IS THE ISSUE
foreach ($xpath->query("//*[text()[contains(.,'This is a') //I want this to remove <p>This is a sentence I do not want.</p>
or contains(.,'This is another') //I want this to remove <span>This is another sentence I do not want.</span>
or contains(.,'height:') //I want this to remove <span>height:50px;width:40px</span>
or contains(.,'înapoi') //I want this to remove <i>Înainte, nu inapoi</i>
]]") as $attr) {
$attr->parentNode->removeChild($attr);
}
echo wpautop($dom->saveHTML(), true);
?>
This is working, but not always, and I cannot understand why. Sometimes all the text from the post is removed. Other times, even if the sentence is in a p paragraph at the end of the post, it gets removed along with other 2-3 paragraphs before it. It seems to happen random, for some posts it is working, for others not.
I should mention that there are almost 150 sentences/strings that I need removed, so I have almost 150 or contains lines. Maybe too many and the site cannot handle them?!
So, is there anything with my code, or do you have other better idea about how to remove elements (p, div, span, any basically) containing certain text strings?
If it matters, I'm on an Ubuntu 20.04 laptop, running Wordpress on nginx and php 7.4.
EDIT:
I will try again to give a more precise answer.
You want to
keep the br tags
remove empty tags
remove -tags
replace some strings
point 3.) There are no  -tags. This you can only see in the Wordpress-Editor, but not in the frontend. In the database I think there is always a line break between two paragraphs then . (a space) and again a line break. Wordpress then makes P-tags out of this in the output. However, this can also be turned off completely. (End of code.)
1.,2.,4.)
functions.php
// 1. br tags will stay
// 3. replace strings
function replaceTextinTheContent($the_content){
$replace = array(
// 'WORD TO REPLACE' => 'REPLACE WORD WITH THIS'
'<span>This is another sentence I do not want.</span>' => '',
'<i>Înainte, nu inapoi</i>' => 'bar',
'<span>height:50px;width:40px</span>' => '',
'This is a sentence I do not want.' => '', // P-tags are not saved in the Database, so we could only search the text
);
$the_content = str_replace(array_keys($replace), $replace, $the_content);
return $the_content;
}
add_filter('the_content', 'replaceTextinTheContent',998); //998, to execute the script as second to last
// 2. remove empty tags
function removeEmptyParagraphs($content) {
//$pattern = "/<p[^>]*><\\/p[^>]*>/"; // use this pattern to only delete P-tags
$pattern = "/<[^\/>]*>([\s]?)*<\/[^>]*>/"
$content = preg_replace($pattern, '', $content);
return $content;
}
add_filter('the_content', 'removeEmptyParagraphs', 999); //999, to execute the script as last
// bonus: don't load P-tags
function disableWpAutoP( $content ) {
remove_filter( 'the_content', 'wpautop' );
remove_filter( 'the_excerpt', 'wpautop' );
return $content;
}
add_filter( 'the_content', 'disableWpAutoP', 0 );
i've not testet this. but i think it should work. :D
screenshot of the post_content in wp_posts table:
OLD:
i think you should try this in your functions.php
I don't get the point of your method.. maybe i'm missing whats your goal.
function replaceTextinTheContent($the_content){
$replace = array(
// 'WORD TO REPLACE' => 'REPLACE WORD WITH THIS'
'foo' => 'bar',
);
$the_content = str_replace(array_keys($replace), $replace, $the_content);
return $the_content;
}
add_filter('the_content', 'replaceTextinTheContent',99);

How to Split the_content() and store into array in wordpress where <!--nextpage-->

I want to split the the_content() function in Wordpress where <!--nextpage-->and store it into array, I am trying to achieve this without using plugin.
Rather than use the_content() (that would print the content directly on the page), you can use get_the_content() and then split it from there:
$parts = explode('<!--nextpage-->', get_the_content());
Just use the function get_the_content (the content will echo it, get_ functions will not) and explode it like this:
$content = get_the_content();
list($page, $nextpage) = explode('<!--nextpage-->', $content);
// or when $content has more than one "<!--nextpage-->"
$pages = explode('<!--nextpage-->', $content);

Additional elements to URLS?

I'm not sure what the terminology is, but basically I have a site that uses the "tag-it" system, currently you can click on the tags and it takes the user to
topics.php?tags=example
My question is what sort of scripting or coding would be required to be able to add additional links?
topics.php?tags=example&tags=example2
or
topics.php?tags=example+example2
Here is the code in how my site is linked to tags.
header("Location: topics.php?tags={$t}");
or
<?php echo strtolower($fetch_name->tags);?>
Thanks for any hints or tips.
You cannot really pass tags two times as a GET parameter although you can pass it as an array
topics.php?tags[]=example&tags[]=example2
Assuming this is what you want try
$string = "topics.php?";
foreach($tags as $t)
{
$string .= "tag[]=$t&";
}
$string = substr($string, 0, -1);
We iterate through the array concatenating value to our $string. The last line removes an extra & symbol that will appear after the last iteration
There is also another option that looks a bit more dirty but might be better depending on your needs
$string = "topics.php?tag[]=" . implode($tags, "&tag[]=");
Note Just make sure the tags array is not empty
topics.php?tags=example&tags=example2
will break in the back end;
you have to assign the data to one variable:
topics.php?tags=example+example2
looks good you can access it in the back end explode it by the + sign:
//toplics.php
<?php
...
$tags = urlencode($_GET['tags']);
$tags_arr = explode('+', $tags); // array of all tags
$current_tags = ""; //make this accessible in the view;
if($tags){
$current_tags = $tags ."+";
}
//show your data
?>
Edit:
you can create the fron-end tags:
<a href="topics.php?tags=<?php echo $current_tags ;?>horror">
horror
</a>

Inserting multiple links into text, ignoring matches that happen to be inserted

The site I'm working on has a database table filled with glossary terms. I am building a function that will take some HTML and replace the first instances of the glossary terms with tooltip links.
I am running into a problem though. Since it's not just one replace, the function is replacing text that has been inserted in previous iterations, so the HTML is getting mucked up.
I guess the bottom line is, I need to ignore text if it:
Appears within the < and > of any HTML tag, or
Appears within the text of an <a></a> tag.
Here's what I have so far. I was hoping someone out there would have a clever solution.
function insertGlossaryLinks($html)
{
// Get glossary terms from database, once per request
static $terms;
if (is_null($terms)) {
$query = Doctrine_Query::create()
->select('gt.title, gt.alternate_spellings, gt.description')
->from('GlossaryTerm gt');
$glossaryTerms = $query->rows();
// Create whole list in $terms, including alternate spellings
$terms = array();
foreach ($glossaryTerms as $glossaryTerm) {
// Initialize with title
$term = array(
'wordsHtml' => array(
h(trim($glossaryTerm['title']))
),
'descriptionHtml' => h($glossaryTerm['description'])
);
// Add alternate spellings
foreach (explode(',', $glossaryTerm['alternate_spellings']) as $alternateSpelling) {
$alternateSpelling = h(trim($alternateSpelling));
if (empty($alternateSpelling)) {
continue;
}
$term['wordsHtml'][] = $alternateSpelling;
}
$terms[] = $term;
}
}
// Do replacements on this HTML
$newHtml = $html;
foreach ($terms as $term) {
$callback = create_function('$m', 'return \'<span>\'.$m[0].\'</span>\';');
$term['wordsHtmlPreg'] = array_map('preg_quote', $term['wordsHtml']);
$pattern = '/\b('.implode('|', $term['wordsHtmlPreg']).')\b/i';
$newHtml = preg_replace_callback($pattern, $callback, $newHtml, 1);
}
return $newHtml;
}
Using Regexes to process HTML is always risky business. You will spend a long time fiddling with the greediness and laziness of your Regexes to only capture text that is not in a tag, and not in a tag name itself. My recommendation would be to ditch the method you are currently using and parse your HTML with an HTML parser, like this one: http://simplehtmldom.sourceforge.net/. I have used it before and have recommended it to others. It is a much simpler way of dealing with complex HTML.
I ended up using preg_replace_callback to replace all existing links with placeholders. Then I inserted the new glossary term links. Then I put back the links that I had replaced.
It's working great!

PHP - Strings - Remove a HTML tag with a specific class, including its contents

I have a string like this:
<div class="container">
<h3 class="hdr"> Text </h3>
<div class="main">
text
<h3> text... </h3>
....
</div>
</div>
how do I remove the H3 tag with the .hdr class using as little code as possible ?
Using as little code as possible? Shortest code isn't necessarily best. However, if your HTML h3 tag always looks like that, this should suffice:
$html = preg_replace('#<h3 class="hdr">(.*?)</h3>#', '', $html);
Generally speaking, using regex for parsing HTML isn't a particularly good idea though.
Something like this is what you're looking for...
$output = preg_replace("#<h3 class=\"hdr\">(.*?)</h3>#is", "", $input);
Use "is" at the end of the regex because it will cause it to be case insensitive which is more flexible.
Stumbled upon this via Google - for anyone else feeling dirty using regex to parse HTML, here's a DOMDocument solution I feel much safer with going:
function removeTagByClass(string $html, string $className) {
$dom = new \DOMDocument();
$dom->loadHTML($html);
$finder = new \DOMXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' {$className} ')]");
foreach ($nodes as $node) {
$node->parentNode->removeChild($node);
}
return $dom->saveHTML();
}
Thanks to this other answer for the XPath query.
try a preg_match, then a preg_replace on the following pattern:
/(<h3
[\s]+
[^>]*?
class=[\"\'][^\"\']*?hdr[^\"\']*?[\"\']
[^>]*?>
[\s\S\d\D\w\W]*?
<\/h3>)/i
It's messy, and it should work fine only if the h3 tag doesn't have inline javascript which might contain sequences that this regular expression will react to. It is far from perfect, but in simple cases where h3 tag is used it should work.
Haven't tried it though, might need adjustments.
Another way would be to copy that function, use your copy, without the h3, if it's possible.
This would help someone if above solutions dont work. It remove iframe and content having tag '-webkit-overflow-scrolling: touch;' like i had :)
RegEx, or regular expressions is code for what you would like to remove, and PHP function preg_replace() will remove all div or divs matching, or replacing them with something else. In the examples below, $incoming_data is where you put all your content before removing elements, and $result is the final product. Basically we are telling the code to find all divs with class=”myclass” and replace them with ” ” (nothing).
How to remove a div and its contents by class in PHP
Just change “myclass” to whatever class your div has.
$result = preg_replace('#<div class="myclass">(.*?)</div>#', ' ',
$incoming_data);
How to remove a div and its contents by ID in PHP
Just change “myid” to whatever ID your div has.
$result = preg_replace('#(.*?)#', ' ', $incoming_data);
If your div has multiple classes?
Just change “myid” to whatever ID your div has like this.
$result = preg_replace('#<div id="myid(.*?)</div>#', ' ', $incoming_data);
or if div don’t have an ID, filter on the first class of the div like this.
$result = preg_replace('#<div class="myclass(.*?)</div>#', ' ', $incoming_data);
How to remove all headings in PHP
This is how to remove all headings.
$result = preg_replace('#<h1>(.*?)</h1>#', ' ', $incoming_data);
and if the heading have a class, do something like this:
$result = preg_replace('#<h1 class="myclass">(.*?)</h1>#', ' ', $incoming_data);
Source: http://www.lets-develop.com/html5-html-css-css3-php-wordpress-jquery-javascript-photoshop-illustrator-flash-tutorial/php-programming/remove-div-by-class-php-remove-div-contents/
$content = preg_replace('~(.*?)~', '', $content);
Above code only works if the div haves are both on the same line. what if they aren't?
$content = preg_replace('~[^|]*?~', '', $content);
This works even if there is a line break in between but fails if the not so used | symbol is in between anyone know a better way?

Categories