Getting the number of words on a webpage issue php - php

I am trying to get the number of words that a domain web page has, but I get way larger numbers than expected. For example on Google.com with my function I get 180 words, and counted manually there are about 30. I noticed that it also includes the words from style tags and javascript tags, that is a bit odd. I also checked this http://www.seoreviewtools.com/bulk-web-page-word-count-checker/ and it's counting only 6.
Where am I mistaking?
function get_page_stats($domain) {
$str = file_get_contents($domain);
$str = strip_tags(strtolower($str));
$words = str_word_count($str, 1);
$words = array_count_values($words); // added as per Avinash Babu answer
var_dump($words);
}
get_page_stats('http://google.com');

You can use array_count_values() for this.
A simple example
<?php
$str = '<h1>Hello</h1> this will show word count of all word used this time... hello!';
print_r(array_count_values(str_word_count(strip_tags(strtolower($str)), 1)));

I managed to filter very well by removing the style tags and the script tags from the entire web page.
function get_page_stats($domain) {
$str = file_get_contents($domain);
$str = preg_replace('/<style\\b[^>]*>(.*?)<\\/style>/s', '', $str);
// remove everything between the style tags
$str = preg_replace('/<script\\b[^>]*>(.*?)<\\/script>/s', '', $str);
// remove everything between the script tags
$str = strip_tags(strtolower($str));
// remove html tags
$words = str_word_count($str, 1);
$words = array_count_values($words);
// count the words
var_dump($words);
}

Related

Php find and replace words in a text from database tables

I'm working on a project. It searches and replaces words from the database in a given text. In the database a have 10k words and replacements. So ı want to search for each word and replace this word's replacement word.
By the way, I'm using laravel. I need only replacement ideas.
I have tried some ways but it replaces only one word.
My database table structure like below;
id word replacement
1 test testing
etc
The text is coming from the input and after the replacement, I wanna show which words are replaced in a different bg color in the result page.
I tried below codes working fine but it only replaces one word.
$article = trim(strip_tags($request->article));
$clean = preg_split('/[\s]+/', $article);
$word_count = count($clean);
$words_from_database_for_search = Words::all();
foreach($words_from_database_for_search as $word){
$content = str_replace($word['word'],
"<span class=\"badge badge-success\">$word[replacement]
</span>",
$article);
}
$new_content = $content ;
$new_content_clean = preg_split('/[\s]+/', $new_content);
$new_content_word_count= count($new_content_clean);
Edit,
Im using preg_replace instead of str_replace. I get it worked but this time i wanna show how many words changed so i tried to find the number of changed words from the text after replacement. It counts wrong.
Example if there is 6 changes it show 3 or 4
It can be done via preg_replace_callback but i didnt use it before so i dont know how to figure out;
My working codes are below;
$old_article = trim(strip_tags($request->article));
$old_article_word_count = count($old_article );
$words_from_database_array= Words::all();
$article_will_replace = trim(strip_tags($request->article));
$count_the_replaced_words = 0;
foreach($words_from_database_array as $word){
$article_will_replace = preg_replace('/[^a-zA-
ZğüşıöçĞÜŞİÖÇ]\b'.$word['word'].'\b\s/u',
" <b>".$word['spin']."</b> ",
$article_will_replace );
$count_the_replaced_words = preg_match_all('/[^a-zA-
ZğüşıöçĞÜŞİÖÇ]\b'.strip_tags($word['spin']).'\b\s/u',$article_will_replace
);
if($count_the_replaced_words ){
$count_the_replaced_words ++;
}
}
As others have suggested in comments, it seems your value of $content is being overwritten on each run of the foreach loop, with older iterations being ignored. That's because the third argument of your str_replace is $article, the original, unmodified text. Only the last word, therefore, is showing up on the result.
The simplest way to fix this would be to declare $content before the foreach loop, and then make $content the third argument of the foreach loop so that it is continually replaced with a new word on each iteration, like so:
$content = $article;
foreach($words_from_database_for_search as $word){
$content = str_replace($word['word'],
"<span class=\"badge badge-success\">$word[replacement]</span>",
$content);
}
Im confused, don't you need the <?php ... ?> around the $word[replacement] ?
foreach($words_from_database_for_search as $word){
$content = str_replace($word['word'],
"<span class=\"badge badge-success\"><?PHP $word[replacement] ?>
</span>",
$article);
}
and then move this in to the for-loop and add a dot before the equal-sign:
$new_content .= $content ;

Word counter: Doesn't seem to give the output I need (PHP)

here's the line of code that I came up with:
function Count($text)
{
$WordCount = str_word_count($text);
$TextToArray = explode(" ", $text);
$TextToArray2 = explode(" ", $text);
for($i=0; $i<$WordCount; $i++)
{
$count = substr_count($TextToArray2[$i], $text);
}
echo "Number of {$TextToArray2[$i]} is {$count}";
}
So, what's gonna happen here is that, the user will be entering a text, sentence or paragraph. By using substr_count, I would like to know the number of occurrences of the word inside the array. Unfortunately, the output the is not what I really need. Any suggestions?
I assume that you want an array with the word frequencies.
First off, convert the string to lowercase and remove all punctuation from the text. This way you won't get entries for "But", "but", and "but," but rather just "but" with 3 or more uses.
Second, use str_word_count with a second argument of 2 as Mark Baker says to get a list of words in the text. This will probably be more efficient than my suggestion of preg_split.
Then walk the array and increment the value of the word by one.
foreach($words as $word)
$output[$word] = isset($output[$word]) ? $output[$word] + 1 : 1;
If I had understood your question correctly this should also solve your problem
function Count($text) {
$TextToArray = explode(" ", $text); // get all space separated words
foreach($TextToArray as $needle) {
$count = substr_count($text, $needle); // Get count of a word in the whole text
echo "$needle has occured $count times in the text";
}
}
$WordCounts = array_count_values(str_word_count(strtolower($text),2));
var_dump($WordCounts);

echo partial text

I want to display just two lines of the paragraph.
How do I do this ?
<p><?php if($display){ echo $crow->content;} ?></p>
Depending on the textual content you are referring to, you might be able to get away with this :
// `nl2br` is a function that converts new lines into the '<br/>' element.
$newContent = nl2br($crow->content);
// `explode` will then split the content at each appearance of '<br/>'.
$splitContent = explode("<br/>",$newContent);
// Here we simply extract the first and second items in our array.
$firstLine = $splitContent[0];
$secondLine = $splitContent[1];
NOTE - This will destroy all the line breaks you have in your text! You'll have to insert them again if you still want to preserve the text in its original formatting.
If you mean sentences you are able to do this by exploding the paragraph and selecting the first two parts of the array:
$array = explode('.', $paragraph);
$2lines = $array[0].$array[1];
Otherwise you will have to count the number of characters across two lines and use a substr() function. For example if the length of two lines is 100 characters you would do:
$2lines = substr($paragraph, 0, 200);
However due to the fact that not all font characters are the same width it may be difficult to do this accurately. I would suggest taking the widest character, such as a 'W' and echo as many of these in one line. Then count the maximum number of the largest character that can be displayed across two lines. From this you will have the optimum number. Although this will not give you a compact two lines, it will ensure that it can not go over two lines.
This is could, however, cause a word to be cut in two. To solve this we are able to use the explode function to find the last word in the extracted characters.
$array = explode(' ', $2lines);
We can then find the last word and remove the correct number of characters from the final output.
$numwords = count($array);
$lastword = $array[$numwords];
$numchars = strlen($lastword);
$2lines = substr($2lines, 0, (0-$numchars));
function getLines($text, $lines)
{
$text = explode("\n", $text, $lines + 1); //The last entrie will be all lines you dont want.
array_pop($text); //Remove the lines you didn't want.
return implode("<br>", $text); //Implode with "<br>" to a string. (This is for a HTML page, right?)
}
echo getLines($crow->content, 2); //The first two lines of $crow->content
Try this:
$lines = preg_split("/[\r\n]+/", $crow->content, 3);
echo $lines[0] . '<br />' . $lines[1];
and for variable number of lines, use:
$num_of_lines = 2;
$lines = preg_split("/[\r\n]+/", $crow->content, $num_of_lines+1);
array_pop($lines);
echo implode('<br />', $lines);
Cheers!
This is a more general answer - you can get any amount of lines using this:
function getLines($paragraph, $lines){
$lineArr = explode("\n",$paragraph);
$newParagraph = null;
if(count($lineArr) > 0){
for($i = 0; $i < $lines; $i++){
if(isset($lines[$i]))
$newParagraph .= $lines[$i];
else
break;
}
}
return $newParagraph;
}
you could use echo getLines($crow->content,2); to do what you want.

php preg_replace words last part

$str='<p>http://domain.com/1.html?u=1234576</p><p>http://domain.com/2.html?u=2345678</p><p>http://domain.com/3.html?u=3456789</p><p>http://domain.com/4.html?u=56789</p>';
$str = preg_replace('/.html\?(.*?)/','.html',$str);
echo $str;
I need get
<p>http://domain.com/1.html</p>
<p>http://domain.com/2.html</p>
<p>http://domain.com/3.html</p>
<p>http://domain.com/4.html</p>
remove ?u=*number* from every words last part. thanks.
Change this line:
$str = preg_replace('/.html\?(.*?)/','.html',$str);
into this:
$str = preg_replace('/.html\?(.*?)</','.html<',$str);
An alternative to the other answers:
preg_replace("/<p>([^?]*)\?[^<]*<\/p>/", "<p>$1</p>", $input);
This will match all types of urls with url variables, not only the ones with html-files in them.
For example, you can also extract these types of values:
<p>http://domain.com/1.php?u=1234576</p>
<p>http://domain.com?u=1234576</p>
<p>http://domain.com</p>
<p>http://domain.com/pages/users?uid=123</p>
With an output of:
<p>http://domain.com/1.php</p>
<p>http://domain.com</p>
<p>http://domain.com</p>
<p>http://domain.com/pages/users</p>
This code will load the url's into an array so they can be handled on the fly:
$str = '<p>http://domain.com/1.html?u=1234576</p><p>http://domain.com/2.html?u=2345678</p><p>http://domain.com/3.html?u=3456789</p><p>http://domain.com/4.html?u=56789</p>';
$str = str_replace("<p>","",$str);
$links = preg_split('`\?.*?</p>`', $str,-1,PREG_SPLIT_NO_EMPTY);
foreach($links as $v) {
echo "<p>".$v."</p>";
}

preg_replace suddenly stops making distinctions

Confounded. I've been using the below IF PREG_MATCH to distinguish between words which entire words and words which are parts of other words. It has suddenly ceased to function in this script, and any other script I use, which depend on this command.
The result is it finds parts of words, although you can see it is explicitly told to find only entire words.
$word = preg_replace("/[^a-zA-Z 0-9]+/", " ", $word);
if (preg_match('#\b'.$word.'\b#',$goodfile) && (trim($word) != "")) {
$fate = strpos($goodfile,$word);
print $word ." ";
print $fate ."</br>";
If you only want to read the first word of a line of a text file, like your title suggests, try another method:
// Get the file as an array, each element being a line
$lines = file("/path/to/file");
// Break up the first line by spaces
$words = explode(" ", $lines[0]);
// Get the first word
$firstWord = $words[0];
This would be faster and cleaner than explode and you won't be making any array
$first_word = stristr($lines, ' ', true);

Categories