How to reduce a text based on word frequency in PHP?
For example, if I have this text:
house house house house book book book
it should be reduced to something like this (or any similar form):
house house book
so this way the most used word is still house by 2 and book by 1.
The question is actually interesting. As I understand it, it is not about compression but word frequency - and this my friend, is the field of natural language processing.
My first thought was: Recommend using NLTK (and learning Python if required) since there is no real PHP equivalent to it (the closest library is probably NlpTools). However, it turned out Dan Cardin, an early NlpTools contributor, created a separate library that deals with your very problem: yooper/php-text-analysis
PHP Text Analysis is a library for performing Information Retrieval
(IR) and Natural Language Processing (NLP) tasks using the PHP
language
Add PHP Text Analysis to your project
composer require yooper/php-text-analysis
Here is an example how to use it:
<?php
require_once('vendor/autoload.php');
$book = file_get_contents('pg74.txt'); // tom sawyer from the gutenberg project http://www.gutenberg.org/cache/epub/74/pg74.txt
// Create a tokenizer object to parse the book into a set of tokens
$tokenizer = new \TextAnalysis\Tokenizers\GeneralTokenizer();
$tokens = $tokenizer->tokenize($book);
$freqDist = new \TextAnalysis\Analysis\FreqDist($tokens);
//Get the top 10 most used words in Tom Sawyer
$top10 = array_splice($freqDist->getKeyValuesByFrequency(), 0, 10);
The call to freq_dist returns a FreqDist instance.
Then, you can then calculate the weights of words yourself (freq/numberOfAllTokens) or use the getKeyValuesByWeight() method.
$top10[0]/$freqDist->getTotalTokens();
$weights = $freqDist->getKeyValuesByWeight();
... or normalize the frequency of your selected words by the occurrence of your least frequent top word, e.g.
foreach ($top10 as $word => $freq) {
$relWeight[$word] = $freq/end($top10);
}
Depending on your input, you will find that your most frequent words are a, the, that, etc. This is why you want to remove stopwords. And we have only started..
Here are some more samples.
Compress & Uncompress a string in PHP: gzcompress,gzuncompress
Example:
$text = "house house house house book book book";
echo "Orignal text lenght : ". strlen($text)."<br>";
$compressed = gzcompress($text, 9);
echo "Compressed text: ".$compressed."<br>";
echo "Compress text length :". strlen($compressed);
echo "<br>";
echo "Uncompressed text :".$uncompressed = gzuncompress($compressed);
Output:
Orignal text length: 38
Compressed text: x���/-NU�� ����R
Compress text length: 22
Uncompressed text : house house house house book book book
Related
There are some questions on this topic already however I did not find a satisfying answer yet. While common .csv reader software (e.g. Libre Office) can read the file without any problems PHP has problems with this.
Sample part of the file:
86010800,class,Table Games,,"(10005183, 10005184, 10005185)"
86011100,class,Toy Vehicles – Non-ride,,"(10005192, 10005190, 10005191, 10005193, 10005194, 10005195, 10005196)"
86011000,class,Toys – Ride-on,,"(10005187, 10005188, 10005441, 10005189)"
86010900,class,Toys/Games Variety Packs,,(10005186)
10001686,brick,Airbrushes (Powered),"Definition:
Includes any products that can be described/observed as a powered
machine that scatters paint using air pressure and which is used for
design illustrations, fine art and delicate fine spray applications. The
machine comprises of a compression or propulsion device to scatter the
paint.Specifically excludes Spray Paint and Aerosols.Excludes Airbrushing Replacement Parts and Accessories such as Airbrush Control Valves and Airbrush Hoses.",()
10001688,brick,Airbrushing Equipment – Replacement Parts/Accessories,Definition: Includes any products that can be described/observed as a replacement part/accessory for airbrushing equipment.Includes products such as Airbrush Control Valves and Airbrush Hoses.Excludes products such as complete Airbrushes.,(20001349)
10001690,brick,Airbrushing Supplies Other,"Definition:
Includes any products that may be described/observed as Airbrushing
Supplies products, where the user of the schema is not able to classify
the products in existing bricks within the schema.Excludes all currently classified Airbrushing Supplies.",()
As you can see column 4 and column 5 are partly quoted as they may contain line breaks, simple values are not quoted.
What I want is a 2-dimensional array where a row is level 1 and the five columns of the row are level 2.
I have tried
fgetcsv()
which fails parsing the multiline fields and
str_getcsv()
which fails parsing in two dimensions both with the correct parameters. How to parse this .csv-file best correctly with PHP?
Hope SplFileObject::fgetcsv() helps you to parse the CSV file. In below-mentioned link the example is already given, so use that and check its get fetched properly or not.
http://php.net/manual/en/splfileobject.fgetcsv.php
Edit: Full code sample
$file = new SplFileObject('../../temp/file.csv');
$ar = array();
while (!$file->eof()) {
array_push($ar,$file->fgetcsv());
}
echo '<pre>';
print_r($ar);
echo '</pre>';
die;
My question is quite simple, I use gettext to translate URLs, therefore I only have the translated version of the url string.
I would like to know if there was an easy way to get the base string from the translated string?
What I had in head was to automatically add the translated name in a database and aliases it with the base string each times I use my _u($string) function.
What I have currently:
function _u($string)
{
if (empty($string))
return '';
else
return dgettext('Urls', $string);
}
What I was thinking about (pseudo-code):
function _u($string)
{
if (empty($string))
return '';
$translation = dgettext('Urls', $string);
MySQL REPLACE INTO ... base = $string, translation = $translation; (translation = primary key)
return $translation;
}
function url_base($translation)
{
$row = SELECT ... FROM ... translation = $translation;
return $base;
}
Although it doesn't seem to be the best way possible to do this and, if on production I remove the REPLACE part, then I might forget a link or two in production that I might haven't went to.
EDIT: What I am mostly looking for is the parsing part of gettext. I need not to miss any of the possible URLs, so if you have another solution it would be required to have a parser (based on what I'm looking for).
EDIT2: Another difficulty have just been added. We must find the URL in any translations and put it back into the "base" translation for the system to parse the URL in the base language.
Actually, the most straightforward way I can think of would be to decode the .mo files used for the translation, through a call to the msgunfmt utility.
Once you have the plaintext database, you save it in any other kind of database, and will then be able to do reverse searches.
But perhaps better, you could create additional domain(s) ("ReverseUrlsIT") in which to store the translated URL as key, and the base as value (provided the mapping is fully two-way, that is!).
At that point you can use dgettext to recover the base string from the translated string, provided that you know the language of the translated string.
Update
This is the main point of using gettext and I would drop it anytime if
I could find another parser/library/tool that could help with that
The gettext family of functions, after all is said and done, are little more than a keystore database system with (maybe) a parser which is a little more powerful than printf, to handle plurals and adjective/noun inversions (violin virtuoso in English becomes virtuoso di violino in Italian).
At the cost of adding to the database complexity (and load), you can build a keystore leveraging whatever persistency layer you've got handy (gettext is file based, after all):
TABLE LanguageDomain
{
PRIMARY KEY ldId;
varchar(?) ldValue;
}
# e.g.
# 39 it_IT
# 44 en_US
# 01 us_US
TABLE Shorthand
{
PRIMARY KEY shId;
varchar(?) shValue;
}
# e.g.
# 1 CAMERA
# 2 BED
TABLE Translation
{
KEY t_ldId,
t_shId;
varchar(?) t_Value; // Or one value for singular form, one for plural...
}
# e.g.
# 44 1 Camera
# 39 1 Macchina fotografica
# 01 1 Camera
# 44 1 Bed
# 39 1 Letto
# 01 1 Bed
# 01 137 Behavior
# 44 137 Behaviour # "American and English have many things in common..."
# 01 979 Cookie
# 44 979 Biscuit " "...except of course the language" (O. Wilde)
function translate($string, $arguments = array())
{
GLOBAL $languageDomain;
// First recover main string
SELECT t_Value FROM Translation AS t
LEFT JOIN LanguageDomain AS l ON (t.ldId = l.ldId AND l.ldValue = :LangDom)
LEFT JOIN Shorthand AS s ON (t.t_shId = s.shId AND s.shValue=:String);
//
if (empty($arguments))
return $Result;
// Now run replacement of arguments - if any
$replacements = array();
foreach($arguments as $n => $argument)
$replacements["\${$n}"] = translate($argument);
// Now replace '$1' with translation of first argument, etc.
return str_replace(array_keys($replacements), array_values($replacements), $Result);
}
This would allow you to easily add one more languageDomain, and even to run queries such as e.g. "What terms in English have not yet been translated into German?" (i.e., have a NULL value when LEFT JOINing the subset of Translation table with English domain Id with the subset with German domain Id).
This system is inter-operable with POfiles, which is important if you need to outsource the translation to someone using the standard tools of the trade. But you can as easily output a query directly to TMX format, eliminating duplicates (in some cases this might really cut down your translation costs - several services overcharge for input in "strange" formats such as Excel, and will either overcharge for "deduping" or will charge for each duplicate as if it was an original).
<?xml version="1.0" ?>
<tmx version="1.4">
<header
creationtool="MySQLgetText"
creationtoolversion="0.1-20120827"
datatype="PlainText"
segtype="sentence"
adminlang="en-us"
srclang="EN"
o-tmf="ABCTransMem">
</header>
<body>
<tu tuid="BED" datatype="plaintext">
<tuv xml:lang="en">
<seg>bed</seg>
</tuv>
<tuv xml:lang="it">
<seg>letto</seg>
</tuv>
</tu>
<tu tuid="CAMERA" datatype="plaintext">
<tuv xml:lang="en">
<seg>camera</seg>
</tuv>
<tuv xml:lang="it">
<seg>macchina fotografica</seg>
</tuv>
</tu>
</body>
</tmx>
I have got a .doc file with english and chinese text in it, they are descriptions for products.. they are split apart in the doc by numbers i.e. 0001,0002,0003,0004,0005 etc etc
For example..
0001
技术参数
电压:AC90V-120V/220V-240V 50-60HZ
功率:400W
光源:120PCS 1W/3W LEDS
(R:30pcs,G:30pcs,B:30psc,W:30pcs)
控制通道:12通道
运行模式:主从,自走,声控,DMX512
每颗LED的理论寿命为50000-100000时
光学透镜角度标准15度
水平扫描:540度,垂直扫描270度
可以调节扫描速度
无限的RGBW颜色混色系统
显示操作面板彩用LCD显示屏
产品尺寸:515*402*555mm
净重:19kg 毛重:21kg
TECHNICAL PARAMETER
Voltage: AC90V-120V or 200V-240V 50-60HZ
Power consumption:400W
Light source:120PCS 1W or 3W LED
(R:30pcs,G:30pcs,B:30psc,W:30pcs)
Control mode:12HS
Operation mode: master-slave, auto movement,
Sound control: DMX 512
Each led source has an expectancy over 50000 to 100000 hours in theory
Optical len angle:15 degrees
Level scanning:540 degrees Vertical scanning
270 degrees, speed adjustable
Indefinite RGBW color mixing system
LCD display adopted
Product size:512*402*555mm
N.W:19kg G.W:21kg
0002
技术参数
电压:AC100V-240V,50/60HZ
功率:360W
光源:108颗 1/3W LED
运行模式:主从,自走,声控,DMX512
控制通道:11通道
水平扫描:540度,垂直扫描270度
高度电子调光,频闪可达1-20次/秒
均匀的RGB混色系统和彩虹效果(可加白色)
光斑角度:15度
包装尺寸:420*330*550mm
净重:10kg 毛重:13kg
TECHNICAL PARAMETER
Voltage:AC100V-240V ,50/60HZ
Power consumption:306W
Light source:108pcs of 1/3W LED
Operation mode master-slave, sound control,
auto movement,DMX512
Control channel:11Hs
Level scanning angle:540 degrees
Vertical scanning angle:270degrees
Quick electronic dimmer, strobe from 1 to 20 times/second
Smooth RGB mixing system &
Rainbow effect(can add white)
Beam angle:15 degrees
Package size :420*330*550mm
N.W:10kg G.W:13kg
0003
技术参数
电压:AC90V-120V,200V-250V,50/60HZ
光束角:10度,15度,25度可选。
控制通道:11通道
预期使用寿命:50000小时
最低的能量消耗。
信号控制:12个标准DMX 12通道控制,独立的主从控制。
频闪:1-18次/秒
LED显示。
内置程序:内置的8个程序能被DMX控制激活。
尺寸:307*354*267mm
净重:8.7kg
符合GB7000.1-2007.GB7000.217-2008及CE标准
TECHNICAL PARAMETER
Power supply:AC100V-120V.200V-250V.50/60Hz
Angle of light beam:10。15。
25。 Are available for choice.
Control channel:11
Service life:50000 hours
The lowest power consumption
Control signal 12 Standard DMX controlling
Channels and ant channels combination
Can be sep up.
Independent master/slave control
Strobe:1-18 flash per second
Inside program: the 8 inside program can
be activated by DMX controller
Dimensions:307*354*267mm
N.W:8.7kg
Up to CE standard. UL standard and
GB 7000.15-2000standard
Any ideas of which best way to split it and put in to a database ?
Thanks
Lee
1. mb_split is for multibyte strings not preg_split
Use mb_split() (linked to man page):
$descriptions = mb_split("/\d{4}/", $text);
2. Loop through the file
Another method of attack that possibly avoids non-multibyte safe PHP functions being run on the text and mucking up the Chinese portions:
$file = file('/file/path');
$descriptions = array();
$description_counter = 0;
foreach($file as $line) {
$line = trim($line);
if(preg_match("/^\d{4}$/", $line)) {
$description_counter++;
}
$descriptions[$description_counter] .= $line . "\n";
}
print_r($descriptions);
Copy text in $text and use
$r = preg_split("(\n\d{4}\n)", $text);
I have a text string, for ex. 'A vehicle travels from A to B, distance {$d} km at constant speed. While returning back to A on same path it {$variation} its speed by {$v} km/hr. The total time of journey is {$t} hours. Find the original speed of vehicle.'
The variables in the curly brackets are to be replaced by appropriate latex equation. I'm using php's preg_replace to replace the variables with latex commands. Unfortunately, my latex commands are coming as it is. It is not processed by mathjax.
For ex, above text becomes 'A vehicle travels from A to B, distance 1 km at constant speed. While returning back to A on same path it increased its speed by (\frac{3}{2}) km/hr. The total time of journey is 1 hours. Find the original speed of vehicle.' The frac is shown as it is.
What is wrong here? Please ask me if you need any more info. Thanks
I'm guessing you aren't quoting the replacement text properly. Replacing just the first two variables, tested using spaweditor's regex tool:
<?php
$string = 'A vehicle travels from A to B, distance {$d} km at constant speed. While returning back to A on same path it {$variation} its speed by {$v} km/hr. The total time of journey is {$t} hours. Find the original speed of vehicle.';
$patns = array();
$patns[0] = '/\{\$d\}/';
$patns[1] = '/\{\$variation\}/';
$repns = array();
$repns[0] = '1 km';
$repns[1] = '\\(\\frac{3}{2}\\)';
echo preg_replace($patns, $repns, $string);
?>
If this doesn't work, show the full example of how you are embedding the text in the page.
Postscript The point being, the latex command for inline maths is \( ... \) - yours is missing the backslashes.
I am working on a project where I have to find out the keyword density of thepage on the basis of URL of that page. I googled a lot but no help and scripts were found, I found a paid tool http://www.selfseo.com/store/_catalog/php_scripts/_keyword_density_checker_php_script
But I am not aware actually what "keyword Density of a page" actually means? and also please tell me how can we create a PHP script which will fetch the keyword density of a web page.
Thanks
"Keyword density" is simply the frequency that the word occurs given as a percentage of the total number of words. The following PHP code will output the density of each word in a string, $str. It demonstrates that keyword density is not a complex calculation, it can be done in a few lines of PHP:
<?php
$str = "I am working on a project where I have to find out the keyword density of the page on the basis of URL of that page. But I am not aware actually what \"keyword Density of a page\" actually means? and also please tell me how can we create a PHP script which will fetch the keyword density of a web page.";
// str_word_count($str,1) - returns an array containing all the words found inside the string
$words = str_word_count(strtolower($str),1);
$numWords = count($words);
// array_count_values() returns an array using the values of the input array as keys and their frequency in input as values.
$word_count = (array_count_values($words));
arsort($word_count);
foreach ($word_count as $key=>$val) {
echo "$key = $val. Density: ".number_format(($val/$numWords)*100)."%<br/>\n";
}
?>
Example output:
of = 5. Density: 8%
a = 4. Density: 7%
density = 3. Density: 5%
page = 3. Density: 5%
...
To fetch the content of a webpage you can use file_get_contents (or cURL). As an example, the following PHP code lists all keywords above 1% density on this webpage:
<?php
$str = strip_tags(file_get_contents("http://stackoverflow.com/questions/819166"));
$words = str_word_count(strtolower($str),1);
$word_count = array_count_values($words);
foreach ($word_count as $key=>$val) {
$density = ($val/count($words))*100;
if ($density > 1)
echo "$key - COUNT: $val, DENSITY: ".number_format($density,2)."%<br/>\n";
}
?>
I hope this helps.
Or you can try this: http://code.eyecatch-up.de/?p=155
Update: Relocated the class to http://code.google.com/p/php-class-keyword-density-check/
<?php
include 'class/class.keywordDensity.php'; // Include class
$obj = new KD(); // New instance
$obj->domain = 'http://code.eyecatch-up.de'; // Define Domain
print_r ($obj->result());
?>
above code returns:
Array
(
[0] => Array
(
[total words] => 231
)
[1] => Array
(
[keyword] => display
[count] => 14
[percent] => 6.06
)
and so on...
works with local and remote files.
keyword density is roughly:
(no. of times keyword appeared on the page)/(total no. of other keywords)
Keyword density just means the percentage that the keywords appear in the content versus rest of the text. In general, it's also a fairly useless metric for SEO. I wouldn't bother building a script for it as you'd be better off concentrating on other metrics. You might find this reference useful.
If the given keyword is "elephant walks", the keyword density would be how often the term "elephant walks" appears on any given web page in relation to other text. As VirtuosiMedia said, this is (broadly) useless information.
To measure it, you must strip all mark up from the text, count the words while keeping track of how often the keyword(s) appear.
At that point, you will know, xx.xx % of all words in this text are keywords. xx.xx % of the time , the key word(s) are used next to each other, therefore my keyword density for "elephant walks" is xx
Again, the only reason this is useful is to demonstrate pattern matching and string functions in php.