How to reduce a text based on word frequency in PHP?
For example, if I have this text:
house house house house book book book
it should be reduced to something like this (or any similar form):
house house book
so this way the most used word is still house by 2 and book by 1.
The question is actually interesting. As I understand it, it is not about compression but word frequency - and this my friend, is the field of natural language processing.
My first thought was: Recommend using NLTK (and learning Python if required) since there is no real PHP equivalent to it (the closest library is probably NlpTools). However, it turned out Dan Cardin, an early NlpTools contributor, created a separate library that deals with your very problem: yooper/php-text-analysis
PHP Text Analysis is a library for performing Information Retrieval
(IR) and Natural Language Processing (NLP) tasks using the PHP
language
Add PHP Text Analysis to your project
composer require yooper/php-text-analysis
Here is an example how to use it:
<?php
require_once('vendor/autoload.php');
$book = file_get_contents('pg74.txt'); // tom sawyer from the gutenberg project http://www.gutenberg.org/cache/epub/74/pg74.txt
// Create a tokenizer object to parse the book into a set of tokens
$tokenizer = new \TextAnalysis\Tokenizers\GeneralTokenizer();
$tokens = $tokenizer->tokenize($book);
$freqDist = new \TextAnalysis\Analysis\FreqDist($tokens);
//Get the top 10 most used words in Tom Sawyer
$top10 = array_splice($freqDist->getKeyValuesByFrequency(), 0, 10);
The call to freq_dist returns a FreqDist instance.
Then, you can then calculate the weights of words yourself (freq/numberOfAllTokens) or use the getKeyValuesByWeight() method.
$top10[0]/$freqDist->getTotalTokens();
$weights = $freqDist->getKeyValuesByWeight();
... or normalize the frequency of your selected words by the occurrence of your least frequent top word, e.g.
foreach ($top10 as $word => $freq) {
$relWeight[$word] = $freq/end($top10);
}
Depending on your input, you will find that your most frequent words are a, the, that, etc. This is why you want to remove stopwords. And we have only started..
Here are some more samples.
Compress & Uncompress a string in PHP: gzcompress,gzuncompress
Example:
$text = "house house house house book book book";
echo "Orignal text lenght : ". strlen($text)."<br>";
$compressed = gzcompress($text, 9);
echo "Compressed text: ".$compressed."<br>";
echo "Compress text length :". strlen($compressed);
echo "<br>";
echo "Uncompressed text :".$uncompressed = gzuncompress($compressed);
Output:
Orignal text length: 38
Compressed text: x���/-NU�� ����R
Compress text length: 22
Uncompressed text : house house house house book book book
Scenario:
I have a php file that I'm using by a zip code lookup form. It has number arrays of five digit zip codes running anywhere from 500 to 1400 zip codes. So far it works but I get PHP sniffer warnings in my code editor (Brackets) that I'm exceeding the 120 character limit.
Question:
Will this stop my PHP from running in certain browsers?
Do I have to go to every 120 characters and do a return just to keep the line length in compliance?
It appears, I need to place these long strings into a database and call them in to the array rather than hang them all inside the PHP.
I am front-end designer so a lot to learn.
<?php
$zip = $_GET['zip']; //your form method is post
// Region 01 - PersonOne Name Zips
$loc01 = array (59001,59002,59003,59004,59006);
// Region 02 - PersonTwo Name Zips
$loc01 = array ("00001","00002","00003","00004","00006");
// Above numeric strings could include 2000 zips
// Region 01 - PersonTwo Name Zips
if (in_array($zip, $loc01)) {
header("Location: https://company.com/personone");
// Region 02 - PersonTwo Name Zips
if (in_array($zip, $loc02)) {
header("Location: https://company.com/persontwo");
Question: Will this stop my PHP from running in certain browsers?
No, PHP runs entirely on the server. Browsers have nothing to do with PHP -- browsers are clients. Languages like HTML, CSS and (most) JavaScript are browser languages, but PHP is only server-side.
Do I have to go to every 120 characters and do a return just to keep the line length in compliance?
No, but I would highly suggest using a database to store tons of records like this. It's exactly what databases are for. Alternatively you could put them in a file and simply read the file in with PHP's file_get_contents function.
I will try to:
Add each array into a mysql database record.
Create a PHP script that fetches each array and applies it to the
respective location.
This will eliminate the bloated lines of arrays numbers in PHP.
BTW, I also need to define these as 5 digit numeric strings as many of the zips start with one or two zeros which are ignored by the POST match.
Thanks everyone for the input.
Right now my system fetching data from Mysql database. Now I want to get data from 3 different txt files and then show data randomly. I've more than one line data in both text files. I want to fetch all data from both files randomly using PHP. then want to make pagination in both data sources.
can anyone help me.
You could read text file lines into an array and pick a line at random:
<?php
$txt1 =<<<TXT
“If you tell the truth, you don't have to remember anything.” ― Mark Twain
“Good friends, good books, and a sleepy conscience: this is the ideal life.” ― Mark Twain
“Whenever you find yourself on the side of the majority, it is time to reform (or pause and reflect).” ― Mark Twain
TXT;
$txt2 =<<<TXT
“Facts do not cease to exist because they are ignored.” ― Aldous Huxley
“Words can be like X-rays if you use them properly -- they’ll go through anything. You read and you’re pierced.” ― Aldous Huxley
“After silence, that which comes nearest to expressing the inexpressible is music.” ― Aldous Huxley,
TXT;
$lines = explode("\n", $txt1 . $txt2);
$random = array_rand($lines);
echo $lines[$random];
You could instead have:
$txt1 = file_get_contents('/path/to/text/file');
And so on.
I've been given the task of converting a web page with a barcode to a one click label print. I've got jZebra up and running, but I have no idea where to get started as far as understanding how to write commands for a printer.
I've Google'd just about everything I can think of regarding this.
Basically, I am trying to understand this code:
applet.append("^XA^CF,0,0,0^PR12^MD30^PW800^PON^CI13\n");
// Draws a line. applet.append("^FO0,147^GB800,4,4^FS\n");
applet.append("^FO0,401^GB800,4,4^FS\n");
applet.append("^FO0,736^GB800,4,4^FS\n");
applet.append("^FO35,92^AdN,0,0^FWN^FH^FD^FS\n");
applet.append("^FO615,156^AdN,0,0^FWN^FH^FD(123) 456-7890^FS\n");
Does anyone have links to or information regarding what these characters / commands like "^FO0,401^GB800,4,4^FS" mean or do?
For zebra you this simple guide will help you.
On this Zebra commands
N
q609
Q203,26
B26,26,0,UA0,2,2,152,B,"777777"
A253,56,0,3,1,1,N,"JHON3:16"
A253,26,0,3,1,1,N,"JESUSLOVESYOU"
A253,86,0,3,1,1,N,"TEST TEST TEST"
A253,116,0,3,1,1,N,"ANOTHER TEST"
A253,146,0,3,1,1,N,"SOME LETTERS"
P1,1
on JZebra
var applet = document.jzebra;
if (applet != null) {
applet.append("N\n");
applet.append("q609\n");
applet.append("Q203,26\n");
applet.append("B26,26,0,UA0,2,2,152,B,\"777777\"\n");
applet.append("A253,56,0,3,1,1,N,\"JHON3:16\"\n");
applet.append("A253,26,0,3,1,1,N,\"JESUSLOVESYOU\"\n");
applet.append("A253,86,0,3,1,1,N,\"TEST TEST TEST\"\n");
applet.append("A253,116,0,3,1,1,N,\"ANOTHER TEST\"\n");
applet.append("A253,146,0,3,1,1,N,\"SOME LETTERS\"\n");
applet.append("P1,1\n");}
Having clear this:
EPL is one command per line. A command starts out with a command identifier, typically a letter, followed by a comma-separated list of parameters specific to that command. You can look up each of these commands in the EPL2 programming documentation. Here’s an English-language version of the commands in the above example.
Sending an initial newline guarantees that any previous borked
command is submitted.
[N] Clear the image buffer. This is an important step and
generally should be the first command in any EPL document;
who knows what state the previous job left the printer in.
[q] Set the label width to 609 dots (3 inch label x 203 dpi
= 609 dots wide).
[Q] Set the label height to 203 dots (1 inch label) with a 26
dot gap between the labels. (The printer will probably auto-
sense, but this doesn't hurt.)
[B] Draw a UPC-A barcode with value "777777" at
x = 26 dots (1/8 in), y = 26 dots (1/8 in) with a narrow bar
width of 2 dots and make it 152 dots (3/4 in) high. (The
origin of the label coordinate system is the top left corner
of the label.)
[A] Draw the text "JESUSLOVESYOU" at
x = 253 dots (3/4 in), y = 26 dots (1/8 in) in
printer font "3", normal horizontal and vertical scaling,
and no fancy white-on-black effect.
All tha A starting lines are similar.
10. [P] Print one copy of one label.
After 9,000 hours in google:
Many card printers (such as Zebra or Eltron manufactured printers)
need special RAW printer commands sent to them in order to perform
certain functions (such as magnetic strip encoding or barcode
printing). These RAW commands are usually sent as text in a
proprietary syntax. This RAW syntax is specified by the printer manufacturer (usually in the form of a developer's manual). Syntax
will vary drastically between printer manufacturers and printer
models.
Emphasis is mine. Probably want to google for a developer's manual.
Source: http://code.google.com/p/jzebra/wiki/OldSummaryDoNotUse
I have a Pig script--currently running in local mode--that processes a huge file containing a list of categories:
/root/level1/level2/level3
/root/level1/level2/level3/level4
...
I need to insert each of these into an existing database by calling a stored procedure. Because I'm new to Pig and the UDF interface is a little daunting, I'm trying to get something done by streaming the file's content through a PHP script.
I'm finding that the PHP script only sees half of the category lines I'm passing through it, though. More precisely, I see a record returned for ceil( pig_categories/2 ). A limit of 15 will produce 8 entries after streaming through the PHP script--the last one will be empty.
-- Pig script snippet
ordered = ORDER mappable_categories BY category;
limited = LIMIT ordered 20;
categories = FOREACH limited GENERATE category;
DUMP categories; -- Displays all 20 categories
streamed = STREAM limited THROUGH `php -nF categorize.php`;
DUMP streamed; -- Displays 10 categories
# categorize.php
$category = fgets( STDIN );
echo $category;
Any thoughts on what I'm missing. I've poured over the Pig reference manual for a while now and there doesn't seem to be much information related to streaming through a PHP script. I've also tried the #hadoop channel on IRC to no avail. Any guidance would be much appreciated.
Thanks.
UPDATE
It's becoming evident that this is EOL-related. If I change the PHP script from using fgets() to stream_get_line(), then I get 10 items back, but the record that should be first is skipped and there's a trailing empty record that gets displayed.
(Arts/Animation)
(Arts/Animation/Anime)
(Arts/Animation/Anime/Characters)
(Arts/Animation/Anime/Clubs_and_Organizations)
(Arts/Animation/Anime/Collectibles)
(Arts/Animation/Anime/Collectibles/Cels)
(Arts/Animation/Anime/Collectibles/Models_and_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam)
()
In that result set, there should be a first item of (Arts). Closing in, but there's still some gap to close.
So it turns out that this is one of those instances where whitespace matters. I had an empty line in front of my opening <?php tag. Once I tightened all of that up, everything sailed through and produced as expected. /punitive headslap/