I'm working on mining tweets using some text processing (with PHP) like removing html tags, mention / username, link, non-alphanumeric, repeating syllable n character, noise word
When I get search results and directly perform all processing mentioned above, my app does not give resulti. The app just can show results when do processing on
removing tag html, mentions, links, nonalphanum, n repeating character (not all textprocessing steps).
I want to ask why that in problems come, is PHP code has specific criteria in the process?
In PHP, is it possible to define order time text to be longer processing ? (so I can use all my process step)
Any help will be greatly appreciated. Thanks
I add all preprocessing line of codes. hope you can analyze if there any problem guys. Thanks
function addspaces($value){
return " ".$value." "; }
function containsTLD($string) {
preg_match(
"/(AC($|\/)|\.AD($|\/)|\.AE($|\/)|\.AERO($|\/)|\.AF($|\/)|\.AG($|\/)|\.AI($|\/)|\.AL($|\/)|\.AM($|\/)|\.AN($|\/)|\.AO($|\/)|\.AQ($|\/)|\.AR($|\/)|\.ARPA($|\/)|\.AS($|\/)|\.ASIA($|\/)|\.AT($|\/)|\.AU($|\/)|\.AW($|\/)|\.AX($|\/)|\.AZ($|\/)|\.BA($|\/)|\.BB($|\/)|\.BD($|\/)|\.BE($|\/)|\.BF($|\/)|\.BG($|\/)|\.BH($|\/)|\.BI($|\/)|\.BIZ($|\/)|\.BJ($|\/)|\.BM($|\/)|\.BN($|\/)|\.BO($|\/)|\.BR($|\/)|\.BS($|\/)|\.BT($|\/)|\.BV($|\/)|\.BW($|\/)|\.BY($|\/)|\.BZ($|\/)|\.CA($|\/)|\.CAT($|\/)|\.CC($|\/)|\.CD($|\/)|\.CF($|\/)|\.CG($|\/)|\.CH($|\/)|\.CI($|\/)|\.CK($|\/)|\.CL($|\/)|\.CM($|\/)|\.CN($|\/)|\.CO($|\/)|\.COM($|\/)|\.COOP($|\/)|\.CR($|\/)|\.CU($|\/)|\.CV($|\/)|\.CX($|\/)|\.CY($|\/)|\.CZ($|\/)|\.DE($|\/)|\.DJ($|\/)|\.DK($|\/)|\.DM($|\/)|\.DO($|\/)|\.DZ($|\/)|\.EC($|\/)|\.EDU($|\/)|\.EE($|\/)|\.EG($|\/)|\.ER($|\/)|\.ES($|\/)|\.ET($|\/)|\.EU($|\/)|\.FI($|\/)|\.FJ($|\/)|\.FK($|\/)|\.FM($|\/)|\.FO($|\/)|\.FR($|\/)|\.GA($|\/)|\.GB($|\/)|\.GD($|\/)|\.GE($|\/)|\.GF($|\/)|\.GG($|\/)|\.GH($|\/)|\.GI($|\/)|\.GL($|\/)|\.GM($|\/)|\.GN($|\/)|\.GOV($|\/)|\.GP($|\/)|\.GQ($|\/)|\.GR($|\/)|\.GS($|\/)|\.GT($|\/)|\.GU($|\/)|\.GW($|\/)|\.GY($|\/)|\.HK($|\/)|\.HM($|\/)|\.HN($|\/)|\.HR($|\/)|\.HT($|\/)|\.HU($|\/)|\.ID($|\/)|\.IE($|\/)|\.IL($|\/)|\.IM($|\/)|\.IN($|\/)|\.INFO($|\/)|\.INT($|\/)|\.IO($|\/)|\.IQ($|\/)|\.IR($|\/)|\.IS($|\/)|\.IT($|\/)|\.JE($|\/)|\.JM($|\/)|\.JO($|\/)|\.JOBS($|\/)|\.JP($|\/)|\.KE($|\/)|\.KG($|\/)|\.KH($|\/)|\.KI($|\/)|\.KM($|\/)|\.KN($|\/)|\.KP($|\/)|\.KR($|\/)|\.KW($|\/)|\.KY($|\/)|\.KZ($|\/)|\.LA($|\/)|\.LB($|\/)|\.LC($|\/)|\.LI($|\/)|\.LK($|\/)|\.LR($|\/)|\.LS($|\/)|\.LT($|\/)|\.LU($|\/)|\.LV($|\/)|\.LY($|\/)|\.MA($|\/)|\.MC($|\/)|\.MD($|\/)|\.ME($|\/)|\.MG($|\/)|\.MH($|\/)|\.MIL($|\/)|\.MK($|\/)|\.ML($|\/)|\.MM($|\/)|\.MN($|\/)|\.MO($|\/)|\.MOBI($|\/)|\.MP($|\/)|\.MQ($|\/)|\.MR($|\/)|\.MS($|\/)|\.MT($|\/)|\.MU($|\/)|\.MUSEUM($|\/)|\.MV($|\/)|\.MW($|\/)|\.MX($|\/)|\.MY($|\/)|\.MZ($|\/)|\.NA($|\/)|\.NAME($|\/)|\.NC($|\/)|\.NE($|\/)|\.NET($|\/)|\.NF($|\/)|\.NG($|\/)|\.NI($|\/)|\.NL($|\/)|\.NO($|\/)|\.NP($|\/)|\.NR($|\/)|\.NU($|\/)|\.NZ($|\/)|\.OM($|\/)|\.ORG($|\/)|\.PA($|\/)|\.PE($|\/)|\.PF($|\/)|\.PG($|\/)|\.PH($|\/)|\.PK($|\/)|\.PL($|\/)|\.PM($|\/)|\.PN($|\/)|\.PR($|\/)|\.PRO($|\/)|\.PS($|\/)|\.PT($|\/)|\.PW($|\/)|\.PY($|\/)|\.QA($|\/)|\.RE($|\/)|\.RO($|\/)|\.RS($|\/)|\.RU($|\/)|\.RW($|\/)|\.SA($|\/)|\.SB($|\/)|\.SC($|\/)|\.SD($|\/)|\.SE($|\/)|\.SG($|\/)|\.SH($|\/)|\.SI($|\/)|\.SJ($|\/)|\.SK($|\/)|\.SL($|\/)|\.SM($|\/)|\.SN($|\/)|\.SO($|\/)|\.SR($|\/)|\.ST($|\/)|\.SU($|\/)|\.SV($|\/)|\.SY($|\/)|\.SZ($|\/)|\.TC($|\/)|\.TD($|\/)|\.TEL($|\/)|\.TF($|\/)|\.TG($|\/)|\.TH($|\/)|\.TJ($|\/)|\.TK($|\/)|\.TL($|\/)|\.TM($|\/)|\.TN($|\/)|\.TO($|\/)|\.TP($|\/)|\.TR($|\/)|\.TRAVEL($|\/)|\.TT($|\/)|\.TV($|\/)|\.TW($|\/)|\.TZ($|\/)|\.UA($|\/)|\.UG($|\/)|\.UK($|\/)|\.US($|\/)|\.UY($|\/)|\.UZ($|\/)|\.VA($|\/)|\.VC($|\/)|\.VE($|\/)|\.VG($|\/)|\.VI($|\/)|\.VN($|\/)|\.VU($|\/)|\.WF($|\/)|\.WS($|\/)|\.XN--0ZWM56D($|\/)|\.XN--11B5BS3A9AJ6G($|\/)|\.XN--80AKHBYKNJ4F($|\/)|\.XN--9T4B11YI5A($|\/)|\.XN--DEBA0AD($|\/)|\.XN--G6W251D($|\/)|\.XN--HGBK6AJ7F53BBA($|\/)|\.XN--HLCJ6AYA9ESC7A($|\/)|\.XN--JXALPDLP($|\/)|\.XN--KGBECHTV($|\/)|\.XN--ZCKZAH($|\/)|\.YE($|\/)|\.YT($|\/)|\.YU($|\/)|\.ZA($|\/)|\.ZM($|\/)|\.ZW)/i",
$string,
$M);
$has_tld = (count($M) > 0) ? true : false;
return $has_tld;
}
function cleaner($url) {
$U = explode(' ',$url);
$W =array();
foreach ($U as $k => $u) {
if (stristr($u,".")) {
if (containsTLD($u) === true) {
unset($U[$k]);
return cleaner( implode(' ',$U));
}
}
}
return implode(' ',$U);
}
$regexmention = "/(?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z-_\.]+[A-Za-z0-9]+)/i";
$reject = strtolower(implode(" ",file("file.txt")));//bikin deretan kata yg direject, mis: smartfren ga iya gue gw
$rejectarray = array_map('addspaces',explode(" ", $reject) );
$removehtmltag=strip_tags($tweet,"");
$removemention=strtolower(preg_replace($regexmention, "",$removehtmltag));
$removeurl=cleaner($removemention);
$removenonalfa = preg_replace("/[^A-Za-z0-9 ]/", " ",$removeurl );
$removerepeatchar = preg_replace("/(.)\\1+/", "$1", $removerepeatchar);
$removedigit = preg_replace('/\d/', " ", preg_replace("/(.)\\1+/", "$1", $removerepeatchar));
$removerepeatword = implode(' ',array_unique(preg_split('/[\s?:;,.]+/', $removedigit, -1, PREG_SPLIT_NO_EMPTY)));
$removedoublesyllable=preg_replace("/(.*)(\\1+)/", "", $removerepeatword ); //wkwk hahaha
$removeminimumthreechar = preg_replace("/\b\w{1,3}\b/", " ", $removedoublesyllable);
$removenoiseword = addspaces( $removeminimumthreechar ); // " saya ... bb "
$final=trim( str_replace($rejectarray, " ",$removenoiseword) );
//say
echo $final;
It seems that you are passing the maximum time allowed by the web server
Try to use set_time_limit
set_time_limit(0);
Or using too much memory, in this case modify php.ini memory_limit to something reasonable for your case like
memory_limit = 128M
Related
Im trying to achieve the following with PHP
sample#gmail.com => s*****#gmail.com
sa#yahoo.com => **#yahoo.com
sampleaddress#hotmail.com => samplead*****#hotmail.com
I want to hide last five characters in the portion that stays before '#'
I can write long code to do this by splitting and then replacing based on lengths, but Im sure there must be an easy way to do this using PHP functions, any help please?
UPDATE:
Im adding my code here, Im sure its not efficient, and thats the reason Im asking it here
$email = 'sampleuser#gmail.com';
$star_string = '';
$expl_set = explode('#',$email);
if(strlen ($expl_set[0]) > 5){$no_stars = 5; }else{$no_stars = strlen ($expl_set[0]); }
for($i=0;$i<$no_stars; $i++)
{
$star_string.='*';
}
$masked_email = substr($expl_set[0], 0, -5).$star_string.'#'.$expl_set[1];
You can wrap it into a function, making it easier to call multiple times.
Basically, split the address and the domain, replace $mask number of characters in the end of the string (default 5) with *, or the length of the address if it's shorter than the amount of masked characters.
function mask_email($email, $masks = 5) {
$array = explode("#", $email);
$string_length = strlen($array[0]);
if ($string_length < $masks)
$masks = $string_length;
$result = substr($array[0], 0, -$masks) . str_repeat('*', $masks);
return $result."#".$array[1];
}
The above would be used like this
echo mask_email("test#test.com")."\n";
echo mask_email("longeremail#test.com");
which would ouput this
****#test.com
longer*****#test.com
You can also specify the number you want filtered by using the second parameter, which is optional.
echo mask_email("longeremail#test.com", 2); // Output: longerema**#test.com
Live demo
I operate an archive of e-mail for a law firm that receives mail from Postfix and uses a PHP script to insert messages into a database. This works mostly fine but sometimes the regular expression I use to parse e-mail addresses from the From, To, and Cc headers does not capture e-mail addresses with 100% accuracy. I have tried the other solutions posited here on stackoverflow (using filter_var(), using imap_rfc822_parse_adrlist, using the regex in question 1028553) with actually less success than what I have.
I am looking to minimize system calls (I use way too many pregs right now) and increase accuracy. The current function takes an input of header text (the From, To, or Cc fields) and returns "clean" e-mail addresses stripped of brackets, quotes, comments, etc.
Any help anyone can provide would be appreciated, as I am stumped!
Wendy
My function:
function return_proper ($email_string) {
if (is_array($email_string)) {
$x = "";
foreach ($email_string as $val) {
$x .= "$val,";
}
$email_string = substr($x, 0, -1);
}
$email_string = strtolower(preg_replace('/.*?([A-Za-z0-9\_\+\.\'-]+#[A-Za-z0-9\.-]+).*?/', '$1,', $email_string));
$email_string = preg_replace('/\>/', "", $email_string);
$email_string = preg_replace('/,$/', "", $email_string);
$email_string = preg_replace('/^\'/', "", $email_string);
return $email_string;
}
function return_proper($email_string) {
if (is_array($email_string)) {
// Deal with array
foreach ($email_string as $email_string_line) {
$results[] = return_proper($email_string_line);
}
$result = implode(',', $results);
} else {
preg_match_all('/[A-Za-z0-9\_\+\.\'-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]+/', $email_string, $matches);
$result = implode(',', $matches[0]);
}
return strtolower($result);
}
I'm trying to get a "live" progress indicator working on my php CLI app. Rather than outputting as
1Done
2Done
3Done
I would rather it cleared and just showed the latest result. system("command \C CLS") doesnt work. Nor does ob_flush(), flush() or anything else that I've found.
I'm running windows 7 64 bit ultimate, I noticed the command line outputs in real time, which was unexpected. Everyone warned me that out wouldn't... but it does... a 64 bit perk?
Cheers for the help!
I want to avoid echoing 24 new lines if I can.
Try outputting a line of text and terminating it with "\r" instead of "\n".
The "\n" character is a line-feed which goes to the next line, but "\r" is just a return that sends the cursor back to position 0 on the same line.
So you can:
echo "1Done\r";
echo "2Done\r";
echo "3Done\r";
etc.
Make sure to output some spaces before the "\r" to clear the previous contents of the line.
[Edit] Optional: Interested in some history & background? Wikipedia has good articles on "\n" (line feed) and "\r" (carriage return)
I came across this while searching for a multi line solution to this problem. This is what I eventually came up with. You can use Ansi Escape commands. http://www.inwap.com/pdp10/ansicode.txt
<?php
function replaceOut($str)
{
$numNewLines = substr_count($str, "\n");
echo chr(27) . "[0G"; // Set cursor to first column
echo $str;
echo chr(27) . "[" . $numNewLines ."A"; // Set cursor up x lines
}
while (true) {
replaceOut("First Ln\nTime: " . time() . "\nThird Ln");
sleep(1);
}
?>
I recently wrote a function that will also keep track of the number of lines it last output, so you can feed it arbitrary string lengths, with newlines, and it will replace the last output with the current one.
With an array of strings:
$lines = array(
'This is a pretty short line',
'This line is slightly longer because it has more characters (i suck at lorem)',
'This line is really long, but I an not going to type, I am just going to hit the keyboard... LJK gkjg gyu g uyguyg G jk GJHG jh gljg ljgLJg lgJLG ljgjlgLK Gljgljgljg lgLKJgkglkg lHGL KgglhG jh',
"This line has newline characters\nAnd because of that\nWill span multiple lines without being too long",
"one\nmore\nwith\nnewlines",
'This line is really long, but I an not going to type, I am just going to hit the keyboard... LJK gkjg gyu g uyguyg G jk GJHG jh gljg ljgLJg lgJLG ljgjlgLK Gljgljgljg lgLKJgkglkg lHGL KgglhG jh',
"This line has newline characters\nAnd because of that\nWill span multiple lines without being too long",
'This is a pretty short line',
);
One can use the following function:
function replaceable_echo($message, $force_clear_lines = NULL) {
static $last_lines = 0;
if(!is_null($force_clear_lines)) {
$last_lines = $force_clear_lines;
}
$term_width = exec('tput cols', $toss, $status);
if($status) {
$term_width = 64; // Arbitrary fall-back term width.
}
$line_count = 0;
foreach(explode("\n", $message) as $line) {
$line_count += count(str_split($line, $term_width));
}
// Erasure MAGIC: Clear as many lines as the last output had.
for($i = 0; $i < $last_lines; $i++) {
// Return to the beginning of the line
echo "\r";
// Erase to the end of the line
echo "\033[K";
// Move cursor Up a line
echo "\033[1A";
// Return to the beginning of the line
echo "\r";
// Erase to the end of the line
echo "\033[K";
// Return to the beginning of the line
echo "\r";
// Can be consolodated into
// echo "\r\033[K\033[1A\r\033[K\r";
}
$last_lines = $line_count;
echo $message."\n";
}
In a loop:
foreach($lines as $line) {
replaceable_echo($line);
sleep(1);
}
And all lines replace each other.
The name of the function could use some work, just whipped it up, but the idea is sound. Feed it an (int) as the second param and it will replace that many lines above instead. This would be useful if you were printing after other output, and you didn't want to replace the wrong number of lines (or any, give it 0).
Dunno, seemed like a good solution to me.
I make sure to echo the ending newline so that it allows the user to still use echo/print_r without killing the line (use the override to not delete such outputs), and the command prompt will come back in the correct place.
i know the question isn't strictly about how to clear a SINGLE LINE in PHP, but this is the top google result for "clear line cli php", so here is how to clear a single line:
function clearLine()
{
echo "\033[2K\r";
}
function clearTerminal () {
DIRECTORY_SEPARATOR === '\\' ? popen('cls', 'w') : exec('clear');
}
Tested on Win 7 PHP 7. Solution for Linux should work, according to other users reports.
something like this :
for ($i = 0; $i <= 100; $i++) {
echo "Loading... {$i}%\r";
usleep(10000);
}
Use this command for clear cli:
echo chr(27).chr(91).'H'.chr(27).chr(91).'J'; //^[H^[J
Console functions are platform dependent and as such PHP has no built-in functions to deal with this. system and other similar functions won't work in this case because PHP captures the output of these programs and prints/returns them. What PHP prints goes to standard output and not directly to the console, so "printing" the output of cls won't work.
<?php
error_reporting(E_ERROR | E_WARNING | E_PARSE);
function bufferout($newline, $buffer=null){
$count = strlen(rtrim($buffer));
$buffer = $newline;
if(($whilespace = $count-strlen($buffer))>=1){
$buffer .= str_repeat(" ", $whilespace);
}
return $buffer."\r";
};
$start = "abcdefghijklmnopqrstuvwxyz0123456789";
$i = strlen($start);
while ($i >= 0){
$new = substr($start, 0, $i);
if($old){
echo $old = bufferout($new, $old);
}else{
echo $old = bufferout($new);
}
sleep(1);
$i--;
}
?>
A simple implementation of #dkamins answer. It works well. It's a bit- hack-ish. But does the job. Wont work across multiple lines.
function (int $count = 1) {
foreach (range(1,$count) as $value){
echo "\r\x1b[K"; // remove this line
echo "\033[1A\033[K"; // cursor back
}
}
See the full example here
Unfortunately, PHP 8.0.2 does not has a function to do it. However, if you just want to clear console try this: print("\033[2J\033[;H"); or use : proc_open('cls', 'w');
It works in php 8.0.2 and windows 10. It is the same that system('cls') using c language programing.
Tried some of solutions from answers:
<?php
...
$messages = [
'11111',
'2222',
'333',
'44',
'5',
];
$endlines = [
"\r",
"\033[2K\r",
"\r\033[K\033[1A\r\033[K\r",
chr(27).chr(91).'H'.chr(27).chr(91).'J',
];
foreach ($endlines as $i=>$end) {
foreach ($messages as $msg) {
output()->write("$i. ");
output()->write($msg);
sleep(1);
output()->write($end);
}
}
And \033[2K\r seems like works correct.
I wrote a script that sends chunks of text of to Google to translate, but sometimes the text, which is html source code) will end up splitting in the middle of an html tag and Google will return the code incorrectly.
I already know how to split the string into an array, but is there a better way to do this while ensuring the output string does not exceed 5000 characters and does not split on a tag?
UPDATE: Thanks to answer, this is the code I ended up using in my project and it works great
function handleTextHtmlSplit($text, $maxSize) {
//our collection array
$niceHtml[] = '';
// Splits on tags, but also includes each tag as an item in the result
$pieces = preg_split('/(<[^>]*>)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
//the current position of the index
$currentPiece = 0;
//start assembling a group until it gets to max size
foreach ($pieces as $piece) {
//make sure string length of this piece will not exceed max size when inserted
if (strlen($niceHtml[$currentPiece] . $piece) > $maxSize) {
//advance current piece
//will put overflow into next group
$currentPiece += 1;
//create empty string as value for next piece in the index
$niceHtml[$currentPiece] = '';
}
//insert piece into our master array
$niceHtml[$currentPiece] .= $piece;
}
//return array of nicely handled html
return $niceHtml;
}
Note: haven't had a chance to test this (so there may be a minor bug or two), but it should give you an idea:
function get_groups_of_5000_or_less($input_string) {
// Splits on tags, but also includes each tag as an item in the result
$pieces = preg_split('/(<[^>]*>)/', $input_string,
-1, PREG_SPLIT_DELIM_CAPTURE);
$groups[] = '';
$current_group = 0;
while ($cur_piece = array_shift($pieces)) {
$piecelen = strlen($cur_piece);
if(strlen($groups[$current_group]) + $piecelen > 5000) {
// Adding the next piece whole would go over the limit,
// figure out what to do.
if($cur_piece[0] == '<') {
// Tag goes over the limit, just put it into a new group
$groups[++$current_group] = $cur_piece;
} else {
// Non-tag goes over the limit, split it and put the
// remainder back on the list of un-grabbed pieces
$grab_amount = 5000 - $strlen($groups[$current_group];
$groups[$current_group] .= substr($cur_piece, 0, $grab_amount);
$groups[++$current_group] = '';
array_unshift($pieces, substr($cur_piece, $grab_amount));
}
} else {
// Adding this piece doesn't go over the limit, so just add it
$groups[$current_group] .= $cur_piece;
}
}
return $groups;
}
Also note that this can split in the middle of regular words - if you don't want that, then modify the part that begins with // Non-tag goes over the limit to choose a better value for $grab_amount. I didn't bother coding that in since this is just supposed to be an example of how to get around splitting tags, not a drop-in solution.
Why not strip the html tags from the string before sending it to google. PHP has a strip_tags() function that can do this for you.
preg_split with a good regex would do it for you.
I have a question about str_replace in PHP. When I do:
$latdir = $latrichting.$Lat;
If (preg_match("/N /", $latdir)) {
$Latcoorl = str_replace(" N ", "+",$latdir);
}
else {
$Latcoorl = str_replace ("S ", "-",$latdir);
}
print_r($latdir);
print_r($Latcoorl);
print_r($latdir); gives :N52.2702777778
but print_r ($Latcoorl); gives :N52.270277777800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Yes, it adds a lot of zeros. Can someone explane this behavior just for the fun of it?
print_r ($latrichting);
give's: N
print_r ($Lat);
This give's the weird long number.
So its probably not the str_replace command, you think ?
$latmin2 = bcdiv($latsec, 60, 20);
$latmin_total = $latmin + $latmin2;
$lat = bcdiv($latmin_total, 60, 20);
$latdir = array("N" => 1, "S" => -1);
$latcoorl = $latdir * $latdir[$latrichting];
Happy New Year.
Your string replace search string has a space before the 'N' while the dumped value looks like it's N:
Not sure what it has to do with all the zeros though.
On my system this code fragment:
<?php
$latdir = ':N52.2702777778';
If (preg_match("/N /", $latdir)) {
$Latcoorl = str_replace(" N ", "+",$latdir);
}
else {
$Latcoorl = str_replace ("S ", "-",$latdir);
}
print_r($latdir);
print_r($Latcoorl);
?>
gives the following result:
:N52.2702777778:N52.2702777778
My best guess is you have something after this code that prints out a serie of 0.
How I would do it; just a variation of Anthony's original answer that keeps everything as numeric and doesn't lapse into string mode.
$Latcoorl = ($latrichting == "N") ? ($Lat) : (-1 * $Lat);
The string operations you did won't generate any 0s.
The 0s have to come from $lat. What did you do with $lat? any division by pi? PHP will try to store the most accurate possible float number in $lat. That's not really a problem, its a correct behavior. Just truncate the number when displayed, or round it up.