Related
I pass a PHP string to wikipedia search page in order to retrieve part of the definition.
Everythin works fine, except unicode chars which appear in the \u... form. Here is an example to explain myself better. As you can see, the phonetic transcription of the name is not readable:
Henrik Ibsen, Henrik Ibsen \u02c8h\u025bn\u027eik \u02c8ips\u0259n
(Skien, 20 marzo 1828 - Oslo, 23 maggio 1906) è stato uno scrittore,
drammaturgo, poeta e regista teatrale norvegese.
The code I use to get the snippet from Wikipedia is this:
$word = $_GET["word"];
$html = file_get_contents('https://it.wikipedia.org/w/api.php?action=opensearch&search='.$word);
$utf8html = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $html), ENT_NOQUOTES, 'UTF-8');
The last line of my code does not solve the problem.
Do you know how to get a clean text which is entirely readable?
The output of the Wikipedia search API is JSON. Don't try to scrape bits out of it and parse string literal escapes yourself, that way madness lies. Just use a readily available JSON parser.
Also, you need to URL-escape the word when you add it into a query string, otherwise any searches for words with URL-special characters in will fail.
In summary:
$word = $_GET['word'];
$url = 'https://it.wikipedia.org/w/api.php?action=opensearch&search='.urlencode($word);
$response = json_decode(file_get_contents($url));
$matching_titles_array = $response[1];
$matching_summaries_array = $response[2];
$matching_urls = $response[3];
...etc...
You got some errors in your regex string, try using:
<?php
$str = "Henrik Ibsen, Henrik Ibsen \u02c8h\u025bn\u027eik \u02c8ips\u0259n(Skien, 20 marzo 1828 - Oslo, 23 maggio 1906) è stato uno scrittore, drammaturgo, poeta e regista teatrale norvegese.";
$utf8html = preg_replace('#\\\U([0-9A-F]{4})#i', "&#x\\1", $str);
echo $utf8html;
Well, the answer posted by bobince is certainly more effective than my previous procedure, which aimed at scraping and pruning bit by bit what I needed. Just to show you how I was doing it, here is my previous code:
$html = file_get_contents('https://it.wikipedia.org/w/api.php?action=opensearch&search='.$s);
$decoded = preg_replace('#\\\U([0-9A-F]{4})#i', "&#x\\1", $html);
$par = array("[", "]");
$def_no_par = str_replace($par, "", $decoded);
$def_no_vir = str_replace("\"\",", "", $def_no_par);
$def_cap = str_replace("\",", "\",<br>", $def_no_vir);
$def_pulita = str_replace("\"", "", $def_cap);
$def_clean = str_replace(".,", ".", $def_pulita);
$definizione = str_replace("$s,", "", $def_clean);
$out = str_replace("\\", "\"", $definizione);
As you can see, removing parts of the output to make it more readable was quite tiresome (and not completely successful).
Using the JSON approach makes everything more linear. Here is my new workaround:
$search = 'https://it.wikipedia.org/w/api.php?action=opensearch&search='.urlencode($s);
$response = json_decode(file_get_contents($search));
$matching_titles_array = $response[1];
$matching_summaries_array = $response[2];
$matching_urls = $response[3];
echo '<h3><div align="center"><font color=" #A3A375">'.$titolo.'</font></div></h3><br><br>';
foreach($response[1] as $t) {
echo '<font color="#5C85D6"><b>'.$t.'</b></font><br><br>';
}
foreach($response[2] as $s) {
echo $s.'<br><br>';
}
foreach($response[3] as $l) {
$link = preg_replace('!(((f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $l);
echo $link.'<br><br>';
}
The advantage is that now I can manipulate the arrays as I wish.
You can see it in action here:
I'm counting words in an article and removing common words such as "and" or "the".
I"m removing them by use of preg_replace
after it is done I do a quick clean of extra white space by using.
$search_body = preg_replace('/\s+/',' ',$search_body);
However I've got some very stubborn white space that will not go away. I've tried
if($word == "" OR $word == " "){
//chop it's head off
}
But the if statement does not see $word as being just whitespace. I've also tried printing it to the screen to get the raw data type of it and it's still just showing up blank.
Here is the full regex that I'm using.
$pattern = array(
'/\"\;/',
'/[0-9]/',
'/\,/',
'/\./',
'/\!/',
'/\#/',
'/\#/',
'/\$/',
'/\%/',
'/\^/',
'/\&/',
'/\*/',
'/\(/',
'/\)/',
'/\_/',
'/\"/',
'/\'/',
'/\:/',
'/\;/',
'/\?/',
'/\`/',
'/\~/',
'/\[/',
'/\]/',
'/\{/',
'/\}/',
'/\|/',
'/\+/',
'/\=/',
'/\-/',
'/–/',
'/°/',
'/\bthe\b/',
'/\band\b/',
'/\bthat\b/',
'/\bhave\b/',
'/\bfor\b/',
'/\bnot\b/',
'/\bwith\b/',
'/\byou\b/',
'/\bthis\b/',
'/\bbut\b/',
'/\bhis\b/',
'/\bfrom\b/',
'/\bthey\b/',
'/\bsay\b/',
'/\bher\b/',
'/\bshe\b/',
'/\bwill\b/',
'/\bone\b/',
'/\ball\b/',
'/\bwould\b/',
'/\bthere\b/',
'/\btheir\b/',
'/\bwhat\b/',
'/\bout\b/',
'/\babout\b/',
'/\bwho\b/',
'/\bget\b/',
'/\bwhich\b/',
'/\bwhen\b/',
'/\bmake\b/',
'/\bcan\b/',
'/\blike\b/',
'/\btime\b/',
'/\bjust\b/',
'/\bhim\b/',
'/\bknow\b/',
'/\btake\b/',
'/\bpeople\b/',
'/\binto\b/',
'/\byear\b/',
'/\byour\b/',
'/\bgood\b/',
'/\bsome\b/',
'/\bcould\b/',
'/\bthem\b/',
'/\bsee\b/',
'/\bother\b/',
'/\bthan\b/',
'/\bthen\b/',
'/\bnow\b/',
'/\blook\b/',
'/\bonly\b/',
'/\bcome\b/',
'/\bits\b/', //it's?
'/\bover\b/',
'/\bthink\b/',
'/\balso\b/',
'/\bback\b/',
'/\bafter\b/',
'/\buse\b/',
'/\btwo\b/',
'/\bhow\b/',
'/\bour\b/',
'/\bwork\b/',
'/\bfirst\b/',
'/\bwell\b/',
'/\bway\b/',
'/\beven\b/',
'/\bnew\b/',
'/\bwant\b/',
'/\bbecause\b/',
'/\bany\b/',
'/\bthese\b/',
'/\bgive\b/',
'/\bday\b/',
'/\bmost\b/',
'/\bare\b/',
'/\bwas\b/',
'/\<\w+\>/', '/\<\/\w+\>/',
'/\b\w{1}\b/', //1 letter word
'/\b\w{2}\b/', //2 letter word
'/\//',
'/\</',
'/\>/'
);
$search_body = strip_tags($body);
$search_body = strtolower($search_body);
$search_body = preg_replace($pattern, ' ', $search_body);
$search_body = preg_replace('/\s+/',' ',$search_body);
$search_body = explode(" ", $search_body);
When exploded blank values show up left and right
Example text that I am using is too long to post here. But I copied and pasted
This article to give it a test and it showed 32 counts of white space, not including the white space in front of or behind of other words even after using trim().
Here's a js.fiddle of the raw data that is being handled by php.
htmlentities and htmlspecialchars also show nothing.
Here's the code counts all the values and puts them into one.
$inhere = array();
$body_hold = array();
foreach($search_body as $value){
$value = trim($value);
if(in_array($value, $inhere) && $value != ""){
$key = array_search($value, $inhere);
$body_hold[$key]['count'] = $body_hold[$key]['count']+1;
}elseif($value != ""){
$inhere[] = $value;
$body_hold[] = array(
'count' => 1,
'word' => $value
);
}
}
rsort($body_hold);
Basic foreach to see values.
foreach($body_hold as $value){
$count = $value['count'];
$word = trim($value['word']);
echo "Count: ".$count;
echo " Word: ".$word;
echo '<br>';
}
Here's a PHP example of what it's returning
Are you sure you put the exact same data you're processing in the js.fiddle? Or did you get it from a subsequent post-processed step?
It's obviously a Wikipedia article. I went to that article on Wikipedia and opened it in Edit mode, and saw that there are s in the raw wikitext. However, those nbsp's don't appear in your js.fiddle data.
TL;DR: Check for in your processing (and convert to spaces, etc.).
This character 160 looks like space but it's not, replacing all of them to the regular spaces (32) and then removing all the double spaces will fix your problem.
$search_body = str_replace(chr(160), chr(32), $search_body);
$search_body = trim(preg_replace('/\s+/', ' ', $search_body));
I have this text : http://pastebin.com/2Zgbs7hi
And i want to be able to remove the HTML code from it and just display the plain text but i want to keep at least one line break where there are currently a few line breaks
i have tried:
$ticket["summary"] = 'pastebin example';
$TicketSummaryDisplay = nl2br($ticket["summary"]);
$TicketSummaryDisplay = stripslashes($TicketSummaryDisplay);
$TicketSummaryDisplay = trim(strip_tags($TicketSummaryDisplay));
$TicketSummaryDisplay = preg_replace('/\n\s+$/m', '', $TicketSummaryDisplay);
echo $TicketSummaryDisplay;
that is displaying as plain text, but it shows it all as one big block of text with no line breaks at all
Maybe this will earn you some time.
<?php
libxml_use_internal_errors(true); //crazy o tags
$html = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$dom = new DOMDocument;
$dom->loadHTML($html);
$result='';
foreach ($dom->getElementsByTagName('p') as $node) {
if (strstr($node->nodeValue, 'Legal Disclaimer:')){
break;
}
$result .= $node->nodeValue;
}
echo $result;
This example should successfully store text from html into an array of strings.
After stripping all the tags, you can use preg_split with \R special character ( matches any newline sequence ) to convert string into array. That array will now have several blank values, and there will be also some amount of html non-breaking space entities, so we will check the array for empty values with array_filter() function ( it will remove all items that do not satisfy the filter conditions, in our case, an empty value ). Here are a problem with entity, because and space characters are not the same, they have different ASCII code, so trim() function will not remove spaces. Here are two possible solutions, the first uncommented part will only replace   and check for white space characters, while the second commented one will decode all html entities and also check for spaces.
PHP:
$text = file_get_contents( 'http://pastebin.com/raw.php?i=2Zgbs7hi' );
$text = strip_tags( $text );
$array = array_filter(
preg_split( '/\R/', $text ),
function( &$item ) {
$item = str_replace( ' ', ' ', $item );
return trim( $item );
// $item = html_entity_decode( $item );
// return trim( str_replace( "\xC2\xA0", ' ', $item ) );
}
);
foreach( $array as $value ) {
echo $value . '<br />';
}
Array output:
Array
(
[8] => Hi,
[11] => Ashley has explained that I need to ask for another line and broadband for the wifi to work, please can you arrange this.
[13] => Regards
[23] => Legal Disclaimer:
[24] => This email and its attachments are confidential. If you received it by mistake, please don’t share it. Let us know and then delete it. Its content does not necessarily represent the views of The Dragon Enterprise
[25] => Centre and we cannot guarantee the information it contains is complete. All emails are monitored and may be seen by another member of The Dragon Enterprise Centre's staff for internal use
)
Now you should have clear array with only items with value in it. By the way, newlines in HTML are expressed through <br />, not through \n, your example as response in a web browser still has them, but they are only visible in page source code. I hope I did not missed the point of the question.
try this get text output with line brakes
<?php
$ticket["summary"] = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$TicketSummaryDisplay = nl2br($ticket["summary"]);
echo strip_tags($TicketSummaryDisplay,'<br>');
?>
You are asking on how to add line-breaks to your "one big block of text with no line breaks at all".
Short answer
After you stripped the HTML tags, apply wordwrap with a desired text-block length
$text = wordwrap($text, 90, "<br />\n");
I really wonder, why nobody suggested that function before.
there is also chunk_split around, which doesn't take words into account and just splits after a certain number of chars. breaking words - but that's not what you want, i guess.
PHP
<?php
$text = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
/**
* Returns string without html tags, also
* removes takes control chars, spaces and " " into account.
*/
function dropHtmlTags($string) {
// remove html tags
//$string = preg_replace ('/<[^>]*>/', ' ', $string);
$string = strip_tags($string);
// control characters and " "
$string = str_replace("\r", '', $string); // remove
$string = str_replace("\n", ' ', $string); // replace with space
$string = str_replace("\t", ' ', $string); // replace with space
$string = str_replace(" ", ' ', $string);
// remove multiple spaces
$string = preg_replace('/ {2,}/', ' ', $string);
$string = trim($string);
return $string;
}
$text = dropHtmlTags($text);
// The Answer: insert line breaks after 95 chars,
// to get rid of the "one big block of text with no line breaks at all"
$text = wordwrap($text, 95, "<br />\n");
// if you want to insert line-breaks before the legal disclaimer,
// uncomment the next line
//$text = str_replace("Regards Legal Disclaimer", "<br /><br />Regards Legal Disclaimer", $text);
echo $text;
?>
Result
first section shows your text block
second section shows the text with wordwrap applied (code from above)
Hello it can be done as follows:
$abc= file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
$abc = strip_tags("\n", $abc);
echo $abc;
Please, let me know whether it works
you may use
<?php
$a= file_get_contents('a.txt');
echo nl2br(htmlspecialchars($a));
?>
<?php
$handle = #fopen("pastebin.html", "r");
if ($handle) {
while (!feof($handle)) {
$buffer = fgetss($handle, 4096);
echo $buffer;
}
fclose($handle);
}
?>
output is
Hi,
Ashley has explained that I need to ask for another line and broadband for the wifi to work, please can you arrange this.
Regards
Legal Disclaimer:
This email and its attachments are confidential. If you received it by mistake, please don’t share it. Let us know and then delete it. Its content does not necessarily represent the views of The Dragon Enterprise
Centre and we cannot guarantee the information it contains is complete. All emails are monitored and may be seen by another member of The Dragon Enterprise Centre's staff for internal use
You can probably write additional code to convert to spaces etc.
I'm not sure I did understand everything correctly but this seems to be your expected result:
$txt = file_get_contents('http://pastebin.com/raw.php?i=2Zgbs7hi');
var_dump(preg_replace("/(\ \;(\s{1,})?)+/", "\n", trim(strip_tags(preg_replace("/(\s){1,}/", " ", $txt)))));
//more readable
$txt = preg_replace("/(\s){1,}/", " ", $txt);
$txt = trim(strip_tags($txt));
$txt = preg_replace("/(\ \;(\s{1,})?)+/", "\n", $txt);
The strip_tags() function strips HTML and PHP tags from a string, if that is what you are trying to accomplish.
Examples from the docs:
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
The above example will output:
Test paragraph. Other text
<p>Test paragraph.</p> Other text
I have this code:
$string = 'The quick brown fox jumped over the lazy dog and lived to tell about it to his crazy moped.';
$text = explode("#", str_replace(" ", " #", $string)); //ugly trick to preserve space when exploding, but it works (faster than preg_split)
foreach ($text as $value) {
echo preg_replace_callback("/(.*p.*e.*d.*|.*a.*y.*)/", function ($matches) {
return " <strong>".$matches[0]."</strong> ";
}, $value);
}
The point of it is to be able to enter a sequence of characters (in the code above it's a fixed pattern), and it finds and highlights those characters in the matched word. The code I have now highlights the entire word. I'm looking for the most efficient way of highlighting the characters.
The result of the current code:
The quick brown fox jumped over the lazy dog and lived to tell about it to his crazy moped.
What I would like to have:
The quick brown fox jumped over the lazy dog and lived to tell about it to his crazy moped.
Did I take the wrong approach? It would be awesome if someone could point me in the right way, I've been searching for hours and didn't find what I was looking for.
EDIT 2:
Divaka's been a great help. Almost there.. I apologize if I haven't been clear enough on what my goal is. I will try to explain further.
- Part A -
One of the things I will be using this code for is a phone book. A simple example:
When following characters are entered:
Jan
I need it to match following examples:
Jan Verhoeven
Arjan Peters
Raj Naren
Jered Von Tran
The problem is that I will be iterating over the entire phone book, person-record per person-record. Each person also has email-addresses, a postal address, maybe a website, a extra note, ect.. This means that the text I'm actually search can contain anything from letters, numbers, special characters(&#()%_- etc..), newlines, and most importantly spaces. So an entire record (csv) might contain the following info:
Name;Address;Email address;Website;Note
Jan Verhoeven;Veldstraat 2a, 3209 Herkstad;jan#werk.be;www.janophetwerk.be,jan#telemet.be;Jan die ik ontmoet heb op de bouwbeurs.\n Zelfstandige vertegenwoordiger van bouwmaterialen.
Raj Naren;Kerklaan 334, 5873 Biep;raj#werk.be;;Rechtstreekse contactpersoon bij Werk.be (#654 intern)
The \n is meant to be an actual newline. So if I search for #werk.be, I'd like to see both these records as a result.
- Part B -
Something else I want to use this for is searching song-texts. When I'm looking for a song and I can only remember it had to do something with ducks or docks and a circle, I would enter dckcircle and get the following result:
... and the ducks were all dancing in a great big circle, around the great big bonfire ...
To be able to fine-tune the searching I'd like to be able to limit the number of spaces (or any other character), because I would imagine it finding a simple pattern like eve in every song while I'm only looking for a song that has the exact word eve in it.
- Conclusion -
If I summarize this in pseudo-regex, for a search pattern abc with a max of 3 spaces in-between it would be something like this: (I might be totally off here)
(a)(any character, max 3 spaces)(b)(any character, max 3 spaces)(c)
Or more generic:
(a)({any character}{these characters with a limit of 3})(b)({any character}{these characters with a limit of 3})(c)
This can even be extended to this fairly easily I'm guessing:
(a)({any character}{these characters with a limit of 3}{not these characters})(b)({any character}{these characters with a limit of 3}{not these characters})(c)
(I know the ´{}´ brackets are not to be used that way in a regular expression, but I don't know how else to put it without using a character that has a meaning in regular expressions.)
If anyone wonders, I know the sql like statement would be able to do 80% (I'm guessing, might even be more) of what I'm trying to do, but I'm trying to avoid using a database to make this as portable as possible.
When the correct answer has been found, I'll clean this question (and the code) up and post the resulting php-class here (maybe I'll even put it up on github if that would be useful), so anyone looking for the same will have a fully working class to work with :).
I've came up with this. Tell me if it's what you want!
//$string = "The quick brown fox jumped over the lazy dog and lived to tell about it to his crazy moped.";
$string = "abcdefo";
//$pattern_array1 = array(a,y);
//$pattern_array2 = array(p,e,d);
$pattern_array1 = array(e,f);
$pattern_array2 = array(o);
$pattern_array2 = array(a,f);
$number_of_patterns = 2;
$regexp1 = generate_regexp($pattern_array1, 1);
$regexp2 = generate_regexp($pattern_array2, 2);
$string = preg_replace($regexp1["pattern"], $regexp1["replacement"], $string);
$string = preg_replace($regexp2["pattern"], $regexp2["replacement"], $string);
$string = transform_multimatched_chars($string);
// transforming other chars after transforming the multimatched ones
for($i = 1; $i <= $number_of_patterns; $i++) {
$string = str_replace("#{$i}", "<strong>", $string);
$string = str_replace("#/{$i}", "</strong>", $string);
}
echo $string;
function generate_regexp($pattern_array, $pattern_num) {
$regexp["pattern"] = "/";
$regexp["replacement"] = "";
$i = 0;
foreach($pattern_array as $key => $char) {
$regexp["pattern"] .= "({$char})";
$regexp["replacement"] .= "#{$pattern_num}\$". ($key + $i+1) . "#/{$pattern_num}";
if($key < count($pattern_array) - 1) {
$regexp["pattern"] .= "(?s)((?:(?!{$pattern_array[$key + 1]})(?!\s).)*)";
$regexp["replacement"] .= "\$".($key + $i+2) . "";
}
$i = $key + 1;
}
$regexp["pattern"] .= "/";
return $regexp;
}
function transform_multimatched_chars($string)
{
preg_match_all("/((#[0-9]){2,})(.*)((#\/[0-9]){2,})/", $string, $matches);
// change this for your purposes
$start_replacement = '<span style="color:red;">';
$end_replacement = '</span>';
foreach($matches[1] as $key => $match)
{
$string = str_replace($match, $start_replacement, $string);
$string = str_replace($matches[4][$key], $end_replacement, $string);
}
return $string;
}
Provinces is a group_concat of all the individual records that contain province, some of which are blank.
So, when I encode:
$provinces = ($row['provinces']);
echo "<td>".wordwrap($provinces, 35, "<br />")."</td>";
This is what the result looks like:
Minas Gerais,,,Rio Grande do
Sul,Santa Catarina,Paraná,São Paulo
However, when I try to preg_replace out some of the nulls, and add some spaces with this expression:
$provinces = preg_replace($patterns,
$replaces, ($row['provinces']));
echo "<td>".wordwrap($provinces, 35, "<br />")."</td>";`
This is what I get!!! :(
Minas Gerais, Rio Grande do
Sul, Santa
Catarina, Paraná, São Paulo
The output is very unnatural looking.
BTW: Here are the search and replace arrays:
$patterns[0] = '/,,([,]+)?/'; $replaces[0] = ', ';
$patterns[1] = '/^,/'; $replaces[1] = '';
$patterns[2] = '/,$/'; $replaces[2] = '';
$patterns[3] = '/\b,\b/'; $replaces[3] = ', ';
$patterns[4] = '/\s,/'; $replaces[4] = ', ';
UPDATE: I even tried to change Paraná to Parana
Minas Gerais, Rio Grande do
Sul, Santa
Catarina, Parana, São
Paulo
Don't use as the replacement. wordwrap() considers that 6 characters. It doesn't interpret the HTML entity. That's why your lines are breaking funny. If you want replace spaces after you wordwrap()
Also, your first pattern should be:
// match one or more commas together
$patterns[0] = '/,+/';
Is the wordwrap() really necessary? It sounds like you are rendering this content into a table cell of some fixed width and you don't want individual entries to split across lines.
If this inference is correct - and if none of your entries is actually so long that forcing it to a single line will break your layout - then how about this: explode() on commas into an array, remove the whitespace-only entries, replace normal spaces in each array entry with , and implode() back on , (a comma followed by a space). Then let the rendering browser break lines wherever it needs.