Parse words in PHP

Parse words in PHP - php

From the string of words, can I get only the words with a capitalized first letter? For example, I have this string:
Page and Brin originally nicknamed THEIR new search engine "BackRub",
because the system checked backlinks to estimate the importance of a
site.
I need to get: Page, Brin, THEIR, BackRub

A non-regex solution (based on Mark Baker's comment):
$result = array_filter(str_word_count($str, 1), function($item) {
return ctype_upper($item[0]);
});
print_r($result);
Output:
Array
(
[0] => Page
[2] => Brin
[5] => THEIR
[9] => BackRub
)

You can match that with
preg_match("/[A-Z]{1}[a-zA-z]*/um", $searchText)
You can see on php.net how preg_match can be applied.
http://ca1.php.net/preg_match
EDIT, TO ADD EXAMPLE
Here's an example of how to get the array with full matches
$searchText = 'Page and Brin originally nicknamed THEIR new search engine "BackRub", because the system checked backlinks to estimate the importance of a site.';
preg_match_all("/[A-Z]{1}[a-zA-z]*/um", $searchText, $matches );
var_dump( $matches );
The output is:
array(1) {
[0]=>
array(4) {
[0]=>
string(4) "Page"
[1]=>
string(4) "Brin"
[2]=>
string(5) "THEIR"
[3]=>
string(7) "BackRub"
}
}

The way I would do it is explode by space, ucfirst the exploded strings, and check them against the original.
here is what I mean:
$str = 'Page and Brin originally nicknamed THEIR new search engine "BackRub", because the system checked backlinks to estimate the importance of a site.';
$strings = explode(' ', $str);
$i = 0;
$out = array();
foreach($strings as $s)
{
if($strings[$i] == ucfirst($s))
{
$out[] = $s;
}
++$i;
}
var_dump($out);
http://codepad.org/QwrS4HpE

I would use strtok function (http://pl1.php.net/strtok), which returns the words in the string, one by one. You can specify the delimiter between words:
$string = 'Page and Brin originally nicknamed THEIR new search engine "BackRub", because the system checked backlinks to estimate the importance of a site.';
$delimiter = ' ,."'; // specify valid delimiters here (add others as needed)
$capitalized_words = array(); // array to hold the found words
$tok = strtok($string,$delimiter); // get first token
while ($tok !== false) {
$first_char = substr($tok,0,1);
if (strtoupper($first_char)===$first_char) {
// this word ($tok) is capitalized, store it
$capitalized_words[] = $tok;
}
$tok = strtok($delimiter); // get next token
}
var_dump($capitalized_words); // print the capitalized words found
This prints:
array(4) {
[0]=>
string(4) "Page"
[1]=>
string(4) "Brin"
[2]=>
string(5) "THEIR"
[3]=>
string(7) "BackRub"
}
Good luck!
Only drawback I can see is that it doesn't handle multibyte. If you have only English characters, then you're ok. If you have international characters, a modified/different solution may be needed.

You can do this using explode and loop through with regex:
$string = 'Page and Brin originally nicknamed THEIR new search engine "BackRub", because the system checked backlinks to estimate the importance of a site.';
$list = explode(' ',$string);
$matches = array();
foreach($list as $str) {
if(preg_match('/[A-Z]+[a-zA-Z]*/um',$str) $matches[] = $str;
}
print_r($matches);

Related

remove part of string after 4th slash in php

I have an array which is contains links and trying to edit those links. Trying to cut links after 4th slash.
[0]=>
string(97) "https://www.nowhere.com./downtoalley/?iad2=sumai-pickup&argument=CH4fRVnN&dmai=shimokita4040/outline"
[1]=>
string(105) "https://www.example.com./wowar-waseda/?iad2=sumai-pickup&argument=CH4fRVnN&dmai=shinjuku-w25861/outline"
[2]=>
string(91) "https://www.hey.com./gotoashbourn/?iad2=sumai-pickup&argument=CH4fRVnN&dmai=kinuta7429/outline"
expected output is like this:
[0]=>
string(97) "https://www.nowhere.com./downtoalley/"
[1]=>
string(105) "https://www.example.com./wowar-waseda/"
[2]=>
string(91) "https://www.hey.com./gotoashbourn/"
Lengths are different, so I can't use strtok any other options for this?

Try following code:
<?php
$arr = array(
0 => "https://www.nowhere.com./downtoalley/?iad2=sumai-pickup&argument=CH4fRVnN&dmai=shimokita4040/outline",
1 => "https://www.example.com./wowar-waseda/?iad2=sumai-pickup&argument=CH4fRVnN&dmai=shinjuku-w25861/outline",
2 => "https://www.hey.com./gotoashbourn/?iad2=sumai-pickup&argument=CH4fRVnN&dmai=kinuta7429/outline");
$resultArray = array();
foreach($arr as $str) {
array_push($resultArray, current(explode("?",$str)));
}
print_r($resultArray);
?>
You can test this code here

You can use preg_replace to replace everything in each string after the fourth / with nothing using this regex
^(([^/]*/){4}).*$
which looks for 4 sets of non-/ characters followed by a /, collecting that text in capture group 1; and then replacing with $1 which gives only the text up to the 4th /:
$strings = array("https://www.nowhere.com./downtoalley/?iad2=sumai-pickup&argument=CH4fRVnN&dmai=shimokita4040/outline",
"https://www.example.com./wowar-waseda/?iad2=sumai-pickup&argument=CH4fRVnN&dmai=shinjuku-w25861/outline",
"https://www.hey.com./gotoashbourn/?iad2=sumai-pickup&argument=CH4fRVnN&dmai=kinuta7429/outline");
print_r(array_map(function ($v) { return preg_replace('#^(([^/]*/){4}).*$#', '$1', $v); }, $strings));
Output:
Array (
[0] => https://www.nowhere.com./downtoalley/
[1] => https://www.example.com./wowar-waseda/
[2] => https://www.hey.com./gotoashbourn/
)
Demo on 3v4l.org

There is no direct function to achieve this. You can follow PHP code as below:
$explodingLimit = 4;
$string = "https://www.nowhere.com./downtoalley/?iad2=sumai-pickup&argument=CH4fRVnN&dmai=shimokita4040/outline";
$stringArray = explode ("/", $string);
$neededElements = array_slice($stringArray, 0, $explodingLimit);
echo implode("/", $neededElements);
I have made this for one element which you can use for you array. Also you can add last '/' if you need that. Hope it helps you.

How can I match just the first line of occurrence?

I have this string:
$str = "11ff11
22mm22
33gg33
mm22mm
vv55vv
77ll77
55kk55
kk22kk
bb11bb";
There is two kind of patterns:
{two numbers}{two letters}{two numbers}
{two letters}{two numbers}{two letters}
I'm trying to match the first line when pattern changes. So I want to match these:
11ff11 -- this
22mm22
33gg33
mm22mm -- this
vv55vv
77ll77 -- this
55kk55
kk22kk -- this
bb11bb
Here is my current pattern:
/(\d{2}[a-z]{2}\d{2})|([a-z]{2}\d{2}[a-z]{2})/
But it matches all lines ..! How can I limit it to match just first line of same pattern?

I could not do it with lookaround due to the problem with spaces. But with classic regex it's available. It finds sequences of repeating pattern and capture only he first one
(?:(\d{2}[a-z]{2}\d{2})\s+)(?:\d{2}[a-z]{2}\d{2}\s+)*|(?:([a-z]{2}\d{2}[a-z]{2})\s+)(?:[a-z]{2}\d{2}[a-z]{2}\s+)*
demo and some explanation
To understand how it works i made simple exmple with patterns of digit and letter:
(?:(\d)\s+)(?:\d\s+)*|(?:(a)\s+)(?:a\s+)*
demo and some explanation

Not sure if you can do this with only one expression, but you can iterate over your string and test when changes:
<?php
$str = "11ff11
22mm22
33gg33
mm22mm
vv55vv
77ll77
55kk55
kk22kk
bb11bb";
$exploded = explode(PHP_EOL, $str);
$patternA = '/(\d{2}[a-z]{2}\d{2})/';
$patternB = '/([a-z]{2}\d{2}[a-z]{2})/';
$result = [];
$currentPattern = '';
//get first and check what pattern is
if(preg_match($patternA, $exploded[0])){
$currentPattern = $patternA;
$result[] = $exploded[0];
} elseif(preg_match($patternB, $exploded[0])){
$currentPattern = $patternB;
$result[] = $exploded[0];
} else {
//.. no pattern on first element, should we continue?
}
//toggle
$currentPattern = $currentPattern == $patternA ? $patternB : $patternA;
foreach($exploded as $e) {
if(preg_match($currentPattern, $e)) {
//toggle
$currentPattern = $currentPattern == $patternA ? $patternB : $patternA;
$result[] = trim($e);
}
}
echo "<pre>";
var_dump($result);
echo "</pre>";
Output:
array(4) {
[0]=>
string(6) "11ff11"
[1]=>
string(6) "mm22mm"
[2]=>
string(6) "77ll77"
[3]=>
string(6) "kk22kk"
}

Here's my take. Never used lookbehinds before and well, my regex skills are not that good but this does seem to return what you want.
/^.*|(?<=[a-z]{2}\n)\d{2}[a-z]{2}\d{2}|(?<=\d{2}\n)[a-z]{2}\d{2}[a-z]{2}/

Why preg_match fails to get the result?

I have the below text displayed on the browser and trying to get the URL from the string.
string 1 = voice-to-text from #switzerland: http://bit.ly/lnpDC12D
When I try to use preg_match and trying to get the URL, but it fails
$urlstr = "";
preg_match('/\b((?#protocol)https?|ftp):\/\/((?#domain)[-A-Z0-9.]+)((?#file)\/[-A-Z0-9+&##\/%=~_|!:,.;]*)?((?#parameters)\?[A-Z0-9+&##\/%
=~_|!:,.;]*)?/i', $urlstr, $match);
echo $match[0];
I think #switzerland: has one more http// ... will it be problem ?
the above split works perfect for the below string,
voice-to-text: http://bit.ly/jDcXrZg

In this case I think parse_url will be better choice than regex based code. Something like this may work (assuming your URL always starts with http):
$str = "voice-to-text from #switzerland: http://bit.ly/lnpDC12D";
$pos = strrpos($str, "http://");
if ($pos>=0) {
var_dump(parse_url(substr($str, $pos)));
}
OUTPUT
array(3) {
["scheme"]=>
string(4) "http"
["host"]=>
string(6) "bit.ly"
["path"]=>
string(9) "/lnpDC12D"
}

As far as I understand your request, here is a way to do it :
$str = 'voice-to-text from <a href="search.twitter.com/…;: http://bit.ly/lnpDC12D';
preg_match("~(bit.ly/\S+)~", $str, $m);
print_r($m);
output:
Array
(
[0] => bit.ly/lnpDC12D
[1] => bit.ly/lnpDC12D
)

php - How Do i extract bolded terms from a webpage and put them into an associative array?

I'm trying to grab all the bolded terms from a google results page and put them into an associative array, but the results are eratic. It seems to only extract single word terms and sometimes (depending on the query) it grabs words that are not bolded. Does anyone know what I'm doing wrong? Thanks in advance.
$gurl = "http://www.google.com/search?q=marketingpro";
$data = file_get_contents($gurl);
// get bolded
preg_match_all('/<b>(\w+)<\/b>/', $data, $res, PREG_PATTERN_ORDER);
$H = $res[0];
foreach($H as $X){
$bold = strtolower($X);
$array[$bold] += 1;
}
print_r($array);

Try:
$doc = new DOMDocument();
#$doc->loadHTMLFile('http://www.google.com/search?q=marketingpro');
$xpath = new DOMXpath($doc);
$terms = array();
foreach ($xpath->query('//b') as $b)
{
$terms[$b->nodeValue] = true;
}
var_dump(array_keys($terms));
For me, I get:
array(15) {
[0]=>
string(3) "Web"
[1]=>
string(13) "marketing pro"
[2]=>
string(12) "marketingpro"
[3]=>
string(3) "..."
... snip ...
[14]=>
string(9) "marketing"
}

/<b>(\w+)<\/b>/ will match only if there is one word inside, space and characters other than 0-9a-z and _ will be omitted in your result. I'll suggest looking for /<b>([^<]+)<\/b>/, or dom/xml parsers (but since google has invalid html, those can fail)

It extracts only single words, because that's what \w+ means. You could use a broader matching pattern like ([^<>]+) instead.
Or better yet, use QueryPath or phpQuery, which are easier on the eyes:
foreach (qp($html)->find("b") as $bold) {
$bold = strtolower($bold->text());
$array[$bold] += 1;
}

You may think about using a DOM parser. There's one here:
http://simplehtmldom.sourceforge.net/
Or, do something like this:
function getTextBetweenTags($string, $tagname)
{
$pattern = "/<$tagname>(.*?)<\/$tagname>/";
preg_match($pattern, $string, $matches);
return $matches[1];
}
That will work as long as $tagname doesn't have any attributes, which "" tags shouldn't.

Regular expressions for Google operators

Using PHP, I'm trying to improve the search on my site by supporting Google like operators e.g.
keyword = natural/default
"keyword" or "search phrase" = exact match
keyword* = partial match
For this to work I need to to split the string into two arrays. One for the exact words (but without the double quotes) into $Array1() and put everything else (natural and partial keywords) into Array2().
What regular expressions would achieve this for the following string?
Example string ($string)
today i'm "trying" out a* "google search" "test"
Desired result
$Array1 = array(
[0]=>trying
[1]=>google search
[2]=>testing
);
$Array2 = array(
[0]=>today
[1]=>i'm
[2]=>out
[3]=>a*
);
1) Exact I've tried the following for the exact regexp but it returns two arrays, one with and one without the double quotes. I could just use $result[1] but there could be a trick that I'm missing here.
preg_match_all(
'/"([^"]+)"/iu',
'today i\'m "trying" \'out\' a* "google search" "test"',
$result
);
2) Natural/Partial The following rule returns the correct keywords, but along with several blank values. This regexp rule maybe sloppy or should I just run the array through array_filter()?
preg_split(
'/"([^"]+)"|(\s)/iu',
'today i\'m "trying" \'out\' a* "google search" "test"'
);

You can use strtok to tokenize the string.
See for example this tokenizeQuoted function derived from this tokenizedQuoted function in the comments on the strtok manual page:
// split a string into an array of space-delimited tokens, taking double-quoted and single-quoted strings into account
function tokenizeQuoted($string, $quotationMarks='"\'') {
$tokens = array(array(),array());
for ($nextToken=strtok($string, ' '); $nextToken!==false; $nextToken=strtok(' ')) {
if (strpos($quotationMarks, $nextToken[0]) !== false) {
if (strpos($quotationMarks, $nextToken[strlen($nextToken)-1]) !== false) {
$tokens[0][] = substr($nextToken, 1, -1);
} else {
$tokens[0][] = substr($nextToken, 1) . ' ' . strtok($nextToken[0]);
}
} else {
$tokens[1][] = $nextToken;
}
}
return $tokens;
}
Here’s an example of use:
$string = 'today i\'m "trying" out a* "google search" "test"';
var_dump(tokenizeQuoted($string));
The output:
array(2) {
[0]=>
array(3) {
[0]=>
string(6) "trying"
[1]=>
string(13) "google search"
[2]=>
string(4) "test"
}
[1]=>
array(4) {
[0]=>
string(5) "today"
[1]=>
string(3) "i'm"
[2]=>
string(3) "out"
[3]=>
string(2) "a*"
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parse words in PHP - php

A non-regex solution (based on Mark Baker's comment): $result = array_filter(str_word_count($str, 1), function($item) { return ctype_upper($item[0]); }); print_r($result); Output: Array ( [0] => Page [2] => Brin [5] => THEIR [9] => BackRub )

Related

remove part of string after 4th slash in php

How can I match just the first line of occurrence?

Why preg_match fails to get the result?

php - How Do i extract bolded terms from a webpage and put them into an associative array?

Regular expressions for Google operators

Categories

Resources