How to extract citations from a text (PHP)? - php

Hello!
I would like to extract all citations from a text. Additionally, the name of the cited person should be extracted. DayLife does this very well.
Example:
“They think it’s ‘game over,’ ” one senior administration official said.
The phrase They think it's 'game over' and the cited person one senior administration official should be extracted.
Do you think that's possible? You can only distinguish between citations and words in quotes if you check whether there's a cited person mentioned.
Example:
“I think it is serious and it is deteriorating,” Admiral Mullen said Sunday on CNN’s “State of the Union” program.
The passage State of the Union is not a quotation. But how do you detect this? a) You check if there's a cited person mentioned. b) You count the blank spaces in the supposed quotation. If there are less than 3 blank spaces it won't be a quotation, right? I would prefer b) since there's not always a cited person named.
How to start?
I would first replace all types of quotes by a single type so that you'll have to check for only one quote mark later.
<?php
$text = '';
$quote_marks = array('“', '”', '„', '»', '«');
$text = str_replace($quote_marks, '"', $text);
?>
Then I would extract all phrases between quotation marks which contain more than 3 blank spaces:
<?php
function extract_quotations($text) {
$result = preg_match_all('/"([^"]+)"/', $text, $found_quotations);
if ($result == TRUE) {
return $found_quotations;
// check for count of blank spaces
}
return array();
}
?>
How could you improve this?
I hope you can help me. Thank you very much in advance!

As ceejayoz already pointed out, this won't fit into a single function. What you're describing in your question (detecting grammatical function of a quote-escaped part of a sentence - i.e. “I think it is serious and it is deteriorating,” vs "State of the Union") would be best solved with a library that can break down natural language into tokens. I am not aware of any such library in PHP, but you can have a look at the project size of something you would use in python: http://www.nltk.org/
I think the best you can do is define a set of syntax rules that you verify manually. What about something like this:
abstract class QuotationExtractor {
protected static $instances;
public static function getAllPossibleQuotations($string) {
$possibleQuotations = array();
foreach (self::$instances as $instance) {
$possibleQuotations = array_merge(
$possibleQuotations,
$instance->extractQuotations($string)
);
}
return $possibleQuotations;
}
public function __construct() {
self::$instances[] = $this;
}
public abstract function extractQuotations($string);
}
class RegexExtractor extends QuotationExtractor {
protected $rules;
public function extractQuotations($string) {
$quotes = array();
foreach ($this->rules as $rule) {
preg_match_all($rule[0], $string, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$quotes[] = array(
'quote' => trim($match[$rule[1]]),
'cited' => trim($match[$rule[2]])
);
}
}
return $quotes;
}
public function addRule($regex, $quoteIndex, $authorIndex) {
$this->rules[] = array($regex, $quoteIndex, $authorIndex);
}
}
$regexExtractor = new RegexExtractor();
$regexExtractor->addRule('/"(.*?)[,.]?\h*"\h*said\h*(.*?)\./', 1, 2);
$regexExtractor->addRule('/"(.*?)\h*"(.*)said/', 1, 2);
$regexExtractor->addRule('/\.\h*(.*)(once)?\h*said[\-]*"(.*?)"/', 3, 1);
class AnotherExtractor extends Quot...
If you have a structure like the above you can run the same text through any/all of them and list the possible quotations to select the correct ones. I've run the code with this thread as input for testing and the result was:
array(4) {
[0]=>
array(2) {
["quote"]=>
string(15) "Not necessarily"
["cited"]=>
string(8) "ceejayoz"
}
[1]=>
array(2) {
["quote"]=>
string(28) "They think it's `game over,'"
["cited"]=>
string(34) "one senior administration official"
}
[2]=>
array(2) {
["quote"]=>
string(46) "I think it is serious and it is deteriorating,"
["cited"]=>
string(14) "Admiral Mullen"
}
[3]=>
array(2) {
["quote"]=>
string(16) "Not necessarily,"
["cited"]=>
string(0) ""
}
}

If there are less than 3 blank spaces it won't be a quotation, right?
"Not necessarily," said ceejayoz.
The passage State of the Union is not a quotation. But how do you detect this? a) You check if there's a cited person mentioned. b) You count the blank spaces in the supposed quotation. If there are less than 3 blank spaces it won't be a quotation, right? I would prefer b) since there's not always a cited person named.
b) doesn't even work for this very example - there are 3 blank spaces in "State of the Union".

A quotation will always have punctuation--either a comma at the end, to signify that the speaker's name or title is to follow, or the end of the sentence (.!?).

Related

How to correctly parse .ini file with PHP

I am parsing .ini file which looks like this (structure is same just much longer)
It is always 3 lines for one vehicle. Each lines have left site and right site. While left site is always same right site is changing.
00code42=52
00name42=Q7 03/06-05/12 (4L) [V] [S] [3D] [IRE] [52]
00type42=Car
00code43=5F
00name43=Q7 od 06/12 (4L) [V] [S] [3D] [5F]
00type43=Car
What I am doing with it is:
$ini = parse_ini_file('files/models.ini', false, INI_SCANNER_RAW);
foreach($ini as $code => $name)
{
//some code here
}
Each value for each car is somehow important for me and I can get to each it but really specifily and I need your help to find correct logic.
What I need to get:
mCode (from first car it is 00)
code (from first car it is 52)
vehicle (from first car it is Q7 03/06-05/12 (4L))
values from [] (for first car it is V, S, 3D, IRE , 52
vehicle type ( for first car it is "car")
How I get code from right site:
$mcode = substr($code, 0, 2); //$code comes from foreach
echo "MCode:".$mcode;
How I get vehicle type:
echo $name; // $name from foreach
How I parse values like vehicle and values from brackets:
$arr = preg_split('/\h*[][]/', $name, -1, PREG_SPLIT_NO_EMPTY); // $name comes from foreach
array(6) { [0]=> string(19) "Q7 03/06-05/12 (4L)" [1]=> string(1) "V" [2]=> string(1) "S" [3]=> string(2) "3D" [4]=> string(3) "IRE" [5]=> string(2) "52" }
So basicly I can get to each value I need just not sure how to write logic for work with it.
In general I can skip the first line of each car because all values from there is in another lines as well
I need just 2th and 3th line but how can I skip lines like this? (was thinking to do something like :
if($number % 3 == 0) but I dont know how number of lines.
After I get all data I cant just echo it somewhere but I also need to store it in DB so how can I do this if
I will really appriciate your help to find me correct way how to get this data in right cycle and then call function to insert them all to DB.
EDIT:
I was thinking about something like:
http://pastebin.com/C97cx6s0
But this is just structure which not working
If your data is consistent, use array_chunk, array_keys, and array_values
foreach(array_chunk($ini, 3, true) as $data)
{
// $data is an array of just the 3 that are related
$mcode = substr(array_keys($data)[0], 0, 2);
$nameLine = array_values($data)[1];
$typeLine = array_values($data)[2];
//.. parse the name and type lines here.
//.. add to DB
}

in_array not finding an element

I am using a PDO approach to get an array out of database:
$statement = $db->prepare("SELECT sname FROM list WHERE ongoing = 1;");
$statement->execute();
$snames = $statement->fetchAll(PDO::FETCH_COLUMN, 0);
var_dump($snames);
The dump output is (total 2500 results):
[69]=> string(13) "ah-my-goddess"
[70]=> string(17) "ahiru-no-oujisama"
[71]=> string(13) "ahiru-no-sora"
Then I check if array $snames contains the new element $sname:
$sname = current(explode(".", $href_word_array[count($href_word_array)-1]));
if (in_array($sname, $snames) == False)
{
echo "New '$sname'!<br>";
}
else
{
echo "$sname is already in the list. Excluding.<br>";
unset($snames[$sname]);
}
And the output is:
'ah-my-goddess' is already in the list. Excluding.
New 'ahiru-no-oujisama'!
'ahiru-no-sora' is already in the list. Excluding.
Why does it says that 'ahiru-no-oujisama' is the new name? We can see from the DUMP function that the array contains this element.
I have compared the results a thousand times. Notepad finds both names. There are no spaces. Name in the database is the same as in variable..
For the record - I have around 2500 entities in $snames array and for 95% of records (+-) I am getting the "already exists" result. However, for some I am getting "new".
Is that perhaps some kind of encoding issue? For the table I have DEFAULT CHARSET=latin1. Could that be a problem?
Edit
It was suggested that I added a trim operation:
$snames = $statement->fetchAll(PDO::FETCH_COLUMN, 0);
for ($i=0; $i < Count($snames); $i+=1)
{
$snames[$i] = trim($snames[$i]);
}
and:
if (in_array(trim($sname), $snames) == False)
However I get the same problem.
Apparently, the problem was with line:
unset($snames[$sname]);
for some entries I had names such as "70" and "111"
as the result command:
unset($snames[$sname]);
removed elements at that position. Not the elements with such keys!! I.e. That's how program understood it:
unset($snames[77]);
and that's what I was expecting:
unset($snames['77']);
so the line had to be changed to following:
if(($key = array_search($sname, $snames)) !== false)
{
unset($snames[$key]);
}

need a simple programming logic solution..

I am working on a CakePHP 2.x but right now my question has nothing to do with the syntax. I need a solution for this problem:
I have a table named messages in which there is a field name mobile numbers. The numbers are in this format:
12345678 and +9112345678 .. so they are both the same. The only difference is that in one number the country code is missing.
So I do this query in the database:
select the distinct numbers from messages tables..
Now it is taking these both numbers as distinct. But what I want is to take these both numbers as one.
How can I do this? In my DB there are several numbers with different country codes. Some have a country code and some don't, but I want to take both as one. The one in country code and the other without code. How can this be done?
At times now I have an array in which all the distinct numbers are stored. In this array there are numbers like this:
12345678 +9112345678
So now I don't know how to make a logic that I can take these numbers as one. If there is any solution for this then please share with some example code.
I don't think you can do this on the database level.
You would have to do something like this:
Create an array of all country codes (including + sign)
Fetch all the numbers from the database
Use array_map() and in the callback run strpos() against each
element in the country code array and if a match is made remove the
country code from the number
Finally after step 4 is finished run the number array through
array_unique()
CODE:
$country_codes = array('+91', '+61');
$numbers_from_db = array('33445322453', '+913232', '3232', '+614343', '024343');
$sanitized_numbers = array_map(function($number) use ($country_codes){
if(substr($number, 0, 1) === "0") {
$number = substr($number, 1);
return $number;
}
foreach($country_codes as $country_code) {
if(strpos($number, $country_code) !== false) {
$number = str_replace($country_code, "", $number);
return $number;
}
}
return $number;
}, $numbers_from_db);
$distinct_sanitized_numbers = array_unique($sanitized_numbers);
Tested and the out put of var_dump($distinct_sanitized_numbers) is:
array(4) {
[0]=>
string(11) "33445322453"
[1]=>
string(4) "3232"
[3]=>
string(4) "4343"
[4]=>
string(5) "24343"
}

how to check for a linked string inside an array?

I have an array, let us say, $breadcrumb = array("home" , "groups", "Create content", "some other element" "so on"); I want to check if it contains a string "Create content" and then unset the string, but my problem is that "Create content" is a link (anchored) and not just a plain string, I tried in_array(), but not successful. How do I look for it, to make it more clear?
Here is my code:
<?php
function phptemplate_breadcrumb($breadcrumb) {
if (!empty($breadcrumb)) {
if(in_array("Create content",$breadcrumb)){
foreach($breadcrumb as $key => $value){
if("Create content" == strip_tags($value)){
unset($breadcrumb[$key]);
}
}
}
}
return '<div class="breadcrumb">'. implode(' › ', $breadcrumb) .'</div>';
}
Note: I know it can be done anyway if I ommit in_array() check but I don't want to loop through the array unecessarily, if the 'Create content' is not in the array.
Edit: actual array is:
array(
[0]=>home
[1]=> groups
[2]=> my group
[3]=> Create content
)
here 'Create content' may occupy any position.
Note: all elements are links (anchored).
If your real array is something like
array(
[0]=> home
[1]=> groups
[2]=> my group
[3]=> Create content
)
then you can try to use preg-grep so as to return all items that match your regExp pattern:
$content_links = preg_grep("/[YOUR REGEXP HERE]/", $breadcrumb);
// if you have matching items
if (0 < sizeof($content_links)) {
// do some stuff - do `foreach` loop or use `array_diff`
}
UPD:
Or even you can use PREG_GREP_INVERT as third parameter and get all items that doens't match RegExp pattern.

How to efficiently combine two (or more) associative arrays with common keys

More generally, let's say we have two lists of different lengths with one common attribute:
list1: {
{"orderID":1234, "FirstName":"shaheeb", "LastName":"roshan"},
{"orderID":9183, "FirstName":"robert", "LastName":"gibbons"},
{"orderID":2321, "FirstName":"chester"},
}
list2: {
{"orderID":1234, "cell":"555-555-5555", "email":"roshan#fake.com"},
{"orderID":2321, "email":"chester#fake.com"},
}
I would like these combined into:
list3: {
{"orderID":1234, "FirstName":"shaheeb", "LastName":"roshan", "cell":"555-555-5555", "email":"roshan#fake.com"},
{"orderID":9183, "FirstName":"robert", "LastName":"gibbons"},
{"orderID":2321, "FirstName":"chester", "email":"chester#fake.com"},
}
I'm primarily a PHP developer, and I came up with the following:
function mergeArrays($a1, $a2) {
$larger = (count($a1) > count($a2)) ? $a1 : $a2;
$smaller = ($larger == $a1) ? $a2 : $a1;
$combinedArray = array();
foreach ($larger AS $key=>$largerSet) {
$combinedRow = array();
if (isset ($smaller[$key]) ) {
$combinedRow = $largerSet + $smaller[$key];
$combinedArray[$key] = $combinedRow;
}else {
$combinedArray[$key] = $largerSet;
}
}
return ($combinedArray);
}
If tested with the following:
$array1 = array("12345"=>array("OrderID"=>12345, "Apt"=>"blue"));
$array2 = array(
"12345"=>array("OrderID"=>12345, "AnotherCol"=>"Goons", "furtherColumns"=>"More Data"),
"13433"=>array("OrderID"=>32544, "Yellow"=>"Submarine")
);
The mergeArrays($array1, $array2) outputs the following:
array(2) {
[12345]=>
array(4) {
["OrderID"]=>
int(12345)
["AnotherCol"]=>
string(5) "Goons"
["furtherColumns"]=>
string(9) "More Data"
["Apt"]=>
string(4) "blue"
}
[13433]=>
array(2) {
["OrderID"]=>
int(32544)
["Yellow"]=>
string(9) "Submarine"
}
}
But I just don't feel like this is the most elegant solution. For example, I should be able to combine n number of arrays. Not really sure how I would accomplish that. Also, just looking at that bit of code, I'm fairly certain there are far more effective ways to accomplish this requirement.
As a learning point, I am curious whether python experts would take this opportunity to show up us PHP folk :). For that matter, I am curious whether Excel/VBA can even handle this. That is where I started trying to solve this problem with the thought that "surely excel can handle lists!".
I am fully aware that there are many many variations of this question around SO. I have looked at several of these, and still felt that I should try my version out here.
Your thoughts are most appreciated.
Thank you!
SR
For a general solution in Python, for any number of lists:
orders = defaultdict(dict)
for order_list in order_lists:
for order in order_list:
orders[order['orderID']].update(order)
See it working online: ideone
A generic solution that can merge any number of dicts (or a list of dicts - if you have more than one list, just add them together before calling the function):
from collections import defaultdict
def merge_dicts_by_key(key, *dicts):
return reduce(lambda acc,val: acc[val[key]].update(val) or acc,
dicts,
defaultdict(dict))
Call like so:
merge_dicts_by_key('orderId', dict1, dict2, dict3)
or, if you have lists of dicts:
merge_dicts_by_key('orderId', *list_of_dicts)
merge_dicts_by_key('orderId', *(list1 + list2))
Well, you could always replace your function with array_merge_recursive.

Categories