Parsing a mixed-delimiter data set - php

I've got a source file that contains some data in a few formats that I need to parse. I'm writing an ETL process that will have to match other data.
Most of the data is in the format city, state (US standard, more or less). Some cities are grouped across heavier population areas with multiple cities combined.
Most of the data looks like this (call this 1):
Elkhart, IN
Some places have multiple cities, delimited by a dash (call this 2):
Hickory-Lenoir-Morganton, NC
It's still not too complicated when the cities are in different states (call this 3):
Steubenville, OH-Weirton, WV
This one threw me for a loop; it makes sense but it flushes the previous formats (call this 4):
Kingsport, TN-Johnson City, TN-Bristol, VA-TN
In that example, Bristol is in both VA and TN. Then there's this (call this 5):
Mayagüez/Aguadilla-Ponce, PR
I'm okay with replacing the slash with a dash and processing the same as a previous example. That contains a diacritic as well and the rest of my data are diacritic-free. I'm okay with stripping the diacritic off, that seems to be somewhat straightforward in PHP.
Then there's my final example (call this 6):
Scranton--Wilkes-Barre--Hazleton, PA
The city name contains a dash so the delimiter between city names is a double dash.
What I'd like to produce is, given any of the above examples and a few hundred other lines that follow the same format, an array of [[city, state],...] for each so I can turn them into SQL. For example, parsing 4 would yield:
[
['Kingsport', 'TN'],
['Johnson City', 'TN'],
['Bristol', 'VA'],
['Bristol', 'TN']
]
I'm using a standard PHP install, I've got preg_match and so on but no PECL libraries. Order is unimportant.
Any thoughts on a good way to do this without a big pile of if-then statements?

I would split the input with '-'s and ','s, then delete empty elements in the array. str_replace followed by explode and array_diff (, array ()) should do the trick.
Then identify States - either searching a list or working on the principal that cities don't tend to have 2 upper-case letter names.
Now work through the array. If it's a city, save the name, if it's a state, apply it to the saved cities. Clear the list of cities when you get a city immediately following a state.
Note any exceptions and reformat by hand into a different input.
Hope this helps.

For anyone who's interested, I took the answer from #mike and came up with this:
function SplitLine($line) {
// This is over-simplified, just to cover the given case.
$line = str_replace('ü', 'u', $line);
// Cover case 6.
$delimiter = '-';
if (false !== strpos($line, '--'))
$delimiter = '--';
$line = str_replace('/', $delimiter, $line);
// Case 5 looks like case 2 now.
$parts = explode($delimiter, $line);
$table = array_map(function($part) { return array_map('trim', explode(',', $part)); }, $parts);
// At this point, table contains a grid with missing values.
for ($i = 0; $i < count($table); $i++) {
$row = $table[$i];
// Trivial case (case 1 and 3), go on.
if (2 == count($row))
continue;
if (preg_match('/^[A-Z]{2}$/', $row[0])) {
// Missing city; seek backwards.
$find = $i;
while (2 != count($table[$find]))
$find--;
$table[$i] = [$table[$find][0], $row[0]];
} else {
// Missing state; seek forwards.
$find = $i;
while (2 != count($table[$find]))
$find++;
$table[$i][] = $table[$find][1];
}
}
return $table;
}
It's not pretty and it's slow. It does cover all my cases and since I'm doing an ETL process the speed isn't paramount. There's also no error detection, which works in my particular case.

Related

How to take input?

I have started to practice coding problems (hackerearth.com) in PHP to increase my problem-solving skill.
As I saw, most of the coding problems are asked for taking input and then output the correct answer based on entered input.
Eg : Input-
The first line consists of two integers N and
K, N being the number of elements in the array and K denotes the
number of steps of rotation.
The next line consists of N space
separated integers , denoting the elements of the array A.
Till now, I know -
fscanf(STDIN, "%d %d\n", $n, $k); //takes N and K
But I don't know how to take an array of size N.
Please help me how to take array of size N. Then It will help me to code further. Else I will just stuck on taking input.
EDIT:
Please help me any PHP pro coder.
EDIT 2:
The problem on which I am still stuck is given below -
Coding challenge -
Monk and Rotation
Monk loves to preform different operations on arrays, and so being the principal of Hackerearth School, he assigned a task to his new student Mishki. Mishki will be provided with an integer array A of size N and an integer K , where she needs to rotate the array in the right direction by K steps and then print the resultant array. As she is new to the school, please help her to complete the task.
EDIT 3 -
Problem can be found here.
What I have tried till know to solve this problem-
fscanf(STDIN, "%s\n", $t);
fscanf(STDIN, "%s %s\n", $n, $k);
//taking 5 numbers seperated by space.
fscanf(STDIN, "%d %d %d %d %d\n", $item1,$item2,$item3,$item4,$item5);
$arr = [$item1,$item2,$item3,$item4,$item5];
for($i = 0; $i<$k; $i++){
array_unshift($arr, array_pop($arr));
}
echo implode(' ', $arr);
You could use readline to read in the space-separated integers, then just split at the spaces to get an array. Note that the array elements will be of type string.
// input string
$string = readline();
// turn into an array
$array = explode(" ", $string);
In hackerearth for PHP you can do something like this -
function getMyInput($n)
{
echo '<pre>';
print_r($n);
}
$t = fgets(STDIN);
for ($t_itr = 0; $t_itr < $t; $t_itr++) {
$n[] = fgets(STDIN);
}
getMyInput($n);
Reference link
PHP standard input?
What does this code mean "ofstream fout(getenv("OUTPUT_PATH"));"
https://www.php.net/manual/en/function.fgets.php
https://www.php.net/manual/en/function.intval.php
Explanation
Use fgets to read a line from the input using STDIN (I/O reading)
then iterate from zero to the maximum provided INPUT and assign those values into a variable.
create a custom method and pass the I/O data (array variable) as an argument to make your own logic for problem-solving.

Split strings into Dictionary words

I am looking for the most efficient algorithm in PHP to check if a string was made from dictionary words only or not.
Example:
thissentencewasmadefromenglishwords
thisonecontainsyxxxyxsomegarbagexaatoo
pure
thisisalsobadxyyyaazzz
Output:
thissentencewasmadefromenglishwords
pure
a.txt
contains the dictionary words
b.txt
contains the strings: one in every line, without spaces made from a..z chars only
Another way to do this is to employ the Aho-Corasick string matching algorithm. The basic idea is to read in your dictionary of words and from that create the Aho-Corasick tree structure. Then, you run each string you want to split into words through the search function.
The beauty of this approach is that creating the tree is a one time cost. You can then use it for all of the strings you're testing. The search function runs in O(n) (n being the length of the string), plus the number of matches found. It's really quite efficient.
Output from the search function will be a list of string matches, telling you which words match at what positions.
The Wikipedia article does not give a great explanation of the Aho-Corasick algorithm. I prefer the original paper, which is quite approachable. See Efficient String Matching: An Aid to Bibliographic Search.
So, for example, given your first string:
thissentencewasmadefromenglishwords
You would get (in part):
this, 0
his, 1
sent, 4
ten, 7
etc.
Now, sort the list of matches by position. It will be almost sorted when you get it from the string matcher, but not quite.
Once the list is sorted by position, the first thing you do is make sure that there is a match at position 0. If there is not, then the string fails the test. If there is (and there might be multiple matches at position 0), you take the length of the matched string and see if there's a string match at that position. Add that match's length and see if there's a match at the next position, etc.
If the strings you're testing aren't very long, then you can use a brute force algorithm like that. It would be more efficient, though, to construct a hash map of the matches, indexed by position. Of course, there could be multiple matches for a particular position, so you have to take that into account. But looking to see if there is a match at a position would be very fast.
It's some work, of course, to implement the Aho-Corasick algorithm. A quick Google search shows that there are php implementations available. How well they work, I don't know.
In the average case, this should be very quick. Again, it depends on how long your strings are. But you're helped by there being relatively few matches at any one position. You could probably construct strings that would exhibit pathologically poor runtimes, but you'd probably have to try real hard. And again, even a pathological case isn't going to take too terribly long if the string is short.
This is a problem that can be solved using Dynamic Programming, based on the next formulas:
f(0) = true
f(i) = OR { f(i-j) AND Dictionary.contais(s.substring(i-j,i) } for each j=1,...,i
First, load your file into a dictionary, then use the DP solution for the above formula.
Pseudo code is something like: (Hope I have no "off by one" for indices..)
check(word):
f = new boolean[word.length() + 1)
f[0] = true
for i from 1 to word.length() + 1:
f[i] = false
for j from 1 to i-1:
if dictionary.contains(word.substring(j-1,i-1)) AND f[j]:
f[i] = true
return f[word.length()
I recommend a recursive approach. Something like this:
<?php
$wordsToCheck = array(
'otherword',
'word1andother',
'word1',
'word1word2',
'word1word3',
'word1word2word3'
);
$wordList = array(
'word1',
'word2',
'word3'
);
$results = array();
function onlyListedWords($word, $wordList) {
if (in_array($word, $wordList)) {
return true;
} else {
$length = strlen($word);
$wordTemp = $word;
$part = '';
for ($i=0; $i < $length; $i++) {
$part .= $wordTemp[$i];
if (in_array($part, $wordList)) {
if ($i == $length - 1) {
return true;
} else {
$wordTemp = substr($wordTemp, $i + 1);
return onlyListedWords($wordTemp, $wordList);
}
}
}
}
}
foreach ($wordsToCheck as $word) {
if (onlyListedWords($word, $wordList))
$results[] = $word;
}
var_dump($results);
?>

Listing by alphabet, groups letters with few entries together (PHP or JS)

I am working on a Web Application that includes long listings of names. The client originally wanted to have the names split up into divs by letter so it is easy to jump to a particular name on the list.
Now, looking at the list, the client pointed out several letters that have only one or two names associated with them. He now wants to know if we can combine several consecutive letters if there are only a few names in each.
(Note that letters with no names are not displayed at all.)
What I do right now is have the database server return a sorted list, then keep a variable containing the current character. I run through the list of names, incrementing the character and printing the opening and closing div and ul tags as I get to each letter. I know how to adapt this code to combine some letters, however, the one thing I'm not sure about how to handle is whether a particular combination of letters is the best-possible one. In other words, say that I have:
A - 12 names
B - 2 names
C - 1 name
D - 1 name
E - 1 name
F - 23 names
I know how to end up with a group A-C and then have D by itself. What I'm looking for is an efficient way to realize that A should be by itself and then B-D should be together.
I am not really sure where to start looking at this.
If it makes any difference, this code will be used in a Kohana Framework module.
UPDATE 2012-04-04:
Here is a clarification of what I need:
Say the minimum number of items I want in a group is 30. Now say that letter A has 25 items, letters B, C, and D, have 10 items each, and letter E has 32 items. I want to leave A alone because it will be better to combine B+C+D. The simple way to combine them is A+B, C+D+E - which is not what I want.
In other words, I need the best fit that comes closest to the minimum per group.
If a letter contains more than 10 names, or whatever reasonable limit you set, do not combine it with the next one. However, if you start combining letters, you might have it run until 15 or so names are collected if you want, as long as no individual letter has more than 10. That's not a universal solution, but it's how I'd solve it.
I came up with this function using PHP.
It groups letters that combined have over $ammount names in it.
function split_by_initials($names,$ammount,$tollerance = 0) {
$total = count($names);
foreach($names as $name) {
$filtered[$name[0]][] = $name;
}
$count = 0;
$key = '';
$temp = array();
foreach ($filtered as $initial => $split) {
$count += count($split);
$temp = array_merge($split,$temp);
$key .= $initial.'-';
if ($count >= $ammount || $count >= $ammount - $tollerance) {
$result[$key] = $temp;
$count = 0;
$key = '';
$temp = array();
}
}
return $result;
}
the 3rd parameter is used for when you want to limit the group to a single letter that doesn't have the ammount specified but is close enough.
Something like
i want to split in groups of 30
but a has 25
to so, if you set a tollerance of 5, A will be left alone and the other letters will be grouped.
I forgot to mention but it returns a multi dimensional array with the letters it contains as key then the names it contains.
Something like
Array
(
[A-B-C-] => Array
(
[0] => Bandice Bergen
[1] => Arey Lowell
[2] => Carmen Miranda
)
)
It is not exactly what you needed but i think it's close enough.
Using the jsfiddle that mrsherman put, I came up with something that could work: http://jsfiddle.net/F2Ahh/
Obviously that is to be used as a pseudocode, some techniques to make it more efficient could be applied. But that gets the job done.
Javascrip Version: enhanced version with sort and symbols grouping
function group_by_initials(names,ammount,tollerance) {
tolerance=tollerance||0;
total = names.length;
var filtered={}
var result={};
$.each(names,function(key,value){
val=value.trim();
var pattern = /[a-zA-Z0-9&_\.-]/
if(val[0].match(pattern)) {
intial=val[0];
}
else
{
intial='sym';
}
if(!(intial in filtered))
filtered[intial]=[];
filtered[intial].push(val);
})
var count = 0;
var key = '';
var temp = [];
$.each(Object.keys(filtered).sort(),function(ky,value){
count += filtered[value].length;
temp = temp.concat(filtered[value])
key += value+'-';
if (count >= ammount || count >= ammount - tollerance) {
key = key.substring(0, key.length - 1);
result[key] = temp;
count = 0;
key = '';
temp = [];
}
})
return result;
}

counting Plagarism in PHP

Forgive me if this isn't a programming oriented question.
Lets say we have two sentences
[1]=This is a test idea
[2]=This is an experimental idea
If I jumble up [1]
[1]= a This idea test is
Would this count as plagiarism? What sort of logic do I have to apply to detect plagiarism.
I'm not making a complexed plagiarism service, but a rather simple one what can catch obvious plagiarism.
My logic is somewhat like this
<?php
$str1= "This is a test idea.";
$str2= "This is an experimental idea.";
echo "$str1<br>$str2<br>";
$str1Array = explode(" ",$str1);
$str2Array = explode(" ",$str2);
if(count($str1Array) > count($str2Array))
$max=count($str1Array);
else
$max=count($str2Array);
$word_seq = array();
$word_seq_history = array();
$c=0;
$plag_count=0;
for ($i = 0; $i < $max; $i++) {
$lev = levenshtein($str1Array[$i], $str2Array[$i]); // check for an exact match
if ($lev == 0) {
$c+=1;// (exact match)
//echo "<br>$c";
$word = $str1Array[$i];
array_push($word_seq,$word);
}
else
{
if($lev != 0){
if($c>=2)
$plag_count+= count($word_seq);
$current_seq = implode(" ", $word_seq);
array_push($word_seq_history,$current_seq);
echo $current_seq;
$c=0;
$word_seq= array();
}
}
}
echo "plag_count:";
echo $plag_count;
echo "max:";
echo $max;
echo "<br>" ;
echo ($plag_count/$max)*100;
?>
Output:
String 1: "This is a test idea."
String 2: "This is an experimental idea."
Words_Same:2 max:5
Plagiarism: 40%
Do I need to change it or is it fine the way it is?
What I would do to detect plagiarism in a very basic way is to first calibrate my system: ie first do a lot of comparisons with files from which you're sure aren't plagiated
1) compare a bunch of files with each other, detect the plagiarism rate with your function. Get out the words that are the most comonly used (let's say drop your rate up to XX%, trial and error here), put this words in your database and give them a weight of 0. Do this again without this words up to (less than XX%) (with regular expressions you can filter this words) and give them a weight of 1. And so on... Until you reach a plagiarism rate of nearly zero.
2) calculate the 'new' percent by sum(weight of words in your db that appear in the text)/ (the total weight of all your words) (and give the words that do not already come up in your database a weight of 10) = your rate
3) test it with plagiated stuff, if not ok, change a few parameters (weights)
I think this method, if used to check longer passages, will show a high level of correlation just because of common words, especially articles, prepositions, "be" verbs, and other common/overused words. If you're writing about a variety of subjects, be it code or Shakespeare, you're likely to run across a jargon sets that are common to many genuinely unique papers. I think you may need to look at an alternate approach. Have you done any research into plagiarism and its detection?

How to parse a word/phrase with 2 words with dictionary database (in PHP)

I want to parse a sentence into words but some sentences have two words that can be combined into one and result in a different meaning.
For example:
Eminem is a hip hop star.
If I parse it by splitting the words by space I will get
Eminem
is
a
**hip**
**hop**
star
but I want something like this:
Eminem
is
a
**hip hop**
star
This is just an example; there might be some other word combinations listed as a word in a dictionary.
How can I parse this easily?
I have a dictionary in a MySQL database. Is there any API to do this?
No API's I know of. However you could try the SQL like clause.
$words = explode(' ', 'Eminem is a hip hop star');
$len = count($words);
$fixed = array();
for($x = 0; $x < $len; $x++) {
//LIKE 'hip %' will match hip hop
$q = mysql_query("SELECT word FROM dict WHERE word LIKE '".$words[$x]." %'");
//Combine current and next word
$combined = $words[$x].' '.$words[($x+1)];
while( $result = mysql_fetch_array($q)) {
if($result['word'] == $combined) { //Word is in dictionary
$fixed[] = $combined;
$x++;
} else { //Word isn't in dictionary
$fixed[] = $words[$x];
}
}
}
*Please excuse my lack of PDO. I'm lazy right now.
EDIT: I've done some thinking. While the code above isn't optimal, the optimized version I've come up with probably can't do very much better. The fact of the matter is regardless of how you approach the problem, you will need to compare every word in your input sentence to your dictionary and perform additional computations. I see two approaches you can take depending on hardware limits.
Both of these methods assume a dict table with (example) structure:
+--+-----+------+
|id|first|second|
+--+-----+------+
|01|hip |hop |
+--+-----+------+
|02|grade|school|
+--+-----+------+
Option 1: Your webserver has lots of available RAM (and a decent processor)
The idea here is to completely bypass the database layer by caching the dictionary in PHP's memory (with APC or memcache, the latter if you plan to run on several severs). This will place all the load on your webserver, however it could be significantly faster since accessing cached data from the RAM is much faster than querying your DB.
(Again, I've left out PDO and Sanitization for simplicity's sake)
// Step One: Cache Dictionary..the entire dictionary
// This could be run on server start-up or before every user input
if(!apc_exists('words')) {
$words = array();
$q = mysql_query('SELECT first, second FROM dict');
while($res = mysql_fetch_array($q)) {
$words[] = array_values($res);
}
apc_store('words', serialize($words)); //You could use memcache if you want
}
// Step Two: Compare cached dictionary to user input
$data = explode(' ', 'Eminem is a hip hop star');
$words = apc_fetch('words');
$count = count($data);
for($x = 0; $x < $count; $x++) { //Simpler to use a for loop
foreach($words as $word) { //Match against each word
if($data[$x] == $word[0] && $data[$x+1] == $word[1]) {
$data[$x] .= ' '.$word[1];
array_splice($data, $x, 1);
$count--;
}
}
}
Option 2: Fast SQL Server
The second option involves querying each of the words in the input text from the SQL server. For example, for the sentence "Eminem is hip hop" you would create a query that looked like SELECT * FROM dict WHERE (first = 'Eminem' && second = 'is') || (first = 'is' && second = 'hip') || (first = 'hip' && second = 'hop'). Then to fix the array of words you would simply loop through MySQL's results and fuse the appropriate words together. If you are willing to take this route, it might be more efficient to cache commonly used words and fix them before querying the database. This way you can eliminate conditions from your query.

Categories