I have to convert a long string of data into values so that I can import them into my database. Unfortunately, the data is displayed as text and not XML, so I need a way to convert this into, ideally, a key->value array.
The data looks like this:
AU - Author 1
AU - Author 2
AU - Author 3
LA - ENG
PT - ARTICLE
DEP - 235234
TA - TA
JN - Journal name
JID - 3456346
EDAT- 2011-11-03 06:00
MHDA- 2011-11-03 06:00
CRDT- 2011-11-03 06:00
TI - multi-line text text text text text
text text tex tex text
text text tex tex text
After researching, it seems like explode could be a viable means to accomplish this, but I'm not sure how to implement it in this scenerio, or if there is a better method of accomplishing this. Especially since there can be random hyphens and line breaks in the middle of the string.
Any help much appreciated in advance!
Since values can contain dashes and be spread across multiple lines, I think the safest method for separating keys from values is using substr(), since the separating dashes always sit at the same character position in the string.
FIXED
<?php
// first, split into lines
$lines = explode("\n",str_replace(array("\r\n","\r"),"\n",$data));
// this will hold the parsed data
$result = array();
// This will track the current key for multi-line values
$thisKey = '';
// Loop the split data
foreach ($lines as $line) {
if (substr($line,4,1) == '-') {
// There is a separator, start a new key
$thisKey = trim(substr($line,0,4));
if ($result[$thisKey]) {
// This is a duplicate value
if (is_array($result[$thisKey])) {
// already an array
$result[$thisKey][] = trim(substr($line,5));
} else {
// convert to array
$result[$thisKey] = array($result[$thisKey],trim(substr($line,5)));
}
} else {
// Not a duplicate value
$result[$thisKey] = trim(substr($line,5));
}
} else {
// There is no separator, append data to the last key
if (is_array($result[$thisKey])) {
$result[$thisKey][count($result[$thisKey]) - 1] .= PHP_EOL.trim(substr($line,5));
} else {
$result[$thisKey] .= PHP_EOL.trim(substr($line,5));
}
}
}
print_r($result);
?>
See it working
Related
I've got a source file that contains some data in a few formats that I need to parse. I'm writing an ETL process that will have to match other data.
Most of the data is in the format city, state (US standard, more or less). Some cities are grouped across heavier population areas with multiple cities combined.
Most of the data looks like this (call this 1):
Elkhart, IN
Some places have multiple cities, delimited by a dash (call this 2):
Hickory-Lenoir-Morganton, NC
It's still not too complicated when the cities are in different states (call this 3):
Steubenville, OH-Weirton, WV
This one threw me for a loop; it makes sense but it flushes the previous formats (call this 4):
Kingsport, TN-Johnson City, TN-Bristol, VA-TN
In that example, Bristol is in both VA and TN. Then there's this (call this 5):
Mayagüez/Aguadilla-Ponce, PR
I'm okay with replacing the slash with a dash and processing the same as a previous example. That contains a diacritic as well and the rest of my data are diacritic-free. I'm okay with stripping the diacritic off, that seems to be somewhat straightforward in PHP.
Then there's my final example (call this 6):
Scranton--Wilkes-Barre--Hazleton, PA
The city name contains a dash so the delimiter between city names is a double dash.
What I'd like to produce is, given any of the above examples and a few hundred other lines that follow the same format, an array of [[city, state],...] for each so I can turn them into SQL. For example, parsing 4 would yield:
[
['Kingsport', 'TN'],
['Johnson City', 'TN'],
['Bristol', 'VA'],
['Bristol', 'TN']
]
I'm using a standard PHP install, I've got preg_match and so on but no PECL libraries. Order is unimportant.
Any thoughts on a good way to do this without a big pile of if-then statements?
I would split the input with '-'s and ','s, then delete empty elements in the array. str_replace followed by explode and array_diff (, array ()) should do the trick.
Then identify States - either searching a list or working on the principal that cities don't tend to have 2 upper-case letter names.
Now work through the array. If it's a city, save the name, if it's a state, apply it to the saved cities. Clear the list of cities when you get a city immediately following a state.
Note any exceptions and reformat by hand into a different input.
Hope this helps.
For anyone who's interested, I took the answer from #mike and came up with this:
function SplitLine($line) {
// This is over-simplified, just to cover the given case.
$line = str_replace('ü', 'u', $line);
// Cover case 6.
$delimiter = '-';
if (false !== strpos($line, '--'))
$delimiter = '--';
$line = str_replace('/', $delimiter, $line);
// Case 5 looks like case 2 now.
$parts = explode($delimiter, $line);
$table = array_map(function($part) { return array_map('trim', explode(',', $part)); }, $parts);
// At this point, table contains a grid with missing values.
for ($i = 0; $i < count($table); $i++) {
$row = $table[$i];
// Trivial case (case 1 and 3), go on.
if (2 == count($row))
continue;
if (preg_match('/^[A-Z]{2}$/', $row[0])) {
// Missing city; seek backwards.
$find = $i;
while (2 != count($table[$find]))
$find--;
$table[$i] = [$table[$find][0], $row[0]];
} else {
// Missing state; seek forwards.
$find = $i;
while (2 != count($table[$find]))
$find++;
$table[$i][] = $table[$find][1];
}
}
return $table;
}
It's not pretty and it's slow. It does cover all my cases and since I'm doing an ETL process the speed isn't paramount. There's also no error detection, which works in my particular case.
I have a text area in my html form.I am collecting the data from this form using POST method.Here I need to set a blank line as a boundary to repeat a function using this form data.
for example I am calculating the sum of the digits which are collected from this text area using below code
<?php
$devices = $_POST['devs'];
$count = array_sum(explode("\n", $devices));
echo "sum is $count";
?>
If I enter below digits
1
2
3
I will get output like:
sum is 6
and what I need is, if I put digits like
1
2
3
4
5
6
I need output like
sum is 6
sum is 15
how can I do it ?
If there isn't any extra whitespace in your data, this can be accomplished very similarly to your current approach, by adding an extra step to explode on two newlines, and then calling your current code on each part:
$devices = $_POST['devs'];
$repeats = explode(PHP_EOL.PHP_EOL, $devices); // Favor PHP_EOL (end of line) to avoid cross OS issues
foreach($repeats as $repeat)
{
$count = array_sum(explode(PHP_EOL, $repeat));
echo "sum is $count".PHP_EOL;
}
Obviously, if there is extra whitespace, then you'll need to do a cleanup step first.
I am working on a Web Application that includes long listings of names. The client originally wanted to have the names split up into divs by letter so it is easy to jump to a particular name on the list.
Now, looking at the list, the client pointed out several letters that have only one or two names associated with them. He now wants to know if we can combine several consecutive letters if there are only a few names in each.
(Note that letters with no names are not displayed at all.)
What I do right now is have the database server return a sorted list, then keep a variable containing the current character. I run through the list of names, incrementing the character and printing the opening and closing div and ul tags as I get to each letter. I know how to adapt this code to combine some letters, however, the one thing I'm not sure about how to handle is whether a particular combination of letters is the best-possible one. In other words, say that I have:
A - 12 names
B - 2 names
C - 1 name
D - 1 name
E - 1 name
F - 23 names
I know how to end up with a group A-C and then have D by itself. What I'm looking for is an efficient way to realize that A should be by itself and then B-D should be together.
I am not really sure where to start looking at this.
If it makes any difference, this code will be used in a Kohana Framework module.
UPDATE 2012-04-04:
Here is a clarification of what I need:
Say the minimum number of items I want in a group is 30. Now say that letter A has 25 items, letters B, C, and D, have 10 items each, and letter E has 32 items. I want to leave A alone because it will be better to combine B+C+D. The simple way to combine them is A+B, C+D+E - which is not what I want.
In other words, I need the best fit that comes closest to the minimum per group.
If a letter contains more than 10 names, or whatever reasonable limit you set, do not combine it with the next one. However, if you start combining letters, you might have it run until 15 or so names are collected if you want, as long as no individual letter has more than 10. That's not a universal solution, but it's how I'd solve it.
I came up with this function using PHP.
It groups letters that combined have over $ammount names in it.
function split_by_initials($names,$ammount,$tollerance = 0) {
$total = count($names);
foreach($names as $name) {
$filtered[$name[0]][] = $name;
}
$count = 0;
$key = '';
$temp = array();
foreach ($filtered as $initial => $split) {
$count += count($split);
$temp = array_merge($split,$temp);
$key .= $initial.'-';
if ($count >= $ammount || $count >= $ammount - $tollerance) {
$result[$key] = $temp;
$count = 0;
$key = '';
$temp = array();
}
}
return $result;
}
the 3rd parameter is used for when you want to limit the group to a single letter that doesn't have the ammount specified but is close enough.
Something like
i want to split in groups of 30
but a has 25
to so, if you set a tollerance of 5, A will be left alone and the other letters will be grouped.
I forgot to mention but it returns a multi dimensional array with the letters it contains as key then the names it contains.
Something like
Array
(
[A-B-C-] => Array
(
[0] => Bandice Bergen
[1] => Arey Lowell
[2] => Carmen Miranda
)
)
It is not exactly what you needed but i think it's close enough.
Using the jsfiddle that mrsherman put, I came up with something that could work: http://jsfiddle.net/F2Ahh/
Obviously that is to be used as a pseudocode, some techniques to make it more efficient could be applied. But that gets the job done.
Javascrip Version: enhanced version with sort and symbols grouping
function group_by_initials(names,ammount,tollerance) {
tolerance=tollerance||0;
total = names.length;
var filtered={}
var result={};
$.each(names,function(key,value){
val=value.trim();
var pattern = /[a-zA-Z0-9&_\.-]/
if(val[0].match(pattern)) {
intial=val[0];
}
else
{
intial='sym';
}
if(!(intial in filtered))
filtered[intial]=[];
filtered[intial].push(val);
})
var count = 0;
var key = '';
var temp = [];
$.each(Object.keys(filtered).sort(),function(ky,value){
count += filtered[value].length;
temp = temp.concat(filtered[value])
key += value+'-';
if (count >= ammount || count >= ammount - tollerance) {
key = key.substring(0, key.length - 1);
result[key] = temp;
count = 0;
key = '';
temp = [];
}
})
return result;
}
I have a list of phrases and I want to know which two words occurred the most often in all of my phrases.
I tried playing with regex and other codes and I just cannot find the right way to do this.
Can anyone help?
eg:
I am purchasing a wallet
a wallet for 20$
purchasing a bag
I'd know that
a wallet occurred 2 times
purchasing a occurred 2 times
<?
$string = "I am purchasing a wallet a wallet for 20$ purchasing a bag";
//split string into words
$words = explode(' ', $string);
//make chunks block ie [0,1][2,3]...
$chunks = array_chunk($words, 2);
//remove first array element
unset($words[0]);
//make chunks block ie [0,1][2,3]...
//but since first element is removed , the real block will be [1,2][3,4]...
$alternateChunks = array_chunk($words, 2);
//merge both chunks
$totalChunks = array_merge($chunks,$alternateChunks);
$finalChunks = array();
foreach($totalChunks as $t)
{
//change the inside chunk to pharse using +
//+ can be replaced to space, if neeced
//to keep associative working + is used instead of white space
$finalChunks[] = implode('+', $t);
}
//count the words inside array
$result = array_count_values($finalChunks);
echo "<pre>";
print_r($result);
I hesitate to suggest this, as it's an extremely brute force way to go about it:
Take your string of words, explode it using the explode(" ", $string); command, then run it through a for loop checking every two word combination against every two words in the string.
$string = "I am purchasing a wallet a wallet for 20$ purchasing a bag";
$words = explode(" ", $string);
for ($t=0; $t<count($string); $t++)
{
for ($i=0; $i<count($string); $i++)
{
if (($words[$t] . words[$t+1]) == ($words[$i] . $word[$i+1])) {$count[$words[$i].$words[$i+1]]++}
}
}
So the nested for loop steps in, grabs the first two words, compares them to each other set of two consecutive words, then grabs the next two words and does it again. Every answer will have an answer of at least 1 (it will always match itself) but sorting the resulting array by size will give you the most repeated values.
Note that this will run (n-1)*(n-1) iterations, which could get unwieldy FAST.
Place them all into an array, and access them by the current word index and next word index.
I think this should do the trick. It will grab pairs of words, unless you are at the end of the string, where you'll get only one word.
$str = "I purchased a wallet because I wanted a wallet a wallet a wallet";
$words = explode(" ", $str);
$array_results = array();
for ($i = 0; $i<count($words); $i++) {
if ($i < count($words)-1) {
$pair = $words[$i] . " " . $words[$i+1]; echo $pair . "\n";
// Have to check if the key is in use yet to avoid a notice
$array_results[$pair] = isset($array_results[$pair]) ? $array_results[$pair] + 1 : 1;
}
// At the end of the array, just use a single word
else $array_results[$words[$i]] = isset($array_results[$words[$i]]) ? $array_results[$words[$i]] + 1 : 1;
}
// Sort the results
// use arsort() instead to get the highest first
asort($array_results);
// Prints:
Array
(
[I wanted] => 1
[wanted a] => 1
[wallet] => 1
[because I] => 1
[wallet because] => 1
[I purchased] => 1
[purchased a] => 1
[wallet a] => 2
[a wallet] => 4
)
Update changed ++ to +1 above since it wasn't working when tested...
Try to put it with explode into an array and count the values with array_count_values.
<?php
$text = "whatever";
$text_array = explode( ' ', $text);
$double_words = array();
for($c = 1; $c < count($text_array); $c++)
{
$double_words[] = $text_array[$c -1] . ' ' . $text_array[$c];
}
$result = array_count_values($double_words);
?>
I updated it now to two word version. Does this work for you?
array(9) {
["I am"]=> int(1)
["am purchasing"]=> int(1)
["purchasing a"]=> int(2)
["a wallet"]=> int(2)
["wallet a"]=> int(1)
["wallet for"]=> int(1)
["for 20$"]=> int(1)
["20$ purchasing"]=> int(1)
["a bag"]=> int(1)
}
Since you used the excel tag, I thought I'd give it a shot, and it's actually really easy.
Split string using space as delimiter. Data > Text to Columns... > Delimited > Delimiter: Space. Each word is now in its own cell.
Transpose the result (not strictly required but much easier to visualize). Copy, Edit > Paste Special... > Transpose.
Make cells containing consecutive word pairs. So if your words are in cells B5:B15, cell C5 should be =B5&" "&B6 (and drag down).
Count occurence of each word pair: In cell D5, =COUNTIF($C$5:$C$15,"="&C5), drag down.
Highlight the winner(s). Select C5:D15, Format > Conditional Formatting... > Formula Is =$D5=MAX($D$5:$D$15) and choose e.g. a yellow background.
Note that there is some inefficiency in step 4 because the count of each word pair will be calculated multiple times if that word pair occurs multiple times. If this is a concern, then you can first make a list of unique word pairs using Data > Filter > Advanced Filter... > Unique records only.
An automated VBA solution could easily be crafted by recording a macro of the above followed by some minor editing.
One way to go about it is to use SPLIT or a regex to split the sentences into words and store each into an array. Then take the array and create a dictionary object. When you add a term to the dictionary, if it's already there, add 1 to the .value to tally the count.
Here is some example code (far from perfect as it's just to show the overlying concept) that will take all the string in column A and generate a word frequency list in columns B and C. It's not exactly what you want, but should give you some ideas on how you can go about doing it I hope:
Sub FrequencyList()
Dim vArray As Variant
Dim myDict As Variant
Set myDict = CreateObject("Scripting.Dictionary")
Dim i As Long
Dim cell As range
With myDict
For Each cell In range("A1", cells(Rows.count, "A").End(xlUp))
vArray = Split(cell.Value, " ")
For i = LBound(vArray) To UBound(vArray)
If Not .exists(vArray(i)) Then
.Add vArray(i), 1
Else
.Item(vArray(i)) = .Item(vArray(i)) + 1
End If
Next
Next
range("B1").Resize(.count).Value = Application.Transpose(.keys)
range("C1").Resize(.count).Value = Application.Transpose(.items)
End With
End Sub
This is kind of a follow on from this post: Regex for splitting params out using preg_match
I have this string 1 0 61 12345678 sierra007^7 0 0 123.123.123.123:524 26429 25000 and I need to get each element. It was suggested I use explode which was a great simple solution but now I need to allow spaces in one of the fields.
Someone else posted this regex:
/^([-0-9]+)\s+([-0-9]+)\s+([-0-9]+)\s+([-0-9]+)\s+(\S+)\s+([-0-9])\s+([-0-9]+)\s+([-0-9.:]+)\s+([-0-9.]+)\s+([-0-9.]+)/mx
That does everything else and I was wondering if it could be modified to allow spaces in field 5 (sierra007^7). The only advice I can offer is that the rest of the fields are always numeric (or a colon as you can see) before and after field 5. Is this possible with 1 regex statement or do I need to parse it in PHP and fudge it together?
EDIT: For example, field 5 could be sierra007^7 OR si erra007^7 or si er ra007^7. It would know that it came across field 5 as its the only one that contains a-zA-Z characters. It would know where field 5 ends because field 6 only contains 0-9 characters.
Thanks.
Why not use explode, like the other thread. And count the number of items in the array. If more items are in the array, you put item 5 + any number too high together again with implode..
Eg. your normal row has 10 items. If the resulting explode has 15 items, you:
implode(" ",array_slice($array,5,(count($array)-10)));
If the number of fields never changes, and there's always a value for each field, you can do it using code below:
$fields = explode (' ', $str);
$defaultNumFields = 10;
if (count($fields) > $defaultNumFields) {
for ($i = 5; $i < (count($fields) - $defaultNumFields) + 5; $i++) {
$field[4] .= ' '.$field[$i];
unset($field[$i]);
}
}
$fields = array_values($fields);
That should do it. I might have mis-calcuated and you might need to change the +4 to a +5, test it on a few strings and let me know.