Machine learning : Predict output number from input vector trained "matched"

Machine learning : Predict output number from input vector trained "matched" - php

I would know if its possible in maths to predict output number from multiple samples given with input vectors.
I really dont know how to explain this so I will give you an example :
The vector will be like [hours studied, sleep hours] and output is the score at a school test :
x.train(100, [5, 8])
x.train(0, [0, 0])
x.predict([2.5, 4]) // should return 50 (because inputs are the half)
x.predict([5, 8]) // should return 100
x.predict([0, 0]) // should return 0
Sorry if I am not clear enought .. If you can understand what I mean do you know a Python or PHP library that can do that and an example of how to use it ?
Thank you very much and have a great day !

Using scikit-learn you can do:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[5, 8], [0 , 0]], [100, 0])
print(clf.predict([[2.5, 4], [5, 8], [0, 0]]))
You should be aware of the chance of over/underfitting. You might want to consider using polynomial features which would increase the complexity of your model and thus increasing the change of better prediction but don't pick too complex features (in this case high polynomial degree) since it would surely lead you to overfitting. Try to find a trade-off between the two by using cross-validation.
Consider also using Ridge or Lasso which natively deal with overfitting.

Related

How to code a, and read in a binary tree of modified(!) Huffman codes in PHP

I am writing a class for the decoding of fax data encoded with modified Huffman code.
Data is coded line by line: data describes each pixel row. Lines are coded as records of variable length. The pixel bits are stored in the bits of code words, least significant first.
Recently the code word list (182 elements) is defined as an array:
/**
* [0] code word
* [1] length of code word
* [2] run length of color bits
* [3] 0 = white / 1 = black
* [4] 1 = termination codes / 0 = make up codes
*/
const CODEWORDS = [
[0b00110101, 8, 0, 0, 1], // termination codes white
[0b000111, 6, 1, 0, 1],
[0b0111, 4, 2, 0, 1],
[0b1000, 4, 3, 0, 1],
[0b1011, 4, 4, 0, 1],
[0b1100, 4, 5, 0, 1],
[0b1110, 4, 6, 0, 1],
[0b1111, 4, 7, 0, 1],
[0b10011, 5, 8, 0, 1],
...
];
Before usage the array is sorted in descending order according to the length of the code words.
In a first approach I´m able to find the correct code words with repeating foreach-iterations over this array - but it's (not surprising!) terribly slow.
It is clear to me, that an increase in performance can only be achieved using a binary tree.
But even after looking at several explanations here or solutions (libraries) in GitHub, I can't find access to
how to transfer the data from the array into a binary tree
how to browse the tree to get the right leaf
If someone could help me there, I would be very grateful.

Once you have the correct codes (see my comments on your question), then you start by building one set of codes for white and one for black. For each, you start the tree with a branch for the first bit being zero, and another branch for one. Break up your set of codes into two sets, one set where all the codes start with zero, and the other where they all start with one. For each of those, make two branches. Break up each set based on the second bit. Once you get to a branch with one code and you just used the last bit of that code, you now have a leaf. In that leaf you store the symbol for the code, e.g. 63 for white code 00110100. If you get to a branch and there are no codes, then you again have a leaf, but this time it will result in a decoding error if it is reached.
To decode, take the first bit and go down that branch. Choose the second branch depending on the second bit. And so on until you get to a leaf. Then emit that symbol and start back again at the root with the subsequent bit. Or terminate if you end up at an error leaf.

Pack an array to 16bit stringwith python

I have an array that i need to pack with python to 16bit depth with
I have been doing this with php without any issues.
Array is just just large set of numbers like this - [1, 2, 3, 4, 5, 700, 540...]
With php I do this process in one line:
$encoded_string = pack("s*", ...$array); // Done
I can not for the love of god figure out how same can be done in python
I have read the documentation, I looked at examples and I can not get this done
Best I have is below and it does not work in any variation i have tried.
encoded_string = struct.pack('h', *array)

You have to call struct.pack on each member of your array:
import struct
nums = [1, 2, 3, 4, 5, 700, 540]
as_bytes = [struct.pack('h', i) for i in nums]
# Produces
[b'\x01\x00', b'\x02\x00', b'\x03\x00', b'\x04\x00', b'\x05\x00', b'\xbc\x02', b'\x1c\x02', b'\x08\x00']
and then that you can join into a single byte string if you want:
>>> b''.join(as_bytes)
b'\x01\x00\x02\x00\x03\x00\x04\x00\x05\x00\xbc\x02\x1c\x02'
Note: you can also use the endianness modifiers to specify the alignment for the output bytes.
Edit: #Proper reminded me that struct.pack's formatting also supports specifying the number of target packed types, so this can be done more easily by including the data length in an f-string with the format specifier:
>>> struct.pack(f'{len(data)}h', *data)
b'\x01\x00\x02\x00\x03\x00\x04\x00\x05\x00\xbc\x02\x1c\x02'

Thank you for the reply b_c to be honest I hate python with a passion at this point, there was another problem that i had to fix, the array was created as str and not int after it was "exploded". So it had to be remapped to int.
Your code does work, thank you.
There is a way to do it with my initial code, however you have to define number of values you want to process. It is possible to simply count number of values in the array and add than in to make it automated
data_array = map(int,data) # converts all values in the array to int
encoded_string = struct.pack('240s',*data_array) # 240 is number of values in the array

Fragmenting the sequence to a maximum randomness (arrays, any language)

I had an interesting discussion with my good developer friends. I wanted to create a random sequence of given array values but with maximum fragmentation, without any detectable patterns. This so called maximum randomness would be practically always identical for any unique sequence.
Example input array:
array(1, 2, 3, 4, 5);
Example result of a standard rand() function:
array(2, 3, 1, 5, 4);
What I don't like in the output above are the sequence values like "2, 3" and "5, 4", It's not fragmented enough.
Expecting result would/could be:
array(3, 5, 1, 4, 2);
So my question; is there any known formula to calculate the maximum randomness or for better choice of words, maximum fragmentation?

So what are you talking about, not randomization, it is sorting. The result of randomization should not depend on order of the initial data.
By fragmentation in this case it is necessary to understand the differences between the array before sorting and after. But it must be evaluated differently depending on the task. For example, one can evaluate the difference between the positions of the elements or it's order.
Sorting example.
<?
// it must be uksort() function with sequence formula, but for me easier do like this
$array = array(1, 2, 3, 4, 5);
uk_sort($array);
function uk_sort(&$array) {
for($i=0;$i<count($array);$i++) {
if($i%2==0) {
$even[] = $array[$i];
} else {
$odd[] = $array[$i];
}
}
$even[] = array_shift($even);
rsort($odd);
$array = array_merge($even, $odd);
}
print_r($array);
?>
Array
(
[0] => 3
[1] => 5
[2] => 1
[3] => 4
[4] => 2
)

You could split the list into two (or more) collections, shuffle those THEN mix them in order?
array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
array(1, 2, 3, 4, 5);
array(6, 7, 8, 9, 10);
array(2, 3, 1, 5, 4);
array(8, 7, 10, 9, 6);
array(2, 8, 3, 7, 1, 10, 5, 9, 4, 6)
This would give you a fairly high fragmentation but not the maximum.
I suspect to get the maximum would require a LOT more work.

Assuming the fragmentation is defined as the sum of the absolute differences of successive values, the maximum fragmentation sequence is not unique -- the reverse sequence will always have the exact same fragmentation and there're many more options, e.g. all the following orderings will have a fragmentation of 11, which is maximal for this array: (3,1,5,2,4), (3,2,5,1,4), (2,5,1,4,3), (2,4,1,5,3), (4,1,5,2,3), (4,2,5,1,3), (3,5,1,4,2), (3,4,1,5,2). There're yet more symmetries if one incorporates the difference between the last and the first element, too.
If one seeks to identify a particular maximum fragmentation sequence, e.g. the one "without a noticeable pattern", the latter notion has to be formalized and a search performed, which, I suspect, would be costly from the computational point of view, unless the objective can be formalized so as to permit efficient decoding. I suspect that for all practical purposes a good heuristic would suffice, e.g. inserting elements into an array one by one (greedy fashion) so as to maximize the gain in fragmentation on each step.
If the elements of the array are not numbers but some entities with a defined distance for each pair, however, the problem does become equivalent to the traveling salesman problem, as user802500 pointed out.

I think this sounds like a traveling salesman type problem, with the "distance" being the difference between two chosen entries, except your goal would be to maximize the total distance rather than minimize it.
I don't actually know a ton about the topic, but here's what I think I know:
There are algorithms for the traveling salesman problem, but they can be quite slow in the limit (they're NP-hard). On the other hand, there are good approximations, and simple cases may be solvable, though it will still be a non-trivial algorithm.
Depending on how important it is to have maximum fragmentation, you could also try a naive method: given an element, choose the next element so that it's quite distant from the given element. Then choose a next element, and so on. The problem with this is that your early choices can back you into a corner. So this won't work if fragmentation is quite important to you.
[2,5,1,3,4] // the first three choices force us to not fragment the last two

Calculating roots with bc_math or GMP

I'm having trouble calculating roots of rather large numbers using bc_math, example:
- pow(2, 2) // 4, power correct
- pow(4, 0.5) // 2, square root correct
- bcpow(2, 2) // 4, power correct
- bcpow(4, 0.5) // 1, square root INCORRECT
Does anybody knows how I can circumvent this? gmp_pow() also doesn't work.

I'm not a PHP programmer but looking at the manual it says you have to pass them in as strings i.e.
bcpow( '4', '0.5' )
Does that help?
Edit: The user contributed notes in the manual page confirm that it doesn't support non-integer exponents.
I did come across this discussion of a PHP N-th root algorithm after a quick search so perhaps that's what you require.

I have two unordered integer arrays, and i need to know how many integers these arrays have in common

I'm working in a LAMP environment, so PHP is the language; at least i can use python.
As the title said i have two unordered integer arrays.
$array_A = array(13, 4, 59, 38, 9, 69, 72, 93, 1, 3, 5)
$array_B = array(29, 72, 21, 3, 6)
I want to know how many integers these array have in common; in the example as you see the result is 2. I'm not interested in what integers are in common, like (72, 3).
I need a faster method than take every element of array B and check if it's in array A ( O(nxm) )
Arrays can be sorted through asort or with sql ordering (they came from a sql result).
An idea that came to me is to create a 'vector' for every array where the integer is a position who gets value 1 and integers not present get 0.
So, for array A (starting at pos 1)
(1, 0, 1, 1, 1, 0, 0, 0, 1, 0, ...)
Same for array B
(0, 0, 1, 0, 0, 1, ...)
And then compare this two vectors with one cycle. The problem is that in this way the vector length is about 400k.

Depending on your data (size) you might want to use array_intersect_key() instead of array_intersect(). Apparently the implementation of array_intersect (testing php 5.3) does not use any optimization/caching/whatsoever but loops through the array and compares the values one by one for each element in array A. The hashtable lookup is incredibly faster than that.
<?php
function timefn($fn) {
static $timer = array();
if ( is_null($fn) ) {
return $timer;
}
$x = range(1, 120000);
$y = range(2, 100000);
foreach($y as $k=>$v) { if (0===$k%3) unset($y[$k]); }
$s = microtime(true);
$fn($x, $y);
$e = microtime(true);
#$timer[ $fn ] += $e - $s;
}
function fnIntersect($x, $y) {
$z = count(array_intersect($x,$y));
}
function fnFlip($x, $y) {
$x = array_flip($x);
$y = array_flip($y);
$z = count(array_intersect_key($x, $y));
}
for ($i=0; $i<3; $i++) {
timefn( 'fnIntersect' );
timefn( 'fnFlip' );
}
print_r(timefn(null));
printsArray
(
[fnIntersect] => 11.271192073822
[fnFlip] => 0.54442691802979
)which means the array_flip/intersect_key method is ~20 times faster on my notebook.
(as usual: this is an ad hoc test. If you spot an error, tell me ...I'm expecting that ;-) )

I don't know a great deal about PHP so you may get a more specific answer from others, but I'd like to present a more language-agnostic approach.
By checking every element in A against every element in B, it is indeed O(n2) [I'll assume the arrays are of identical length here to simplify the equations but the same reasoning will hold for arrays of differing lengths].
If you were to sort the data in both arrays, you could reduce the time complexity to O(n log n) or similar, depending on the algorithm chosen.
But you need to keep in mind that the complexity only really becomes important for larger data sets. If those two arrays you gave were typical of the size, I would say don't sort it, just use the "compare everything with everything" method - sorting won't give you enough of an advantage over that. Arrays of 50 elements would still only give you 2,500 iterations (whether that's acceptable to PHP, I don't know, it would certainly be water off a duck's back for C and other compiled languages).
And before anyone jumps in and states that you should plan for larger data sets just in case, that's YAGNI, as unnecessary as premature optimization. You may never need it in which case you've wasted time that would have been better spent elsewhere. The time to implement that would be when it became a problem (that's my opinion of course, others may disagree).
If the data sets really are large enough to make the O(n2) unworkable, I think sorting then walking through the arrays in parallel is probably your best bet.
One other possibility is if the range of numbers is not too big - then your proposed solution of a vector of booleans is quite workable since that would be O(n), walking both arrays to populate the vector followed by comparisons of fixed locations within the two vectors. But I'm assuming your range is too large or you wouldn't have already mentioned the 400K requirement. But again, the size of the data sets will dictate whether or not that's worth doing.

The simplest way would be:
count(array_intersect($array_A, $array_B));
if I understand what you're after.
Should be fast.

If both arrays came from SQL, could you not write an SQL query with an inner join on the 2 sets of data to get your result?

You want the array_intersect() function. From there you can count the result. Don't worry about speed until you know you have a problem. The built-in function execute much faster than anything you'll be able to write in PHP.

I have written a PHP extension that provides functions for efficient set operations like union, intersection, binary search, etc. Internal data layout is an ordinary int32_t array stored in a PHP string. Operations are based on merge algorithms.
Example:
// Create two intarrays
$a = intarray_create_from_array(array(1, 2, 3));
$b = intarray_create_from_array(array(3, 4, 5));
// Get a union of them
$u = intarray_union($a, $b);
// Dump to screen
intarray_dump($u);
It's available here: https://github.com/tuner/intarray

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.