Is there a function in python that allows me to count the number of non-missing values in an array?
My data:
df.wealth1[df.wealth < 25000] = df.wealth
df.wealth2[df.wealth <50000 & df.wealth > 25000] = df.wealth
df.wealth3[df.wealth < 75000 & df.wealth > 50000] = df.wealth
...
id, income, wealth, wealth1, wealth2, ... wealth9
1, 100000, 20000, 20000, ,...,
2, 60000, 40000, , 40000, ...,
3 70000, 23000, 23000, , ...,
4 80000, 75000, , ,..., 75000
...
My current situation:
income_brackets = [(0, 25000), (25000,50000), (50000,100000)]
source = {'wealth1': [], 'wealth2' :[], .... 'wealth9' : []
for lower, upper in income_brackets:
for key in source:
source[key].append(len(df.query('income > {} and income < {}'.format(lower,upper))[np.logical_not(np.isnan([key]))]))
But this does not work because np.isnan('wealth1') is invalid. it only works with np.isnan(df.wealth1), however I cannot incorporate that into my for loop. I am pretty new to python so perhaps (hopefully) I am missing something obvious.
Any suggestions or question would be great. Thanks! Cheers
The best way to do this is with the count method of DataFrame objects:
In [18]: data = randn(1000, 3)
In [19]: data
Out[19]:
array([[ 0.1035, 0.9239, 0.3902],
[ 0.2022, -0.1755, -0.4633],
[ 0.0595, -1.3779, -1.1187],
...,
[ 1.3931, 0.4087, 2.348 ],
[ 1.2746, -0.6431, 0.0707],
[-1.1062, 1.3949, 0.3065]])
In [20]: data[rand(len(data)) > 0.5] = nan
In [21]: data
Out[21]:
array([[ 0.1035, 0.9239, 0.3902],
[ 0.2022, -0.1755, -0.4633],
[ nan, nan, nan],
...,
[ 1.3931, 0.4087, 2.348 ],
[ 1.2746, -0.6431, 0.0707],
[-1.1062, 1.3949, 0.3065]])
In [22]: df = DataFrame(data, columns=list('abc'))
In [23]: df.head()
Out[23]:
a b c
0 0.1035 0.9239 0.3902
1 0.2022 -0.1755 -0.4633
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
[5 rows x 3 columns]
In [24]: df.count()
Out[24]:
a 498
b 498
c 498
dtype: int64
In [26]: df.notnull().sum()
Out[26]:
a 498
b 498
c 498
dtype: int64
Like many pandas methods, this also works on Series objects:
In [27]: df.a.count()
Out[27]: 498
Pandas allows you to access columns in the following way too:
np.isnan(df['wealth1'])
By the way, even if this was not the case, you could still do
np.isnan(getattr(df, 'wealth1'))
Related
I am currently trying to shuffle an array and am running into some problems.
What I have:
my_array=array([nan, 1, 1, nan, nan, 2, nan, ..., nan, nan, nan])
What I want to do:
I want to shuffle the dataset while keeping the numbers (e.g. the 1,1 in the array) together.
What I did is first converting every naninto an unique negative number.
my_array=array([-1, 1, 1, -2, -3, 2, -4, ..., -2158, -2159, -2160])
Afterward I split everything up with pandas:
df = pd.DataFrame(my_array)
df.rename(columns={0: 'sampleID'}, inplace=True)
groups = [df.iloc[:, 0] for _, df in df.groupby('sampleID')]
If I know shuffle my dataset I will have an equal probability for every group to appear at a given place, but this would neglect the number of elements in each group. If I have a group of several elements like [9,9,9,9,9,9] it should have a higher chance at appearing earlier than some random nan. Correct me on this one if I'm wrong.
One way to get around this problem is numpys choice method.
For this I have to create a probability array
probability_array = np.zeros(len(groups))
for index, item in enumerate(groups):
probability_array[index] = len(item) / len(groups)
All of this to finally call:
groups=np.array(groups,dtype=object)
rng = np.random.default_rng()
shuffled_indices = rng.choice(len(groups), len(groups), replace=False, p=probability_array)
shuffled_array = np.concatenate(groups[shuffled_indices]).ravel()
shuffled_array[shuffled_array < 1] = np.NaN
All of this is quite cumbersome and not very fast. Besides the fact that you can certainly code it better, I feel like I am missing some very simple solution to my problem.
Can somebody point me in the right direction?
One approach:
import numpy as np
from itertools import groupby
# toy data
my_array = np.array([np.nan, 1, 1, np.nan, np.nan, 2, 2, 2, np.nan, 3, 3, 3, np.nan, 4, 4, np.nan, np.nan])
# find groups
groups = np.array([[key, sum(1 for _ in group)] for key, group in groupby(my_array)])
# permute
keys, repetitions = zip(*np.random.permutation(groups))
# recreate new array
res = np.repeat(keys, repetitions)
print(res)
Output (single run)
[ 3. 3. 3. nan nan nan nan 2. 2. 2. 1. 1. nan nan nan 4. 4.]
I have solved your problem under some restrictions
Instead of NaN, I have used zeros as separators
I assumed that an array of yours ALWAYS starts with a sequence of non-zero integers and ends with another sequence of non-zero integers.
With these provisions, I have essentially shuffled a representation of the sequences of integers, and later I have stitched everything in place again.
In [102]: import numpy as np
...: from itertools import groupby
...: a = np.array([int(_) for _ in '1110022220003044440005500000600777'])
...: print(a)
...: n, z = [], []
...: for i,g in groupby(a):
...: if i:
...: n.append((i, sum(1 for _ in g)))
...: else:
...: z.append(sum(1 for _ in g))
...: np.random.shuffle(n)
...: nn = n[0]
...: b = [*[nn[0]]*nn[1]]
...: for zz, nn in zip(z, n[1:]):
...: b += [*[0]*zz, *[nn[0]]*nn[1]]
...: print(np.array(b))
[1 1 1 0 0 2 2 2 2 0 0 0 3 0 4 4 4 4 0 0 0 5 5 0 0 0 0 0 6 0 0 7 7 7]
[7 7 7 0 0 1 1 1 0 0 0 4 4 4 4 0 6 0 0 0 5 5 0 0 0 0 0 2 2 2 2 0 0 3]
Note
The lengths of the runs of separators in the shuffled array is exactly the same as in the original array, but shuffling also the separators is easy. A more difficult problem would be to change arbitrarily the lengths, keepin' the array length unchanged.
I'm using pandas's example to do what I want to do:
>>> s = pd.Series([90, 91, 85])
>>> s
0 90
1 91
2 85
dtype: int64
then the pct_change() is applied to this series:
>>> s.pct_change()
0 NaN
1 0.011111
2 -0.065934
dtype: float64
okay, fair enough, but Percentage Increase = [ (Final Value - Starting Value) / |Starting Value| ] × 100
so the results should actually be [NaN, 1.11111%, -6.59341%].
how would I get this *100 part that the pct_change() didn't run for me?
You can simply multiply the result by 100 to get what you want:
In [712]: s.pct_change().mul(100)
Out[712]:
0 NaN
1 1.111111
2 -6.593407
dtype: float64
If you want the result to be a list of these values, do this:
In [714]: l = s.pct_change().mul(100).tolist()
In [715]: l
Out[715]: [nan, 1.1111111111111072, -6.593406593406592]
Try concatenating functions:
.pct_change().multiply(100)
following the desired df operation. You could concatenate more functions before or after.
I was wondering if there is any pandas equivalent to cumsum() or cummax() etc. for median: e.g. cummedian().
So that if I have, for example this dataframe:
a
1 5
2 7
3 6
4 4
what I want is something like:
df['a'].cummedian()
which should output:
5
6
6
5.5
You can use expanding.median -
df.a.expanding().median()
1 5.0
2 6.0
3 6.0
4 5.5
Name: a, dtype: float64
Timings
df = pd.DataFrame({'a' : np.arange(1000000)})
%timeit df['a'].apply(cummedian())
1 loop, best of 3: 1.69 s per loop
%timeit df.a.expanding().median()
1 loop, best of 3: 838 ms per loop
The winner is expanding.median by a huge margin. Divakar's method is memory intensive and suffers memory blowout at this size of input.
We could create nan filled subarrays as rows with a strides based function, like so -
def nan_concat_sliding_windows(x):
n = len(x)
add_arr = np.full(n-1, np.nan)
x_ext = np.concatenate((add_arr, x))
strided = np.lib.stride_tricks.as_strided
nrows = len(x_ext)-n+1
s = x_ext.strides[0]
return strided(x_ext, shape=(nrows,n), strides=(s,s))
Sample run -
In [56]: x
Out[56]: array([5, 6, 7, 4])
In [57]: nan_concat_sliding_windows(x)
Out[57]:
array([[ nan, nan, nan, 5.],
[ nan, nan, 5., 6.],
[ nan, 5., 6., 7.],
[ 5., 6., 7., 4.]])
Thus, to get sliding median values for an array x, we would have a vectorized solution, like so-
np.nanmedian(nan_concat_sliding_windows(x), axis=1)
Hence, the final solution would be -
In [54]: df
Out[54]:
a
1 5
2 7
3 6
4 4
In [55]: pd.Series(np.nanmedian(nan_concat_sliding_windows(df.a.values), axis=1))
Out[55]:
0 5.0
1 6.0
2 6.0
3 5.5
dtype: float64
A faster solution for the specific cumulative median
In [1]: import timeit
In [2]: setup = """import bisect
...: import pandas as pd
...: def cummedian():
...: l = []
...: info = [0, True]
...: def inner(n):
...: bisect.insort(l, n)
...: info[0] += 1
...: info[1] = not info[1]
...: median = info[0] // 2
...: if info[1]:
...: return (l[median] + l[median - 1]) / 2
...: else:
...: return l[median]
...: return inner
...: df = pd.DataFrame({'a': range(20)})"""
In [3]: timeit.timeit("df['cummedian'] = df['a'].apply(cummedian())",setup=setup,number=100000)
Out[3]: 27.11604686321956
In [4]: timeit.timeit("df['expanding'] = df['a'].expanding().median()",setup=setup,number=100000)
Out[4]: 48.457676260100335
In [5]: 48.4576/27.116
Out[5]: 1.7870482372031273
I have a pandas array and want to normalize 1 single column, here 'col3'
This is how my data looks like:
test1['col3']
1 73.506
2 73.403
3 74.038
4 73.980
5 74.295
6 72.864
7 74.013
8 73.748
9 74.536
10 74.926
11 74.355
12 75.577
13 75.563
Name: col3, dtype: float64
When I use the normalizer function (I hope that I am just using it incorrectly), I get:
from sklearn import preprocessing
preprocessing.normalize(test1['col3'][:, np.newaxis], axis=0)
array([[ 0.27468327],
[ 0.27429837],
[ 0.27667129],
[ 0.27645455],
[ 0.27763167],
[ 0.27228419],
[ 0.27657787],
[ 0.27558759],
[ 0.27853226],
[ 0.27998964],
[ 0.27785588],
[ 0.28242235],
[ 0.28237003]])
But for normalization (not standardization), I would usually want to scale the values to a range 0 to 1, right? E.g., via the equation
$X' = \frac{X \; - \; X_{min} }{X_{max} - X_{min}}$
(Hm, somehow the Latex doesn't work today...)
So, when I do it "manually", I get completely different results (but results I would expect)
(test1['col3'] - test1['col3'].min()) / (test1['col3'].max() - test1['col3'].min())
1 0.236638
2 0.198673
3 0.432731
4 0.411353
5 0.527460
6 0.000000
7 0.423516
8 0.325839
9 0.616292
10 0.760044
11 0.549576
12 1.000000
13 0.994840
Name: col3, dtype: float64
This is not all what sklearn.preprocessing.normalize does. In fact, it scales its input vectors to unit L2 norm (or L1 norm if requested), i.e.
>>> from sklearn.preprocessing import normalize
>>> rng = np.random.RandomState(42)
>>> x = rng.randn(2, 5)
>>> x
array([[ 0.49671415, -0.1382643 , 0.64768854, 1.52302986, -0.23415337],
[-0.23413696, 1.57921282, 0.76743473, -0.46947439, 0.54256004]])
>>> normalize(x)
array([[ 0.28396232, -0.07904315, 0.37027159, 0.87068807, -0.13386116],
[-0.12251149, 0.82631858, 0.40155802, -0.24565113, 0.28389299]])
>>> x / np.linalg.norm(x, axis=1).reshape(-1, 1)
array([[ 0.28396232, -0.07904315, 0.37027159, 0.87068807, -0.13386116],
[-0.12251149, 0.82631858, 0.40155802, -0.24565113, 0.28389299]])
>>> np.linalg.norm(normalize(x), axis=1)
array([ 1., 1.])
(normalize uses a faster way of computing the norm than np.linalg and deals with zeros gracefully, but otherwise these two expressions are the same.)
What you were expecting is called min-max scaling in scikit-learn.
I need to divide two series element wise.
The elements are of type float.
A = [10,20,30]
B = [2,5,5]
result = A/B
I expect
result = [5,4,6]
but get
result = [NaN, NaN, NaN]
This just works with pandas Series as expected:
In [3]: import pandas as pd
In [4]: A = pd.Series([10,20,30])
In [5]: B = pd.Series([2,5,5])
In [6]: A/B
Out[6]:
0 5
1 4
2 6
dtype: float64