Given any integer n convert it to a float 0.n
#input
[11 22 5 1 68 17 5 4 558]
#output
[0.11 0.22 0.5 0.1 0.68 0.17 0.5 0.4 0.558]
Is there a way in numpy to do the following.
import numpy as np
int_=np.array([11,22,5,1,68,17,5,4,558])
float_=np.array([])
for i in range(len(int_)):
float_=np.append(float_,int_[i]/10**(len(str(int_[i]))))
print(float_)
[0.11 0.22 0.5 0.1 0.68 0.17 0.5 0.4 0.558]
for now the code I have is slow (takes a lot of time for very large arrays)
One way using numpy.log10:
arr = np.array([11,22,5,1,68,17,5,4,558])
new_arr = arr/np.power(10, np.log10(arr).astype(int) + 1)
print(new_arr)
Output:
[0.11 0.22 0.5 0.1 0.68 0.17 0.5 0.4 0.558]
Explain:
numpy.log10(arr).astype(int) + 1 will give you the number of digits
numpy.power(10, {above}) will give you the required denominator
You can also try a Vectorize version of your code
def chg_to_float(val):
return val/10**len(str(val))
v_chg_to_float = np.vectorize(chg_to_float)
np.array(list(map(chg_to_float, ar)))
Since you're only inserting a 0. in front of each input integer, you can simply cast them to strings, add the 0., and then cast them to floats.
>>> input_list = [11, 22, 5, 1, 68, 17, 5, 4, 558]
>>> [float(f'0.{str(item)}') for item in input_list]
[0.11, 0.22, 0.5, 0.1, 0.68, 0.17, 0.5, 0.4, 0.558]
Performance could be enhanced by using a generator comprehension instead of a list comprehension.
Related
I have a DataFrame where one column is a numpy array of numbers. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame.from_dict({
'id': [1, 1, 2, 2, 3, 3, 3, 4, 4],
'data': [np.array([0.43, 0.32, 0.19]),
np.array([0.41, 0.11, 0.21]),
np.array([0.94, 0.35, 0.14]),
np.array([0.78, 0.92, 0.45]),
np.array([0.32, 0.63, 0.48]),
np.array([0.17, 0.12, 0.15]),
np.array([0.54, 0.12, 0.16]),
np.array([0.48, 0.16, 0.19]),
np.array([0.14, 0.47, 0.01])]
})
I want to groupby the id column and aggregate by taking the element-wise average of the array. Splitting the array up first is not feasible since it is length 300 and I have 200,000+ rows. When I do df.groupby('id').mean(), I get the error "No numeric types to aggregate". I am able to get an element-wise mean of the lists using df['data'].mean(), so I think there should be a way to do a grouped mean. To clarify, I want the output to be an array for each value of ID. Each element in the resulting array should be the mean of the values of the elements in the corresponding position within each group. In the example, the result should be:
pd.DataFrame.from_dict({
'id': [1, 2,3,4],
'data': [np.array([0.42, 0.215, 0.2]),
np.array([0.86, 0.635, 0.29500000000000004]),
np.array([0.3433333333333333, 0.29, 0.26333333333333336]),
np.array([0.31, 0.315, 0.1])]
})
Could someone suggest how I might do this? Thanks!
Mean it twice, one at array level and once at group level:
df['data'].map(np.mean).groupby(df['id']).mean().reset_index()
id data
0 1 0.278333
1 2 0.596667
2 3 0.298889
3 4 0.241667
Based on comment, you can do:
pd.DataFrame(df['data'].tolist(),index=df['id']).mean(level=0).agg(np.array,1)
id
1 [0.42, 0.215, 0.2]
2 [0.86, 0.635, 0.29500000000000004]
3 [0.3433333333333333, 0.29, 0.26333333333333336]
4 [0.31, 0.315, 0.1]
dtype: object
Or:
df.groupby("id")['data'].apply(np.mean)
First, splitting up the array is feasible because your current storage requires storing a complex object of all the values within a DataFrame. This is going to take a lot more space than simply storing the flat 2D array
# Your current memory usage
df.memory_usage(deep=True).sum()
1352
# Create a new DataFrame (really just overwrite `df` but keep separate for illustration)
df1 = pd.concat([df['id'], pd.DataFrame(df['data'].tolist())], 1)
# id 0 1 2
#0 1 0.43 0.32 0.19
#1 1 0.41 0.11 0.21
#2 2 0.94 0.35 0.14
#3 2 0.78 0.92 0.45
#4 3 0.32 0.63 0.48
#5 3 0.17 0.12 0.15
#6 3 0.54 0.12 0.16
#7 4 0.48 0.16 0.19
#8 4 0.14 0.47 0.01
Yes, this looks bigger, but it's not in terms of memory, it's actually smaller. The 3x factor here is a bit extreme, for larger DataFrames with long arrays it will probably be like 95% of the memory. Still it has to be less.
df1.memory_usage(deep=True).sum()
#416
And now your aggregation is a normal groupby + mean, columns give the location in the array
df1.groupby('id').mean()
# 0 1 2
#id
#1 0.420000 0.215 0.200000
#2 0.860000 0.635 0.295000
#3 0.343333 0.290 0.263333
#4 0.310000 0.315 0.100000
Group by mean for array where output is array of mean value
df['data'].map(np.array).groupby(df['id']).mean().reset_index()
Output:
id data
0 1 [0.42, 0.215, 0.2]
1 2 [0.86, 0.635, 0.29500000000000004]
2 3 [0.3433333333333333, 0.29, 0.26333333333333336]
3 4 [0.31, 0.315, 0.1]
You can always .apply the numpy mean.
df.groupby('id')['data'].apply(np.mean).apply(np.mean)
# returns:
id
1 0.278333
2 0.596667
3 0.298889
4 0.241667
Name: data, dtype: float64
I need to do a large amount of data-frame slices and to update the value of a column in the slice to the minimum between existing value and a constant.
My current code looks like this
for indices value in list_of_slices:
df.loc[indices,'SCORE'] = df.loc[indices,'SCORE'].clip(upper=value)
This is quite efficient and much faster than the apply method I used in the beginning, however still somewhat too slow for a large list.
I expected to be able to write
df.loc[indices,'SCORE'].clip(upper=value, inplace=True)
to save on slicing twice, but that doesn't work.
Also saving the slice to a tmp variable seems to create a copy, thus not changing the original df.
Is there a better way to do this loop and/or set the value without slicing the data-frame twice?
If you could generate a dictionary where (key, value) pairs will be the index to clip with a given values. For example, considering the following dataframe
import pandas as pd
import numpy as np
d = {
'categorical_identifier': [1, 2, 3, 1, 2, 3, 1, 2, 3],
'SCORE': [0.02, 0.04, 0.67, 0.01, 0.45, 0.89, 0.39, 0.25, 0.47]
}
df = pd.DataFrame(d)
df
>>>
categorical_identifier SCORE
0 1 0.02
1 2 0.04
2 3 0.67
3 1 0.01
4 2 0.45
5 3 0.89
6 1 0.39
7 2 0.25
8 3 0.47
if I generate a dictionary mapping by index which value to clip to as the following
indices_max_values = {
0: 0.10,
1: 0.3,
2: 0.9,
3: 0.10,
4: 0.3,
5: 0.9,
6: 0.10,
7: 0.3,
8: 0.9,
}
Notice that if you have a set of slices you can generate this dictionary by filtering the True values of each condition.
from collections import ChainMap
list_of_slice = [
df.categorical_identifier == 1,
df.categorical_identifier == 2,
df.categorical_identifier == 3
]
dict_of_slice = [{k:v for k, v in dict(s).items() if v} for s in list_of_slice]
dict_of_slice = dict(ChainMap(*dict_of_slice))
dict_of_slice
>>>
{2: True,
5: True,
8: True,
1: True,
4: True,
7: True,
0: True,
3: True,
6: True}
just replace v with the value you want to clip to when creating dict_of_slice.
Then you can apply np.clip() to each element, identifying the value to clip by the value of the index.
df.reset_index(inplace=True)
df.rename(columns={'index':'Index'}, inplace=True)
existing_value = 0
df[['Index', 'SCORE']].transform(
lambda x: np.clip(x, a_min=existing_value, a_max=indices_max_values[x.Index]),
axis=1
)
>>>
Index SCORE
0 0.0 0.02
1 0.3 0.04
2 0.9 0.67
3 0.1 0.01
4 0.3 0.30
5 0.9 0.89
6 0.1 0.10
7 0.3 0.25
8 0.9 0.47
I am using a numpy arange.
[In] test = np.arange(0.01, 0.2, 0.02)
[In] test
[Out] array([0.01, 0.03, 0.05, 0.07, 0.09, 0.11, 0.13, 0.15, 0.17, 0.19])
But then, if I iterate over this array, it iterates over slightly smaller values.
[In] for t in test:
.... print(t)
[Out]
0.01
0.03
0.049999999999999996
0.06999999999999999
0.08999999999999998
0.10999999999999997
0.12999999999999998
0.15
0.16999999999999998
0.18999999999999997
Why is this happening?
To avoid this problem, I have been rounding the values, but is this the best way to solve this problem?
for t in test:
print(round(t, 2))
I think the nature of the floating point numbers mentioned in the comments is the issue.
If you still think you're afraid of leaving it that way I suggest that you multiply your numbers by 100 and so work with intergers:
test = np.arange(1, 20, 2)
print(test)
for t in test:
print(t / 100)
This gives me the following output:
[ 1 3 5 7 9 11 13 15 17 19]
0.01
0.03
0.05
0.07
0.09
0.11
0.13
0.15
0.17
0.19
Alternatively you can also try the following:
test = np.arange(1, 20, 2) / 100
Did you try:
test = np.arange(0.01, 0.2, 0.02, dtype=np.float32)
I have "reference population" (say, v=np.random.rand(100)) and I want to compute percentile ranks for a given set (say, np.array([0.3, 0.5, 0.7])).
It is easy to compute one by one:
def percentile_rank(x):
return (v<x).sum() / len(v)
percentile_rank(0.4)
=> 0.4
(actually, there is an ootb scipy.stats.percentileofscore - but it does not work on vectors).
np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
=> [ 0.33 0.48 0.71]
This produces the expected results, but I have a feeling that there should be a built-in for this.
I can also cheat:
pd.concat([pd.Series([0.3, 0.5, 0.7]),pd.Series(v)],ignore_index=True).rank(pct=True).loc[0:2]
0 0.330097
1 0.485437
2 0.718447
This is bad on two counts:
I don't want the test data [0.3, 0.5, 0.7] to be a part of the ranking.
I don't want to waste time computing ranks for the reference population.
So, what is the idiomatic way to accomplish this?
Setup:
In [62]: v=np.random.rand(100)
In [63]: x=np.array([0.3, 0.4, 0.7])
Using Numpy broadcasting:
In [64]: (v<x[:,None]).mean(axis=1)
Out[64]: array([ 0.18, 0.28, 0.6 ])
Check:
In [67]: percentile_rank(0.3)
Out[67]: 0.17999999999999999
In [68]: percentile_rank(0.4)
Out[68]: 0.28000000000000003
In [69]: percentile_rank(0.7)
Out[69]: 0.59999999999999998
I think pd.cut can do that
s=pd.Series([-np.inf,0.3, 0.5, 0.7])
pd.cut(v,s,right=False).value_counts().cumsum()/len(v)
Out[702]:
[-inf, 0.3) 0.37
[0.3, 0.5) 0.54
[0.5, 0.7) 0.71
dtype: float64
Result from your function
np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
Out[696]: array([0.37, 0.54, 0.71])
You can use quantile:
np.random.seed(123)
v=np.random.rand(100)
s = pd.Series(v)
arr = np.array([0.3,0.5,0.7])
s.quantile(arr)
Output:
0.3 0.352177
0.5 0.506130
0.7 0.644875
dtype: float64
I know I am a little late to the party, but wanted to add that pandas has another way to get what you are after with Series.rank. Just use the pct=True option.
I have an input file that looks sort of like this:
0.1 0.3 0.4 0.3
0.2 02. 1.2 -0.2
0.1 -1.22 0.12 9.2 0.2 0.2
0.3 -1.42 0.2 6.2 0.9 0.88
0.3 -1.42 0.12 1.1 0.1 0.88 0.06 0.14
4
So it starts with some number of columns, and ends with n*2 columns (n is the last line).
I can get the number of rows, say # rows = i. I can also get n.
I want to read this file into a python 2d array (not a list), e.g. Array[i][n*2]. I realize I may need to fill the empty columns with zeros so that it can be read simply as
Array = numpy.loadtxt("data.txt")
But I don't know how to proceed.
Thanks
I don't think any of the built-in missing-value stuff is going to help here, because space-separated columns make it ambiguous which values are missing. (Not ambiguous in your context—you know all the missing columns are on the right—but a general-purpose parser won't.) Hopefully I'm wrong and someone else will provide a simpler answer, but otherwise…
One option is to extend the lines one by one on the fly and feed them into an array. If memory isn't an issue, you can do this with a list comprehension over the row:
def readrow(row, cols):
a = np.fromstring(row, sep=' ')
a.resize((cols,)
return a
with open(file_path, 'rb') as f:
a = np.array([readrow(row, 2*n) for row in f])
If you can't afford to waste the memory to create a temporary list of i 1D arrays, you may need to use something like fromiter to generate a 1D array, then reshape it:
a = np.fromiter(itertools.chain.from_iterable(
readrow(row, n*2) for row in f)).reshape((n*2,))
(Although at this point, using numpy to parse the rows instead of csv or just str.split seems like it might be a bit silly.)
If you want to pad the short lines with 0.0's here is one way - pad with a full set of 0.0's, then slice only the leading significant part:
data = """0.1 0.3 0.4 0.3
0.2 02. 1.2 -0.2
0.1 -1.22 0.12 9.2 0.2 0.2
0.3 -1.42 0.2 6.2 0.9 0.88
0.3 -1.42 0.12 1.1 0.1 0.88 0.06 0.14
4""".splitlines()
maxcols = int(data[-1])*2
emptyvalue = 0.0
pad = [emptyvalue]*maxcols
for line in data[:-1]:
# get the input data values, converted from strings to floats
vals = map(float, line.split())
# pad the input with default values, then only take the first maxcols values
vals = (vals + pad)[:maxcols]
# show our work in a nice table
print "[" + ','.join("%s%.2f" % (' ' if v>=0 else '', v) for v in vals) + "]"
prints
[ 0.10, 0.30, 0.40, 0.30, 0.00, 0.00, 0.00, 0.00]
[ 0.20, 2.00, 1.20,-0.20, 0.00, 0.00, 0.00, 0.00]
[ 0.10,-1.22, 0.12, 9.20, 0.20, 0.20, 0.00, 0.00]
[ 0.30,-1.42, 0.20, 6.20, 0.90, 0.88, 0.00, 0.00]
[ 0.30,-1.42, 0.12, 1.10, 0.10, 0.88, 0.06, 0.14]