I would like to convert all the values in a pandas dataframe from strings to floats. My dataframe contains various NaN values (e.g. NaN, NA, None). For example,
import pandas as pd
import numpy as np
my_data = np.array([[0.5, 0.2, 0.1], ["NA", 0.45, 0.2], [0.9, 0.02, "N/A"]])
df = pd.DataFrame(my_data, dtype=str)
I have found here and here (among other places) that convert_objects might be the way to go. However, I get a message that it is deprecated (I am using Pandas 0.17.1) and should instead use to_numeric.
df2 = df.convert_objects(convert_numeric=True)
Output:
FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
But to_numeric doesn't seem to actually convert the strings.
df3 = pd.to_numeric(df, errors='force')
Output:
df2:
0 1 2
0 0.5 0.20 0.1
1 NaN 0.45 0.2
2 0.9 0.02 NaN
df2 dtypes:
0 float64
1 float64
2 float64
dtype: object
df3:
0 1 2
0 0.5 0.2 0.1
1 NA 0.45 0.2
2 0.9 0.02 N/A
df3 dtypes:
0 object
1 object
2 object
dtype: object
Should I use convert_objects and deal with the warning message, or is there a proper way to do what I want with to_numeric?
Strangely this works:
In [11]:
df.apply(lambda x: pd.to_numeric(x, errors='force'))
Out[11]:
0 1 2
0 0.5 0.20 0.1
1 NaN 0.45 0.2
2 0.9 0.02 NaN
It seems that it's not able to coerce the entire df for some reason which is a little surprising
If you hate typing (thanks to #Zero) then you can just use:
df.apply(pd.to_numeric, errors='force')
You can try replace and astype:
import pandas as pd
import numpy as np
my_data = np.array([[0.5, 0.2, 0.1], ["NA", 0.45, 0.2], [0.9, 0.02, "N/A"]])
df = pd.DataFrame(my_data, dtype=str)
print df.replace({r'N': np.nan}, regex=True).astype(float)
0 1 2
0 0.5 0.20 0.1
1 NaN 0.45 0.2
2 0.9 0.02 NaN
Related
I have a simple dataframe as the following:
n_obs = 3
dd = pd.DataFrame({
'WTL_exploded': [0, 1, 2]*n_obs,
'hazard': [0.3, 0.4, 0.5, 0.2, 0.8, 0.9, 0.6,0.6,0.65],
}, index=[1,1,1,2,2,2,3,3,3])
dd
I want to group by the index and get the cumulative product of the hazard column. However, I want to multiply all but the last element of each group.
Desired output:
index
hazard
1
0.3
1
0.12
2
0.2
2
0.16
3
0.6
3
0.36
How can I do that?
You can use:
out = dd.groupby(level=0, group_keys=False).apply(lambda x: x.cumprod().iloc[:-1])
Or:
out = dd.groupby(level=0).apply(lambda x: x.cumprod().iloc[:-1]).droplevel(1)
output:
WTL_exploded hazard
1 0 0.30
1 0 0.12
2 0 0.20
2 0 0.16
3 0 0.60
3 0 0.36
NB. you can also use lambda x: x.cumprod().head(-1).
The solution I found is a bit intricate but works for the test case.
First, get rid of the last row of each group:
ff = dd.groupby(lambda x:x, as_index=False).apply(lambda x: x.iloc[:-1])
ff
Then, restore the original index, group-by again and use pandas cumprod:
ff.reset_index().set_index('level_1').groupby(lambda x:x).cumprod()
Is there a more direct way?
I have a dataframe with having 5 columns with having missing values.
How do i fill the missing values with taking the average of previous two column values.
Here is the sample code for the same.
coh0 = [0.5, 0.3, 0.1, 0.2,0.2]
coh1 = [0.4,0.3,0.6,0.5]
coh2 = [0.2,0.2,0.3]
coh3 = [0.8,0.8]
coh4 = [0.5]
df= pd.DataFrame({'coh0': pd.Series(coh0), 'coh1': pd.Series(coh1),'coh2': pd.Series(coh2), 'coh3': pd.Series(coh3),'coh4': pd.Series(coh4)})
df
Here is the sample output
coh0coh1coh2coh3coh4
0 0.5 0.4 0.2 0.8 0.5
1 0.3 0.3 0.2 0.8 NaN
2 0.1 0.6 0.3 NaN NaN
3 0.2 0.5 NaN NaN NaN
4 0.2 NaN NaN NaN NaN
Here is the desired result i am looking for.
The NaN value in each column should be replaced by the previous two columns average value at the same position. However for the first NaN value in second column, it will take the default last value of first column.
The sample desired output would be like below.
For the exception you named, the first NaN, you can do
df.iloc[1, -1] = df.iloc[0, -1]
though it doesn't make a difference in this case as the mean of .2 and .8 is .5, anyway.
Either way, the rest is something like a rolling window calculation, except it has to be computed incrementally. Normally, you want to vectorize your operations and avoid iterating over the dataframe, but IMHO this is one of the rarer cases where it's actually appropriate to loop over the columns (cf. this excellent post), i.e.,
compute the row-wise (axis=1) mean of up to two columns left of the current one (df.iloc[:, max(0, i-2):i]),
and fill its NaN values from the resulting series.
for i in range(1, df.shape[1]):
mean_df = df.iloc[:, max(0, i-2):i].mean(axis=1)
df.iloc[:, i] = df.iloc[:, i].fillna(mean_df)
which results in
coh0 coh1 coh2 coh3 coh4
0 0.5 0.4 0.20 0.800 0.5000
1 0.3 0.3 0.20 0.800 0.5000
2 0.1 0.6 0.30 0.450 0.3750
3 0.2 0.5 0.35 0.425 0.3875
4 0.2 0.2 0.20 0.200 0.2000
I have a DataFrame where one column is a numpy array of numbers. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame.from_dict({
'id': [1, 1, 2, 2, 3, 3, 3, 4, 4],
'data': [np.array([0.43, 0.32, 0.19]),
np.array([0.41, 0.11, 0.21]),
np.array([0.94, 0.35, 0.14]),
np.array([0.78, 0.92, 0.45]),
np.array([0.32, 0.63, 0.48]),
np.array([0.17, 0.12, 0.15]),
np.array([0.54, 0.12, 0.16]),
np.array([0.48, 0.16, 0.19]),
np.array([0.14, 0.47, 0.01])]
})
I want to groupby the id column and aggregate by taking the element-wise average of the array. Splitting the array up first is not feasible since it is length 300 and I have 200,000+ rows. When I do df.groupby('id').mean(), I get the error "No numeric types to aggregate". I am able to get an element-wise mean of the lists using df['data'].mean(), so I think there should be a way to do a grouped mean. To clarify, I want the output to be an array for each value of ID. Each element in the resulting array should be the mean of the values of the elements in the corresponding position within each group. In the example, the result should be:
pd.DataFrame.from_dict({
'id': [1, 2,3,4],
'data': [np.array([0.42, 0.215, 0.2]),
np.array([0.86, 0.635, 0.29500000000000004]),
np.array([0.3433333333333333, 0.29, 0.26333333333333336]),
np.array([0.31, 0.315, 0.1])]
})
Could someone suggest how I might do this? Thanks!
Mean it twice, one at array level and once at group level:
df['data'].map(np.mean).groupby(df['id']).mean().reset_index()
id data
0 1 0.278333
1 2 0.596667
2 3 0.298889
3 4 0.241667
Based on comment, you can do:
pd.DataFrame(df['data'].tolist(),index=df['id']).mean(level=0).agg(np.array,1)
id
1 [0.42, 0.215, 0.2]
2 [0.86, 0.635, 0.29500000000000004]
3 [0.3433333333333333, 0.29, 0.26333333333333336]
4 [0.31, 0.315, 0.1]
dtype: object
Or:
df.groupby("id")['data'].apply(np.mean)
First, splitting up the array is feasible because your current storage requires storing a complex object of all the values within a DataFrame. This is going to take a lot more space than simply storing the flat 2D array
# Your current memory usage
df.memory_usage(deep=True).sum()
1352
# Create a new DataFrame (really just overwrite `df` but keep separate for illustration)
df1 = pd.concat([df['id'], pd.DataFrame(df['data'].tolist())], 1)
# id 0 1 2
#0 1 0.43 0.32 0.19
#1 1 0.41 0.11 0.21
#2 2 0.94 0.35 0.14
#3 2 0.78 0.92 0.45
#4 3 0.32 0.63 0.48
#5 3 0.17 0.12 0.15
#6 3 0.54 0.12 0.16
#7 4 0.48 0.16 0.19
#8 4 0.14 0.47 0.01
Yes, this looks bigger, but it's not in terms of memory, it's actually smaller. The 3x factor here is a bit extreme, for larger DataFrames with long arrays it will probably be like 95% of the memory. Still it has to be less.
df1.memory_usage(deep=True).sum()
#416
And now your aggregation is a normal groupby + mean, columns give the location in the array
df1.groupby('id').mean()
# 0 1 2
#id
#1 0.420000 0.215 0.200000
#2 0.860000 0.635 0.295000
#3 0.343333 0.290 0.263333
#4 0.310000 0.315 0.100000
Group by mean for array where output is array of mean value
df['data'].map(np.array).groupby(df['id']).mean().reset_index()
Output:
id data
0 1 [0.42, 0.215, 0.2]
1 2 [0.86, 0.635, 0.29500000000000004]
2 3 [0.3433333333333333, 0.29, 0.26333333333333336]
3 4 [0.31, 0.315, 0.1]
You can always .apply the numpy mean.
df.groupby('id')['data'].apply(np.mean).apply(np.mean)
# returns:
id
1 0.278333
2 0.596667
3 0.298889
4 0.241667
Name: data, dtype: float64
I have this DF:
df = pd.DataFrame(data=[[-2.000000, -1.958010, 0.2],
[-1.958010, -1.916030, 0.4],
[-1.916030, -1.874040, 0.3],
[-1.874040, -1.832050, 0.6],
[-1.832050, -1.790070, 0.8],
[-1.790070, -1.748080, 0.2]],columns=['egystart','egyend','fx'])
So I want to groupby every two rows and get fx as the mean value of the two rows. egystart should by egystart of the first row and egyend should by egyend of the second row.
In this case I should obtain:
-2.000000 -1.916030 0.3
-1.916030 -1.832050 0.45
-1.832050 -1.748080 0.5
So I have tried something like this:
df.groupby((df.egystart == df.egyend.shift(1)).cumsum()).agg({'egystart':min, 'egyend':max, 'fx':HERE_THE_MEAN_VALUE})
But it doesnt work
You could try this to get the mean of fx every 2 rows:
result = df.groupby(np.arange(len(df))//2).mean()
print(result)
egystart egyend fx
0 -1.979005 -1.937020 0.30
1 -1.895035 -1.853045 0.45
2 -1.811060 -1.769075 0.50
Consider the following Multiindex Pandas Seires:
import pandas as pd
import numpy as np
val = np.array([ 0.4, -0.6, 0.6, 0.5, -0.4, 0.2, 0.6, 1.2, -0.4])
inds = [(-1000, 1921.6), (-1000, 1922.3), (-1000, 1923.0), (-500, 1921.6),
(-500, 1922.3), (-500, 1923.0), (-400, 1921.6), (-400, 1922.3),
(-400, 1923.0)]
names = ['pp_delay', 'wavenumber']
example = pd.Series(val)
example.index = pd.MultiIndex.from_tuples(inds, names=names)
example should now look like
pp_delay wavenumber
-1000 1921.6 0.4
1922.3 -0.6
1923.0 0.6
-500 1921.6 0.5
1922.3 -0.4
1923.0 0.2
-400 1921.6 0.6
1922.3 1.2
1923.0 -0.4
dtype: float64
I want to group example by pp_delay and select a range within each group using the wavenumber index and perform an operation on that subgroup. To clarify what I mean, I have a few examples.
Here is a position based solution.
example.groupby(level="pp_delay").nth(list(range(1,3))).groupby(level="pp_delay").sum()
this gives
pp_delay
-1000 0.0
-500 -0.2
-400 0.8
dtype: float64
Now the last to elements of each pp_delay group have been summed.
An alternative solution and more straight forward is to loop over the groups:
delays = example.index.levels[0]
res = np.zeros(delays.shape)
roi = slice(1922, 1924)
for i in range(3):
res[i] = example[delays[i]][roi].sum()
res
gives
array([ 0. , -0.2, 0.8])
Anyhow I don't like it much ether because it doesn't fit well with the usual pandas style.
Now what I ideally would want something like:
example.groupby(level="pp_delay").loc[1922:1924].sum()
or maybe even something like
example[:, 1922:1924].sum()
But apparently pandas indexing doesn't work that way. Anybody got a better way?
Cheers
I'd skip the groupby
example.unstack(0).ix[1922:1924].sum()
pp_delay
-1000 0.0
-500 -0.2
-400 0.8
dtype: float64