Could anybody help me with fill missing values with the most common value but grouped form? .Here I want to fill missing value of cylinders columns with the same model of cars.
I tried this :
sh_cars['cylinders']=sh_cars['cylinders'].fillna(sh_cars.groupby('model')['cylinders'].agg(pd.Series.mode))
and other ones but I got everytime error messages.
Thanks in advance.
I think problem is there are only NaNs per some (or all) groups, so error is raised. Possible solution is use custom function with GroupBy.transform for return Series with same size like original DataFrame:
data = {'model':['a','a','a','a','b','b','a'],
'cylinders':[2,9,9,np.nan,np.nan,np.nan,np.nan]}
sh_cars = pd.DataFrame(data)
f = lambda x: x.mode().iat[0] if x.notna().any() else np.nan
s = sh_cars.groupby('model')['cylinders'].transform(f)
sh_cars['new']=sh_cars['cylinders'].fillna(s)
print (sh_cars)
model cylinders new
0 a 2.0 2.0
1 a 9.0 9.0
2 a 9.0 9.0
3 a NaN 9.0
4 b NaN NaN
5 b NaN NaN
6 a NaN 9.0
Replace original column:
f = lambda x: x.mode().iat[0] if x.notna().any() else np.nan
s = sh_cars.groupby('model')['cylinders'].transform(f)
sh_cars['cylinders']=sh_cars['cylinders'].fillna(s)
print (sh_cars)
model cylinders
0 a 2.0
1 a 9.0
2 a 9.0
3 a 9.0
4 b NaN
5 b NaN
6 a 9.0
Problem
I would like to have a more systematic way to aggregate the frequencies for multiple frequency intervals.
The following dataframe contains random data representing timefrequency data. Its columns index contains the following levels:
conditions
channels
frequencies
The code to generate the dataframe is as follows:
import numpy as np
import pandas as pd
pidx = pd.IndexSlice
D=np.zeros((32,2,2,6))# timepoints, conditions, channels, frequencies
for i in range(6):
D[:,0,0,i]=np.arange(i,i+32,1) # C0, ch01
D[:,0,1,i]=np.arange(i+1,i+32+1,1) # C0, ch02
D[:,1,0,i]=np.arange(i+2,i+32+2,1) # C1, ch01
D[:,1,1,i]=np.arange(i+3,i+32+3,1) # C1, ch02
conditions = ['C0', 'C1']
channels = ["ch{:02}".format(i) for i in np.arange(1,3)]
frequencies = np.arange(1, 7)
# columnns multi index
cidx = pd.MultiIndex.from_product([conditions,channels,frequencies])
# reshape to 2D
D = D.reshape((D.shape[0], -1))
# create DataFrame
df = pd.DataFrame(D, columns=cidx)
Current solution
Currently I do the following
fbands = {
'fb1' : [pidx[1:3]],
'fb2' : [pidx[2:5]],
'fb3' : [pidx[4:6]]
}
def frequencyband_mean(df, fb):
return df.loc(axis=1)[:,:,fb].groupby(axis=1,level=[0,1]).mean()
dffbands = dict((k, frequencyband_mean(df, fbands[k])) for k in fbands)
df_result = pd.concat(dffbands, axis=1)
However, with the latter code the columnindex levels are not being maintained, more specifically, the first level of df_result contains the name of every frequency interval defined in fbands. I would solve this by swapping the column levels, but that seems cumbersome.
Question
I would like to know whether there's a more systematic way to apply an aggregation function to multiple frequency intervals in one go, while maintaining the column index levels. Eventually the last level of the columnindex should look like
conditions
channels
frequency interval names (e.g. fb1, fb2, fb3)
If I got you correct, then I'd do it like this:
fbands={
'fb1' : [0,3],
'fb2' : [2,5],
'fb3' : [4,6]
}
for co_i in df.columns.levels[0]:
for cha_i in df.columns.levels[1]:
for k,v in fbands.items():
df[co_i,cha_i,k] = df[co_i,cha_i,].T[v[0]:v[1]].mean()
Update: Note that the slice here is not based on the labels, hence you would actually need v[0]-1:v[1]; to make this more clear, I'd suggest you simplify your df:
D=np.zeros((32,2,2,6))
for i in range(6):
D[:,0,0,i]=np.arange(i,i+32,1) # C0, ch01
D[:,0,1,i]=np.arange(i+1,i+32+1,1) # C0, ch02
D[:,1,0,i]=np.arange(i+2,i+32+2,1) # C1, ch01
D[:,1,1,i]=np.arange(i+3,i+32+3,1) # C1, ch02
such that df.head(3) returns:
C0 C1
ch01 ch02 ch01 ch02
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
0 0.0 1.0 2.0 3.0 4.0 5.0 1.0 2.0 3.0 4.0 5.0 6.0 2.0 3.0 4.0 5.0 6.0 7.0 3.0 4.0 5.0 6.0 7.0 8.0
1 1.0 2.0 3.0 4.0 5.0 6.0 2.0 3.0 4.0 5.0 6.0 7.0 3.0 4.0 5.0 6.0 7.0 8.0 4.0 5.0 6.0 7.0 8.0 9.0
2 2.0 3.0 4.0 5.0 6.0 7.0 3.0 4.0 5.0 6.0 7.0 8.0 4.0 5.0 6.0 7.0 8.0 9.0 5.0 6.0 7.0 8.0 9.0 10.0
This way, we can actually verify our expectations! I am now using fbands as an array, rather than a dict, so that the ordering becomes nice (could have also used an OrderedDict from collections).
fbands=[
['fb1',[1,3]],
['fb2',[2,5]],
['fb3',[4,6]]
]
for co_i in df.columns.levels[0]:
for cha_i in df.columns.levels[1]:
for fi in range(len(fbands)):
k=fbands[fi][0]
v=fbands[fi][1]
df[co_i,cha_i,k] = df[co_i,cha_i,].T[v[0]-1:v[1]].mean()
for i in range(7):
df=df.drop(i, axis=1, level=2)
print(df.head(3))
returns:
C0 C1
ch01 ch02 ch01 ch02
fb1 fb2 fb3 fb1 fb2 fb3 fb1 fb2 fb3 fb1 fb2 fb3
0 1.0 2.5 4.0 2.0 3.5 5.0 3.0 4.5 6.0 4.0 5.5 7.0
1 2.0 3.5 5.0 3.0 4.5 6.0 4.0 5.5 7.0 5.0 6.5 8.0
2 3.0 4.5 6.0 4.0 5.5 7.0 5.0 6.5 8.0 6.0 7.5 9.0
Now, the fb* columns actually reflect the mean of frequencies fb1:[1,2,3], fb2:[2,3,4,5] and fb3:[4,5,6], as I hope you intended
Update 2:
Note that if you would set up your frequencies like this instead:
frequencies = ["f{0}".format(i) for i in np.arange(1,7)]
then you could e.g. create the means of frequencies 'f1','f2','f3' in ch01 within C0 like this:
df['C0','ch01','fb1'] = df.loc(axis=1)[pd.IndexSlice['C0','ch01',['f1','f2','f3'],:]].mean(axis=1)
I'm guessing you are grouping the frequencies in groups of two. If so, try:
# it's convenient to groupby over rows than
data = df.T.reset_index()
data.rename(columns={'level_0':'condition',
'level_1': 'channel',
'level_2': 'frequency'},
inplace=True)
# groupby and compute mean
# review your frequency grouping here
# change mapping frequency -> frequency_band_group
new_df = data.groupby(['condition', 'channel', (data.frequency-1)//2]).mean()
new_df.drop('frequency', axis=1, inplace=True)
# change name for frequency index
new_df.index.rename('frequency_band', level=2, inplace=True)
# change label for frequency band
new_df.index.set_levels([conditions, channels, ['fb1','fb2','fb3']], inplace=True)
# transform back to get multi-level columns:
new_df.T
new_df = data.groupby(['condition', 'channel', (data.frequency-1)//2]).mean()
new_df.drop('frequency', axis=1, inplace=True)
I have the following csv:
value value value value ...
id 1 1 1 2
indic 1 2 3 1
valuedate
05/01/1970 1.0 2.0 3.2 5.2
06/01/1970 4.1 ...
07/01/1970
08/01/1970
that I want to read in a pandas DataFrame, so I do the following:
df=pd.read_csv("mycsv.csv", skipinitialspace=True, tupleize_cols=True)
but get the following error:
IndexError: Too many levels: Index has only 1 level, not 2
I suspect there might be an error with the multi indexing but I don't understand how to use the parameters of read_csv in order to solve this.
(NB: valuedate is the name of the index column)
I want to get this data into a DataFrame that would be multi-indexed: several indic sub columns under the id column.
file.csv:
value value value value
id 1 1 1 2
indic 1 2 3 1
valuedate
05/01/1970 1.0 2.0 3.2 5.2
Do:
import pandas as pd
df = pd.read_csv("file.csv", index_cols=0, delim_whitespace=True)
print(df)
Output:
value value.1 value.2 value.3
id 1.0 1.0 1.0 2.0
indic 1.0 2.0 3.0 1.0
valuedate NaN NaN NaN NaN
05/01/1970 1.0 2.0 3.2 5.2
I want to create a plot from a large Pandas dataframe. The data is in the following format
Type Number ...unimportant additional columns
Foo 13 ...
Foo 25 ...
Foo 56 ...
Foo 56 ...
Bar 10 ...
Bar 10 ...
Bar 11 ...
Bar 23 ...
I need to count the number of elements from column 'Number' in a sliding window from x to x+i to determine the number of values falling in each sliding window bucket.
For example, if the window size is i=10, starting at x=0, and incrementing x by 1 each step, the sliding window bucket for 'Foo' a correct result for the above example would be:
Foo Bar
0 0 2 #(0-10)
1 0 3 #(1-11)
2 0 3 #(2-12)
3 1 3 #(3-13)
4 1 3 #(4-14)
.
.
.
20 1 1 #(13-23)
21 0 1 #(14-24)
22 1 1 #(15-25)
.
.
.
The answer would have df.max().max - [Window Length] rows, and len(df.columns) columns.
Toy code to generate a similar dataframe might be the following:
import pandas as pd
import numpy as np
str_arr = ['Foo','Bar','Python','PleaseHelp']
data1 = np.matrix(np.random.choice(str_arr, 100, p=[0.5, 0.1, 0.1, 0.3])).T
data2 = np.random.randint(100, size=(100,1))
merge = np.concatenate((data1,data2), axis=1)
df = pd.DataFrame(merge, index=range(100), columns=['Type','Number'])
df.sort_values(['Type','Number'], ascending=[True,True], inplace=True)
df = df.reset_index(drop=True)
How can I generate such a list efficiently?
Edit Note: Thanks to FLab who answered my question earlier before I clarified my question.
Here is my proposed solution.
For convenience, let's force 'Number' column to be an int.
df['Number'] = df['Number'].astype(int)
Define all possible ranges:
len_wdw = 10
all_ranges = [(i, i+len_wdw) for i in range(df['Number'].max()-len_wdw)]
And now check how many observations there are for "Number" in each of this ranges:
def get_mask(df, rg):
#rg is a range, e.g. (10-20)
return (df['Number'] >= rg[0]) & (df['Number'] <= rg[1])
result = pd.concat({ rg[0] :
df[get_mask(df, rg)].groupby('Type').count()['Number']
for rg in all_ranges},
axis = 1).fillna(0).T
For the randomly generated numbers, this gives:
Bar Foo PleaseHelp Python
0 1.0 4.0 3.0 1.0
1 1.0 5.0 2.0 1.0
2 1.0 5.0 3.0 1.0
3 1.0 4.0 3.0 0.0
4 1.0 3.0 3.0 1.0
.....
85 2.0 3.0 4.0 1.0
86 1.0 3.0 3.0 1.0
87 1.0 4.0 3.0 1.0
88 1.0 4.0 4.0 1.0
89 1.0 3.0 5.0 1.0
I am querying a database and populating a pandas dataframe. I am struggling to aggregate the data (via groupby) and then manipulate the dataframe index such that the dates in the table become the index.
Here is an example of how the data looks like before and after the groupby and what I ultimately am looking for.
dataframe - populated data
firm | dates | received | Sent
-----------------------------------------
A 10/08/2016 2 8
A 12/08/2016 4 2
B 10/08/2016 1 0
B 11/08/2016 3 5
A 13/08/2016 5 1
C 14/08/2016 7 3
B 14/08/2016 2 5
First I want to Group By "firm" and "dates" and "received/sent".
Then manipulate the DataFrame such that the dates becomes the index - rather than the row-index.
Finally to add a total column for each day
Some of the firms do not have 'activity' during some days or at least no activity in either received or sent. However as I want a view of the past X days back, empty values aren't possible rather I need to fill in a zero as a value instead.
dates | 10/08/2016 | 11/08/2016| 12/08/2016| 13/08/2016| 14/08/2016
firm |
----------------------------------------------------------------------
A received 2 0 4 5 0
sent 8 0 2 1 0
B received 1 3 1 0 2
sent 0 5 0 0 5
C received 0 0 2 0 1
sent 0 0 1 2 0
Totals r. 3 3 7 5 3
Totals s. 8 0 3 3 5
I've tried the following code:
df = > mysql query result
n_received = df.groupby(["firm", "dates"
]).received.size()
n_sent = df.groupby(["firm", "dates"
]).sent.size()
tables = pd.DataFrame({ 'received': n_received, 'sent': n_sent,
},
columns=['received','sent'])
this = pd.melt(tables,
id_vars=['dates',
'firm',
'received', 'sent']
this = this.set_index(['dates',
'firm',
'received', 'sent'
'var'
])
this = this.unstack('dates').fillna(0)
this.columns = this.columns.droplevel()
this.columns.name = ''
this = this.transpose()
Basically, I am not getting to the result I want based on this code.
- How can I achieve this?
- Conceptually is there a better way of achieving this result ? Say aggregating in the SQL statement or does the aggregation in Pandas make more sense from an optimisation point of view and logically.
You can use stack(unstack) to transform data from long to wide(wide to long) format:
import pandas as pd
# calculate the total received and sent grouped by dates
df1 = df.drop('firm', axis = 1).groupby('dates').sum().reset_index()
# add total category as the firm column
df1['firm'] = 'total'
# concatenate the summary data frame and original data frame use stack and unstack to
# transform the data frame so that dates appear as columns while received and sent stack as column.
pd.concat([df, df1]).set_index(['firm', 'dates']).stack().unstack(level = 1).fillna(0)
# dates 10/08/2016 11/08/2016 12/08/2016 13/08/2016 14/08/2016
# firm
# A Sent 8.0 0.0 2.0 1.0 0.0
# received 2.0 0.0 4.0 5.0 0.0
# B Sent 0.0 5.0 0.0 0.0 5.0
# received 1.0 3.0 0.0 0.0 2.0
# C Sent 0.0 0.0 0.0 0.0 3.0
# received 0.0 0.0 0.0 0.0 7.0
# total Sent 8.0 5.0 2.0 1.0 8.0
# received 3.0 3.0 4.0 5.0 9.0