I'm trying to wrap my head around Pandas groupby methods. I'd like to write a function that does some aggregation functions and then returns a Pandas DataFrame. Here's a grossly simplified example using sum(). I know there are easier ways to do simple sums, in real life my function is more complex:
import pandas as pd
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B'], 'col2':[1.0, 2, 3, 4]})
In [3]: df
Out[3]:
col1 col2
0 A 1
1 A 2
2 B 3
3 B 4
def func2(df):
dfout = pd.DataFrame({ 'col1' : df['col1'].unique() ,
'someData': sum(df['col2']) })
return dfout
t = df.groupby('col1').apply(func2)
In [6]: t
Out[6]:
col1 someData
col1
A 0 A 3
B 0 B 7
I did not expect to have col1 in there twice nor did I expect that mystery index looking thing. I really thought I would just get col1 & someData.
In my real life application I'm grouping by more than one column and really would like to get back a DataFrame and not a Series object.
Any ideas for a solution or an explanation on what Pandas is doing in my example above?
----- added info -----
I should have started with this example, I think:
In [13]: import pandas as pd
In [14]: df = pd.DataFrame({'col1':['A','A','A','B','B','B'], 'col2':['C','D','D','D','C','C'], 'col3':[.1,.2,.4,.6,.8,1]})
In [15]: df
Out[15]:
col1 col2 col3
0 A C 0.1
1 A D 0.2
2 A D 0.4
3 B D 0.6
4 B C 0.8
5 B C 1.0
In [16]: def func3(df):
....: dfout = sum(df['col3']**2)
....: return dfout
....:
In [17]: t = df.groupby(['col1', 'col2']).apply(func3)
In [18]: t
Out[18]:
col1 col2
A C 0.01
D 0.20
B C 1.64
D 0.36
In the above illustration the result of the apply() function is a Pandas Series. And it lacks the groupby columns from the df.groupby. The essence of what I'm struggling with is how do I create a function which I apply to a groupby which returns both the result of the function AND the columns on which it was grouped?
----- yet another update ------
It appears that if I then do this:
pd.DataFrame(t).reset_index()
I get back a dataframe which is really close to what I was after.
The reason you are seeing the columns with 0s is because the output of .unique() is an array.
The best way to understand how your apply is going to work is to inspect each action group-wise:
In [11] :g = df.groupby('col1')
In [12]: g.get_group('A')
Out[12]:
col1 col2
0 A 1
1 A 2
In [13]: g.get_group('A')['col1'].unique()
Out[13]: array([A], dtype=object)
In [14]: sum(g.get_group('A')['col2'])
Out[14]: 3.0
The majority of the time you want this to be an aggregated value.
The output of grouped.apply will always have the group labels as an index (the unique values of 'col1'), so your example construction of col1 seems a little obtuse to me.
Note: To pop 'col1' (the index) back to a column you can call reset_index, so in this case.
In [15]: g.sum().reset_index()
Out[15]:
col1 col2
0 A 3
1 B 7
Related
I am trying to find a better, more pythonic way of accomplishing the following:
I want to add a new column to business_df called 'dot_prod', which is the dot product of a fixed vector (fixed_vector) and a vector from another data frame (rating_df). The rows of both business_df and rating_df have the same index values (business_id).
I have this loop which appears to work, however I know it's super clumsy (and takes forever). Essentially it loops through once for every row, calculates the dot product, then dumps it into the business_df dataframe.
n=0
for i in range(business_df.shape[0]):
dot_prod = np.dot(fixed_vector, rating_df.iloc[n])
business_df['dot_prod'][n] = dot_prod
n+=1
IIUC, you are looking for apply across axis=1 like:
business_df['dot_prod'] = rating_df.apply(lambda x: np.dot(fixed_vector, x), axis=1)
>>> fixed_vector = [1, 2, 3]
>>> df = pd.DataFrame({'col1' : [1,2], 'col2' : [3,4], 'col3' : [5,6]})
>>> df
col1 col2 col3
0 1 3 5
1 2 4 6
>>> df['col4'] = np.dot(fixed_vector, [df['col1'], df['col2'], df['col3']])
>>> df
col1 col2 col3 col4
0 1 3 5 22
1 2 4 6 28
So I'm working with pandas on python. I collect data indexed by timestamps with multiple ways.
This means I can have one index with 2 features available (and the others with NaN values, it's normal) or all features, it depends.
So my problem is when I add some data with multiple values for the same indices, see the example below :
Imagine this is the set we're adding new data :
Index col1 col2
1 a A
2 b B
3 c C
This the data we will add:
Index new col
1 z
1 y
Then the result is this :
Index col1 col2 new col
1 a A NaN
1 NaN NaN z
1 NaN NaN y
2 b B NaN
3 c C NaN
So instead of that, I would like the result to be :
Index col1 col2 new col1 new col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
I want that instead of having multiples indexes in 1 feature, there will be 1 index for multiple features.
I don't know if this is understandable. Another way is to say that I want this : number of values per timestamp=number of features instead of =numbers of indices.
This solution assumes the data that you need to add is a series.
Original df:
df = pd.DataFrame(np.random.randint(0,3,size=(3,3)),columns = list('ABC'),index = [1,2,3])
Data to add (series):
s = pd.Series(['x','y'],index = [1,1])
Solution:
df.join(s.to_frame()
.assign(cc = lambda x: x.groupby(level=0)
.cumcount().add(1))
.set_index('cc',append=True)[0]
.unstack()
.rename('New Col{}'.format,axis=1))
Output:
A B C New Col1 New Col2
1 1 2 2 x y
2 0 1 2 NaN NaN
3 2 2 0 NaN NaN
Alternative answer (maybe more simplistic, probably less pythonic). I think you need to look at converting wide data to long data and back again in general (pivot and transpose might be good things to look up for this), but I also think there are some possible problems in your question. You don't mention new col 1 and new col 2 in the declaration of the subsequent arrays.
Here's my declarations of your data frames:
d = {'index': [1, 2, 3],'col1': ['a', 'b', 'c'], 'col2': ['A', 'B', 'C']}
df = pd.DataFrame(data=d)
e1 = {'index': [1], 'new col1': ['z']}
dfe1 = pd.DataFrame(data=e1)
e2 = {'index': [1], 'new col2': ['y']}
dfe2 = pd.DataFrame(data=e2)
They look like this:
index new col1
1 z
and this:
index new col2
1 y
Notice that I declare your new columns as part of the data frames. Once they're declared like that, it's just a matter of merging:
dfr = pd.merge(df, dfe, on='index', how="outer")
dfr1 = pd.merge(df, dfe1, on='index', how="outer")
dfr2 = pd.merge(dfr1, dfe2, on='index', how="outer")
And the output looks like this:
index col1 col2 new col1 new col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
I think one problem may arise in the way you first create your second data frame.
Actually, expanding the number of feature depending on its content is what makes this reformatting a bit annoying here (as you could see for yourself, when writing two new column names out of the bare assumption that this reflect the number of feature observed at every timestamps).
Here is yet another solution, this tries to be a bit more explicit in the step taken than rhug123's answer.
# Initial dataFrames
a = pd.DataFrame({'col1':['a', 'b', 'c'], 'col2':['A', 'B', 'C']}, index=range(1, 4))
b = pd.DataFrame({'new col':['z', 'y']}, index=[1, 1])
Now the only important step is basically transposing your second DataFrame, while here you also need to intorduce two new column names.
We will do this grouping of the second dataframe according to its content (y, z, ...):
c = b.groupby(b.index)['new col'].apply(list) # this has also one index per timestamp, but all features are grouped in a list
# New column names:
cols = ['New col%d'%(k+1) for in range(b.value_counts().sum())]
# Expanding dataframe "c" for each new column
d = pd.DataFrame(c.to_list(), index=b.index.unique(), columns=cols)
# Merge
a.join(d, how='outer')
Output:
col1 col2 New col1 New col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
Finally, one problem encountered with both my answer and the one from rhug123, is that as for now it won't deal with another feature at a different timestamp correctly. Not sure what the OP expects here.
For example if b is:
new col
1 z
1 y
2 x
The merged output will be:
col1 col2 New col1 New col2
1 a A z y
2 b B x None
3 c C NaN NaN
I am trying to reassign multiple columns in DataFrame with modifications.
The below is a simplified example.
import pandas as pd
d = {'col1':[1,2], 'col2':[3,4]}
df = pd.DataFrame(d)
print(df)
col1 col2
0 1 3
1 2 4
I use assign() method to add 1 to both 'col1' and 'col2'.
However, the result is to add 1 only to 'col2' and copy the result to 'col1'.
df2 = df.assign(**{c: lambda x: x[c] + 1 for c in ['col1','col2']})
print(df2)
col1 col2
0 4 4
1 5 5
Can someone explain why this is happening, and also suggest a correct way to apply assign() to multiple columns?
I think the lambda here can not be used within the for loop dict
df.assign(**{c: df[c] + 1 for c in ['col1','col2']})
My question is about the pandas.DataFrame.filter command. It seems that pandas creates a copy of the data frame to write any changes. How am I able to write on the data frame itself?
In other words:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.filter(regex='col1').iloc[0]=10
Output:
col1 col2
0 1 3
1 2 4
Desired Output:
col1 col2
0 10 3
1 2 4
I think you need extract columns names and then use loc or iloc functions:
cols = df.filter(regex='col1').columns
df.loc[0, cols]=10
Or:
df.iloc[0, df.columns.get_indexer(cols)] = 10
print (df)
col1 col2
0 10 3
1 2 4
You cannnot use filter function, because subset returns a Series/DataFrame which may have its data as a view. That's why SettingWithCopyWarning is possible there (or raise if you set the option).
I was wondering how to calculate the number of unique symbols that occur in a single column in a dataframe. For example:
df = pd.DataFrame({'col1': ['a', 'bbb', 'cc', ''], 'col2': ['ddd', 'eeeee', 'ff', 'ggggggg']})
df col1 col2
0 a ddd
1 bbb eeeee
2 cc ff
3 gggggg
It should calculate that col1 contains 3 unique symbols, and col2 contains 4 unique symbols.
My code so far (but this might be wrong):
unique_symbols = [0]*203
i = 0
for col in df.columns:
observed_symbols = []
df_temp = df[[col]]
df_temp = df_temp.astype('str')
#This part is where I am not so sure
for index, row in df_temp.iterrows():
pass
if symbol not in observed_symbols:
observed_symbols.append(symbol)
unique_symbols[i] = len(observed_symbols)
i += 1
Thanks in advance
Option 1
str.join + set inside a dict comprehension
For problems like this, I'd prefer falling back to python, because it's so much faster.
{c : len(set(''.join(df[c]))) for c in df.columns}
{'col1': 3, 'col2': 4}
Option 2
agg
If you want to stay in pandas space.
df.agg(lambda x: set(''.join(x)), axis=0).str.len()
Or,
df.agg(lambda x: len(set(''.join(x))), axis=0)
col1 3
col2 4
dtype: int64
Here is one way:
df.apply(lambda x: len(set(''.join(x.astype(str)))))
col1 3
col2 4
Maybe
df.sum().apply(set).str.len()
Out[673]:
col1 3
col2 4
dtype: int64
One more option:
In [38]: df.applymap(lambda x: len(set(x))).sum()
Out[38]:
col1 3
col2 4
dtype: int64