Extracting a column from every Frame in a Panel - python

I have a Panel data containing some Data Frames. All of them have a column named 'N0'. I'd like to an array containing the means of N0 for every panel. I managed with this:
[np.mean(data.minor_xs('N0')[g]) for g in data]
But it seems too cumbersome. Isn't there any cleaner way to extract the N0 columnes, like data['N0']?

You could use pd.Panel.apply (see docs) as illustrated with random sample data:
df1 = pd.DataFrame(np.random.randn(4, 2), columns=['A', 'B'])
df2 = pd.DataFrame(np.random.randn(4, 2), columns=['A', 'B'])
data = {'Item1': df1, 'Item2': df2}
df = pd.Panel(data)
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 2 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: A to B
which, as DataFrame, looks as follows:
Item1 Item2
major minor
0 A -0.572396 0.515488
B 0.796982 0.726253
1 A 0.345817 -0.330810
B -2.516973 1.833602
2 A -2.140583 -1.050717
B 1.302233 -1.391122
3 A -0.088435 -0.041199
B 0.521575 0.618990
Using .apply() as below gives the mean for each column by DataFrame, the sample illustrates how to select only B.
df.apply(np.mean, axis='major').loc['B']
Item1 0.025954
Item2 0.446931
Name: B, dtype: float64
Using a MultiIndex DataFrame instead might be simpler because better documented as it seems to be the more common use case.

Related

Creating a New Column in a Pandas Dataframe in a more pythonic way

I am trying to find a better, more pythonic way of accomplishing the following:
I want to add a new column to business_df called 'dot_prod', which is the dot product of a fixed vector (fixed_vector) and a vector from another data frame (rating_df). The rows of both business_df and rating_df have the same index values (business_id).
I have this loop which appears to work, however I know it's super clumsy (and takes forever). Essentially it loops through once for every row, calculates the dot product, then dumps it into the business_df dataframe.
n=0
for i in range(business_df.shape[0]):
dot_prod = np.dot(fixed_vector, rating_df.iloc[n])
business_df['dot_prod'][n] = dot_prod
n+=1
IIUC, you are looking for apply across axis=1 like:
business_df['dot_prod'] = rating_df.apply(lambda x: np.dot(fixed_vector, x), axis=1)
>>> fixed_vector = [1, 2, 3]
>>> df = pd.DataFrame({'col1' : [1,2], 'col2' : [3,4], 'col3' : [5,6]})
>>> df
col1 col2 col3
0 1 3 5
1 2 4 6
>>> df['col4'] = np.dot(fixed_vector, [df['col1'], df['col2'], df['col3']])
>>> df
col1 col2 col3 col4
0 1 3 5 22
1 2 4 6 28

Reshaping DataFrame with pandas

So I'm working with pandas on python. I collect data indexed by timestamps with multiple ways.
This means I can have one index with 2 features available (and the others with NaN values, it's normal) or all features, it depends.
So my problem is when I add some data with multiple values for the same indices, see the example below :
Imagine this is the set we're adding new data :
Index col1 col2
1 a A
2 b B
3 c C
This the data we will add:
Index new col
1 z
1 y
Then the result is this :
Index col1 col2 new col
1 a A NaN
1 NaN NaN z
1 NaN NaN y
2 b B NaN
3 c C NaN
So instead of that, I would like the result to be :
Index col1 col2 new col1 new col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
I want that instead of having multiples indexes in 1 feature, there will be 1 index for multiple features.
I don't know if this is understandable. Another way is to say that I want this : number of values per timestamp=number of features instead of =numbers of indices.
This solution assumes the data that you need to add is a series.
Original df:
df = pd.DataFrame(np.random.randint(0,3,size=(3,3)),columns = list('ABC'),index = [1,2,3])
Data to add (series):
s = pd.Series(['x','y'],index = [1,1])
Solution:
df.join(s.to_frame()
.assign(cc = lambda x: x.groupby(level=0)
.cumcount().add(1))
.set_index('cc',append=True)[0]
.unstack()
.rename('New Col{}'.format,axis=1))
Output:
A B C New Col1 New Col2
1 1 2 2 x y
2 0 1 2 NaN NaN
3 2 2 0 NaN NaN
Alternative answer (maybe more simplistic, probably less pythonic). I think you need to look at converting wide data to long data and back again in general (pivot and transpose might be good things to look up for this), but I also think there are some possible problems in your question. You don't mention new col 1 and new col 2 in the declaration of the subsequent arrays.
Here's my declarations of your data frames:
d = {'index': [1, 2, 3],'col1': ['a', 'b', 'c'], 'col2': ['A', 'B', 'C']}
df = pd.DataFrame(data=d)
e1 = {'index': [1], 'new col1': ['z']}
dfe1 = pd.DataFrame(data=e1)
e2 = {'index': [1], 'new col2': ['y']}
dfe2 = pd.DataFrame(data=e2)
They look like this:
index new col1
1 z
and this:
index new col2
1 y
Notice that I declare your new columns as part of the data frames. Once they're declared like that, it's just a matter of merging:
dfr = pd.merge(df, dfe, on='index', how="outer")
dfr1 = pd.merge(df, dfe1, on='index', how="outer")
dfr2 = pd.merge(dfr1, dfe2, on='index', how="outer")
And the output looks like this:
index col1 col2 new col1 new col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
I think one problem may arise in the way you first create your second data frame.
Actually, expanding the number of feature depending on its content is what makes this reformatting a bit annoying here (as you could see for yourself, when writing two new column names out of the bare assumption that this reflect the number of feature observed at every timestamps).
Here is yet another solution, this tries to be a bit more explicit in the step taken than rhug123's answer.
# Initial dataFrames
a = pd.DataFrame({'col1':['a', 'b', 'c'], 'col2':['A', 'B', 'C']}, index=range(1, 4))
b = pd.DataFrame({'new col':['z', 'y']}, index=[1, 1])
Now the only important step is basically transposing your second DataFrame, while here you also need to intorduce two new column names.
We will do this grouping of the second dataframe according to its content (y, z, ...):
c = b.groupby(b.index)['new col'].apply(list) # this has also one index per timestamp, but all features are grouped in a list
# New column names:
cols = ['New col%d'%(k+1) for in range(b.value_counts().sum())]
# Expanding dataframe "c" for each new column
d = pd.DataFrame(c.to_list(), index=b.index.unique(), columns=cols)
# Merge
a.join(d, how='outer')
Output:
col1 col2 New col1 New col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
Finally, one problem encountered with both my answer and the one from rhug123, is that as for now it won't deal with another feature at a different timestamp correctly. Not sure what the OP expects here.
For example if b is:
new col
1 z
1 y
2 x
The merged output will be:
col1 col2 New col1 New col2
1 a A z y
2 b B x None
3 c C NaN NaN

Element-wise average and standard deviation across multiple dataframes

Data:
Multiple dataframes of the same format (same columns, an equal number of rows, and no points missing).
How do I create a "summary" dataframe that contains an element-wise mean for every element? How about a dataframe that contains an element-wise standard deviation?
A B C
0 -1.624722 -1.160731 0.016726
1 -1.565694 0.989333 1.040820
2 -0.484945 0.718596 -0.180779
3 0.388798 -0.997036 1.211787
4 -0.249211 1.604280 -1.100980
5 0.062425 0.925813 -1.810696
6 0.793244 -1.860442 -1.196797
A B C
0 1.016386 1.766780 0.648333
1 -1.101329 -1.021171 0.830281
2 -1.133889 -2.793579 0.839298
3 1.134425 0.611480 -1.482724
4 -0.066601 -2.123353 1.136564
5 -0.167580 -0.991550 0.660508
6 0.528789 -0.483008 1.472787
You can create a panel of your DataFrames and then compute the mean and SD along the items axis:
df1 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df3 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
p = pd.Panel({n: df for n, df in enumerate([df1, df2, df3])})
>>> p.mean(axis=0)
A B C
0 -0.024284 -0.622337 0.581292
1 0.186271 0.596634 -0.498755
2 0.084591 -0.760567 -0.334429
3 -0.833688 0.403628 0.013497
4 0.402502 -0.017670 -0.369559
5 0.733305 -1.311827 0.463770
6 -0.941334 0.843020 -1.366963
7 0.134700 0.626846 0.994085
8 -0.783517 0.703030 -1.187082
9 -0.954325 0.514671 -0.370741
>>> p.std(axis=0)
A B C
0 0.196526 1.870115 0.503855
1 0.719534 0.264991 1.232129
2 0.315741 0.773699 1.328869
3 1.169213 1.488852 1.149105
4 1.416236 1.157386 0.414532
5 0.554604 1.022169 1.324711
6 0.178940 1.107710 0.885941
7 1.270448 1.023748 1.102772
8 0.957550 0.355523 1.284814
9 0.582288 0.997909 1.566383
One simple solution here is to simply concatenate the existing dataframes into a single dataframe while adding an ID variable to track the original source:
dfa = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='a')
dfb = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='b')
df = pd.concat([df1,df2])
a b id
0 -0.542652 1.609213 a
1 -0.192136 0.458564 a
0 -0.231949 -0.000573 b
1 0.245715 -0.083786 b
So now you have two 2x2 dataframes combined into a single 4x2 dataframe. The 'id' columns identifies the source dataframe so you haven't lost any generality, and can select on 'id' to do the same thing you would to any single dataframe. E.g. df[ df['id'] == 'a' ].
But now you can also use groupby to do any pandas method such as mean() or std() on an element by element basis:
df.groupby('id').mean()
a b
index
0 0.198164 -0.811475
1 0.639529 0.812810
The following solution worked for me.
average_data_frame = (dataframe1 + dataframe2 ) / 2
Or, if you have more than two dataframes, say n, then
average_data_frame = dataframe1
for i in range(1,n):
average_data_frame = average_data_frame + i_th_dataframe
average_data_frame = average_data_frame / n
Once you have the average, you can go for the standard deviation. If you are looking for a "true Pythonic" approach, you should follow other answers. But if you are looking for a working and quick solution, this is it.

Map List of Tuples to New Column

Suppose I have a pandas.DataFrame:
In [76]: df
Out[76]:
a b c
0 -0.685397 0.845976 w
1 0.065439 2.642052 x
2 -0.220823 -2.040816 y
3 -1.331632 -0.162705 z
Suppose I have a list of tuples:
In [78]: tp
Out[78]: [('z', 0.25), ('y', 0.33), ('x', 0.5), ('w', 0.75)]
I would like to map tp do df such that the the second element in each tuple lands in a new column that corresponds with the row matching the first element in each tuple.
The end result would look like this:
In [87]: df2
Out[87]:
a b c new
0 -0.685397 0.845976 w 0.75
1 0.065439 2.642052 x 0.50
2 -0.220823 -2.040816 y 0.33
3 -1.331632 -0.162705 z 0.25
I've tried using lambdas, pandas.applymap, pandas.map, etc but cannot seem to crack this one. So for those that will point out I have not actually asked a question, how would I map tp do df such that the the second element in each tuple lands in a new column that corresponds with the row matching the first element in each tuple?
You need to turn your list of tuples into a dict which is ridiculously easy to do in python, then call map on it:
In [4]:
df['new'] = df['c'].map(dict(tp))
df
Out[4]:
a b c new
index
0 -0.685397 0.845976 w 0.75
1 0.065439 2.642052 x 0.50
2 -0.220823 -2.040816 y 0.33
3 -1.331632 -0.162705 z 0.25
The docs for map show that that it takes as a function arg a dict, series or function.
applymap takes a function as an arg but operates element wise on the whole dataframe which is not what you want to do in this case.
The online docs show how to apply an operation element wise, as does the excellent book
Does this example help?
class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
>>> d = {'col1': ts1, 'col2': ts2}
>>> df = DataFrame(data=d, index=index)
>>> df2 = DataFrame(np.random.randn(10, 5))
>>> df3 = DataFrame(np.random.randn(10, 5),
... columns=['a', 'b', 'c', 'd', 'e'])

returning aggregated dataframe from pandas groupby

I'm trying to wrap my head around Pandas groupby methods. I'd like to write a function that does some aggregation functions and then returns a Pandas DataFrame. Here's a grossly simplified example using sum(). I know there are easier ways to do simple sums, in real life my function is more complex:
import pandas as pd
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B'], 'col2':[1.0, 2, 3, 4]})
In [3]: df
Out[3]:
col1 col2
0 A 1
1 A 2
2 B 3
3 B 4
def func2(df):
dfout = pd.DataFrame({ 'col1' : df['col1'].unique() ,
'someData': sum(df['col2']) })
return dfout
t = df.groupby('col1').apply(func2)
In [6]: t
Out[6]:
col1 someData
col1
A 0 A 3
B 0 B 7
I did not expect to have col1 in there twice nor did I expect that mystery index looking thing. I really thought I would just get col1 & someData.
In my real life application I'm grouping by more than one column and really would like to get back a DataFrame and not a Series object.
Any ideas for a solution or an explanation on what Pandas is doing in my example above?
----- added info -----
I should have started with this example, I think:
In [13]: import pandas as pd
In [14]: df = pd.DataFrame({'col1':['A','A','A','B','B','B'], 'col2':['C','D','D','D','C','C'], 'col3':[.1,.2,.4,.6,.8,1]})
In [15]: df
Out[15]:
col1 col2 col3
0 A C 0.1
1 A D 0.2
2 A D 0.4
3 B D 0.6
4 B C 0.8
5 B C 1.0
In [16]: def func3(df):
....: dfout = sum(df['col3']**2)
....: return dfout
....:
In [17]: t = df.groupby(['col1', 'col2']).apply(func3)
In [18]: t
Out[18]:
col1 col2
A C 0.01
D 0.20
B C 1.64
D 0.36
In the above illustration the result of the apply() function is a Pandas Series. And it lacks the groupby columns from the df.groupby. The essence of what I'm struggling with is how do I create a function which I apply to a groupby which returns both the result of the function AND the columns on which it was grouped?
----- yet another update ------
It appears that if I then do this:
pd.DataFrame(t).reset_index()
I get back a dataframe which is really close to what I was after.
The reason you are seeing the columns with 0s is because the output of .unique() is an array.
The best way to understand how your apply is going to work is to inspect each action group-wise:
In [11] :g = df.groupby('col1')
In [12]: g.get_group('A')
Out[12]:
col1 col2
0 A 1
1 A 2
In [13]: g.get_group('A')['col1'].unique()
Out[13]: array([A], dtype=object)
In [14]: sum(g.get_group('A')['col2'])
Out[14]: 3.0
The majority of the time you want this to be an aggregated value.
The output of grouped.apply will always have the group labels as an index (the unique values of 'col1'), so your example construction of col1 seems a little obtuse to me.
Note: To pop 'col1' (the index) back to a column you can call reset_index, so in this case.
In [15]: g.sum().reset_index()
Out[15]:
col1 col2
0 A 3
1 B 7

Categories

Resources