Reference many columns of pandas DataFrame at once - python

Say I have a n ⨉ p matrix of n samples of a single feature of p dimension (for example a word2vec element, so that p is of the order of ~300). I can create each column programatically, eg. with features = ['f'+str(i) for i in range(p)] and then appending to an existing dataframe.
Since they represent a single feature, how can I reference all those columns later on? I can assign df.feature = df[features] which works, but it breaks when I slice the dataset: df[:x].feature results in an exception.
Example:
df = pre_exisiting_dataframe() # such that len(df) is n
n,p = 3,4
m = np.arange(n*p).reshape((n,p))
fs = ['f'+str(i) for i in range(p)]
df_m = pd.DataFrame(m)
df_m.columns = fs
df = pd.concat([df,df_m],axis=1) # m is now only a part of df
df.f = df[fs]
df.f # works: I can access the whole m at once
df[:1].f # crashes

I wouldn't use df.f = df[fs]. It may lead to undesired and surprising behaviour if you try to modify the data frame. Instead, I'd consider creating hierarchical columns as in the below example.
Say, we already have a preexisting data frame df0 and another one with features:
df0 = pd.DataFrame(np.arange(4).reshape(2,2), columns=['A', 'B'])
df1 = pd.DataFrame(np.arange(10, 16).reshape(2,3), columns=['f0', 'f1', 'f2'])
Then, using the keys argument to concat, we create another level in columns:
df = pd.concat([df0, df1], keys=['pre', 'feat1'], axis=1)
df
Out[103]:
pre feat1
A B f0 f1 f2
0 0 1 10 11 12
1 2 3 13 14 15
The subframe with features can be accessed as follows:
df['feat1']
Out[104]:
f0 f1 f2
0 10 11 12
1 13 14 15
df[('feat1', 'f0')]
Out[105]:
0 10
1 13
Name: (feat1, f0), dtype: int64
Slicing on rows is straightforward. Slicing on columns may be more complicated:
df.loc[:, pd.IndexSlice['feat1', :]]
Out[106]:
feat1
f0 f1 f2
0 10 11 12
1 13 14 15
df.loc[:, pd.IndexSlice['feat1', 'f0':'f1']]
Out[107]:
feat1
f0 f1
0 10 11
1 13 14
To modify values in the data frame, use .loc, for example df.loc[1:, ('feat1', 'f1')] = -1. (More on hierarchical indexing, slicing etc.)
It's also possible to append another frame to df.
# another set of features
df2 = pd.DataFrame(np.arange(100, 108).reshape(2,4), columns=['f0', 'f1', 'f2', 'f3'])
# create a MultiIndex:
idx = pd.MultiIndex.from_product([['feat2'], df2.columns])
# append
df[idx] = df2
df
Out[117]:
pre feat1 feat2
A B f0 f1 f2 f0 f1 f2 f3
0 0 1 10 11 12 100 101 102 103
1 2 3 13 -1 15 104 105 106 107
To keep a nice layout, it's important that idx have the same numbers of levels as df.columns.

Related

Python/Pandas DataFrame with leapfrog assigned columns

I would like to kindly approach you with request for help and support with mine conundrum.
I am working at moment of refresher of old issue and it occurred me to work on improvements :))).
I am creating DataFrame for future analysis from multiple Excel Files.
When file contains always multiple columns which are transposed to rows and than connected to DF.
So fa so good.
However once I start generating additional columns by generating columns names based on readied entry from Excel.
I am having a issue find some elegant solution for this for sure trivial problem.
Example:
dat1 = {'A':[3, 1], 'B':[4, 1]}
df_1 = pd.DataFrame(data = dat1)
this is DataFrame df_1:
A B
0 a3 b4
1 a1 b1
dat2 = {'U':[9,9], 'Y':[2,2]}
df_2 = pd.DataFrame(data = dat2)
this is DataFrame df_2:
U Y
0 u9 y2
1 u9 y2
Wished output is to assigne value to DF by columns name for multiple entries (assign complete DF to another one):
dat3 = {'A':[], 'U':[], 'B':[], 'Y':[]}
df_3 = pd.DataFrame(data = dat3)
this is DataFrame df_3:
A U B Y
0 a3 u9 b4 y2
1 a1 u9 b1 y2
At moment I am elaborating with all Join/Merge/Concat function but non of them is able to do do it by itself.
I can imagine to try to create new DF or assign according some index however this seems as overshoot for this. Main column name list is made separately in separate function.
Please is there any simple way which I am missing?
Many thanks for your time, consideration and potential help in advance.
Best Regards
Jan
You should use concat method to concatenate the data frames as the following:
First, creating the data frames:
import pandas as pd
dat1 = {'A':[3, 1], 'B':[4, 1]}
df_1 = pd.DataFrame(data = dat1)
df_1
output:
A B
0 3 4
1 1 1
dat2 = {'U':[9,9], 'Y':[2,2]}
df_2 = pd.DataFrame(data = dat2)
df_2
U Y
0 9 2
1 9 2
Then use concat method:
df_3 = pd.concat([df_1, df_2], axis=1)
df_3
output:
A B U Y
0 3 4 9 2
1 1 1 9 2
The last step is to rearrange df_3 columns to get an output similar to the one you have shown in your question, you should use:
df_3 = df_3[['A', 'U', 'B', 'Y']]
df_3
output:
A U B Y
0 3 9 4 2
1 1 9 1 2

Pandas aggregate grouping & summing

I have a pandas dataframe where I am trying to sum based on groupings, but I can't seem to get the order right. In the example below, I want to groupby group2 then group1 and sum without double-counting the group1 values. This is part of a larger table with other things going on, so I don't want to filter-out by unique group1-2 sets.
Using pandas 1.0.5
x, y = [(21643,21665,21640,21668,21713,21706), (30,28,84,2,32,-9)]
val = [11,27,31,15,50,35]
group1, group2 = [(1,1,3,4,1,4), (21660,21660,21660,21660,21700,21700)]
df = pd.DataFrame(list(zip(x, y, val, group1, group2)),
columns =['x', 'y', 'val', 'group1', 'group2']
)
df.reset_index(drop=True, inplace=True)
df.sort_values(['group2', 'group1'],inplace=True)
df['group1_mean'] = df.groupby(['group2', 'group1'])['val'].transform('mean')
df['group2_sum'] = df.groupby(['group2', 'group1'])['group1_mean'].transform('sum')
display(df)
I would make a temporary df
dfsum = df.groupby(['group2', 'group1']).mean()
dfsum = dfsum.groupby('group2').sum()
Then merge df with this dfsum
df = df.merge(dfsum, on='group2')
The one line trick
df = df.merge(df.groupby(['group2', 'group1']).val.mean()
.groupby('group2').sum().rename('result'), on='group2')
This will not assign a new variable name so groupby intermediate dfs will be garbage-collected.
Output
x y val group1 group2 result
0 21643 30 11 1 21660 65
1 21665 28 27 1 21660 65
2 21640 84 31 3 21660 65
3 21668 2 15 4 21660 65
4 21713 32 50 1 21700 85
5 21706 -9 35 4 21700 85

python pandas reshaping a dataframe

I have a dataframe with two columns: id1 and id2.
df = pd.DataFrame({'id1': list('ABCBAC'), 'id2': [12,13,12,11,13,13]})
print(df)
id1 id2
A 123
B 13
C 12
B 11
A 13
C 132
And I want to reshape it (using, groupby, or pivot maybe?) to obtain the following:
id1 id2-1 id2-2
A 123 13
B 13 11
C 12 132
Note that there are exactly two rows for each id1 but a great number of different values of id2 (so I'd rather not do one-hot vector encoding).
There is a preference if the output could be sorted by lexicographic order, to give this:
id1 id2-1 id2-2
A 13 123
B 11 13
C 12 132
i.e. for each row the values in id2-1 and id2-2 are sorted (see the row corresponding to id1 == 'B').
plan
we want to create an index that for each successive time we see the values in 'id1'. For this we will groupby('id1') then use cumcount() to give us that new index.
We then set the index to be a pd.MultiIndex with set_index
with the pd.MultiIndex we are set up to unstack
finally, we rename the columns with some tricky mapping
d = df.set_index(['id1', df.groupby('id1').cumcount() + 1]).unstack()
d.columns = d.columns.to_series().map('{0[0]}-{0[1]}'.format)
print(d)
id2-1 id2-2
id1
A 12 13
B 13 11
C 12 13
This should do it:
import pandas as pd
df = pd.DataFrame({'id1': list('ABCBAC'), 'id2': [123,13,12,11,13,132]})
df['id2'] = df['id2'].astype(str)
df = df.groupby(['id1']).agg(lambda x: '-'.join(x))
df['id2-1'] = df['id2'].apply(lambda x: x.split('-')[0]).astype(int)
df['id2-2'] = df['id2'].apply(lambda x: x.split('-')[1]).astype(int)
df = df.reset_index()[['id1', 'id2-1', 'id2-2']]

Pandas flatten hierarchical index on non overlapping columns

I have a dataframe, and I set the index to a column of the dataframe. This creates a hierarchical column index. I want to flatten the columns to a single level. Similar to this question - Python Pandas - How to flatten a hierarchical index in columns, however, the columns do not overlap (i.e. 'id' is not at level 0 of the hierarchical index, and other columns are at level 1 of the index).
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
A B
id
101 3 x
102 5 y
Desired output is flattened columns, like this:
id A B
101 3 x
102 5 y
You are misinterpreting what you are seeing.
A B
id
101 3 x
102 5 y
Is not showing you a hierarchical column index. id is the name of the row index. In order to show you the name of the index, pandas is putting that space there for you.
The answer to your question depends on what you really want or need.
As the df is, you can dump it to a csv just the way you want:
print(df.to_csv(sep='\t'))
id A B
101 3 x
102 5 y
print(df.to_csv())
id,A,B
101,3,x
102,5,y
Or you can alter the df so that it displays the way you'd like
print(df.rename_axis(None))
A B
101 3 x
102 5 y
please do not do this!!!!
I'm putting it to demonstrate how to manipulate
I could also keep the index as it is but manipulate both column and row index names to print how you would like.
print(df.rename_axis(None).rename_axis('id', 1))
id A B
101 3 x
102 5 y
But this has named the columns' index id which makes no sense.
there will always be an index in your dataframes. if you don't set 'id' as index, it will be at the same level as other columns and pandas will populate an increasing integer for your index starting from 0.
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
In[52]: df
Out[52]:
id A B
0 101 3 x
1 102 5 y
the index is there so you can slice the original dataframe. such has
df.iloc[0]
Out[53]:
id 101
A 3
B x
Name: 0, dtype: object
so let says you want ID as index and ID as a column, which is very redundant, you could do:
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
df['id'] = df.index
df
Out[55]:
A B id
id
101 3 x 101
102 5 y 102
with this you can slice by 'id' such has:
df.loc[101]
Out[57]:
A 3
B x
id 101
Name: 101, dtype: object
but it would the same info has :
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
df.loc[101]
Out[58]:
A 3
B x
Name: 101, dtype: object
Given:
>>> df2=pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
>>> df2.set_index('id', inplace=True)
>>> df2
A B
id
101 3 x
102 5 y
For printing purdy, you can produce a copy of the DataFrame with a reset the index and use .to_string:
>>> print df2.reset_index().to_string(index=False)
id A B
101 3 x
102 5 y
Then play around with the formatting options so that the output suites your needs:
>>> fmts=[lambda s: u"{:^5}".format(str(s).strip())]*3
>>> print df2.reset_index().to_string(index=False, formatters=fmts)
id A B
101 3 x
102 5 y

Element-wise average and standard deviation across multiple dataframes

Data:
Multiple dataframes of the same format (same columns, an equal number of rows, and no points missing).
How do I create a "summary" dataframe that contains an element-wise mean for every element? How about a dataframe that contains an element-wise standard deviation?
A B C
0 -1.624722 -1.160731 0.016726
1 -1.565694 0.989333 1.040820
2 -0.484945 0.718596 -0.180779
3 0.388798 -0.997036 1.211787
4 -0.249211 1.604280 -1.100980
5 0.062425 0.925813 -1.810696
6 0.793244 -1.860442 -1.196797
A B C
0 1.016386 1.766780 0.648333
1 -1.101329 -1.021171 0.830281
2 -1.133889 -2.793579 0.839298
3 1.134425 0.611480 -1.482724
4 -0.066601 -2.123353 1.136564
5 -0.167580 -0.991550 0.660508
6 0.528789 -0.483008 1.472787
You can create a panel of your DataFrames and then compute the mean and SD along the items axis:
df1 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df3 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
p = pd.Panel({n: df for n, df in enumerate([df1, df2, df3])})
>>> p.mean(axis=0)
A B C
0 -0.024284 -0.622337 0.581292
1 0.186271 0.596634 -0.498755
2 0.084591 -0.760567 -0.334429
3 -0.833688 0.403628 0.013497
4 0.402502 -0.017670 -0.369559
5 0.733305 -1.311827 0.463770
6 -0.941334 0.843020 -1.366963
7 0.134700 0.626846 0.994085
8 -0.783517 0.703030 -1.187082
9 -0.954325 0.514671 -0.370741
>>> p.std(axis=0)
A B C
0 0.196526 1.870115 0.503855
1 0.719534 0.264991 1.232129
2 0.315741 0.773699 1.328869
3 1.169213 1.488852 1.149105
4 1.416236 1.157386 0.414532
5 0.554604 1.022169 1.324711
6 0.178940 1.107710 0.885941
7 1.270448 1.023748 1.102772
8 0.957550 0.355523 1.284814
9 0.582288 0.997909 1.566383
One simple solution here is to simply concatenate the existing dataframes into a single dataframe while adding an ID variable to track the original source:
dfa = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='a')
dfb = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='b')
df = pd.concat([df1,df2])
a b id
0 -0.542652 1.609213 a
1 -0.192136 0.458564 a
0 -0.231949 -0.000573 b
1 0.245715 -0.083786 b
So now you have two 2x2 dataframes combined into a single 4x2 dataframe. The 'id' columns identifies the source dataframe so you haven't lost any generality, and can select on 'id' to do the same thing you would to any single dataframe. E.g. df[ df['id'] == 'a' ].
But now you can also use groupby to do any pandas method such as mean() or std() on an element by element basis:
df.groupby('id').mean()
a b
index
0 0.198164 -0.811475
1 0.639529 0.812810
The following solution worked for me.
average_data_frame = (dataframe1 + dataframe2 ) / 2
Or, if you have more than two dataframes, say n, then
average_data_frame = dataframe1
for i in range(1,n):
average_data_frame = average_data_frame + i_th_dataframe
average_data_frame = average_data_frame / n
Once you have the average, you can go for the standard deviation. If you are looking for a "true Pythonic" approach, you should follow other answers. But if you are looking for a working and quick solution, this is it.

Categories

Resources