python pandas reshaping a dataframe

python pandas reshaping a dataframe - python

I have a dataframe with two columns: id1 and id2.
df = pd.DataFrame({'id1': list('ABCBAC'), 'id2': [12,13,12,11,13,13]})
print(df)
id1 id2
A 123
B 13
C 12
B 11
A 13
C 132
And I want to reshape it (using, groupby, or pivot maybe?) to obtain the following:
id1 id2-1 id2-2
A 123 13
B 13 11
C 12 132
Note that there are exactly two rows for each id1 but a great number of different values of id2 (so I'd rather not do one-hot vector encoding).
There is a preference if the output could be sorted by lexicographic order, to give this:
id1 id2-1 id2-2
A 13 123
B 11 13
C 12 132
i.e. for each row the values in id2-1 and id2-2 are sorted (see the row corresponding to id1 == 'B').

plan
we want to create an index that for each successive time we see the values in 'id1'. For this we will groupby('id1') then use cumcount() to give us that new index.
We then set the index to be a pd.MultiIndex with set_index
with the pd.MultiIndex we are set up to unstack
finally, we rename the columns with some tricky mapping
d = df.set_index(['id1', df.groupby('id1').cumcount() + 1]).unstack()
d.columns = d.columns.to_series().map('{0[0]}-{0[1]}'.format)
print(d)
id2-1 id2-2
id1
A 12 13
B 13 11
C 12 13

This should do it:
import pandas as pd
df = pd.DataFrame({'id1': list('ABCBAC'), 'id2': [123,13,12,11,13,132]})
df['id2'] = df['id2'].astype(str)
df = df.groupby(['id1']).agg(lambda x: '-'.join(x))
df['id2-1'] = df['id2'].apply(lambda x: x.split('-')[0]).astype(int)
df['id2-2'] = df['id2'].apply(lambda x: x.split('-')[1]).astype(int)
df = df.reset_index()[['id1', 'id2-1', 'id2-2']]

Related

lookup into another dataframe and filter value

I have dataframe named df which has two columns id1 and id2
I need to filter values based on some other df named as meta_df
meta_df has three columns id,name,text
df
id1
id2
12
34
99
42
metadf
id
name
text
12
aa
lowerend
42
bb
upperend
99
cc
upper limit
34
dd
uppersome
I need values from text which have lower and upper in string of text. e.g 12 and 34
I am trying the below code and stuck at getting text clumn
for row in df.itertuples():
print(row.Index, row.id1, row.id2)
print(meta_df[id['id']== row.id1])
print(meta_df[id['id']== row.id2])
Output Expected
id2
id2
flag
12
34
yes
99
42
no

Melt df and merge to metadf, a bit of reshaping before getting the final value:
# keep the index with ignore_index
# it will be used when reshaping back to original form
reshaped = (df.melt(value_name = 'id', ignore_index = False)
.assign(ind=lambda df: df.index)
.merge(metadf, on='id', how = 'left')
.assign(text = lambda df: df.text.str.contains('lower'))
.drop(columns='name')
.pivot('ind', 'variable')
.rename_axis(columns=[None, None], index=None)
)
# if the row contains both lower(1) and upper(0)
# it will sum to 1, else 0, or 2(unlikely with the sample data shared)
flag = reshaped.loc(axis=1)['text'].sum(1)
reshaped.loc(axis=1)['id'].assign(flag = flag.map({1:'yes', 0:'no'}))
id1 id2 flag
0 12 34 yes
1 99 42 no

Set cross section of pandas MultiIndex to DataFrame from addition of other cross sections

I am currently trying to assign rows with certain indices based on the other indices within the group.
Consider the following pandas data frame:
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_product([list('abc'), ['aa', 'bb', 'cc']])
df = pd.DataFrame({'col1': np.arange(9),
'col2': np.arange(9, 18),
'col3': np.arange(18,27)},
index=index)
Output of df:
col1 col2 col3
a aa 0 9 18
bb 1 10 19
cc 2 11 20
b aa 3 12 21
bb 4 13 22
cc 5 14 23
c aa 6 15 24
bb 7 16 25
cc 8 17 26
I want to assign the indices 'cc' equal to 'aa' plus 'bb' per the first level of indices.
The following works fine, but I am wondering if there is a way to set values without having to reference the underlying NumPy array.
df.loc[pd.IndexSlice[:, 'cc'], :] = (df.xs('aa', level=1)
+ df.xs('bb', level=1)).values
Is there a way to set the 'cc' rows directly to the output below? I believe the issue with trying to set the below directly is due to an index mismatch. Can I get around this somehow?
df.xs('aa', level=1) + df.xs('bb', level=1)

Update
You can use pandas.DataFrame.iloc
df.iloc[df.index.get_level_values(1)=='cc'] = df.xs('aa', level=1) + df.xs('bb', level=1)
Old answer
You can do this:
df[df.index.get_level_values(1)=='cc'] = df.xs('aa', level=1) + df.xs('bb', level=1)
Disclaimer: it works for pandas version 1.2.1 and it doesn't work in pandas 1.2.3. I haven't tested any other version

reshape a pandas dataframe

suppose a dataframe like this one:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]], columns = ['A', 'B', 'A1', 'B1'])
I would like to have a dataframe which looks like:
what does not work:
new_rows = int(df.shape[1]/2) * df.shape[0]
new_cols = 2
df.values.reshape(new_rows, new_cols, order='F')
of course I could loop over the data and make a new list of list but there must be a better way. Any ideas ?

The pd.wide_to_long function is built almost exactly for this situation, where you have many of the same variable prefixes that end in a different digit suffix. The only difference here is that your first set of variables don't have a suffix, so you will need to rename your columns first.
The only issue with pd.wide_to_long is that it must have an identification variable, i, unlike melt. reset_index is used to create a this uniquely identifying column, which is dropped later. I think this might get corrected in the future.
df1 = df.rename(columns={'A':'A1', 'B':'B1', 'A1':'A2', 'B1':'B2'}).reset_index()
pd.wide_to_long(df1, stubnames=['A', 'B'], i='index', j='id')\
.reset_index()[['A', 'B', 'id']]
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2

You can use lreshape, for column id numpy.repeat:
a = [col for col in df.columns if 'A' in col]
b = [col for col in df.columns if 'B' in col]
df1 = pd.lreshape(df, {'A' : a, 'B' : b})
df1['id'] = np.repeat(np.arange(len(df.columns) // 2), len (df.index)) + 1
print (df1)
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
EDIT:
lreshape is currently undocumented, but it is possible it might be removed(with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.

I solved this in 3 steps:
Make a new dataframe df2 holding only the data you want to be added to the initial dataframe df.
Delete the data from df that will be added below (and that was used to make df2.
Append df2 to df.
Like so:
# step 1: create new dataframe
df2 = df[['A1', 'B1']]
df2.columns = ['A', 'B']
# step 2: delete that data from original
df = df.drop(["A1", "B1"], 1)
# step 3: append
df = df.append(df2, ignore_index=True)
Note how when you do df.append() you need to specify ignore_index=True so the new columns get appended to the index rather than keep their old index.
Your end result should be your original dataframe with the data rearranged like you wanted:
In [16]: df
Out[16]:
A B
0 1 2
1 5 6
2 9 10
3 3 4
4 7 8
5 11 12

Use pd.concat() like so:
#Split into separate tables
df_1 = df[['A', 'B']]
df_2 = df[['A1', 'B1']]
df_2.columns = ['A', 'B'] # Make column names line up
# Add the ID column
df_1 = df_1.assign(id=1)
df_2 = df_2.assign(id=2)
# Concatenate
pd.concat([df_1, df_2])

Reference many columns of pandas DataFrame at once

Say I have a n ⨉ p matrix of n samples of a single feature of p dimension (for example a word2vec element, so that p is of the order of ~300). I can create each column programatically, eg. with features = ['f'+str(i) for i in range(p)] and then appending to an existing dataframe.
Since they represent a single feature, how can I reference all those columns later on? I can assign df.feature = df[features] which works, but it breaks when I slice the dataset: df[:x].feature results in an exception.
Example:
df = pre_exisiting_dataframe() # such that len(df) is n
n,p = 3,4
m = np.arange(n*p).reshape((n,p))
fs = ['f'+str(i) for i in range(p)]
df_m = pd.DataFrame(m)
df_m.columns = fs
df = pd.concat([df,df_m],axis=1) # m is now only a part of df
df.f = df[fs]
df.f # works: I can access the whole m at once
df[:1].f # crashes

I wouldn't use df.f = df[fs]. It may lead to undesired and surprising behaviour if you try to modify the data frame. Instead, I'd consider creating hierarchical columns as in the below example.
Say, we already have a preexisting data frame df0 and another one with features:
df0 = pd.DataFrame(np.arange(4).reshape(2,2), columns=['A', 'B'])
df1 = pd.DataFrame(np.arange(10, 16).reshape(2,3), columns=['f0', 'f1', 'f2'])
Then, using the keys argument to concat, we create another level in columns:
df = pd.concat([df0, df1], keys=['pre', 'feat1'], axis=1)
df
Out[103]:
pre feat1
A B f0 f1 f2
0 0 1 10 11 12
1 2 3 13 14 15
The subframe with features can be accessed as follows:
df['feat1']
Out[104]:
f0 f1 f2
0 10 11 12
1 13 14 15
df[('feat1', 'f0')]
Out[105]:
0 10
1 13
Name: (feat1, f0), dtype: int64
Slicing on rows is straightforward. Slicing on columns may be more complicated:
df.loc[:, pd.IndexSlice['feat1', :]]
Out[106]:
feat1
f0 f1 f2
0 10 11 12
1 13 14 15
df.loc[:, pd.IndexSlice['feat1', 'f0':'f1']]
Out[107]:
feat1
f0 f1
0 10 11
1 13 14
To modify values in the data frame, use .loc, for example df.loc[1:, ('feat1', 'f1')] = -1. (More on hierarchical indexing, slicing etc.)
It's also possible to append another frame to df.
# another set of features
df2 = pd.DataFrame(np.arange(100, 108).reshape(2,4), columns=['f0', 'f1', 'f2', 'f3'])
# create a MultiIndex:
idx = pd.MultiIndex.from_product([['feat2'], df2.columns])
# append
df[idx] = df2
df
Out[117]:
pre feat1 feat2
A B f0 f1 f2 f0 f1 f2 f3
0 0 1 10 11 12 100 101 102 103
1 2 3 13 -1 15 104 105 106 107
To keep a nice layout, it's important that idx have the same numbers of levels as df.columns.

Element-wise average and standard deviation across multiple dataframes

Data:
Multiple dataframes of the same format (same columns, an equal number of rows, and no points missing).
How do I create a "summary" dataframe that contains an element-wise mean for every element? How about a dataframe that contains an element-wise standard deviation?
A B C
0 -1.624722 -1.160731 0.016726
1 -1.565694 0.989333 1.040820
2 -0.484945 0.718596 -0.180779
3 0.388798 -0.997036 1.211787
4 -0.249211 1.604280 -1.100980
5 0.062425 0.925813 -1.810696
6 0.793244 -1.860442 -1.196797
A B C
0 1.016386 1.766780 0.648333
1 -1.101329 -1.021171 0.830281
2 -1.133889 -2.793579 0.839298
3 1.134425 0.611480 -1.482724
4 -0.066601 -2.123353 1.136564
5 -0.167580 -0.991550 0.660508
6 0.528789 -0.483008 1.472787

You can create a panel of your DataFrames and then compute the mean and SD along the items axis:
df1 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df3 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
p = pd.Panel({n: df for n, df in enumerate([df1, df2, df3])})
>>> p.mean(axis=0)
A B C
0 -0.024284 -0.622337 0.581292
1 0.186271 0.596634 -0.498755
2 0.084591 -0.760567 -0.334429
3 -0.833688 0.403628 0.013497
4 0.402502 -0.017670 -0.369559
5 0.733305 -1.311827 0.463770
6 -0.941334 0.843020 -1.366963
7 0.134700 0.626846 0.994085
8 -0.783517 0.703030 -1.187082
9 -0.954325 0.514671 -0.370741
>>> p.std(axis=0)
A B C
0 0.196526 1.870115 0.503855
1 0.719534 0.264991 1.232129
2 0.315741 0.773699 1.328869
3 1.169213 1.488852 1.149105
4 1.416236 1.157386 0.414532
5 0.554604 1.022169 1.324711
6 0.178940 1.107710 0.885941
7 1.270448 1.023748 1.102772
8 0.957550 0.355523 1.284814
9 0.582288 0.997909 1.566383

One simple solution here is to simply concatenate the existing dataframes into a single dataframe while adding an ID variable to track the original source:
dfa = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='a')
dfb = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='b')
df = pd.concat([df1,df2])
a b id
0 -0.542652 1.609213 a
1 -0.192136 0.458564 a
0 -0.231949 -0.000573 b
1 0.245715 -0.083786 b
So now you have two 2x2 dataframes combined into a single 4x2 dataframe. The 'id' columns identifies the source dataframe so you haven't lost any generality, and can select on 'id' to do the same thing you would to any single dataframe. E.g. df[ df['id'] == 'a' ].
But now you can also use groupby to do any pandas method such as mean() or std() on an element by element basis:
df.groupby('id').mean()
a b
index
0 0.198164 -0.811475
1 0.639529 0.812810

The following solution worked for me.
average_data_frame = (dataframe1 + dataframe2 ) / 2
Or, if you have more than two dataframes, say n, then
average_data_frame = dataframe1
for i in range(1,n):
average_data_frame = average_data_frame + i_th_dataframe
average_data_frame = average_data_frame / n
Once you have the average, you can go for the standard deviation. If you are looking for a "true Pythonic" approach, you should follow other answers. But if you are looking for a working and quick solution, this is it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python pandas reshaping a dataframe - python

Related

lookup into another dataframe and filter value

Set cross section of pandas MultiIndex to DataFrame from addition of other cross sections

reshape a pandas dataframe

Reference many columns of pandas DataFrame at once

Element-wise average and standard deviation across multiple dataframes

Categories

Resources