Concatenation of pandas DataFrames including rows' source-dataframe/-category - python

I need to pandas.concatenate(...,axis=0,...) multiple DataFrames so that the resulting DataFrame has a new column holding information about which dataset a row belongs to, but drop the implicit indices of the original DataFrames.
In this MWE for example, we have from multiple countries (A and B, e.g.) the heights and weights of people in DataFrames df1 and df2, respectively.
import pandas as pd
df1 = pd.DataFrame({'Weight': [5, 4, 6], 'Height': [170, 172, 180]})
df2 = pd.DataFrame({'Weight': [4, 4, 5], 'Height': [180, 181, 169]})
The concatenated DataFrame df needs to store the country for each row, but
df = pd.concat([df1, df2], keys=list('AB'), names=["Country"]).reset_index()
print df
> Nationality level_1 Height Weight
>0 A 0 170 5
>1 A 1 172 4
>2 A 2 180 6
>3 B 0 180 4
>4 B 1 181 4
>5 B 2 169 5
carries along the "old" implicit indices in an additional column (level_1), while
df = pd.concat([df1, df2], keys=list('AB'), names=["Country"], ignore_index=True).reset_index()
print df
> Height Weight
>0 170 5
>1 172 4
>2 180 6
>3 180 4
>4 181 4
>5 169 5
ignores the columns marked as index of the new DataFrame instead of the indices of the source DataFrames (which would make more sense in my opinion, at least if keys and names are provided).
I get the desired result with
df = pd.concat([df1, df2], keys=list(list('AB')), names=["Nationality"]).reset_index(0).reset_index(0, drop=True)
print df
> Nationality Height Weight
>0 A 170 5
>1 A 172 4
>2 A 180 6
>3 B 180 4
>4 B 181 4
>5 B 169 5
which is a syntactic nightmare IMHO.
Hence my two questions:
Am I missing another way to do this properly?
Or is the behaviour of the ignore_index-flag faulty or misleading and should be subject to a bug-report?

Using both ignore_index and keys is somewhat conflicting - the first says to toss away the index information, and the second says to use it to make a MultiIndex. That said, I think pandas could give a better message (maybe just raise a ValueError) if you pass both, so you can make an issue.
Here's an alternative way to accomplish what you want.
In [2]: keys = ['A', 'B']
In [3]: dfs = [df1, df2]
In [4]: df = pd.concat([df.assign(Nationality=key) for key, df
...: in zip(keys, dfs)])
In [5]: df
Out[5]:
Height Weight Nationality
0 170 5 A
1 172 4 A
2 180 6 A
0 180 4 B
1 181 4 B
2 169 5 B

Related

Pandas - fill new column with values from following day

In the following dataframe
#Create data
data = {'Day': [1,1,2,2,3,3],
'Where': ['A','B','A','B','B','B'],
'What': ['x','y','x','x','x','y'],
'Dollars': [100,200,100,100,100,200]}
index = range(len(data['Day']))
columns = ['Day','Where','What','Dollars']
df = pd.DataFrame(data, index=index, columns=columns)
df
I would like to add a column with the future values. In this case, the first value should be 100 as on day 2 at A x was sold for 100 dollars. The complete column should contain the values 100, None, None, 100, None, None.
I thought that I could index the cells in the following way
df2 = df
df2['Tomorrow_Dollars'] = df[df.Day == df2.Day+1,'Dollars']
but this throws the following error
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Is there a solution to this or a smarter approach?
Idea is add create missing combinations by reindex with MultiIndex.from_product, reshape by unstack for unique Days, so possible shift. Last reshape back and join for new column:
df1 = df.set_index(['Day','Where','What'])
mux = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)
s = df1.reindex(mux)['Dollars'].unstack([1,2]).shift(-1).unstack().rename('Tomorrow_Dollars')
df = df.join(s, on=['Where','What','Day'])
print (df)
Day Where What Dollars Tomorrow_Dollars
0 1 A x 100 100.0
1 1 B y 200 NaN
2 2 A x 100 NaN
3 2 B x 100 100.0
4 3 B x 100 NaN
5 3 B y 200 NaN

loop through a single column in one dataframe compare to a column in another dataframe create new column in first dataframe using pandas

right now I have two dataframes they look like:
c = pd.DataFrame({'my_goal':[3, 4, 5, 6, 7],
'low_number': [0,100,1000,2000,3000],
'high_number': [100,1000,2000,3000,4000]})
and
a= pd.DataFrame({'a':['a', 'b', 'c', 'd', 'e'],
'Number':[50, 500, 1030, 2005 , 3575]})
what I want to do is if 'Number' falls between the low number and the high number I want it to bring back the value in 'my_goal'. For example if we look at 'a' it's 'Number is is 100 so I want it to bring back 3. I also want to create a dataframe that contains all the columns from dataframe a and the 'my_goal' column from dataframe c. I want the output to look like:
I tried making my high and low numbers into a separate list and running a for loop from that, but all that gives me are 'my_goal' numbers:
low_number= 'low_number': [0,100,1000,2000,3000]
for i in a:
if float(i) >= low_number:
a = c['my_goal']
print(a)
You can use pd.cut, when I see ranges, I first think of pd.cut:
dfa = pd.DataFrame(a)
dfc = pd.DataFrame(c)
dfa['my_goal'] = pd.cut(dfa['Number'],
bins=[0]+dfc['high_number'].tolist(),
labels=dfc['my_goal'])
Output:
a Number my_goal
0 a 50 3
1 b 500 4
2 c 1030 5
3 d 2005 6
4 e 3575 7
I changed row 4 slightly to include a test case where the condition is not met. You can concat a with rows of c where the condition is true.
a= pd.DataFrame({'a':['a', 'b', 'c', 'd', 'e'],'Number':[50, 500, 1030, 1995 , 3575]})
cond= a.Number.between( c.low_number, c.high_number)
pd.concat([a, c.loc[cond, ['my_goal']] ], axis = 1, join = 'inner')
Number a my_goal
0 50 a 3
1 500 b 4
2 1030 c 5
4 3575 e 7

Pandas flatten hierarchical index on non overlapping columns

I have a dataframe, and I set the index to a column of the dataframe. This creates a hierarchical column index. I want to flatten the columns to a single level. Similar to this question - Python Pandas - How to flatten a hierarchical index in columns, however, the columns do not overlap (i.e. 'id' is not at level 0 of the hierarchical index, and other columns are at level 1 of the index).
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
A B
id
101 3 x
102 5 y
Desired output is flattened columns, like this:
id A B
101 3 x
102 5 y
You are misinterpreting what you are seeing.
A B
id
101 3 x
102 5 y
Is not showing you a hierarchical column index. id is the name of the row index. In order to show you the name of the index, pandas is putting that space there for you.
The answer to your question depends on what you really want or need.
As the df is, you can dump it to a csv just the way you want:
print(df.to_csv(sep='\t'))
id A B
101 3 x
102 5 y
print(df.to_csv())
id,A,B
101,3,x
102,5,y
Or you can alter the df so that it displays the way you'd like
print(df.rename_axis(None))
A B
101 3 x
102 5 y
please do not do this!!!!
I'm putting it to demonstrate how to manipulate
I could also keep the index as it is but manipulate both column and row index names to print how you would like.
print(df.rename_axis(None).rename_axis('id', 1))
id A B
101 3 x
102 5 y
But this has named the columns' index id which makes no sense.
there will always be an index in your dataframes. if you don't set 'id' as index, it will be at the same level as other columns and pandas will populate an increasing integer for your index starting from 0.
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
In[52]: df
Out[52]:
id A B
0 101 3 x
1 102 5 y
the index is there so you can slice the original dataframe. such has
df.iloc[0]
Out[53]:
id 101
A 3
B x
Name: 0, dtype: object
so let says you want ID as index and ID as a column, which is very redundant, you could do:
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
df['id'] = df.index
df
Out[55]:
A B id
id
101 3 x 101
102 5 y 102
with this you can slice by 'id' such has:
df.loc[101]
Out[57]:
A 3
B x
id 101
Name: 101, dtype: object
but it would the same info has :
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
df.loc[101]
Out[58]:
A 3
B x
Name: 101, dtype: object
Given:
>>> df2=pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
>>> df2.set_index('id', inplace=True)
>>> df2
A B
id
101 3 x
102 5 y
For printing purdy, you can produce a copy of the DataFrame with a reset the index and use .to_string:
>>> print df2.reset_index().to_string(index=False)
id A B
101 3 x
102 5 y
Then play around with the formatting options so that the output suites your needs:
>>> fmts=[lambda s: u"{:^5}".format(str(s).strip())]*3
>>> print df2.reset_index().to_string(index=False, formatters=fmts)
id A B
101 3 x
102 5 y

Reference many columns of pandas DataFrame at once

Say I have a n ⨉ p matrix of n samples of a single feature of p dimension (for example a word2vec element, so that p is of the order of ~300). I can create each column programatically, eg. with features = ['f'+str(i) for i in range(p)] and then appending to an existing dataframe.
Since they represent a single feature, how can I reference all those columns later on? I can assign df.feature = df[features] which works, but it breaks when I slice the dataset: df[:x].feature results in an exception.
Example:
df = pre_exisiting_dataframe() # such that len(df) is n
n,p = 3,4
m = np.arange(n*p).reshape((n,p))
fs = ['f'+str(i) for i in range(p)]
df_m = pd.DataFrame(m)
df_m.columns = fs
df = pd.concat([df,df_m],axis=1) # m is now only a part of df
df.f = df[fs]
df.f # works: I can access the whole m at once
df[:1].f # crashes
I wouldn't use df.f = df[fs]. It may lead to undesired and surprising behaviour if you try to modify the data frame. Instead, I'd consider creating hierarchical columns as in the below example.
Say, we already have a preexisting data frame df0 and another one with features:
df0 = pd.DataFrame(np.arange(4).reshape(2,2), columns=['A', 'B'])
df1 = pd.DataFrame(np.arange(10, 16).reshape(2,3), columns=['f0', 'f1', 'f2'])
Then, using the keys argument to concat, we create another level in columns:
df = pd.concat([df0, df1], keys=['pre', 'feat1'], axis=1)
df
Out[103]:
pre feat1
A B f0 f1 f2
0 0 1 10 11 12
1 2 3 13 14 15
The subframe with features can be accessed as follows:
df['feat1']
Out[104]:
f0 f1 f2
0 10 11 12
1 13 14 15
df[('feat1', 'f0')]
Out[105]:
0 10
1 13
Name: (feat1, f0), dtype: int64
Slicing on rows is straightforward. Slicing on columns may be more complicated:
df.loc[:, pd.IndexSlice['feat1', :]]
Out[106]:
feat1
f0 f1 f2
0 10 11 12
1 13 14 15
df.loc[:, pd.IndexSlice['feat1', 'f0':'f1']]
Out[107]:
feat1
f0 f1
0 10 11
1 13 14
To modify values in the data frame, use .loc, for example df.loc[1:, ('feat1', 'f1')] = -1. (More on hierarchical indexing, slicing etc.)
It's also possible to append another frame to df.
# another set of features
df2 = pd.DataFrame(np.arange(100, 108).reshape(2,4), columns=['f0', 'f1', 'f2', 'f3'])
# create a MultiIndex:
idx = pd.MultiIndex.from_product([['feat2'], df2.columns])
# append
df[idx] = df2
df
Out[117]:
pre feat1 feat2
A B f0 f1 f2 f0 f1 f2 f3
0 0 1 10 11 12 100 101 102 103
1 2 3 13 -1 15 104 105 106 107
To keep a nice layout, it's important that idx have the same numbers of levels as df.columns.

Element-wise average and standard deviation across multiple dataframes

Data:
Multiple dataframes of the same format (same columns, an equal number of rows, and no points missing).
How do I create a "summary" dataframe that contains an element-wise mean for every element? How about a dataframe that contains an element-wise standard deviation?
A B C
0 -1.624722 -1.160731 0.016726
1 -1.565694 0.989333 1.040820
2 -0.484945 0.718596 -0.180779
3 0.388798 -0.997036 1.211787
4 -0.249211 1.604280 -1.100980
5 0.062425 0.925813 -1.810696
6 0.793244 -1.860442 -1.196797
A B C
0 1.016386 1.766780 0.648333
1 -1.101329 -1.021171 0.830281
2 -1.133889 -2.793579 0.839298
3 1.134425 0.611480 -1.482724
4 -0.066601 -2.123353 1.136564
5 -0.167580 -0.991550 0.660508
6 0.528789 -0.483008 1.472787
You can create a panel of your DataFrames and then compute the mean and SD along the items axis:
df1 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df3 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
p = pd.Panel({n: df for n, df in enumerate([df1, df2, df3])})
>>> p.mean(axis=0)
A B C
0 -0.024284 -0.622337 0.581292
1 0.186271 0.596634 -0.498755
2 0.084591 -0.760567 -0.334429
3 -0.833688 0.403628 0.013497
4 0.402502 -0.017670 -0.369559
5 0.733305 -1.311827 0.463770
6 -0.941334 0.843020 -1.366963
7 0.134700 0.626846 0.994085
8 -0.783517 0.703030 -1.187082
9 -0.954325 0.514671 -0.370741
>>> p.std(axis=0)
A B C
0 0.196526 1.870115 0.503855
1 0.719534 0.264991 1.232129
2 0.315741 0.773699 1.328869
3 1.169213 1.488852 1.149105
4 1.416236 1.157386 0.414532
5 0.554604 1.022169 1.324711
6 0.178940 1.107710 0.885941
7 1.270448 1.023748 1.102772
8 0.957550 0.355523 1.284814
9 0.582288 0.997909 1.566383
One simple solution here is to simply concatenate the existing dataframes into a single dataframe while adding an ID variable to track the original source:
dfa = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='a')
dfb = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='b')
df = pd.concat([df1,df2])
a b id
0 -0.542652 1.609213 a
1 -0.192136 0.458564 a
0 -0.231949 -0.000573 b
1 0.245715 -0.083786 b
So now you have two 2x2 dataframes combined into a single 4x2 dataframe. The 'id' columns identifies the source dataframe so you haven't lost any generality, and can select on 'id' to do the same thing you would to any single dataframe. E.g. df[ df['id'] == 'a' ].
But now you can also use groupby to do any pandas method such as mean() or std() on an element by element basis:
df.groupby('id').mean()
a b
index
0 0.198164 -0.811475
1 0.639529 0.812810
The following solution worked for me.
average_data_frame = (dataframe1 + dataframe2 ) / 2
Or, if you have more than two dataframes, say n, then
average_data_frame = dataframe1
for i in range(1,n):
average_data_frame = average_data_frame + i_th_dataframe
average_data_frame = average_data_frame / n
Once you have the average, you can go for the standard deviation. If you are looking for a "true Pythonic" approach, you should follow other answers. But if you are looking for a working and quick solution, this is it.

Categories

Resources