Pandas flatten hierarchical index on non overlapping columns - python

I have a dataframe, and I set the index to a column of the dataframe. This creates a hierarchical column index. I want to flatten the columns to a single level. Similar to this question - Python Pandas - How to flatten a hierarchical index in columns, however, the columns do not overlap (i.e. 'id' is not at level 0 of the hierarchical index, and other columns are at level 1 of the index).
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
A B
id
101 3 x
102 5 y
Desired output is flattened columns, like this:
id A B
101 3 x
102 5 y

You are misinterpreting what you are seeing.
A B
id
101 3 x
102 5 y
Is not showing you a hierarchical column index. id is the name of the row index. In order to show you the name of the index, pandas is putting that space there for you.
The answer to your question depends on what you really want or need.
As the df is, you can dump it to a csv just the way you want:
print(df.to_csv(sep='\t'))
id A B
101 3 x
102 5 y
print(df.to_csv())
id,A,B
101,3,x
102,5,y
Or you can alter the df so that it displays the way you'd like
print(df.rename_axis(None))
A B
101 3 x
102 5 y
please do not do this!!!!
I'm putting it to demonstrate how to manipulate
I could also keep the index as it is but manipulate both column and row index names to print how you would like.
print(df.rename_axis(None).rename_axis('id', 1))
id A B
101 3 x
102 5 y
But this has named the columns' index id which makes no sense.

there will always be an index in your dataframes. if you don't set 'id' as index, it will be at the same level as other columns and pandas will populate an increasing integer for your index starting from 0.
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
In[52]: df
Out[52]:
id A B
0 101 3 x
1 102 5 y
the index is there so you can slice the original dataframe. such has
df.iloc[0]
Out[53]:
id 101
A 3
B x
Name: 0, dtype: object
so let says you want ID as index and ID as a column, which is very redundant, you could do:
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
df['id'] = df.index
df
Out[55]:
A B id
id
101 3 x 101
102 5 y 102
with this you can slice by 'id' such has:
df.loc[101]
Out[57]:
A 3
B x
id 101
Name: 101, dtype: object
but it would the same info has :
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
df.loc[101]
Out[58]:
A 3
B x
Name: 101, dtype: object

Given:
>>> df2=pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
>>> df2.set_index('id', inplace=True)
>>> df2
A B
id
101 3 x
102 5 y
For printing purdy, you can produce a copy of the DataFrame with a reset the index and use .to_string:
>>> print df2.reset_index().to_string(index=False)
id A B
101 3 x
102 5 y
Then play around with the formatting options so that the output suites your needs:
>>> fmts=[lambda s: u"{:^5}".format(str(s).strip())]*3
>>> print df2.reset_index().to_string(index=False, formatters=fmts)
id A B
101 3 x
102 5 y

Related

Filter Columns from Pandas Dataframe with given list when list elements may or may not be present as column

I have a huge dataframe and I need to filter out the columns from the dataframe if the columns are present in a given list.
For example,
df = pd.DataFrame([[1,2,3,4,5],[6,7,8,9,10]], columns=list('ABCDE'))
This is the dataframe.
A B C D E
0 1 2 3 4 5
1 6 7 8 9 10
I have a list.
fil_lst = ['A', 'D', 'F']
The list may contain column names that are not present in the dataframe. I need only the columns that are present in the dataframe.
I need the resulting dataframe like,
A D
0 1 4
1 6 9
I know it can be done with the help of list comprehension like,
new_df = df[[col for col in fil_lst if col in df.columns]]
But as I have a huge dataframe, it is better if I don't use this computationally expensive process.
Is it possible to vectorize this in any way?
Use Index.isin for test membership in columns and DataFrame.loc for filter by columns, so : mean select all rows and columns by mask:
fil_lst = ['A', 'D', 'F']
df = df.loc[:, df.columns.isin(fil_lst)]
print(df)
A D
0 1 4
1 6 9
Or use Index.intersection:
fil_lst = ['A', 'D', 'F']
df = df[df.columns.intersection(fil_lst)]
print(df)
A D
0 1 4
1 6 9
If you are dealing with large lists, and the focus is on performance more than order of columns, you can use set intersection:
In [2944]: fil_lst = ['A', 'D', 'F']
In [2945]: col_list = df.columns.tolist()
In [2947]: df = df[list(set(col_list) & set(fil_lst))]
In [2947]: df
Out[2947]:
D A
0 4 1
1 9 6
EDIT: If order of columns is important, then do this:
In [2953]: df = df[sorted(set(col_list) & set(fil_lst), key = col_list.index)]
In [2953]: df
Out[2953]:
A D
0 1 4
1 6 9

How to create a dataframe in pandas?

I have two lists. One is called 'Region' and another is called 'Products'. Region has 111 distinct values and Products has 1181 distinct values. I want to create a dataframe of each combination of products and region from these two lists.
For example, I want this type of dataframe made up of two list. Here, product has 2 values and region has 3 values.
Pdts Region
0 A X
1 B X
2 A Y
3 B Y
4 A Z
5 B Z
I want this type of dataframe but my 'Region' list has 111 distinct values and 'Products' list has 1181 distinct values. How can I achieve this?
That's a Cartesian product
import pandas as pd
df1 = pd.DataFrame({'dataframe1': ['A', 'B']})
df2 = pd.DataFrame({'dataframe2': ['X', 'Y', 'Z']})
# Assign new columns to a DataFrame
# Merge with a database-style join
# Drop specified labels from rows or columns
product_df = df1.assign(key=1)\
.merge(df2.assign(key=1), on='key')\
.drop('key', 1)
print(product_df)
Output
dataframe1 dataframe2
0 A X
1 A Y
2 A Z
3 B X
4 B Y
5 B Z
You can do something like that:
import pandas as pd
d = {'Region': first_list, 'Products': second_list}
df = pd.DataFrame(data=d)
Ya mean:
>>> df['Region'] = ['X', 'Y', 'Z'] * (len(df) // 3)
>>> df
Pdts Region
0 A X
1 B Y
2 A Z
3 B X
4 A Y
5 B Z
>>>

reshape a pandas dataframe

suppose a dataframe like this one:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]], columns = ['A', 'B', 'A1', 'B1'])
I would like to have a dataframe which looks like:
what does not work:
new_rows = int(df.shape[1]/2) * df.shape[0]
new_cols = 2
df.values.reshape(new_rows, new_cols, order='F')
of course I could loop over the data and make a new list of list but there must be a better way. Any ideas ?
The pd.wide_to_long function is built almost exactly for this situation, where you have many of the same variable prefixes that end in a different digit suffix. The only difference here is that your first set of variables don't have a suffix, so you will need to rename your columns first.
The only issue with pd.wide_to_long is that it must have an identification variable, i, unlike melt. reset_index is used to create a this uniquely identifying column, which is dropped later. I think this might get corrected in the future.
df1 = df.rename(columns={'A':'A1', 'B':'B1', 'A1':'A2', 'B1':'B2'}).reset_index()
pd.wide_to_long(df1, stubnames=['A', 'B'], i='index', j='id')\
.reset_index()[['A', 'B', 'id']]
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
You can use lreshape, for column id numpy.repeat:
a = [col for col in df.columns if 'A' in col]
b = [col for col in df.columns if 'B' in col]
df1 = pd.lreshape(df, {'A' : a, 'B' : b})
df1['id'] = np.repeat(np.arange(len(df.columns) // 2), len (df.index)) + 1
print (df1)
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
EDIT:
lreshape is currently undocumented, but it is possible it might be removed(with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
I solved this in 3 steps:
Make a new dataframe df2 holding only the data you want to be added to the initial dataframe df.
Delete the data from df that will be added below (and that was used to make df2.
Append df2 to df.
Like so:
# step 1: create new dataframe
df2 = df[['A1', 'B1']]
df2.columns = ['A', 'B']
# step 2: delete that data from original
df = df.drop(["A1", "B1"], 1)
# step 3: append
df = df.append(df2, ignore_index=True)
Note how when you do df.append() you need to specify ignore_index=True so the new columns get appended to the index rather than keep their old index.
Your end result should be your original dataframe with the data rearranged like you wanted:
In [16]: df
Out[16]:
A B
0 1 2
1 5 6
2 9 10
3 3 4
4 7 8
5 11 12
Use pd.concat() like so:
#Split into separate tables
df_1 = df[['A', 'B']]
df_2 = df[['A1', 'B1']]
df_2.columns = ['A', 'B'] # Make column names line up
# Add the ID column
df_1 = df_1.assign(id=1)
df_2 = df_2.assign(id=2)
# Concatenate
pd.concat([df_1, df_2])

Concatenation of pandas DataFrames including rows' source-dataframe/-category

I need to pandas.concatenate(...,axis=0,...) multiple DataFrames so that the resulting DataFrame has a new column holding information about which dataset a row belongs to, but drop the implicit indices of the original DataFrames.
In this MWE for example, we have from multiple countries (A and B, e.g.) the heights and weights of people in DataFrames df1 and df2, respectively.
import pandas as pd
df1 = pd.DataFrame({'Weight': [5, 4, 6], 'Height': [170, 172, 180]})
df2 = pd.DataFrame({'Weight': [4, 4, 5], 'Height': [180, 181, 169]})
The concatenated DataFrame df needs to store the country for each row, but
df = pd.concat([df1, df2], keys=list('AB'), names=["Country"]).reset_index()
print df
> Nationality level_1 Height Weight
>0 A 0 170 5
>1 A 1 172 4
>2 A 2 180 6
>3 B 0 180 4
>4 B 1 181 4
>5 B 2 169 5
carries along the "old" implicit indices in an additional column (level_1), while
df = pd.concat([df1, df2], keys=list('AB'), names=["Country"], ignore_index=True).reset_index()
print df
> Height Weight
>0 170 5
>1 172 4
>2 180 6
>3 180 4
>4 181 4
>5 169 5
ignores the columns marked as index of the new DataFrame instead of the indices of the source DataFrames (which would make more sense in my opinion, at least if keys and names are provided).
I get the desired result with
df = pd.concat([df1, df2], keys=list(list('AB')), names=["Nationality"]).reset_index(0).reset_index(0, drop=True)
print df
> Nationality Height Weight
>0 A 170 5
>1 A 172 4
>2 A 180 6
>3 B 180 4
>4 B 181 4
>5 B 169 5
which is a syntactic nightmare IMHO.
Hence my two questions:
Am I missing another way to do this properly?
Or is the behaviour of the ignore_index-flag faulty or misleading and should be subject to a bug-report?
Using both ignore_index and keys is somewhat conflicting - the first says to toss away the index information, and the second says to use it to make a MultiIndex. That said, I think pandas could give a better message (maybe just raise a ValueError) if you pass both, so you can make an issue.
Here's an alternative way to accomplish what you want.
In [2]: keys = ['A', 'B']
In [3]: dfs = [df1, df2]
In [4]: df = pd.concat([df.assign(Nationality=key) for key, df
...: in zip(keys, dfs)])
In [5]: df
Out[5]:
Height Weight Nationality
0 170 5 A
1 172 4 A
2 180 6 A
0 180 4 B
1 181 4 B
2 169 5 B

Element-wise average and standard deviation across multiple dataframes

Data:
Multiple dataframes of the same format (same columns, an equal number of rows, and no points missing).
How do I create a "summary" dataframe that contains an element-wise mean for every element? How about a dataframe that contains an element-wise standard deviation?
A B C
0 -1.624722 -1.160731 0.016726
1 -1.565694 0.989333 1.040820
2 -0.484945 0.718596 -0.180779
3 0.388798 -0.997036 1.211787
4 -0.249211 1.604280 -1.100980
5 0.062425 0.925813 -1.810696
6 0.793244 -1.860442 -1.196797
A B C
0 1.016386 1.766780 0.648333
1 -1.101329 -1.021171 0.830281
2 -1.133889 -2.793579 0.839298
3 1.134425 0.611480 -1.482724
4 -0.066601 -2.123353 1.136564
5 -0.167580 -0.991550 0.660508
6 0.528789 -0.483008 1.472787
You can create a panel of your DataFrames and then compute the mean and SD along the items axis:
df1 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df3 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
p = pd.Panel({n: df for n, df in enumerate([df1, df2, df3])})
>>> p.mean(axis=0)
A B C
0 -0.024284 -0.622337 0.581292
1 0.186271 0.596634 -0.498755
2 0.084591 -0.760567 -0.334429
3 -0.833688 0.403628 0.013497
4 0.402502 -0.017670 -0.369559
5 0.733305 -1.311827 0.463770
6 -0.941334 0.843020 -1.366963
7 0.134700 0.626846 0.994085
8 -0.783517 0.703030 -1.187082
9 -0.954325 0.514671 -0.370741
>>> p.std(axis=0)
A B C
0 0.196526 1.870115 0.503855
1 0.719534 0.264991 1.232129
2 0.315741 0.773699 1.328869
3 1.169213 1.488852 1.149105
4 1.416236 1.157386 0.414532
5 0.554604 1.022169 1.324711
6 0.178940 1.107710 0.885941
7 1.270448 1.023748 1.102772
8 0.957550 0.355523 1.284814
9 0.582288 0.997909 1.566383
One simple solution here is to simply concatenate the existing dataframes into a single dataframe while adding an ID variable to track the original source:
dfa = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='a')
dfb = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='b')
df = pd.concat([df1,df2])
a b id
0 -0.542652 1.609213 a
1 -0.192136 0.458564 a
0 -0.231949 -0.000573 b
1 0.245715 -0.083786 b
So now you have two 2x2 dataframes combined into a single 4x2 dataframe. The 'id' columns identifies the source dataframe so you haven't lost any generality, and can select on 'id' to do the same thing you would to any single dataframe. E.g. df[ df['id'] == 'a' ].
But now you can also use groupby to do any pandas method such as mean() or std() on an element by element basis:
df.groupby('id').mean()
a b
index
0 0.198164 -0.811475
1 0.639529 0.812810
The following solution worked for me.
average_data_frame = (dataframe1 + dataframe2 ) / 2
Or, if you have more than two dataframes, say n, then
average_data_frame = dataframe1
for i in range(1,n):
average_data_frame = average_data_frame + i_th_dataframe
average_data_frame = average_data_frame / n
Once you have the average, you can go for the standard deviation. If you are looking for a "true Pythonic" approach, you should follow other answers. But if you are looking for a working and quick solution, this is it.

Categories

Resources