Concatenate dataframes along columns in a pandas dataframe - python
I want to concatenate two df along columns. Both have the same number of indices.
df1
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
df2
D E F
0 13 14 15
1 16 17 18
2 19 20 21
3 22 23 24
Expected:
A B C D E F
0 1 2 3 13 14 15
1 4 5 6 16 17 18
2 7 8 9 19 20 21
3 10 11 12 22 23 24
I have done:
df_combined = pd.concat([df1,df2], axis=1)
But, the df_combined have new rows with NaN values in some columns...
I can't find my error. So, what I have to do? Thanks in advance!
In this case, merge() works.
pd.merge(df1, df2, left_index=True, right_index=True)
output
A B C D E F
0 1 2 3 13 14 15
1 4 5 6 16 17 18
2 7 8 9 19 20 21
3 10 11 12 22 23 24
This works only if both dataframe have same indices.
Related
Dynamically create columns in a dataframe
I have a Dataframe like the following: a b a1 b1 0 1 6 10 20 1 2 7 11 21 2 3 8 12 22 3 4 9 13 23 4 5 2 14 24 where a1 and b1 are dynamically created by a and b. Can we create percentage columns dynamically as well ? The one thing that is contant is the created columns will have 1 suffixed after the name Expected output: a b a1 b1 a% b% 0 0 6 10 20 0 30 1 2 7 11 21 29 33 2 3 8 12 22 38 36 3 4 9 13 23 44 39 4 5 2 14 24 250 8
Create new DataFrame by divide both columns and rename columns by DataFrame.add_suffix, last append to original by DataFrame.join: cols = ['a','b'] new = [f'{x}1' for x in cols] df = df.join(df[cols].div(df[new].to_numpy()).mul(100).add_suffix('%')) print (df) a b a1 b1 a% b% 0 1 6 10 20 10.000000 30.000000 1 2 7 11 21 18.181818 33.333333 2 3 8 12 22 25.000000 36.363636 3 4 9 13 23 30.769231 39.130435 4 5 2 14 24 35.714286 8.333333
CSV & Pandas: Unnamed columns and multi-index
I have a set of data: ,,England,,,,,,,,,,,,France,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,, ,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,, ,,,,,,,,,,,,,,,,,,,,,,,,, ,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D ,,,,,,,,,,,,,,,,,,,,,,,,, Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13 ,,,,,,,,,,,,,,,,,,,,,,,,, ,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5 ,,,,,,,,,,,,,,,,,,,,,,,,, Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12 ,,,,,,,,,,,,,,,,,,,,,,,,, ,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10 ,,,,,,,,,,,,,,,,,,,,,,,,, Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20 ,,,,,,,,,,,,,,,,,,,,,,,,, ,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8 The desired output I'm trying to achieve is: I know that I can read the CSV and remove any NaN rows with: df = pd.read_csv("Stores.csv",skipinitialspace=True) df.dropna(how="all", inplace=True) My 2 main issues are: How do I group the unnamed columns so that they are just the countries "England" and "France" How do I setup an index so that each of the 3 stores fall under the relevant countries? I believe that I can use hierarchical indexing for the headings but all examples I've come across use nice, clean data frames unlike my CSV. I'd be very grateful if someone could point me in the right direction as I'm fairly new to pandas. Thank you.
You can try this: from io import StringIO import pandas as pd import numpy as np test=StringIO(""",,England,,,,,,,,,,,,France,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,, ,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,, ,,,,,,,,,,,,,,,,,,,,,,,,, ,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D ,,,,,,,,,,,,,,,,,,,,,,,,, Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13 ,,,,,,,,,,,,,,,,,,,,,,,,, ,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5 ,,,,,,,,,,,,,,,,,,,,,,,,, Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12 ,,,,,,,,,,,,,,,,,,,,,,,,, ,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10 ,,,,,,,,,,,,,,,,,,,,,,,,, Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20 ,,,,,,,,,,,,,,,,,,,,,,,,, ,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8""") df = pd.read_csv(test, index_col=[0,1], header=[0,1,2], skiprows=lambda x: x%2 == 1) df.columns = pd.MultiIndex.from_frame(df.columns .to_frame() .apply(lambda x: np.where(x.str.contains('Unnamed'), np.nan, x))\ .ffill()) df.index = pd.MultiIndex.from_frame(df.index.to_frame().ffill()) print(df) Output: 0 England ... France 1 Store 1 Store 2 Store 3 ... Store 1 Store 2 Store 3 2 F P M D F P M D F P ... M D F P M D F P M D 0 1 ... Year 1 M 0 5 7 9 2 18 5 10 4 9 ... 18 11 10 19 18 20 3 17 19 13 F 0 13 14 11 0 6 8 6 2 12 ... 12 18 6 17 16 14 0 4 2 5 Year 2 M 5 10 6 6 1 20 5 18 4 9 ... 15 19 2 18 16 13 1 19 5 12 F 1 11 14 15 0 9 9 2 2 12 ... 18 14 9 18 13 14 0 9 2 10 Evening M 4 10 6 5 3 13 19 5 4 9 ... 10 18 3 11 20 11 4 18 17 20 F 4 12 12 13 0 9 3 8 2 12 ... 11 18 1 13 13 10 0 6 2 8 [6 rows x 24 columns]
You'll have to set the (multi) index and headers yourself: df = pd.read_csv("Stores.csv", header=None) df.dropna(how='all', inplace=True) df.reset_index(inplace=True, drop=True) # getting headers as a product of [England, France], [Store1, Store2, Store3] and [F, P, M, D] headers = pd.MultiIndex.from_product([df.iloc[0].dropna().unique(), df.iloc[1].dropna().unique(), df.iloc[2].dropna().unique()]) df.drop([0, 1, 2], inplace=True) # removing header rows df[0].ffill(inplace=True) # filling nan values for first index col df.set_index([0,1], inplace=True) # setting mulitiindex df.columns = headers print(df) Output: England ... France Store 1 Store 2 Store 3 ... Store 1 Store 2 Store 3 F P M D F P M D F P M ... P M D F P M D F P M D 0 1 ... Year 1 M 0 5 7 9 2 18 5 10 4 9 6 ... 14 18 11 10 19 18 20 3 17 19 13 F 0 13 14 11 0 6 8 6 2 12 14 ... 17 12 18 6 17 16 14 0 4 2 5 Year 2 M 5 10 6 6 1 20 5 18 4 9 6 ... 13 15 19 2 18 16 13 1 19 5 12 F 1 11 14 15 0 9 9 2 2 12 14 ... 17 18 14 9 18 13 14 0 9 2 10 Evening M 4 10 6 5 3 13 19 5 4 9 6 ... 17 10 18 3 11 20 11 4 18 17 20 F 4 12 12 13 0 9 3 8 2 12 14 ... 18 11 18 1 13 13 10 0 6 2 8 [6 rows x 24 columns]
Several Layers of If Statements with String
I have a data frame df = pd.DataFrame([[3,2,1,5,'Stay',2],[4,5,6,10,'Leave',10], [10,20,30,40,'Stay',11],[12,2,3,3,'Leave',15], [31,23,31,45,'Stay',25],[12,21,17,6,'Stay',15], [15,17,18,12,'Leave',10],[3,2,1,5,'Stay',3], [12,2,3,3,'Leave',12]], columns = ['A','B','C','D','Status','E']) A B C D Status E 0 3 2 1 5 Stay 2 1 4 5 6 10 Leave 10 2 10 20 30 40 Stay 11 3 12 2 3 3 Leave 15 4 31 23 31 45 Stay 25 5 12 21 17 6 Stay 15 6 15 17 18 12 Leave 10 7 3 2 1 5 Stay 3 8 12 2 3 3 Leave 12 I want to run a condition where if Status is Stay and if column E is smaller than column A, then: change the data where data in column D is replaced with data column C, data in column C is replaced with data from column B and data in column B is replaced with data from column A and data in column A is replaced with data from column E. If Status is Leave and if column E is larger than column A, then: change the data where data in column D is replaced with data column C, data in column C is replaced with data from column B and data in column B is replaced with data from column A and data in column A is replaced with data from column E. So the result is: A B C D Status E 0 2 3 2 1 Stay 2 1 10 4 5 6 Leave 10 2 10 20 30 40 Stay 11 3 15 12 2 3 Leave 15 4 25 31 23 31 Stay 25 5 12 21 17 6 Stay 15 6 15 17 18 12 Leave 10 7 3 2 1 5 Stay 3 8 12 2 3 3 Leave 12 My attempt: if df['Status'] == 'Stay': if df['E'] < df['A']: df['D'] = df['C'] df['C'] = df['B'] df['B'] = df['A'] df['A'] = df['E'] elif df['Status'] == 'Leave': if df['E'] > df['A']: df['D'] = df['C'] df['C'] = df['B'] df['B'] = df['A'] df['A'] = df['E'] This runs into several problems including problem with string. Your help is kindly appreciated.
I think you want boolean indexing: s1 = df.Status.eq('Stay') & df['E'].lt(df['A']) s2 = df.Status.eq('Leave') & df['E'].gt(df['A']) s = s1 | s2 df.loc[s, ['A','B','C','D']] = df.loc[s, ['E','A','B','C']].to_numpy() Output: A B C D Status E 0 2 3 2 1 Stay 2 1 10 4 5 6 Leave 10 2 10 20 30 40 Stay 11 3 15 12 2 3 Leave 15 4 25 31 23 31 Stay 25 5 12 21 17 6 Stay 15 6 15 17 18 12 Leave 10 7 3 2 1 5 Stay 3 8 12 2 3 3 Leave 12
Using np.roll with .loc: shift = np.roll(df.select_dtypes(exclude='object'),1,axis=1)[:, :-1] m1 = df['Status'].eq('Stay') & (df['E'] < df['A']) m2 = df['Status'].eq('Leave') & (df['E'] > df['A']) df.loc[m1|m2, ['A','B','C','D']] = shift[m1|m2] A B C D Status E 0 2 3 2 1 Stay 2 1 10 4 5 6 Leave 10 2 10 20 30 40 Stay 11 3 15 12 2 3 Leave 15 4 25 31 23 31 Stay 25 5 12 21 17 6 Stay 15 6 15 17 18 12 Leave 10 7 3 2 1 5 Stay 3 8 12 2 3 3 Leave 12
Use DataFrame.mask + DataFrame.shift: #Status like index to use shift new_df=df.set_index('Status') #DataFrame to replace df_modify=new_df.shift(axis=1,fill_value=df['E']) #Creating boolean mask under_mask=(df.Status.eq('Stay'))&(df.E<df.A) over_mask=(df.Status.eq('Leave'))&(df.E>df.A) #Using DataFrame.mask new_df=new_df.mask(under_mask|over_mask,df_modify).reset_index() print(new_df) Output Status A B C D E 0 Stay 2 3 2 1 5 1 Leave 10 4 5 6 10 2 Stay 10 20 30 40 11 3 Leave 15 12 2 3 3 4 Stay 25 31 23 31 45 5 Stay 12 21 17 6 15 6 Leave 15 17 18 12 10 7 Stay 3 2 1 5 3 8 Leave 12 2 3 3 12
It sounds like you want to do this for each row of the data, but your code is written to try to do it at the top level. Can you use a for ... in loop to iterate over the rows? for row in df: if row['Status'] == 'Stay': ... etc ...
Python Dataframe: Create columns based on another column
I have a dataframe with repeated values for one column (here column 'A') and I want to convert this dataframe so that new columns are formed based on values of column 'A'. Example df = pd.DataFrame({'A':range(4)*3, 'B':range(12),'C':range(12,24)}) df A B C 0 0 0 12 1 1 1 13 2 2 2 14 3 3 3 15 4 0 4 16 5 1 5 17 6 2 6 18 7 3 7 19 8 0 8 20 9 1 9 21 10 2 10 22 11 3 11 23 Note that the values of "A" column are repeated 3 times. Now I want the simplest solution to convert it to another dataframe with this configuration (please ignore the naming of the columns, it is used for description purpose only, they could be anything): B C A0 A1 A2 A3 A0 A1 A2 A3 0 0 1 2 3 12 13 14 15 1 4 5 6 7 16 17 18 19 2 8 9 10 11 20 21 22 23
This is a pivot problem, so use df.assign(idx=df.groupby('A').cumcount()).pivot('idx', 'A', ['B', 'C']) B C A 0 1 2 3 0 1 2 3 idx 0 0 1 2 3 12 13 14 15 1 4 5 6 7 16 17 18 19 2 8 9 10 11 20 21 22 23 If the headers are important, you can use MultiIndex.set_levels to fix them. u = df.assign(idx=df.groupby('A').cumcount()).pivot('idx', 'A', ['B', 'C']) u.columns = u.columns.set_levels( ['A' + u.columns.levels[1].astype(str)], level=[1]) u B C A A0 A1 A2 A3 A0 A1 A2 A3 idx 0 0 1 2 3 12 13 14 15 1 4 5 6 7 16 17 18 19 2 8 9 10 11 20 21 22 23
You may need assign the group help key by cumcount , then just do unstack yourdf=df.assign(D=df.groupby('A').cumcount(),A='A'+df.A.astype(str)).set_index(['D','A']).unstack() B C A A0 A1 A2 A3 A0 A1 A2 A3 D 0 0 1 2 3 12 13 14 15 1 4 5 6 7 16 17 18 19 2 8 9 10 11 20 21 22 23
Pandas reshape dataframe every N rows to columns
I have a dataframe as follows : df1=pd.DataFrame(np.arange(24).reshape(6,-1),columns=['a','b','c','d']) and I want to take 3 set of rows and convert them to columns with following order Numpy reshape doesn't give intended answer pd.DataFrame(np.reshape(df1.values,(3,-1)),columns=['a','b','c','d','e','f','g','h'])
In [258]: df = pd.DataFrame(np.hstack(np.split(df1, 2))) In [259]: df Out[259]: 0 1 2 3 4 5 6 7 0 0 1 2 3 12 13 14 15 1 4 5 6 7 16 17 18 19 2 8 9 10 11 20 21 22 23 In [260]: import string In [261]: df.columns = list(string.ascii_lowercase[:len(df.columns)]) In [262]: df Out[262]: a b c d e f g h 0 0 1 2 3 12 13 14 15 1 4 5 6 7 16 17 18 19 2 8 9 10 11 20 21 22 23
Create 3d array by reshape: a = np.hstack(np.reshape(df1.values,(-1, 3, len(df1.columns)))) df = pd.DataFrame(a,columns=['a','b','c','d','e','f','g','h']) print (df) a b c d e f g h 0 0 1 2 3 12 13 14 15 1 4 5 6 7 16 17 18 19 2 8 9 10 11 20 21 22 23
This uses the reshape/swapaxes/reshape idiom for rearranging sub-blocks of NumPy arrays. In [26]: pd.DataFrame(df1.values.reshape(2,3,4).swapaxes(0,1).reshape(3,-1), columns=['a','b','c','d','e','f','g','h']) Out[26]: a b c d e f g h 0 0 1 2 3 12 13 14 15 1 4 5 6 7 16 17 18 19 2 8 9 10 11 20 21 22 23
If you want a pure pandas solution: df.set_index([df.index % 3, df.index // 3])\ .unstack()\ .sort_index(level=1, axis=1)\ .set_axis(list('abcdefgh'), axis=1, inplace=False) Output: a b c d e f g h 0 0 1 2 3 12 13 14 15 1 4 5 6 7 16 17 18 19 2 8 9 10 11 20 21 22 23