Concatenate dataframes along columns in a pandas dataframe

Concatenate dataframes along columns in a pandas dataframe - python

I want to concatenate two df along columns. Both have the same number of indices.
df1
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
df2
D E F
0 13 14 15
1 16 17 18
2 19 20 21
3 22 23 24
Expected:
A B C D E F
0 1 2 3 13 14 15
1 4 5 6 16 17 18
2 7 8 9 19 20 21
3 10 11 12 22 23 24
I have done:
df_combined = pd.concat([df1,df2], axis=1)
But, the df_combined have new rows with NaN values in some columns...
I can't find my error. So, what I have to do? Thanks in advance!

In this case, merge() works.
pd.merge(df1, df2, left_index=True, right_index=True)
output
A B C D E F
0 1 2 3 13 14 15
1 4 5 6 16 17 18
2 7 8 9 19 20 21
3 10 11 12 22 23 24
This works only if both dataframe have same indices.

Related

Dynamically create columns in a dataframe

I have a Dataframe like the following:
a b a1 b1
0 1 6 10 20
1 2 7 11 21
2 3 8 12 22
3 4 9 13 23
4 5 2 14 24
where a1 and b1 are dynamically created by a and b. Can we create percentage columns dynamically as well ?
The one thing that is contant is the created columns will have 1 suffixed after the name
Expected output:
a b a1 b1 a% b%
0 0 6 10 20 0 30
1 2 7 11 21 29 33
2 3 8 12 22 38 36
3 4 9 13 23 44 39
4 5 2 14 24 250 8

Create new DataFrame by divide both columns and rename columns by DataFrame.add_suffix, last append to original by DataFrame.join:
cols = ['a','b']
new = [f'{x}1' for x in cols]
df = df.join(df[cols].div(df[new].to_numpy()).mul(100).add_suffix('%'))
print (df)
a b a1 b1 a% b%
0 1 6 10 20 10.000000 30.000000
1 2 7 11 21 18.181818 33.333333
2 3 8 12 22 25.000000 36.363636
3 4 9 13 23 30.769231 39.130435
4 5 2 14 24 35.714286 8.333333

CSV & Pandas: Unnamed columns and multi-index

I have a set of data:
,,England,,,,,,,,,,,,France,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10
,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8
The desired output I'm trying to achieve is:
I know that I can read the CSV and remove any NaN rows with:
df = pd.read_csv("Stores.csv",skipinitialspace=True)
df.dropna(how="all", inplace=True)
My 2 main issues are:
How do I group the unnamed columns so that they are just the countries "England" and "France"
How do I setup an index so that each of the 3 stores fall under the relevant countries?
I believe that I can use hierarchical indexing for the headings but all examples I've come across use nice, clean data frames unlike my CSV. I'd be very grateful if someone could point me in the right direction as I'm fairly new to pandas.
Thank you.

You can try this:
from io import StringIO
import pandas as pd
import numpy as np
test=StringIO(""",,England,,,,,,,,,,,,France,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10
,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8""")
df = pd.read_csv(test, index_col=[0,1], header=[0,1,2], skiprows=lambda x: x%2 == 1)
df.columns = pd.MultiIndex.from_frame(df.columns
.to_frame()
.apply(lambda x: np.where(x.str.contains('Unnamed'), np.nan, x))\
.ffill())
df.index = pd.MultiIndex.from_frame(df.index.to_frame().ffill())
print(df)
Output:
0 England ... France
1 Store 1 Store 2 Store 3 ... Store 1 Store 2 Store 3
2 F P M D F P M D F P ... M D F P M D F P M D
0 1 ...
Year 1 M 0 5 7 9 2 18 5 10 4 9 ... 18 11 10 19 18 20 3 17 19 13
F 0 13 14 11 0 6 8 6 2 12 ... 12 18 6 17 16 14 0 4 2 5
Year 2 M 5 10 6 6 1 20 5 18 4 9 ... 15 19 2 18 16 13 1 19 5 12
F 1 11 14 15 0 9 9 2 2 12 ... 18 14 9 18 13 14 0 9 2 10
Evening M 4 10 6 5 3 13 19 5 4 9 ... 10 18 3 11 20 11 4 18 17 20
F 4 12 12 13 0 9 3 8 2 12 ... 11 18 1 13 13 10 0 6 2 8
[6 rows x 24 columns]

You'll have to set the (multi) index and headers yourself:
df = pd.read_csv("Stores.csv", header=None)
df.dropna(how='all', inplace=True)
df.reset_index(inplace=True, drop=True)
# getting headers as a product of [England, France], [Store1, Store2, Store3] and [F, P, M, D]
headers = pd.MultiIndex.from_product([df.iloc[0].dropna().unique(),
df.iloc[1].dropna().unique(),
df.iloc[2].dropna().unique()])
df.drop([0, 1, 2], inplace=True) # removing header rows
df[0].ffill(inplace=True) # filling nan values for first index col
df.set_index([0,1], inplace=True) # setting mulitiindex
df.columns = headers
print(df)
Output:
England ... France
Store 1 Store 2 Store 3 ... Store 1 Store 2 Store 3
F P M D F P M D F P M ... P M D F P M D F P M D
0 1 ...
Year 1 M 0 5 7 9 2 18 5 10 4 9 6 ... 14 18 11 10 19 18 20 3 17 19 13
F 0 13 14 11 0 6 8 6 2 12 14 ... 17 12 18 6 17 16 14 0 4 2 5
Year 2 M 5 10 6 6 1 20 5 18 4 9 6 ... 13 15 19 2 18 16 13 1 19 5 12
F 1 11 14 15 0 9 9 2 2 12 14 ... 17 18 14 9 18 13 14 0 9 2 10
Evening M 4 10 6 5 3 13 19 5 4 9 6 ... 17 10 18 3 11 20 11 4 18 17 20
F 4 12 12 13 0 9 3 8 2 12 14 ... 18 11 18 1 13 13 10 0 6 2 8
[6 rows x 24 columns]

Several Layers of If Statements with String

I have a data frame
df = pd.DataFrame([[3,2,1,5,'Stay',2],[4,5,6,10,'Leave',10],
[10,20,30,40,'Stay',11],[12,2,3,3,'Leave',15],
[31,23,31,45,'Stay',25],[12,21,17,6,'Stay',15],
[15,17,18,12,'Leave',10],[3,2,1,5,'Stay',3],
[12,2,3,3,'Leave',12]], columns = ['A','B','C','D','Status','E'])
A B C D Status E
0 3 2 1 5 Stay 2
1 4 5 6 10 Leave 10
2 10 20 30 40 Stay 11
3 12 2 3 3 Leave 15
4 31 23 31 45 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
I want to run a condition where if Status is Stay and if column E is smaller than column A, then: change the data where data in column D is replaced with data column C, data in column C is replaced with data from column B and data in column B is replaced with data from column A and data in column A is replaced with data from column E.
If Status is Leave and if column E is larger than column A, then: change the data where data in column D is replaced with data column C, data in column C is replaced with data from column B and data in column B is replaced with data from column A and data in column A is replaced with data from column E.
So the result is:
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
My attempt:
if df['Status'] == 'Stay':
if df['E'] < df['A']:
df['D'] = df['C']
df['C'] = df['B']
df['B'] = df['A']
df['A'] = df['E']
elif df['Status'] == 'Leave':
if df['E'] > df['A']:
df['D'] = df['C']
df['C'] = df['B']
df['B'] = df['A']
df['A'] = df['E']
This runs into several problems including problem with string. Your help is kindly appreciated.

I think you want boolean indexing:
s1 = df.Status.eq('Stay') & df['E'].lt(df['A'])
s2 = df.Status.eq('Leave') & df['E'].gt(df['A'])
s = s1 | s2
df.loc[s, ['A','B','C','D']] = df.loc[s, ['E','A','B','C']].to_numpy()
Output:
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12

Using np.roll with .loc:
shift = np.roll(df.select_dtypes(exclude='object'),1,axis=1)[:, :-1]
m1 = df['Status'].eq('Stay') & (df['E'] < df['A'])
m2 = df['Status'].eq('Leave') & (df['E'] > df['A'])
df.loc[m1|m2, ['A','B','C','D']] = shift[m1|m2]
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12

Use DataFrame.mask + DataFrame.shift:
#Status like index to use shift
new_df=df.set_index('Status')
#DataFrame to replace
df_modify=new_df.shift(axis=1,fill_value=df['E'])
#Creating boolean mask
under_mask=(df.Status.eq('Stay'))&(df.E<df.A)
over_mask=(df.Status.eq('Leave'))&(df.E>df.A)
#Using DataFrame.mask
new_df=new_df.mask(under_mask|over_mask,df_modify).reset_index()
print(new_df)
Output
Status A B C D E
0 Stay 2 3 2 1 5
1 Leave 10 4 5 6 10
2 Stay 10 20 30 40 11
3 Leave 15 12 2 3 3
4 Stay 25 31 23 31 45
5 Stay 12 21 17 6 15
6 Leave 15 17 18 12 10
7 Stay 3 2 1 5 3
8 Leave 12 2 3 3 12

It sounds like you want to do this for each row of the data, but your code is written to try to do it at the top level. Can you use a for ... in loop to iterate over the rows?
for row in df:
if row['Status'] == 'Stay':
... etc ...

Python Dataframe: Create columns based on another column

I have a dataframe with repeated values for one column (here column 'A') and I want to convert this dataframe so that new columns are formed based on values of column 'A'.
Example
df = pd.DataFrame({'A':range(4)*3, 'B':range(12),'C':range(12,24)})
df
A B C
0 0 0 12
1 1 1 13
2 2 2 14
3 3 3 15
4 0 4 16
5 1 5 17
6 2 6 18
7 3 7 19
8 0 8 20
9 1 9 21
10 2 10 22
11 3 11 23
Note that the values of "A" column are repeated 3 times.
Now I want the simplest solution to convert it to another dataframe with this configuration (please ignore the naming of the columns, it is used for description purpose only, they could be anything):
B C
A0 A1 A2 A3 A0 A1 A2 A3
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

This is a pivot problem, so use
df.assign(idx=df.groupby('A').cumcount()).pivot('idx', 'A', ['B', 'C'])
B C
A 0 1 2 3 0 1 2 3
idx
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
If the headers are important, you can use MultiIndex.set_levels to fix them.
u = df.assign(idx=df.groupby('A').cumcount()).pivot('idx', 'A', ['B', 'C'])
u.columns = u.columns.set_levels(
['A' + u.columns.levels[1].astype(str)], level=[1])
u
B C
A A0 A1 A2 A3 A0 A1 A2 A3
idx
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

You may need assign the group help key by cumcount , then just do unstack
yourdf=df.assign(D=df.groupby('A').cumcount(),A='A'+df.A.astype(str)).set_index(['D','A']).unstack()
B C
A A0 A1 A2 A3 A0 A1 A2 A3
D
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

Pandas reshape dataframe every N rows to columns

I have a dataframe as follows :
df1=pd.DataFrame(np.arange(24).reshape(6,-1),columns=['a','b','c','d'])
and I want to take 3 set of rows and convert them to columns with following order
Numpy reshape doesn't give intended answer
pd.DataFrame(np.reshape(df1.values,(3,-1)),columns=['a','b','c','d','e','f','g','h'])

In [258]: df = pd.DataFrame(np.hstack(np.split(df1, 2)))
In [259]: df
Out[259]:
0 1 2 3 4 5 6 7
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
In [260]: import string
In [261]: df.columns = list(string.ascii_lowercase[:len(df.columns)])
In [262]: df
Out[262]:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

Create 3d array by reshape:
a = np.hstack(np.reshape(df1.values,(-1, 3, len(df1.columns))))
df = pd.DataFrame(a,columns=['a','b','c','d','e','f','g','h'])
print (df)
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

This uses the reshape/swapaxes/reshape idiom for rearranging sub-blocks of NumPy arrays.
In [26]: pd.DataFrame(df1.values.reshape(2,3,4).swapaxes(0,1).reshape(3,-1), columns=['a','b','c','d','e','f','g','h'])
Out[26]:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

If you want a pure pandas solution:
df.set_index([df.index % 3, df.index // 3])\
.unstack()\
.sort_index(level=1, axis=1)\
.set_axis(list('abcdefgh'), axis=1, inplace=False)
Output:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Concatenate dataframes along columns in a pandas dataframe - python

In this case, merge() works. pd.merge(df1, df2, left_index=True, right_index=True) output A B C D E F 0 1 2 3 13 14 15 1 4 5 6 16 17 18 2 7 8 9 19 20 21 3 10 11 12 22 23 24 This works only if both dataframe have same indices.

Related

Dynamically create columns in a dataframe

CSV & Pandas: Unnamed columns and multi-index

Several Layers of If Statements with String

Python Dataframe: Create columns based on another column

Pandas reshape dataframe every N rows to columns

Categories

Resources