Merging 3 databases with same names, and renaming them in python - python

I have 3 df with 25 columns each. All the columns are same in the 3 df.
I want to merge the 3 df, and change the column name to "_a" for 25 columns of df1, change to "_b" for 25 columns of df2, and change to "_c" for 25 columns of df3.
I am using the below code:
pd.merge(pd.merge(df1,df2,'left',on='year',suffixes=['_a','_b']),df3,'left',on='year')
How do I use a rename or some other function, to change all the 25 columns of df3 in the code above?
Thanks.

pd.merge(pd.merge(df1,df2,'left',on='year',suffixes=['_a','_b']),
df3,'left',on='year',suffixes=['','_c'])
Another approach:
Source DFs:
In [68]: d1
Out[68]:
col1 col2 col3
0 1 2 3
1 4 5 6
In [69]: d2
Out[69]:
col1 col2 col3
0 11 12 13
1 14 15 16
In [70]: d3
Out[70]:
col1 col2 col3
0 21 22 23
1 24 25 26
Let's create a list of DFs:
In [71]: dfs = [d1,d2,d3]
and a list of suffixes:
In [73]: suffixes = ['_a','_b','_c']
Now we can merge them in one step like as follows:
In [74]: pd.concat([df.add_suffix(suffixes[i]) for i,df in enumerate(dfs)], axis=1)
Out[74]:
col1_a col2_a col3_a col1_b col2_b col3_b col1_c col2_c col3_c
0 1 2 3 11 12 13 21 22 23
1 4 5 6 14 15 16 24 25 26
Short explanation: in the list comprehension we are generating a list of DFs with already renamed columns:
In [75]: [suffixes[i] for i,df in enumerate(dfs)]
Out[75]: ['_a', '_b', '_c']
In [76]: [df.add_suffix(suffixes[i]) for i,df in enumerate(dfs)]
Out[76]:
[ col1_a col2_a col3_a
0 1 2 3
1 4 5 6, col1_b col2_b col3_b
0 11 12 13
1 14 15 16, col1_c col2_c col3_c
0 21 22 23
1 24 25 26]

Related

Python Rank with non numeric columns

I'm trying to find a way to do nested ranking (row number) in python that is equivalent to the following in TSQL:
I have a table thank looks like this:
data = {
'col1':[11,11,11,22,22,33,33],
'col2':['a','b','c','a','b','a','b']
}
df = pd.DataFrame(data)
# col1 col2
# 11 a
# 11 b
# 11 c
# 22 a
# 22 b
# 33 a
# 33 b
Looking for Python equivalent to:
SELECT
col1
,col2
,row_number() over(Partition by col1 order by col2) as rnk
FROM df
group by col1, col2
The output to be:
# col1 col2 rnk
# 11 a 1
# 11 b 2
# 11 c 3
# 22 a 1
# 22 b 2
# 33 a 1
# 33 b 2
I've tried to use rank() and groupby() but I keep running into a problem of No numeric types to aggregate. Is there a way to rank non numeric columns and give them row numbers
Use cumcount()
df['rnk']=df.groupby('col1')['col2'].cumcount()+1
col1 col2 rnk
0 11 a 1
1 11 b 2
2 11 c 3
3 22 a 1
4 22 b 2
5 33 a 1
6 33 b 2

Split the data frame based on consecutive row values differences

I have a data frame like this,
df
col1 col2 col3
1 2 3
2 5 6
7 8 9
10 11 12
11 12 13
13 14 15
14 15 16
Now I want to create multiple data frames from above when the col1 difference of two consecutive rows are more than 1.
So the result data frames will look like,
df1
col1 col2 col3
1 2 3
2 5 6
df2
col1 col2 col3
7 8 9
df3
col1 col2 col3
10 11 12
11 12 13
df4
col1 col2 col3
13 14 15
14 15 16
I can do this using for loop and storing the indices but this will increase execution time, looking for some pandas shortcuts or pythonic way to do this most efficiently.
You could define a custom grouper by taking the diff, checking when it is greater than 1, and take the cumsum of the boolean series. Then group by the result and build a dictionary from the groupby object:
d = dict(tuple(df.groupby(df.col1.diff().gt(1).cumsum())))
print(d[0])
col1 col2 col3
0 1 2 3
1 2 5 6
print(d[1])
col1 col2 col3
2 7 8 9
A more detailed break-down:
df.assign(difference=(diff:=df.col1.diff()),
condition=(gt1:=diff.gt(1)),
grouper=gt1.cumsum())
col1 col2 col3 difference condition grouper
0 1 2 3 NaN False 0
1 2 5 6 1.0 False 0
2 7 8 9 5.0 True 1
3 10 11 12 3.0 True 2
4 11 12 13 1.0 False 2
5 13 14 15 2.0 True 3
6 14 15 16 1.0 False 3
You can also peel off the target column and work with it as a series, rather than the above answer. That keeps everything smaller. It runs faster on the example, but I don't know how they'll scale up, depending how many times you're splitting.
row_bool = df['col1'].diff()>1
split_inds, = np.where(row_bool)
split_inds = np.insert(arr=split_inds, obj=[0,len(split_inds)], values=[0,len(df)])
df_tup = ()
for n in range(0,len(split_inds)-1):
tempdf = df.iloc[split_inds[n]:split_inds[n+1],:]
df_tup.append(tempdf)
(Just throwing it in a tuple of dataframes afterward, but the dictionary approach might be better?)

Pandas: Pack column into rows

I've been reading through pd.stack, pd.unstack and pd.pivot but I can't wrap my head around getting what I want done
Given a dataframe as follows
id1 id2 id3 vals vals1
0 1 a -1 10 20
1 1 a -2 11 21
2 1 a -3 12 22
3 1 a -4 13 23
4 1 b -1 14 24
5 1 b -2 15 25
6 1 b -3 16 26
7 1 b -4 17 27
I'd like to get the following result
id1 id2 -1_vals -2_vals ... -1_vals1 -2_vals1 -3_vals1 -4_vals1
0 1 a 10 11 ... 20 21 22 23
1 1 b 14 15 ... 24 25 26 27
It's kind of a groupby with a pivot, The column id3 is being spread into rows, where the new column names is the corresponding concatenation of the original column and the value of id3
EDIT: It is guaranteed that per id1 + id2 id3 will be unique, but some groups of id1 + id2 will have diffenet id3 - in this case it is ok to put NaNs there
Use DataFrame.set_index with DataFrame.unstack and DataFrame.sort_index for MultiIndex in columns and then flatten it by list comprehension with f-strings:
df1 = (df.set_index(['id1','id2','id3'])
.unstack()
.sort_index(level=[0,1], ascending=[True, False], axis=1))
#python 3.6+
df1.columns = [f'{b}_{a}' for a, b in df1.columns]
#python below
#df1.columns = ['{}_{}'.format(a, b) for a, b in df1.columns]
df1 = df1.reset_index()
print (df1)
id1 id2 -1_vals -2_vals -3_vals -4_vals -1_vals1 -2_vals1 -3_vals1 \
0 1 a 10 11 12 13 20 21 22
1 1 b 14 15 16 17 24 25 26
-4_vals1
0 23
1 27

Perform column rename and slicing on multiple pandas dataframe

Example
import pandas as pd
d = {'col1': [1,"newcolumn1name",5, 8,15 ], 'col2':[5,"newcolumn2name"10,15, 20]}
df = pd.DataFrame(data=d)
df1=df
df2=df
df
Out[24]:
col1 col2
0 1 5
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
What I would like to do with this example is to drop the first row and rename the columns with the string of the second row.
I can do this with the following code (complete python newcomer here):
df=df[1:]
new_header = df.iloc[0]
df=df[1:]
df.columns = new_header
df
Out[26]:
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
Now I'd like to be able to this over both df1 and df2, as defined in the example. I've tried lists, dictionaries, and map, but I ran into issues with all of them.
Can anyone think of the simplest way to do it? On my real data, I'll have six to ten data frames (~1000x8000) to run it on.
IIUC
l=[df1,df2]
[ d[1:].T.set_index(1).T for d in l]
Out[221]:
[1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20, 1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20]
Update
l=[df1,df2]
df1,df2=[ d[1:].T.set_index(1).T for d in l]
df1
Out[226]:
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
df2
Out[227]:
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
Update 2
variables = locals()
for x,d in enumerate(l):
variables["df{0}".format(x+1)]=d[1:].T.set_index(1).T
df1
Out[231]:
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
df2
Out[232]:
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
You can turn you logic into a function and use df.pipe. Something like the below could work (untested).
def formatter(df):
df = df[1:]
new_header = df.iloc[0]
df = df[1:]
df.columns = new_header
return df
for my_df in [df1, df2, df3, df4, df5, df6]:
my_df = my_df.pipe(formatter)
Yet another solution for Pandas 0.21+:
In [21]: lst = [df1, df2]
In [22]: def renamer(df):
return (df.iloc[2:]
.set_axis(df.iloc[1], axis='columns', inplace=False)
.rename_axis(None,1))
In [23]: new = list(map(renamer, lst))
In [24]: new[0]
Out[24]:
newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
In [25]: new[1]
Out[25]:
newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20

Dynamically reshape the dataframe in pandas

I am having a dataframe which has 4 columns and 4 rows. I need to reshape it into 2 columns and 4 rows. The 2 new columns are result of addition of values of col1 + col3 and col2 +col4. I do not wish to create any other memory object for it.
I am trying
df['A','B'] = df['A']+df['C'],df['B']+df['D']
Can it be achieved by using drop function only? Is there any other simpler method for this?
The dynamic way of summing two columns at a time is to use groupby:
df.groupby(np.arange(len(df.columns)) % 2, axis=1).sum()
Out[11]:
0 1
0 2 4
1 10 12
2 18 20
3 26 28
You can use rename afterwards if you want to change column names but that would require a logic.
Consider the sample dataframe df
df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=list('ABCD'))
df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
One line of code
pd.DataFrame(
df.values.reshape(4, 2, 2).transpose(0, 2, 1).sum(2),
columns=df.columns[:2]
)
A B
0 2 4
1 10 12
2 18 20
3 26 28
Another line of code
df.iloc[:, :2] + df.iloc[:, 2:4].values
A B
0 2 4
1 10 12
2 18 20
3 26 28
Yet another
df.assign(A=df.A + df.C, B=df.B + df.D).drop(['C', 'D'], 1)
A B
0 2 4
1 10 12
2 18 20
3 26 28
This works for me:
df['A'], df['B'] = df['A'] + df['C'], df['B'] + df['D']
df.drop(['C','D'], axis=1)

Categories

Resources