Perform column rename and slicing on multiple pandas dataframe - python

Example
import pandas as pd
d = {'col1': [1,"newcolumn1name",5, 8,15 ], 'col2':[5,"newcolumn2name"10,15, 20]}
df = pd.DataFrame(data=d)
df1=df
df2=df
df
Out[24]:
col1 col2
0 1 5
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
What I would like to do with this example is to drop the first row and rename the columns with the string of the second row.
I can do this with the following code (complete python newcomer here):
df=df[1:]
new_header = df.iloc[0]
df=df[1:]
df.columns = new_header
df
Out[26]:
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
Now I'd like to be able to this over both df1 and df2, as defined in the example. I've tried lists, dictionaries, and map, but I ran into issues with all of them.
Can anyone think of the simplest way to do it? On my real data, I'll have six to ten data frames (~1000x8000) to run it on.

IIUC
l=[df1,df2]
[ d[1:].T.set_index(1).T for d in l]
Out[221]:
[1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20, 1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20]
Update
l=[df1,df2]
df1,df2=[ d[1:].T.set_index(1).T for d in l]
df1
Out[226]:
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
df2
Out[227]:
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
Update 2
variables = locals()
for x,d in enumerate(l):
variables["df{0}".format(x+1)]=d[1:].T.set_index(1).T
df1
Out[231]:
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
df2
Out[232]:
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20

You can turn you logic into a function and use df.pipe. Something like the below could work (untested).
def formatter(df):
df = df[1:]
new_header = df.iloc[0]
df = df[1:]
df.columns = new_header
return df
for my_df in [df1, df2, df3, df4, df5, df6]:
my_df = my_df.pipe(formatter)

Yet another solution for Pandas 0.21+:
In [21]: lst = [df1, df2]
In [22]: def renamer(df):
return (df.iloc[2:]
.set_axis(df.iloc[1], axis='columns', inplace=False)
.rename_axis(None,1))
In [23]: new = list(map(renamer, lst))
In [24]: new[0]
Out[24]:
newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
In [25]: new[1]
Out[25]:
newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20

Related

Comparing Python Data-frames

I have two data frames, e.g.,
df_1:
index A B C D
1 2 5 9 12
2 9 8 13 22
3 0 44 3 1
and
df_2:
index A C
1 12 40
2 9 13
3 16 1
4 0 21
I am looking for a way to compare these two dfs and the final product should be rows in df_1 such that the values in column A and C are present in df_2, e.g.
Final_df:
index A B C D
2 9 8 13 22
I have tried,
Final_df = pd.merge(df_1, df_2, on=['A','C'], how='left', indicator='Exist')
Final_df['Exist'] = np.where(Final_df.Exist == 'both', True, False)
Final_df = Final_df[Final_df['Exist']==True]
But it doesn't give the expected results. Your suggestion will be appreciated!
I think you just want to have an inner merge.
df_1.merge(df_2, on=['A', 'C'], how='inner')
A B C D
0 9 8 13 22

Drop specific column and indexes in pandas DataFrame

DataFrame:
A B C
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15
Is it possible to drop values from index 2 to 4 in column B? or replace it with NaN.
In this case, values: [8, 9, 10] should be removed.
I tried this: df.drop(columns=['B'], index=[8, 9, 10]), but then column B is removed.
Drop values does not make sense into DataFrame. You can set values to NaN instead and use .loc / .iloc to access index/columns:
>>> df
A B C
a 1 6 11
b 2 7 12
c 3 8 13
d 4 9 14
e 5 10 15
# By name:
df.loc['c':'e', 'B'] = np.nan
# By number:
df.iloc[2:5, 2] = np.nan
Read carefully Indexing and selecting data
import pandas as pd
data = [
['A','B','C'],
[1,6,11],
[2,7,12],
[3,8,13],
[4,9,14],
[5,10,15]
]
df = pd.DataFrame(data=data[1:], columns=data[0])
df['B'] = df['B'].shift(3)
>>>
A B C
0 1 NaN 11
1 2 NaN 12
2 3 NaN 13
3 4 6.0 14
4 5 7.0 15

Pandas - Duplicate rows on function application

I have a dataframe, and I'm trying to apply a single function to that dataframe, with multiple arguments. I want the results of the function application to be stored in a new column, with each row duplicated to match each column, but I can't figure out how to do this.
Simple example:
df= pd.DataFrame({"a" : [4 ,5], "b" : [7, 8]}, index = [1, 2])
a b
1 4 7
2 5 8
Now, I want to add both the numbers 10 and 11 to column 'a', and store the results in a new column, 'c'. Sorry if this is unclear, but this is the result I'm looking for:
a b c
1 4 7 14
2 4 7 15
3 5 8 15
4 5 8 16
Is there an easy way to do this?
Use Index.repeat with numpy.tile:
df= pd.DataFrame({"a" : [4 ,5], "b" : [7, 8]}, index = [1, 2])
a = [10,11]
df1 = (df.loc[df.index.repeat(len(a))]
.assign(c = lambda x: x.a + np.tile(a, len(df)))
.reset_index(drop=True)
.rename(lambda x: x+1)
)
Or:
df1 = df.loc[df.index.repeat(len(a))].reset_index(drop=True).rename(lambda x: x+1)
df1['c'] = df1.a + np.tile(a, len(df))
print (df1)
a b c
1 4 7 14
2 4 7 15
3 5 8 15
4 5 8 16
Another idea is use cross join:
a = [10,11]
df1 = df.assign(tmp=1).merge(pd.DataFrame({'c':a, 'tmp':1}), on='tmp').drop('tmp', 1)
df1['c'] += df1.a
print (df1)
a b c
0 4 7 14
1 4 7 15
2 5 8 15
3 5 8 16
Using the explode method (pandas >= 0.25.0):
df1 = df.assign(c=df.apply(lambda row: [row.a+10, row.a+11], axis=1))
df1 = df1.explode('c')
print(df1)
a b c
1 4 7 14
1 4 7 15
2 5 8 15
2 5 8 16
Note that your code example doesn't do what you say (5+10 = 15, not 16).
The output from adding 10 and 11 is:
a b c
1 4 7 14
2 4 7 15
3 5 8 15
4 5 8 16
That said, here's some understandable code:
def add_x_y_to_df_col(df, incol, outcol, x, y):
df1 = df.copy()
df[outcol] = df[incol] + x
df1[outcol] = df[incol] + y
return df.append(df1, ignore_index=True)
df = add_x_y_to_df_col(df, 'a', 'c', 10, 11)
Note this returns:
a b c
0 4 7 14
1 5 8 15
2 4 7 15
3 5 8 16
If you want to sort by column a and restart the index at 1:
df = df.sort_values(by='a').reset_index(drop=True)
df.index += 1
(You could of course add that code to the function.) This gives the desired result:
a b c
1 4 7 14
2 4 7 15
3 5 8 15
4 5 8 16

If possible batch drop dataframe's columns with something like slice selection method?

For next dataframe, I want to drop the columns c, d, e, f, g
a b c d e f g h i j
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
So I use next code:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(20).reshape(2, 10), columns=list('abcdefghij'))
df.drop(['c', 'd', 'e', 'f', 'g'], axis=1)
The problem is maybe my dataframe not just have so little columns, I may need to drop a lots of consecutive columns, so my question any way like 'c': 'g' could be possible for me to quick select the columns to drop?
Use DataFrame.loc for select consecutive names of columns:
df = df.drop(df.loc[:, 'c':'g'].columns, axis=1)
print (df)
a b h i j
0 0 1 7 8 9
1 10 11 17 18 19
Or use Index.isin:
c = df.loc[:, 'c':'g'].columns
df = df.loc[:, ~df.columns.isin(c)]
If possible multiple consecutive groups use Index.union for join values together, Index.isin, Index.difference or Index.drop:
c1 = df.loc[:, 'c':'g'].columns
c2 = df.loc[:, 'i':'j'].columns
df = df.loc[:, ~df.columns.isin(c1.union(c2))]
print (df)
a b h
0 0 1 7
1 10 11 17
df = pd.DataFrame(np.arange(20).reshape(2, 10), columns=list('wbcdefghij'))
print (df)
w b c d e f g h i j
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
c1 = df.loc[:, 'c':'g'].columns
c2 = df.loc[:, 'i':'j'].columns
#possible change order of columns, because function difference sorting
df1 = df[df.columns.difference(c1.union(c2))]
print (df1)
b h w
0 1 7 0
1 11 17 10
#ordering is not changed
df2 = df[df.columns.drop(c1.union(c2))]
print (df2)
w b h
0 0 1 7
1 10 11 17

Dynamically reshape the dataframe in pandas

I am having a dataframe which has 4 columns and 4 rows. I need to reshape it into 2 columns and 4 rows. The 2 new columns are result of addition of values of col1 + col3 and col2 +col4. I do not wish to create any other memory object for it.
I am trying
df['A','B'] = df['A']+df['C'],df['B']+df['D']
Can it be achieved by using drop function only? Is there any other simpler method for this?
The dynamic way of summing two columns at a time is to use groupby:
df.groupby(np.arange(len(df.columns)) % 2, axis=1).sum()
Out[11]:
0 1
0 2 4
1 10 12
2 18 20
3 26 28
You can use rename afterwards if you want to change column names but that would require a logic.
Consider the sample dataframe df
df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=list('ABCD'))
df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
One line of code
pd.DataFrame(
df.values.reshape(4, 2, 2).transpose(0, 2, 1).sum(2),
columns=df.columns[:2]
)
A B
0 2 4
1 10 12
2 18 20
3 26 28
Another line of code
df.iloc[:, :2] + df.iloc[:, 2:4].values
A B
0 2 4
1 10 12
2 18 20
3 26 28
Yet another
df.assign(A=df.A + df.C, B=df.B + df.D).drop(['C', 'D'], 1)
A B
0 2 4
1 10 12
2 18 20
3 26 28
This works for me:
df['A'], df['B'] = df['A'] + df['C'], df['B'] + df['D']
df.drop(['C','D'], axis=1)

Categories

Resources