I am having a dataframe which has 4 columns and 4 rows. I need to reshape it into 2 columns and 4 rows. The 2 new columns are result of addition of values of col1 + col3 and col2 +col4. I do not wish to create any other memory object for it.
I am trying
df['A','B'] = df['A']+df['C'],df['B']+df['D']
Can it be achieved by using drop function only? Is there any other simpler method for this?
The dynamic way of summing two columns at a time is to use groupby:
df.groupby(np.arange(len(df.columns)) % 2, axis=1).sum()
Out[11]:
0 1
0 2 4
1 10 12
2 18 20
3 26 28
You can use rename afterwards if you want to change column names but that would require a logic.
Consider the sample dataframe df
df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=list('ABCD'))
df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
One line of code
pd.DataFrame(
df.values.reshape(4, 2, 2).transpose(0, 2, 1).sum(2),
columns=df.columns[:2]
)
A B
0 2 4
1 10 12
2 18 20
3 26 28
Another line of code
df.iloc[:, :2] + df.iloc[:, 2:4].values
A B
0 2 4
1 10 12
2 18 20
3 26 28
Yet another
df.assign(A=df.A + df.C, B=df.B + df.D).drop(['C', 'D'], 1)
A B
0 2 4
1 10 12
2 18 20
3 26 28
This works for me:
df['A'], df['B'] = df['A'] + df['C'], df['B'] + df['D']
df.drop(['C','D'], axis=1)
Related
I have a dataframe, and I'm trying to apply a single function to that dataframe, with multiple arguments. I want the results of the function application to be stored in a new column, with each row duplicated to match each column, but I can't figure out how to do this.
Simple example:
df= pd.DataFrame({"a" : [4 ,5], "b" : [7, 8]}, index = [1, 2])
a b
1 4 7
2 5 8
Now, I want to add both the numbers 10 and 11 to column 'a', and store the results in a new column, 'c'. Sorry if this is unclear, but this is the result I'm looking for:
a b c
1 4 7 14
2 4 7 15
3 5 8 15
4 5 8 16
Is there an easy way to do this?
Use Index.repeat with numpy.tile:
df= pd.DataFrame({"a" : [4 ,5], "b" : [7, 8]}, index = [1, 2])
a = [10,11]
df1 = (df.loc[df.index.repeat(len(a))]
.assign(c = lambda x: x.a + np.tile(a, len(df)))
.reset_index(drop=True)
.rename(lambda x: x+1)
)
Or:
df1 = df.loc[df.index.repeat(len(a))].reset_index(drop=True).rename(lambda x: x+1)
df1['c'] = df1.a + np.tile(a, len(df))
print (df1)
a b c
1 4 7 14
2 4 7 15
3 5 8 15
4 5 8 16
Another idea is use cross join:
a = [10,11]
df1 = df.assign(tmp=1).merge(pd.DataFrame({'c':a, 'tmp':1}), on='tmp').drop('tmp', 1)
df1['c'] += df1.a
print (df1)
a b c
0 4 7 14
1 4 7 15
2 5 8 15
3 5 8 16
Using the explode method (pandas >= 0.25.0):
df1 = df.assign(c=df.apply(lambda row: [row.a+10, row.a+11], axis=1))
df1 = df1.explode('c')
print(df1)
a b c
1 4 7 14
1 4 7 15
2 5 8 15
2 5 8 16
Note that your code example doesn't do what you say (5+10 = 15, not 16).
The output from adding 10 and 11 is:
a b c
1 4 7 14
2 4 7 15
3 5 8 15
4 5 8 16
That said, here's some understandable code:
def add_x_y_to_df_col(df, incol, outcol, x, y):
df1 = df.copy()
df[outcol] = df[incol] + x
df1[outcol] = df[incol] + y
return df.append(df1, ignore_index=True)
df = add_x_y_to_df_col(df, 'a', 'c', 10, 11)
Note this returns:
a b c
0 4 7 14
1 5 8 15
2 4 7 15
3 5 8 16
If you want to sort by column a and restart the index at 1:
df = df.sort_values(by='a').reset_index(drop=True)
df.index += 1
(You could of course add that code to the function.) This gives the desired result:
a b c
1 4 7 14
2 4 7 15
3 5 8 15
4 5 8 16
Say we have a DataFrame df
df = pd.DataFrame({
"Id": [1, 2],
"Value": [2, 5]
})
df
Id Value
0 1 2
1 2 5
and some function f which takes an element of df and returns a DataFrame.
def f(value):
return pd.DataFrame({"A": range(10, 10 + value), "B": range(20, 20 + value)})
f(2)
A B
0 10 20
1 11 21
We want to apply f to each element in df["Value"], and join the result to df, like so:
Id Value A B
0 1 2 10 20
1 1 2 11 21
2 2 5 10 20
2 2 5 11 21
2 2 5 12 22
2 2 5 13 23
2 2 5 14 24
In T-SQL, with a table df and table-valued function f, we would do this with a CROSS APPLY:
SELECT * FROM df
CROSS APPLY f(df.Value)
How can we do this in pandas?
You could apply the function to each element in Value in a list comprehension and use pd.concat to concatenate all resulting dataframes. Also assign the corresponding Id so that it can be later on used to merge both dataframes:
l = pd.concat([f(row.Value).assign(Id=row.Id) for _, row in df.iterrows()])
df.merge(l, on='Id')
Id Value A B
0 1 2 10 20
1 1 2 11 21
2 2 5 10 20
3 2 5 11 21
4 2 5 12 22
5 2 5 13 23
6 2 5 14 24
One of the few cases I would use DataFrame.iterrows. We can iterate over each row, concat the cartesian product out of your function with the original dataframe and at the same time fillna with bfill and ffill:
df = pd.concat([pd.concat([f(r['Value']), pd.DataFrame(r).T], axis=1).bfill().ffill() for _, r in df.iterrows()],
ignore_index=True)
Which yields:
print(df)
A B Id Value
0 10 20 1.0 2.0
1 11 21 1.0 2.0
2 10 20 2.0 5.0
3 11 21 2.0 5.0
4 12 22 2.0 5.0
5 13 23 2.0 5.0
6 14 24 2.0 5.0
For next dataframe, I want to drop the columns c, d, e, f, g
a b c d e f g h i j
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
So I use next code:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(20).reshape(2, 10), columns=list('abcdefghij'))
df.drop(['c', 'd', 'e', 'f', 'g'], axis=1)
The problem is maybe my dataframe not just have so little columns, I may need to drop a lots of consecutive columns, so my question any way like 'c': 'g' could be possible for me to quick select the columns to drop?
Use DataFrame.loc for select consecutive names of columns:
df = df.drop(df.loc[:, 'c':'g'].columns, axis=1)
print (df)
a b h i j
0 0 1 7 8 9
1 10 11 17 18 19
Or use Index.isin:
c = df.loc[:, 'c':'g'].columns
df = df.loc[:, ~df.columns.isin(c)]
If possible multiple consecutive groups use Index.union for join values together, Index.isin, Index.difference or Index.drop:
c1 = df.loc[:, 'c':'g'].columns
c2 = df.loc[:, 'i':'j'].columns
df = df.loc[:, ~df.columns.isin(c1.union(c2))]
print (df)
a b h
0 0 1 7
1 10 11 17
df = pd.DataFrame(np.arange(20).reshape(2, 10), columns=list('wbcdefghij'))
print (df)
w b c d e f g h i j
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
c1 = df.loc[:, 'c':'g'].columns
c2 = df.loc[:, 'i':'j'].columns
#possible change order of columns, because function difference sorting
df1 = df[df.columns.difference(c1.union(c2))]
print (df1)
b h w
0 1 7 0
1 11 17 10
#ordering is not changed
df2 = df[df.columns.drop(c1.union(c2))]
print (df2)
w b h
0 0 1 7
1 10 11 17
Example
import pandas as pd
d = {'col1': [1,"newcolumn1name",5, 8,15 ], 'col2':[5,"newcolumn2name"10,15, 20]}
df = pd.DataFrame(data=d)
df1=df
df2=df
df
Out[24]:
col1 col2
0 1 5
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
What I would like to do with this example is to drop the first row and rename the columns with the string of the second row.
I can do this with the following code (complete python newcomer here):
df=df[1:]
new_header = df.iloc[0]
df=df[1:]
df.columns = new_header
df
Out[26]:
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
Now I'd like to be able to this over both df1 and df2, as defined in the example. I've tried lists, dictionaries, and map, but I ran into issues with all of them.
Can anyone think of the simplest way to do it? On my real data, I'll have six to ten data frames (~1000x8000) to run it on.
IIUC
l=[df1,df2]
[ d[1:].T.set_index(1).T for d in l]
Out[221]:
[1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20, 1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20]
Update
l=[df1,df2]
df1,df2=[ d[1:].T.set_index(1).T for d in l]
df1
Out[226]:
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
df2
Out[227]:
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
Update 2
variables = locals()
for x,d in enumerate(l):
variables["df{0}".format(x+1)]=d[1:].T.set_index(1).T
df1
Out[231]:
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
df2
Out[232]:
1 newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
You can turn you logic into a function and use df.pipe. Something like the below could work (untested).
def formatter(df):
df = df[1:]
new_header = df.iloc[0]
df = df[1:]
df.columns = new_header
return df
for my_df in [df1, df2, df3, df4, df5, df6]:
my_df = my_df.pipe(formatter)
Yet another solution for Pandas 0.21+:
In [21]: lst = [df1, df2]
In [22]: def renamer(df):
return (df.iloc[2:]
.set_axis(df.iloc[1], axis='columns', inplace=False)
.rename_axis(None,1))
In [23]: new = list(map(renamer, lst))
In [24]: new[0]
Out[24]:
newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
In [25]: new[1]
Out[25]:
newcolumn1name newcolumn2name
2 5 10
3 8 15
4 15 20
Looking for ways to achieve following updates on a dataframe:
dfb is the base dataframe that I want to update with dft transactions.
Any common index rows should be updated with values from dft.
Indexes only in dft should be appended to dfb.
Looking at the documentation, setting with enlargement looked perfect but then I realized it only worked with a single row. Is it possible to use setting with enlargement to do this update or is there another method that could be recommended?
dfb = pd.DataFrame(data={'A': [11,22,33], 'B': [44,55,66]}, index=[1,2,3])
dfb
Out[70]:
A B
1 11 44
2 22 55
3 33 66
dft = pd.DataFrame(data={'A': [0,2,3], 'B': [4,5,6]}, index=[3,4,5])
dft
Out[71]:
A B
3 0 4
4 2 5
5 3 6
# Updated dfb should look like this:
dfb
Out[75]:
A B
1 11 44
2 22 55
3 0 4
4 2 5
5 3 6
You can use combine_first with renaming columns, last convert float columns to int by astype:
dft = dft.rename(columns={'c':'B', 'B':'A'}).combine_first(dfb).astype(int)
print (dft)
A B
1 11 44
2 22 55
3 0 4
4 2 5
5 3 6
Another solution with finding same indexes in both DataFrames by Index.intersection, drop it from first DataFrame dfb and then use concat:
dft = dft.rename(columns={'c':'B', 'B':'A'})
idx = dfb.index.intersection(dft.index)
print (idx)
Int64Index([3], dtype='int64')
dfb = dfb.drop(idx)
print (dfb)
A B
1 11 44
2 22 55
print (pd.concat([dfb, dft]))
A B
1 11 44
2 22 55
3 0 4
4 2 5
5 3 6