Summing columns to form a new dataframe - python

I have a DataFrame
A B C D
2015-07-18 4.534390e+05 2.990611e+05 5.706540e+05 4.554383e+05
2015-07-22 3.991351e+05 2.606576e+05 3.876394e+05 4.019723e+05
2015-08-07 1.085791e+05 8.215599e+04 1.356295e+05 1.096541e+05
2015-08-19 1.397305e+06 8.681048e+05 1.672141e+06 1.403100e+06
...
I simply want to sum all columns to get a new dataframe
A B C D
sum s s s s
With the columnwise sums And then print it with to_csv(). When is use
df.sum(axis=0)
print(df)
A 9.099377e+06
B 5.897003e+06
C 1.049932e+07
D 9.208681e+06
dtype: float64

You can transform df.sum() to DataFrame and transpose it:
In [39]: df.sum().to_frame('sum').T
Out[39]:
A B C D
sum 2358458.2 1509979.49 2766063.9 2370164.7

A slightly shorter version of pd.DataFrame is (with credit to jezrael for simplification):
In [120]: pd.DataFrame([df.sum()], index=['sum'])
Out[120]:
A B C D
sum 2358458.2 1509979.49 2766063.9 2370164.7

Use DataFrame constructor:
df = pd.DataFrame(df.sum().values.reshape(-1, len(df.columns)),
columns=df.columns,
index=['sum'])
print (df)
A B C D
sum 2358458.2 1509979.49 2766063.9 2370164.7

I think the simplest is df.agg([sum])
df.agg([sum])
Out[40]:
A B C D
sum 2358458.2 1509979.49 2766063.9 2370164.7

Related

Use dataframe column containing "column name strings", to return values from dataframe based on column name and index without using .apply()

I have a dataframe as follows:
df=pandas.DataFrame()
df['A'] = numpy.random.random(10)
df['B'] = numpy.random.random(10)
df['C'] = numpy.random.random(10)
df['Col_name'] = numpy.random.choice(['A','B','C'],size=10)
I want to obtain an output that uses 'Col_name' and the respective index of the dataframe row to lookup the value in the dataframe.
I can get the desired output this with .apply() follows:
df['output'] = df.apply(lambda x: x[ x['Col_name'] ], axis=1)
.apply() is slow over a large dataframe with it iterating row by row. Is there an obvious solution in pandas that is faster/vectorised?
You can also pick each column name (or give list of possible names) and then apply it as mask to filter your dataframe then pick values from desired column and assign them to all rows matching the mask. Then repeat this for another coulmn.
for column_name in df: #or: for column_name in ['A', 'B', 'C']
df.loc[df['Col_name']==column_name, 'output'] = df[column_name]
Rows that will not match any mask will have NaN values.
PS. Accodring to my test with 10000000 random rows - method with .apply() takes 2min 24s to finish while my method takes only 4,3s.
Use melt to flatten your dataframe and keep rows where Col_name equals to variable column:
df['output'] = df.melt('Col_name', ignore_index=False).query('Col_name == variable')['value']
print(df)
# Output
A B C Col_name output
0 0.202197 0.430735 0.093551 B 0.430735
1 0.344753 0.979453 0.999160 C 0.999160
2 0.500904 0.778715 0.074786 A 0.500904
3 0.050951 0.317732 0.363027 B 0.317732
4 0.722624 0.026065 0.424639 C 0.424639
5 0.578185 0.626698 0.376692 C 0.376692
6 0.540849 0.805722 0.528886 A 0.540849
7 0.918618 0.869893 0.825991 C 0.825991
8 0.688967 0.203809 0.734467 B 0.203809
9 0.811571 0.010081 0.372657 B 0.010081
Transformation after melt:
>>> df.melt('Col_name', ignore_index=False)
Col_name variable value
0 B A 0.202197
1 C A 0.344753
2 A A 0.500904 # keep
3 B A 0.050951
4 C A 0.722624
5 C A 0.578185
6 A A 0.540849 # keep
7 C A 0.918618
8 B A 0.688967
9 B A 0.811571
0 B B 0.430735 # keep
1 C B 0.979453
2 A B 0.778715
3 B B 0.317732 # keep
4 C B 0.026065
5 C B 0.626698
6 A B 0.805722
7 C B 0.869893
8 B B 0.203809 # keep
9 B B 0.010081 # keep
0 B C 0.093551
1 C C 0.999160 # keep
2 A C 0.074786
3 B C 0.363027
4 C C 0.424639 # keep
5 C C 0.376692 # keep
6 A C 0.528886
7 C C 0.825991 # keep
8 B C 0.734467
9 B C 0.372657
Update
Alternative with set_index and stack for #Rabinzel:
df['output'] = (
df.set_index('Col_name', append=True).stack()
.loc[lambda x: x.index.get_level_values(1) == x.index.get_level_values(2)]
.droplevel([1, 2])
)
print(df)
# Output
A B C Col_name output
0 0.209953 0.332294 0.812476 C 0.812476
1 0.284225 0.566939 0.087084 A 0.284225
2 0.815874 0.185154 0.155454 A 0.815874
3 0.017548 0.733474 0.766972 A 0.017548
4 0.494323 0.433719 0.979399 C 0.979399
5 0.875071 0.789891 0.319870 B 0.789891
6 0.475554 0.229837 0.338032 B 0.229837
7 0.123904 0.397463 0.288614 C 0.288614
8 0.288249 0.631578 0.393521 A 0.288249
9 0.107245 0.006969 0.367748 C 0.367748
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['A'] = np.random.random(10)
df['B'] = np.random.random(10)
df['C'] = np.random.random(10)
df['Col_name'] = np.random.choice(['A','B','C'],size=10)
df["output"] = np.nan
Even though you do not like going row per row, I still routinely use loops to go through each row just to know where it breaks when it breaks. Here are two loops just to satisfy myself. The column is created ahead with na values becausethe loops needs it to be.
# each rows by index
for i in range(len(df)):
df['output'][i] = df[df['Col_name'][i]][i]
# each rows but by column name
for col in list(df["Col_name"]):
df.loc[:,'output'] = df.loc[:,col]
Here are some "non-loop" ways to do so.
df["output"] = df.lookup(df.index, df.Col_name)
df['output'] = np.where(np.isnan(df['output']), df[df['Col_name']], np.nan)

Is there a way to merge 2 rows of a df into 1?

I have a df that has plenty of row pairs that need to be condensed into 1. Column B identifies the pairs. All column values except one are identical. Is there a way to accomplish this in pandas?
Existing df:
A B C D E
x c v 2 w
x c v 2 r
Desired Output:
A B C D E
x c v 2 w,r
It's a little bit unintuitive to read but works:
df2 = (
df.groupby('B', as_index=False)
.agg({**dict.fromkeys(df.columns, 'first'), 'E': ','.join})
)
What we're doing here is grouping by column B and indicating that we want the first value occurring for each value of B across all columns, but then we're also over-riding what we want for the column E for aggregation to take place to join E's values sharing identical columns with B with a comma.
Hence you get:
A B C D E
0 x c v 2 w,r
This doesn't make assumptions about data types and leave columns alone that aren't strings but of course will error out if your E column contains non string values (or types that can't logically support it).
Like this:
df = df.apply(lambda x: ','.join(x), axis=0)
To use specific cols
df = df[['A','B']] ....

pandas dataframe multiplication by column, with index matched

df1=pd.DataFrame(np.random.randn(6,3),index=list("ABCDEF"),columns=list("XYZ"))
df2=pd.DataFrame(np.random.randn(6,1),index=list("ABCDEF"))
I want to multiply each column of df1 with df2, and match by index label. That means:
df1["X"]*df2
df1["Y"]*df2
df1["Z"]*df2
The output would have the index and columns of df1.
How can I do this? Tried several ways, and it still didn't work...
Use mul function and multiple DataFrame by Series (column) select by position with iloc:
print(df1.mul(df2.iloc[:,0], axis=0))
X Y Z
A -0.577748 0.299258 -0.021782
B -0.952604 0.024046 -0.276979
C 0.175287 2.507922 0.597935
D -0.002698 0.043514 -0.012256
E -1.598639 0.635508 1.532068
F 0.196783 -0.234017 -0.111166
Detail:
print(df2.iloc[:, 0])
A -2.875274
B 1.881634
C 1.369197
D 1.358094
E -0.024610
F 0.443865
Name: 0, dtype: float64
You can use apply to multiply each column in df1 with df2.
df1.apply(lambda x: x * df2[0], axis=0)
X Y Z
A -0.437749 0.515611 -0.870987
B 0.105674 1.679020 -0.693983
C 0.055004 0.118673 -0.028035
D 0.704775 -1.786515 -0.982376
E 0.109218 -0.021522 -0.188369
F 1.491816 0.105558 -1.814437

Converting multiple columns to categories in Pandas. apply?

Consider a Dataframe. I want to convert a set of columns to_convert to categories.
I can certainly do the following:
for col in to_convert:
df[col] = df[col].astype('category')
but I was surprised that the following does not return a dataframe:
df[to_convert].apply(lambda x: x.astype('category'), axis=0)
which of course makes the following not work:
df[to_convert] = df[to_convert].apply(lambda x: x.astype('category'), axis=0)
Why does apply (axis=0) return a Series even though it is supposed to act on the columns one by one?
This was just fixed in master, and so will be in 0.17.0, see the issue here
In [7]: df = DataFrame({'A' : list('aabbcd'), 'B' : list('ffghhe')})
In [8]: df
Out[8]:
A B
0 a f
1 a f
2 b g
3 b h
4 c h
5 d e
In [9]: df.dtypes
Out[9]:
A object
B object
dtype: object
In [10]: df.apply(lambda x: x.astype('category'))
Out[10]:
A B
0 a f
1 a f
2 b g
3 b h
4 c h
5 d e
In [11]: df.apply(lambda x: x.astype('category')).dtypes
Out[11]:
A category
B category
dtype: object
Note that since pandas 0.23.0 you no longer apply to convert multiple columns to categorical data types. Now you can simply do df[to_convert].astype('category') instead (where to_convert is a set of columns as defined in the question).

Efficiently join two labels of a DataFrame index

I have a DataFrame with one column of integers and string labels.
I want to join (as in sum up) two labels, while I replace the new label.
My DataFrame is:
import pandas as pd
pd.DataFrame(data=np.array([1,2,3,4]), index=['a','b','c','d'], columns=['cost'])
cost
a 1
b 2
c 3
d 4
And I want to change it to:
cost
a 1
b 2
c and d 7
don't know if there is a cleaner way but this works:
In [157]:
df.append(pd.DataFrame(index=['c and d'], data={'cost':df.loc[df.cost.isin([3,4])].sum().values})).drop(['c','d'])
Out[157]:
cost
a 1
b 2
c and d 7
We construct a dataframe to append to your existing one. We set the new index to 'c and d', then sum those rows where the labels are in 'c' and 'd', then finally drop those.
One option using df.reindex:
>>> df.loc['c and d'] = df.loc['c'] + df.loc['d']
>>> df.reindex(index=['a', 'b', 'c and d'])
>>> df
cost
a 1
b 2
c and d 7
[3 rows x 1 columns]
You could name the index labels of the one's you want summed and use a groupby.
In [35]: df = df.rename(index={'d': 'c'})
In [36]: df.groupby(level=0).sum()
Out[36]:
cost
a 1
b 2
c 7

Categories

Resources