Summing 3 columns in a dataframe

Summing 3 columns in a dataframe - python

This should be easy:
I have a data frame with the following columns
a,b,min,w,w_min
all I want to do is sum up the columns min,w,and w_min and read that result into another data frame.
I've looked, but I can not find a previously asked question that directly relates back to this. Everything I've found seems much more complex then what I'm trying to do.

You can just pass a list of cols and select these to perform the summation on:
In [64]:
df = pd.DataFrame(columns=['a','b','min','w','w_min'], data = np.random.randn(10,5) )
df
Out[64]:
a b min w w_min
0 0.626671 0.850726 0.539850 -0.669130 -1.227742
1 0.856717 2.108739 -0.079023 -1.107422 -1.417046
2 -1.116149 -0.013082 0.871393 -1.681556 -0.170569
3 -0.944121 -2.394906 -0.454649 0.632995 1.661580
4 0.590963 0.751912 0.395514 0.580653 0.573801
5 -1.661095 -0.592036 -1.278102 -0.723079 0.051083
6 0.300866 -0.060604 0.606705 1.412149 0.916915
7 -1.640530 -0.398978 0.133140 -0.628777 -0.464620
8 0.734518 1.230869 -1.177326 -0.544876 0.244702
9 -1.300137 1.328613 -1.301202 0.951401 -0.693154
In [65]:
cols=['min','w','w_min']
df[cols].sum()
Out[65]:
min -1.743700
w -1.777642
w_min -0.525050
dtype: float64

Related

Load data from txt

I am loading a txt file containig complex number. The data are formatted in this way
How can I create a two separate arrays, one for the real part and one for the imaginary part?
I tried to create a panda dataframe using e-01 as a separator but in this way I loose this info

df = pd.read_fwf(r'c:\test\complex.txt', header=None)
df[['real','im']] = df[0].str.extract(r'\(([-.\de]+)([+-]\d\.[\de\-j]+)')
print(df)
0 real im
0 (9.486832980505137680e-01-3.162277660168379412... 9.486832980505137680e-01 -3.162277660168379412e-01j
1 (9.486832980505137680e-01+9.486832980505137680... 9.486832980505137680e-01 +9.486832980505137680e-01j
2 (-9.486832980505137680e-01+9.48683298050513768... -9.486832980505137680e-01 +9.486832980505137680e-01j
3 (-3.162277660168379412e-01+3.16227766016837941... -3.162277660168379412e-01 +3.162277660168379412e-01j
4 (-3.162277660168379412e-01+9.48683298050513768... -3.162277660168379412e-01 +9.486832980505137680e-01j
5 (9.486832980505137680e-01-3.162277660168379412... 9.486832980505137680e-01 -3.162277660168379412e-01j
6 (-3.162277660168379412e-01+3.16227766016837941... -3.162277660168379412e-01 +3.162277660168379412e-01j
7 (9.486832980505137680e-01-9.486832980505137680... 9.486832980505137680e-01 -9.486832980505137680e-01j
8 (9.486832980505137680e-01-9.486832980505137680... 9.486832980505137680e-01 -9.486832980505137680e-01j
9 (-3.162277660168379412e-01+3.16227766016837941... -3.162277660168379412e-01 +3.162277660168379412e-01j
10 (3.162277660168379412e-01-9.486832980505137680... 3.162277660168379412e-01 -9.486832980505137680e-01j

Never knew how annoyingly involved it is to read complex numbers with Pandas, This is a slightly different solution than #Алексей's. I prefer to avoid regular expressions when not absolutely necessary.
# Read the file, pandas defaults to string type for contents
df = pd.read_csv('complex.txt', header=None, names=['string'])
# Convert string representation to complex.
# Use of `eval` is ugly but works.
df['complex'] = df['string'].map(eval)
# Alternatively...
#df['complex'] = df['string'].map(lambda c: complex(c.strip('()')))
# Separate real and imaginary parts
df['real'] = df['complex'].map(lambda c: c.real)
df['imag'] = df['complex'].map(lambda c: c.imag)
df
is...
string complex \
0 (9.486832980505137680e-01-3.162277660168379412... 0.948683-0.316228j
1 (9.486832980505137680e-01+9.486832980505137680... 0.948683+0.948683j
2 (-9.486832980505137680e-01+9.48683298050513768... -0.948683+0.000000j
3 (-3.162277660168379412e-01+3.16227766016837941... -0.316228+0.316228j
4 (-3.162277660168379412e-01+9.48683298050513768... -0.316228+0.948683j
5 (9.486832980505137680e-01-3.162277660168379412... 0.948683-0.316228j
6 (3.162277660168379412e-01+3.162277660168379412... 0.316228+0.316228j
7 (9.486832980505137680e-01-9.486832980505137680... 0.948683-0.948683j
real imag
0 0.948683 -3.162278e-01
1 0.948683 9.486833e-01
2 -0.948683 9.486833e-01
3 -0.316228 3.162278e-01
4 -0.316228 9.486833e-01
5 0.948683 -3.162278e-01
6 0.316228 3.162278e-01
7 0.948683 -9.486833e-01
df.dtypes
prints out..
string object
complex complex128
real float64
imag float64
dtype: object

Convert all rows into a Series object pandas

I have a dataframe like so:
time 0 1 2 3 4 5
0 3.477110 3.475698 3.475874 3.478345 3.476757 3.478169
1 3.422223 3.419752 3.417987 3.421341 3.418693 3.418340
2 3.474110 3.474816 3.477463 3.479757 3.479581 3.476757
3 3.504995 3.507112 3.504995 3.505877 3.507112 3.508171
4 3.426106 3.424870 3.422399 3.421517 3.419046 3.417105
6 3.364336 3.362571 3.360453 3.358335 3.357806 3.356924
7 3.364336 3.362571 3.360453 3.358335 3.357806 3.356924
8 3.364336 3.362571 3.360453 3.358335 3.357806 3.356924
but sktime requires the data to be in a format where each dataframe entry is a seperate time series:
3.477110,3.475698,3.475874,3.478345,3.476757,3.478169
3.422223,3.419752,3.417987,3.421341,3.418693,3.418340
3.474110,3.474816,3.477463,3.479757,3.479581,3.476757
3.504995,3.507112,3.504995,3.505877,3.507112,3.508171
3.426106,3.424870,3.422399,3.421517,3.419046,3.417105
3.364336,3.362571,3.360453,3.358335,3.357806,3.356924
Essentially as I have 6 cols of data, each row should become a seperate series (of length 6) and the final shape should be (9, 1) (for this example) instead of the (9, 6) it is right now
I have tried iterating over the rows, using various transform techniques but to no avail, I am looking for something similar to the .squeeze() method but that works for multiple datapoints, how does one go about it?

I think you want something like this.
result = df.set_index('time').apply(np.array, axis=1)
print(result)
print(type(result))
print(result.shape)
time
0 [3.47711, 3.475698, 3.475874, 3.478345, 3.4767...
1 [3.422223, 3.419752, 3.417987, 3.421341, 3.418...
2 [3.47411, 3.474816, 3.477463, 3.479757, 3.4795...
3 [3.504995, 3.507112, 3.504995, 3.505877, 3.507...
4 [3.426106, 3.42487, 3.422399, 3.421517, 3.4190...
6 [3.364336, 3.362571, 3.360453, 3.358335, 3.357...
7 [3.364336, 3.362571, 3.360453, 3.358335, 3.357...
8 [3.364336, 3.362571, 3.360453, 3.358335, 3.357...
dtype: object
<class 'pandas.core.series.Series'>
(8,)
This is one pd.Series of length 8 (in your example data index 5 is missing;) ) and each value of the Series is a np.array. You can also go with list (in the applystatement) if you want.

Convert all columns to str, because the join method only accepts string.
Then join all columns by a "," delimiter
df.astype(str).agg(','.join,axis=1)
df.astype(str).agg(','.join,axis=1).shape
(9,)

groupby and sum two columns and set as one column in pandas

I have the following data frame:
import pandas as pd
data = pd.DataFrame()
data['Home'] = ['A','B','C','D','E','F']
data['HomePoint'] = [3,0,1,1,3,3]
data['Away'] = ['B','C','A','E','D','D']
data['AwayPoint'] = [0,3,1,1,0,0]
i want to groupby the columns ['Home', 'Away'] and change the name as Team. Then i like to sum homepoint and awaypoint as name as Points.
Team Points
A 4
B 0
C 4
D 1
E 4
F 3
How can I do it?
I was trying different approach using the following post:
Link
But I was not able to get the format that I wanted.
Greatly appreciate your advice.
Thanks
Zep.

A simple way is to create two new Series indexed by the teams:
home = pd.Series(data.HomePoint.values, data.Home)
away = pd.Series(data.AwayPoint.values, data.Away)
Then, the result you want is:
home.add(away, fill_value=0).astype(int)
Note that home + away does not work, because team F never played away, so would result in NaN for them. So we use Series.add() with fill_value=0.
A complicated way is to use DataFrame.melt():
goo = data.melt(['HomePoint', 'AwayPoint'], var_name='At', value_name='Team')
goo.HomePoint.where(goo.At == 'Home', goo.AwayPoint).groupby(goo.Team).sum()
Or from the other perspective:
ooze = data.melt(['Home', 'Away'])
ooze.value.groupby(ooze.Home.where(ooze.variable == 'HomePoint', ooze.Away)).sum()

You can concatenate, pairwise, columns of your input dataframe. Then use groupby.sum.
# calculate number of pairs
n = int(len(df.columns)/2)+1)
# create list of pairwise dataframes
df_lst = [data.iloc[:, 2*i:2*(i+1)].set_axis(['Team', 'Points'], axis=1, inplace=False) \
for i in range(n)]
# concatenate list of dataframes
df = pd.concat(df_lst, axis=0)
# perform groupby
res = df.groupby('Team', as_index=False)['Points'].sum()
print(res)
Team Points
0 A 4
1 B 0
2 C 4
3 D 1
4 E 4
5 F 3

How to get the highest values from many columns and show in what rows it happened using pandas?

I have a dataframe from which I want to know the highest value for each column. But I also want to know in what row it happened.
With my code I have to put the name of each column each time. Is there a better way to get all highest values from all columns?
df2.loc[df2['ALL'].idxmax()]
THE DATAFRAME
WHAT I GET WITH MY CODE
WHAT I WANT
THE DATAFRAME

You can stack your frame and then sort the values from largest to smallest and then take the first occurrence of your column names.
First I will create some fake data
df = pd.DataFrame(np.random.rand(10,5), columns=list('abcde'),
index=list('nopqrstuvw'))
df.columns.name = 'level_0'
df.index.name = 'level_1'
Output
level_0 a b c d e
level_1
n 0.417317 0.821350 0.443729 0.167315 0.281859
o 0.166944 0.223317 0.418765 0.226544 0.508055
p 0.881260 0.789210 0.289563 0.369656 0.610923
q 0.893197 0.494227 0.677377 0.065087 0.228854
r 0.394382 0.573298 0.875070 0.505148 0.334238
s 0.046179 0.039642 0.930811 0.326114 0.880804
t 0.143488 0.561449 0.832186 0.486752 0.323215
u 0.891823 0.616401 0.247078 0.497050 0.995108
v 0.888553 0.386260 0.816100 0.874761 0.769073
w 0.557239 0.601758 0.932839 0.274614 0.854063
Now stack, sort and drop all but the first column occurrence
df.stack()\
.sort_values(ascending=False)\
.reset_index()\
.drop_duplicates('level_0')\
.sort_values('level_0')[['level_0', 0, 'level_1']]
level_0 0 level_1
3 a 0.893197 q
12 b 0.821350 n
1 c 0.932839 w
9 d 0.874761 v
0 e 0.995108 u

Looping to recode variables in python

I'm fairly new to programming and I have a question on using loops to recode variables in a pandas data frame that I was hoping I could get some help with.
I want to recode multiple columns in a pandas data frame from units of seconds to minutes. I've written a simple function in python and then can copy and repeat it on each column which works, but I wanted to automate this. I appreciate the help.
The ivf.secondsUntilCC.xxx column contains the number of seconds until something happens. I want the new column ivf.minsUntilCC.xxx to be the number of minutes. The data frame name is data.
def f(x,y):
return x[y]/60
data['ivf.minsUntilCC.500'] = f(data,'ivf.secondsUntilCC.500')
data['ivf.minsUntilCC.1000'] = f(data,'ivf.secondsUntilCC.1000')
data['ivf.minsUntilCC.2000'] = f(data,'ivf.secondsUntilCC.2000')
data['ivf.minsUntilCC.3000'] = f(data,'ivf.secondsUntilCC.3000')
data['ivf.minsUntilCC.4000'] = f(data,'ivf.secondsUntilCC.4000')

I would use vectorized approach:
In [27]: df
Out[27]:
X ivf.minsUntilCC.500 ivf.minsUntilCC.1000 ivf.minsUntilCC.2000 ivf.minsUntilCC.3000 ivf.minsUntilCC.4000
0 191365 906395 854268 701859 979647 914942
1 288577 300394 577555 880370 924162 897984
2 66705 493545 232603 682509 794074 204429
3 747828 504930 379035 29230 410390 287327
4 926553 913360 657640 336139 210202 356649
In [28]: df.loc[:, df.columns.str.startswith('ivf.minsUntilCC.')] /= 60
In [29]: df
Out[29]:
X ivf.minsUntilCC.500 ivf.minsUntilCC.1000 ivf.minsUntilCC.2000 ivf.minsUntilCC.3000 ivf.minsUntilCC.4000
0 191365 15106.583333 14237.800000 11697.650000 16327.450000 15249.033333
1 288577 5006.566667 9625.916667 14672.833333 15402.700000 14966.400000
2 66705 8225.750000 3876.716667 11375.150000 13234.566667 3407.150000
3 747828 8415.500000 6317.250000 487.166667 6839.833333 4788.783333
4 926553 15222.666667 10960.666667 5602.316667 3503.366667 5944.150000
Setup:
df = pd.DataFrame(np.random.randint(0,10**6,(5,6)),
columns=['X','ivf.minsUntilCC.500', 'ivf.minsUntilCC.1000',
'ivf.minsUntilCC.2000', 'ivf.minsUntilCC.3000',
'ivf.minsUntilCC.4000'])
Explanation:
In [26]: df.loc[:, df.columns.str.startswith('ivf.minsUntilCC.')]
Out[26]:
ivf.minsUntilCC.500 ivf.minsUntilCC.1000 ivf.minsUntilCC.2000 ivf.minsUntilCC.3000 ivf.minsUntilCC.4000
0 906395 854268 701859 979647 914942
1 300394 577555 880370 924162 897984
2 493545 232603 682509 794074 204429
3 504930 379035 29230 410390 287327
4 913360 657640 336139 210202 356649

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Summing 3 columns in a dataframe - python

Related

Load data from txt

Convert all rows into a Series object pandas

groupby and sum two columns and set as one column in pandas

How to get the highest values from many columns and show in what rows it happened using pandas?

Looping to recode variables in python

Categories

Resources