Binning a data set using Pandas - python

Given a csv file of...
How can one split the samples using overlapping bins of 0.25 m/z - where the first column of each tuple (Sample n,,) contains a m/z value and the second containing the weight?
To load the file into a Pandas DataFrame I currently do:
import csv, pandas as pd
def load_raw_data():
raw_data = []
with open("negsmaller.csv", "rb") as rawfile:
reader = csv.reader(rawfile, delimiter=",")
for row in reader:
raw_data = pd.DataFrame(raw_data)
return raw_data.T
if __name__ == '__main__':
raw_data = load_raw_data()
print raw_data
Which returns
0 1 2 3 4 5 6 \
0 SAMPLE 1 50.0261 50.1057 50.1514 50.166 50.1831 50.2023
1 2.17E+02 2.65E+02 2.90E+02 3.85E+02 2.55E+02 3.45E+02
2 SAMPLE 2 50.0224 50.0435 50.0781 50.1245 50.1748 50.2161
3 3.31E+02 3.92E+02 3.88E+02 4.25E+02 3.71E+02 2.59E+02
4 SAMPLE 3 50.0007 50.0657 50.1115 50.1258 50.1411 50.1671
5 5.38E+02 5.52E+02 5.75E+02 5.11E+02 6.21E+02 5.56E+02
6 SAMPLE 4 50.0199 50.0465 50.0584 50.0765 50.1246 50.1866
7 2.39E+02 3.37E+02 2.58E+02 4.47E+02 1.43E+02 3.77E+02
7 8 9 ... 56 57 58 \
0 50.223 50.2631 50.2922 ... 52.2284 52.2666 52.2936
1 4.02E+02 1.89E+02 2.72E+02 ... 2.66E+02 5.32E+02 4.24E+02
2 50.2381 50.2826 50.3593 ... 51.9413 51.9616 51.9845
3 4.33E+02 4.63E+02 4.52E+02 ... 3.56E+02 2.19E+02 1.53E+02
4 50.1968 50.211 50.2279 ... 52.1412 52.1722 52.1821
5 6.31E+02 3.92E+02 5.92E+02 ... 6.20E+02 5.72E+02 5.18E+02
6 50.2276 50.2717 50.376 ... 51.8698 51.9084 51.937
7 3.41E+02 4.71E+02 3.09E+02 ... 1.74E+02 2.89E+02 1.69E+02
59 60 61 62 63 64 65
0 52.3256 52.3566 52.4192 52.4757 52.498 52.57 52.6178
1 3.69E+02 2.50E+02 3.08E+02 2.99E+02 2.37E+02 2.58E+02 4.29E+02
2 52.0051 52.0299 52.0603 52.0988 52.1176 52.1698 52.2352
3 3.53E+02 2.87E+02 3.15E+02 3.45E+02 3.63E+02 3.11E+02 3.96E+02
4 52.2473 52.3103 52.35 52.3807 52.4234 52.4451 52.4627
5 5.51E+02 4.12E+02 8.76E+02 6.95E+02 4.89E+02 4.54E+02 5.38E+02
6 51.9641 52.0292 52.0633 52.0797 52.1073 52.1546 52.2219
7 3.31E+02 2.63E+02 3.94E+02 2.88E+02 2.97E+02 3.41E+02 3.68E+02
[8 rows x 66 columns]
Process finished with exit code 0
My desired output: To take the overlapping 0.25 bins and then take the average of the column next to it and have it as one. So,
0.01 3
0.10 4
0.24 2
would become .25 3


Converting time format to second in a panda dataframe

I have a df with time data and I would like to transform these data to second (see example below).
Compression_level Size (M) Real time (s) User time (s) Sys time (s)
0 0 265 0:19.938 0:24.649 0:3.062
1 1 76 0:17.910 0:25.929 0:3.098
2 2 74 1:02.619 0:27.724 0:3.014
3 3 73 0:20.607 0:27.937 0:3.193
4 4 67 0:19.598 0:28.853 0:2.925
5 5 67 0:21.032 0:30.119 0:3.206
6 6 66 0:27.013 0:31.462 0:3.106
7 7 65 0:27.337 0:36.226 0:3.060
8 8 64 0:37.651 0:47.246 0:2.933
9 9 64 0:59.241 1:8.333 0:3.027
This is the output I would like to obtain.
df["Real time (s)"]
0 19.938
1 17.910
2 62.619
I have some useful code but I do not how to itinerate this code in a data frame
x = time.strptime("00:01:00","%H:%M:%S")
datetime.timedelta(hours=x.tm_hour,minutes=x.tm_min, seconds=x.tm_sec).total_seconds()
Add 00: from right side for 0hours, pass to to_timedelta and then add Series.dt.total_seconds:
df["Real time (s)"] = pd.to_timedelta(df["Real time (s)"].radd('00:')).dt.total_seconds()
print (df)
Compression_level Size (M) Real time (s) User time (s) Sys time (s)
0 0 265 19.938 0:24.649 0:3.062
1 1 76 17.910 0:25.929 0:3.098
2 2 74 62.619 0:27.724 0:3.014
3 3 73 20.607 0:27.937 0:3.193
4 4 67 19.598 0:28.853 0:2.925
5 5 67 21.032 0:30.119 0:3.206
6 6 66 27.013 0:31.462 0:3.106
7 7 65 27.337 0:36.226 0:3.060
8 8 64 37.651 0:47.246 0:2.933
9 9 64 59.241 1:8.333 0:3.027
Solution for processing multiple columns:
def to_td(x):
return pd.to_timedelta(x.radd('00:')).dt.total_seconds()
cols = ["Real time (s)", "User time (s)", "Sys time (s)"]
df[cols] = df[cols].apply(to_td)
print (df)
Compression_level Size (M) Real time (s) User time (s) Sys time (s)
0 0 265 19.938 24.649 3.062
1 1 76 17.910 25.929 3.098
2 2 74 62.619 27.724 3.014
3 3 73 20.607 27.937 3.193
4 4 67 19.598 28.853 2.925
5 5 67 21.032 30.119 3.206
6 6 66 27.013 31.462 3.106
7 7 65 27.337 36.226 3.060
8 8 64 37.651 47.246 2.933
9 9 64 59.241 68.333 3.027

How to calculate the expanding mean of all the columns across the DataFrame and add to DataFrame

I am trying to calculate the means of all previous rows for each column of the DataFrame and add the calculated mean column to the DataFrame.
I am using a set of nba games data that contains 20+ features (columns) that I am trying to calculate the means for. Example of the dataset is below. (Note. "...." represent rest of the feature columns)
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25
Example for calculating two of the columns:
dataset = pd.read_csv('')
df = dataset
df['Game_mean'] = (df.groupby('Team')['TeamPoints'].apply(lambda x: x.shift().expanding().mean()))
df['TeamPoints_mean'] = (df.groupby('Team')['OpponentsPoints'].apply(lambda x: x.shift().expanding().mean()))
Again, the code only calculates the mean and adding the column to the DataFrame one at a time. Is there a way to get the column means and add them to the DataFrame without doing one at a time? For loop? Example of what I am looking for is below.
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean ...("..." = mean columns of rest of the feature columns)
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25
Try this one:
(0) sample input:
>>> df
col1 col2 col3
0 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797
2 0.042541 1.196383 6.568839
3 4.784911 0.444671 8.019933
4 3.831556 0.902672 0.198920
5 3.672763 2.236639 1.528215
6 0.792616 2.604049 0.373296
7 2.281992 2.563639 1.500008
8 4.096861 0.598854 4.934116
9 3.632607 1.502801 0.241920
Then processing:
(1) side table to get all the means on the side (I didn't find cummulative mean function, so went with cumsum + count)
>>> df_side=df.assign(col_temp=1).cumsum()
>>> df_side
col1 col2 col3 col_temp
0 1.490977 1.784433 0.852842 1.0
1 5.217640 4.629801 8.619638 2.0
2 5.260182 5.826184 15.188477 3.0
3 10.045093 6.270855 23.208410 4.0
4 13.876649 7.173527 23.407330 5.0
5 17.549412 9.410166 24.935545 6.0
6 18.342028 12.014215 25.308841 7.0
7 20.624021 14.577855 26.808849 8.0
8 24.720882 15.176708 31.742965 9.0
9 28.353489 16.679509 31.984885 10.0
>>> for el in df.columns:
... df_side["{}_mean".format(el)]=df_side[el]/df_side.col_temp
>>> df_side=df_side.drop([el for el in df.columns] + ["col_temp"], axis=1)
>>> df_side
col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842
1 2.608820 2.314901 4.309819
2 1.753394 1.942061 5.062826
3 2.511273 1.567714 5.802103
4 2.775330 1.434705 4.681466
5 2.924902 1.568361 4.155924
6 2.620290 1.716316 3.615549
7 2.578003 1.822232 3.351106
8 2.746765 1.686301 3.526996
9 2.835349 1.667951 3.198489
(2) joining back, on index:
>>> df_final=df.join(df_side)
>>> df_final
col1 col2 col3 col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797 2.608820 2.314901 4.309819
2 0.042541 1.196383 6.568839 1.753394 1.942061 5.062826
3 4.784911 0.444671 8.019933 2.511273 1.567714 5.802103
4 3.831556 0.902672 0.198920 2.775330 1.434705 4.681466
5 3.672763 2.236639 1.528215 2.924902 1.568361 4.155924
6 0.792616 2.604049 0.373296 2.620290 1.716316 3.615549
7 2.281992 2.563639 1.500008 2.578003 1.822232 3.351106
8 4.096861 0.598854 4.934116 2.746765 1.686301 3.526996
9 3.632607 1.502801 0.241920 2.835349 1.667951 3.198489
I am trying to calculate the means of all previous rows for each column of the DataFrame
To get all of the columns, you can do:
df_means = df.join(df.cumsum()/
df.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
However, if Team is a column rather the index, you'd want to get rid of it:
df_data = df.drop('Teams', axis=1)
df_means = df.join(df_data.cumsum()/
df_data.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
You could also do
import numpy as np
df_data = df[[col for col in df.columns
if np.issubdtype(df[col],np.number)]]
Or manually define a list of columns that you want to take the mean of, cols_for_mean, and then do
df_data = df[cols_for_mean]

Unable to read the first row of a .dat file using pandas

I have a .dat file, about whose origin I am not sure. I have to read this file in order to perform PCA. Assuming it to be white spaced file, I was successfully able to read the contents of file and ignore the first column (as it is a index), but the very first row. Below is the code:
import numpy as np
import pandas as pd
from numpy import array
myarray = pd.read_csv('hand_postures.dat', delim_whitespace=True)
myarray = array(myarray)
myarray = np.delete(myarray,0,1)
The file is shared at the link Can someone help me point out my mistake?
You need an extra parameter when calling pd.read_csv.
df = pd.read_csv('hand_postures.dat', header=None, delim_whitespace=True, index_col=[0])
1 2 3 4 5 6 7 8 \
0 -65.55560 0.172413 44.4944 22.2472 0.000000 50.6723 34.3434 17.1717
1 -65.55560 2.586210 43.8202 21.9101 0.277778 51.4286 34.3434 17.1717
2 -45.55560 5.000000 43.8202 21.9101 0.833333 56.7227 42.4242 21.2121
3 5.55556 -2.241380 46.5169 23.2584 1.111110 70.3361 85.8586 42.9293
4 67.77780 20.689700 59.3258 29.6629 2.222220 80.9244 93.9394 46.9697
9 10 11 12 13 14 15 16 \
0 -0.235294 54.6154 39.7849 19.8925 0.705883 37.2656 41.3043 20.6522
1 -0.235294 55.3846 38.7097 19.3548 0.705883 38.6719 41.3043 20.6522
2 0.000000 63.0769 47.3118 23.6559 0.000000 47.8125 54.3478 27.1739
3 -0.117647 83.8462 90.3226 45.1613 0.352941 73.1250 92.3913 46.1957
4 0.117647 93.8462 98.9247 49.4624 -0.352941 89.2969 100.0000 50.0000
17 18 19 20
0 15.0 34.6584 54.1270 27.0635
1 14.4 35.2174 55.8730 27.9365
2 14.4 43.6025 69.8413 34.9206
3 3.6 73.7888 94.2857 47.1429
4 -1.2 92.2360 106.5080 53.2540
header=None specifies that the first row is part of the data (and not the header)
index_col=[0] specifies that the first column is to be treated as the index

How to add a repeated column using pandas

I am doing my homework and I encounter a problem, I have a large matrix, the first column Y002 is a nominal variable, which has 3 levels and encoded as 1,2,3 respectively. The other two columns V96 and V97 are just numeric.
Now, I wanna get a group mean corresponds to the variable Y002. I wrote the code like this
group = data2.groupby(by=["Y002"]).mean()
Then I index to get each group mean using
group1 = group["V96"]
group2 = group["V97"]
Now I wanna append this group mean as a new column into the original dataframe, in which each mean matches the corresponding Y002 code(1 or 2 or 3). Actually I tried this code, but it only shows NAN.
data2["group1"] = pd.Series(group1, index=data2.index)
Hope someone could help me with this, many thanks :)
PS: Hope this makes sense. just like R language, we can do the same thing using
data2$group1 = with(data2, tapply(V97,Y002,mean))[data2$Y002]
But how can we implement this in Python and pandas???
You can use .transform()
import pandas as pd
import numpy as np
# your data
# ============================
df = pd.DataFrame({'Y002': np.random.randint(1,4,100), 'V96': np.random.randn(100), 'V97': np.random.randn(100)})
V96 V97 Y002
0 -0.6866 -0.1478 1
1 0.0149 1.6838 2
2 -0.3757 0.9718 1
3 -0.0382 1.6077 2
4 0.3680 -0.2571 2
5 -0.0447 1.8098 3
6 -0.3024 0.8923 1
7 -2.2244 -0.0966 3
8 0.7240 -0.3772 1
9 0.3590 -0.5053 1
.. ... ... ...
90 -0.6906 1.5567 2
91 -0.6815 -0.4189 3
92 -1.5122 -0.4097 1
93 2.1969 1.1164 2
94 1.0412 -0.2510 3
95 -0.0332 -0.4152 1
96 0.0656 -0.6391 3
97 0.2658 2.4978 1
98 1.1518 -3.0051 2
99 0.1380 -0.8740 3
# processing
# ===========================
df['V96_mean'] = df.groupby('Y002')['V96'].transform(np.mean)
df['V97_mean'] = df.groupby('Y002')['V97'].transform(np.mean)
V96 V97 Y002 V96_mean V97_mean
0 -0.6866 -0.1478 1 -0.1944 0.0837
1 0.0149 1.6838 2 0.0497 -0.0496
2 -0.3757 0.9718 1 -0.1944 0.0837
3 -0.0382 1.6077 2 0.0497 -0.0496
4 0.3680 -0.2571 2 0.0497 -0.0496
5 -0.0447 1.8098 3 0.0053 -0.0707
6 -0.3024 0.8923 1 -0.1944 0.0837
7 -2.2244 -0.0966 3 0.0053 -0.0707
8 0.7240 -0.3772 1 -0.1944 0.0837
9 0.3590 -0.5053 1 -0.1944 0.0837
.. ... ... ... ... ...
90 -0.6906 1.5567 2 0.0497 -0.0496
91 -0.6815 -0.4189 3 0.0053 -0.0707
92 -1.5122 -0.4097 1 -0.1944 0.0837
93 2.1969 1.1164 2 0.0497 -0.0496
94 1.0412 -0.2510 3 0.0053 -0.0707
95 -0.0332 -0.4152 1 -0.1944 0.0837
96 0.0656 -0.6391 3 0.0053 -0.0707
97 0.2658 2.4978 1 -0.1944 0.0837
98 1.1518 -3.0051 2 0.0497 -0.0496
99 0.1380 -0.8740 3 0.0053 -0.0707
[100 rows x 5 columns]

Pandas appending Series to DataFrame to write to a file

I have list of Dataframes that I want to compute the mean on
~ pieces[1].head()
301 manual 82.150833 7 69 ... 3.615 1.952 1.241
302 manual 82.150833 7 69 ... 3.615 1.952 1.241
303 manual 82.150833 7 69 ... 3.615 1.952 1.241
304 manual 82.150833 7 69 ... 3.615 1.952 1.241
305 manual 82.150833 7 69 ... 3.615 1.952 1.241
, So i am looping through them ->
pieces = np.array_split(df,size)
output = pd.DataFrame()
for piece in pieces:
dp = piece.mean()
output = output.append(dp,ignore_index=True)
Unfortunately the output is sorted (the column names are alphabetical in the output) and I want to keep the original column order (as seen up top).
~ output.head()
0 44.578937 66.183858 14.466816 14.113321 18.831117 6.677792
1 34.042593 66.231229 14.320409 14.113321 22.368983 6.677792
2 34.497194 66.309320 14.210066 14.113321 25.353414 6.677792
3 43.430931 66.376632 14.314854 14.113321 28.462130 6.677792
4 44.419204 66.516515 14.314653 14.113321 32.244107 6.677792
I have tried variations of concat etc with no success. Is there a different way to think about this ?
My recommendation would be to concat the list of dataframes using pd.concat. This will allow you to use the standard group-by/apply. In this example, multi_df is a MultiIndex which behaves like a standard data frame, only the indexing and group by is a little different:
x = []
for i in range(10):
x.append(pd.DataFrame(dict(zip(list('abc'), [i + 1, i + 2, i + 3])), index = list('ind')))
Now x contains a list of data frames of the shape
a b c
i 2 3 4
n 2 3 4
d 2 3 4
And with
multi_df = pd.concat(x, keys = range(len(x)))
result = multi_df.groupby(level = [0]).apply(np.mean)
we get a data frame that looks like
a b c
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
8 9 10 11
9 10 11 12
You can then just call result.to_csv('filepath') to write that out.

