I am using a pandas DataFrame to store data from a series of experiments so that I can easily make cuts across various parameter values for the next stage of analysis. I have a few questions about how to do this most effectively.
Currently I create my DataFrame from a dictionary of lists. There is typically a few thousand rows in the DataFrame. One of the columns is a device_id which indicates which of the 20 devices that the experimental data pertains to. Other columns include info about the experimental setup, like temperature, power, etc. and measurement results, like resonant_frequency, bandwidth, etc.
So far, I've been using this DataFrame rather "naively," that is, I use it sort of like a numpy record array, and so I don't think I'm fully taking advantage of the power of the DataFrame. The following are some examples of what I'm trying to achieve.
First I want to create a new column which is the maximum resonant_frequency measured for a given device over all experiments: call it max_freq. I do this like so:
df['max_freq'] = np.zeros((data.shape[0])) # create the new column
for index in np.unique(df.device_index):
group = df[df.device_index == index]
max = group.resonant_frequency.max()
df.max_freq[df.resonator_index == index] = max
Second One of my columns contains 1-D numpy arrays of a noise measurement. I want to compute a statistic on this 1-D array and put it into a new column. Currently I do this as:
noise_est = []
for vals,freq in (df.noise,df.resonant_freq):
noise_est.append(vals.std()/(1e6*freq))
df['noise_est'] = noise_est
Third Related the the previous one: Is it possible to iterate through rows of a DataFrame where the resulting object has attribute access to the columns? I.e. something like:
for row in df:
row.noise_est = row.noise.std()/(1e6*row.resonant_freq)
I know that this instead iterates through columns. I also know there is an iterrows method, but this provides a Series which doesn't allow attribute access.
I think this should get me started for now, thanks for your time!
edited to add df.info(), df.head() as requested:
df.info() # df.head() looks the same, but 5 non-null values
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9620 entries, 0 to 9619
Data columns (total 83 columns):
A_mag 9620 non-null values
A_mag_err 9620 non-null values
A_phase 9620 non-null values
A_phase_err 9620 non-null values
....
total_dac_atten 9600 non-null values
round_temp 9620 non-null values
dtypes: bool(1), complex128(4), float64(39), int64(12), object(27)
I trimmed this down because it's 83 columns, and I don't think this adds much to the example code snippets I shared, but have posted this bit in case it's helpful.
Create data. Note that storing a numpy array INSIDE a frame is generally not a good idea as its pretty inefficient.
In [84]: df = pd.DataFrame(dict(A = np.random.randn(20), B = np.random.randint(0,3,size=20), C = [ np.random.randn(5) for i in range(20) ]))
In [85]: df
Out[85]:
A B C
0 -0.493730 1 [-0.8790126045, -1.87366673214, 0.76227570837,...
1 -0.105616 2 [0.612075134682, -1.64452324091, 0.89799758012...
2 1.487656 1 [-0.379505426885, 1.17611806172, 0.88321152932...
3 0.351694 2 [0.132071242514, -1.54701609348, 1.29813626801...
4 -0.330538 2 [0.395383858214, 0.874419943107, 1.21124463921...
5 0.360041 0 [0.439133138619, -1.98615530266, 0.55971723554...
6 -0.505198 2 [-0.770830608002, 0.243255072359, -1.099514797...
7 0.631488 1 [0.676233200011, 0.622926691271, -0.1110029751...
8 1.292087 1 [1.77633938532, -0.141683361957, 0.46972952154...
9 0.641987 0 [1.24802709304, 0.477527098462, -0.08751885691...
10 0.732596 2 [0.475771915314, 1.24219702097, -0.54304296895...
11 0.987054 1 [-0.879620967644, 0.657193159735, -0.093519342...
12 -1.409455 1 [1.04404325784, -0.310849157425, 0.60610368623...
13 1.063830 1 [-0.760467872808, 1.33659372288, -0.9343171844...
14 0.533835 1 [0.985463451645, 1.76471927635, -0.59160181340...
15 0.062441 1 [-0.340170594584, 1.53196133354, 0.42397775978...
16 1.458491 2 [-1.79810090668, -1.82865815817, 1.08140831482...
17 -0.886119 2 [0.281341969073, -1.3516126536, 0.775326038501...
18 0.662076 1 [1.03992509625, 1.17661862104, -0.562683934951...
19 1.216878 2 [0.0746149754367, 0.156470450639, -0.477269150...
In [86]: df.dtypes
Out[86]:
A float64
B int64
C object
dtype: object
Apply an operation to the value of a series (2 and 3)
In [88]: df['C_std'] = df['C'].apply(np.std)
Get the max of each group and return the value (1)
In [91]: df['A_max_by_group'] = df.groupby('B')['A'].transform(lambda x: x.max())
In [92]: df
Out[92]:
A B C A_max_by_group C_std
0 -0.493730 1 [-0.8790126045, -1.87366673214, 0.76227570837,... 1.487656 1.058323
1 -0.105616 2 [0.612075134682, -1.64452324091, 0.89799758012... 1.458491 0.987980
2 1.487656 1 [-0.379505426885, 1.17611806172, 0.88321152932... 1.487656 1.264522
3 0.351694 2 [0.132071242514, -1.54701609348, 1.29813626801... 1.458491 1.150026
4 -0.330538 2 [0.395383858214, 0.874419943107, 1.21124463921... 1.458491 1.045408
5 0.360041 0 [0.439133138619, -1.98615530266, 0.55971723554... 0.641987 1.355853
6 -0.505198 2 [-0.770830608002, 0.243255072359, -1.099514797... 1.458491 0.443872
7 0.631488 1 [0.676233200011, 0.622926691271, -0.1110029751... 1.487656 0.432342
8 1.292087 1 [1.77633938532, -0.141683361957, 0.46972952154... 1.487656 1.021847
9 0.641987 0 [1.24802709304, 0.477527098462, -0.08751885691... 0.641987 0.676835
10 0.732596 2 [0.475771915314, 1.24219702097, -0.54304296895... 1.458491 0.857441
11 0.987054 1 [-0.879620967644, 0.657193159735, -0.093519342... 1.487656 0.628655
12 -1.409455 1 [1.04404325784, -0.310849157425, 0.60610368623... 1.487656 0.835633
13 1.063830 1 [-0.760467872808, 1.33659372288, -0.9343171844... 1.487656 0.936746
14 0.533835 1 [0.985463451645, 1.76471927635, -0.59160181340... 1.487656 0.991327
15 0.062441 1 [-0.340170594584, 1.53196133354, 0.42397775978... 1.487656 0.700299
16 1.458491 2 [-1.79810090668, -1.82865815817, 1.08140831482... 1.458491 1.649771
17 -0.886119 2 [0.281341969073, -1.3516126536, 0.775326038501... 1.458491 0.910355
18 0.662076 1 [1.03992509625, 1.17661862104, -0.562683934951... 1.487656 0.666237
19 1.216878 2 [0.0746149754367, 0.156470450639, -0.477269150... 1.458491 0.275065
Related
I've been on this all night, and just can't figure it out, even though I know it should be simple. So, my sincerest apologies for the following incantation from a sleep-deprived fellow:
So, I have four fields, Employee ID, Name, Station and Shift (ID is non-null integer, the rest are strings or null).
I have about 10 dataframes, all indexed by ID. And each containing only two columns either (Name and Station) or (Name and Shift)
Now of course, I want to combine all of this into one dataframe, which has a unique row for each ID.
But I'm really frustrated by it at this point(especially because I can't find a way to directly check how many unique indices my final dataframe ends with)
After messing around with some very ugly ways of using .merge(), I finally found .concat(). But it keeps making multiple rows per ID, when I check in excel, the indices are like Table1/1234, Table2/1234 etc. One row has the shift, the other one has station, which is precisely what I'm trying to avoid.
How do I compile all my data into one dataframe, having exactly one row per ID? Possibly without using 9 different merge statements, as I have to scale up later.
If I understand your question correctly, this is the thing that you want.
For example with this 3 dataframes..
In [1]: df1
Out[1]:
0 1 2
0 3.588843 3.566220 6.518865
1 7.585399 4.269357 4.781765
2 9.242681 7.228869 5.680521
3 3.600121 3.931781 4.616634
4 9.830029 9.177663 9.842953
5 2.738782 3.767870 0.925619
6 0.084544 6.677092 1.983105
7 5.229042 4.729659 8.638492
8 8.575547 6.453765 6.055660
9 4.386650 5.547295 8.475186
In [2]: df2
Out[2]:
0 1
0 95.013170 90.382886
2 1.317641 29.600709
4 89.908139 21.391058
6 31.233153 3.902560
8 17.186079 94.768480
In [3]: df
Out[3]:
0 1 2
0 0.777689 0.357484 0.753773
1 0.271929 0.571058 0.229887
2 0.417618 0.310950 0.450400
3 0.682350 0.364849 0.933218
4 0.738438 0.086243 0.397642
5 0.237481 0.051303 0.083431
6 0.543061 0.644624 0.288698
7 0.118142 0.536156 0.098139
8 0.892830 0.080694 0.084702
9 0.073194 0.462129 0.015707
You can do
pd.concat([df,df1,df2], axis=1)
This produces
In [6]: pd.concat([df,df1,df2], axis=1)
Out[6]:
0 1 2 0 1 2 0 1
0 0.777689 0.357484 0.753773 3.588843 3.566220 6.518865 95.013170 90.382886
1 0.271929 0.571058 0.229887 7.585399 4.269357 4.781765 NaN NaN
2 0.417618 0.310950 0.450400 9.242681 7.228869 5.680521 1.317641 29.600709
3 0.682350 0.364849 0.933218 3.600121 3.931781 4.616634 NaN NaN
4 0.738438 0.086243 0.397642 9.830029 9.177663 9.842953 89.908139 21.391058
5 0.237481 0.051303 0.083431 2.738782 3.767870 0.925619 NaN NaN
6 0.543061 0.644624 0.288698 0.084544 6.677092 1.983105 31.233153 3.902560
7 0.118142 0.536156 0.098139 5.229042 4.729659 8.638492 NaN NaN
8 0.892830 0.080694 0.084702 8.575547 6.453765 6.055660 17.186079 94.768480
9 0.073194 0.462129 0.015707 4.386650 5.547295 8.475186 NaN NaN
For more details you might want to see pd.concat
Just a tip putting simple illustrative data in your question always helps in getting answer.
My DataFrame is:
model epochs loss
0 <keras.engine.sequential.Sequential object at ... 1 0.0286867
1 <keras.engine.sequential.Sequential object at ... 1 0.0210836
2 <keras.engine.sequential.Sequential object at ... 1 0.0250625
3 <keras.engine.sequential.Sequential object at ... 1 0.109146
4 <keras.engine.sequential.Sequential object at ... 1 0.253897
I want to get the row with the lowest loss.
I'm trying self.models['loss'].idxmin(), but that gives an error:
TypeError: reduction operation 'argmin' not allowed for this dtype
There are a number of ways to do exactly that:
Consider this example dataframe
df
level beta
0 0 0.338
1 1 0.294
2 2 0.308
3 3 0.257
4 4 0.295
5 5 0.289
6 6 0.269
7 7 0.259
8 8 0.288
9 9 0.302
1) Using pandas conditionals
df[df.beta == df.beta.min()] #returns pandas DataFrame object
level beta
3 3 0.257
2) Using sort_values and choosing the first(0th) index
df.sort_values(by="beta").iloc[0] #returns pandas Series object
level 3
beta 0.257
Name: 3, dtype: object
These are most readable methods I guess
Edit :
Made this graph to visualize time taken by the above two methods over increasing no. of rows in the dataframe. Although it largely depends on the dataframe in question, sort_values is considerably faster than conditionals when the number of rows is greater than 1000 or so.
self.models[self.models['loss'] == self.models['loss'].min()]
Will give you the row the lowest loss (as long as self.models is your df). add .index to get the index number.
Hope this works
import pandas as pd
df = pd.DataFrame({'epochs':[1,1,1,1,1],'loss':[0.0286867,0.0286867,0.0210836,0.0109146,0.0109146]})
out = df.loc[df['loss'].idxmin()]
Let's say I have a dataframe and list of words i.e
toxic = ['bad','horrible','disguisting']
df = pd.DataFrame({'text':['You look horrible','You are good','you are bad and disguisting']})
main = pd.concat([df,pd.DataFrame(columns=toxic)]).fillna(0)
samp = main['text'].str.split().apply(lambda x : [i for i in toxic if i in x])
for i,j in enumerate(samp):
for k in j:
main.loc[i,k] = 1
This leads to :
bad disguisting horrible text
0 0 0 1 You look horrible
1 0 0 0 You are good
2 1 1 0 you are bad and disguisting
This is bit faster than get_dummies, but for loops in pandas is not appreciable when there is huge amount of data.
I tried with str.get_dummies, this will rather one hot encode every word in the series which makes it bit slower.
pd.concat([df,main['text'].str.get_dummies(' ')[toxic]],1)
text bad horrible disguisting
0 You look horrible 0 1 0
1 You are good 0 0 0
2 you are bad and disguisting 1 0 1
If I try the same in scipy.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(toxic)
main['text'].str.split().apply(le.transform)
This leads to Value Error,y contains new labels. Is there a way to ignore the error in scipy?
How can I improve the speed of achieving the same, is there any other fast way of doing the same?
Use sklearn.feature_extraction.text.CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(vocabulary=toxic)
r = pd.SparseDataFrame(cv.fit_transform(df['text']),
df.index,
cv.get_feature_names(),
default_fill_value=0)
Result:
In [127]: r
Out[127]:
bad horrible disguisting
0 0 1 0
1 0 0 0
2 1 0 1
In [128]: type(r)
Out[128]: pandas.core.sparse.frame.SparseDataFrame
In [129]: r.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
bad 3 non-null int64
horrible 3 non-null int64
disguisting 3 non-null int64
dtypes: int64(3)
memory usage: 104.0 bytes
In [130]: r.memory_usage()
Out[130]:
Index 80
bad 8 # <--- NOTE: it's using 8 bytes (1x int64) instead of 24 bytes for three values (3x8)
horrible 8
disguisting 8
dtype: int64
joining SparseDataFrame with the original DataFrame:
In [137]: r2 = df.join(r)
In [138]: r2
Out[138]:
text bad horrible disguisting
0 You look horrible 0 1 0
1 You are good 0 0 0
2 you are bad and disguisting 1 0 1
In [139]: r2.memory_usage()
Out[139]:
Index 80
text 24
bad 8
horrible 8
disguisting 8
dtype: int64
In [140]: type(r2)
Out[140]: pandas.core.frame.DataFrame
In [141]: type(r2['horrible'])
Out[141]: pandas.core.sparse.series.SparseSeries
In [142]: type(r2['text'])
Out[142]: pandas.core.series.Series
PS in older Pandas versions Sparsed columns loosed their sparsity (got densed) after joining SparsedDataFrame with a regular DataFrame, now we can have a mixture of regular Series (columns) and SparseSeries - really nice feature!
The accepted answer is deprecated, see release notes:
SparseSeries and SparseDataFrame were removed in pandas 1.0.0. This migration guide is present to aid in migrating from previous versions.
Pandas 1.0.5 Solution:
r = df = pd.DataFrame.sparse.from_spmatrix(cv.fit_transform(df['text']),
df.index,
cv.get_feature_names())
I would like to have the two tables that I read in, stored in data frames.
I'm reading a h5 file into my code with:
with pd.HDFStore(directory_path) as store:
self.df = store['/raw_ti4404']
self.hr_df = store['/metric_heartrate']
self.df is being stored as a data frame, but self.hr_df is being stored as a series.
I am calling them both in the same manner and I don't understand why the one is a data frame and the other a series. It might be something to do with how the data is stored:
Any help on how to store the metric_heartrate as a data frame would be appreciated.
Most probably the metric_heartrate was stored as Series.
Demo:
Generate sample DF:
In [123]: df = pd.DataFrame(np.random.rand(10, 3), columns=list('abc'))
In [124]: df
Out[124]:
a b c
0 0.404338 0.010642 0.686192
1 0.108319 0.962482 0.772487
2 0.564785 0.456916 0.496818
3 0.122507 0.653329 0.647296
4 0.348033 0.925427 0.937080
5 0.750008 0.301208 0.779692
6 0.833262 0.448925 0.553434
7 0.055830 0.267205 0.851582
8 0.189788 0.087814 0.902296
9 0.045610 0.738983 0.831780
In [125]: store = pd.HDFStore('d:/temp/test.h5')
Let's store a column as Series:
In [126]: store.append('ser', df['a'], format='t')
Let's store a DataFrame, containing only one column - a:
In [127]: store.append('df', df[['a']], format='t')
Reading data from HDFStore:
In [128]: store.select('ser')
Out[128]:
0 0.404338
1 0.108319
2 0.564785
3 0.122507
4 0.348033
5 0.750008
6 0.833262
7 0.055830
8 0.189788
9 0.045610
Name: a, dtype: float64
In [129]: store.select('df')
Out[129]:
a
0 0.404338
1 0.108319
2 0.564785
3 0.122507
4 0.348033
5 0.750008
6 0.833262
7 0.055830
8 0.189788
9 0.045610
Fix - read Series and convert it to DF:
In [130]: store.select('ser').to_frame('a')
Out[130]:
a
0 0.404338
1 0.108319
2 0.564785
3 0.122507
4 0.348033
5 0.750008
6 0.833262
7 0.055830
8 0.189788
9 0.045610
I have a Dataframe like so:
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
4 0.225629 46.681293 0.540616
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
7 0.322530 48.078038 0.235047
How do I get rid of the fourth row because it has the max value of sq_resid? note: the max will change from dataset to dataset so just removing the 4th row isn't enough.
I have tried several things such as I can remove the max value which leaves the dataframe like below but haven't been able to remove the whole row.
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
4 0.225629 46.681293 Nan
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
7 0.322530 48.078038 0.235047
You could just filter the df like so:
In [255]:
df.loc[df['sq_resid']!=df['sq_resid'].max()]
Out[255]:
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
or drop using idxmax which will return the label row of the max value:
In [257]:
df.drop(df['sq_resid'].idxmax())
Out[257]:
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
7 0.322530 48.078038 0.235047