Suppose I have a dataframe like this:
Height Speed
0 4.0 39.0
1 7.8 24.0
2 8.9 80.5
3 4.2 60.0
Then, through some feature extraction, I get this:
0 39.0
1 24.0
2 80.5
3 60.0
However, I want it to be a dataframe where the column index is still there. How would you get the following?
Speed
0 39.0
1 24.0
2 80.5
3 60.0
I am looking for an answer that compares the original with the new column and determines that the new column must be named Speed. In other words, it shouldn't just rename the new column 'Speed'.
Here is the feature extraction: Let X be the original dataframe and X1 be the returned array that lacks a column name.
svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),
scoring='accuracy')
X1=rfecv.fit_transform(X, y)
Thanks
EDIT:
For the comments I am receiving, I will clarify my ambiguity. I believe that the feature extraction method above takes a dataframe or a series/array. Then, it returns an array. I am passing a dataframe into it. This dataframe contains the column labels and the data. However, it returns an array that lacks column names. Another caveat is that this must be ambiguous in general. I cannot explicitly name my columns because the columns will change in my program. It could return two arrays, four arrays, ... I am looking for a method that will compare the original dataframe to the array(s) given after the feature extraction and realize that the new array is "subset" of the original dataframe. Then, mark it with the orginal column name(s). Let me know your thoughts on that! Sorry guys and thank you for your help.
RFECV, after being fit, has an attribute called support_, which is a boolean mask of selected features. You can obtain the names of the chosen features by doing:
selected_cols = original_df.columns[rfecv.support_]
Easy peasey!
Related
Thank you for taking the time to read through my question. I hope you can help.
I have a large DataFrame with loads of columns. One column is an ID with multiple classes on which I would like to calculate totals and other custom calculations based on the columns above it.
The DataFrame columns look something like this:
I would like to calculate the Total AREA for each ID for all the CLASSES. Then I need to calculate the custom totals for the VAR columns using the variables from the other columns. In the end I would like to have a series of grouped IDs that look like this:
I hope that this make sense. The current thinking I have applied is to use the following code:
df = pd.read_csv(data.csv)
df.groupby('ID').apply(lambda x: x['AREA'].sum())
This provides me with a list of all the summed areas, which I can store in a variable to append back to the original dataframe through the ID and CLASS column. However, I am unsure how I get the other calculations done, as shown above. On top of that, I am not sure how to get the final DataFrame to mimic the above table format.
I am just starting to understand Pandas and constantly having to teach myself and ask for help when it gets rough.
Some guidance would be greatly appreciated. I am open to providing more information and clarity on the problem if this question is insufficient. Thank you.
I am not sure if I understand your formulas correctly.
First you can simplify your formula by using the built-in sum() function:
df=pd.DataFrame({'ID':[1.1,1.1,1.2,1.2,1.2,1.3,1.3], 'Class':[1,2,1,2,3,1,2],'AREA':[350,200,15,5000,65,280,70],
'VAR1':[24,35,47,12,26,12,78], 'VAR2':[1.5,1.2,1.1,1.4,2.3,4.5,0.8], 'VAR3':[200,300,400,500,600,700,800]})
df.groupby(['ID']).sum()['AREA']
This will give the mentioned list
ID
1.1 550
1.2 5080
1.3 350
Name: AREA, dtype: int64
For Area Class 1 you just have to add a key()to the groupby() command:
df.groupby(['ID', 'Class']).sum()['AREA']
Resulting in:
ID Class
1.1 1 350
2 200
1.2 1 15
2 5000
3 65
1.3 1 280
2 70
Name: AREA, dtype: int64
Since you want to sum up the square of the sum over each Class we can add both approaches together:
df.groupby(['ID', 'Class']).apply(lambda x: x['AREA'].sum()**2).groupby('ID').sum()
With the result
ID
1.1 162500
1.2 25004450
1.3 83300
dtype: int64
I recommend to strip the command apart and try to understand each step. If you need further assistance just ask.
Dear pandas DataFrame experts,
I have been using pandas DataFrames to help with re-writing the charting code in an open source project (https://openrem.org/, https://bitbucket.org/openrem/openrem).
I've been grouping and aggregating data over fields such as study_name and x_ray_system_name.
An example dataframe might contain the following data:
study_name request_name total_dlp x_ray_system_name
head head 50.0 All systems
head head 100.0 All systems
head NaN 200.0 All systems
blank NaN 75.0 All systems
blank NaN 125.0 All systems
blank head 400.0 All systems
The following line calculates the count and mean of the total_dlp data grouped by x_ray_system_name and study_name:
df.groupby(["x_ray_system_name", "study_name"]).agg({"total_dlp": ["count", "mean"]})
with the following result:
total_dlp
count mean
x_ray_system_name study_name
All systems blank 3 200.000000
head 3 116.666667
I now have a need to be able to calculate the mean of the total_dlp data grouped over entries in study_name or request_name. So in the example above, I'd like the "head" mean to include the three study_name "head" entries, and also the single request_name "head" entry.
I would like the results to look something like this:
total_dlp
count mean
x_ray_system_name name
All systems blank 3 200.000000
head 4 187.500000
Does anyone know how I can carry out a groupby based on categories in one field or another?
Any help you can offer will be very much appreciated.
Kind regards,
David
You (groupby) data is essentially union of:
extract those with study_name == request_name
duplicate those with study_name != request_name, one for study_name, one for request_name
We can duplicate the data with melt
(pd.concat([df.query('study_name==request_name') # equal part
.drop('request_name', axis=1), # remove so `melt` doesn't duplicate this data
df.query('study_name!=request_name')]) # not equal part
.melt(['x_ray_system_name','total_dlp']) # melt to duplicate
.groupby(['x_ray_system_name','value'])
['total_dlp'].mean()
)
Update: editing the above code helps me realize that we could simplify do:
# mask `request_name` with `NaN` where they equal `study_name`
# so they are ignored when duplicate/mean
(df.assign(request_name=df.request_name.mask(df.study_name==df.request_name))
.melt(['x_ray_system_name','total_dlp'])
.groupby(['x_ray_system_name','value'])
['total_dlp'].mean()
)
Output:
x_ray_system_name value
All systems blank 200.0
head 187.5
Name: total_dlp, dtype: float64
I have a similar approach to that of #QuangHoang but with a different order of the operations.
I am using here the original (range) index to chose how to drop the duplicate data.
You can melt, drop_duplicates and dropna and groupby:
(df.reset_index()
.melt(id_vars=['index', 'total_dlp', 'x_ray_system_name'])
.drop_duplicates(['index', 'value'])
.dropna(subset=['value'])
.groupby(["x_ray_system_name", 'value'])
.agg({"total_dlp": ["count", "mean"]})
)
output:
total_dlp
count mean
x_ray_system_name value
All systems blank 3 200.0
head 4 187.5
I have a pandas DataFrame
ID Unique_Countries
0 123 [Japan]
1 124 [nan]
2 125 [US,Brazil]
.
.
.
I got the Unique_Countries column by aggregating over unique countries from each ID group. There were many IDs with only 'NaN' values in the original country column. They are now displayed as what you see in row 1. I would like to filter on these but can't seem to. When I type
df.Unique_Countries[1]
I get
array([nan], dtype=object)
I have tried several methods including
isnull() and
isnan()
but it gets messed up because it is a numpy array.
If your cell has NaN not in 1st position, try use explode and groupby.all
df[df.Unique_Countries.explode().notna().groupby(level=0).all()]
OR
df[df.Unique_Countries.explode().notna().all(level=0)]
Let's try
df.Unique_Countries.str[0].isna() #'nan' is True
df.Unique_Countries.str[0].notna() #'nan' is False
To pick only non-nan-string just use mask above
df[df.Unique_Countries.str[0].notna()]
I believe that the answers based on string method contains would fail if a country contains the substring nan in it.
In my opinion the solution should be this:
df.explode('Unique_Countries').dropna().groupby('ID', as_index=False).agg(list)
This code drops nan from your dataframe and returns the dataset in the original form.
I am not sure from your question if you want to dropna or you want to know the IDs of the records which have nan in the Unique_Countries column you can use something similar:
long_ss = df.set_index('ID').squeeze().explode()
long_ss[long_ss.isna()]
Problem and what I want
I have a data file that comprises time series read asynchronously from multiple sensors. Basically for every data element in my file, I have a sensor ID and time at which it was read, but I do not always have all sensors for every time, and read times may not be evenly spaced. Something like:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
2,1,5 # skip some sensors for some time steps
0,2,6
2,2,7
2,3,8
1,5,9 # skip some time steps
2,5,10
Important note the actual time column is of datetime type.
What I want is to be able to zero-order hold (forward fill) values for every sensor for any time steps where that sensor does not exist, and either set to zero or back fill any sensors that are not read at the earliest time steps. What I want is a dataframe that looks like it was read from:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
1,1,2 # ID 1 hold value from time step 0
2,1,5
0,2,6
1,2,2 # ID 1 still holding
2,2,7
0,3,6 # ID 0 holding
1,3,2 # ID 1 still holding
2,3,8
0,5,6 # ID 0 still holding, can skip totally missing time steps
1,5,9 # ID 1 finally updates
2,5,10
Pandas attempts so far
I initialize my dataframe and set my indices:
df = pd.read_csv(filename, dtype=np.int)
df.set_index(['ID', 'time'], inplace=True)
I try to mess with things like:
filled = df.reindex(method='ffill')
or the like with various values passed to the index keyword argument like df.index, ['time'], etc. This always either throws an error because I passed an invalid keyword argument, or does nothing visible to the dataframe. I think it is not recognizing that the data I am looking for is "missing".
I also tried:
df.update(df.groupby(level=0).ffill())
or level=1 based on Multi-Indexed fillna in Pandas, but I get no visible change to the dataframe again, I think because I don't have anything currently where I want my values to go.
Numpy attempt so far
I have had some luck with numpy and non-integer indexing using something like:
data = [np.array(df.loc[level].data) for level in df.index.levels[0]]
shapes = [arr.shape for arr in data]
print(shapes)
# [(3,), (2,), (5,)]
data = [np.array([arr[i] for i in np.linspace(0, arr.shape[0]-1, num=max(shapes)[0])]) for arr in data]
print([arr.shape for arr in data])
# [(5,), (5,), (5,)]
But this has two problems:
It takes me out of the pandas world, and I now have to manually maintain my sensor IDs, time index, etc. along with my feature vector (the actual data column is not just one column but a ton of values from a sensor suite).
Given the number of columns and the size of the actual dataset, this is going to be clunky and inelegant to implement on my real example. I would prefer a way of doing it in pandas.
The application
Ultimately this is just the data-cleaning step for training recurrent neural network, where for each time step I will need to feed a feature vector that always has the same structure (one set of measurements for each sensor ID for each time step).
Thank you for your help!
Here is one way , by using reindex and category
df.time=df.time.astype('category',categories =[0,1,2,3,4,5])
new_df=df.groupby('time',as_index=False).apply(lambda x : x.set_index('ID').reindex([0,1,2])).reset_index()
new_df['data']=new_df.groupby('ID')['data'].ffill()
new_df.drop('time',1).rename(columns={'level_0':'time'})
Out[311]:
time ID data
0 0 0 1.0
1 0 1 2.0
2 0 2 3.0
3 1 0 4.0
4 1 1 2.0
5 1 2 5.0
6 2 0 6.0
7 2 1 2.0
8 2 2 7.0
9 3 0 6.0
10 3 1 2.0
11 3 2 8.0
12 4 0 6.0
13 4 1 2.0
14 4 2 8.0
15 5 0 6.0
16 5 1 9.0
17 5 2 10.0
You can have a dictionary of last readings for each sensors. You'll have to pick some initial value; the most logical choice is probably to back-fill the earliest reading to earlier times. Once you've populated your last_reading dictionary, you can just sort all the readings by time, update the dictionary for each reading, and then fill in rows according to the dictionay. So after you have your last_reading dictionary initialized:
last_time = readings[1][time]
for reading in readings:
if reading[time] > last_time:
for ID in ID_list:
df.loc[last_time,ID] = last_reading[ID]
last_time = reading[time]
last_reading[reading[ID]] = reading[data]
#the above for loop doesn't update for the last time
#so you'll have to handle that separately
for ID in ID_list:
df.loc[last_time,ID] = last_reading[ID]
last_time = reading[time]
This assumes that you have only one reading for each time/sensor pair, and that 'readings' a list of dictionaries sorted by time. It also assumes that df has the different sensors as columns and different times as index. Adjust the code as necessary if otherwise. You can also probably optimize it a bit more by updating a whole row at once instead of using a for loop, but I didn't want to deal with making sure I had the Pandas syntax right.
Looking at the application, though, you might want to have each cell in the dataframe be not a number but a tuple of last value and time it was read, so replace last_reading[reading[ID]] = reading[data] with
last_reading[reading[ID]] = [reading[data],reading[time]]. Your neural net can then decide how to weight data based on how old it is.
I got this to work with the following, which I think is pretty general for any case like this where the time index for which you want to fill values is the second in a multi-index with two indices:
# Remove duplicate time indices (happens some in the dataset, pandas freaks out).
df = df[~df.index.duplicated(keep='first')]
# Unstack the dataframe and fill values per serial number forward, backward.
df = df.unstack(level=0)
df.update(df.ffill()) # first ZOH forward
df.update(df.bfill()) # now back fill values that are not seen at the beginning
# Restack the dataframe and re-order the indices.
df = df.stack(level=1)
df = df.swaplevel()
This gets me what I want, although I would love to be able to keep the duplicate time entries if anybody knows of a good way to do this.
You could also use df.update(df.fillna(0)) instead of backfilling if starting unseen values at zero is preferable for a particular application.
I put the above code block in a function called clean_df that takes the dataframe as argument and returns the cleaned dataframe.
I have a Pandas data frame with columns that are 'dynamic' (meaning that I don't know what the column names will be until I retrieve the data from the various databases).
The data frame is a single row and looks something like this:
Make Date Red Blue Green Black Yellow Pink Silver
89 BMW 2016-10-28 300.0 240.0 2.0 500.0 1.0 1.0 750.0
Note that '89' is that particular row in the data frame.
I have the following code:
cars_bar_plot = df_cars.loc[(df_cars.Make == 'BMW') & (df_cars.Date == as_of_date)]
cars_bar_plot = cars_bar_plot.replace(0, value=np.nan)
cars_bar_plot = cars_bar_plot.dropna(axis=1, how='all')
This works fine in helping me to create the above-mentioned single-row data frame, BUT some of the values in each column are very small (e.g. 1.0 and 2.0) relative to the other values and they are distorting a horizontal bar chart that I'm creating with Matplotlib. I'd like to get rid of numbers that are smaller than some minimum threshold value (e.g. 3.0).
Any idea how I can do that?
Thanks!
UPDATE 1
The following line of code helps, but does not fully solve the problem.
cars_bar_plot = cars_bar_plot.loc[:, (cars_bar_plot >= 3.0).any(axis=0)]
The problem is that it's eliminating unintended columns. For example, referencing the original data frame, is it possible to modify this code such that it only removes columns with a value less than 3.0 to the right of the "Black" column (under the assumption that we actually want to retain the value of 2.0 in the "Green" column)?
Thanks!
Assuming you want to only keep rows matching your criteria, you can filter you data like this:
df[df.apply(lambda x: x > 0.5).min(axis=1)]
i.e. simply look at all values matching your condition, and remove the row as soon if at least one doesn't.
Here is the answer to my question:
lower_threshold = 3.0
start_column = 5
df = df.loc[start_column:, (df >= lower_threshold).any(axis=0)]