Subset Pandas Data Frame Based on Some Row Value - python

I have a Pandas data frame with columns that are 'dynamic' (meaning that I don't know what the column names will be until I retrieve the data from the various databases).
The data frame is a single row and looks something like this:
Make Date Red Blue Green Black Yellow Pink Silver
89 BMW 2016-10-28 300.0 240.0 2.0 500.0 1.0 1.0 750.0
Note that '89' is that particular row in the data frame.
I have the following code:
cars_bar_plot = df_cars.loc[(df_cars.Make == 'BMW') & (df_cars.Date == as_of_date)]
cars_bar_plot = cars_bar_plot.replace(0, value=np.nan)
cars_bar_plot = cars_bar_plot.dropna(axis=1, how='all')
This works fine in helping me to create the above-mentioned single-row data frame, BUT some of the values in each column are very small (e.g. 1.0 and 2.0) relative to the other values and they are distorting a horizontal bar chart that I'm creating with Matplotlib. I'd like to get rid of numbers that are smaller than some minimum threshold value (e.g. 3.0).
Any idea how I can do that?
Thanks!
UPDATE 1
The following line of code helps, but does not fully solve the problem.
cars_bar_plot = cars_bar_plot.loc[:, (cars_bar_plot >= 3.0).any(axis=0)]
The problem is that it's eliminating unintended columns. For example, referencing the original data frame, is it possible to modify this code such that it only removes columns with a value less than 3.0 to the right of the "Black" column (under the assumption that we actually want to retain the value of 2.0 in the "Green" column)?
Thanks!

Assuming you want to only keep rows matching your criteria, you can filter you data like this:
df[df.apply(lambda x: x > 0.5).min(axis=1)]
i.e. simply look at all values matching your condition, and remove the row as soon if at least one doesn't.

Here is the answer to my question:
lower_threshold = 3.0
start_column = 5
df = df.loc[start_column:, (df >= lower_threshold).any(axis=0)]

Related

How to relocate different data that is in a single column to their respective columns?

I have a dataframe whose data are strings and different information are mixed in a single column. Like this:
0
Place: House
1
Date/Time: 01/02/03 at 09:30
2
Color:Yellow
3
Place: Street
4
Date/Time: 12/12/13 at 13:21:21
5
Color:Red
df = pd.DataFrame(['Place: House','Date/Time: 01/02/03 at 09:30', 'Color:Yellow', 'Place: Street','Date/Time: 21/12/13 at 13:21:21', 'Color:Red'])
I need the dataframe like this:
Place
Date/Time
Color
House
01/02/03
Yellow
Street
21/12/13
Red
I started by converting the excel file to csv, and then I tried to open it as follows:
df = pd.read_csv(filename, sep=":")
I tried using the ":" to separate the columns, but the time formatting also uses ":", so it didn't work. The time is not important information so I even tried to delete it and keep the date, but I couldn't find a way that wouldn't affect the other information in the column either.
Given the values in your data, you will need to limit the split to just happen once, which you can do with n parameter of split. You can expand the split values into two columns then pivot.
The trick here is to create a grouping by taking the df.index // 3 as the index, so that every 3 lines is in a new group.
df = pd.DataFrame(['Place: House','Date/Time: 01/02/03 at 09:30', 'Color:Yellow', 'Place: Street','Date/Time: 21/12/13 at 13:21:21', 'Color:Red'])
df = df[0].str.split(':', n=1, expand=True)
df['idx'] = df.index//3
df.pivot(index='idx', columns=0, values=1).reset_index().drop(columns='idx')[['Place','Date/Time','Color']]
Output
0 Place Date/Time Color
0 House 01/02/03 at 09:30 Yellow
1 Street 21/12/13 at 13:21:21 Red
Your data is all strings, IMO you are likely to get better performance wrangling it within vanilla python, before bringing it back into Pandas; the only time you are likely to get better performance for strings in Pandas is if you are using the pyarrow string data type.
from collections import defaultdict
out = df.squeeze().tolist() # this works since it is just one column
frame = defaultdict(list)
for entry in out:
key, value = entry.split(':', maxsplit=1)
if key == "Date/Time":
value = value.split('at')[0]
value = value.strip()
key = key.strip() # not really necessary
frame[key].append(value)
pd.DataFrame(frame)
Place Date/Time Color
0 House 01/02/03 Yellow
1 Street 21/12/13 Red

Colouring one column of pandas dataframe: change in format

I would like to color a column in my dataframe (using the method proposed in a given answer, link below). So (taking only 1st row of my dataframe) I used to change the color the following code expression:
data1.head(1).style.set_properties(**{'background-color': 'red'}, subset=['column10'])
However it causes another problem: it changes the format of my dataframe (= addes more 0's after decimal point..). Is there any possibility I can keep the old format of my dataframe and still be able to color a column? Thanks in advance
Old output (first row):
2021-01-02 32072 0.0 1831 1831 1.0 1 0 1 0.0 1.0
New output (first row):
2021-01-02 32072 0.000000 1831 1831 1.000000 1 0 1 0.000000 1.000000 1.000000
Colouring one column of pandas dataframe
~~ Edit, as I see you wanted rows not columns:
Once you apply the style code, it changes the pandas dataframe from a pandas.core.frame.DataFrame object to a pandas.io.formats.style.Styler object. The styler object treats floats differently than the pandas dataframe object and yields what you see in your code (more decimal points). You can change the format with style.format to get the results you want:
data = [{"col1":"2021-01-02", "col2":32072, "col3":0.0, "col4":1831, "col5":1831,
"col6":1.0, "col7":1, "col8":0, "col9":1, "column10":0.0, "col11":1.0}]
data1 = pd.DataFrame(data)
data1 = data1.style.format(precision=1, subset=list(data1.columns)).set_properties(**{'background-color': 'red'}, subset=['column10'])
data1
Output:
Once you use style, it is no longer a pandas dataframe and is now a Styler object. So, normal commands that work on pandas dataframes no longer work on your newly styled dataframe (e.g. just doing head(10) no longer works). But, there are work arounds. If you want to look at only the first 10 rows of your Styler after you applied the style, you can export the style that was used and then reapply it to just look at the top 10 rows:
data = [{"col1":"2021-01-02", "col2":32072, "col3":0.0, "col4":1831, "col5":1831,
"col6":1.0, "col7":1, "col8":0, "col9":1, "column10":0.0, "col11":1.0}]
data1 = pd.DataFrame(data)
data1 = data1.append(data*20).reset_index(drop=True)
data1 = data1.style.format(precision=1, subset=list(data1.columns)).set_properties(**{'background-color': 'red'}, subset=['column10'])
Gives a large dataframe:
And then using this code afterwards will head (ie show) only 10 of the rows:
style = data1.export()
data1.data.head(10).style.use(style).format(precision=1, subset=list(data1.columns))
Now it is only showing the first 10 rows:

How can I keep NAN values while I am extracting a subset?

I have a dataframe and there is a column called 'budget'.
There are some NaN values in this column and I'd like to keep them.
However when I try to exclude a sub-frame in which the budget value is bigger than (or equal to) 38750, I lost my NaN values in the new data frame!
df2 = df[df.budget >= 38750]
I would use a double condition, where the second condition checks whether the values in the budget column are missing:
inds = (df['budget'] >= 38750) | (df['budget'].isnull())
df2 = df.loc[inds, :]

Rebuilding Column Names in Pandas Dataframe

Suppose I have a dataframe like this:
Height Speed
0 4.0 39.0
1 7.8 24.0
2 8.9 80.5
3 4.2 60.0
Then, through some feature extraction, I get this:
0 39.0
1 24.0
2 80.5
3 60.0
However, I want it to be a dataframe where the column index is still there. How would you get the following?
Speed
0 39.0
1 24.0
2 80.5
3 60.0
I am looking for an answer that compares the original with the new column and determines that the new column must be named Speed. In other words, it shouldn't just rename the new column 'Speed'.
Here is the feature extraction: Let X be the original dataframe and X1 be the returned array that lacks a column name.
svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),
scoring='accuracy')
X1=rfecv.fit_transform(X, y)
Thanks
EDIT:
For the comments I am receiving, I will clarify my ambiguity. I believe that the feature extraction method above takes a dataframe or a series/array. Then, it returns an array. I am passing a dataframe into it. This dataframe contains the column labels and the data. However, it returns an array that lacks column names. Another caveat is that this must be ambiguous in general. I cannot explicitly name my columns because the columns will change in my program. It could return two arrays, four arrays, ... I am looking for a method that will compare the original dataframe to the array(s) given after the feature extraction and realize that the new array is "subset" of the original dataframe. Then, mark it with the orginal column name(s). Let me know your thoughts on that! Sorry guys and thank you for your help.
RFECV, after being fit, has an attribute called support_, which is a boolean mask of selected features. You can obtain the names of the chosen features by doing:
selected_cols = original_df.columns[rfecv.support_]
Easy peasey!

Filling missing time values in a multi-indexed dataframe

Problem and what I want
I have a data file that comprises time series read asynchronously from multiple sensors. Basically for every data element in my file, I have a sensor ID and time at which it was read, but I do not always have all sensors for every time, and read times may not be evenly spaced. Something like:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
2,1,5 # skip some sensors for some time steps
0,2,6
2,2,7
2,3,8
1,5,9 # skip some time steps
2,5,10
Important note the actual time column is of datetime type.
What I want is to be able to zero-order hold (forward fill) values for every sensor for any time steps where that sensor does not exist, and either set to zero or back fill any sensors that are not read at the earliest time steps. What I want is a dataframe that looks like it was read from:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
1,1,2 # ID 1 hold value from time step 0
2,1,5
0,2,6
1,2,2 # ID 1 still holding
2,2,7
0,3,6 # ID 0 holding
1,3,2 # ID 1 still holding
2,3,8
0,5,6 # ID 0 still holding, can skip totally missing time steps
1,5,9 # ID 1 finally updates
2,5,10
Pandas attempts so far
I initialize my dataframe and set my indices:
df = pd.read_csv(filename, dtype=np.int)
df.set_index(['ID', 'time'], inplace=True)
I try to mess with things like:
filled = df.reindex(method='ffill')
or the like with various values passed to the index keyword argument like df.index, ['time'], etc. This always either throws an error because I passed an invalid keyword argument, or does nothing visible to the dataframe. I think it is not recognizing that the data I am looking for is "missing".
I also tried:
df.update(df.groupby(level=0).ffill())
or level=1 based on Multi-Indexed fillna in Pandas, but I get no visible change to the dataframe again, I think because I don't have anything currently where I want my values to go.
Numpy attempt so far
I have had some luck with numpy and non-integer indexing using something like:
data = [np.array(df.loc[level].data) for level in df.index.levels[0]]
shapes = [arr.shape for arr in data]
print(shapes)
# [(3,), (2,), (5,)]
data = [np.array([arr[i] for i in np.linspace(0, arr.shape[0]-1, num=max(shapes)[0])]) for arr in data]
print([arr.shape for arr in data])
# [(5,), (5,), (5,)]
But this has two problems:
It takes me out of the pandas world, and I now have to manually maintain my sensor IDs, time index, etc. along with my feature vector (the actual data column is not just one column but a ton of values from a sensor suite).
Given the number of columns and the size of the actual dataset, this is going to be clunky and inelegant to implement on my real example. I would prefer a way of doing it in pandas.
The application
Ultimately this is just the data-cleaning step for training recurrent neural network, where for each time step I will need to feed a feature vector that always has the same structure (one set of measurements for each sensor ID for each time step).
Thank you for your help!
Here is one way , by using reindex and category
df.time=df.time.astype('category',categories =[0,1,2,3,4,5])
new_df=df.groupby('time',as_index=False).apply(lambda x : x.set_index('ID').reindex([0,1,2])).reset_index()
new_df['data']=new_df.groupby('ID')['data'].ffill()
new_df.drop('time',1).rename(columns={'level_0':'time'})
Out[311]:
time ID data
0 0 0 1.0
1 0 1 2.0
2 0 2 3.0
3 1 0 4.0
4 1 1 2.0
5 1 2 5.0
6 2 0 6.0
7 2 1 2.0
8 2 2 7.0
9 3 0 6.0
10 3 1 2.0
11 3 2 8.0
12 4 0 6.0
13 4 1 2.0
14 4 2 8.0
15 5 0 6.0
16 5 1 9.0
17 5 2 10.0
You can have a dictionary of last readings for each sensors. You'll have to pick some initial value; the most logical choice is probably to back-fill the earliest reading to earlier times. Once you've populated your last_reading dictionary, you can just sort all the readings by time, update the dictionary for each reading, and then fill in rows according to the dictionay. So after you have your last_reading dictionary initialized:
last_time = readings[1][time]
for reading in readings:
if reading[time] > last_time:
for ID in ID_list:
df.loc[last_time,ID] = last_reading[ID]
last_time = reading[time]
last_reading[reading[ID]] = reading[data]
#the above for loop doesn't update for the last time
#so you'll have to handle that separately
for ID in ID_list:
df.loc[last_time,ID] = last_reading[ID]
last_time = reading[time]
This assumes that you have only one reading for each time/sensor pair, and that 'readings' a list of dictionaries sorted by time. It also assumes that df has the different sensors as columns and different times as index. Adjust the code as necessary if otherwise. You can also probably optimize it a bit more by updating a whole row at once instead of using a for loop, but I didn't want to deal with making sure I had the Pandas syntax right.
Looking at the application, though, you might want to have each cell in the dataframe be not a number but a tuple of last value and time it was read, so replace last_reading[reading[ID]] = reading[data] with
last_reading[reading[ID]] = [reading[data],reading[time]]. Your neural net can then decide how to weight data based on how old it is.
I got this to work with the following, which I think is pretty general for any case like this where the time index for which you want to fill values is the second in a multi-index with two indices:
# Remove duplicate time indices (happens some in the dataset, pandas freaks out).
df = df[~df.index.duplicated(keep='first')]
# Unstack the dataframe and fill values per serial number forward, backward.
df = df.unstack(level=0)
df.update(df.ffill()) # first ZOH forward
df.update(df.bfill()) # now back fill values that are not seen at the beginning
# Restack the dataframe and re-order the indices.
df = df.stack(level=1)
df = df.swaplevel()
This gets me what I want, although I would love to be able to keep the duplicate time entries if anybody knows of a good way to do this.
You could also use df.update(df.fillna(0)) instead of backfilling if starting unseen values at zero is preferable for a particular application.
I put the above code block in a function called clean_df that takes the dataframe as argument and returns the cleaned dataframe.

Categories

Resources