I have a DataFrame I wanted the difference between the maximum and second maximum from the DataFrame as a new column appended to the DataFrame as output.
The data frame looks like this for example (this is quite a huge DataFrame):
gene_id Time_1 Time_2 Time_3
a 0.01489251 8.00246 8.164309
b 6.67943235 0.8832114 1.048761
So far I tried the following but it's just taking the headers,
largest = max(df)
second_largest = max(item for item in df if item < largest)
and returning the header value alone.
You can define a func which takes the values, sorts them, slices the top 2 values ([:2]) then calculates the difference and returns the second value (as the first value is NaN). You apply this and pass arg axis=1 to apply row-wise:
In [195]:
def func(x):
return -x.sort(inplace=False, ascending=False)[:2].diff()[1]
df['diff'] = df.loc[:,'Time_1':].apply(func, axis=1)
df
Out[195]:
gene_id Time_1 Time_2 Time_3 diff
0 a 0.014893 8.002460 8.164309 0.161849
1 b 6.679432 0.883211 1.048761 5.630671
Here is my solution:
# Load data
data = {'a': [0.01489251, 8.00246, 8.164309], 'b': [6.67943235, 0.8832114, 1.048761]}
df = pd.DataFrame.from_dict(data, 'index')
The trick is to do a linear sort of the values and keep the top-2 using numpy.argpartition.
You do the difference of the 2 maximum values in absolute value. The function is applied row-wise.
def f(x):
ind = np.argpartition(x.values, -2)[-2:]
return np.abs(x.iloc[ind[0]] - x.iloc[ind[1]])
df.apply(f, axis=1)
Here's an elegant solution that doesn't involve sorting or defining any functions. It's also fully vectorized as it avoid use of the apply method.
maxes = df.max(axis=1)
less_than_max = df.where(df.lt(maxes, axis='rows'))
seconds = less_than_max.max(axis=1)
df['diff'] = maxes - seconds
Related
Regularly I run into the problem that I have time series data that I want to interpolate and resample at given times. I have a solution, but it feels like "too labor intensive", e.g. I guess there should be a simpler way. Have a look for how I currently do it here: https://gist.github.com/cs224/012f393d5ced6931ae223e6ddc4fe6b2 (or the nicer version via nbviewer here: https://nbviewer.org/gist/cs224/012f393d5ced6931ae223e6ddc4fe6b2)
Perhaps a motivating example: I fill up my car about every two weeks. I have the cost data of every refill. Now I would like to know the cummulative sum on a daily basis, where the day values are at midnight and interpolated.
Currently I create a new empty data frame that contains the time points at which I want to have my resampled values:
df_sampling = pd.DataFrame(index=pd.date_range(start, end, freq=freq))
And then either use pd.merge:
ldf = pd.merge(df_in, df_sampling, left_index=True, right_index=True, how='outer')
or pd.concat:
ldf = pd.concat([df_in, df_sampling], axis=1)
to create a combined time series that has the additional time points in the index. Based on that I can then use pd.interpolate and then sub-select all index values given by df_sampling. See the gist for details.
All this feels too cumbersome and I guess there should be a better way how to do it.
Instead of using either merge or concat inside your function generate_interpolated_time_series, I would rely on df.reindex. Something like this:
def f(df_in, freq='T', start=None):
if start is None:
start = df_in.index[0].floor('T')
# refactored: df_in.index[0].replace(second=0,microsecond=0,nanosecond=0)
end = df_in.index[-1]
idx = pd.date_range(start=start, end=end, freq=freq)
ldf = df_in.reindex(df_in.index.union(idx)).interpolate().bfill()
ldf = ldf[~ldf.index.isin(df_in.index.difference(idx))]
return ldf
Test sample:
from pandas import Timestamp
d = {Timestamp('2022-10-07 11:06:09.957000'): 21.9,
Timestamp('2022-11-19 04:53:18.532000'): 47.5,
Timestamp('2022-11-19 16:30:04.564000'): 66.9,
Timestamp('2022-11-21 04:17:57.832000'): 96.9,
Timestamp('2022-12-05 22:26:48.354000'): 118.6}
df = pd.DataFrame.from_dict(d, orient='index', columns=['values'])
print(df)
values
2022-10-07 11:06:09.957 21.9
2022-11-19 04:53:18.532 47.5
2022-11-19 16:30:04.564 66.9
2022-11-21 04:17:57.832 96.9
2022-12-05 22:26:48.354 118.6
Check for equality:
merge = generate_interpolated_time_series(df, freq='D', method='merge')
concat = generate_interpolated_time_series(df, freq='D', method='concat')
reindex = f(df, freq='D')
print(all([merge.equals(concat),merge.equals(reindex)]))
# True
Added bonus would be some performance gain. Here you see the results of a comparison between the 3 methods (applying %timeit) for different frequencies (['D','H','T','S']). reindex in green is fastest for each.
Aside: in your function, raise Exception('Method unknown: ' + metnhod) contains a typo; should be method.
I have a dataframe df where one column is timestamp and one is A. Column A contains decimals.
I would like to add a new column B and fill it with the current value of A divided by the value of A one minute earlier. That is:
df['B'] = df['A']_current / df['A'] _(current - 1 min)
NOTE: The data does not come in exactly every 1 minute so "the row one minute earlier" means the row whose timestamp is the closest to (current - 1 minute).
Here is how I do it:
First, I use the timestamp as index in order to use get_loc and create a new dataframe new_df starting from 1 minute after df. In this way I'm sure I have all the data when I go look 1 minute earlier within the first minute of data.
new_df = df.loc[df['timestamp'] > df.timestamp[0] + delta] # delta = 1 min timedelta
values = []
for index, row n new_df.iterrows():
v = row.A / df.iloc[df.index.get_loc(row.timestamp-delta,method='nearest')]['A']
values.append[v]
v_ser = pd.Series(values)
new_df['B'] = v_ser.values
I'm afraid this is not that great. It takes a long time for large dataframes. Also, I am not 100% sure the above is completely correct. Sometimes I get this message:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
What is the best / most efficient way to do the task above? Thank you.
PS. If someone can think of a better title please let me know. It took me longer to write the title than the post and I still don't like it.
You could try to use .asof() if the DataFrame has been indexed correctly by the timestamps (if not, use .set_index() first).
Simple example here
import pandas as pd
import numpy as np
n_vals = 50
# Create a DataFrame with random values and 'unusual times'
df = pd.DataFrame(data = np.random.randint(low=1,high=6, size=n_vals),
index=pd.DatetimeIndex(start=pd.Timestamp.now(),
freq='23s', periods=n_vals),
columns=['value'])
# Demonstrate how to use .asof() to get the value that was the 'state' at
# the time 1 min since the index. Note the .values call
df['value_one_min_ago'] = df['value'].asof(df.index - pd.Timedelta('1m')).values
# Note that there will be some NaNs to deal with consider .fillna()
Here I have a function that compute a percentile column based on 2 other columns in the dataframe:
for each row, the function recreate a mini df with only the last 20 rows, compute the absolute difference for each of them, and then assign a percentile to the current row.
I was suggested by a respondent to a previous question to repost that more specific question regarding apply
grid = np.random.rand(40,2)
full = pd.DataFrame(grid, columns=['value'])
def percentile(x, df):
if int(x.name)<20:
pass
else:
df_temp = df.loc[(int(x.name)-20):int(x.name),'value']
bucketted = [b for b in df_temp.value if b < df_temp.loc[int(x.name), 'value']]
return len(bucketted)/0.2
full['percentile'] = full.apply(percentile, axis=1, args=(full,))
for intellectual curiosity - since this works - if anyone has a neater /faster way to approach the problem.
This is the closest to what i'm looking for that I've found
Let's say my dataframe looks something like this:
d = {'item_number':['K208UL','AKD098008','DF900A','K208UL','AKD098008']
'Comp_ID':['998798098','988797387','12398787','998798098','988797387']
'date':['2016-11-12','2016-11-13','2016-11-17','2016-11-13','2016-11-14']}
df = pd.DataFrame(data=d)
I would like to count the amount of times where the same item_number and Comp_ID were observed on consecutive days.
I imagine this will look something along the lines of:
g = df.groupby(['Comp_ID','item_number'])
g.apply(lambda x: x.loc[x.iloc[i,'date'].shift(-1) - x.iloc[i,'date'] == 1].count())
However, I would need to extract the day from each date as an int before comparing, which I'm also having trouble with.
for i in df.index:
wbc_seven.iloc[i, 'day_column'] = datetime.datetime.strptime(df.iloc[i,'date'],'%Y-%m-%d').day
Apparently location based indexing only allows for integers? How could I solve this problem?
However, I would need to extract the day from each date as an int
before comparing, which I'm also having trouble with.
Why?
To fix your code, you need:
consecutive['date'] = pd.to_datetime(consecutive['date'])
g = consecutive.groupby(['Comp_ID','item_number'])
g['date'].apply(lambda x: sum(abs((x.shift(-1) - x)) == pd.to_timedelta(1, unit='D')))
Note the following:
The code above avoids repetitions. That is a basic programming principle: Don't Repeat Yourself
It converts 1 to timedelta for proper comparison.
It takes the absolute difference.
Tip, write a top level function for your work, instead of a lambda, as it accords better readability, brevity, and aesthetics:
def differencer(grp, day_dif):
"""Counts rows in grp separated by day_dif day(s)"""
d = abs(grp.shift(-1) - grp)
return sum(d == pd.to_timedelta(day_dif, unit='D'))
g['date'].apply(differencer, day_dif=1)
Explanation:
It is pretty straightforward. The dates are converted to Timestamp type, then subtracted. The difference will result in a timedelta, which needs to also be compared with a timedelta object, hence the conversion of 1 (or day_dif) to timedelta. The result of that conversion will be a Boolean Series. Boolean are represented by 0 for False and 1 for True. Sum of a Boolean Series will return the total number of True values in the Series.
One solution would be to use pivot tables to count the number of times a Comp_ID and an item_number were observed on consecutive days.
import pandas as pd
d = {'item_number':['K208UL','AKD098008','DF900A','K208UL','AKD098008'],'Comp_ID':['998798098','988797387','12398787','998798098','988797387'],'date':['2016-11-12','2016-11-13','2016-11-17','2016-11-13','2016-11-14']}
df = pd.DataFrame(data=d).sort_values(['item_number','Comp_ID'])
df['date'] = pd.to_datetime(df['date'])
df['delta'] = (df['date'] - df['date'].shift(1))
df = df[(df['delta']=='1 days 00:00:00.000000000') & (df['Comp_ID'] == df['Comp_ID'].shift(1)) &
(df['item_number'] == df['item_number'].shift(1))].pivot_table( index=['item_number','Comp_ID'],
values=['date'],aggfunc='count').reset_index()
df.rename(columns={'date':'consecutive_days'},inplace =True)
Results in
item_number Comp_ID consecutive_days
0 AKD098008 988797387 1
1 K208UL 998798098 1
On a pandas dataframe, I know I can groupby on one or more columns and then filter values that occur more/less than a given number.
But I want to do this on every column on the dataframe. I want to remove values that are too infrequent (let's say that occur less than 5% of times) or too frequent. As an example, consider a dataframe with following columns: city of origin, city of destination, distance, type of transport (air/car/foot), time of day, price-interval.
import pandas as pd
import string
import numpy as np
vals = [(c, np.random.choice(list(string.lowercase), 100, replace=True)) for c in
'city of origin', 'city of destination', 'distance, type of transport (air/car/foot)', 'time of day, price-interval']
df = pd.DataFrame(dict(vals))
>> df.head()
city of destination city of origin distance, type of transport (air/car/foot) time of day, price-interval
0 f p a n
1 k b a f
2 q s n j
3 h c g u
4 w d m h
If this is a big dataframe, it makes sense to remove rows that have spurious items, for example, if time of day = night occurs only 3% of the time, or if foot mode of transport is rare, and so on.
I want to remove all such values from all columns (or a list of columns). One idea I have is to do a value_counts on every column, transform and add one column for each value_counts; then filter based on whether they are above or below a threshold. But I think there must be a better way to achieve this?
This procedure will go through each column of the DataFrame and eliminate rows where the given category is less than a given threshold percentage, shrinking the DataFrame on each loop.
This answer is similar to that provided by #Ami Tavory, but with a few subtle differences:
It normalizes the value counts so you can just use a percentile threshold.
It calculates counts just once per column instead of twice. This results in faster execution.
Code:
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
Code timing:
df2 = pd.DataFrame(np.random.choice(list(string.lowercase), [1e6, 4], replace=True),
columns=list('ABCD'))
%%timeit df=df2.copy()
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
1 loops, best of 3: 485 ms per loop
%%timeit df=df2.copy()
m = 0.03 * len(df)
for c in df:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
1 loops, best of 3: 688 ms per loop
I would go with one of the following:
Option A
m = 0.03 * len(df)
df[np.all(
df.apply(
lambda c: c.isin(c.value_counts()[c.value_counts() > m].index).as_matrix()),
axis=1)]
Explanation:
m = 0.03 * len(df) is the threshold (it's nice to take the constant out of the complicated expression)
df[np.all(..., axis=1)] retains the rows where some condition was obtained across all columns.
df.apply(...).as_matrix applies a function to all columns, and makes a matrix of the results.
c.isin(...) checks, for each column item, whether it is in some set.
c.value_counts()[c.value_counts() > m].index is the set of all values in a column whose count is above m.
Option B
m = 0.03 * len(df)
for c in df.columns:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
The explanation is similar to the one above.
Tradeoffs:
Personally, I find B more readable.
B creates a new DataFrame for each filtering of a column; for large DataFrames, it's probably more expensive.
I am new to Python and using Pandas. I came up with the following solution below. Maybe other people might have a better or more efficient approach.
Assuming your DataFrame is DF, you can use the following code below to filter out all infrequent values. Just be sure to update the col and bin_freq variable. DF_Filtered is your new filtered DataFrame.
# Column you want to filter
col = 'time of day'
# Set your frequency to filter out. Currently set to 5%
bin_freq = float(5)/float(100)
DF_Filtered = pd.DataFrame()
for i in DF[col].unique():
counts = DF[DF[col]==i].count()[col]
total_counts = DF[col].count()
freq = float(counts)/float(total_counts)
if freq > bin_freq:
DF_Filtered = pd.concat([DF[DF[col]==i],DF_Filtered])
print DF_Filtered
DataFrames support clip_lower(threshold, axis=None) and clip_upper(threshold, axis=None), which remove all values below or above (respectively) a certain threshhold.
We can also replace all the rare categories with one label, say "Rare" and remove later if this doesn't add value to prediction.
# function finds the labels that are more than certain percentage/threshold
def get_freq_labels(df, var, rare_perc):
df = df.copy()
tmp = df.groupby(var)[var].count() / len(df)
return tmp[tmp > rare_perc].index
vars_cat = [val for val in data.columns if data[val].dtype=='O']
for var in vars_cat:
# find the frequent categories
frequent_cat = get_freq_labels(data, var, 0.05)
# replace rare categories by the string "Rare"
data[var] = np.where(data[var].isin(
frequent_cat ), data[var], 'Rare')