I have a large dataframe (around 35k entries), the index of this data frame is composed by dates (like 2014-02-12), this dates are not unique. What I need to do is to find for each data what is the max value for each data and create a new data frame with it. I created a solution that works (it is down bellow) but it takes a lot of time to process. Does anyone knows a faster way that I could do this? Thank you.
#Creates a empty dataframe
dataset0514maxrec = pd.DataFrame(columns=dataset0514max.columns.values)
dataset0514maxrec.index.name = 'Date'
#Gets the unique values, find the groups, recover the max value and append it
for i in dataset0514max.index.unique():
tempDF1 = dataset0514max.loc[dataset0514max.index.isin([i])]
tempDF2 = tempDF1[tempDF1['Data_Value'] == tempDF1['Data_Value'].max()]
dataset0514maxrec = dataset0514maxrec.append(tempDF2.head(1))
print(dataset0514maxrec)
groupby with levels
df.groupby(level=0).Data_Value.max().reset_index()
The next two options require the index to be a datetime index. If it
isn't, convert it:
df.index = pd.to_datetime(df.index)
resample
df.resample('D').max()
sort_values + duplicated
df = df.sort_values('Data_Value')
m = ~df.index.duplicated()
df = df[m]
Related
A lot of times (e.g. for time series) I need to use all the values in a column until the current row.
For instance if my dataframe has 100 rows, I want to create a new column where the value in each row is a (sum, average, product, [any other formula]) of all the previous rows, and excluding then next ones:
Row 20 = formula(all_values_until_row_20)
Row 21 = formula(all_values_until_row_21)
etc
I think the easiest way to ask this question would be: How to implement the cumsum() function for a new column in pandas without using that specific method?
One approach, if you cannot use cumsum is to introduce a new column or index and then apply a lambda function that uses all rows that have the new column value less than the current row's.
import pandas as pd
df = pd.DataFrame({'x': range(20, 30), 'y': range(40, 50)}).set_index('x')
df['Id'] = range(0, len(df.index))
df['Sum'] = df.apply(lambda x: df[df['Id']<=x['Id']]['y'].sum(), axis=1)
print(df)
Since there is no sample data I go with an assumed dataframe with atleast one column with numeric data and no NaN values.
I would start like below for cumulativbe sum and averages.
cumulative sum:
df['cum_sum'] = df['existing_col'].cumsum()
cumulative average:
df['cum_avg'] = df['existing_col'].cumsum() / df['index_col']
or
df['cum_avg'] = df['existing_col'].expanding().mean()
if you can provide a sample DataFrame you can get better help I believe so.
I have a dataframe with a pandas DatetimeIndex. I need to take many slices from it(for printing a piecewise graph with matplotlib). In other words I need a new DF which would be a subset of the first one.
More precisely I need to take all rows that are between 9 and 16 o'clock but only if they are within a date range. Fortunately I have only one date range and one time range to apply.
What would be a clean way to do that? thanks
The first step is to set the index of the dataframe to the column where you store time. Once the index is based on time, you can subset the dataframe easily.
df['time'] = pd.to_datetime(df['time'], format='%H:%M:%S.%f') # assuming you have a col called 'time'
df['time'] = df['time'].dt.strftime('%H:%M:%S.%f')
df.set_index('time', inplace=True)
new_df = df[startTime:endTime] # startTime and endTime are strings
I have to do something similar to below. where I iterate through a list of dates
But that process is slow because each time I iterate by dfday = df[df.Date=d] I went through a whole data frame to do the boolean masking.
Is there a way to make it faster?
df = pd.read_csv('bt.csv', sep='\t')
dates = df.Date.unique()
for d in dates:
dfday = df[df.Date=d]
# do something
My issue is very simple, but I just can't wrap my head around it:
I have two dataframes:
time series dataframe with two columns: Timestamp and DataValue
A time interval dataframe with start, end timestamps and a label
What I want to do:
Add a third column to the timeseries that yields the labels according to the time interval dataframe.
Every timepoint needs to have an assigned label designated by the time interval dataframe.
This code works:
TimeSeries_labelled = TimeSeries.copy(deep=True)
TimeSeries_labelled["State"] = 0
for index in Timeintervals_States.index:
for entry in TimeSeries_labelled.index:
if Timeintervals_States.loc[index,"start"] <= TimeSeries_labelled.loc[entry, "Timestamp"] <= Timeintervals_States.loc[index,"end"]:
TimeSeries_labelled.loc[entry, "State"] = Timeintervals_States.loc[index,"state"]
But it is really slow. I tried to make it shorter and faster with pyhton built in filter codes, but failed miserably.
Please help!
I don't really know about TimeSeries, with a dataframe containing timestamps as datetime object you could use something like the following :
import pandas as pd
#Create the thrid column in the target dataframe
df_timeseries['label'] = pd.Series('',index=df_timeseries.index)
#Loop over the dataframe containing start and end timestamps
for index,row in df_start_end.iterrows():
#Create a boolean mask to filter data
mask = (df_timeseries['timestamp'] > row['start']) & (df_timeseries['timestamp'] < row['end'])
df_timeseries.loc[mask,'label'] = row['label']
This will make the rows your timeseries dataframe that match the condition of the mask have the label of the row, for each rows of your dataframe containing start & end timestamps
I'd like to take a subset of rows of a Dask dataframe based on a set of index keys. (Specifically, I want to find rows of ddf1 whose index is not in the index of ddf2.)
Both cache.drop([overlap_list]) and diff = cache[should_keep_bool_array] either throw a NotImplementedException or otherwise don't work.
What is the best way to do this?
I'm not sure this is the "best" way, but here's how I ended up doing it:
Create a Pandas DataFrame with the index be the series of index keys I want to keep (e.g., pd.DataFrame(index=overlap_list))
Inner join the Dask Dataframe
Another possibility is:
df_index = df.reset_index()
df_index = df_index.dorp_dplicates()