pandas -- how to select by column values faster

pandas -- how to select by column values faster - python

I have to do something similar to below. where I iterate through a list of dates
But that process is slow because each time I iterate by dfday = df[df.Date=d] I went through a whole data frame to do the boolean masking.
Is there a way to make it faster?
df = pd.read_csv('bt.csv', sep='\t')
dates = df.Date.unique()
for d in dates:
dfday = df[df.Date=d]
# do something

Related

Python pandas.datetimeindex piecewise dataframe slicing

I have a dataframe with a pandas DatetimeIndex. I need to take many slices from it(for printing a piecewise graph with matplotlib). In other words I need a new DF which would be a subset of the first one.
More precisely I need to take all rows that are between 9 and 16 o'clock but only if they are within a date range. Fortunately I have only one date range and one time range to apply.
What would be a clean way to do that? thanks

The first step is to set the index of the dataframe to the column where you store time. Once the index is based on time, you can subset the dataframe easily.
df['time'] = pd.to_datetime(df['time'], format='%H:%M:%S.%f') # assuming you have a col called 'time'
df['time'] = df['time'].dt.strftime('%H:%M:%S.%f')
df.set_index('time', inplace=True)
new_df = df[startTime:endTime] # startTime and endTime are strings

How to make this row-wise operation performant (python)?

My issue is very simple, but I just can't wrap my head around it:
I have two dataframes:
time series dataframe with two columns: Timestamp and DataValue
A time interval dataframe with start, end timestamps and a label
What I want to do:
Add a third column to the timeseries that yields the labels according to the time interval dataframe.
Every timepoint needs to have an assigned label designated by the time interval dataframe.
This code works:
TimeSeries_labelled = TimeSeries.copy(deep=True)
TimeSeries_labelled["State"] = 0
for index in Timeintervals_States.index:
for entry in TimeSeries_labelled.index:
if Timeintervals_States.loc[index,"start"] <= TimeSeries_labelled.loc[entry, "Timestamp"] <= Timeintervals_States.loc[index,"end"]:
TimeSeries_labelled.loc[entry, "State"] = Timeintervals_States.loc[index,"state"]
But it is really slow. I tried to make it shorter and faster with pyhton built in filter codes, but failed miserably.
Please help!

I don't really know about TimeSeries, with a dataframe containing timestamps as datetime object you could use something like the following :
import pandas as pd
#Create the thrid column in the target dataframe
df_timeseries['label'] = pd.Series('',index=df_timeseries.index)
#Loop over the dataframe containing start and end timestamps
for index,row in df_start_end.iterrows():
#Create a boolean mask to filter data
mask = (df_timeseries['timestamp'] > row['start']) & (df_timeseries['timestamp'] < row['end'])
df_timeseries.loc[mask,'label'] = row['label']
This will make the rows your timeseries dataframe that match the condition of the mask have the label of the row, for each rows of your dataframe containing start & end timestamps

pandas GroupBy on the index and find the max

I have a large dataframe (around 35k entries), the index of this data frame is composed by dates (like 2014-02-12), this dates are not unique. What I need to do is to find for each data what is the max value for each data and create a new data frame with it. I created a solution that works (it is down bellow) but it takes a lot of time to process. Does anyone knows a faster way that I could do this? Thank you.
#Creates a empty dataframe
dataset0514maxrec = pd.DataFrame(columns=dataset0514max.columns.values)
dataset0514maxrec.index.name = 'Date'
#Gets the unique values, find the groups, recover the max value and append it
for i in dataset0514max.index.unique():
tempDF1 = dataset0514max.loc[dataset0514max.index.isin([i])]
tempDF2 = tempDF1[tempDF1['Data_Value'] == tempDF1['Data_Value'].max()]
dataset0514maxrec = dataset0514maxrec.append(tempDF2.head(1))
print(dataset0514maxrec)

groupby with levels
df.groupby(level=0).Data_Value.max().reset_index()
The next two options require the index to be a datetime index. If it
isn't, convert it:
df.index = pd.to_datetime(df.index)
resample
df.resample('D').max()
sort_values + duplicated
df = df.sort_values('Data_Value')
m = ~df.index.duplicated()
df = df[m]

Boolean filter using a timestamp value on a dataframe in Python

I have a dataframe created from a .csv document. Since one of the columns has dates, I have used pandas read_csv with parse_dates:
df = pd.read_csv('CSVdata.csv', encoding = "ISO-8859-1", parse_dates=['Dates_column'])
The dates range from 2012 to 2016. I want to crate a sub-dataframe, containing only the rows from 2014.
The only way I have managed to do this, is with two subsequent Boolean filters:
df_a = df[df.Dates_column>pd.Timestamp('2014')] # To create a dataframe from 01/Jan/2014 onwards.
df = df_a[df_a.Dates_column<pd.Timestamp('2015')] # To remove all the values after 01/jan/2015
Is there a way of doing this in one step, more efficiently?
Many thanks!

You can use the dt accessor:
df = df[df.Dates_column.dt.year == 2014]

How to add values from one dataframe into another ignoring the row indices

I have a pandas dataframe called trg_data to collect data that I am producing in batches. Each batch is produced by a sub-routine as a smaller dataframe df with the same number of columns but less rows and I want to insert the values from df into trg_data at a new row position each time.
However, when I use the following statement df is always inserted at the top. (i.e. rows 0 to len(df)).
trg_data.iloc[trg_pt:(trg_pt + len(df))] = df
I'm guessing but I think the reason may be that even though the slice indicates the desired rows, it is using the index in df to decide where to put the data.
As a test I found that I can insert an ndarray at the right position no problem:
trg_data.iloc[trg_pt:(trg_pt + len(df))] = np.ones(df.shape)
How do I get it to ignore the index in df and insert the data where I want it? Or is there an entirely different way of achieving this? At the end of the day I just want to create the dataframe trg_data and then save to file at the end. I went down this route because there didn't seem to be a way of easily appending to an existing dataframe.
I've been working at this for over an hour and I can't figure out what to google to find the right answer!

I think I may have the answer (I thought I had already tried this but apparently not):
trg_data.iloc[trg_pt:(trg_pt + len(df))] = df.values
Still, I'm open to other suggestions. There's probably a better way to add data to a dataframe.

The way I would do this is save all the intermediate dataframes in an array, and then concatenate them together
import pandas as pd
dfs = []
# get all the intermediate dataframes somehow
# combine into one dataframe
trg_data = pd.concatenate(dfs)

Both
trg_data = pd.concat([df1, df2, ... dfn], ignore_index=True)
and
trg_data = pd.DataFrame()
for ...: #loop that generates df
trg_data = trg_data.append(df, ignore_index=True) #you can reuse the name df
shoud work for you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas -- how to select by column values faster - python

Related

Python pandas.datetimeindex piecewise dataframe slicing

How to make this row-wise operation performant (python)?

pandas GroupBy on the index and find the max

Boolean filter using a timestamp value on a dataframe in Python

How to add values from one dataframe into another ignoring the row indices

Categories

Resources