Pandas groupby: How do I use shifted values - python

I have a dataset that represents reoccurring events at different locations.
df = [Datetime location time event]
Each location can have 8-10 events that repeat. What I'm trying to do is build some information of how long it has been between two events. (they may not be the same event)
I am able to do this by splitting the df into sub-dfs and processing each location individually. But it would seem that groupby should be smarter that this. This is also assuming that I know all the locations which may vary file to file.
df1 = df[(df['location'] == "Loc A")]
df1['delta'] = df1['time'] - df1['time'].shift(1)
df2 = df[(df['location'] == "Loc B")]
df2['delta'] = df2['time'] - df2['time'].shift(1)
...
...
What I would like to do is groupBy based on location...
dfg = df.groupby(['location'])
Then for each grouped location
Add a delta column
Shift and subtract to get the delta time between events
Questions:
Does groupby maintain the order of events?
Would a for loop that runs over the DF be better? That doesn't seem very python like.
Also once you have a grouped df is there a way to transform it back to a general dataframe. I don't think I need to do this but thought it may be helpful in the future.
Thank you for any support you can offer.

http://pandas.pydata.org/pandas-docs/dev/groupby.html looks like it provides what you need.
groups = df.groupby('location').groups
or
for name, group in df.groupby('location')
// do stuff here
Will split it into groups of rows with matching values in the location column.
Then you can sort the groups based on the time value and iterate through to create the deltas.

It appears that when you group-by and identify a column to act on the data is returned in a series which then a function can be applied to.
deltaTime = lambda x: (x - x.shift(1))
df['delta'] = df.groupby('location')['time'].apply(deltaTime)
This groups by location and returns the time column for each group.
Each sub-series is then passed to the function deltaTime.

Related

Generate a pandas dataframe with for-loop

I have generated a dataframe (called 'sectors') that stores information from my brokerage account (sector/industry, sub sector, company name, current value, cost basis, etc).
I want to avoid hard coding a filter for each sector or sub sector to find specific data. I have achieved this with the following code (I know, not very pythonic, but I am new to coding):
for x in set(sectors_df['Sector']):
x_filt = sectors_df['Sector'] == x
#value in sect takes the sum of all current values in a given sector
value_in_sect = round(sectors_df.loc[x_filt]['Current Value'].sum(), 2)
#pct in sect is the % of the sector in the over all portfolio (total equals the total value of all sectors)
pct_in_sect = round((value_in_sect/total)*100 , 2)
print(x, value_in_sect, pct_in_sect)
for sub in set(sectors_df['Sub Sector']):
sub_filt = sectors_df['Sub Sector'] == sub
value_of_subs = round(sectors_df.loc[sub_filt]['Current Value'].sum(), 2)
pct_of_subs = round((value_of_subs/total)*100, 2)
print(sub, value_of_subs, pct_of_subs)
My print statements produce the majority of the information I want, although I am still working through how to program for the % of a sub sector within its own sector. Anyways, I would now like to put this information (value_in_sect, pct_in_sect, etc) into dataframes of their own. What would be the best way or the smartest way or the most pythonic way to go about this? I am thinking a dictionary, and then creating a dataframe from the dictionary, but not sure.
The split-apply-combine process in pandas, specifically aggregation, is the best way to go about this. First I'll explain how this process would work manually, and then I'll show how pandas can do it in one line.
Manual split-apply-combine
Split
First, divide the DataFrame into groups of the same Sector. This involves getting a list of Sectors and figuring out which rows belong to it (just like the first two lines of your code). This code runs through the DataFrame and builds a dictionary with keys as Sectors and a list of indices of rows from sectors_df that correspond to it.
sectors_index = {}
for ix, row in sectors_df.iterrows():
if row['Sector'] not in sectors_index:
sectors_index[row['Sector']] = [ix]
else:
sectors_index[row['Sector']].append(ix)
Apply
Run the same function, in this case summing of Current Value and calculating its percentage share, on each group. That is, for each sector, grab the corresponding rows from the DataFrame and run the calculations in the next lines of your code. I'll store the results as a dictionary of dictionaries: {'Sector1': {'value_in_sect': 1234.56, 'pct_in_sect': 11.11}, 'Sector2': ... } for reasons that will become obvious later:
sector_total_value = {}
total_value = sectors_df['Current Value'].sum()
for sector, row_indices in sectors_index.items():
sector_df = sectors_df.loc[row_indices]
current_value = sector_df['Current Value'].sum()
sector_total_value[sector] = {'value_in_sect': round(current_value, 2),
'pct_in_sect': round(current_value/total_value * 100, 2)
}
(see footnote 1 for a note on rounding)
Combine
Finally, collect the function results into a new DataFrame, where the index is the Sector. pandas can easily convert this nested dictionary structure into a DataFrame:
sector_total_value_df = pd.DataFrame.from_dict(sector_total_value, orient='index')
split-apply-combine using groupby
pandas makes this process very simple using the groupby method.
Split
The groupby method splits a DataFrame into groups by a column or multiple columns (or even another Series):
grouped_by_sector = sectors_df.groupby('Sector')
grouped_by_sector is similar to the index we built earlier, but the groups can be manipulated much more easily, as we can see in the following steps.
Apply
To calculate the total value in each group, select the column or columns to sum up, use the agg or aggregate method with the function you want to apply:
sector_total_value = grouped_by_sector['Current Value'].agg(value_in_sect=sum)
Combine
It's already done! The apply step already creates a DataFrame where the index is the Sector (the groupby column) and the value in the value_in_sect column is the result of the sum operation.
I've left out the pct_in_sect part because a) it can be more easily done after the fact:
sector_total_value_df['pct_in_sect'] = round(sector_total_value_df['value_in_sect'] / total_value * 100, 2)
sector_total_value_df['value_in_sect'] = round(sector_total_value_df['value_in_sect'], 2)
and b) it's outside the scope of this answer.
Most of this can be done easily in one line (see footnote 2 for including the percentage, and rounding):
sector_total_value_df = sectors_df.groupby('Sector')['Current Value'].agg(value_in_sect=sum)
For subsectors, there's one additional consideration, which is that grouping should be done by Sector and Subsector rather than just Subsector, so that, for example rows from Utilities/Gas and Energy/Gas aren't combined.
subsector_total_value_df = sectors_df.groupby(['Sector', 'Sub Sector'])['Current Value'].agg(value_in_sect=sum)
This produces a DataFrame with a MultiIndex with levels 'Sector' and 'Sub Sector', and a column 'value_in_sect'. For a final piece of magic, the percentage in Sector can be calculated quite easily:
subsector_total_value_df['pct_within_sect'] = round(subsector_total_value_df['value_in_sect'] / sector_total_value_df['value_in_sect'] * 100, 2)
which works because the 'Sector' index level is matched during division.
Footnote 1. This deviates from your code slightly, because I've chosen to calculate the percentage using the unrounded total value, to minimize the error in the percentage. Ideally though, rounding is only done at display time.
Footnote 2. This one-liner generates the desired result, including percentage and rounding:
sector_total_value_df = sectors_df.groupby('Sector')['Current Value'].agg(
value_in_sect = lambda c: round(sum(c), 2),
pct_in_sect = lambda c: round(sum(c)/sectors_df['Current Value'].sum() * 100, 2),
)

Filter on dataframe based on multiple features of another dataframe

The situation:
I have two datasets:
df1: contains the data of the sensors, the machine ID logged in every minute
df2: contains the production unit ID-s, the machine ID and the starting and ending datetime of the units
df1:
df2:
My task is to filter only on the production timeframes of the machines. This means that based on the production datetimes (these are the timeframes between start and stop in df2) in df2 I need to filter out the releavant sensor data from df2 (sensor data is logged in df2 in every minute no matter if there is production or not).
The problem:
I was able to write a code which filters out the timeintervals in df2, but I am can't figure out how to filter on the machine ID as well.
Here is my working code containing only the datetime filtering:
for index, row in df1.iterrows():
mask = ((df2.index >= row['Start']) & (df2.index <= row['Stop']))
df2.loc[mask, 'Sarzs_no'] = row['Sarzs_no']
df2.loc[mask, 'Output'] = row['Output']
Here is my attempt to add the "Unit"(=machine ID) filtering as well to the datetime filtering:
for index, row in df1.iterrows():
mask = ((df1.index >= row['Start']) & (df1.index <= row['Stop']) & (row['Unit']==df1.Unit))
df1.loc[mask, 'Sarzs_no'] = row['Sarzs_no']
df1.loc[mask, 'Output'] = row['Output']
The above code unfortunatelly is not working.
Questions:
Could you please let me know what am I doing wrong?
Could you please let me know how can I have a filter argument on the machine ID as well (column "Unit")?
Thank you for your help in advance!
I wanted to post this as a comment, but I don't have enough reputation to do this. As initial hints:
1) Try checking your keys. Unit in your first df has a different pattern than in your second. You may need to transform one or the other.
e.g. before looping:
df1["Unit"] = df1["Unit"].apply(lambda x: x.split('_')[1]) # K2_110 -> 110
2) In your example you iterate through the first dataframe and apply the mask on the first dataframe as well
df1.loc[mask, 'Sarzs_no'] = row['Sarzs_no']
df1.loc[mask, 'Output'] = row['Output']`

How to multi-thread large number of pandas dataframe selection calls on large dataset

df is a dataframe containing 12 millions+ lines unsorted.
Each row has a GROUP ID.
The end goal is to randomly select 1 row per unique GROUP ID, thus populating a new column named SELECTED where 1 means selected 0 means the opposite
There may be 5000+ unique GROUP IDs.
Seeking better and faster solution than the following, Potentially multi-threaded solution?
for sec in df['GROUP'].unique():
sz = df.loc[df.GROUP == sec, ['SELECTED']].size
sel = [0]*sz
sel[random.randint(0,sz-1)] = 1
df.loc[df.GROUP == sec, ['SELECTED']] = sel
You could try a vectorized version, which will probably speed things up if you have many classes.
import pandas as pd
# get fake data
df = pd.DataFrame(pd.np.random.rand(10))
df['GROUP'] = df[0].astype(str).str[2]
# mark one element of each group as selected
df['selected'] = df.index.isin( # Is current index in a selected list?
df.groupby('GROUP') # Get a GroupBy object.
.apply(pd.Series.sample) # Select one row from each group.
.index.levels[1] # Access index - in this case (group, old_id) pair; select the old_id out of the two.
).astype(pd.np.int) # Convert to ints.
Note that this may fail if duplicate indices are present.
I do not know panda's dataframe, but if you simply set selected where it is needed to be one and later assume that not having the attribute means not selected you could avoid updating all elements.
You may also do something like this :
selected = []
for sec in df['GROUP'].unique():
selected.append(random.choice(sec))
or with list comprehensions
selected = [random.choice(sec) for sec in df['GROUP'].unique()]
maybe this can speed it up because you will not need to allow new memory and udpate all elements from your dataframe.
If you really want multithreading have a look at concurrent.futures https://docs.python.org/3/library/concurrent.futures.html

Get nth row after applying lambda on groupby in python

So I need to group a dataframe by its SessionId, and then I need to sort each group with the created time, afterwards i need to retrieve the nth row only of each group.
but i found that after applying lambda it becomes a dataframe instead of a group by object, hence i cannot use the .nth property
grouped = df.groupby(['SessionId'])
sorted = grouped.apply(lambda x: x.sort_values(["Created"], ascending = True))
sorted.nth ---> error
Changing the order in which you are approaching the problem in this case will help. If you first sort and then use groupby, you will get the desired output and you can use the groupby.nth function.
Here is a code snippet to demonstrate the idea:
df = pd.DataFrame({'id':['a','a','a','b','b','b'],
'var1':[3,2,1,8,7,6],
'var2':['g','h','i','j','k','l']})
n = 2 # replace with required row from each group
df.sort_values(['id','var1']).groupby('id').nth(n).reset_index()
Assuming id is your sessionid and var1 is the timestamp, this sorts your dataframe by id and then var1. Then picks up the nth row from each of these sorted groups. The reset_index() is there just to avoid the resulting multi-index.
If you want to get the last n rows of each group, you can use .tail(n) instead of .nth(n).
I have created a small dataset -
n = 2
grouped = df.groupby('SessionId')
pd.concat([grouped.get_group(x).sort_values(by='SortVar').reset_index().loc[[n]] for x in grouped.groups]\
,axis=0)
This will return -
Please note that in python index start from zero, so for n=2, it will give you 3rd row in sorted data

Pandas formatting column within DataFrame and adding timedelta Index error

I'm trying to use panda to do some analysis on some messaging data and am running into a few problems try to prep the data. It is coming from a database I don't have control of and therefore I need to do a little pruning and formatting before analyzing it.
Here is where I'm at so far:
#select all the messages in the database. Be careful if you get the whole test data base, may have 5000000 messages.
full_set_data = pd.read_sql("Select * from message",con=engine)
After I make this change to the timestamp, and set it as index, I'm no longer and to call to_csv.
#convert timestamp to a timedelta and set as index
#full_set_data[['timestamp']] = full_set_data[['timestamp']].astype(np.timedelta64)
indexed = full_set_data.set_index('timestamp')
indexed.to_csv('indexed.csv')
#extract the data columns I really care about since there as a bunch I don't need
datacolumns = indexed[['address','subaddress','rx_or_tx', 'wordcount'] + [col for col in indexed.columns if ('DATA' in col)]]
Here I need to format the DATA columns, I get a "SettingWithCopyWarning".
#now need to format the DATA columns to something useful by removing the upper 4 bytes
for col in datacolumns.columns:
if 'DATA' in col:
datacolumns[col] = datacolumns[col].apply(lambda x : int(x,16) & 0x0000ffff)
datacolumns.to_csv('data_col.csv')
#now group the data by "interaction key"
groups = datacolumns.groupby(['address','subaddress','rx_or_tx'])
I need to figure out how to get all the messages from a given group. get_group() requires I know key values ahead of time.
key_group = groups.get_group((1,1,1))
#foreach group in groups:
#do analysis
I have tried everything I could think of to fix the problems I'm running into but I cant seem to get around it. I'm sure it's from me misunderstanding/misusing Pandas as I'm still figuring it out.
I looking to solve these issues:
1) Can't save to csv after I add index of timestamp as timedelta64
2) How do I apply a function to a set of columns to remove SettingWithCopyWarning when reformatting DATA columns.
3) How to grab the rows for each group without having to use get_group() since I don't know the keys ahead of time.
Thanks for any insight and help so I can better understand how to properly use Pandas.
Firstly, you can set the index column(s) and parse dates while querying the DB:
indexed = pd.read_sql_query("Select * from message", engine=engine,
parse_dates='timestamp', index_col='timestamp')
Note I've used pd.read_sql_query here rather than pd.read_sql, which is deprecated, I think.
SettingWithCopy warning is due to the fact that datacolumns is a view of indexed, i.e. a subset of it's rows /columns, not an object in it's own right. Check out this part of the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
One way to get around this is to define
datacolumns = indexed[<cols>].copy()
Another would to do
indexed = indexed[<cols>]
which effectively removes the columns you don't want, if you're happy that you won't need them again. You can then manipulate indexed at your leisure.
As for the groupby, you could introduce a columns of tuples which would be the group keys:
indexed['interaction_key'] = zip(indexed[['address','subaddress','rx_or_tx']]
indexed.groupby('interaction_key').apply(
lambda df: some_function(df.interaction_key, ...)
I'm not sure if it's all exactly what you want but let me know and I can edit.

Categories

Resources