Manipulate pandas.DataFrame with multiple criterias - python

For example I have a dataframe:
df = pd.DataFrame({'Value_Bucket': [5, 5, 5, 10, 10, 10],
'DayofWeek': [1, 1, 3, 2, 4, 2],
'Hour_Bucket': [1, 5, 7, 4, 3, 12],
'Values': [1, 1.5, 2, 3, 5, 3]})
The actual data set is rather large (5000 rows+). I'm looking to perform functions on 'Values' if the "Value_Bucket" = 5, and for each possible combination of "DayofWeek" and "Hour_Bucket".
Essentially the data will be grouped to a table of 24 rows (Hour_Bucket) and 7 columns (DayofWeek), and each cell is filled with the result of a function (say average for example). I can use a groupby function for 1 criteria, can someone explain how I can group two criteria and tabulate the result in a table?

query to subset
groupby
unstack
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack()
DayofWeek 1 3
Hour_Bucket
1 1.0 NaN
5 1.5 NaN
7 NaN 2.0
If you want to have zeros instead of NaN
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack(fill_value=0)
DayofWeek 1 3
Hour_Bucket
1 1.0 0.0
5 1.5 0.0
7 0.0 2.0

Pivot tables seem more natural to me than groupby paired with unstack though they do the exact same thing.
pd.pivot_table(data=df.query('Value_Bucket == 5'),
index='Hour_Bucket',
columns='DayofWeek',
values='Values',
aggfunc='mean',
fill_value=0)
Output
DayofWeek 1 3
Hour_Bucket
1 1.0 0
5 1.5 0
7 0.0 2

Related

Compare a DataFrame to itself? pandas

I have a dataframe with week number as int, item name, and ranking.
For instance:
item_name ranking week_number
0 test 4 1
1 test 3 2
I'd like to add a new column with the ranking evolution since the last week.
The math is very simple:
df['ranking_evolution'] = ranking_previous_week - df['ranking']
It would only require exception handling for week 1.
But I'm not sure how to return the ranking previous week.
I could do it by iterating over the rows but I'm wondering if there is a cleaner way so I can just declare a column?
The issue is that I'd have to compare the dataframe to itself.
I've candidly tried:
df['ranking_evolution'] = df['ranking'].loc[(df[item_name] == df['item_name]) & (df['week_number'] == df['week_number'] - 1) - df['ranking']
But this return NaN values.
Even using a copy returned NaN values.
I assume this is a simplistic example, you probably have different products and maybe missing weeks?
A robust way would be to perform a self-merge with the week+1:
(df.merge(df.assign(week_number=df['week_number']+1),
on=['item_name', 'week_number'],
suffixes=(None, '_evolution'),
how='left')
.assign(ranking_evolution=lambda d: d['ranking_evolution'].sub(d['ranking']))
)
Output:
item_name ranking week_number ranking_evolution
0 test 4 1 NaN
1 test 3 2 1.0
Shortly, try this code to figure out the trick.
import pandas as pd
data = {
'item_name': ['test', 'test', 'test', 'test', 'test', 'test', 'test', 'test', 'test', 'test'],
'ranking': [4, 3, 2, 1, 2, 3, 4, 5, 6, 7],
'week_number': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}
df = pd.DataFrame(data)
df['ranking_evolution'] = df['ranking'].diff(-1) # this is the line that does the trick
print(df)
Results
item_name ranking week_number ranking_evolution
test 4 1 1.0
test 3 2 1.0
test 2 3 1.0
test 1 4 -1.0

Pandas: Alternative methods to using nested for-loops for cell-comparisons

I'm working on a problem with trying to find train-meeting at stations, but I'm having a hard time finding a way to do the necessary comparisons without using nested for-loops (which is way to slow, I have hundreds of thousands of data points).
My DataFrame-rows contains the following useful data: An arrival time (datetime), an departure time(datetime), a unique train-ID(string), the station that the train is located at between start and finish time (string) and a cell for train-ID's of trains it meet(string, empty at start). I want to find all pairs of rows that meet, meaning that I want that fulfil:
The time-interval from Row 1's arrival to departure overlaps with Row 2's arrival to departure time-interval.
Are located at the same station.
Additionally, there are no meetings with more than two trains involved.
I tried the following (code below): I created Interval-objects out of my arrival and departure time. Then I used nested for loops to compare each rows interval with every other row, and if they overlapped I checked if the station matched. If they did I stored each train-ID in the other's train-meeting-cell.
df_dsp['interval'] = [func(x,y) for x, y in zip(df_dsp['arrival'], df_dsp['departure'])]
meetings = np.empty([])
for i in range (1,len(df.index)):
for q in range (1,len(df.index)):
if (i < q): # Train meetings are symmetric.
if df.iloc[i, df.columns.get_loc('interval')].overlaps(df.iloc[q, df.columns.get_loc('interval')]):
if df.iloc[i, df.columns.get_loc('station')] == df.iloc[q, df.columns.get_loc('station')]:
df.iloc[i, df.columns.get_loc('train_id_meeting')] = df.iloc[q, df.columns.get_loc('train_id')]
df.iloc[q, df.columns.get_loc('train_id_meeting')] = df.iloc[i, df.columns.get_loc('train_id')]
I've taken a look at similar questions but have a hard time applying them to my dataset efficiently. My question is: How can I perform these comparisons faster?
Edit:
I can't give out the database (somewhat classified) but I made a representative dataset.
d = {'arrival': [pd.Timestamp(datetime.datetime(2012, 5, 1, 1)), pd.Timestamp(datetime.datetime(2012, 5, 1, 3)),
pd.Timestamp(datetime.datetime(2012, 5, 1, 6)), pd.Timestamp(datetime.datetime(2012, 5, 1, 4))],
'departure': [pd.Timestamp(datetime.datetime(2012, 5, 1, 3)), pd.Timestamp(datetime.datetime(2012, 5, 1, 5)),
pd.Timestamp(datetime.datetime(2012, 5, 1, 7)), pd.Timestamp(datetime.datetime(2012, 5, 1, 6))],
'station': ["a", "b", "a", "b"],
'train_id': [1, 2, 3, 4],
'meetings': [np.nan, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
In this sample-data, Row 2 and 4 would represent trains that meet at station "b". If this can be done faster without use of the Interval-object, I'd be happy to use that.
Initialize the DataFrame.
d = {'arrival': [pd.Timestamp(datetime.datetime(2012, 5, 1, 1)), pd.Timestamp(datetime.datetime(2012, 5, 1, 3)),
pd.Timestamp(datetime.datetime(2012, 5, 1, 6)), pd.Timestamp(datetime.datetime(2012, 5, 1, 4))],
'departure': [pd.Timestamp(datetime.datetime(2012, 5, 1, 3)), pd.Timestamp(datetime.datetime(2012, 5, 1, 5)),
pd.Timestamp(datetime.datetime(2012, 5, 1, 7)), pd.Timestamp(datetime.datetime(2012, 5, 1, 6))],
'station': ["a", "b", "a", "b"],
'train_id': [1, 2, 3, 4],
'meetings': [np.nan, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
Out:
arrival departure station train_id meetings
0 2012-05-01 01:00:00 2012-05-01 03:00:00 a 1 NaN
1 2012-05-01 03:00:00 2012-05-01 05:00:00 b 2 NaN
2 2012-05-01 06:00:00 2012-05-01 07:00:00 a 3 NaN
3 2012-05-01 04:00:00 2012-05-01 06:00:00 b 4 NaN
Convert datatimes to timestamps
df["arrival"] = df["arrival"].apply(lambda x: x.timestamp())
df["departure"] = df["departure"].apply(lambda x: x.timestamp())
Initiliaze two tables for merging.
df
df2 = df
Merge together both tables based on "Station", so they are already compared.
merge = pd.merge(df, df2, on=['station'])
table = merge[merge["train_id_x"] != merge["train_id_y"]]
arrival_x departure_x station train_id_x meetings_x arrival_y departure_y train_id_y meetings_y
1 1.335834e+09 1.335841e+09 a 1 NaN 1.335852e+09 1.335856e+09 3 NaN
2 1.335852e+09 1.335856e+09 a 3 NaN 1.335834e+09 1.335841e+09 1 NaN
5 1.335841e+09 1.335848e+09 b 2 NaN 1.335845e+09 1.335852e+09 4 NaN
6 1.335845e+09 1.335852e+09 b 4 NaN 1.335841e+09 1.335848e+09 2 NaN
Now use comparison vectorised algorithm
table[((table["arrival_x"] > table["arrival_y"]) & (table["arrival_x"] < table["departure_y"]) | (table["arrival_y"] > table["arrival_x"]) & (table["arrival_y"] < table["departure_x"]))]
Result:
arrival_x departure_x station train_id_x meetings_x arrival_y departure_y train_id_y meetings_y
5 1.335841e+09 1.335848e+09 b 2 NaN 1.335845e+09 1.335852e+09 4 NaN
6 1.335845e+09 1.335852e+09 b 4 NaN 1.335841e+09 1.335848e+09 2 NaN
Disclaimer: This algorithm can be even more improved. But I hope you get the idea. Instead doing for loops, use Pandas and Numpy functions which are faster.
My head is going to explode. Sorry I can't comment given I haven't enough points. Is there anyway we can have your database to test on it?
Does have to match on Row 1 with Row 2 (pairs) or Row 1 with Row any?
You are actually doing O(n^2), it could be improved for sure, but it has to be very well understood what the problem is.
The most efficient algorithm in Pandas is trying to understand everything as a Matrix which can be compared and worked on with Numpy Arrays (Vectorisation) but I think this is not the case.
What is the data type of "interval"?

Vectorized way of finding the index of a previously occurring element

Let's say I have this Pandas series:
num = pd.Series([1,2,3,4,5,6,5,6,4,2,1,3])
What I want to do is to get a number, say 5, and return the index where it previously occurred. So if I'm using the element 5, I should get 4 as the element appears in indices 4 and 6. Now I want to do this for all of the elements of the series, and can be easily done using a for loop:
for idx,x in enumerate(num):
idx_prev = num[num == x].idxmax()
if(idx_prev < idx):
return idx_prev
However, this process consumes too much time for longer series lengths due to the looping. Is there a way to implement the same thing but in a vectorized form? The output should be something like this:
[NaN,NaN,NaN,NaN,NaN,NaN,4,5,3,1,0,2]
You can use groupby to shift the index:
num.index.to_series().groupby(num).shift()
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
7 5.0
8 3.0
9 1.0
10 0.0
11 2.0
dtype: float64
It's possible to keep working in numpy.
Equivalent of [num[num == x].idxmax() for idx,x in enumerate(num)] using numpy is:
_, out = np.unique(num.values, return_inverse=True)
which assigns
array([0, 1, 2, 3, 4, 5, 4, 5, 3, 1, 0, 2], dtype=int64)
to out. Now you can assign bad values of out to Nans like this:
out_series = pd.Series(out)
out_series[out >= np.arange(len(out))] = np.nan

Fast conversion to multiindexed pandas dataframe using bincounts

I have data from users who have left star ratings (1, 2 or 3 stars) on items in various categories, where each item may belong to multiple categories. In my current dataframe, each row represents a rating and the categories are one-hot encoded, like so:
import numpy as np
import pandas as pd
df_old = pd.DataFrame({
'user': [1, 1, 2, 2, 2],
'rate': [3, 2, 1, 1, 2],
'cat1': [1, 0, 1, 1, 1],
'cat2': [0, 1, 0, 0, 1]
})
# user rate cat1 cat2
# 0 1 3 1 0
# 1 1 2 0 1
# 2 2 1 1 0
# 3 2 1 1 0
# 4 2 2 1 1
I want to convert this to a new dataframe, multiindexed by user and rate, which show the per-category bincounts for each star rating. I'm currently doing this with loops:
multi_idx = pd.MultiIndex.from_product(
[df_old.user.unique(), range(1,4)],
names=['user', 'rate']
)
df_new = pd.DataFrame( # preallocate in an attempt to speed up the code
{'cat1': np.nan, 'cat2': np.nan},
index=multi_idx
)
df_new.sort_index(inplace=True)
idx = pd.IndexSlice
for uid in df_old.user.unique():
for cat in ['cat1', 'cat2']:
df_new.loc[idx[uid, :], cat] = np.bincount(
df_old.loc[(df_old.user == uid) & (df_old[cat] == 1),
'rate'].values, minlength=4)[1:]
# cat1 cat2
# user rate
# 1 1 0.0 0.0
# 2 0.0 1.0
# 3 1.0 0.0
# 2 1 2.0 0.0
# 2 1.0 1.0
# 3 0.0 0.0
Unfortunately the above code is hopelessly slow on my real dataframe, which is long and contains many categories. How can I eliminate the loops please?
With your multi-index, you can aggregate your old data frame, and reindex it:
df_old.groupby(['user', 'rate']).sum().reindex(multi_idx).fillna(0)
Or as #piRSquared commented, do the reindex and fill missing value at one step:
df_old.groupby(['user', 'rate']).sum().reindex(multi_idx, fill_value=0)

Splitting a Dataframe at NaN row

There is already an answer that deals with a relatively simple dataframe that is given here.
However, the dataframe I have at hand has multiple columns and large number of rows. One Dataframe contains three dataframes attached along axis=0. (Bottom end of one is attached to the top of the next.) They are separated by a row of NaN values.
How can I create three dataframes out of this one data by splitting it along the NaN rows?
Like in the answer you linked, you want to create a column which identifies the group number. Then you can apply the same solution.
To do so, you have to make a test for all the values of a row to be NaN. I don't know if there is such a test builtin in pandas, but pandas has a test to check if a Series is full of NaN. So what you want to do is to perform that on the transpose of your dataframe, so that your "Series" is actually your row:
df["group_no"] = df.isnull().all(axis=1).cumsum()
At that point you can use the same technique from that answer to split the dataframes.
You might want to do a .dropna() at the end, because you will still have the NaN rows in your result.
Ran into this same question in 2022. Here's what I did to split dataframes on rows with NaNs, caveat is this relies on pip install python-rle for run-length encoding:
import rle
def nanchucks(df):
# It chucks NaNs outta dataframes
# True if whole row is NaN
df_nans = pd.isnull(df).sum(axis="columns").astype(bool)
values, counts = rle.encode(df_nans)
df_nans = pd.DataFrame({"values": values, "counts": counts})
df_nans["cum_counts"] = df_nans["counts"].cumsum()
df_nans["start_idx"] = df_nans["cum_counts"].shift(1)
df_nans.loc[0, "start_idx"] = 0
df_nans["start_idx"] = df_nans["start_idx"].astype(int) # np.nan makes it a float column
df_nans["end_idx"] = df_nans["cum_counts"] - 1
# Only keep the chunks of data w/o NaNs
df_nans = df_nans[df_nans["values"] == False]
indices = []
for idx, row in df_nans.iterrows():
indices.append((row["start_idx"], row["end_idx"]))
return [df.loc[df.index[i[0]]: df.index[i[1]]] for i in indices]
Examples:
sample_df1 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, np.nan, 3, 4],
"c": [1, 2, np.nan, 3, 4],
})
sample_df2 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, 3, np.nan, 4],
"c": [1, 2, np.nan, 3, 4],
})
print(nanchucks(sample_df1))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 3 3.0 3.0 3.0
# 4 4.0 4.0 4.0]
print(nanchucks(sample_df2))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 4 4.0 4.0 4.0]

Categories

Resources