Count each observation a row - python

I have a pandas df named df, with millions of observations (rows) and only 4 columns.
I'm trying to convert the event_type column into several columns, and add a count to each row for that column.
My df looks like this:
event_type event_time organization_id user_id
0 Applied Saved View 2018-11-22 10:59:57.360 3 0
And I'm looking for this:
Applied_Saved_View event_time organization_id user_id
0 1 2018-11-22 10:59:57.360 3 0

I believe you are looking for something called pd.get_dummies. I assume you are trying to make this categorical data? I have no way of testing without sample data but see code below.
df2 = pd.get_dummies(df['event_type'])
new_df = pd.concat([df2,df],axis=1)
I should mention, you should see how many unique values there are in this event type column because those each will become rows whether its 10 or 100000 unique values

Related

How to change values within a column, based on a condition, in a DataFrame with multi-index

My current DF looks like:
column 1 column 2 column 3
user_id date
5678 2022-01-01 0.0 1.5 0.0
6253 2022-01-14 0.0 NaN 2.0
My DF has a lot of rows, and I need to change the value of column 2 based on whether the user_id is in a particular set called 'users'.
I am using the following code but it doesn't seem to be working.
My code:
for idx, row in df.iterrows():
if idx[0] in users:
row['column 2'] = 0
When I checked against a particular user_id that exists within the 'users' set, it shows up as 'NaN'. Does this mean the code hasn't worked? I need all values of column 2 to be zero if the user_id exists in the users set.
Thank you in advance.
df.loc[df.index.get_level_values("user_id").isin(users), "column 2"] = 0
You don't need the loop! You can
get a hold on the user_id level values in the index
check which of them are in the predefined "users" set
use that boolean mask as the row indexer and the column of interest "column 2" as the column one
then .loc will do the setting
Here's how I solved this:
for user in users:
if user in df.index.get_level_values(level='user_id'):
df['column 2'].loc[user,:] = 0
Cycle will check every user. If they are in that index of dataframe, it will change a value in column 2 for that user. (loc works here)
Also that might work:
for user in users:
if user in df.index.get_level_values(0):
df['column 2'].loc[user,:] = 0

Pandas delete duplicate rows based on timestamp

I have a dataset where I have multiple duplicate records based on timestamps for the same date. I want to keep the record with the max timestamp and delete the other records for a given ID and timestamp combo.
Sample dataset
id|timestamp|value
--|---------|-----
1|2022-04-19T18:46:36.259+0000|xyz
1|2022-04-19T18:46:36.302+0000|xyz
1|2022-04-19T18:46:36.357+0000|xyz
1|2022-04-24T00:41:40:871+0000|xyz
1|2022-04-24T00:41:40:879+0000|xyz
1|2022-05-02T10:15:25.829+0000|xyz
1|2022-05-02T10:15:25.832+0000|xyz
Final Df
id|timestamp|value
--|---------|-----
1|2022-04-19T18:46:36.357+0000|xyz
1|2022-04-24T00:41:40:879+0000|xyz
1|2022-05-02T10:15:25.832+0000|xyz
if you add the data as a code, it'll be easier to share the result. Since you already have a data, its simpler to post it as a code or text
# To keep the lastdate but latest timestamp
# create a dateonly field from timestamp, in identifying the dupicates
# sort values so, we have latest timestamp for an id at the end
# drop duplicates based on id and timestamp. keeping last row
# finally drop the temp column
(df.assign(d=pd.to_datetime(df['timestamp']).dt.date)
.sort_values(['id','timestamp'])
.drop_duplicates(subset=['id','d'], keep='last')
.drop(columns='d')
)
id timestamp value
2 1 2022-04-19T18:46:36.357+0000 xyz
4 1 2022-04-24T00:41:40.879+0000 xyz
6 1 2022-05-02T10:15:25.832+0000 xyz
a combination of .groupby and .max will do
import pandas as pd
dates = pd.to_datetime(['01-01-1990', '01-02-1990', '01-02-1990', '01-03-1990'])
values = [1] * len(dates)
ids = values[:]
df = pd.DataFrame(zip(dates, values, ids), columns=['timestamp', 'val', 'id'])
selection = df.groupby(['val', 'id'])['timestamp'].max().reset_index()
print(selection)
output
val id timestamp
0 1 1 1990-01-03
You can use following code for your task.
df.groupby(["id","value"]).max()
explanation: Fist group by using id and value column and then select only the maximum.

filtering pandas dataframe when data contains two parts

I have a pandas dataframe and want to filter down to all the rows that contain a certain criteria in the “Title” column.
The rows I want to filter down to are all rows that contain the format “(Axx)” (Where xx are 2 numbers).
The data in the “Title” column doesn’t just consist of “(Axx)” data.
The data in the “Title” column looks like so:
“some_string (Axx)”
What Ive been playing around a bit with different methods but cant seem to get it.
I think the closest ive gotten is:
df.filter(regex=r'(D\d{2})', axis=0))
but its not correct as the entries aren’t being filtered.
Use Series.str.contains with escape () and $ for end of string and filter in boolean indexing:
df = pd.DataFrame({'Title':['(D89)','aaa (D71)','(D5)','(D78) aa','D72']})
print (df)
Title
0 (D89)
1 aaa (D71)
2 (D5)
3 (D78) aa
df1 = df[df['Title'].str.contains(r'\(D\d{2}\)$')]
print (df1)
4 D72
Title
0 (D89)
1 aaa (D71)
If ned match only (Dxx) use Series.str.match:
df2 = df[df['Title'].str.match(r'\(D\d{2}\)$')]
print (df2)
Title
0 (D89)

Grouping the dataset by cluster_id attribute

I'd like to group by my dataframe based on cluster ids and print all instances of occurrences. My dataframe is somewhat like this
Chemical Name,cluster_id
XA323, 0
ZC4-D, 2
XA324, 0
YB1050, 1
ZC5-D, 2
YB1052, 1
I'd like it grouped by the cluster_id like
cluster_id
0 XA323
XA324
1 YB1050
YB1052
2 ZC4-D
ZC5-D
NOTE: This is a dummy dataset and my original dataset has around 3000 instances. Where the cluster_id distribution is like 0: 2700+, 1: 200+ and 2: remaining
Thank you.
Following the comments you can aggregate it by the cluster_id, then use list as an aggregation function and turn it into a dict -
df.groupby("cluster_id").agg(list)["Chemical Name"].to_dict()

Pivot across multiple columns with repeating values in each column

I am trying to pivot a pandas dataframe, but the data is following a strange format that I cannot seem to pivot. The data is structured as below:
Date, Location, Action1, Quantity1, Action2, Quantity2, ... ActionN, QuantityN
<date> 1 Lights 10 CFloor 1 ... Null Null
<date2> 2 CFloor 2 CWalls 4 ... CBasement 15
<date3> 2 CWalls 7 CBasement 4 ... NUll Null
Essentially, each action will always have a quantity attached to it (which may be 0), but null actions will never have a quantity (the quantity will just be null). The format I am trying to achieve is the following:
Lights CFloor CBasement CWalls
1 10 1 0 0
2 0 2 19 11
The index of the rows becomes the location while the columns become any unique action found across the multiple activity columns. When pulling the data together, the value of each row/column is the sum of each quantity associated with the action (i.e Action1 corresponds to Quantity1). Is there a way to do this with the native pandas pivot funciton?
My current code performs a ravel across all the activity columns to get a list of all unique activities. It will also grab all the unique locations from the Location column. Once I have the unique columns, I create an empty dataframe and fill it with zeros:
Lights CFloor CBasement CWalls
1 0 0 0 0
2 0 0 0 0
I then iterate back over the old data frame with the itertuples() method (I was told it was significantly faster than iterrows()) and populate the new dataframe. This empty dataframe acts as a template that is stored in memory and filled later.
#Creates a template from the dataframe
def create_template(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
activities = df[act_cols]
flat_acts = activities.values.ravel('K')
unique_locations = pd.unique(df['Location'])
unique_acts = pd.unique(flat_acts)
pivot_template = pd.DataFrame(index=unique_locations, columns=unique_acts).fillna(0)
return pivot_template
#Fills the template from the dataframe
def create_pivot(df, pivot_frmt):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
for row in df.itertuples():
for act, quantity in zip(act_cols, quant_cols):
act_val = getattr(row, act)
if pd.notna(act_val):
quantity_val = getattr(row, quantity)
location = getattr(row, 'Location')
pivot_frmt.loc[location, act_val] += quantity_val
return pivot_frmt
While my solution works, it is incredibly slow when dealing with a large dataset and has taken 10 seconds or more to complete this type of operation. Any help would be greatly appreciated!
After experimenting with various pandas functions, such as melt and pivot on multiple columns simulatenously, I found a solution that worked for me:
For every quantity-activity pair, I build a partial frame of the final dataset and store it in a list. Once every pair has been addressed I will end up with multiple dataframes that all have the same row counts, but potentially different column counts. I solved this issue by simply concatenating the columns and if any columns are repeated, I then sum them to get the final result.
def test_pivot(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
dfs = []
for act, quant in zip(act_cols, quant_cols):
partial = pd.crosstab(index=df['Location'], columns=df[act], values=df[quant], aggfunc=np.sum).fillna(0)
dfs.append(partial)
finalDf = pd.concat(dfs, axis=1)
finalDf = test.groupby(finalDf.columns, axis=1).sum()
return finalDf
There are two assumptions that I make during this approach:
The indexes maintain their order across all partial dataframes
There are an equivalent number of indexes across all partial dataframes
While this is probably not the most elegant solution, it achieves the desired result and reduced the time it took to process the data by a very significant margin (from 10s ~4k rows to 0.2s ~4k rows). If anybody has a better way to deal with this type of scenario and do the process outlined above in one shot, then I would love to see your response!

Categories

Resources