Pandas dataframe tweak - python

I have some data as follows:
+-----+-------+-------+--------------------+
| Sys | Event | Code | Duration |
+-----+-------+-------+--------------------+
| | 1 | 65 | 355.52 |
| | 1 | 66 | 18.78 |
| | 1 | 66 | 223.42 |
| | 1 | 66 | 392.17 |
| | 2 | 66 | 449.03 |
| | 2 | 66 | 506.03 |
| | 2 | 66 | 73.93 |
| | 3 | 66 | 123.17 |
| | 3 | 66 | 97.85 |
+-----+-------+-------+--------------------+
Now, for each Code, I want to sum the Durations for all Event = 1 and so on, regardless of Sys. How do I approach this?

As DYZ says:
df.groupby(['Code', 'Event']).Duration.sum()
Output:
Code Event
65 1 355.52
66 1 634.37
2 1028.99
3 221.02
Name: Duration, dtype: float64

Related

Left join table without duplicating right table row values

I have 2 dataframe's as below. I want to join right side table (cycletime) to left table (current data).
Left Table- Current data (df_current)
| datetime_index | current | speed | cycle_counter |
|--------------------------|---------|-------|---------------|
| 27-10-2022 08:30:56.3056 | 30 | 60 | 1 |
| 27-10-2022 08:30:58.3058 | 30 | 60 | 1 |
| 27-10-2022 08:30:59.3059 | 31 | 62 | 1 |
| 27-10-2022 08:30:59.3059 | 30 | 60 | 1 |
| 27-10-2022 08:31:00.310 | 30.5 | 61 | 2 |
| 27-10-2022 08:31:01.311 | 30 | 60 | 2 |
| 27-10-2022 08:31:02.312 | 31 | 61 | 2 |
| 27-10-2022 08:31:02.312 | 30 | 60 | 3 |
| 27-10-2022 08:31:03.313 | 31 | 62 | 3 |
| 27-10-2022 08:31:04.314 | 30 | 60 | 3 |
Right Table- Cycletime data (df_cycletime)
| cycle_counter | total_time | up_time |
|---------------|------------|---------|
| 1 | 20 | 6 |
| 2 | 22 | 7 |
| 3 | 24 | 5 |
Code:
I used the below code
df = df_current.reset_index().merge(df_cycletime, how='left', on=cyclecounter).set_index('datetime')
What I get
| datetime_index | current | speed | cycle_counter | total_time | up_time |
|--------------------------|---------|-------|---------------|------------|---------|
| 27-10-2022 08:30:56.3056 | 30 | 60 | 1 | 20 | 6 |
| 27-10-2022 08:30:58.3058 | 30 | 60 | 1 | 20 | 6 |
| 27-10-2022 08:30:59.3059 | 31 | 62 | 1 | 20 | 6 |
| 27-10-2022 08:30:59.3059 | 30 | 60 | 1 | 20 | 6 |
| 27-10-2022 08:31:00.310 | 30.5 | 61 | 2 | 22 | 7 |
| 27-10-2022 08:31:01.311 | 30 | 60 | 2 | 22 | 7 |
| 27-10-2022 08:31:02.312 | 31 | 61 | 2 | 22 | 7 |
| 27-10-2022 08:31:02.312 | 30 | 60 | 3 | 24 | 5 |
| 27-10-2022 08:31:03.313 | 31 | 62 | 3 | 24 | 5 |
| 27-10-2022 08:31:04.314 | 30 | 60 | 3 | 24 | 5 |
Requirement: I don't want 'total_time' and 'up_time' to repeat, just only once for one cycle counter
| datetime_index | current | speed | cycle_counter | total_time | up_time |
|--------------------------|---------|-------|---------------|------------|---------|
| 27-10-2022 08:30:56.3056 | 30 | 60 | 1 | 20 | 6 |
| 27-10-2022 08:30:58.3058 | 30 | 60 | 1 | | |
| 27-10-2022 08:30:59.3059 | 31 | 62 | 1 | | |
| 27-10-2022 08:30:59.3059 | 30 | 60 | 1 | | |
| 27-10-2022 08:31:00.310 | 30.5 | 61 | 2 | 22 | 7 |
| 27-10-2022 08:31:01.311 | 30 | 60 | 2 | | |
| 27-10-2022 08:31:02.312 | 31 | 61 | 2 | | |
| 27-10-2022 08:31:02.312 | 30 | 60 | 3 | 24 | 5 |
| 27-10-2022 08:31:03.313 | 31 | 62 | 3 | | |
| 27-10-2022 08:31:04.314 | 30 | 60 | 3 | | |
You have to find duplicates in column total_time and column up_time according to cycle_counter column and replace them with empty string (""). this will work for all data.
df.loc[df.duplicated(['cycle_counter','total_time', 'up_time']), ['total_time','up_time']] = ""
print(df)
cycle_counter total_time up_time
0 1 20 6
1 1
2 1
3 1
4 2 22 7
5 2
6 2
7 3 24 5
8 3
9 3

How to check if next 3 consecutive rows in pandas column have same value?

I have a pandas dataframe with 3 columns - id, date and value.
| id | date | value |
| --- | --- | --- |
| 1001 | 1-04-2021 | 61 |
| 1001 | 3-04-2021 | 61 |
| 1001 | 10-04-2021 | 61 |
| 1002 | 11-04-2021 | 13 |
| 1002 | 12-04-2021 | 12 |
| 1015 | 18-04-2021 | 42 |
| 1015 | 20-04-2021 | 42 |
| 1015 | 21-04-2021 | 43 |
| 2001 | 8-04-2021 | 27 |
| 2001 | 11-04-2021 | 27 |
| 2001 | 12-04-2021 | 27 |
| 2001 | 27-04-2021 | 27 |
| 2001 | 29-04-2021 | 27 |
I want to check how many rows are there for each id where the next 3 or more than 3 next consecutive rows having the same value in value column? Once identified that the next 3 or more consecutive rows are having the same value, flag them as 1 in a separate column else 0.
So the final dataframe would look like the following,
| id | date | value | pattern
| --- | --- | --- | --- |
| 1001 | 1-04-2021 | 61 | 1 |
| 1001 | 3-04-2021 | 61 | 1 |
| 1001 | 10-04-2021 | 61 | 1 |
| 1002 | 11-04-2021 | 13 | 0 |
| 1002 | 12-04-2021 | 12 | 0 |
| 1015 | 18-04-2021 | 42 | 0 |
| 1015 | 20-04-2021 | 42 | 0 |
| 1015 | 21-04-2021 | 43 | 0 |
| 2001 | 8-04-2021 | 27 | 1 |
| 2001 | 11-04-2021 | 27 | 1 |
| 2001 | 12-04-2021 | 27 | 1 |
| 2001 | 27-04-2021 | 27 | 1 |
| 2001 | 29-04-2021 | 27 | 1 |
Try with groupby:
df['pattern'] = (df.groupby(['id', df['value'].diff().ne(0).cumsum()])
['id'].transform('size').ge(3).astype(int)
)
How about this:
def f(x):
x = x.fillna(0)
y = len(x)*[0]
for i in range(len(x)-3):
if x[i+1] == 0 and x[i+2] == 0:
y[i] = 1
y[i+1] = 1
y[i+2] = 1
if x[len(x)-1] == 0 and x[len(x)-2] == 0 and x[len(x)-3] == 0:
y[len(x)-1] = 1
return pd.Series(y)
df['value'].diff().transform(f)

update table information based on columns of another table

I am new in python have two dataframes, df1 contains information about all students with their group and score, and df2 contains updated information about few students when they change their group and score. How could I update the information in df1 based on the values of df2 (group and score)?
df1
+----+----------+-----------+----------------+
| |student No| group | score |
|----+----------+-----------+----------------|
| 0 | 0 | 0 | 0.839626 |
| 1 | 1 | 0 | 0.845435 |
| 2 | 2 | 3 | 0.830778 |
| 3 | 3 | 2 | 0.831565 |
| 4 | 4 | 3 | 0.823569 |
| 5 | 5 | 0 | 0.808109 |
| 6 | 6 | 4 | 0.831645 |
| 7 | 7 | 1 | 0.851048 |
| 8 | 8 | 3 | 0.843209 |
| 9 | 9 | 4 | 0.84902 |
| 10 | 10 | 0 | 0.835143 |
| 11 | 11 | 4 | 0.843228 |
| 12 | 12 | 2 | 0.826949 |
| 13 | 13 | 0 | 0.84196 |
| 14 | 14 | 1 | 0.821634 |
| 15 | 15 | 3 | 0.840702 |
| 16 | 16 | 0 | 0.828994 |
| 17 | 17 | 2 | 0.843043 |
| 18 | 18 | 4 | 0.809093 |
| 19 | 19 | 1 | 0.85426 |
+----+----------+-----------+----------------+
df2
+----+-----------+----------+----------------+
| | group |student No| score |
|----+-----------+----------+----------------|
| 0 | 2 | 1 | 0.887435 |
| 1 | 0 | 19 | 0.81214 |
| 2 | 3 | 17 | 0.899041 |
| 3 | 0 | 8 | 0.853333 |
| 4 | 4 | 9 | 0.88512 |
+----+-----------+----------+----------------+
The result
df: 3
+----+----------+-----------+----------------+
| |student No| group | score |
|----+----------+-----------+----------------|
| 0 | 0 | 0 | 0.839626 |
| 1 | 1 | 2 | 0.887435 |
| 2 | 2 | 3 | 0.830778 |
| 3 | 3 | 2 | 0.831565 |
| 4 | 4 | 3 | 0.823569 |
| 5 | 5 | 0 | 0.808109 |
| 6 | 6 | 4 | 0.831645 |
| 7 | 7 | 1 | 0.851048 |
| 8 | 8 | 0 | 0.853333 |
| 9 | 9 | 4 | 0.88512 |
| 10 | 10 | 0 | 0.835143 |
| 11 | 11 | 4 | 0.843228 |
| 12 | 12 | 2 | 0.826949 |
| 13 | 13 | 0 | 0.84196 |
| 14 | 14 | 1 | 0.821634 |
| 15 | 15 | 3 | 0.840702 |
| 16 | 16 | 0 | 0.828994 |
| 17 | 17 | 3 | 0.899041 |
| 18 | 18 | 4 | 0.809093 |
| 19 | 19 | 0 | 0.81214 |
+----+----------+-----------+----------------+
my code to update df1 from df2
dfupdated = df1.merge(df2, how='left', on=['student No'], suffixes=('', '_new'))
dfupdated['group'] = np.where(pd.notnull(dfupdated['group_new']), dfupdated['group_new'],
dfupdated['group'])
dfupdated['score'] = np.where(pd.notnull(dfupdated['score_new']), dfupdated['score_new'],
dfupdated['score'])
dfupdated.drop(['group_new', 'score_new'],axis=1, inplace=True)
dfupdated.reset_index(drop=True, inplace=True)
but I face the following error
KeyError: "['group'] not in index"
I don't know what's wrong I ran same and got the answer
giving a different way to solve it
try :
dfupdated = df1.merge(df2, on='student No', how='left')
dfupdated['group'] = dfupdated['group_y'].fillna(dfupdated['group_x'])
dfupdated['score'] = dfupdated['score_y'].fillna(dfupdated['score_x'])
dfupdated.drop(['group_x', 'group_y','score_x', 'score_y'], axis=1,inplace=True)
will give you the solution you want.
to get the max from each group
dfupdated.groupby(['group'], sort=False)['score'].max()

How to select all rows from a Dask dataframe with value equal to minimal value of group

So I have following dask dataframe grouped by Problem column.
| Problem | Items | Min_Dimension | Max_Dimension | Cost |
|-------- |------ |---------------|-------------- |------ |
| A | 7 | 2 | 15 | 23 |
| A | 5 | 2 | 15 | 38 |
| A | 15 | 2 | 15 | 23 |
| B | 11 | 6 | 10 | 54 |
| B | 10 | 6 | 10 | 48 |
| B | 18 | 6 | 10 | 79 |
| C | 50 | 8 | 25 | 120 |
| C | 50 | 8 | 25 | 68 |
| C | 48 | 8 | 25 | 68 |
| ... | ... | ... | ... | ... |
The goal is to create a new dataframe with all rows where the Cost values is minimal for this particular Problem group. So we want following result:
| Problem | Items | Min_Dimension | Max_Dimension | Cost |
|-------- |------ |---------------|-------------- |------ |
| A | 7 | 2 | 15 | 23 |
| A | 15 | 2 | 15 | 23 |
| B | 10 | 6 | 10 | 48 |
| C | 50 | 8 | 25 | 68 |
| C | 48 | 8 | 25 | 68 |
| ... | ... | ... | ... | ... |
How can I achieve this result, i already tried using idxmin() as mentioned in another question on here, but then I get a ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.
What if you create another dataframe that is grouped by Problem and Cost.min()? Let's say the new column is called cost_min.
df1 = df.groupby('Problem')['Cost'].min().reset_index()
Then, merge back this new cost_min column back to the dataframe.
df2 = pd.merge(df, df1, how='left', on='Problem')
From there, do something like:
df_new = df2.loc[df2['Cost'] == df2['cost_min']]
Just wrote some pseudocode, but I think that all works with Dask.

Hot Deck Imputation in Python

I have been trying to find Python code that would allow me to replace missing values in a dataframe's column. The focus of my analysis is in biostatistics so I am not comfortable with replacing values using means/medians/modes. I would like to apply the "Hot Deck Imputation" method.
I cannot find any Python functions or packages online that takes the column of a dataframe and fills missing values with the "Hot Deck Imputation" method.
I did, however, see this GitHub project and did not find it useful.
The following is an example of some of my data (assume this is a pandas dataframe):
| age | sex | bmi | anesthesia score | pain level |
|-----|-----|------|------------------|------------|
| 78 | 1 | 40.7 | 3 | 0 |
| 55 | 1 | 25.3 | 3 | 0 |
| 52 | 0 | 25.4 | 3 | 0 |
| 77 | 1 | 44.9 | 3 | 3 |
| 71 | 1 | 26.3 | 3 | 0 |
| 39 | 0 | 28.2 | 2 | 0 |
| 82 | 1 | 27 | 2 | 1 |
| 70 | 1 | 37.9 | 3 | 0 |
| 71 | 1 | NA | 3 | 1 |
| 53 | 0 | 24.5 | 2 | NA |
| 68 | 0 | 34.7 | 3 | 0 |
| 57 | 0 | 30.7 | 2 | 0 |
| 40 | 1 | 22.4 | 2 | 0 |
| 73 | 1 | 34.2 | 2 | 0 |
| 66 | 1 | NA | 3 | 1 |
| 55 | 1 | 42.6 | NA | NA |
| 53 | 0 | 37.5 | 3 | 3 |
| 65 | 0 | 31.6 | 2 | 2 |
| 36 | 0 | 29.6 | 1 | 0 |
| 60 | 0 | 25.7 | 2 | NA |
| 70 | 1 | 30 | NA | NA |
| 66 | 1 | 28.3 | 2 | 0 |
| 63 | 1 | 29.4 | 3 | 2 |
| 70 | 1 | 36 | 3 | 2 |
I would like to apply a Python function that would allow me to input a column as a parameter and return the column with the missing values replaced with imputed values using the "Hot Deck Imputation" method.
I am using this for the purpose of statistical modeling with models such as linear and logistic regression using Statsmodels.api. I am not using this for Machine Learning.
Any help would be much appreciated!
You can use ffill that uses last observation carried forward (LOCF) Hot Code Imputation.
#...
df.fillna(method='ffill', inplace=True)
Scikit-learn impute offers KNN, Mean, Max and other imputing methods. (https://scikit-learn.org/stable/modules/impute.html)
# sklearn '>=0.22.x'
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2, weights="uniform")
DF['imputed_x'] = imputer.fit_transform(DF[['bmi']])
print(DF['imputed_x'])

Categories

Resources