I have a dataframe in the following format:
time
parameter
TimeDelta
1
123
-
2
456
1
4
122
2
7
344
3
8
344
1
How to build an additional column with labeling, once TimeDelta is greater than e.g. 1.5?
And also apply this labeling for the following rows once TimeDelta is again greater than 1.5?
time
parameter
TimeDelta
Label
1
123
-
1
2
456
1
1
4
122
2
2
7
344
3
3
8
344
1
3
I do not want to loop over every row, which is extremely slow.
Maybe it is possible with cumsum() to flag all the following rows up to the next value above threshold?
You can use part of soluton from previous answer, add 1 and assign to new column:
df['Label'] = pd.to_numeric(df['TimeDelta'], errors='coerce').gt(1.5).cumsum().add(1)
print (df)
time parameter TimeDelta Label
0 1 123 - 1
1 2 456 1 1
2 4 122 2 2
3 7 344 3 3
4 8 344 1 3
Related
for the following dataframe
df = pd.DataFrame({'Rounds':[1000,1000,1000,1000,3000,3000,4000,5000,6000,6000]})
I would like to have a for loop that if the value already exists in previous rows, a fixed int, in this case 25, is added to the value and creates:
df = pd.DataFrame({'Rounds':[1000,1025,1050,1075,3000,3025,4000,5000,6000,6025]})
Initially I tried
for i in df.index:
if df.iat[i,1] == df.iloc[i-1,1]:
df.iat[i,1] = df.iat[i-1,1]+25
The problem is that it doesn't work for more than two similar values in a column and I would like to give column name "Rounds" instead of the index of column.
You need groupby.cumcount:
df['Rounds'] += df.groupby('Rounds').cumcount()*25
output:
Rounds
0 1000
1 1025
2 1050
3 1075
4 3000
5 3025
6 4000
7 5000
8 6000
9 6025
intermediate:
df.groupby('Rounds').cumcount()
0 0
1 1
2 2
3 3
4 0
5 1
6 0
7 0
8 0
9 1
dtype: int64
Use groupby + cumcount:
df["Rounds"] += df.groupby(df["Rounds"]).cumcount() * 25
print(df)
Output
Rounds
0 1000
1 1025
2 1050
3 1075
4 3000
5 3025
6 4000
7 5000
8 6000
9 6025
I have a data frame like this.
FOOD_ID Cumulative_addition
0 110 0
1 110 15
2 110 15
3 110 35
4 111 0
5 111 10
6 111 10
I want to add another column that gives the addition only for a particular FOOD ID. The final data frame that I want looks like below....
FOOD_ID Cumulative_addition Addition_Only
0 110 0 0
1 110 15 15
2 110 15 0
3 110 35 20
4 111 0 0
5 111 10 10
6 111 10 0
I know how to do this in excel using if statement but do not know how to do it in python.
Try :
df['Addition_only'] = (df.groupby('FOOD_ID').Cumulative_addition.shift(-1) - df.Cumulative_addition).shift(1).fillna(0)
Detail
df.groupby('FOOD_ID').Cumulative_addition.shift(-1)
Will give group the cumulative addition column grouped by food id and then shift it by 1 row.
The you can subtract the original column to get the diff and shift it back by one row.
Hope that helps.
I have following dataframe in pandas
code tank nozzle_1 nozzle_2 nozzle_var nozzle_sale
123 1 1 1 10 10
123 1 2 2 12 10
123 2 1 1 10 10
123 2 2 2 12 10
123 1 1 1 10 10
123 2 2 2 12 10
Now, I want to generate cumulative sum of all the columns grouping over tank and take out the last observation. nozzle_1 and nozzle_2 columns are dynamic, it could be nozzle_3, nozzle_4....nozzle_n etc. I am doing following in pandas to get the cumsum
## Below code for calculating cumsum of dynamic columns nozzle_1 and nozzle_2
cols= df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)]
df.assign(**df.groupby('tank')[cols].agg(['cumsum'])\
.pipe(lambda x: x.set_axis(x.columns.map('_'.join), axis=1, inplace=False)))
## nozzle_sale_cumsum is static column
df[nozzle_sale_cumsum] = df.groupby('tank')['nozzle_sale'].cumsum()
From above code I will get cumsum of following columns
tank nozzle_1 nozzle_2 nozzle_var nozzle_1_cumsum nozzle_2_cumsum nozzle_sale_cumsum
1 1 1 10 1 1 10
1 2 2 12 3 3 20
2 1 1 10 1 1 10
2 2 2 12 3 3 20
1 1 1 10 4 4 30
2 2 2 12 5 5 30
Now, I want to get last values of all 3 cumsum columns grouping over tank. I can do it with following code in pandas, but it is hard coded with column names.
final_df= df.groupby('tank').agg({'nozzle_1_cumsum':'last',
'nozzle_2_cumsum':'last',
'nozzle_sale_cumsum':'last',
}).reset_index()
Problem with above code is nozzle_1_cumsum and nozzle_2_cumsum is hard coded which is not the case. How can I do this in pandas with dynamic columns.
How about:
df.filter(regex='_cumsum').groupby(df['tank']).last()
Output:
nozzle_1_cumsum nozzle_2_cumsum nozzle_sale_cumsum
tank
1 4 4 30
2 5 5 30
You can also replace df.filter(...) by, e.g., df.iloc[:,-3:] or df[col_names].
I've got a date-ordered dataframe that can be grouped. What I am attempting to do is groupby a variable (Person), determine the maximum (weight) for each group (person), and then drop all rows that come after (date) the maximum.
Here's an example of the data:
df = pd.DataFrame({'Person': 1,1,1,1,1,2,2,2,2,2],'Date': '1/1/2015','2/1/2015','3/1/2015','4/1/2015','5/1/2015','6/1/2011','7/1/2011','8/1/2011','9/1/2011','10/1/2011'], 'MonthNo':[1,2,3,4,5,1,2,3,4,5], 'Weight':[100,110,115,112,108,205,210,211,215,206]})
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
3 4/1/2015 4 1 112
4 5/1/2015 5 1 108
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
9 10/1/2011 5 2 206
Here's what I want the result to look like:
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
I think its worth noting, there can be disjoint start dates and the maximum may appear at different times.
My idea was to find the maximum for each group, obtain the MonthNo the maximum was in for that group, and then discard any rows with MonthNo greater Max Weight MonthNo. So far I've been able to obtain the max by group, but cannot get past doing a comparison based on that.
Please let me know if I can edit/provide more information, haven't posted many questions here! Thanks for the help, sorry if my formatting/question isn't clear.
Using idxmax with groupby
df.groupby('Person',sort=False).apply(lambda x : x.reset_index(drop=True).iloc[:x.reset_index(drop=True).Weight.idxmax()+1,:])
Out[131]:
Date MonthNo Person Weight
Person
1 0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
2 0 6/1/2011 1 2 205
1 7/1/2011 2 2 210
2 8/1/2011 3 2 211
3 9/1/2011 4 2 215
You can use groupby.transform with idxmax. The first 2 steps may not be necessary depending on how your dataframe is structured.
# convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'])
# sort by Person and Date to make index usable for next step
df = df.sort_values(['Person', 'Date']).reset_index(drop=True)
# filter for index less than idxmax transformed by group
df = df[df.index <= df.groupby('Person')['Weight'].transform('idxmax')]
print(df)
Date MonthNo Person Weight
0 2015-01-01 1 1 100
1 2015-02-01 2 1 110
2 2015-03-01 3 1 115
5 2011-06-01 1 2 205
6 2011-07-01 2 2 210
7 2011-08-01 3 2 211
8 2011-09-01 4 2 215
In python, given a list of ratings as:
import pandas as pd
path = 'ratings_ml100k.csv'
data = pd.read_csv(path,sep= ',')
print(data)
user_id item_id rating
28422 100 690 4
32020 441 751 4
15819 145 265 5
where the items are:
print(itemsTrain)
[ 690 751 265 ..., 1650 1447 1507]
For each item, I would like to compute the number of ratings. Is there anyway to do this without resorting to a Loop? All ideas are appreciated,
data is a pandas dataframe. The desire output should look like this:
pop =
item_id rating_count
690 120
751 10
265 159
... ...
Note that itemsTrain contain unique item_ids in the rating dataset data.
you can do it this way:
In [200]: df = pd.DataFrame(np.random.randint(0,8,(15,2)),columns=['id', 'rating'])
In [201]: df
Out[201]:
id rating
0 4 6
1 0 1
2 2 4
3 2 5
4 2 7
5 3 5
6 6 1
7 4 3
8 4 3
9 3 2
10 2 4
11 7 7
12 3 1
13 2 7
14 7 3
In [202]: df.groupby('id').rating.count()
Out[202]:
id
0 1
2 5
3 3
4 3
6 1
7 2
Name: rating, dtype: int64
if you want to have result as a DF (you can also name the count column as you wish):
In [206]: df.groupby('id').rating.count().to_frame('count').reset_index()
Out[206]:
id count
0 0 1
1 2 5
2 3 3
3 4 3
4 6 1
5 7 2
you can also count # of unique ratings:
In [203]: df.groupby('id').rating.nunique()
Out[203]:
id
0 1
2 3
3 3
4 2
6 1
7 2
Name: rating, dtype: int64
You can use the method df.groupby() to group items by item_id and then use the method count() to sum the ratings.
Do as follows :
# df is your dataframe
v # the method allows you to sum values of the previous feature
df.groupby('item_id').rating.count()
^ ^ # the feature you want to sum upon its values
^
# The method allows you to group the samples by the feature "item_id"
# which is supposed to be unique