I have a Pandas DataFrame that looks like this:
index
ID
value_1
value_2
0
1
200
126
1
1
200
127
2
1
200
128.1
3
1
200
125.7
4
2
300.1
85
5
2
289.4
0
6
2
0
76.9
7
2
199.7
0
My aim is to find all rows in each ID-group (1,2 in this example) which have the max value for value_1 column. The second condition is if there are multiple maximum values per group, the row where the value in column value_2 is maximum should be taken.
So the target table should look like this:
index
ID
value_1
value_2
0
1
200
128.1
1
2
300.1
85
Use DataFrame.sort_values by all 3 columns and then DataFrame.drop_duplicates:
df1 = (df.sort_values(['ID', 'value_1', 'value_2'], ascending=[True, False, False])
.drop_duplicates('ID'))
print (df1)
ID value_1 value_2
2 1 200.0 128.1
4 2 300.1 85.0
I have below data frame:
a
100
200
200
b
20
30
40
c
400
50
Need help to calculate sum of values for each item and place it in 2nd column, which ideally should look like below:
a 500
100
200
200
b 90
20
30
40
c 450
400
50
If need sum by groups by column col converted to numeric use GroupBy.transform with repeated non numeric values by ffill:
s = pd.to_numeric(df['col'], errors='coerce')
mask = s.isna()
df.loc[mask, 'new'] = s.groupby(df['col'].where(mask).ffill()).transform('sum')
print (df)
col new
0 a 500.0
1 100 NaN
2 200 NaN
3 200 NaN
4 b 90.0
5 20 NaN
6 30 NaN
7 40 NaN
8 c 450.0
9 400 NaN
10 50 NaN
Or:
df['new'] = np.where(mask, new.astype(int), '')
print (df)
col new
0 a 500
1 100
2 200
3 200
4 b 90
5 20
6 30
7 40
8 c 450
9 400
10 50
For the DataFrame below, I need to create a new column 'unit_count' which is 'unit'/'count' for each year and month. However, because each year and month is not unique, for each entry, I only want to use the count for a given month from the B option.
key UID count month option unit year
0 1 100 1 A 10 2015
1 1 200 1 B 20 2015
2 1 300 2 A 30 2015
3 1 400 2 B 40 2015
Essentially, I need a function that does the following:
unit_count = df.unit / df.count
for value of unit, but using the only the 'count' value of option 'B' in that given 'month'.
So that the end result would look like the table below, where unit_count is dividing the number of units by the count of 'sector' 'B' for a given month.
key UID count month option unit year unit_count
0 1 100 1 A 10 2015 0.05
1 1 200 1 B 20 2015 0.10
2 1 300 2 A 30 2015 0.075
3 1 400 2 B 40 2015 0.01
Here is the code I used to create the original DataFrame:
df = pd.DataFrame({'UID':[1,1,1,1],
'year':[2015,2015,2015,2015],
'month':[1,1,2,2],
'option':['A','B','A','B'],
'unit':[10,20,30,40],
'count':[100,200,300,400]
})
It seems you can first create NaN where not option is B and then divide back filled NaN values:
Notice: DataFrame has to be sorted by year, month and option first for last value with B for each group
#if necessary in real data
#df.sort_values(['year','month', 'option'], inplace=True)
df['unit_count'] = df.loc[df.option=='B', 'count']
print (df)
UID count month option unit year unit_count
0 1 100 1 A 10 2015 NaN
1 1 200 1 B 20 2015 200.0
2 1 300 2 A 30 2015 NaN
3 1 400 2 B 40 2015 400.0
df['unit_count'] = df.unit.div(df['unit_count'].bfill())
print (df)
UID count month option unit year unit_count
0 1 100 1 A 10 2015 0.050
1 1 200 1 B 20 2015 0.100
2 1 300 2 A 30 2015 0.075
3 1 400 2 B 40 2015 0.100
i have a table in pandas df
id product_1 product_2 count
1 100 200 10
2 200 600 20
3 100 500 30
4 400 100 40
5 500 700 50
6 200 500 60
7 100 400 70
also i have another table in dataframe df2
product price
100 5
200 10
300 15
400 20
500 25
600 30
700 35
i wanted to merge df2 with df1 such that i get price_x and price_y as columns
and then again divide price_y/price_x to get final column as perc_diff.
so i tried to do merge using.
# Add prices for products 1 and 2
df3 = (df1.
merge(df2, left_on='product_1', right_on='product').
merge(df2, left_on='product_2', right_on='product'))
# Calculate the percent difference
df3['perc_diff'] = (df3.price_y - df3.price_x) / df3.price_x
But when i did merge i got multiple columns of product_1 and product_2
for eg. my df3.head(1) after merging is:
id product_1 product_2 count product_1 product_2 price_x price_y
1 100 200 10 100 200 5 10
So how do i remove these multiple column's of product_1 & product_2 while merging or after merging?
df2_ = df2.set_index('product')
df3 = df.join(df2_, on='product_1') \
.join(df2_, on='product_2', lsuffix='_x', rsuffix='_y')
df3.assign(perc_diff=df3.price_y.div(df3.price_x).sub(1))
For removing column is necessary rename:
df3 = df1.merge(df2, left_on='product_1', right_on='product') \
.merge(df2.rename(columns={'product':'product_2'}), on='product_2')
#borrow from piRSquared solution
df3 = df3.assign(perc_diff=df3.price_y.div(df3.price_x).sub(1))
print (df3)
id product_1 product_2 count product price_x price_y perc_diff
0 1 100 200 10 100 5 10 1.00
1 3 100 500 30 100 5 25 4.00
2 6 200 500 60 200 10 25 1.50
3 7 100 400 70 100 5 20 3.00
4 2 200 600 20 200 10 30 2.00
5 4 400 100 40 400 20 5 -0.75
6 5 500 700 50 500 25 35 0.40
I have customer records with id, timestamp and status.
ID, TS, STATUS
1 10 GOOD
1 20 GOOD
1 25 BAD
1 30 BAD
1 50 BAD
1 600 GOOD
2 40 GOOD
.. ...
I am trying to calculate how much time is spent in consecutive BAD statuses (lets imagine order above is correct) per customer. So for customer id=1, 30-25,50-30,600-50 in total 575 seconds was spent in BAD status.
What is the method of doing this in Pandas? If I calculate .diff() on TS, that would give me differences, but how can I tie that 1) to the customer 2) certain status "blocks" for that customer?
Sample data:
df = pandas.DataFrame({'ID':[1,1,1,1,1,1,2],
'TS':[10,20,25,30,50,600,40],
'Status':['G','G','B','B','B','G','G']
},
columns=['ID','TS','Status'])
Thanks,
In [1]: df = DataFrame({'ID':[1,1,1,1,1,2,2],'TS':[10,20,25,30,50,10,40],'Stat
us':['G','G','B','B','B','B','B']}, columns=['ID','TS','Status'])
In [2]: f = lambda x: x.diff().sum()
In [3]: df['diff'] = df[df.Status=='B'].groupby('ID')['TS'].transform(f)
In [4]: df
Out[4]:
ID TS Status diff
0 1 10 G NaN
1 1 20 G NaN
2 1 25 B 25
3 1 30 B 25
4 1 50 B 25
5 2 10 B 30
6 2 40 B 30
Explanation:
Subset the dataframe to only those records with the desired Status. Groupby the ID and apply the lambda function diff().sum() to each group. Use transform instead of apply because transform returns an indexed series which you can use to assign to a new column 'diff'.
EDIT: New response to account for expanded question scope.
In [1]: df
Out[1]:
ID TS Status
0 1 10 G
1 1 20 G
2 1 25 B
3 1 30 B
4 1 50 B
5 1 600 G
6 2 40 G
In [2]: df['shift'] = -df['TS'].diff(-1)
In [3]: df['diff'] = df[df.Status=='B'].groupby('ID')['shift'].transform('sum')
In [4]: df
Out[4]:
ID TS Status shift diff
0 1 10 G 10 NaN
1 1 20 G 5 NaN
2 1 25 B 5 575
3 1 30 B 20 575
4 1 50 B 550 575
5 1 600 G -560 NaN
6 2 40 G NaN NaN
Here's a solution to separately aggregate each contiguous block of bad status (part 2 of your question?).
In [5]: df = pandas.DataFrame({'ID':[1,1,1,1,1,1,1,1,2,2,2],
'TS':[10,20,25,30,50,600,650,670,40,50,60],
'Status':['G','G','B','B','B','G','B','B','G','B','B']
},
columns=['ID','TS','Status'])
In [6]: grp = df.groupby('ID')
In [7]: def status_change(df):
...: return (df.Status.shift(1) != df.Status).astype(int)
...:
In [8]: df['BlockId'] = grp.apply(lambda df: status_change(df).cumsum())
In [9]: df['Duration'] = grp.TS.diff().shift(-1)
In [10]: df
Out[10]:
ID TS Status BlockId Duration
0 1 10 G 1 10
1 1 20 G 1 5
2 1 25 B 2 5
3 1 30 B 2 20
4 1 50 B 2 550
5 1 600 G 3 50
6 1 650 B 4 20
7 1 670 B 4 NaN
8 2 40 G 1 10
9 2 50 B 2 10
10 2 60 B 2 NaN
In [11]: df[df.Status == 'B'].groupby(['ID', 'BlockId']).Duration.sum()
Out[11]:
ID BlockId
1 2 575
4 20
2 2 10
Name: Duration