I have a DF like this:
Date DIS_NR NUM_EMPLOYEES
8/16/2018 868 200
8/17/2018 868 150
8/18/2018 868 200
8/16/2018 776 150
8/17/2018 776 250
8/18/2018 776 150
Now, for each DIS_NR, the NUM_EMPLOYEES value with the most occurrences must be used as the benchmark, and any of the other days that do not have the same value must be flagged.
Final Data should look like this:
Date DIS_NR NUM_EMPLOYEES FLAG
8/16/2018 868 200 0
8/17/2018 868 150 1
8/18/2018 868 200 0
8/16/2018 776 150 0
8/17/2018 776 250 1
8/18/2018 776 150 0
I grouped by Date and DIS_NR using
df1 = DF.groupby(["DIS_NR", "Date"])
I tried looping over each one and finding the mode but it won't work. Any help would be appreciated.
Thank you.
From your question, it seems like you are agnostic to the Date column in the grouping:
>>> func = lambda s: s.ne(s.value_counts().idxmax()).astype(int)
>>> df['FLAG'] = df.groupby('DIS_NR')['NUM_EMPLOYEES'].apply(func)
>>> df
Date DIS_NR NUM_EMPLOYEES FLAG
0 2018-08-16 868 200 0
1 2018-08-17 868 150 1
2 2018-08-18 868 200 0
3 2018-08-16 776 150 0
4 2018-08-17 776 250 1
5 2018-08-18 776 150 0
groupby().transform() is generally not always the fastest route, but in this case it should be able to use some Cython routines because the methods that are used within func are vectorized. (Rather than needing to be carried out in Python.)
When you pass a function to .transform(), it gets applies to each subsetted-Series, which you can view with .get_groups():
>>> df.groupby('DIS_NR')['NUM_EMPLOYEES'].get_group(868)
0 200
1 150
2 200
Name: NUM_EMPLOYEES, dtype: int64
>>> df.groupby('DIS_NR')['NUM_EMPLOYEES'].get_group(776)
3 150
4 250
5 150
Name: NUM_EMPLOYEES, dtype: int64
>>> func(df.groupby('DIS_NR')['NUM_EMPLOYEES'].get_group(868))
0 0
1 1
2 0
Name: NUM_EMPLOYEES, dtype: int64
Update:
For example, If DIS_NR 825 has values (125,243,221) then all of them should be flagged.
>>> df
Date DIS_NR NUM_EMPLOYEES
0 2018-08-16 868 200
1 2018-08-17 868 150
2 2018-08-18 868 200
3 2018-08-16 776 150
4 2018-08-17 776 250
5 2018-08-18 776 150
6 2018-08-16 825 100
7 2018-08-17 825 100
8 2018-08-18 825 100
In this case, you can throw in a second condition testing for the number of unique values. Notice also that you're using .transform() rather than .apply():
func = lambda s: np.where(
s.nunique() == 1, 1,
s.ne(s.value_counts().idxmax()).astype(int)
)
>>> df.groupby('DIS_NR')['NUM_EMPLOYEES'].transform(func)
0 0
1 1
2 0
3 0
4 1
5 0
6 1
7 1
8 1
Name: NUM_EMPLOYEES, dtype: int64
DF[‘counts’]=1
df1 = DF.groupby(["DIS_NR", "Date"]).sum()
df1[df1[‘counts’]>1]=0
df1=df1.reset_index()
DF=pd.merge(DF,df1,on=[“DIS_NR”, “Date”])
These are the key steps, after merging, you should see the counts column as the additional column you want.
I am typing this on a phone, there maybe syntax errors above.
Related
What is the easiest way to convert the following ascending data frame:
start end
0 100 500
1 400 700
2 450 580
3 750 910
4 920 940
5 1000 1200
6 1100 1300
into
start end
0 100 700
1 750 910
2 920 940
3 1000 1300
You may notice that rows 0:3 and 5:7 were merged, because these rows overlap or one row is subpart of another: actually, they have only one start and end.
Use a custom group with shift to identify the overlapping intervals and keep the first start and last end (or min/max if you prefer):
group = df['start'].gt(df['end'].shift()).cumsum()
out = df.groupby(group).agg({'start': 'first', 'end': 'last'})
output:
start end
0 100 580
1 750 910
2 920 940
3 1000 1300
intermediate group:
0 0
1 0
2 0
3 1
4 2
5 3
6 3
dtype: int64
I am trying to sum up all the match wins for each team, there are 12 different teams and each row has "team" and "match win". I want to be able to sum up the match wins for each team but I am running into a ton of errors with the ways I have tried to do it. I tried to use iterrows and iteritems but it just creates a tuple with (0, rest of data). Any help would be extremely appreciated!
Here is my code and here is my results. The picture shows how my db is formatted,
for i in df.iteritems():
#for k in team:
print(i)
#reak;
#tdf = df["match win"].iteritems()
print(tdf)
Here is my output examples
('match id', 0 1
1 1
2 1
3 1
4 1
..
315 10
316 10
317 10
318 10
319 10
Name: match id, Length: 320, dtype: int64)
('game id', 0 1
1 1
2 1
3 1
4 1
..
315 5
316 5
317 5
318 5
319 5
Name: game id, Length: 320, dtype: int64)
('match win', 0 0
1 0
2 0
3 0
4 1
..
315 1
316 0
317 0
318 0
319 0
Name: match win, Length: 320, dtype: int64)
('mode', 0 hp
1 hp
2 hp
3 hp
4 hp
...
315 snd
316 snd
317 snd
318 snd
319 snd
Name: mode, Length: 320, dtype: object)
('final score', 0 194
1 194
2 194
3 194
4 250
...
315 6
316 2
317 2
318 2
319 2
This?
df.groupby('team')['match win'].sum()
I have a pandas series that can have positive integers (0, 8, 10, etc) and -1s:
id values
1137 -1
1097 -1
201 8
610 -1
594 -1
727 -1
970 21
300 -1
243 0
715 -1
946 -1
548 4
Name: cluster, dtype: int64
I want to replace those -1 with values that don't already exist in the series and that are unique between them, in other words, I can't fill twice with, for example, 90. What's the most pythonic way to do that?
Here is the expected output:
id values
1137 1
1097 2
201 8
610 3
594 5
727 6
970 21
300 7
243 0
715 9
946 10
548 4
Name: cluster, dtype: int64
Idea is create all possible values by np.arange with add more values for positives, then get difference with positives and set to filtered column:
m = df['values'] != -1
s = np.setdiff1d(np.arange(len(df) + m.sum()), df.loc[m, 'values'])
df.loc[~m, 'values'] = s[:(~m).sum()]
print (df)
id values
0 1137 1
1 1097 2
2 201 8
3 610 3
4 594 5
5 727 6
6 970 21
7 300 7
8 243 0
9 715 9
10 946 10
11 548 4
I started this question yesterday and have done more work on it.
Thanks #AMC , #ALollz
I have a dataframe of surgical activity data that has 58 columns and 200,000 records. One of the columns is treatment specialty Each row corresponds to a patient encounter. I want to see the relative conribution of medical specialties. One column is 'TRETSPEF' = treatment_specialty. I have used `pd.read_csv('csv, usecols = ['TRETSPEF') to import the series.
df
TRETSPEF
0 150
1 150
2 150
3 150
4 150
... ...
218462 150
218463 &
218464 150
218465 150
218466 218`
The most common treatment specialty is neurosurgery (code 150). So heres the problem. When I apply
.value_counts I get two groups for the 150 code (and the 218 code)
df['TRETSPEF'].value_counts()
150 140411
150 40839
218 13692
108 10552
218 4143
...
501 1
120 1
302 1
219 1
106 1
Name: TRETSPEF, Length: 69, dtype: int64
There are some '&' in there (454) so I wondered if the fact they aren't integers was messing things up so I changed them to null values, and ran value counts.
df['TRETSPEF'].str.replace("&", "").value_counts()
150 140411
218 13692
108 10552
800 858
110 835
811 692
191 580
323 555
454
100 271
400 116
420 47
301 45
812 38
214 24
215 23
180 22
300 17
370 15
421 11
258 11
314 5
422 4
260 4
192 4
242 4
171 4
350 2
307 2
302 2
328 2
160 1
219 1
120 1
107 1
101 1
143 1
501 1
144 1
320 1
104 1
106 1
430 1
264 1
Name: TRETSPEF, dtype: int64
so now I seem to have lost the second group of 150 - about 40000 records by changing '&' to null. The nulls are still showing up in .value_counts though.The length of the series has gone down to 45 fromn 69.
I tried stripping whitespace - no difference. Not sure what tests to run to see why this is happening. I feel it must somehow be due to the data.
This is 100% a data cleansing issue. Try to force the column to be numeric.
pd.to_numeric(df['TRETSPEF'], errors='coerce').value_counts()
I have a dataframe from which I need to calculate a number of features from. The dataframe df looks something like this for a object and an event:
id event_id event_date age money_spent rank
1 100 2016-10-01 4 150 2
2 100 2016-09-30 5 10 4
1 101 2015-12-28 3 350 3
2 102 2015-10-25 5 400 5
3 102 2015-10-25 7 500 2
1 103 2014-04-15 2 1000 1
2 103 2014-04-15 3 180 6
From this I need to know for each id and event_id (basically each row), what was the number of days since the last event date, total money spend upto that date, avg. money spent upto that date, rank in last 3 events etc.
What is the best way to work with this kind of problem in pandas where for each row I need information from all rows with the same id before the date of that row, and so the calculations? I want to return a new dataframe with the corresponding calculated features like
id event_id event_date days_last_event avg_money_spent total_money_spent
1 100 2016-10-01 278 500 1500
2 100 2016-09-30 361 196.67 590
1 101 2015-12-28 622 675 1350
2 102 2015-10-25 558 290 580
3 102 2015-10-25 0 500 500
1 103 2014-04-15 0 1000 1000
2 103 2014-04-15 0 180 180
I came up with the following solution:
df1= df.sort_values(by="event_date",ascending = False)
g = df1.groupby(by=["id"])
df1["total_money_spent","count"]= g.agg({"money_spent":["cumsum","cumcount"]})
df1["avg_money_spent"]=df1["total_money_spent"]/(df1["count"]+1)