Find overlapped rows in Pandas Data Frame - python

What is the easiest way to convert the following ascending data frame:
start end
0 100 500
1 400 700
2 450 580
3 750 910
4 920 940
5 1000 1200
6 1100 1300
into
start end
0 100 700
1 750 910
2 920 940
3 1000 1300
You may notice that rows 0:3 and 5:7 were merged, because these rows overlap or one row is subpart of another: actually, they have only one start and end.

Use a custom group with shift to identify the overlapping intervals and keep the first start and last end (or min/max if you prefer):
group = df['start'].gt(df['end'].shift()).cumsum()
out = df.groupby(group).agg({'start': 'first', 'end': 'last'})
output:
start end
0 100 580
1 750 910
2 920 940
3 1000 1300
intermediate group:
0 0
1 0
2 0
3 1
4 2
5 3
6 3
dtype: int64

Related

Replace -1 in pandas series with unique values

I have a pandas series that can have positive integers (0, 8, 10, etc) and -1s:
id values
1137 -1
1097 -1
201 8
610 -1
594 -1
727 -1
970 21
300 -1
243 0
715 -1
946 -1
548 4
Name: cluster, dtype: int64
I want to replace those -1 with values that don't already exist in the series and that are unique between them, in other words, I can't fill twice with, for example, 90. What's the most pythonic way to do that?
Here is the expected output:
id values
1137 1
1097 2
201 8
610 3
594 5
727 6
970 21
300 7
243 0
715 9
946 10
548 4
Name: cluster, dtype: int64
Idea is create all possible values by np.arange with add more values for positives, then get difference with positives and set to filtered column:
m = df['values'] != -1
s = np.setdiff1d(np.arange(len(df) + m.sum()), df.loc[m, 'values'])
df.loc[~m, 'values'] = s[:(~m).sum()]
print (df)
id values
0 1137 1
1 1097 2
2 201 8
3 610 3
4 594 5
5 727 6
6 970 21
7 300 7
8 243 0
9 715 9
10 946 10
11 548 4

Create a text file using the pandas dataframes

I am new to the python . I have the following dataframe
Document_ID OFFSET PredictedFeature word
0 0 2000 abcd
0 8 2000 is
0 16 2200 a
0 23 2200 good
0 25 315 XXYYZZ
1 0 2100 but
1 5 2100 it
1 7 2100 can
1 10 315 XXYYZZ
Now, In this dataframe what I trying to do is make a file which can be in a readable formt like ,
abcd is 2000, a good 2200
but it can 2100,
PredictedData feature offset endoffset
abcd is 2000 0 8
a good 2200 16 23
NewLine 315 25 25
but it can 2100 0 7
this type of data. where if you see I trying same sequence of predictedFeatures are coming then I am concatening same words with it's value. If there is feature 315 then I am giving a new line to it.
SO, Is there any way though which I can do this ? Any help will be appreciated.
Thnaks
IIUC, you can do groupby():
(df.groupby(['Document_ID', 'PredictedFeature'],as_index=False)
.agg({'word':(' '.join),
'OFFSET':('min','max')
})
)
Output:
Document_ID PredictedFeature word OFFSET
join min max
0 0 315 XXYYZZ 25 25
1 0 2000 abcd is 0 8
2 0 2200 a good 16 23
3 1 315 XXYYZZ 10 10
4 1 2100 but it can 0 7

Pandas totalling balances with date timeline from multiple sheets

I have three sheets inside one excel spreadsheet. I am trying to obtain the output listed below, or something close to it. The desired outcome is to know when there will be a shortage so that I can attempt to re-actively order and prevent the shortage. All of these, except for the output, is on one excel file. Each are different sheets. How hard will this be to achieve, is this possible? Note that in all sheets listed, there are tons of other data columns so positional references to columns may be needed, or using iloc to call upon columns by name.
instock sheet
product someother datapoint qty
5.25 1 2 100
5.25 1 3 200
6 2 1 50
6 4 1 500
ordered
product something ordernum qty date
5 1/4 abc 52521 50 07/01/2019
5 1/4 ddd 22911 100 07/28/2019
6 eeee 72944 10 07/5/2019
promised
product order qty date
5 1/4 456 300 06/12/2019
5 1/4 789 50 06/20/2019
5 1/4 112 50 07/20/2019
6 113 800 07/22/2019
5 1/4 144 50 07/28/2019
9 155 100 08/22/2019
Output
product date onhand qtyordered commited balance shortage
5.25 06/10 300 300 n
5.25 06/12 300 300 0 n
5.25 06/20 0 50 -50 y
5.25 07/01 -50 50 0 n
6 07/05 550 10 0 560 n
5.25 07/20 0 50 -50 y
6 07/22 560 0 800 -240 y
5.25 07/28 -50 100 50 0 n
9 08/22 0 0 100 -100 y

Get Most Occurring Value in a Grouped Pandas Dataframe

I have a DF like this:
Date DIS_NR NUM_EMPLOYEES
8/16/2018 868 200
8/17/2018 868 150
8/18/2018 868 200
8/16/2018 776 150
8/17/2018 776 250
8/18/2018 776 150
Now, for each DIS_NR, the NUM_EMPLOYEES value with the most occurrences must be used as the benchmark, and any of the other days that do not have the same value must be flagged.
Final Data should look like this:
Date DIS_NR NUM_EMPLOYEES FLAG
8/16/2018 868 200 0
8/17/2018 868 150 1
8/18/2018 868 200 0
8/16/2018 776 150 0
8/17/2018 776 250 1
8/18/2018 776 150 0
I grouped by Date and DIS_NR using
df1 = DF.groupby(["DIS_NR", "Date"])
I tried looping over each one and finding the mode but it won't work. Any help would be appreciated.
Thank you.
From your question, it seems like you are agnostic to the Date column in the grouping:
>>> func = lambda s: s.ne(s.value_counts().idxmax()).astype(int)
>>> df['FLAG'] = df.groupby('DIS_NR')['NUM_EMPLOYEES'].apply(func)
>>> df
Date DIS_NR NUM_EMPLOYEES FLAG
0 2018-08-16 868 200 0
1 2018-08-17 868 150 1
2 2018-08-18 868 200 0
3 2018-08-16 776 150 0
4 2018-08-17 776 250 1
5 2018-08-18 776 150 0
groupby().transform() is generally not always the fastest route, but in this case it should be able to use some Cython routines because the methods that are used within func are vectorized. (Rather than needing to be carried out in Python.)
When you pass a function to .transform(), it gets applies to each subsetted-Series, which you can view with .get_groups():
>>> df.groupby('DIS_NR')['NUM_EMPLOYEES'].get_group(868)
0 200
1 150
2 200
Name: NUM_EMPLOYEES, dtype: int64
>>> df.groupby('DIS_NR')['NUM_EMPLOYEES'].get_group(776)
3 150
4 250
5 150
Name: NUM_EMPLOYEES, dtype: int64
>>> func(df.groupby('DIS_NR')['NUM_EMPLOYEES'].get_group(868))
0 0
1 1
2 0
Name: NUM_EMPLOYEES, dtype: int64
Update:
For example, If DIS_NR 825 has values (125,243,221) then all of them should be flagged.
>>> df
Date DIS_NR NUM_EMPLOYEES
0 2018-08-16 868 200
1 2018-08-17 868 150
2 2018-08-18 868 200
3 2018-08-16 776 150
4 2018-08-17 776 250
5 2018-08-18 776 150
6 2018-08-16 825 100
7 2018-08-17 825 100
8 2018-08-18 825 100
In this case, you can throw in a second condition testing for the number of unique values. Notice also that you're using .transform() rather than .apply():
func = lambda s: np.where(
s.nunique() == 1, 1,
s.ne(s.value_counts().idxmax()).astype(int)
)
>>> df.groupby('DIS_NR')['NUM_EMPLOYEES'].transform(func)
0 0
1 1
2 0
3 0
4 1
5 0
6 1
7 1
8 1
Name: NUM_EMPLOYEES, dtype: int64
DF[‘counts’]=1
df1 = DF.groupby(["DIS_NR", "Date"]).sum()
df1[df1[‘counts’]>1]=0
df1=df1.reset_index()
DF=pd.merge(DF,df1,on=[“DIS_NR”, “Date”])
These are the key steps, after merging, you should see the counts column as the additional column you want.
I am typing this on a phone, there maybe syntax errors above.

How to do calculation on pandas dataframe that require processing multiple rows?

I have a dataframe from which I need to calculate a number of features from. The dataframe df looks something like this for a object and an event:
id event_id event_date age money_spent rank
1 100 2016-10-01 4 150 2
2 100 2016-09-30 5 10 4
1 101 2015-12-28 3 350 3
2 102 2015-10-25 5 400 5
3 102 2015-10-25 7 500 2
1 103 2014-04-15 2 1000 1
2 103 2014-04-15 3 180 6
From this I need to know for each id and event_id (basically each row), what was the number of days since the last event date, total money spend upto that date, avg. money spent upto that date, rank in last 3 events etc.
What is the best way to work with this kind of problem in pandas where for each row I need information from all rows with the same id before the date of that row, and so the calculations? I want to return a new dataframe with the corresponding calculated features like
id event_id event_date days_last_event avg_money_spent total_money_spent
1 100 2016-10-01 278 500 1500
2 100 2016-09-30 361 196.67 590
1 101 2015-12-28 622 675 1350
2 102 2015-10-25 558 290 580
3 102 2015-10-25 0 500 500
1 103 2014-04-15 0 1000 1000
2 103 2014-04-15 0 180 180
I came up with the following solution:
df1= df.sort_values(by="event_date",ascending = False)
g = df1.groupby(by=["id"])
df1["total_money_spent","count"]= g.agg({"money_spent":["cumsum","cumcount"]})
df1["avg_money_spent"]=df1["total_money_spent"]/(df1["count"]+1)

Categories

Resources