Find Pandas column largest/smallest values where dates don't overlap - python

I have a DataFrame like:
df = pd.DataFrame(index = [0,1,2,3,4,5])
df['XYZ'] = [2, 8, 6, 5, 9, 10]
df['Date2'] = ["2005-01-06", "2005-01-07", "2005-01-08", "1994-06-08", "1999-06-15", "2005-01-09"]
df['Date1'] = ["2005-01-02", "2005-01-03", "2005-01-04", "1994-06-04", "1999-06-12", "2005-01-05"]
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
I need to follow the 2 largest values of XYZ with dates that do not overlap. The expected output would be:
XYZ Date1 Date2
10 2005-01-05 2005-01-09
9 1999-06-12 1999-06-15
5 1994-06-04 1994-06-08
I tried to sort by "XYZ":
df.sort_values(by="XYZ", ascending=False, inplace=True)
And then compare dates:
df['overlap'] = (df['Date1] <= df['Date2'].shift()) & (df['Date2'] >= df['Date1'].shift())
And then drop any True values in df['overlap'] and take the nlargest() values, however that results in cases that do overlap.
Any help would be much appreciated.

This is somewhat involved but hopefully will work for you. We introduce a mask indexed by every date between the min and the max date in your df, where we mark each date as 'used' if it appears in the range, and then use that to reject overlapping rows
First we get the min and the max date (while also sorting the original df by 'XYZ')
df1 = df.sort_values('XYZ', ascending = False)
dmin, dmax = df1[['Date1', 'Date2']].unstack().agg([min,max])
then we create a mask populated with 0s initially
mask = pd.Series(index = pd.date_range(dmin,dmax), data = 0)
Then we iterate over rows marking those we want in the 'include' column
for idx,row in df1.iterrows():
if sum(mask[row['Date1']:row['Date2']]) > 0:
df1.loc[idx, 'include'] = False
continue
mask[row['Date1']:row['Date2']] = 1
df1.loc[idx, 'include'] = True
finally filter on 'include'
df1[df1['include']].drop(columns = 'include')
output
XYZ Date1 Date2
5 10 2005-01-05 2005-01-09
4 9 1999-06-12 1999-06-15
3 5 1994-06-04 1994-06-08

Related

Using Loc and Iloc together

Hello I am trying to pull three rows of data. Row 0 Row 1 and the Row that is titled "Inventories".
I Figured the best way would be to find the Row number of Inventories and parse the date using iloc. However I get an error that says to many indexers. Any help would be appreciated
df.columns=df.iloc[1]
cols = df.columns.tolist()
A =df.loc[df[cols[0]].isin(['Inventories'])].index.tolist()
df = df.iloc[[0,1,[A]]]
I have also tried
df = df.iloc[[0,1,A]]
Also please note A returns 56, and if I replace A with 56 in
df = df.iloc[[0,1,56]]
I get the desired outcome.
For position of matched condition use Series.argmax, so possible add A without [] to DataFrame.iloc, it working well if ALWAYS match condition:
A = df[cols[0]].eq('Inventories').argmax()
df = df.iloc[[0,1,A]]
Another idea is add conditio with bitwise OR by | for test first 2 rows:
df = pd.DataFrame({'col' : [100,10,'s','Inventories',1,10,100]})
df.index += 10
print (df)
col
10 100
11 10
12 s
13 Inventories
14 1
15 10
16 100
df = df[np.in1d(np.arange(len(df)), [0,1]) | df.iloc[:, 0].eq('Inventories')]
print (df)
col
10 100
11 10
13 Inventories
Or join filtered rows by positions and by condition:
df = pd.concat([df.iloc[[0, 1]], df[df.iloc[:, 0].eq('Inventories')]])
print (df)
col
10 100
11 10
13 Inventories
IIC you are wanting to pull out 3 specific values in the index (which can be an index number or a string). This will allow you to set the values you want to pull back when referencing an index.
df = pd.DataFrame({
'Column' : [1, 2, 3, 4, 5],
'index' : [0, 1, 'Test', 'Inventories', 4]
})
df = df.set_index('index')
df.loc[[0, 1, 'Inventories']]

Add new index to multiindex as a count of first level

I would like to add a new multiindex in between the already existing indexes 'Warnings' and 'equip' with the sum of the column 'count per equip' for each 'Warnings' level.
idx = pd.MultiIndex.from_product([['warning1', 'warning2', 'warning3'],
['ff0001', 'ff0002', 'ff0003']],
names=['Warnings', 'equip'])
col = ['count per equip']
df = pd.DataFrame([100,2,1,44,45,20,25,98,0], idx, col)
df
So the resulting dataframe would have the same number of index in level 0, 'Warnings', and for this example it would be [103, 109, 123], respectively.
I've managed to sum and insert the index at the right place, but when trying to do all together, all values are NaN's:
df = df.assign(total=df.groupby(level=[0]).size()).set_index('total', append=True).reorder_levels(['Warnings','total','equip'])
In assign we can't do groupby. So, the following code create similar data.
idx = pd.MultiIndex.from_product([['warning1', 'warning2', 'warning3'],
['ff0001', 'ff0002', 'ff0003']],
names=['Warnings', 'equip'])
col = ['count per equip']
df = pd.DataFrame([100,2,1,44,45,20,25,98,0], idx, col)
Grouping based on level = 0
df['total'] = df.groupby(level=0).transform(lambda x: x.size)
df = df.set_index('total', append=True).reorder_levels(['Warnings','total','equip'])
print(df)
count per equip
Warnings total equip
warning1 3 ff0001 100
ff0002 2
ff0003 1
warning2 3 ff0001 44
ff0002 45
ff0003 20
warning3 3 ff0001 25
ff0002 98
ff0003 0

Keep pandas DataFrame rows in df2 for each row in df1 with timedelta

I have two pandas dataframes. I would like to keep all rows in df2 where Type is equal to Type in df1 AND Date is between Date in df1 (- 1 day or + 1 day). How can I do this?
df1
IBSN Type Date
0 1 X 2014-08-17
1 1 Y 2019-09-22
df2
IBSN Type Date
0 2 X 2014-08-16
1 2 D 2019-09-22
2 9 X 2014-08-18
3 3 H 2019-09-22
4 3 Y 2019-09-23
5 5 G 2019-09-22
res
IBSN Type Date
0 2 X 2014-08-16 <-- keep because Type = df1[0]['Type'] AND Date = df1[0]['Date'] - 1
1 9 X 2014-08-18 <-- keep because Type = df1[0]['Type'] AND Date = df1[0]['Date'] + 1
2 3 Y 2019-09-23 <-- keep because Type = df1[1]['Type'] AND Date = df1[1]['Date'] + 1
This should do it:
import pandas as pd
from datetime import timedelta
# create dummy data
df1 = pd.DataFrame([[1, 'X', '2014-08-17'], [1, 'Y', '2019-09-22']], columns=['IBSN', 'Type', 'Date'])
df1['Date'] = pd.to_datetime(df1['Date']) # might not be necessary if your Date column already contain datetime objects
df2 = pd.DataFrame([[2, 'X', '2014-08-16'], [2, 'D', '2019-09-22'], [9, 'X', '2014-08-18'], [3, 'H', '2019-09-22'], [3, 'Y', '2014-09-23'], [5, 'G', '2019-09-22']], columns=['IBSN', 'Type', 'Date'])
df2['Date'] = pd.to_datetime(df2['Date']) # might not be necessary if your Date column already contain datetime objects
# add date boundaries to the first dataframe
df1['Date_from'] = df1['Date'].apply(lambda x: x - timedelta(days=1))
df1['Date_to'] = df1['Date'].apply(lambda x: x + timedelta(days=1))
# merge the date boundaries to df2 on 'Type'. Filter rows where date is between
# data_from and date_to (inclusive). Drop 'date_from' and 'date_to' columns
df2 = df2.merge(df1.loc[:, ['Type', 'Date_from', 'Date_to']], on='Type', how='left')
df2[(df2['Date'] >= df2['Date_from']) & (df2['Date'] <= df2['Date_to'])].\
drop(['Date_from', 'Date_to'], axis=1)
Note that according to your logic, row 4 in df2 (3 Y 2014-09-23) should not remain as its date (2014) is not in between the given dates in df1 (year 2019).
Assume Date columns in both dataframes are already in dtype datetime. I would construct IntervalIndex to assign to index of df1. Map columns Type of df1 to df2. Finally check equality to create mask to slice
iix = pd.IntervalIndex.from_arrays(df1.Date + pd.Timedelta(days=-1),
df1.Date + pd.Timedelta(days=1), closed='both')
df1 = df1.set_index(iix)
s = df2['Date'].map(df1.Type)
df_final = df2[df2.Type == s]
Out[1131]:
IBSN Type Date
0 2 X 2014-08-16
2 9 X 2014-08-18
4 3 Y 2019-09-23

Python sum with condition using a date and a condition

I have to dataframes and I am using pandas.
I want to do a cumulative sum from a variable date and by the value in a column
I want to add a second column to df2 that show the date to know the day when the sum of the AVG column is greater than 100 after date2 in df2.
For example with df1 and df2 being the dataframe I start with and df3 what I want and df3['date100'] is the day the sum of avg is greater than 100:
df1 = pd.DataFrame({'date1': ['1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014'],
'Place':['A','A','A','B','B','B','C','C','C'],'AVG': [62,14,47,25,74,60,78,27,41]})
df2 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C'])})
*Something*
df3 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C'], 'date100': ['3/1/2014', '2/1/2014'], 'sum': [123, 105]})
I found some answers but most them use groupby and df2 has no groups.
Since your example is very basic, if you have edge cases you want me to take care of, just ask. This solution implies that :
The solution :
# For this solution your DataFrame needs to be sorted by date.
limit = 100
df = pd.DataFrame({
'date1': ['1/1/2014', '2/1/2014', '3/1/2014','1/1/2014',
'2/1/2014', '3/1/2014','1/1/2014', '2/1/2014', '3/1/2014'],
'Place':['A','A','A','B','B','B','C','C','C'],
'AVG': [62,14,47,25,74,60,78,27,41]})
df2 = pd.DataFrame({'date2': ['1/1/2014', '2/1/2014'], 'Place':['A','C']})
result = []
for row in df2.to_dict('records'):
# For each date, I want to select the date that comes AFTER this one.
# Then, I take the .cumsum(), because it's the agg you wish to do.
# Filter by your limit and take the first occurrence.
# Converting this to a dict, appending it to a list, makes it easy
# to rebuild a DataFrame later.
ndf = df.loc[ (df['date1'] >= row['date2']) & (df['Place'] == row['Place']) ]\
.sort_values(by='date1')
ndf['avgsum'] = ndf['AVG'].cumsum()
final_df = ndf.loc[ ndf['avgsum'] >= limit ]
# Error handling, in case there is not avgsum above the threshold.
try:
final_df = final_df.iloc[0][['date1', 'avgsum']].rename({'date1' : 'date100'})
result.append( final_df.to_dict() )
except IndexError:
continue
df3 = pd.DataFrame(result)
final_df = pd.concat([df2, df3], axis=1, sort=False)
print(final_df)
# date2 Place avgsum date100
# 0 1/1/2014 A 123.0 3/1/2014
# 1 2/1/2014 C NaN NaN
Here is a direct solution, with following assumptions:
df1 is sorted by date
one solution exists for every date in df2
You can then do:
df2 = df2.join(pd.concat([
pd.DataFrame(pd.DataFrame(df1.loc[df1.date1 >= d].AVG.cumsum()).query('AVG>=100')
.iloc[0]).transpose()
for d in df2.date2]).rename_axis('ix').reset_index())\
.join(df1.drop(columns='AVG'), on='ix').rename(columns={'AVG': 'sum', 'date1': 'date100'})\
.drop(columns='ix')[['date2', 'date100', 'sum']]
This does the following:
for each date in df2 find the first date when the cumul on AVG will be at least 100
combine the results in one single dataframe indexed by the index of that line in df1
store that index in an ix column and reset the index to join that dataframe to df2
join that to df1 minus the AVG column using the ix column
rename the columns, remove the ix column, and re-order everything

Python Pandas matching closest index from another Dataframe

df.index = 10,100,1000
df2.index = 1,2,11,50,101,500,1001
Just sample
I need to match closest index from df2 compare with df by these conditions
df2.index have to > df.index
only one closest value
for example output
df | df2
10 | 11
100 | 101
1000 | 1001
Now I can do it with for-loop and it's extremely slow
And I used new_df2 to keep index instead of df2
new_df2 = pd.DataFrame(columns = ["value"])
for col in df.index:
for col2 in df2.index:
if(col2 > col):
new_df2.loc[col2] = df2.loc[col2]
break
else:
df2 = df2[1:] #delete first row for index speed
How to avoid for-loop in this case Thank.
Not sure how robust this is, but you can sort df2 so it's index is decreasing, and use asof to find the most recent index label matching each key in df's index:
df2.sort_index(ascending=False, inplace=True)
df['closest_df2'] = df.index.map(lambda x: df2.index.asof(x))
df
Out[19]:
a closest_df2
10 1 11
100 2 101
1000 3 1001

Categories

Resources