Related
I have a large data set and I want to add values to a column based on the higest values in another column in my data set.
Easy, I can just use df.quantile() and access the appropriate values
However, I want to check for each month in each year.
I solved it for looking at years only, see code below.
I'm sure I could do it for months with nested for loops but I'd rather avoid it if I can.
I guess the most pythonic way would by to not loop at all but use pandas in a smarter way..
Any suggestion?
Sample code:
df = pd.read_excel(file)
df.index = df['date']
df = df.drop('date', axis=1)
df['new'] = 0
year = (2016, 2017, 2018, 2019, 2020)
for i in year:
df['new'].loc[str(i)] = np.where(df['cost'].loc[str(i)] < df['cost'].loc[str(i)].quantile(0.5), 0, 1)
print(df)
Sample input
file =
cost
date
2016-11-01 30
2016-12-01 29
2017-11-01 40
2017-12-01 45
2018-11-30 240
2018-12-01 200
2019-11-30 220
2019-12-30 180
2020-11-30 150
2020-12-30 130
Output
cost new
date
2016-11-01 30 1
2016-12-01 29 0
2017-11-01 40 0
2017-12-01 45 1
2018-11-30 240 1
2018-12-01 200 0
2019-11-30 220 1
2019-12-30 180 0
2020-11-30 150 1
2020-12-30 130 0
Desired output (if quantile works like that on single values, but as an example)
cost new
date
2016-11-01 30 1
2016-12-01 29 1
2017-11-01 40 1
2017-12-01 45 1
2018-11-30 240 1
2018-12-01 200 1
2019-11-30 220 1
2019-12-30 180 1
2020-11-30 150 1
2020-12-30 130 1
Thank you _/_
An interesting question, it took me a little while to work out a solution!
import pandas as pd
df = pd.DataFrame(data={"cost": [30, 29, 40, 45, 240, 200, 220, 180, 150, 130],
"date": ["2016-11-01", "2016-12-01", "2017-11-01",
"2017-12-01", "2018-11-30", "2018-12-01",
"2019-11-30", "2019-12-30", "2020-11-30",
"2020-12-30"]})
df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)
df["new"] = df.groupby([lambda x: x.year, lambda x: x.month]).transform(lambda x: (x >= x.quantile(0.5))*1)
#Out:
# cost new
#date
#2016-11-01 30 1
#2016-12-01 29 1
#2017-11-01 40 1
#2017-12-01 45 1
#2018-11-30 240 1
#2018-12-01 200 1
#2019-11-30 220 1
#2019-12-30 180 1
#2020-11-30 150 1
#2020-12-30 130 1
What the important line does:
Groups by the index year and month
For each item in the group, calculates whether it is greater than or equal to the 0.5 quantile (as bool)
Multiplying by 1 creates an integer bool (1/0) instead of True/False
The initial creation of the dataframe should be equivalent to your df = pd.read_excel(file)
Leaving out the , lambda x: x.month part of the groupby (by year only), the output is the same as your current output:
# cost new
#date
#2016-11-01 30 1
#2016-12-01 29 0
#2017-11-01 40 0
#2017-12-01 45 1
#2018-11-30 240 1
#2018-12-01 200 0
#2019-11-30 220 1
#2019-12-30 180 0
#2020-11-30 150 1
#2020-12-30 130 0
I have some data with dates of sales to my clients.
The data looks like this:
Cod client
Items
Date
0
100
1
2022/01/01
1
100
7
2022/01/01
2
100
2
2022/02/01
3
101
5
2022/01/01
4
101
8
2022/02/01
5
101
10
2022/02/01
6
101
2
2022/04/01
7
101
2
2022/04/01
8
102
4
2022/02/01
9
102
10
2022/03/01
What I'm trying to acomplish is to calculate the differences beetween dates for each client: grouped first by "Cod client" and after by "Date" (because of the duplicates)
The expected result is like:
Cod client
Items
Date
Date diff
Explain
0
100
1
2022/01/01
NaT
First date for client 100
1
100
7
2022/01/01
NaT
...repeat above
2
100
2
2022/02/01
31
Diff from first date 2022/01/01
3
101
5
2022/01/01
NaT
Fist date for client 101
4
101
8
2022/02/01
31
Diff from first date 2022/01/01
5
101
10
2022/02/01
31
...repeat above
6
101
2
2022/04/01
59
Diff from previous date 2022/02/01
7
101
2
2022/04/01
59
...repeat above
8
102
4
2022/02/01
NaT
First date for client 102
9
102
10
2022/03/01
28
Diff from first date 2022/02/01
I already tried doing df["Date diff"] = df.groupby("Cod client")["Date"].diff() but it considers the repeated dates and return zeroes for then
I appreciate for help!
IIUC you can combine several groupby operations:
# ensure datetime
df['Date'] = pd.to_datetime(df['Date'])
# set up group
g = df.groupby('Cod client')
# identify duplicated dates per group
m = g['Date'].apply(pd.Series.duplicated)
# compute the diff, mask and ffill
df['Date diff'] = g['Date'].diff().mask(m).groupby(df['Cod client']).ffill()
output:
Cod client Items Date Date diff
0 100 1 2022-01-01 NaT
1 100 7 2022-01-01 NaT
2 100 2 2022-02-01 31 days
3 101 5 2022-01-01 NaT
4 101 8 2022-02-01 31 days
5 101 10 2022-02-01 31 days
6 101 2 2022-04-01 59 days
7 101 2 2022-04-01 59 days
8 102 4 2022-02-01 NaT
9 102 10 2022-03-01 28 days
Another way to do this, with transform:
import pandas as pd
# data saved as .csv
df = pd.read_csv("Data.csv", header=0, parse_dates=True)
# convert Date column to correct date.
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")
# new column!
df["Date diff"] = df.sort_values("Date").groupby("Cod client")["Date"].transform(lambda x: x.diff().replace("0 days", pd.NaT).ffill())
Sorry I'm new to python.
I have a dataframe of entities that log values once a month. For each unique entity in my dataframe, I locate the max value then locate the max value's corresponding month. Using the max value month, a time delta between each other unique entity's months and the max month can be calculated in days. This works for small dataframes.
I know my loop is not performant and can't scale to larger dataframes(e.g., 3M rows (+156MB)). After weeks of research I've gathered that my loop is degenerate and feel there is a numpy solution or something more pythonic. Can someone see a more performant approach to calculating this time delta in days?
I've tried various value.shift(x) calculations in a lambda function, but the peak value is not consistent. I've also tried calculating more columns to minimize my loop calculations.
import pandas as pd
df = pd.DataFrame({'entity':['A','A','A','A','B','B','B','C','C','C','C','C'], 'month': ['10/31/2018','11/30/2018','12/31/2018','1/31/2019','1/31/2009','2/28/2009','3/31/2009','8/31/2011','9/30/2011','10/31/2011','11/30/2011','12/31/2011'], 'value':['80','600','500','400','150','300','100','200','250','300','200','175'], 'month_number': ['1','2','3','4','1','2','3','1','2','3','4','5']})
df['month'] = df['month'].apply(pd.to_datetime)
for entity in set(df['entity']):
# set peak value
peak_value = df.loc[df['entity'] == entity, 'value'].max()
# set peak value date
peak_date = df.loc[(df['entity'] == entity) & (df['value'] == peak_value), 'month'].min()
# subtract peak date from current date
delta = df.loc[df['entity'] == entity, 'month'] - peak_date
# update days_delta with delta in days
df.loc[df['entity'] == entity, 'days_delta'] = delta
RESULT:
entity month value month_number days_delta
A 2018-10-31 80 1 0 days
A 2018-11-30 600 2 30 days
A 2018-12-31 500 3 61 days
A 2019-01-31 400 4 92 days
B 2009-01-31 150 1 -28 days
B 2009-02-28 300 2 0 days
B 2009-03-31 100 3 31 days
C 2011-08-31 200 1 -61 days
C 2011-09-30 250 2 -31 days
C 2011-10-31 300 3 0 days
C 2011-11-30 200 4 30 days
C 2011-12-31 175 5 61 days
Setup
First let's also make sure value is numeric
df = pd.DataFrame({
'entity':['A','A','A','A','B','B','B','C','C','C','C','C'],
'month': ['10/31/2018','11/30/2018','12/31/2018','1/31/2019',
'1/31/2009','2/28/2009','3/31/2009','8/31/2011',
'9/30/2011','10/31/2011','11/30/2011','12/31/2011'],
'value':['80','600','500','400','150','300','100','200','250','300','200','175'],
'month_number': ['1','2','3','4','1','2','3','1','2','3','4','5']
})
df['month'] = df['month'].apply(pd.to_datetime)
df['value'] = pd.to_numeric(df['value'])
transform and idxmax
max_months = df.groupby('entity').value.transform('idxmax').map(df.month)
df.assign(days_delta=df.month - max_months)
entity month value month_number days_delta
0 A 2018-10-31 80 1 -30 days
1 A 2018-11-30 600 2 0 days
2 A 2018-12-31 500 3 31 days
3 A 2019-01-31 400 4 62 days
4 B 2009-01-31 150 1 -28 days
5 B 2009-02-28 300 2 0 days
6 B 2009-03-31 100 3 31 days
7 C 2011-08-31 200 1 -61 days
8 C 2011-09-30 250 2 -31 days
9 C 2011-10-31 300 3 0 days
10 C 2011-11-30 200 4 30 days
11 C 2011-12-31 175 5 61 days
I have a dataframe with three columns:
a b c
0 73 12
73 80 2
80 100 5
100 150 13
Values in "a" and "b" are days. I need to find the average values of "c" in each 30 day-interval (slice values inside [min(a),max(b)] in 30 days and calculate average of c). I want as a result have a dataframe like this:
aa bb c_avg
0 30 12
30 60 12
60 90 6.33
90 120 9
120 150 13
Another sample data could be:
a b c
0 1264.0 1629.0 0.000000
1 1629.0 1632.0 133.333333
6 1632.0 1699.0 0.000000
2 1699.0 1706.0 21.428571
7 1706.0 1723.0 0.000000
3 1723.0 1726.0 50.000000
8 1726.0 1890.0 0.000000
4 1890.0 1893.0 33.333333
1 1893.0 1994.0 0.000000
How can I get to the final table?
First create ranges DataFrame by ranges defined a and b columns:
a = np.arange(0, 180, 30)
df1 = pd.DataFrame({'aa':a[:-1], 'bb':a[1:]})
#print (df1)
Then cross join all rows by helper column tmp:
df3 = pd.merge(df1.assign(tmp=1), df.assign(tmp=1), on='tmp')
#print (df3)
And last filter - There are 2 solution by columns for filtering:
df4 = df3[df3['aa'].between(df3['a'], df3['b']) | df3['bb'].between(df3['a'], df3['b'])]
print (df4)
aa bb tmp a b c
0 0 30 1 0 73 12
4 30 60 1 0 73 12
8 60 90 1 0 73 12
10 60 90 1 80 100 5
14 90 120 1 80 100 5
15 90 120 1 100 150 13
19 120 150 1 100 150 13
df4 = df4.groupby(['aa','bb'], as_index=False)['c'].mean()
print (df4)
aa bb c
0 0 30 12.0
1 30 60 12.0
2 60 90 8.5
3 90 120 9.0
4 120 150 13.0
df5 = df3[df3['a'].between(df3['aa'], df3['bb']) | df3['b'].between(df3['aa'], df3['bb'])]
print (df5)
aa bb tmp a b c
0 0 30 1 0 73 12
8 60 90 1 0 73 12
9 60 90 1 73 80 2
10 60 90 1 80 100 5
14 90 120 1 80 100 5
15 90 120 1 100 150 13
19 120 150 1 100 150 13
df5 = df5.groupby(['aa','bb'], as_index=False)['c'].mean()
print (df5)
aa bb c
0 0 30 12.000000
1 60 90 6.333333
2 90 120 9.000000
3 120 150 13.000000
I have a file that is dynamically generated(i.e., file headers remain the same but values changes). For instance, let the file be of this form:
ID,CLASS,DATE,MRK
1,321,02/12/2016,30
2,321,05/12/2016,40
3,321,06/12/2016,0
4,321,07/12/2016,60
5,321,10/12/2016,70
6,876,5/12/2016,100
7,876,7/12/2016,80
Notice for CLASS 321 there are some missing dates namely 03/12/2016, 04/12/2016, 08/12/2016, 09/12/2016. I'm trying to insert the missing dates in the appropriate places with their corresponding value for MRK being 0. The expected output would be like so:
ID,CLASS,DATE,MRK
1,321,02/12/2016,30
2,321,03/12/2016,0
3,321,04/12/2016,0
4,321,05/12/2016,40
5,321,06/12/2016,0
6,321,07/12/2016,60
7,321,08/12/2016,0
8,321,09/12/2016,0
9,321,10/12/2016,70
10,876,5/12/2016,100
11,876,6/12/2016,0
12,876,7/12/2016,80
This is what i came up with so far:
import pandas as pd
df = pd.read_csv('In.txt')
resampled_df = df.resample('D').mean()
print resampled_df
But I'm getting exception:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Could somebody help out a python newbie here?
Read your CSV like this -
df = pd.read_csv('file.csv',
sep=',',
parse_dates=['DATE'],
dayfirst=True, # this is important since you have days first
index_col=['DATE'])
Now, call groupby + resample + first, and tie up loose ends -
df = df.groupby('CLASS').resample('1D')[['MRK']].first()
df.ID = np.arange(1, len(df) + 1)
df.MRK = df.MRK.fillna(0).astype(int)
df.reset_index()
CLASS DATE ID MRK
0 321 2016-12-02 1 30
1 321 2016-12-03 2 0
2 321 2016-12-04 3 0
3 321 2016-12-05 4 40
4 321 2016-12-06 5 0
5 321 2016-12-07 6 60
6 321 2016-12-08 7 0
7 321 2016-12-09 8 0
8 321 2016-12-10 9 70
9 876 2016-12-05 10 100
10 876 2016-12-06 11 0
11 876 2016-12-07 12 80
In particular, MRK needs fillna. The rest can be forward filled.
If the order of columns is important, here's another version.
df = pd.read_csv('file.csv',
sep=',',
parse_dates=['DATE'],
dayfirst=True)
c = df.columns
df = df.set_index('DATE').groupby('CLASS').resample('1D')[['MRK']].first()
df['MRK'] = df.MRK.fillna(0).astype(int)
df['ID'] = np.arange(1, len(df) + 1)
df = df.reset_index().reindex(columns=c)
df['DATE'] = df['DATE'].dt.strftime('%d/%m/%Y')
df
ID CLASS DATE MRK
0 1 321 02/12/2016 30
1 2 321 03/12/2016 0
2 3 321 04/12/2016 0
3 4 321 05/12/2016 40
4 5 321 06/12/2016 0
5 6 321 07/12/2016 60
6 7 321 08/12/2016 0
7 8 321 09/12/2016 0
8 9 321 10/12/2016 70
9 10 876 05/12/2016 100
10 11 876 06/12/2016 0
11 12 876 07/12/2016 80
First convert to datetimes and then groupby by CLASS and resample and last add column ID by insert:
df['DATE'] = pd.to_datetime(df['DATE'], dayfirst=True)
df = (df.set_index('DATE')
.groupby('CLASS')
.resample('d')['MRK']
.asfreq()
.fillna(0)
.astype(int)
.reset_index())
df.insert(0, 'ID', range(1, len(df) + 1))
print (df)
ID CLASS DATE MRK
0 1 321 2016-12-02 30
1 2 321 2016-12-03 0
2 3 321 2016-12-04 0
3 4 321 2016-12-05 40
4 5 321 2016-12-06 0
5 6 321 2016-12-07 60
6 7 321 2016-12-08 0
7 8 321 2016-12-09 0
8 9 321 2016-12-10 70
9 10 876 2016-12-05 100
10 11 876 2016-12-06 0
11 12 876 2016-12-07 80
Alternative solution:
df = (df.set_index('DATE')
.groupby('CLASS')
.resample('d')['MRK']
.first()
.fillna(0)
.astype(int)
.reset_index())
df.insert(0, 'ID', range(1, len(df) + 1))
print (df)
ID CLASS DATE MRK
0 1 321 2016-12-02 30
1 2 321 2016-12-03 0
2 3 321 2016-12-04 0
3 4 321 2016-12-05 40
4 5 321 2016-12-06 0
5 6 321 2016-12-07 60
6 7 321 2016-12-08 0
7 8 321 2016-12-09 0
8 9 321 2016-12-10 70
9 10 876 2016-12-05 100
10 11 876 2016-12-06 0
11 12 876 2016-12-07 80
Last for same format as input use strftime:
df['DATE'] = df['DATE'].dt.strftime('%d/%m/%Y')
print (df)
ID CLASS DATE MRK
0 1 321 02/12/2016 30
1 2 321 03/12/2016 0
2 3 321 04/12/2016 0
3 4 321 05/12/2016 40
4 5 321 06/12/2016 0
5 6 321 07/12/2016 60
6 7 321 08/12/2016 0
7 8 321 09/12/2016 0
8 9 321 10/12/2016 70
9 10 876 05/12/2016 100
10 11 876 06/12/2016 0
11 12 876 07/12/2016 80