I have a dataframe that contains NaN values and I want to fill the missing data using information of the same month.
the dataframe looks this:
data = {'x':[208.999,-894.0,-171.0,108.999,-162.0,-29.0,-143.999,-133.0,-900.0],
'e':[0.105,0.209,0.934,0.150,0.158,'',0.333,0.089,0.189],
}
df = pd.DataFrame(data)
df = pd.DataFrame(data, index =['2020-01-01', '2020-02-01',
'2020-03-01', '2020-01-01',
'2020-02-01','2020-03-01',
'2020-01-01','2020-02-01',
'2020-03-01'])
df.index = pd.to_datetime(df.index)
df['e'] =df['e'].apply(pd.to_numeric, errors='coerce')
Now im using df=df.fillna(df['e'].mean()) to fill the nan value but it takes all the column data, is and it gives me 0.27 is there a way to use only the data of the same month?, the result should be 0.56
Try grouping in index.month and get mean (transformed) then fillna
df.index = pd.to_datetime(df.index)
out = df.fillna({'e':df.groupby(df.index.month)['e'].transform('mean')})
print(out)
x e
2020-01-01 208.999 0.1050
2020-02-01 -894.000 0.2090
2020-03-01 -171.000 0.9340
2020-01-01 108.999 0.1500
2020-02-01 -162.000 0.1580
2020-03-01 -29.000 0.5615
2020-01-01 -143.999 0.3330
2020-02-01 -133.000 0.0890
2020-03-01 -900.000 0.1890
Maybe you could use interpolate() instead of fillna(), but you have to sort the index first, ie.:
df.e.sort_index().interpolate()
Output:
2020-01-01 0.1050
2020-01-01 0.1500
2020-01-01 0.3330
2020-02-01 0.2090
2020-02-01 0.1580
2020-02-01 0.0890
2020-03-01 0.9340
2020-03-01 0.5615
2020-03-01 0.1890
Name: e, dtype: float64
By default linear interpolation is used, so in case of a single occurrence of NaN you get the mean value and the missing one was replaced by 0.5615 like you expected. However if the NaN was the first sample of the month after sorting the result would be the mean of the last month's last value and this month's next value, but it works in cases where there are NaNs for the whole month and nothing to average, so depending how strict you are on the same month requirement or how are your missing values spread across the whole dataframe you can accept this solution or not.
Related
How do you groupby on consecutive blocks of rows where each block is separated by a threshold value?
I have the following sample pandas dataframe, and I'm having difficulty getting blocks of rows whose difference in their dates is greater than 365 days.
Date
Data
2019-01-01
A
2019-05-01
B
2020-04-01
C
2021-07-01
D
2022-02-01
E
2024-05-01
F
The output I'm looking for is the following,
Min Date
Max Date
Data
2019-01-01
2020-04-01
ABC
2021-07-01
2022-02-01
DE
2024-05-01
2024-05-01
F
I was looking at pandas .diff() and .cumsum() for getting the number of days between two rows and filtering for rows with difference > 365 days, however, it doesn't work when the dataframe has multiple blocks of rows.
I would also suggest .diff() and .cumsum():
import pandas as pd
df = pd.read_clipboard()
df["Date"] = pd.to_datetime(df["Date"])
blocks = df["Date"].diff().gt("365D").cumsum()
out = df.groupby(blocks).agg({"Date": ["min", "max"], "Data": "sum"})
out:
Date Data
min max sum
Date
0 2019-01-01 2019-05-01 AB
1 2020-06-01 2020-06-01 C
2 2021-07-01 2022-02-01 DE
3 2024-05-01 2024-05-01 F
after which you can replace the column labels (now a 2 level MultiIndex) as appropriate.
The date belonging to data "C" is more than 365 days apart from both "B" and "D", so it got its own group. Or am I misunderstanding your expected output?
I am trying to calculate the number of days that have elapsed since the launch of a marketing campaign. I have one row per date for each marketing campaign in my DataFrame (df) and all dates start from the same day (though there is not a data point for each day for each campaign). In column 'b' I have the date relating to the data points of interest (dateime64[ns]) and in column 'c' I have the launch date of the marketing campaign (dateime64[ns]). I would like the resulting calculation to return n/a (or np.NaN or a suitable alternative) when column 'b' is earlier than column 'c', else I would like the calculation to return the difference the two dates.
Campaign
Date
Launch Date
Desired Column
A
2019-09-01
2022-12-01
n/a
A
2019-09-02
2022-12-01
n/a
B
2019-09-01
2019-09-01
0
B
2019-09-25
2019-09-01
24
When I try:
df['Days Since Launch'] = df['Date'] - df['Launch Date']
What I would hope returns a negative value actually returns a positive one, thus leading to duplicate values when I have dates that are 10 days prior and 10 days after the launch date.
When I try:
df['Days Since Launch'] = np.where(df['Date'] < df['Launch Date'], XXX, df['Date'] - df['Launch Date'])
Where XXX has to be the same data type as the two input columns, so I can't enter np.NaN because the calculation will fail, nor can I enter a date as this will still leave the same issue that i want to solve. IF statements do not work as the "truth value of a Series is ambiguous". Any ideas?
You can use a direct subtraction and conversion to days with dt.days, then mask the negative values with where:
s = pd.to_datetime(df['Date']).sub(pd.to_datetime(df['Launch Date'])).dt.days
# or, if already datetime:
#s = df['Date'].sub(df['Launch Date']).dt.days
df['Desired Column'] = s.where(s.ge(0))
Alternative closer to your initial attempt, using mask:
df['Desired Column'] = (df['Date'].sub(df['Launch Date'])
.mask(df['Date'] < df['Launch Date'])
)
Output:
Campaign Date Launch Date Desired Column
0 A 2019-09-01 2022-12-01 NaN
1 A 2019-09-02 2022-12-01 NaN
2 B 2019-09-01 2019-09-01 0.0
3 B 2019-09-25 2019-09-01 24.0
Add Series.dt.days for convert timedeltas to days:
df['Days Since Launch'] = np.where(df['Date'] < df['Launch Date'],
np.nan,
(df['Date'] - df['Launch Date']).dt.days)
print (df)
Campaign Date Launch Date Desired Column Days Since Launch
0 A 2019-09-01 2022-12-01 NaN NaN
1 A 2019-09-02 2022-12-01 NaN NaN
2 B 2019-09-01 2019-09-01 0.0 0.0
3 B 2019-09-25 2019-09-01 24.0 24.0
Another alternative:
df["Date"] = pd.to_datetime(df["Date"])
df["Launch Date"] = pd.to_datetime(df["Launch Date"])
df["Desired Column"] = df.apply(lambda x: x["Date"] - x["Launch Date"] if x["Date"] >= x["Launch Date"] else None, axis=1)
i have a sales dataset (simplified) with sales from existing customers (first_order = 0)
import pandas as pd
import datetime as dt
df = pd.DataFrame({'Date':['2020-06-30 00:00:00','2020-05-05 00:00:00','2020-04-10 00:00:00','2020-02-26 00:00:00'],
'email':['1#abc.de','2#abc.de','3#abc.de','1#abc.de'],
'first_order':[1,1,1,1],
'Last_Order_Date':['2020-06-30 00:00:00','2020-05-05 00:00:00','2020-04-10 00:00:00','2020-02-26 00:00:00']
})
I would like to analyze how many existing customers we lose per month.
my idea is to
group(count) by month and
then count how many have made their last purchase in the following months which gives me a churn cross table where I can see that e.g. we had 300 purchases in January, and 10 of them bought the last time in February.
like this:
Col B is the total number of repeating customers and column C and further is the last month they bought something.
E.g. we had 2400 customers in January, 677 of them made their last purchase in this month, 203 more followed in February etc.
I guess I could first group the total number of sales per month and then group a second dataset by Last_Order_Date and filter by month.
but I guess there is a handy python way ?! :)
any ideas?
thanks!
The below code helps you to identify how many purchases are done in each month.
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df.groupby(pd.Grouper(freq="M")).size()
O/P:
Date
2020-02-29 1
2020-03-31 0
2020-04-30 1
2020-05-31 1
2020-06-30 1
Freq: M, dtype: int64
I couldn't find the required data and explanation clearly. This could be your starting point. Please let me know, if it is helped you in anyway.
Update-1:
df.pivot_table(index='Date', columns='email', values='Last_Order_Date', aggfunc='count')
Output:
email 1#abc.de 2#abc.de 3#abc.de
Date
2020-02-26 00:00:00 1.0 NaN NaN
2020-04-10 00:00:00 NaN NaN 1.0
2020-05-05 00:00:00 NaN 1.0 NaN
2020-06-30 00:00:00 1.0 NaN NaN
I am having some trouble managing and combining columns in order to get one datetime column out of three columns containing the date, the hours and the minutes.
Assume the following df (copy and type df= = pd.read_clipboard() to reproduce) with the types as noted below:
>>>df
date hour minute
0 2021-01-01 7.0 15.0
1 2021-01-02 3.0 30.0
2 2021-01-02 NaN NaN
3 2021-01-03 9.0 0.0
4 2021-01-04 4.0 45.0
>>>df.dtypes
date object
hour float64
minute float64
dtype: object
I want to replace the three columns with one called 'datetime' and I have tried a few things but I face the following problems:
I first create a 'time' column df['time']= (pd.to_datetime(df['hour'], unit='h') + pd.to_timedelta(df['minute'], unit='m')).dt.time and then I try to concatenate it with the 'date' df['datetime']= df['date'] + ' ' + df['time'] (with the purpose of converting the 'datetime' column pd.to_datetime(df['datetime']). However, I get
TypeError: can only concatenate str (not "datetime.time") to str
If I convert 'hour' and 'minute' to str to concatenate the three columns to 'datetime', then I face the problem with the NaN values, which prevents me from converting the 'datetime' to the corresponding type.
I have also tried to first convert the 'date' column df['date']= df['date'].astype('datetime64[ns]') and again create the 'time' column df['time']= (pd.to_datetime(df['hour'], unit='h') + pd.to_timedelta(df['minute'], unit='m')).dt.time to combine the two: df['datetime']= pd.datetime.combine(df['date'],df['time']) and it returns
TypeError: combine() argument 1 must be datetime.date, not Series
along with the warning
FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
Is there a generic solution to combine the three columns and ignore the NaN values (assume it could return 00:00:00).
What if I have a row with all NaN values? Would it possible to ignore all NaNs and 'datetime' be NaN for this row?
Thank you in advance, ^_^
First convert date to datetimes and then add hour and minutes timedeltas with replace missing values to 0 timedelta:
td = pd.Timedelta(0)
df['datetime'] = (pd.to_datetime(df['date']) +
pd.to_timedelta(df['hour'], unit='h').fillna(td) +
pd.to_timedelta(df['minute'], unit='m').fillna(td))
print (df)
date hour minute datetime
0 2021-01-01 7.0 15.0 2021-01-01 07:15:00
1 2021-01-02 3.0 30.0 2021-01-02 03:30:00
2 2021-01-02 NaN NaN 2021-01-02 00:00:00
3 2021-01-03 9.0 0.0 2021-01-03 09:00:00
4 2021-01-04 4.0 45.0 2021-01-04 04:45:00
Or you can use Series.add with fill_value=0:
df['datetime'] = (pd.to_datetime(df['date'])
.add(pd.to_timedelta(df['hour'], unit='h'), fill_value=0)
.add(pd.to_timedelta(df['minute'], unit='m'), fill_value=0))
I would recommend converting hour and minute columns to string and constructing the datetime string from the provided components.
Logically, you need to perform the following steps:
Step 1. Fill missing values for hour and minute with zeros.
df['hour'] = df['hour'].fillna(0)
df['minute'] = df['minute'].fillna(0)
Step 2. Convert float values for hour and minute into integer ones, because your final output should look like 2021-01-01 7:15, not 2021-01-01 7.0:15.0.
df['hour'] = df['hour'].astype(int)
df['minute'] = df['minute'].astype(int)
Step 3. Convert integer values for hour and minute to the string representation.
df['hour'] = df['hour'].astype(str)
df['minute'] = df['minute'].astype(str)
Step 4. Concatenate date, hour and minute into one column of the correct format.
df['result'] = df['date'].str.cat(df['hour'].str.cat(df['minute'], sep=':'), sep=' ')
Step 5. Convert your result column to datetime object.
pd.to_datetime(df['result'])
It is also possible to fullfill all of this steps in one command, though it will read a bit messy:
df['result'] = pd.to_datetime(df['date'].str.cat(df['hour'].fillna(0).astype(int).astype(str).str.cat(df['minute'].fillna(0).astype(int).astype(str), sep=':'), sep=' '))
Result:
date hour minute result
0 2020-01-01 7.0 15.0 2020-01-01 07:15:00
1 2020-01-02 3.0 30.0 2020-01-02 03:30:00
2 2020-01-02 NaN NaN 2020-01-02 00:00:00
3 2020-01-03 9.0 0.0 2020-01-03 09:00:00
4 2020-01-04 4.0 45.0 2020-01-04 04:45:00
So i wanted to downsampling my data using ffill method
I have a data:
2020-01-01 1.248310e+06
2021-01-01 1.259511e+06
2022-01-01 1.276312e+06
2023-01-01 1.298714e+06
The output should be:
2020-01-01 1.248310e+06
2020-02-01 1.248310e+06
2020-03-01 1.248310e+06
.... ...
2023-10-01 1.298714e+06
2023-11-01 1.298714e+06
2023-12-01 1.298714e+06
Here is what I tried
down_sampling = df.resample('MS', fill_method= 'ffill')
I get something like:
2020-01-01 1.248310e+06
2020-02-01 1.248310e+06
2020-03-01 1.248310e+06
.... ...
2022-11-01 1.276312e+06
2022-12-01 1.276312e+06
2023-01-01 1.298714e+06
The problem here is the year 2023 has only one month.
Can you suggest any idea on how to fixed it.
Thank you.
You can do it like this:
index = pd.date_range('1/1/2020', periods=4, freq='YS')
series = pd.Series([1.248310e+06, 1.259511e+06, 1.276312e+06, 1.298714e+06], index=index)
series2 = pd.Series(1.298714e+06, pd.date_range('12/1/2023', periods=1))
series = series.append(series2)
down_sampling = series.resample('MS').ffill()
A hacky but pythonic solution:
df.append(df.iloc[[-1]].set_index(df.iloc[[-1]].index.shift(1, freq="D"))).resample("H").ffill()[:-1]
It picks the last row (as a df: df.iloc[[-1]]),
increases its index by one step (here 1 day: .index.shift(1, freq="D")),
then resamples: .resample("H").ffill()
And removes the single dummy row at the end of df ([:-1])
I'd actually expect closed: parameter of resample to do the job.