this is my first question on stack overflow so the formatting might be a bit off. I have a problem that I know a solution for with a for loop in python. However, I don't know if there is a way in pandas itself that does the same thing faster.
Problem:
Suppose I have a pandas Series 'in' consisting of an index date and where every date has a value (integer). There is also a Series 'out' that has the same structure.
Ex:
in
date val
2022-12-01 5
2022-12-02 8
2022-12-03 19
out
date val
2022-12-01 3
2022-12-02 7
2022-12-03 21
If I want to make a Series of the amount of events that are being processed each day, I could do it with a for loop in the following where the value of every day is open.iloc[i]=open[i-1]+in[i]-out[i]. The result should be
open
date val
2022-12-01 2 #5-3
2022-12-02 3 #2+8-7
2022-12-03 1 #3+19-21
Is there a way to do this in pandas itself, without the need for a for loop?
new answer
Use cumsum:
ser_open = ser_in.sub(ser_out).cumsum()
Output:
2022-12-01 2
2022-12-02 3
2022-12-03 1
dtype: int64
Used input:
ser_in = pd.Series([5, 8, 19], index=['2022-12-01', '2022-12-02', '2022-12-03'])
ser_out = pd.Series([3, 7, 21], index=['2022-12-01', '2022-12-02', '2022-12-03'])
initial answer
Use shift after setting date as index:
out = (df_open
.set_index('date').shift()
.add(df_in.set_index('date')-df_out.set_index('date'),
fill_value=0
)
.reset_index()
)
Or, for assignment use variant with map:
df_open['val'] = df_open['date'].map(df_open
.set_index('date').shift()
.add(df_in.set_index('date')-df_out.set_index('date'),
fill_value=0
)['val']
)
Output:
date val
0 2022-12-01 2.0
1 2022-12-02 3.0
2 2022-12-03 1.0
Used inputs:
df_in = pd.DataFrame({'date': ['2022-12-01', '2022-12-02', '2022-12-03'], 'val': [5, 8, 19]})
df_out = pd.DataFrame({'date': ['2022-12-01', '2022-12-02', '2022-12-03'], 'val': [3, 7, 21]})
df_open = pd.DataFrame({'date': ['2022-12-01', '2022-12-02', '2022-12-03'], 'val': [2.0, 3.0, 1.0]})
Related
I am trying to use rolling().sum() to create a dataframe with 2-month rolling sums within each 'type'. Here's what my data looks like:
import pandas as pd
df = pd.DataFrame({'type': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'date': ['2022-01-01', '2022-02-01', '2022-03-01', '2022-04-01',
'2022-01-01', '2022-02-01', '2022-03-01', '2022-04-01',
'2022-01-01', '2022-02-01', '2022-03-01', '2022-04-01'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]})
and here is the expected result:
and here is what I have tried (unsuccessfully):
rolling_sum = df.groupby(['date', 'type']).rolling(2).sum().reset_index()
Here's a way to do it:
rolling_sum = (
df.assign(value=df.groupby(['type'])['value']
.rolling(2, min_periods=1).sum().reset_index()['value'])
)
Output:
type date value
0 A 2022-01-01 1.0
1 A 2022-02-01 3.0
2 A 2022-03-01 5.0
3 A 2022-04-01 7.0
4 B 2022-01-01 5.0
5 B 2022-02-01 11.0
6 B 2022-03-01 13.0
7 B 2022-04-01 15.0
8 C 2022-01-01 9.0
9 C 2022-02-01 19.0
10 C 2022-03-01 21.0
11 C 2022-04-01 23.0
Explanation:
Use groupby() only on type (without date) so that all dates are in the group for a given type
the min_periods argument ensures rolling works even for the first row where 2 periods are not available
Use assign() to update the value column using index alignment.
Answer
The below code does the work.
rolling_sum = df.groupby(['type']).rolling(2).sum()
You're pretty close -- just need to specify the level and set drop=True on reset_index(). Also, you can remove date from the groupby, since it's just grouped on type. You should use sort_values to ensure the input dataframe is in the correct order for the rolling sum.
df = df.sort_values(by=['type', 'date'])
df['rolling_sum'] = df.groupby(['type']).rolling(2).sum().reset_index(level=0, drop=True)
Let's say I have my main DataFrame.
df = pd.DataFrame({'ID': [1,1,1,2,2,2,3,3,3],
'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-01', '2021-01-02', '2021-01-03','2021-01-01', '2021-01-02', '2021-01-03'] ,
'Values': [11, np.nan, np.nan, 13, np.nan, np.nan, 15, np.nan, np.nan],
'Random_Col': [0,0,0,0,0,0,0,0,0]})
I want to fill the np.nan values with values from another dataframe that is not the same shape. The values have to match on "ID" and "Date".
new_df = pd.DataFrame({'ID': [1,1,2,2,3,3],
'Date': ['2021-01-02', '2021-01-03', '2021-01-02', '2021-01-03','2021-01-02','2021-01-03'],
'Values': [16, 19, 14, 14, 19, 18]})
What's the best way to do this?
I experimented with df.update(), but I'm not that works since the dataframes do not have the same number of rows. Am I wrong about this?
I could also use pd.merge(), but then I end up with multiple versions of each column and have to .fillna() for each specific column with the 2nd column with the new values. This would be fine if I only had 1 column of data to do this for, but I have dozens.
Is there a simpler way that I haven't considered?
One option is to merge + sort_index + bfill to fill the missing data in df, then reindex with df.columns. Since '\x00' has the lowest value, the sorting should place the same column names next to each other.
out = (df.merge(new_df, on=['ID','Date'], how='left', suffixes=('','\x00'))
.sort_index(axis=1).bfill(axis=1)[df.columns])
Output:
ID Date Values Random_Col
0 1 2021-01-01 11.0 0
1 1 2021-01-02 16.0 0
2 1 2021-01-03 19.0 0
3 2 2021-01-01 13.0 0
4 2 2021-01-02 14.0 0
5 2 2021-01-03 14.0 0
6 3 2021-01-01 15.0 0
7 3 2021-01-02 19.0 0
8 3 2021-01-03 18.0 0
I am trying to filter multiple dataframes at once by a specific date range (for this example January 2 - January 4). I know you can filter a dataframe by date using the following code: df = (df['Date'] > 'start-date') & (df['Date'] < 'end-date'); however, when I created a list of dataframes and tried to loop over them, I am returned the original dataframe with the original date range.
Any suggestions? I have provide some example code below:
d1 = {'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
'A': [1, 2, 3, 4, 5],
'B': [6, 7, 8, 9, 10]
}
df1 = pd.DataFrame(data=d1)
d2 = {'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
'C': [11, 12, 13, 14, 15],
'D': [16, 17, 18, 19, 20]
}
df2 = pd.DataFrame(data=d2)
df_list = [df1, df2]
for df in df_list:
df = (df['Date'] > '2021-01-01') & (df['Date'] < '2021-01-05')
df1
**Output:**
Date A B
0 2021-01-01 1 6
1 2021-01-02 2 7
2 2021-01-03 3 8
3 2021-01-04 4 9
4 2021-01-05 5 10
I have tried various ways to filter, such as .loc, writing functions, and creating a mask, but still can't get it to work. Another thing to note is that I am doing more formatting as part of this loop, and all the other formats are applied except this one. Any help is greatly appreciated! Thanks!
The issue here is that you're simply reassigning the variable df in your for loop without actually writing the result back into df_list.
This solves your issue:
df_list = [df1, df2]
output_list = []
for df in df_list:
df_filter = (df['Date'] > '2021-01-01') & (df['Date'] < '2021-01-05')
output_list.append(df.loc[df_filter])
output_list now contains the filtered dataframes.
As #anky mentioned, you need to specify your 'Date' columns as type of datetime. In addition, in your for loop you get boolean. You need to use it for selection.
[...]
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
df_list_filtered = []
for df in df_list:
df = df[(df['Date'] > '2021-01-01') & (df['Date'] < '2021-01-05')]
df_list_filtered.append(df)
[...]
print(df_list_filtered[0])
Date A B
1 2021-01-02 2 7
2 2021-01-03 3 8
3 2021-01-04 4 9
There is a huge dataframe containing multiple data types in different columns. I want to find rows that contain date values in different columns.
Here a test dataframe:
dt = pd.Series(['abc', datetime.now(), 12, '', None, np.nan, '2020-05-05'])
dt1 = pd.Series([3, datetime.now(), 'sam', '', np.nan, 'abc-123', '2020-05-25'])
dt3 = pd.Series([1,2,3,4,5,6,7])
df = pd.DataFrame({"A":dt.values, "B":dt1.values, "C":dt3.values})
Now, I want to create a new dataframe that contains only dates in both columns A and B, here rows 2nd and last.
Expected output:
A B C
1 2020-06-01 16:58:17.274311 2020-06-01 17:13:20.391394 2
6 2020-05-05 2020-05-25 7
What is the best way to do that? Thanks.
P.S.> Dates can be in any standard format.
Use:
m = df[['A', 'B']].transform(pd.to_datetime, errors='coerce').isna().any(axis=1)
df = df[~m]
Result:
# print(df)
A B C
1 2020-06-01 17:54:16.377722 2020-06-01 17:54:16.378432 2
6 2020-05-05 2020-05-25 7
Solution for test only A,B columns is boolean indexing with DataFrame.notna and DataFrame.all for not match any non datetimes:
df = df[df[['A','B']].apply(pd.to_datetime, errors='coerce').notna().all(axis=1)]
print (df)
A B C
1 2020-06-01 16:14:35.020855 2020-06-01 16:14:35.021855 2
6 2020-05-05 2020-05-25 7
import pandas as pd
from datetime import datetime
dt = pd.Series(['abc', datetime.now(), 12, '', None, np.nan, '2020-05-05'])
dt1 = pd.Series([3, datetime.now(), 'sam', '', np.nan, 'abc-123', '2020-05-25'])
dt3 = pd.Series([1,2,3,4,5,6,7])
df = pd.DataFrame({"A":dt.values, "B":dt1.values, "C":dt3.values})
m = pd.concat([pd.to_datetime(df['A'], errors='coerce'),
pd.to_datetime(df['B'], errors='coerce')], axis=1).isna().all(axis=1)
print(df[~m])
Prints:
A B C
1 2020-06-01 12:17:51.320286 2020-06-01 12:17:51.320826 2
6 2020-05-05 2020-05-25 7
I have a Dataframe in Pandas with a letter and two dates as columns. I would like to calculate the difference between the two date columns for the previous row using shift(1) provided that the Lettervalue is the same (using a groupby). The complex part is I would like to calculate business days, not just elapsed days. The best way I have found to do that is using a numpy.busday_count, which takes two lists as an argument. I am essentially trying to use .apply to make each row it's own list. Not sure if this is the best way to do it, but running into some problems, which are ambiguous.
import pandas as pd
from datetime import datetime
import numpy as np
# create dataframe
df = pd.DataFrame(data=[['A', datetime(2016,01,07), datetime(2016,01,09)],
['A', datetime(2016,03,01), datetime(2016,03,8)],
['B', datetime(2016,05,01), datetime(2016,05,10)],
['B', datetime(2016,06,05), datetime(2016,06,07)]],
columns=['Letter', 'First Day', 'Last Day'])
# convert to dates since pandas reads them in as time series
df['First Day'] = df['First Day'].apply(lambda x: x.to_datetime().date())
df['Last Day'] = df['Last Day'].apply(lambda x: x.to_datetime().date())
df['Gap'] = (df.groupby('Letter')
.apply(
lambda x: (
np.busday_count(x['First Day'].shift(1).tolist(),
x['Last Day'].shift(1).tolist())))
.reset_index(drop=True))
print df
I get the following error on the lambda function. I'm not sure what object it's having problems with as the two passed arguments should be dates:
ValueError: Could not convert object to NumPy datetime
Desired Output:
Letter First Day Last Day Gap
0 A 2016-01-07 2016-01-09 NAN
1 A 2016-03-01 2016-03-08 1
2 B 2016-05-01 2016-05-10 NAN
3 B 2016-06-05 2016-06-07 7
The following should work - first removing the leading zeros from the date digits):
df = pd.DataFrame(data=[['A', datetime(2016, 1, 7), datetime(2016, 1, 9)],
['A', datetime(2016, 3, 1), datetime(2016, 3, 8)],
['B', datetime(2016, 5, 1), datetime(2016, 5, 10)],
['B', datetime(2016, 6, 5), datetime(2016, 6, 7)]],
columns=['Letter', 'First Day', 'Last Day'])
df['Gap'] = df.groupby('Letter')
.apply(
lambda x:
pd.DataFrame(
np.busday_count(x['First Day'].tolist(), x['Last Day'].tolist())).shift())
.reset_index(drop=True)
Letter First Day Last Day Gap
0 A 2016-01-07 2016-01-09 NaN
1 A 2016-03-01 2016-03-08 2.0
2 B 2016-05-01 2016-05-10 NaN
3 B 2016-06-05 2016-06-07 6.0
I don't think you need the .date() conversion.