I have a DataFrame with two date columns, each row corresponding to a disjoint interval of time. I am trying to produce a series which contains as an index all dates from the minimum date to the maximum date from the original columns and has a value 1 if it is a date within one of the original time intervals.
pd.DataFrame({"A":[pd.Timestamp("2017-1-1"), pd.Timestamp("2017-2-1")],
"B": [pd.Timestamp("2017-1-3"), pd.Timestamp("2017-2-3")]})
id A B
0 2017-01-01 2017-01-03
1 2017-02-01 2017-02-03
To this,
pd.DataFrame({"A":[pd.Timestamp("2017-1-1"),pd.Timestamp("2017-1-2"),pd.Timestamp("2017-1-3"),
pd.Timestamp("2017-2-1"),pd.Timestamp("2017-2-2"),pd.Timestamp("2017-2-3")],
"B": [1,1,1,1,1,1]})
id A B
0 2017-01-01 1
1 2017-01-02 1
2 2017-01-03 1
3 2017-02-01 1
4 2017-02-02 1
5 2017-02-03 1
Not really pythonic but I think it solves your issue:
In [1]:
from datetime import date, timedelta
import pandas as pd
df = pd.DataFrame({"A":[pd.Timestamp("2017-1-1"), pd.Timestamp("2017-2-1")],
"B": [pd.Timestamp("2017-1-3"), pd.Timestamp("2017-2-3")]})
dates_list = []
for k in range(len(df)):
sdate = df.iloc[k, 0] # start date
edate = df.iloc[k, 1] # end date
delta = edate - sdate # as timedelta
for i in range(delta.days + 1):
day = sdate + timedelta(days=i)
dates_list.append(day)
final = pd.DataFrame(data=dates_list, columns=['A'])
final['B'] = 1
final
Out [1]:
A B
0 2017-01-01 1
1 2017-01-02 1
2 2017-01-03 1
3 2017-02-01 1
4 2017-02-02 1
5 2017-02-03 1
Related
I'm loading csv file and It has three columns: a column with date and time, a column with a value, and another 'data'. Example rows:
value data Date-Time
0 2 a 2019-3-18 23:11:00
1 3 b 2019-10-24 21:00:12
2 1 c 2019-1-10 23:00:00
3 2 d 2019-4-18 23:11:00
4 1 e 2019-1-1 23:00:00
I want group by value if we get duplicates on value need to fetch record based on recent record of date and time it should look as follows.
value data date
0 1 c 2019-1-10 23:00:00
1 2 d 2019-04-18 23:11:00
2 3 b 2019-10-24 21:00:12
df["date"] = pd.to_datetime(df["date"])
df = df.sort_values("date").groupby(['value'], as_index=False).first()
print(df)
Use sort_values and drop_duplicates:
# Convert 'Date-Time' column to datetime64
# df['Date-Time'] = pd.to_datetime(df['Date-Time'])
>>> df.sort_values('Date-Time') \
.drop_duplicates('value', keep='last') \
.sort_values('value')
value data Date-Time
2 1 c 2019-01-10 23:00:00
3 2 d 2019-04-18 23:11:00
1 3 b 2019-10-24 21:00:12
I have a list of date objects, let's say dates = ['2020-03-01', '2020-05-10'], and a dataframe indexed with datetime object :
The dataset was generated from another one using the pandas resampling methods with "1h" as frequency.
I would like to fill the values of the column A and B using the list dates which contains date object.
More precisely, if an index is the same day as an element of dates, I want to put the corresponding row and all 1-month previous rows with 1 as values. The other rows should be filled with 0 as values.
My strategy is to begin with an null dataframe sharing the columns and index of the initial one.
But I have some problems when I want to fill the values. In fact, I can do this using foor loop but I do believe in the fact that pandas is powerful enough to do this job with a few lines.
Can someone help me ? I would appreciate any help.
Edit :
To make the coding/testing easy I have changed a little bit the problem statement. The datetime index is now all the hours of the 2020-01-01.
The code below seems to work well but I do not see a way to iterate through the dates list without for loop... You can start from it.
Code :
import pandas as pd
import numpy as np
datetime_index = pd.date_range(start='2020-01-01 00:00:00',
end='2020-01-01 23:59:59', freq='1h')
n_cols = 2
n_rows = datetime_index.size
df_shape =(n_rows, n_cols)
dates = ['2020-01-01 05:00:00']
df = pd.DataFrame(np.random.randint(0,10, size=df_shape),
index = datetime_index.values,
columns = [f'column {i}' for i in range(n_cols)])
data = pd.DataFrame(np.zeros(df_shape).astype(int),
index = datetime_index.values,
columns = [f'column {i}' for i in range(n_cols)])
data['column 0'] = data.apply(lambda x: int(x.name in pd.date_range(
start=pd.to_datetime(dates[0]) - pd.Timedelta(1, unit='hours'),
end=pd.to_datetime(dates[0]),
freq='1h'
)), axis=1)
Here is a slightly different version:
import pandas as pd
window = 30 # days
targets = ['2020-03-01', '2020-05-10']
dates = pd.date_range(start='2020-01-29', end='2020-07-01', freq='14d')
# create data frame
t = pd.DataFrame(index=dates)
# is target date close to i-th date?
for target in targets:
t[target] = abs((t.index - pd.Timestamp(target)).days) < window
print(t.astype(int))
2020-03-01 2020-05-10
2020-01-29 0 0
2020-02-12 1 0
2020-02-26 1 0
2020-03-11 1 0
2020-03-25 1 0
2020-04-08 0 0
2020-04-22 0 1
2020-05-06 0 1
2020-05-20 0 1
2020-06-03 0 1
2020-06-17 0 0
2020-07-01 0 0
I have changed the freq to two weeks so I can check my result more easily
import pandas as pd
import numpy as np
from datetime import datetime
datetime_index = pd.date_range(start='2020-01-01 00:00:00',
end='2020-10-01 23:59:59', freq=f'{2*24*7}h')
n_cols = 2
n_rows = datetime_index.size
df_shape =(n_rows, n_cols)
dates = ['2020-03-01', '2020-05-10']
def to_datetime(x):
year, month, day = map(int, x.split("-"))
return datetime(year=year, month=month, day=day)
dates = list(map(to_datetime, dates))
date_series = pd.Series(datetime_index, index=datetime_index)
pd.DataFrame({f"column{i}":date_series\
.apply(lambda x:int(-pd.Timedelta(days=30)<(x-dates[i])<pd.Timedelta(days=30)))
for i in range(len(dates))})
this yields
column0 column1
2020-01-01 0 0
2020-01-15 0 0
2020-01-29 0 0
2020-02-12 1 0
2020-02-26 1 0
2020-03-11 1 0
2020-03-25 1 0
2020-04-08 0 0
2020-04-22 0 1
2020-05-06 0 1
2020-05-20 0 1
2020-06-03 0 1
2020-06-17 0 0
2020-07-01 0 0
2020-07-15 0 0
2020-07-29 0 0
2020-08-12 0 0
2020-08-26 0 0
2020-09-09 0 0
2020-09-23 0 0
I have a dataframe df1, and I want to calculate the days between two dates given three conditions and create a new column DiffDays with the difference in days.
1) When Yes is 1
2) When values in Value are non-zero
3) Must be UserId specific (perhaps with groupby())
df1 = pd.DataFrame({'Date':['02.01.2017', '03.01.2017', '04.01.2017', '05.01.2017', '01.01.2017', '02.01.2017', '03.01.2017'],
'UserId':[1,1,1,1,2,2,2],
'Value':[0,0,0,100,0,1000,0],
'Yes':[1,0,0,0,1,0,0]})
For example, when Yes is 1, calculate the dates between when Value is non-zero, which is 05.01.2017 and when Yes is 1, which is 02.01.2017. The result is three days for UserId in row 3.
Expected outcome:
Date UserId Value Yes DiffDays
0 02.01.2017 1 0.0 1 0
1 03.01.2017 1 0.0 0.0 0
2 04.01.2017 1 0.0 0.0 0
3 05.01.2017 1 100 0.0 3
4 01.01.2017 2 0.0 1 0
5 02.01.2017 2 1000 0.0 1
6 03.01.2017 2 0.0 0.0 0
I couldn't find anything on Stackoverflow about this, and not sure how to start.
def dayDiff(groupby):
if (not (groupby.Yes == 1).any()) or (not (groupby.Value > 0).any()):
return np.zeros(groupby.Date.count())
min_date = groupby[groupby.Yes == 1].Date.iloc[0]
max_date = groupby[groupby.Value > 0].Date.iloc[0]
delta = max_date - min_date
return np.where(groupby.Value > 0 , delta.days, 0)
df1.Date = pd.to_datetime(df1.Date, dayfirst=True)
DateDiff = df1.groupby('UserId').apply(dayDiff).explode().rename('DateDiff').reset_index(drop=True)
pd.concat([df1, DateDiff], axis=1)
Returns:
Date UserId Value Yes DateDiff
0 2017-01-02 1 0 1 0
1 2017-01-03 1 0 0 0
2 2017-01-04 1 0 0 0
3 2017-01-05 1 100 0 3
4 2017-01-01 2 0 1 0
5 2017-01-02 2 1000 0 1
6 2017-01-03 2 0 0 0
Although this answers your question, the date diff logic is hard to follow, especially when it comes to the placement of the DateDiff values.
Update
pd.Series.explode() was only introduced in pandas version 0.25, for those using previous versions:
df1.Date = pd.to_datetime(df1.Date, dayfirst=True)
DateDiff = (df1
.groupby('UserId')
.apply(dayDiff)
.to_frame()
.explode(0)
.reset_index(drop=True)
.rename(columns={0: 'DateDiff'}))
pd.concat([df1, DateDiff], axis=1)
This will yield the same results.
I have a dataframe with a time column, and then a value column which has repeating A/B values. I need to be able to group these values into pairs and find the timedelta between them.
import pandas as pd
df = pd.DataFrame()
df['time1'] = pd.date_range('2018-01-01', periods=6, freq='H')
df['id'] = range(1,7)
df['val'] = ['A','B'] * 3
time id val
0 2018-01-01 00:00:00 1 A
1 2018-01-01 01:00:00 2 B
2 2018-01-01 02:00:00 3 A
3 2018-01-01 03:00:00 4 B
4 2018-01-01 04:00:00 5 A
5 2018-01-01 05:00:00 6 B
needs to be...
index diff A B
0 01:00:00 1 2
1 01:00:00 3 4
2 01:00:00 5 6
Create a pair_id, this will be used to identify pairs. Add this to the df
pair_id = sorted(list(range(0, int(df.shape[0]/2))) * 2)
df.loc[:, 'pair'] = pair_id
Define a difference function
def diff(x):
return max(x) - min(x)
Using groupby make the difference calculation
diff_df = df.groupby('pair')['time1'].apply(diff).to_frame('diff')
And group the remaining data
id_df = df.groupby(['pair','val'])['id'].sum().unstack()
So we have diff_df:
diff
pair
0 01:00:00
1 01:00:00
2 01:00:00
And id_df:
val A B
pair
0 1 2
1 3 4
2 5 6
Join these two
diff_df.join(id_df)
diff A B
pair
0 01:00:00 1 2
1 01:00:00 3 4
2 01:00:00 5 6
There is probably a much simpler/faster way to do this within Pandas, but given your example data, here is something I came up with that seems to work. It uses the grouper() recipe from the itertools docs to pull the rows 2 at a time from the dataframe, and then takes the timedelta and merges into one new row.
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
new_rows = []
for a, b in grouper(df.iterrows(), 2):
tdelta = b[1][0] - a[1][0]
aid = a[1][1]
bid = b[1][1]
new_rows.append({'diff': tdelta, 'A': aid, 'B': bid})
new_df = pd.DataFrame(new_rows)
new_df = new_df.reindex(columns=['diff', 'A', 'B'])
Which gives:
>>> print(new_df)
diff A B
0 01:00:00 1 2
1 01:00:00 3 4
2 01:00:00 5 6
... But Dillon's solution above is much cleaner, and probably much more efficient :)
I need to subtract all elements in one column of pandas dataframe by its first value.
In this code, pandas complains about self.inferred_type, which I guess is the circular referencing.
df.Time = df.Time - df.Time[0]
And in this code, pandas complains about setting value on copies.
df.Time = df.Time - df.iat[0,0]
What is the correct way to do this computation in Pandas?
I think you can select first item in column Time by iloc:
df.Time = df.Time - df.Time.iloc[0]
Sample:
start = pd.to_datetime('2015-02-24 10:00')
rng = pd.date_range(start, periods=5)
df = pd.DataFrame({'Time': rng, 'a': range(5)})
print (df)
Time a
0 2015-02-24 10:00:00 0
1 2015-02-25 10:00:00 1
2 2015-02-26 10:00:00 2
3 2015-02-27 10:00:00 3
4 2015-02-28 10:00:00 4
df.Time = df.Time - df.Time.iloc[0]
print (df)
Time a
0 0 days 0
1 1 days 1
2 2 days 2
3 3 days 3
4 4 days 4
Notice:
For me works perfectly your 2 ways also.