groupby Date year-month - python

I read and transform data using the following code
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as dates
import numpy as np
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv', parse_dates=['Date'])
df.drop('ID', axis='columns', inplace = True)
df_min = df[(df['Date']<='2014-12') & (df['Date']>='2004-01') & (df['Element']=='TMIN')]
df_min.drop('Element', axis='columns', inplace = True)
df_min = df_min.groupby('Date').agg({'Data_Value': 'min'}).reset_index()
giving the following result
Date Data_Value
0 2005-01-01 -56
1 2005-01-02 -56
2 2005-01-03 0
3 2005-01-04 -39
4 2005-01-05 -94
Now I try to get the Date in Year-Month. So
Date Data_Value
0 2005-01 -94
1 2005-02 xx
2 2005-03 xx
3 2005-04 xx
4 2005-05 xx
Where xx is the minimum value for that year-month.
how do I have to change the Groupby function or is this not possible with this function?

Use pd.Grouper() to accumulate by yearly/monthly/daily frequencies.
Code
df_min["Date"] = pd.to_datetime(df_min["Date"])
df_ans = df_min.groupby(pd.Grouper(key="Date", freq="M")).min()
Result
print(df_ans)
Data_Value
Date
2005-01-31 -94

You can first map Date column in order to get only year and month, and then just perform a groupby and get the min for each group:
# import libraries
import pandas as pd
# test data
data = [['2005-01-01', -56],['2005-01-01', -3],['2005-01-01', 6],
['2005-01-01', 26],['2005-01-01', 56],['2005-02-01', -26],
['2005-02-01', -2],['2005-02-01', 6],['2005-02-01', 26],
['2005-03-01', 56],['2005-03-01', -33],['2005-03-01', -5],
['2005-03-01', 6],['2005-03-01', 26],['2005-03-01', 56]]
# create dataframe
df_min = pd.DataFrame(data=data, columns=["Date", "Date_value"])
# convert 'Date' column to datetime datatype
df_min['Date'] = pd.to_datetime(df_min['Date'])
# get only year and month
df_min['Date'] = df_min['Date'].map(lambda x: str(x.year)+'-'+str(x.month))
# get min value for each group
df_min = df_min.groupby('Date').min()
After printing df_min, output must be:
Date_value
Date
2005-01-01 -56
2005-02-01 -26
2005-03-01 -33

Related

Why do i get all date similiar while trying to fill them in dataset?

I have dataset with 800 rows and i want to create new column with date, and in each row in should increase on one day.
import datetime
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(800):
df['Date'] = date + datetime.timedelta(days=x)
In each column date is equal to '2014-01-12', as i inderstand it fills as if x is always equal to 799
Each time through the loop you are updating the ENTIRE Date column. You see the results of the 800th update at the end.
You could use a date range:
dr = pd.date_range('5/11/2011', periods=800, freq='D')
df = pd.DataFrame({'Date': dr})
print(df)
Date
0 2011-05-11
1 2011-05-12
2 2011-05-13
3 2011-05-14
4 2011-05-15
.. ...
795 2013-07-14
796 2013-07-15
797 2013-07-16
798 2013-07-17
799 2013-07-18
Or:
df['Date'] = dr
pandas is nice tool which can repeate some calculations without using for-loop.
When you use df['Date'] = ... then you assign the same value to all cells in column.
You have to use df.loc[x, 'Date'] = ... to assign to single cell.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(10):
df.loc[x,'Date'] = date + datetime.timedelta(days=x)
print(df)
But you could use also pd.date_range() for this.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
df['Date'] = pd.date_range(date, periods=10)
print(df)

Price column lost when converting timestamp column data to datetime

I'm preparing my data for price analytics, so I created this code that pull the price feed from Coingecko API, sort the required columns, change the headers names and convert the date.
The current block I'm facing is once I convert the timestamp to datetime, I lose the price column, so how can I get it back along with the new date format?
import pandas as pd
from pycoingecko import CoinGeckoAPI
cg = CoinGeckoAPI()
response = cg.get_coin_market_chart_by_id(id='bitcoin',
vs_currency='usd',
days='90',
interval='daily')
df1 = pd.json_normalize(response)
df2 = df1.explode('prices')
df2 = pd.DataFrame(df2['prices'].to_list(), columns=['dates','prices'])
df2 .rename(columns={'dates': 'ds','prices': 'y'}, inplace=True)
print('DATAFRAME EXPLODED: ',df2)
df2 = df2['ds'].mul(1e6).apply(pd.Timestamp)
df2 = pd.DataFrame(df2.to_list(), columns=['ds','y'])
df3 = df2.tail()
print('DATAFRAME TAILED: ',df3)
DATAFRAME EXPLODED:
ds y
0 1618185600000 59988.020959
1 1618272000000 59911.020595
2 1618358400000 63576.676041
3 1618444800000 62807.123233
4 1618531200000 63179.772446
.. ... ...
86 1625616000000 34149.989815
87 1625702400000 33932.254638
88 1625788800000 32933.578199
89 1625875200000 33971.297750
90 1625895274000 33738.909080
[91 rows x 2 columns]
DATAFRAME TAILED:
86 2021-07-07 00:00:00
87 2021-07-08 00:00:00
88 2021-07-09 00:00:00
89 2021-07-10 00:00:00
90 2021-07-10 05:34:34
Name: ds, type: datetime64[ns]
ValueError: Shape of passed values is (91, 1), indices imply (91, 3)
Change :
df2 = df2['ds'].mul(1e6).apply(pd.Timestamp)
df2 = pd.DataFrame(df2.to_list(), columns=['ds','y'])
to :
df2['ds_datetime'] = df2['ds'].mul(1e6).apply(pd.Timestamp)
Try this:
import pandas as pd
from pycoingecko import CoinGeckoAPI
cg = CoinGeckoAPI()
response = cg.get_coin_market_chart_by_id(id='bitcoin',
vs_currency='usd',
days='90',
interval='daily')
df1 = pd.json_normalize(response)
df2 = df1.explode('prices')
df2 = pd.DataFrame(df2['prices'].to_list(), columns=['dates','prices'])
df2.rename(columns={'dates': 'ds','prices': 'y'}, inplace=True)
print('DATAFRAME EXPLODED: ',df2)
df2['ds'] = df2['ds'].mul(1e6).apply(pd.Timestamp)
# df2 = pd.DataFrame(df2.to_list(), columns=['ds','y'])
df3 = df2.tail()
print('DATAFRAME TAILED: ',df3)
By writing df2 = df2['ds'].mul(1e6).apply(pd.Timestamp), you removed the price column from df2.

Pandas.DataFrame.resample inner level of MultiIndex

I need to resample a Pandas MultiIndex consisting of two levels. The inner level is a datetime index. which needs to be resampled.
import numpy as np
import pandas as pd
rng = pd.date_range('2019-01-01', '2019-04-27', freq='B', name='date')
df = pd.DataFrame(np.random.randint(0, 100, (len(rng), 2)), index=rng, columns=['sec1', 'sec2'])
df['month'] = df.index.month
df.set_index(['month', rng], inplace=True)
print(df)
# At that point I need to apply pd.resample. I'm wondering how to specify the level that I would like to resample?
df = df.resample('M').last() # is not working;
# I'm looking for somthing like this: df = df.resample('M', level=1).last()
Try:
df.groupby('month').resample('M', level=1).last()
Output:
sec1 sec2
month date
1 2019-01-31 59 87
2 2019-02-28 70 33
3 2019-03-31 71 38
4 2019-04-30 56 79
Details.
First, group the dataframe on 'month' or level=0 of the index.
Next, use resample with the level parameter for MultiIndex.
The level parameter can use either str, the index level name such as 'date' in this case, or the level number.
Lastly, chain and aggregration function such as last.

PANDAS Time Series Window Labels

I currently have a process for windowing time series data, but I am wondering if there is a vectorized, in-place approach for performance/resource reasons.
I have two lists that have the start and end dates of 30 day windows:
start_dts = [2014-01-01,...]
end_dts = [2014-01-30,...]
I have a dataframe with a field called 'transaction_dt'.
What I am trying accomplish is method to add two new columns ('start_dt' and 'end_dt') to each row when the transaction_dt is between a pair of 'start_dt' and 'end_dt' values. Ideally, this would be vectorized and in-place if possible.
EDIT:
As requested here is some sample data of my format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
IIUC
By suing IntervalIndex
df2.index=pd.IntervalIndex.from_arrays(df2['Start'],df2['End'],closed='both')
df[['End','Start']]=df2.loc[df['transaction_dt']].values
df
Out[457]:
transaction_dt End Start
0 2017-01-02 2017-01-31 2017-01-01
1 2017-03-02 2017-03-31 2017-03-01
2 2017-04-02 2017-04-30 2017-04-01
3 2017-05-02 2017-05-31 2017-05-01
Data Input :
df=pd.DataFrame({'transaction_dt':['2017-01-02','2017-03-02','2017-04-02','2017-05-02']})
df['transaction_dt']=pd.to_datetime(df['transaction_dt'])
list1=['2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01']
list2=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31']
df2=pd.DataFrame({'Start':list1,'End':list2})
df2.Start=pd.to_datetime(df2.Start)
df2.End=pd.to_datetime(df2.End)
If you want start and end we can use this, Extracting the first day of month of a datetime type column in pandas:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
df["start"] = df['transaction_dt'].dt.floor('d') - pd.offsets.MonthBegin(1)
df["end"] = df['transaction_dt'].dt.floor('d') + pd.offsets.MonthEnd(1)
df
Returns
customer_id transaction_dt product price units start end
0 1 2004-01-02 thing1 25 47 2004-01-01 2004-01-31
1 1 2004-01-17 thing2 150 8 2004-01-01 2004-01-31
2 2 2004-01-29 thing2 150 25 2004-01-01 2004-01-31
new approach:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-06-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
# Get all timestamps that are necessary
# This assumes dates are sorted
# if not we should change [0] -> min_dt and [-1] --> max_dt
timestamps = [df.iloc[0]["transaction_dt"].floor('d') - pd.offsets.MonthBegin(1)]
while df.iloc[-1]["transaction_dt"].floor('d') > timestamps[-1]:
timestamps.append(timestamps[-1]+datetime.timedelta(days=30))
# We store all ranges here
ranges = list(zip(timestamps,timestamps[1:]))
# Loop through all values and add to column start and end
for ind,value in enumerate(df["transaction_dt"]):
for i,(start,end) in enumerate(ranges):
if (value >= start and value <= end):
df.loc[ind, "start"] = start
df.loc[ind, "end"] = end
# When match is found let's also
# remove all ranges that aren't met
# This can be removed if dates are not sorted
# But this should speed things up for large datasets
for _ in range(i):
ranges.pop(0)

Convert dataframe index to datetime

How do I convert a pandas index of strings to datetime format?
My dataframe df is like this:
value
2015-09-25 00:46 71.925000
2015-09-25 00:47 71.625000
2015-09-25 00:48 71.333333
2015-09-25 00:49 64.571429
2015-09-25 00:50 72.285714
but the index is of type string, but I need it a datetime format because I get the error:
'Index' object has no attribute 'hour'
when using
df["A"] = df.index.hour
It should work as expected. Try to run the following example.
import pandas as pd
import io
data = """value
"2015-09-25 00:46" 71.925000
"2015-09-25 00:47" 71.625000
"2015-09-25 00:48" 71.333333
"2015-09-25 00:49" 64.571429
"2015-09-25 00:50" 72.285714"""
df = pd.read_table(io.StringIO(data), delim_whitespace=True)
# Converting the index as date
df.index = pd.to_datetime(df.index)
# Extracting hour & minute
df['A'] = df.index.hour
df['B'] = df.index.minute
df
# value A B
# 2015-09-25 00:46:00 71.925000 0 46
# 2015-09-25 00:47:00 71.625000 0 47
# 2015-09-25 00:48:00 71.333333 0 48
# 2015-09-25 00:49:00 64.571429 0 49
# 2015-09-25 00:50:00 72.285714 0 50
You could explicitly create a DatetimeIndex when initializing the dataframe. Assuming your data is in string format
data = [
('2015-09-25 00:46', '71.925000'),
('2015-09-25 00:47', '71.625000'),
('2015-09-25 00:48', '71.333333'),
('2015-09-25 00:49', '64.571429'),
('2015-09-25 00:50', '72.285714'),
]
index, values = zip(*data)
frame = pd.DataFrame({
'values': values
}, index=pd.DatetimeIndex(index))
print(frame.index.minute)
I just give other option for this question - you need to use '.dt' in your code:
import pandas as pd
df.index = pd.to_datetime(df.index)
#for get year
df.index.dt.year
#for get month
df.index.dt.month
#for get day
df.index.dt.day
#for get hour
df.index.dt.hour
#for get minute
df.index.dt.minute
Doing
df.index = pd.to_datetime(df.index, errors='coerce')
the data type of the index has changed to

Categories

Resources