How do I convert a pandas index of strings to datetime format?
My dataframe df is like this:
value
2015-09-25 00:46 71.925000
2015-09-25 00:47 71.625000
2015-09-25 00:48 71.333333
2015-09-25 00:49 64.571429
2015-09-25 00:50 72.285714
but the index is of type string, but I need it a datetime format because I get the error:
'Index' object has no attribute 'hour'
when using
df["A"] = df.index.hour
It should work as expected. Try to run the following example.
import pandas as pd
import io
data = """value
"2015-09-25 00:46" 71.925000
"2015-09-25 00:47" 71.625000
"2015-09-25 00:48" 71.333333
"2015-09-25 00:49" 64.571429
"2015-09-25 00:50" 72.285714"""
df = pd.read_table(io.StringIO(data), delim_whitespace=True)
# Converting the index as date
df.index = pd.to_datetime(df.index)
# Extracting hour & minute
df['A'] = df.index.hour
df['B'] = df.index.minute
df
# value A B
# 2015-09-25 00:46:00 71.925000 0 46
# 2015-09-25 00:47:00 71.625000 0 47
# 2015-09-25 00:48:00 71.333333 0 48
# 2015-09-25 00:49:00 64.571429 0 49
# 2015-09-25 00:50:00 72.285714 0 50
You could explicitly create a DatetimeIndex when initializing the dataframe. Assuming your data is in string format
data = [
('2015-09-25 00:46', '71.925000'),
('2015-09-25 00:47', '71.625000'),
('2015-09-25 00:48', '71.333333'),
('2015-09-25 00:49', '64.571429'),
('2015-09-25 00:50', '72.285714'),
]
index, values = zip(*data)
frame = pd.DataFrame({
'values': values
}, index=pd.DatetimeIndex(index))
print(frame.index.minute)
I just give other option for this question - you need to use '.dt' in your code:
import pandas as pd
df.index = pd.to_datetime(df.index)
#for get year
df.index.dt.year
#for get month
df.index.dt.month
#for get day
df.index.dt.day
#for get hour
df.index.dt.hour
#for get minute
df.index.dt.minute
Doing
df.index = pd.to_datetime(df.index, errors='coerce')
the data type of the index has changed to
Related
I have dataset with 800 rows and i want to create new column with date, and in each row in should increase on one day.
import datetime
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(800):
df['Date'] = date + datetime.timedelta(days=x)
In each column date is equal to '2014-01-12', as i inderstand it fills as if x is always equal to 799
Each time through the loop you are updating the ENTIRE Date column. You see the results of the 800th update at the end.
You could use a date range:
dr = pd.date_range('5/11/2011', periods=800, freq='D')
df = pd.DataFrame({'Date': dr})
print(df)
Date
0 2011-05-11
1 2011-05-12
2 2011-05-13
3 2011-05-14
4 2011-05-15
.. ...
795 2013-07-14
796 2013-07-15
797 2013-07-16
798 2013-07-17
799 2013-07-18
Or:
df['Date'] = dr
pandas is nice tool which can repeate some calculations without using for-loop.
When you use df['Date'] = ... then you assign the same value to all cells in column.
You have to use df.loc[x, 'Date'] = ... to assign to single cell.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(10):
df.loc[x,'Date'] = date + datetime.timedelta(days=x)
print(df)
But you could use also pd.date_range() for this.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
df['Date'] = pd.date_range(date, periods=10)
print(df)
I'm preparing my data for price analytics, so I created this code that pull the price feed from Coingecko API, sort the required columns, change the headers names and convert the date.
The current block I'm facing is once I convert the timestamp to datetime, I lose the price column, so how can I get it back along with the new date format?
import pandas as pd
from pycoingecko import CoinGeckoAPI
cg = CoinGeckoAPI()
response = cg.get_coin_market_chart_by_id(id='bitcoin',
vs_currency='usd',
days='90',
interval='daily')
df1 = pd.json_normalize(response)
df2 = df1.explode('prices')
df2 = pd.DataFrame(df2['prices'].to_list(), columns=['dates','prices'])
df2 .rename(columns={'dates': 'ds','prices': 'y'}, inplace=True)
print('DATAFRAME EXPLODED: ',df2)
df2 = df2['ds'].mul(1e6).apply(pd.Timestamp)
df2 = pd.DataFrame(df2.to_list(), columns=['ds','y'])
df3 = df2.tail()
print('DATAFRAME TAILED: ',df3)
DATAFRAME EXPLODED:
ds y
0 1618185600000 59988.020959
1 1618272000000 59911.020595
2 1618358400000 63576.676041
3 1618444800000 62807.123233
4 1618531200000 63179.772446
.. ... ...
86 1625616000000 34149.989815
87 1625702400000 33932.254638
88 1625788800000 32933.578199
89 1625875200000 33971.297750
90 1625895274000 33738.909080
[91 rows x 2 columns]
DATAFRAME TAILED:
86 2021-07-07 00:00:00
87 2021-07-08 00:00:00
88 2021-07-09 00:00:00
89 2021-07-10 00:00:00
90 2021-07-10 05:34:34
Name: ds, type: datetime64[ns]
ValueError: Shape of passed values is (91, 1), indices imply (91, 3)
Change :
df2 = df2['ds'].mul(1e6).apply(pd.Timestamp)
df2 = pd.DataFrame(df2.to_list(), columns=['ds','y'])
to :
df2['ds_datetime'] = df2['ds'].mul(1e6).apply(pd.Timestamp)
Try this:
import pandas as pd
from pycoingecko import CoinGeckoAPI
cg = CoinGeckoAPI()
response = cg.get_coin_market_chart_by_id(id='bitcoin',
vs_currency='usd',
days='90',
interval='daily')
df1 = pd.json_normalize(response)
df2 = df1.explode('prices')
df2 = pd.DataFrame(df2['prices'].to_list(), columns=['dates','prices'])
df2.rename(columns={'dates': 'ds','prices': 'y'}, inplace=True)
print('DATAFRAME EXPLODED: ',df2)
df2['ds'] = df2['ds'].mul(1e6).apply(pd.Timestamp)
# df2 = pd.DataFrame(df2.to_list(), columns=['ds','y'])
df3 = df2.tail()
print('DATAFRAME TAILED: ',df3)
By writing df2 = df2['ds'].mul(1e6).apply(pd.Timestamp), you removed the price column from df2.
I read and transform data using the following code
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as dates
import numpy as np
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv', parse_dates=['Date'])
df.drop('ID', axis='columns', inplace = True)
df_min = df[(df['Date']<='2014-12') & (df['Date']>='2004-01') & (df['Element']=='TMIN')]
df_min.drop('Element', axis='columns', inplace = True)
df_min = df_min.groupby('Date').agg({'Data_Value': 'min'}).reset_index()
giving the following result
Date Data_Value
0 2005-01-01 -56
1 2005-01-02 -56
2 2005-01-03 0
3 2005-01-04 -39
4 2005-01-05 -94
Now I try to get the Date in Year-Month. So
Date Data_Value
0 2005-01 -94
1 2005-02 xx
2 2005-03 xx
3 2005-04 xx
4 2005-05 xx
Where xx is the minimum value for that year-month.
how do I have to change the Groupby function or is this not possible with this function?
Use pd.Grouper() to accumulate by yearly/monthly/daily frequencies.
Code
df_min["Date"] = pd.to_datetime(df_min["Date"])
df_ans = df_min.groupby(pd.Grouper(key="Date", freq="M")).min()
Result
print(df_ans)
Data_Value
Date
2005-01-31 -94
You can first map Date column in order to get only year and month, and then just perform a groupby and get the min for each group:
# import libraries
import pandas as pd
# test data
data = [['2005-01-01', -56],['2005-01-01', -3],['2005-01-01', 6],
['2005-01-01', 26],['2005-01-01', 56],['2005-02-01', -26],
['2005-02-01', -2],['2005-02-01', 6],['2005-02-01', 26],
['2005-03-01', 56],['2005-03-01', -33],['2005-03-01', -5],
['2005-03-01', 6],['2005-03-01', 26],['2005-03-01', 56]]
# create dataframe
df_min = pd.DataFrame(data=data, columns=["Date", "Date_value"])
# convert 'Date' column to datetime datatype
df_min['Date'] = pd.to_datetime(df_min['Date'])
# get only year and month
df_min['Date'] = df_min['Date'].map(lambda x: str(x.year)+'-'+str(x.month))
# get min value for each group
df_min = df_min.groupby('Date').min()
After printing df_min, output must be:
Date_value
Date
2005-01-01 -56
2005-02-01 -26
2005-03-01 -33
I have a large csv file with millions of rows. The data looks like this. 2 columns (date, score) and million rows. I need the missing dates (for example 1/1/16, 2/1/16, 4/1/16) to have '0' values in the 'score' column and keep my existing 'date' and 'score' intact, all in the same csv. But,I also have multiple (hundreds probably) scores on many dates. So really having trouble to code it. Looked up quite a few examples on stackoverflow but none of them seemed to work yet.
date score
3/1/16 0.6369
5/1/16 -0.2023
6/1/16 0.25
7/1/16 0.0772
9/1/16 -0.4215
12/1/16 0.296
15/1/16 0.25
15/1/16 0.7684
15/1/16 0.8537
...
...
31/12/18 0.5646
This is what I have done so far. But all I am getting is an index column filled with 3 years of my 'date' and 'score' columns filled with '0'. I will really appreciate your answers and suggestions. Thank you very much.
import csv
import pandas as pd
import datetime as dt
df =pd.read_csv('myfile.csv')
dtr =pd.date_range('01.01.2016', '31.12.2018')
df.index = pd.DatetimeIndex(df.index)
df =df.reindex(dtr,fill_value = 0)
df.to_csv('missingDateCorrected.csv', encoding ='utf-8', index =True)
Note: I know I put index as True that's why the index is appearing but don't know why the 'date' column is not filling. If I put parse_dates =['date'] in my pd.read_csv I get the 'date' column filled with dates from 1970 with the same results as before.
You can do it like this:
(I did it with a smaller timeframe so change the date so that it fits you.)
import pandas as pd
x = {"date":["3/1/16","5/1/16","5/1/16"],
"score":[4,5,6]}
df = pd.DataFrame.from_dict(x)
df["date"] = pd.to_datetime(df["date"], format='%d/%m/%y')
df.set_index("date",inplace=True)
dtr =pd.date_range('01.01.2016', '01.10.2016', freq='D')
s = pd.Series(index=dtr)
df = pd.concat([df,s[~s.index.isin(df.index)]]).sort_index()
df = df.drop([0],axis=1).fillna(0)
print(df)
Output
score
2016-01-01 0.0
2016-01-02 0.0
2016-01-03 4.0
2016-01-04 0.0
2016-01-05 5.0
2016-01-05 6.0
2016-01-06 0.0
2016-01-07 0.0
2016-01-08 0.0
2016-01-09 0.0
2016-01-10 0.0
With file
Because you ask in the comment here an example with file:
df = pd.read_csv('myfile.csv', index_col=0)
df.index = pd.to_datetime(df.index, format='%d/%m/%y')
dtr =pd.date_range('01.01.2016', '01.10.2016', freq='D')
s = pd.Series(index=dtr)
df = pd.concat([df,s[~s.index.isin(df.index)]]).sort_index()
df = df.drop([0],axis=1).fillna(0)
df.to_csv('missingDateCorrected.csv', encoding ='utf-8', index =True)
Just an idea . Try resampling with 1 day and fill zeros .
like : nd = df.resample('D').pad()
Not very efficient but will work.
import pandas as pd
df = pd.read_csv('myfile.csv', index_col=0)
df.index = pd.to_datetime(df.index, format='%d/%m/%y')
dtr = pd.date_range('01.01.2016', '31.12.2018')
# Create an empty DataFrame from selected date range
empty = pd.DataFrame(index=dtr, columns=['score'])
# Append your CSV file
df = pd.concat([df, empty[~empty.index.isin(df.index)]]).sort_index().fillna(0)
df.to_csv('missingDateCorrected.csv', encoding='utf-8', index=True)
I have a data set which is aggregated between two dates and I want to de-aggregate it daily by dividing total number with days between these dates.
As a sample
StoreID Date_Start Date_End Total_Number_of_sales
78 12/04/2015 17/05/2015 79089
80 12/04/2015 17/05/2015 79089
The data set I want is:
StoreID Date Number_Sales
78 12/04/2015 79089/38(as there are 38 days in between)
78 13/04/2015 79089/38(as there are 38 days in between)
78 14/04/2015 79089/38(as there are 38 days in between)
78 ...
78 17/05/2015 79089/38(as there are 38 days in between)
Any help would be useful.
Thanks
I'm not sure if this is exactly what you want but you can try this (I've added another imaginary row):
import datetime as dt
df = pd.DataFrame({'date_start':['12/04/2015','17/05/2015'],
'date_end':['18/05/2015','10/06/2015'],
'sales':[79089, 1000]})
df['date_start'] = pd.to_datetime(df['date_start'], format='%d/%m/%Y')
df['date_end'] = pd.to_datetime(df['date_end'], format='%d/%m/%Y')
df['days_diff'] = (df['date_end'] - df['date_start']).dt.days
master_df = pd.DataFrame(None)
for row in df.index:
new_df = pd.DataFrame(index=pd.date_range(start=df['date_start'].iloc[row],
end = df['date_end'].iloc[row],
freq='d'))
new_df['number_sales'] = df['sales'].iloc[row] / df['days_diff'].iloc[row]
master_df = pd.concat([master_df, new_df], axis=0)
First convert string dates to datetime objects (so you can calculate number of days in between ranges), then create a new index based on the date range, and divide sales. The loop sticks each row of your dataframe into an "expanded" dataframe and then concatenates them into one master dataframe.
What about creating a new dataframe?
start = pd.to_datetime(df['Date_Start'].values[0], dayfirst=True)
end = pd.to_datetime(df['Date_End'].values[0], dayfirst=True)
idx = pd.DatetimeIndex(start=start, end=end, freq='D')
res = pd.DataFrame(df['Total_Number_of_sales'].values[0]/len(idx), index=idx, columns=['Number_Sales'])
yields
In[42]: res.head(5)
Out[42]:
Number_Sales
2015-04-12 2196.916667
2015-04-13 2196.916667
2015-04-14 2196.916667
2015-04-15 2196.916667
2015-04-16 2196.916667
If you have multiple stores (according to your comment and edit), then you could loop over all rows, calculate sales and concatenate the resulting dataframes afterwards.
df = pd.DataFrame({'Store_ID': [78, 78, 80],
'Date_Start': ['12/04/2015', '18/05/2015', '21/06/2015'],
'Date_End': ['17/05/2015', '10/06/2015', '01/07/2015'],
'Total_Number_of_sales': [79089., 50000., 25000.]})
to_concat = []
for _, row in df.iterrows():
start = pd.to_datetime(row['Date_Start'], dayfirst=True)
end = pd.to_datetime(row['Date_End'], dayfirst=True)
idx = pd.DatetimeIndex(start=start, end=end, freq='D')
sales = [row['Total_Number_of_sales']/len(idx)] * len(idx)
id = [row['Store_ID']] * len(idx)
res = pd.DataFrame({'Store_ID': id, 'Number_Sales':sales}, index=idx)
to_concat.append(res)
res = pd.concat(to_concat)
There are definitley more elegant solutions, have a look for example at this thread.
Consider building a list of data frames with the DataFrame constructor iterating through each row of main data frame. Each iteration will expand a sequence of days from Start_Date to end of range with needed sales division of total sales by difference of days:
from io import StringIO
import pandas as pd
from datetime import timedelta
txt = '''StoreID Date_Start Date_End Total_Number_of_sales
78 12/04/2015 17/05/2015 79089
80 12/04/2015 17/05/2015 89089'''
df = pd.read_table(StringIO(txt), sep="\s+", parse_dates=[1, 2], dayfirst=True)
df['Diff_Days'] = (df['Date_End'] - df['Date_Start']).dt.days
def calc_days_sales(row):
long_df = pd.DataFrame({'StoreID': row['StoreID'],
'Date': [row['Date_Start'] + timedelta(days=i)
for i in range(row['Diff_Days']+1)],
'Number_Sales': row['Total_Number_of_sales'] / row['Diff_Days']})
return long_df
df_list = [calc_days_sales(row) for i, row in df.iterrows()]
final_df = pd.concat(df_list).reindex(['StoreID', 'Date', 'Number_Sales'], axis='columns')
print(final_df.head(10))
# StoreID Date Number_Sales
# 0 78 2015-04-12 2259.685714
# 1 78 2015-04-13 2259.685714
# 2 78 2015-04-14 2259.685714
# 3 78 2015-04-15 2259.685714
# 4 78 2015-04-16 2259.685714
# 5 78 2015-04-17 2259.685714
# 6 78 2015-04-18 2259.685714
# 7 78 2015-04-19 2259.685714
# 8 78 2015-04-20 2259.685714
# 9 78 2015-04-21 2259.685714
reindex at end not needed for Python 3.6 since data frame's input dictionary will be ordered.