I'm preparing my data for price analytics, so I created this code that pull the price feed from Coingecko API, sort the required columns, change the headers names and convert the date.
The current block I'm facing is once I convert the timestamp to datetime, I lose the price column, so how can I get it back along with the new date format?
import pandas as pd
from pycoingecko import CoinGeckoAPI
cg = CoinGeckoAPI()
response = cg.get_coin_market_chart_by_id(id='bitcoin',
vs_currency='usd',
days='90',
interval='daily')
df1 = pd.json_normalize(response)
df2 = df1.explode('prices')
df2 = pd.DataFrame(df2['prices'].to_list(), columns=['dates','prices'])
df2 .rename(columns={'dates': 'ds','prices': 'y'}, inplace=True)
print('DATAFRAME EXPLODED: ',df2)
df2 = df2['ds'].mul(1e6).apply(pd.Timestamp)
df2 = pd.DataFrame(df2.to_list(), columns=['ds','y'])
df3 = df2.tail()
print('DATAFRAME TAILED: ',df3)
DATAFRAME EXPLODED:
ds y
0 1618185600000 59988.020959
1 1618272000000 59911.020595
2 1618358400000 63576.676041
3 1618444800000 62807.123233
4 1618531200000 63179.772446
.. ... ...
86 1625616000000 34149.989815
87 1625702400000 33932.254638
88 1625788800000 32933.578199
89 1625875200000 33971.297750
90 1625895274000 33738.909080
[91 rows x 2 columns]
DATAFRAME TAILED:
86 2021-07-07 00:00:00
87 2021-07-08 00:00:00
88 2021-07-09 00:00:00
89 2021-07-10 00:00:00
90 2021-07-10 05:34:34
Name: ds, type: datetime64[ns]
ValueError: Shape of passed values is (91, 1), indices imply (91, 3)
Change :
df2 = df2['ds'].mul(1e6).apply(pd.Timestamp)
df2 = pd.DataFrame(df2.to_list(), columns=['ds','y'])
to :
df2['ds_datetime'] = df2['ds'].mul(1e6).apply(pd.Timestamp)
Try this:
import pandas as pd
from pycoingecko import CoinGeckoAPI
cg = CoinGeckoAPI()
response = cg.get_coin_market_chart_by_id(id='bitcoin',
vs_currency='usd',
days='90',
interval='daily')
df1 = pd.json_normalize(response)
df2 = df1.explode('prices')
df2 = pd.DataFrame(df2['prices'].to_list(), columns=['dates','prices'])
df2.rename(columns={'dates': 'ds','prices': 'y'}, inplace=True)
print('DATAFRAME EXPLODED: ',df2)
df2['ds'] = df2['ds'].mul(1e6).apply(pd.Timestamp)
# df2 = pd.DataFrame(df2.to_list(), columns=['ds','y'])
df3 = df2.tail()
print('DATAFRAME TAILED: ',df3)
By writing df2 = df2['ds'].mul(1e6).apply(pd.Timestamp), you removed the price column from df2.
Related
I read and transform data using the following code
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as dates
import numpy as np
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv', parse_dates=['Date'])
df.drop('ID', axis='columns', inplace = True)
df_min = df[(df['Date']<='2014-12') & (df['Date']>='2004-01') & (df['Element']=='TMIN')]
df_min.drop('Element', axis='columns', inplace = True)
df_min = df_min.groupby('Date').agg({'Data_Value': 'min'}).reset_index()
giving the following result
Date Data_Value
0 2005-01-01 -56
1 2005-01-02 -56
2 2005-01-03 0
3 2005-01-04 -39
4 2005-01-05 -94
Now I try to get the Date in Year-Month. So
Date Data_Value
0 2005-01 -94
1 2005-02 xx
2 2005-03 xx
3 2005-04 xx
4 2005-05 xx
Where xx is the minimum value for that year-month.
how do I have to change the Groupby function or is this not possible with this function?
Use pd.Grouper() to accumulate by yearly/monthly/daily frequencies.
Code
df_min["Date"] = pd.to_datetime(df_min["Date"])
df_ans = df_min.groupby(pd.Grouper(key="Date", freq="M")).min()
Result
print(df_ans)
Data_Value
Date
2005-01-31 -94
You can first map Date column in order to get only year and month, and then just perform a groupby and get the min for each group:
# import libraries
import pandas as pd
# test data
data = [['2005-01-01', -56],['2005-01-01', -3],['2005-01-01', 6],
['2005-01-01', 26],['2005-01-01', 56],['2005-02-01', -26],
['2005-02-01', -2],['2005-02-01', 6],['2005-02-01', 26],
['2005-03-01', 56],['2005-03-01', -33],['2005-03-01', -5],
['2005-03-01', 6],['2005-03-01', 26],['2005-03-01', 56]]
# create dataframe
df_min = pd.DataFrame(data=data, columns=["Date", "Date_value"])
# convert 'Date' column to datetime datatype
df_min['Date'] = pd.to_datetime(df_min['Date'])
# get only year and month
df_min['Date'] = df_min['Date'].map(lambda x: str(x.year)+'-'+str(x.month))
# get min value for each group
df_min = df_min.groupby('Date').min()
After printing df_min, output must be:
Date_value
Date
2005-01-01 -56
2005-02-01 -26
2005-03-01 -33
I need to resample a Pandas MultiIndex consisting of two levels. The inner level is a datetime index. which needs to be resampled.
import numpy as np
import pandas as pd
rng = pd.date_range('2019-01-01', '2019-04-27', freq='B', name='date')
df = pd.DataFrame(np.random.randint(0, 100, (len(rng), 2)), index=rng, columns=['sec1', 'sec2'])
df['month'] = df.index.month
df.set_index(['month', rng], inplace=True)
print(df)
# At that point I need to apply pd.resample. I'm wondering how to specify the level that I would like to resample?
df = df.resample('M').last() # is not working;
# I'm looking for somthing like this: df = df.resample('M', level=1).last()
Try:
df.groupby('month').resample('M', level=1).last()
Output:
sec1 sec2
month date
1 2019-01-31 59 87
2 2019-02-28 70 33
3 2019-03-31 71 38
4 2019-04-30 56 79
Details.
First, group the dataframe on 'month' or level=0 of the index.
Next, use resample with the level parameter for MultiIndex.
The level parameter can use either str, the index level name such as 'date' in this case, or the level number.
Lastly, chain and aggregration function such as last.
I have a data set which is aggregated between two dates and I want to de-aggregate it daily by dividing total number with days between these dates.
As a sample
StoreID Date_Start Date_End Total_Number_of_sales
78 12/04/2015 17/05/2015 79089
80 12/04/2015 17/05/2015 79089
The data set I want is:
StoreID Date Number_Sales
78 12/04/2015 79089/38(as there are 38 days in between)
78 13/04/2015 79089/38(as there are 38 days in between)
78 14/04/2015 79089/38(as there are 38 days in between)
78 ...
78 17/05/2015 79089/38(as there are 38 days in between)
Any help would be useful.
Thanks
I'm not sure if this is exactly what you want but you can try this (I've added another imaginary row):
import datetime as dt
df = pd.DataFrame({'date_start':['12/04/2015','17/05/2015'],
'date_end':['18/05/2015','10/06/2015'],
'sales':[79089, 1000]})
df['date_start'] = pd.to_datetime(df['date_start'], format='%d/%m/%Y')
df['date_end'] = pd.to_datetime(df['date_end'], format='%d/%m/%Y')
df['days_diff'] = (df['date_end'] - df['date_start']).dt.days
master_df = pd.DataFrame(None)
for row in df.index:
new_df = pd.DataFrame(index=pd.date_range(start=df['date_start'].iloc[row],
end = df['date_end'].iloc[row],
freq='d'))
new_df['number_sales'] = df['sales'].iloc[row] / df['days_diff'].iloc[row]
master_df = pd.concat([master_df, new_df], axis=0)
First convert string dates to datetime objects (so you can calculate number of days in between ranges), then create a new index based on the date range, and divide sales. The loop sticks each row of your dataframe into an "expanded" dataframe and then concatenates them into one master dataframe.
What about creating a new dataframe?
start = pd.to_datetime(df['Date_Start'].values[0], dayfirst=True)
end = pd.to_datetime(df['Date_End'].values[0], dayfirst=True)
idx = pd.DatetimeIndex(start=start, end=end, freq='D')
res = pd.DataFrame(df['Total_Number_of_sales'].values[0]/len(idx), index=idx, columns=['Number_Sales'])
yields
In[42]: res.head(5)
Out[42]:
Number_Sales
2015-04-12 2196.916667
2015-04-13 2196.916667
2015-04-14 2196.916667
2015-04-15 2196.916667
2015-04-16 2196.916667
If you have multiple stores (according to your comment and edit), then you could loop over all rows, calculate sales and concatenate the resulting dataframes afterwards.
df = pd.DataFrame({'Store_ID': [78, 78, 80],
'Date_Start': ['12/04/2015', '18/05/2015', '21/06/2015'],
'Date_End': ['17/05/2015', '10/06/2015', '01/07/2015'],
'Total_Number_of_sales': [79089., 50000., 25000.]})
to_concat = []
for _, row in df.iterrows():
start = pd.to_datetime(row['Date_Start'], dayfirst=True)
end = pd.to_datetime(row['Date_End'], dayfirst=True)
idx = pd.DatetimeIndex(start=start, end=end, freq='D')
sales = [row['Total_Number_of_sales']/len(idx)] * len(idx)
id = [row['Store_ID']] * len(idx)
res = pd.DataFrame({'Store_ID': id, 'Number_Sales':sales}, index=idx)
to_concat.append(res)
res = pd.concat(to_concat)
There are definitley more elegant solutions, have a look for example at this thread.
Consider building a list of data frames with the DataFrame constructor iterating through each row of main data frame. Each iteration will expand a sequence of days from Start_Date to end of range with needed sales division of total sales by difference of days:
from io import StringIO
import pandas as pd
from datetime import timedelta
txt = '''StoreID Date_Start Date_End Total_Number_of_sales
78 12/04/2015 17/05/2015 79089
80 12/04/2015 17/05/2015 89089'''
df = pd.read_table(StringIO(txt), sep="\s+", parse_dates=[1, 2], dayfirst=True)
df['Diff_Days'] = (df['Date_End'] - df['Date_Start']).dt.days
def calc_days_sales(row):
long_df = pd.DataFrame({'StoreID': row['StoreID'],
'Date': [row['Date_Start'] + timedelta(days=i)
for i in range(row['Diff_Days']+1)],
'Number_Sales': row['Total_Number_of_sales'] / row['Diff_Days']})
return long_df
df_list = [calc_days_sales(row) for i, row in df.iterrows()]
final_df = pd.concat(df_list).reindex(['StoreID', 'Date', 'Number_Sales'], axis='columns')
print(final_df.head(10))
# StoreID Date Number_Sales
# 0 78 2015-04-12 2259.685714
# 1 78 2015-04-13 2259.685714
# 2 78 2015-04-14 2259.685714
# 3 78 2015-04-15 2259.685714
# 4 78 2015-04-16 2259.685714
# 5 78 2015-04-17 2259.685714
# 6 78 2015-04-18 2259.685714
# 7 78 2015-04-19 2259.685714
# 8 78 2015-04-20 2259.685714
# 9 78 2015-04-21 2259.685714
reindex at end not needed for Python 3.6 since data frame's input dictionary will be ordered.
I have this list :
20161216014500
20161216020000
20161216021500
20161216023000
20161216024500
20161216030000
20161216031500
20161216033000
20161216034500
20161216040000
20161216041500
20161216043000
20161216044500
20161216050000
20161216051500
20161216053000
20161216054500
And I want after parsing it and putting it in the correct format by this code:
for row in rows:
if "".join(row).strip() != "":
chaine = str(row[0]+row[1])
date = chaine[:10] + " " + chaine[11:]
header = parseDate(date)
header = str(header).replace('-','')
header = str(header).replace(':','')
header = str(header).replace(' ','')
print header
I want to insert the header(the list above) in a dataframe using pandas:
newDataframe = pd.DataFrame(data, index=index, columns=header)
This is the error I get:
14 columns passed, passed data had 1 columns
What is the reason of this error and how to correct it ?
You can do the same thing this way:
import pandas as pd
rows = ['20161216014500',
'20161216020000',
'20161216021500',
'20161216023000',
'20161216024500',
'20161216030000',
'20161216031500',
'20161216033000',
'20161216034500',
'20161216040000',
'20161216041500',
'20161216043000',
'20161216044500',
'20161216050000',
'20161216051500',
'20161216053000',
'20161216054500']
df = pd.DataFrame(rows, columns=['date'])
pd.to_datetime(df['date'], format='%Y%m%d%H%M%S')
df
output:
date
0 20161216014500
1 20161216020000
2 20161216021500
3 20161216023000
4 20161216024500
5 20161216030000
6 20161216031500
7 20161216033000
8 20161216034500
9 20161216040000
10 20161216041500
11 20161216043000
12 20161216044500
13 20161216050000
14 20161216051500
15 20161216053000
16 20161216054500
import io
import pandas as pd
a = io.StringIO(u"""20161216014500
20161216020000
20161216021500
20161216023000
20161216024500
20161216030000
20161216031500
20161216033000
20161216034500
20161216040000
20161216041500
20161216043000
20161216044500
20161216050000
20161216051500
20161216053000
20161216054500""")
df = pd.read_csv(a, header=None, parse_dates=[0],
date_parser=pd.tseries.tools.parse_time_string)
df.head()
Output:
0
0 2016-12-16 01:45:00
1 2016-12-16 02:00:00
2 2016-12-16 02:15:00
3 2016-12-16 02:30:00
4 2016-12-16 02:45:00
How do I convert a pandas index of strings to datetime format?
My dataframe df is like this:
value
2015-09-25 00:46 71.925000
2015-09-25 00:47 71.625000
2015-09-25 00:48 71.333333
2015-09-25 00:49 64.571429
2015-09-25 00:50 72.285714
but the index is of type string, but I need it a datetime format because I get the error:
'Index' object has no attribute 'hour'
when using
df["A"] = df.index.hour
It should work as expected. Try to run the following example.
import pandas as pd
import io
data = """value
"2015-09-25 00:46" 71.925000
"2015-09-25 00:47" 71.625000
"2015-09-25 00:48" 71.333333
"2015-09-25 00:49" 64.571429
"2015-09-25 00:50" 72.285714"""
df = pd.read_table(io.StringIO(data), delim_whitespace=True)
# Converting the index as date
df.index = pd.to_datetime(df.index)
# Extracting hour & minute
df['A'] = df.index.hour
df['B'] = df.index.minute
df
# value A B
# 2015-09-25 00:46:00 71.925000 0 46
# 2015-09-25 00:47:00 71.625000 0 47
# 2015-09-25 00:48:00 71.333333 0 48
# 2015-09-25 00:49:00 64.571429 0 49
# 2015-09-25 00:50:00 72.285714 0 50
You could explicitly create a DatetimeIndex when initializing the dataframe. Assuming your data is in string format
data = [
('2015-09-25 00:46', '71.925000'),
('2015-09-25 00:47', '71.625000'),
('2015-09-25 00:48', '71.333333'),
('2015-09-25 00:49', '64.571429'),
('2015-09-25 00:50', '72.285714'),
]
index, values = zip(*data)
frame = pd.DataFrame({
'values': values
}, index=pd.DatetimeIndex(index))
print(frame.index.minute)
I just give other option for this question - you need to use '.dt' in your code:
import pandas as pd
df.index = pd.to_datetime(df.index)
#for get year
df.index.dt.year
#for get month
df.index.dt.month
#for get day
df.index.dt.day
#for get hour
df.index.dt.hour
#for get minute
df.index.dt.minute
Doing
df.index = pd.to_datetime(df.index, errors='coerce')
the data type of the index has changed to