I have daily precipitation values with time information in following form:
a = [(19500101,3.45),(19500102,1.2).......(19701231,1.4)]
I want to take annual mean of it using date information. It might be a simple solution. I have tried as below. Any suggestions?
prcp=numpy.array(precipitation)
time=numpy.array(time)
yearly=numpy.zeros(prcp.shape)
#-----------------Get annual means-----------------
for ii in xrange(len(time)):
tt=time[ii]
if ii==0:
year_old=tt[0:4]
index_start=ii
else:
#----------------new year----------------
year=tt[0:4]
if year != year_old:
year_mean=numpy.mean(prcp[index_start:ii])
yearly[index_start:ii]=year_mean
year_old=month
index_start=ii
#----------------Get the last year----------------
if ii==len(time)-1:
year_mean=numpy.mean(prcp[index_start:])
yearly[index_start:]=year_mean
You could try Pandas for aggregations.
import pandas as pd
a = [(19500101,3.45),(19500102,1.2), (19701231,1.4)]
df = pd.DataFrame(a) # convert to dataframe
df[0] = pd.to_datetime(df[0], format='%Y%m%d') # create a datetime series
df.groupby(df[0].map(lambda x: x.year)).mean() # groupby year and mean from g roups
1
0
1950 2.325
1970 1.400
You could use the snippet below to do this:
First, segregate the data based on the years:
>>> list_of_data = [(19500101,3.45), (19500102,1.2), (19701231,1.4)]
>>> from collections import defaultdict
>>> data = defaultdict(list)
>>> for item in list_of_data:
... data[str(item[0])[:4]].append(item[1])
And now, calculate the mean using
>>> for key, value in data.iteritems():
... print key, sum(value)/len(value)
...
1950 2.325
1970 1.4
Note that I am doing two runs on the data, and #John's answer of Pandas will be probably faster if you are ok using the pandas library.
I recommend pandas as #John-Galt suggested,
If you want a python solution without pandas:
import numpy as np
a = [(19500101,3.45),(19500102,1.2).......(19701231,1.4)]
year=lambda x:int(x[0]/10**4)
years={year(x) for x in a}
annual_avg=dict()
for y in years:
annual_avg[y]=reduce(np.mean,[x[1] for x in a if year(x)==y])
Related
I have a dataframe of daily sales:
import pandas as pd
date = ['28-01-2017','29-01-2017','30-01-2017','31-01-2017','01-02-2017','02-02-2017']
sales = [1,2,3,4,1,2]
ym = [201701,201701,201701,201701,201702,201702]
prev_1_ym = [201612,201612,201612,201612,201701,201701]
prev_2_ym = [201611,201611,201611,201611,201612,201612]
df_test = pd.DataFrame({'date': date,'ym':ym,'prev_1_ym':prev_1_ym,'prev_2_ym':prev_2_ym,'sales':sales})
df_test['date'] = pd.to_datetime(df_test['date'],format = '%d-%m-%Y')
I am trying to find total sales in the previous 1m, previous 2m etc..
My current approach is to use a list comprehension:
df_test[prev_1m_sales] = [ sum(df_test.loc[df_test['ym'] == x].sales) for x in df_test[prev_1_ym] ]
However, this proves to be very slow.
Is there a way to speed it up by using .groupby()?
you can use the date column to group your data, first change its data-type to pandas TimeStamps,
df['dates']=pd.to_datetime(df['dates'])
then you can use it directly in grouping for example
df.groupby(df.data.month).sales.sum().cumsum()
I am pulling data using pytreasurydirect and I would like to query each unique cusip and then append them and create a pandas dataframe table. I am having difficulties generating the the pandas dataframe. I believe it is because of the unicode structure of the data.
import pandas as pd
from pytreasurydirect import TreasuryDirect
td = TreasuryDirect()
cusip_list = [['912796PY9','08/09/2018'],['912796PY9','06/07/2018']]
for i in cusip_list:
cusip =''.join(i[0])
issuedate =''.join(i[1])
cusip_value=(td.security_info(cusip, issuedate))
#pd.DataFrame(cusip_value.items())
df = pd.DataFrame(cusip_value, index=['a'])
td = td.append(df, ignore_index=False)
Example of data from pytreasurydirect :
Index([u'accruedInterestPer100', u'accruedInterestPer1000',
u'adjustedAccruedInterestPer1000', u'adjustedPrice',
u'allocationPercentage', u'allocationPercentageDecimals',
u'announcedCusip', u'announcementDate', u'auctionDate',
u'auctionDateYear',
...
u'totalTendered', u'treasuryDirectAccepted',
u'treasuryDirectTendersAccepted', u'type',
u'unadjustedAccruedInterestPer1000', u'unadjustedPrice',
u'updatedTimestamp', u'xmlFilenameAnnouncement',
u'xmlFilenameCompetitiveResults', u'xmlFilenameSpecialAnnouncement'],
dtype='object', length=116)
I think you want to define a function like this:
def securities(type):
secs = td.security_type(type)
keys = secs[0].keys() if secs else []
seri = [pd.Series([sec[key] for sec in secs]) for key in keys]
return pd.DataFrame(dict(zip(keys, seri)))
Then, use it:
df = securities('Bond')
df[['cusip', 'issueDate', 'maturityDate']].head()
to get results like these, for example (TreasuryDirect returns a lot of addition columns):
cusip issueDate maturityDate
0 912810SD1 2018-08-15T00:00:00 2048-08-15T00:00:00
1 912810SC3 2018-07-16T00:00:00 2048-05-15T00:00:00
2 912810SC3 2018-06-15T00:00:00 2048-05-15T00:00:00
3 912810SC3 2018-05-15T00:00:00 2048-05-15T00:00:00
4 912810SA7 2018-04-16T00:00:00 2048-02-15T00:00:00
At least today those are the results today. The results will change over time as bonds are issued and, alas, mature. Note the multiple issueDates per cusip.
Finally, per the TreasuryDirect website (https://www.treasurydirect.gov/webapis/webapisecurities.htm), the possible security types are: Bill, Note, Bond, CMB, TIPS, FRN.
I would like to calculate the mean and standard deviation of a timedelta by bank from a dataframe with two columns shown below. When I run the code (also shown below) I get the below error:
pandas.core.base.DataError: No numeric types to aggregate
My dataframe:
bank diff
Bank of Japan 0 days 00:00:57.416000
Reserve Bank of Australia 0 days 00:00:21.452000
Reserve Bank of New Zealand 55 days 12:39:32.269000
U.S. Federal Reserve 8 days 13:27:11.387000
My code:
means = dropped.groupby('bank').mean()
std = dropped.groupby('bank').std()
Pandas mean() and other aggregation methods support numeric_only=False parameter.
dropped.groupby('bank').mean(numeric_only=False)
Found here: Aggregations for Timedelta values in the Python DataFrame
You need to convert timedelta to some numeric value, e.g. int64 by values what is most accurate, because convert to ns is what is the numeric representation of timedelta:
dropped['new'] = dropped['diff'].values.astype(np.int64)
means = dropped.groupby('bank').mean()
means['new'] = pd.to_timedelta(means['new'])
std = dropped.groupby('bank').std()
std['new'] = pd.to_timedelta(std['new'])
Another solution is to convert values to seconds by total_seconds, but that is less accurate:
dropped['new'] = dropped['diff'].dt.total_seconds()
means = dropped.groupby('bank').mean()
No need to convert timedelta back and forth. Numpy and pandas can seamlessly do it for you with a faster run time. Using your dropped DataFrame:
import numpy as np
grouped = dropped.groupby('bank')['diff']
mean = grouped.apply(lambda x: np.mean(x))
std = grouped.apply(lambda x: np.std(x))
I would suggest passing the numeric_only=False argument to mean as mentioned by Alexander Usikov - this works for pandas version 0.20+.
If you have an older version, the following works:
import pandas pd
df = pd.DataFrame({
'td': pd.Series([pd.Timedelta(days=i) for i in range(5)]),
'group': ['a', 'a', 'a', 'b', 'b']
})
(
df
.astype({'td': int}) # convert timedelta to integer (nanoseconds)
.groupby('group')
.mean()
.astype({'td': 'timedelta64[ns]'})
)
I am writing scripts in panda but i could not able to extract correct output that i want. here it is problem:
i can read this data from CSV file. Here you can find table structure
http://postimg.org/image/ie0od7ejr/
I want this output from above table data
Month Demo1 Demo 2
June 2013 3 1
July 2013 2 2
in Demo1 and Demo2 column i want to count regular entry and entry which starts with u. for June there are total 3 regular entry while 1 entry starts with u.
so far i have written this code.
import sqlite3
from pylab import *
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime as dt
conn = sqlite3.connect('Demo2.sqlite')
df = pd.read_sql("SELECT * FROM Data", conn)
df['DateTime'] = df['DATE'].apply(lambda x: dt.date.fromtimestamp(x))
df1 = df.set_index('DateTime', drop=False)
Thanks advace for help. End result would be bar graph. I can draw graph from output that i mention above.
For resample, you can define two aggregation functions like this:
def countU(x):
return sum(i[0] == 'u' for i in x)
def countNotU(x):
return sum(i[0] != 'u' for i in x)
print df.resample('M', how=[countU, countNotU])
Alternatively, consider groupby.
I'm a beginner of Python related environment and I have problem with using time series data.
The below is my OHLC 1 minute data.
2011-11-01,9:00:00,248.50,248.95,248.20,248.70
2011-11-01,9:01:00,248.70,249.00,248.65,248.85
2011-11-01,9:02:00,248.90,249.25,248.70,249.15
...
2011-11-01,15:03:00,250.25,250.30,250.05,250.15
2011-11-01,15:04:00,250.15,250.60,250.10,250.60
2011-11-01,15:15:00,250.55,250.55,250.55,250.55
2011-11-02,9:00:00,245.55,246.25,245.40,245.80
2011-11-02,9:01:00,245.85,246.40,245.75,246.35
2011-11-02,9:02:00,246.30,246.45,245.75,245.80
2011-11-02,9:03:00,245.75,245.85,245.30,245.35
...
I'd like to extract the last "CLOSE" data per each row and convert data format like the following:
2011-11-01, 248.70, 248.85, 249.15, ... 250.15, 250.60, 250.55
2011-11-02, 245.80, 246.35, 245.80, ...
...
I'd like to calculate the highest Close value and it's time(minute) per EACH DAY like the following:
2011-11-01, 10:23:03, 250.55
2011-11-02, 11:02:36, 251.00
....
Any help would be very appreciated.
Thank you in advance,
You can use the pandas library. In the case of your data you can get the max as:
import pandas as pd
# Read in the data and parse the first two columns as a
# date-time and set it as index
df = pd.read_csv('your_file', parse_dates=[[0,1]], index_col=0, header=None)
# get only the fifth column (close)
df = df[[5]]
# Resample to date frequency and get the max value for each day.
df.resample('D', how='max')
If you want to show also the times, keep them in your DataFrame as a column and pass a function that will determine the max close value and return that row:
>>> df = pd.read_csv('your_file', parse_dates=[[0,1]], index_col=0, header=None,
usecols=[0, 1, 5], names=['d', 't', 'close'])
>>> df['time'] = df.index
>>> df.resample('D', how=lambda group: group.iloc[group['close'].argmax()])
close time
d_t
2011-11-01 250.60 2011-11-01 15:04:00
2011-11-02 246.35 2011-11-02 09:01:00
And if you wan't a list of the prices per day then just do a groupby per day and return the list of all the prices from every group using the apply on the grouped object:
>>> df.groupby(lambda dt: dt.date()).apply(lambda group: list(group['close']))
2011-11-01 [248.7, 248.85, 249.15, 250.15, 250.6, 250.55]
2011-11-02 [245.8, 246.35, 245.8, 245.35]
For more information take a look at the docs: Time Series
Update for the concrete data set:
The problem with your data set is that you have some days without any data, so the function passed in as the resampler should handle those cases:
def func(group):
if len(group) == 0:
return None
return group.iloc[group['close'].argmax()]
df.resample('D', how=func).dropna()