Hi I have a time series and would like to count how many events I have per day(i.e. rows in the table within a day). The command I'd like to use is:
ts.resample('D', how='count')
but "count" is not a valid aggregation function for time series, I suppose.
just to clarify, here is a sample of the dataframe:
0 2008-02-22 03:43:00
1 2008-02-22 03:43:00
2 2010-08-05 06:48:00
3 2006-02-07 06:40:00
4 2005-06-06 05:04:00
5 2008-04-17 02:11:00
6 2012-05-12 06:46:00
7 2004-05-17 08:42:00
8 2004-08-02 05:02:00
9 2008-03-26 03:53:00
Name: Data_Hora, dtype: datetime64[ns]
and this is the error I am getting:
ts.resample('D').count()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-42-86643e21ce18> in <module>()
----> 1 ts.resample('D').count()
/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in resample(self, rule, how, axis, fill_method, closed, label, convention, kind, loffset, limit, base)
255 def resample(self, rule, how=None, axis=0, fill_method=None,
256 closed=None, label=None, convention='start',
--> 257 kind=None, loffset=None, limit=None, base=0):
258 """
259 Convenience method for frequency conversion and resampling of regular
/usr/local/lib/python2.7/dist-packages/pandas/tseries/resample.pyc in resample(self, obj)
98 return obj
99 else: # pragma: no cover
--> 100 raise TypeError('Only valid with DatetimeIndex or PeriodIndex')
101
102 rs_axis = rs._get_axis(self.axis)
TypeError: Only valid with DatetimeIndex or PeriodIndex
That can be fixed by turning the datetime column into an index with set_index. However after I do that, I still get the following error:
DataError: No numeric types to aggregate
because my Dataframe does not have a numeric column.
But I just want to count rows!! The simple "select count(*) group by ... " from SQL.
In order to get this to work, after removing the rows in which the index was NaT:
df2 = df[df.index!=pd.NaT]
I had to add a column of ones:
df2['n'] = 1
and then count only that column:
df2.n.resample('D', how="sum")
then I could visualize the data with:
plot(df2.n.resample('D', how="sum"))
In [104]: df = DataFrame(1,index=date_range('20130101 9:01',freq='h',periods=1000),columns=['A'])
In [105]: df
Out[105]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000 entries, 2013-01-01 09:01:00 to 2013-02-12 00:01:00
Freq: H
Data columns (total 1 columns):
A 1000 non-null values
dtypes: int64(1)
In [106]: df.resample('D').count()
Out[106]:
A 43
dtype: int64
You can do this with a one liner, using value counts and resampling.
Assuming your DataFrame is named df:
df.index.value_counts().resample('D', how='sum')
This method also works if datetime is not your index:
df.any_datetime_series.value_counts().resample('D', how='sum')
Related
I am curious as to why when I create a data frame in the manner below, using lists to create the values in the rows does not graph and gives me the error "ValueError: x must be a label or position"
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
values = [9.83, 19.72, 7.19, 3.04]
values
[9.83, 19.72, 7.19, 3.04]
cols = ['Condition', 'No-Show']
conditions = ['Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism']
df = pd.DataFrame(columns = [cols])
df['Condition'] = conditions
df['No-Show'] = values
df
Condition No-Show
0 Scholarship 9.83
1 Hipertension 19.72
2 Diabetes 7.19
3 Alcoholism 3.04
df.plot(kind='bar', x='Condition', y='No-Show');
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 df.plot(kind='bar', x='Condition', y='No-Show')
File ~\anaconda3\lib\site-packages\pandas\plotting\_core.py:938, in
PlotAccessor.__call__(self, *args, **kwargs)
936 x = data_cols[x]
937 elif not isinstance(data[x], ABCSeries):
--> 938 raise ValueError("x must be a label or position")
939 data = data.set_index(x)
940 if y is not None:
941 # check if we have y as int or list of ints
ValueError: x must be a label or position
Yet if I create the same DataFrame a different way, it graphs just fine....
df2 = pd.DataFrame({'Condition': ['Scholarship', 'Hipertension', 'Diatebes', 'Alcoholism'],
'No-Show': [9.83, 19.72, 7.19, 3.04]})
df2
Condition No-Show
0 Scholarship 9.83
1 Hipertension 19.72
2 Diatebes 7.19
3 Alcoholism 3.04
df2.plot(kind='bar', x='Condition', y='No-Show')
plt.ylim(0, 50)
#graph appears here just fine
Can someone enlighten me why it works the second way and not the first? I am a new student and am confused. I appreciate any insight.
Let's look at pd.DataFrame.info for both dataframes.
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (Condition,) 4 non-null object
1 (No-Show,) 4 non-null float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes
Note, your column headers are tuples with a empty second element.
Now, look at info for df2.
df2.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Condition 4 non-null object
1 No-Show 4 non-null float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes
Note your column headers here are strings.
As, #BigBen states in his comment you don't need the extra brackets in your dataframe constructor for df.
FYI... to fix your statement with the incorrect dataframe constructor for df.
df.plot(kind='bar', x=('Condition',), y=('No-Show',))
When I use f-string with format modifiers in expression using pandas Series - I get TypeError. However same string expression works fine with regular data. Is it possible to do this in pandas at all?
works with regular data:
episode = 42
frame = 4242
f"e{episode:03}_f{frame:06}.jpg"
> 'e042_f004242.jpg'
fails with pandas.Series:
df['filename'] = f"e{df.episode:03}_f{df.frame:05}_i{df.item:05}.jpg"
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-73001803277d> in <module>
----> 1 df['filename'] = f"e{df.episode:03}_f{df.frame:05}_i{df.item:05}.jpg"
TypeError: unsupported format string passed to Series.__format__
The TypeError shows when I pass format modifiers in f-string. Any kind, including {df.area:.2f}. However without format modifiers I get no error but receive something useless like e0 215\n1 1\n2 1\n3
I have workarounds, but would like to use f-strings as they are neat.
pd.__version__
> '0.25.3'
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84604 entries, 0 to 84603
Data columns (total 5 columns):
episode 84604 non-null int32
frame 84604 non-null int64
item 84604 non-null int64
img_size 84604 non-null float64
bounds 84604 non-null object
dtypes: float64(1), int32(1), int64(2), object(1)
memory usage: 2.9+ MB
Unfortunately f-strings not working if pass columns of DataFrame, is necesary processing by scalars in DataFrame.apply with x['item'] because item is valid pandas function Series.item and x.item here raise error:
df = pd.DataFrame({'episode':[42, 21], 'frame':[4242,4248], 'item':[20,563]})
df['filename']=df.apply(lambda x: f"e{x.episode:03}_f{x.frame:05}_i{x['item']:05}.jpg", axis=1)
print (df)
episode frame item filename
0 42 4242 20 e042_f04242_i00020.jpg
1 21 4248 563 e021_f04248_i00563.jpg
Or in list comprehension:
df['filename'] = [f"e{e:03}_f{f:05}_i{i:05}.jpg"
for e,f,i
in zip(df.episode, df.frame, df.item)]
However f-strings work fine with pd.Series without format modifiers: df['filename'] = f"e{df.episode}_f{df.frame}_i{df.item}.jpg" works OK.
I think not, because Series, so output is:
df['filename'] = f"e{df.episode}_f{df.frame}_i{df.item}.jpg"
print (df)
episode frame item filename
0 42 4242 20 e0 42\n1 21\nName: episode, dtype: int64...
1 21 4248 563 e0 42\n1 21\nName: episode, dtype: int64...
I have the following dataframe:
Join_Count 1
LSOA11CD
E01006512 15
E01006513 35
E01006514 11
E01006515 11
E01006518 11
...
But when I try to sort it:
BusStopList.sort("LSOA11CD",ascending=1)
I get the following:
Key Error: 'LSOA11CD'
How do I go about sorting this by either the LSOA column or the column full of numbers which doesn't have a heading?
The following is the information produced by Python about this dataframe:
<class 'pandas.core.frame.DataFrame'>
Index: 286 entries, E01006512 to E01033768
Data columns (total 1 columns):
1 286 non-null int64
dtypes: int64(1)
memory usage: 4.5+ KB
'LSOA11CD' is the name of the index, 1 is the name of the column. So you must use sort index (rather than sort_values):
BusStopList.sort_index(level="LSOA11CD", ascending=True)
I currently have a dataframe as follows and all I want to do is just replace the strings in Maturity with just the number within them. For example, I want to replace FZCY0D with 0 and so on.
Date Maturity Yield_pct Currency
0 2009-01-02 FZCY0D 4.25 AUS
1 2009-01-05 FZCY0D 4.25 AUS
2 2009-01-06 FZCY0D 4.25 AUS
My code is as follows and I tried replacing these strings with the numbers, but that lead to the error AttributeError: 'Series' object has no attribute 'split' in the line result.Maturity.replace(result['Maturity'], [int(s) for s in result['Maturity'].split() if s.isdigit()]). I am hence struggling to understand how to do this.
from pandas.io.excel import read_excel
import pandas as pd
import numpy as np
import xlrd
url = 'http://www.rba.gov.au/statistics/tables/xls/f17hist.xls'
xls = pd.ExcelFile(url)
#Gets rid of the information that I dont need in my dataframe
df = xls.parse('Yields', skiprows=10, index_col=None, na_values=['NA'])
df.rename(columns={'Series ID': 'Date'}, inplace=True)
# This line assumes you want datetime, ignore if you don't
#combined_data['Date'] = pd.to_datetime(combined_data['Date'])
result = pd.melt(df, id_vars=['Date'])
result['Currency'] = 'AUS'
result.rename(columns={'value': 'Yield_pct'}, inplace=True)
result.rename(columns={'variable': 'Maturity'}, inplace=True)
result.Maturity.replace(result['Maturity'], [int(s) for s in result['Maturity'].split() if s.isdigit()])
print result
You can use the vectorised str methods and pass a regex to extract the number:
In [15]:
df['Maturity'] = df['Maturity'].str.extract('(\d+)')
df
Out[15]:
Date Maturity Yield_pct Currency
0 2009-01-02 0 4.25 AUS
1 2009-01-05 0 4.25 AUS
2 2009-01-06 0 4.25 AUS
You can call astype(int) to cast the series to int:
In [17]:
df['Maturity'] = df['Maturity'].str.extract('(\d+)').astype(int)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 4 columns):
Date 3 non-null object
Maturity 3 non-null int32
Yield_pct 3 non-null float64
Currency 3 non-null object
dtypes: float64(1), int32(1), object(2)
memory usage: 108.0+ bytes
I have a .json file extension (logs.json) that was sent to me with the following data in it (I am showing only some of it as there are over 2,000 entries):
["2012-03-01T00:05:55+00:00", "2012-03-01T00:06:23+00:00", "2012-03-01T00:06:52+00:00", "2012-03-01T00:11:23+00:00", "2012-03-01T00:12:47+00:00", "2012-03-01T00:12:54+00:00", "2012-03-01T00:16:14+00:00", "2012-03-01T00:17:31+00:00", "2012-03-01T00:21:23+00:00", "2012-03-01T00:21:26+00:00", "2012-03-01T00:22:25+00:00", "2012-03-01T00:28:24+00:00", "2012-03-01T00:31:21+00:00", "2012-03-01T00:32:20+00:00", "2012-03-01T00:33:32+00:00", "2012-03-01T00:35:21+00:00", "2012-03-01T00:38:14+00:00", "2012-03-01T00:39:24+00:00", "2012-03-01T00:43:12+00:00", "2012-03-01T00:46:13+00:00", "2012-03-01T00:46:31+00:00", "2012-03-01T00:48:03+00:00", "2012-03-01T00:49:34+00:00", "2012-03-01T00:49:54+00:00", "2012-03-01T00:55:19+00:00", "2012-03-01T00:56:27+00:00", "2012-03-01T00:56:32+00:00"]
Using Pandas, I did:
import pandas as pd
logs = pd.read_json('logs.json')
logs.head()
And I get the following:
0
0 2012-03-01T00:05:55+00:00
1 2012-03-01T00:06:23+00:00
2 2012-03-01T00:06:52+00:00
3 2012-03-01T00:11:23+00:00
4 2012-03-01T00:12:47+00:00
[5 rows x 1 columns]
Then, in order to assign the proper data type including the UTC zone, I do:
logs = pd.to_datetime(logs[0], utc=True)
logs.head()
And get:
0 2012-03-01 00:05:55
1 2012-03-01 00:06:23
2 2012-03-01 00:06:52
3 2012-03-01 00:11:23
4 2012-03-01 00:12:47
Name: 0, dtype: datetime64[ns]
Here are my questions:
Is the above code correct to get my data in the right format?
where did my UTC zone go? and what if I want to create a column with the corresponding PST time and add it to this dataset in a data frame format?
I seem to recall that in order to obtain counts per day/week, or year, I need to add .day, .week, or .year somewhere (logs.day?), but I cannot figure it out and I am guessing that it is because of the current shape of my data. How do I get counts by day? week? year? so that I can plot the data? and how would I go with plotting the data?
Such simple questions that seem so hard for someone who is transitioning from R to using Python for Data Analysis! I hope you guys can help!
I think there may be a bug in the tz handling here, it's certainly possible that this should be converted by default (I was surprised that it wasn't, I suspect it's because it's just a list).
In [21]: s = pd.read_json(js, convert_dates=[0], typ='Series') # more honestly this is a Series
In [22]: s.head()
Out[22]:
0 2012-03-01 00:05:55
1 2012-03-01 00:06:23
2 2012-03-01 00:06:52
3 2012-03-01 00:11:23
4 2012-03-01 00:12:47
dtype: datetime64[ns]
To get counts of year, month, etc. I would probably use a DatetimeIndex (at the moment date-like columns don't have year/month etc methods, though I think they (c|sh)ould):
In [23]: dti = pd.DatetimeIndex(s)
In [24]: s.groupby(dti.year).size()
Out[24]:
2012 27
dtype: int64
In [25]: s.groupby(dti.month).size()
Out[25]:
3 27
dtype: int64
Perhaps it makes more sense to view the data as a TimeSeries:
In [31]: ts = pd.Series(1, dti)
In [32]: ts.head()
Out[32]:
2012-03-01 00:05:55 1
2012-03-01 00:06:23 1
2012-03-01 00:06:52 1
2012-03-01 00:11:23 1
2012-03-01 00:12:47 1
dtype: int64
This way you can use resample:
In [33]: ts.resample('M', how='sum')
Out[33]:
2012-03-31 27
Freq: M, dtype: int64