Output and preserve groupby index structure without aggregate function

Output and preserve groupby index structure without aggregate function - python

I have a dataframe that I want to groupby year and followed by months within each year. Due to the fact that the data are quite huge (recorded from 3 decades ago till now), I would like to have the output presented as shown below for subsequent calculation but without any aggregate function such ".mean()" behind.
However, I am unable to do so because groupby always require an .agg, else it will show this error: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000022BF79A52E0>
On the other hand, I am a bit worried about importing as Series because I do not know how to set the parameters to get exactly the same format as below. Another reason is that I used the below lines to import the .csv into dataframe:
df=pd.read_csv(r'file directory', index_col = 'date')
df.index = pd.to_datetime(df.index)
For some weird reasons, if I define the date string format in pd.read_csv to import and subsequently, try to sort by other methods according to years and month, function or method that gets confused when the records have date starts off with 01(day)/01(month)/1990 and 01(day)/02(month)/1990. For example, it interprets the first number in Jan as day and the second number as month and sorts all chronologically but when it comes to Feb, when the day is should be 01, the method thought that 01 is the month and 02 is the day portion and move that Feb record to the Jan group.
Are there any ways to achieve the same format?
Methods shown in the post below does not seem to help me get the format I want: Pandas - Groupby dataframe store as dataframe without aggregating

IIUC:
You can use dayfirst parameter in to_datetime() and set that equal to True then create 'Year' and 'Month' column and make them index and sort index:
df=pd.read_csv(r'file directory')
df['date']=pd.to_datetime(df['date'],dayfirst=True)
df['Year']=df['date'].dt.year
df['Month']=df['date'].dt.month
df=df.set_index(['Year','Month']).sort_index()
OR in 3 steps via assign():
df=pd.read_csv(r'file directory')
df['date']=pd.to_datetime(df['date'],dayfirst=True)
df=(df.assign(Year=df['date'].dt.year,Month=df['date'].dt.month)
.set_index(['Year','Month']).sort_index())

you can iterate through the groups of the groupby results.
import pandas as pd
import numpy as np
rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
'B': rand.randn(6),
'C': rand.randint(0, 20, 6)})
groupby_obj = df.groupby(['A'])
for k, gdf in groupby_obj:
print('Groupby Key:', k)
print('Dataframe:\n', gdf, '\n')
you can apply all dataframe methods on gdf

Related

Assigning name to index when using groupby() in pandas

I have weather data over a variety of years. In this I am trying to find the long term averages for the temperature of each month, which I achieved using the following.
mh3 = mh3.groupby([mh3.index.month, mh3.index.day])
mh3 = mh3[['dry_bulb_tmp_mean', 'global_horiz_radiation']].mean()
However, in doing this, I get two index's for the dataframe (both month and day which is fine). The issue is that both of these index columns are assigned the name date. Is there a way to manually add a name? This causes problems later in my code when I need to do some data analysis by month. Thank you

The name of the Series you group with becomes the name of the Index levels so rename them in the grouper.
mh3 = mh3.groupby([mh3.index.month.rename('month'), mh3.index.day.rename('day')])
Or if you don't want to type as much you can create the grouping with a list comprehension, getattr and renaming to the attribute.
import pandas as pd
df = pd.DataFrame(index=pd.date_range('2010-01-01', freq='4H', periods=10),
data={'col1': range(10)})
grpr = [getattr(df.index, attr).rename(attr) for attr in ['month', 'day']]
df.groupby(grpr).sum()
# col1
#month day
#1 1 15
# 2 30

Faster solution for date formatting

I am trying to change the format of the date in a pandas dataframe.
If I check the date in the beginning, I have:
df['Date'][0]
Out[158]: '01/02/2008'
Then, I use:
df['Date'] = pd.to_datetime(df['Date']).dt.date
To change the format to
df['Date'][0]
Out[157]: datetime.date(2008, 1, 2)
However, this takes a veeeery long time, since my dataframe has millions of rows.
All I want to do is change the date format from MM-DD-YYYY to YYYY-MM-DD.
How can I do it in a faster way?

You should first collapse by Date using the groupby method to reduce the dimensionality of the problem.
Then you parse the dates into the new format and merge the results back into the original DataFrame.
This requires some time because of the merging, but it takes advantage from the fact that many dates are repeated a large number of times. You want to convert each date only once!
You can use the following code:
date_parser = lambda x: pd.datetime.strptime(str(x), '%m/%d/%Y')
df['date_index'] = df['Date']
dates = df.groupby(['date_index']).first()['Date'].apply(date_parser)
df = df.set_index([ 'date_index' ])
df['New Date'] = dates
df = df.reset_index()
df.head()
In my case, the execution time for a DataFrame with 3 million lines reduced from 30 seconds to about 1.5 seconds.

I'm not sure if this will help with the performance issue, as I haven't tested with a dataset of your size, but at least in theory, this should help. Pandas has a built in parameter you can use to specify that it should load a column as a date or datetime field. See the parse_dates parameter in the pandas documentation.
Simply pass in a list of columns that you want to be parsed as a date and pandas will convert the columns for you when creating the DataFrame. Then, you won't have to worry about looping back through the dataframe and attempting the conversion after.
import pandas as pd
df = pd.read_csv('test.csv', parse_dates=[0,2])
The above example would try to parse the 1st and 3rd (zero-based) columns as dates.
The type of each resulting column value will be a pandas timestamp and you can then use pandas to print this out however you'd like when working with the dataframe.

Following a lead at #pygo's comment, I found that my mistake was to try to read the data as
df['Date'] = pd.to_datetime(df['Date']).dt.date
This would be, as this answer explains:
This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).
As you have shown above, you can improve the performance by supplying a format string to to_datetime. Or another option is to use infer_datetime_format=True
When using any of the date parsers from the answers above, we go into the for loop. Also, when specifying the format we want (instead of the format we have) in the pd.to_datetime, we also go into the for loop.
Hence, instead of doing
df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d')
or
df['Date'] = pd.to_datetime(df['Date']).dt.date
we should do
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y').dt.date
By supplying the current format of the data, it is read really fast into datetime format. Then, using .dt.date, it is fast to change it to the new format without the parser.
Thank you to everyone who helped!

Split CSV by date

I'm hoping that there is a relatively simple solution to my problem:
I have a csv with a selection of data points but they all include a date field.
I would like to be able to split up the csv into multiple files based on the month of the date field.
For example: I would like to be able to have all the records before March 2015 in 1 file, all before April 2015 in another, up to all before October 2016 etc.
In this case there will be many duplicate records between the files.
Is there a way to do this with a simple bit of python code or is there an easier method?
Thanks in advance

This code assumes that the date field is in the first column and is labelled "dates". We use pandas to read in the data to a dataframe and pass ['dates'] as the column to convert to date objects. We then take different slices of the dataframe using the year and month to create subsetted views. Each view is then dumped to a new csv with the format year_month.csv
import pandas as pd
df = pd.read_csv('filename.csv', parse_dates=['dates'])
for year in df.dates.apply(lambda x: x.year).unique():
for month in df.dates.apply(lambda x: x.month).unique():
view = df[df.dates.apply(lambda x: x.month == month and x.year==year)]
if view.size:
view.to_csv('{}_{:0>2}.csv'.format(year, month))
There is probably a better way to do this, but this will get the job done.

Updating this to include date_parser to solve the attribute error: 'str' object has no attribute 'year' issue.
import pandas as pd
df = pd.read_csv('filename.csv', parse_dates=['Date'], date_parser=pd.to_datetime)
for year in df['Date'].apply(lambda x: x.year).unique():
for month in df['Date'].apply(lambda x: x.month).unique():
view = df[df['Date'].apply(lambda x: x.month == month and x.year == year)]
if view.size:
view.to_csv('{}_{:0>2}.csv'.format(year, month), index=False)

Python Pandas - Day and Month mix up

I have a 'myfile.csv' file which has a 'timestamp' column which starts at
(01/05/2015 11:51:00)
and finishes at
(07/05/2015 23:22:00)
A total span of 9,727 minutes
'myfile.csv' also has a column named 'A' which is some numerical value, there are values are multiple values for 'A' within each minute, each with a unique timestamp to the nearest second.
I have code as follows
df = pd.read_csv('myfile.csv')
df = df.set_index('timestamp')
df.index = df.index.to_datetime()
df.sort_index(inplace=True)
df = df['A'].resample('1Min').mean()
df.index = (df.index.map(lambda t: t.strftime('%Y-%m-%d %H:%M')))
My problem is that python seems to think 'timestamp' starts at
(01/05/2015 11:51:00)
-> 5th January
and finishes at
(07/05/2015 23:22:00)
-> 5th July
But really 'timestamp' starts at the
1st May
and finishes at the
7th of May
So the above code produces a dataframe with 261,332 rows, OMG, when it should really only have 9,727 rows.
Somehow Python is mixing up the month with the day, misinterpreting the dates, how do I sort this out?

There are many arguments within csv_read that can help you parse dates from a csv straight into your pandas DataFrame. Here we can set parse_dates with the columns you want as dates and then use dayfirst. This is defaulted to false so the following should do what you want, assuming the dates are in the first column.
df = pd.read_csv('myfile.csv', parse_dates=[0], dayfirst=True)
If the dates column is not the first row, just change the 0 to the column number.

The format of dates that you have included in your question don't seem to match your strftime filter. Take a look at this to fix your string parameter.
It looks to me that it should be something in the lines of:
'%d/%m/%Y %H:%M:%S'

Python Pandas Index Sorting/Grouping/DateTime

I am trying to combine 2 separate data series using one minute data to create a ratio then creating Open High Low Close (OHLC) files for the ratio for the entire day. I am bringing in two time series then creating associated dataframes using pandas. The time series have missing data so I am creating a datetime variable in each file then merging the files using the pd.merge approach on the datetime variable. Up this this point everything is going fine.
Next I group the data by the date using groupby. I then feed the grouped data to a for loop that calculates the OHLC and feeds that into a new dataframe for each respective day. However, the newly populated dataframe uses the date (from the grouping) as the dataframe index and the sorting is off. The index data looks like this (even when sorted):
01/29/2013
01/29/2014
01/29/2015
12/2/2013
12/2/2014
In short, the sorting is being done on only the month not the whole date as a date so it isn't chronological. My goal is to get it sorted by date so it would be chronological. Perhaps I need to create a new column in the dataframe referencing the index (not sure how). Or maybe there is a way to tell pandas the index is a date not just a value? I tried using various sort approaches including sort_index but since the dates are the index and don't seem to be treated as dates the sort functions sort by the month regardless of the year and thus my output file is out of order. In more general terms I am not sure how to reference/manipulate the actual unique identifier index in a pandas dataframe so any associated material would be useful.
Thank you

Years later...
This fixes the problem.
df is a dataframe
import pandas as pd
df.index = pd.to_datetime(df.index) #convert the index to a datetime object
df = df.sort_index() #sort the converted
This should get the sorting back into chronological order

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.