I have the following dataframe, that is grouped according to the invoice cycle first, then added to a count of clinics in each invoice cycle.
Dataframe after groupby function
I used the following code to add the count column:
df5 = df4.groupby(['Invoice Cycle', 'Clinic']).size().reset_index(name='counts')
and then this code to set the index and get the dataframe, as seen in the image above:
df5 = df5.set_index(['Invoice Cycle','Clinic'])
Now, I want to reorder the Invoice Cycle column so the dates are in order 16-Dec, 17-Jan, 17-Feb, 17-Mar, etc.
Then I want to reorder the clinics in each invoice cycle so clinic with the highest count is on the top and the clinic with the lowest count is on the bottom.
Given the values in Invoice Cycle are strings, and not timestamps, I can't seem to do both of the above tasks.
Is there a way to reorder the dataframe?
You can create a function to transform the date-string into a datetime format:
import pandas as pd
import datetime
def str_to_date(string):
# This will get you the date with the first day of the month (ex. 01-Jan-2017)
date = datetime.datetime.strptime(string, '%y-%b')
return date
df['Invoice Cycle'] = df['Invoice Cycle'].apply(str_to_date)
# now you an sort correctly
df = df.sort_values(['Invoice Cycle', 'counts'])
Related
I have a crime dataset where each row is a recorded crime, and the relevant columns are Date, Crime Type, District.
Here is an example with only 2 Districts and 2 Crime Types over a week:
I want to expand it to a dataframe that can be used to run regressions. In this simple example, I need the columns to be Date, District, Murder, Theft. Each District would have a different row for each date in the range, and the crime type categories would be the number of that crimes committed on that Date in that District
Here is the final dataframe:
I need - a time series where #Rows = #Districts * #Dates, and there is a column for each crime type
Are there any good ways to make this without looping through the dataframes?
I can create the date list like this:
datelist = pd.date_range(start='01-01-2011', end='12-31-2015', freq='1d')
But how do I merge that with my other dataframe and create the columns described above?
I will try to answer my own question here. I think I figured it out, but would appreciate any input on my method. I was able to do it without looping, but rather using pivot_table and merge.
Import packages:
import pandas as pd
from datetime import datetime
import numpy as np
Import crime dataset:
crime_df = pd.read_csv("/Users/howard/Crime_Data.csv")
Create a list of dates in the range:
datelist = pd.date_range(start='01-01-2011', end='12-31-2015', freq='1d')
Create variables for the length of this date list and length of unique districts list:
nd = len(datelist)
nu = len(df_crime['District'].unique())
Create dataframe combining dates and districts:
date_df = pd.DataFrame({'District':df_crime['District'].unique().tolist()*nd, 'Date':np.repeat(datelist,nu)})
Now we turn to our crime dataset.
I added a column of 1s to have something to sum in the next step:
crime_df["ones"] = 1
Next we take our crime data and put it in wide form using Pandas pivot_table:
crime_df = pd.pivot_table(crime_df,index=["District","Date"], columns="Crime Type", aggfunc="sum")
This gave me stacked-level columns and an unnecessary index, so I removed them with the following:
crime_df.columns.droplevel()
crime_df.reset_index(inplace=True)
The final step is to merge the two datasets. I want to put date_df first and merge on that because it includes all the dates in the range and all the districts included for each date. Thus, this uses a Left merge.
final_df = pd.merge(date_df, crime_df, on=["Date", "District"],how="left")
Now I can finish by filling in NaN with 0s
final_df.fillna(0, inplace=True)
Our final dataframe is in the correct form to do time series analyses - regressions, plotting, etc. Many of the plots in matplotlib.pyplot that I use are easier to make if the date column is the index. This can be done like this:
df_final = df_final.set_index(['Date'])
That's it! Hope this helps others and please comment on any way to improve.
I have weather data over a variety of years. In this I am trying to find the long term averages for the temperature of each month, which I achieved using the following.
mh3 = mh3.groupby([mh3.index.month, mh3.index.day])
mh3 = mh3[['dry_bulb_tmp_mean', 'global_horiz_radiation']].mean()
However, in doing this, I get two index's for the dataframe (both month and day which is fine). The issue is that both of these index columns are assigned the name date. Is there a way to manually add a name? This causes problems later in my code when I need to do some data analysis by month. Thank you
The name of the Series you group with becomes the name of the Index levels so rename them in the grouper.
mh3 = mh3.groupby([mh3.index.month.rename('month'), mh3.index.day.rename('day')])
Or if you don't want to type as much you can create the grouping with a list comprehension, getattr and renaming to the attribute.
import pandas as pd
df = pd.DataFrame(index=pd.date_range('2010-01-01', freq='4H', periods=10),
data={'col1': range(10)})
grpr = [getattr(df.index, attr).rename(attr) for attr in ['month', 'day']]
df.groupby(grpr).sum()
# col1
#month day
#1 1 15
# 2 30
I have a dataframe with following columns: movie_name, date, comment.
The date format is like this(example): 2018-06-27T09:09:00Z.
I want to make a new dataframe that contains ONLY first date of a certain movie.
For example, for movie a, the first date maybe 2018-09-11T:02:02:00Z, in this case, i want all rows on 2018-09-11 for movie a. How would i do this when there are multiple movies with different dates?
Here's one way to do:
# create a new df
new_df = old_df['date'].copy()
# get the date
new_df['date'] = pd.to_datetime(new_df['date']).dt.date
# first date of movie
new_df.groupby('movie_name')['date'].first()
import datetime as dt
df['My Time Format'] = dt['Given time].apply(lambda x: dt.datetime.strftime(dt.datetime.strptime(x, "%Y-%m-%dT%H:%M:%SZ"),"%Y-%m-%d"))
So I have a dataset containing the closing price of 30 stocks. I have to find the average annualized return and volatility of each stock. I don't have a problem with the formula, I can't seem to be able to formulate how to iterate over each stock and then find it's closing price, and then save each closing price in a different column.
What I have tried:
I have tried to iterate over the columns, and then return the columns, and then assign the function to a variable like:
def get_columns(df):
for columns in df:
return columns
namesOfColumn = get_columns(df)
When I check the type of namesOfColumn, it returns str, and when I check the content of the string, it is the title of the first column in my dataset.
I have also tried
def get_columns(df):
for columns in df:
column = df[columns]
for column in df[columns]:
stock = column
returns = df[stock].pct_change()
My current dataframe looks like
A Close B Close
0 823.45 201.9
1 824.90 198.9
2 823.60 198.3
A & B are the name of the companies.
There are total 30 columns like this,and each columns has around 240 values.
A Return B Return
0 xxxx.xx xxxx.xx
I want my output to look like this
I want to find the annual return of each stock, and then save the return in a dictionary, and then convert that dictionary to a dataframe.
Assuming the index of your dataframe is in datetime format you could just use pandas resample (below I am resampling it yearly - please refer to pandas resample documentation for more info) and do the following:
(1 + df.pct_change()).resample('Y').prod() - 1
Since, it looks like your dataframe is not indexed with pandas datetime, you will have to reindex it first (and then apply the code shown above) as shown below:
import pandas as pd
initial_date = '20XX-XX-XX' #set here the initial date of your dataframe
end_date = '20XX-XX-XX' #set here the end date of your dataframe
df.set_index(pd.date_range(start=initial_date, end=end_date) inplace=True)
I have a pandas Series indexed by timestamp. I group by the value in the series, so I end up having a number of groups each either their own timestamp indexed Series. I then want to resample() the group series each to weekly, but want to align the first date across all groups.
My code looks something like this:
grp = df.groupby(df)
for userid, user_df in grp:
resample = user_df.resample('1W', __some_fun)
The only way I have found to make sure alignment happens on the left hand side of the date is to fake pad each group with one value, e.g.:
grp = df.groupby(df)
for userid, user_df in grp:
user_df = user_df.append(pandas.Series([0], index=[pandas.to_datetime('2013-09-02')]))
resample = user_df.resample('1W', __some_fun)
It seems to me that this must be a common workflow, any pandas insight?