How to sort dataframe by time and other condition? - python

I have a dataframe with following columns: movie_name, date, comment.
The date format is like this(example): 2018-06-27T09:09:00Z.
I want to make a new dataframe that contains ONLY first date of a certain movie.
For example, for movie a, the first date maybe 2018-09-11T:02:02:00Z, in this case, i want all rows on 2018-09-11 for movie a. How would i do this when there are multiple movies with different dates?

Here's one way to do:
# create a new df
new_df = old_df['date'].copy()
# get the date
new_df['date'] = pd.to_datetime(new_df['date']).dt.date
# first date of movie
new_df.groupby('movie_name')['date'].first()

import datetime as dt
df['My Time Format'] = dt['Given time].apply(lambda x: dt.datetime.strftime(dt.datetime.strptime(x, "%Y-%m-%dT%H:%M:%SZ"),"%Y-%m-%d"))

Related

Pandas Dataframe return rows where state, city, and date occur multiple times

Firstly, this is my first post, so my apologies if it is formatted poorly.
So I have this dataframe which I have attached a picture of. It contains UFO sightings and I want to return the rows where the if the city and state are the same and then also if the dates are the same. I am trying to find sightings that occurred on the same day in the same city and state. Please let me know if more info is required.
Thank you in advance!
Alternative, boolean indexing to keep the duplicated rows:
df['date'] = pd.to_datetime(df['occurred_date_time']).dt.normalize()
df2 = df[df.duplicated(['date','city','state'], keep=False)]
If you don't want the new column:
df2 = df[df.assign(date=pd.to_datetime(df['occurred_date_time'])
.dt.normalize())
.duplicated(['date','city','state'], keep=False)]
Try this.
# Create a column converting date_time to just date
df['date'] = pd.to_datetime(df['occurred_date_time']).dt.normalize()
# groupby and count times where date, city and state
# then create boolean series where count is greater than 1
m = df.groupby(['date','city','state']).transform("count") > 1
# boolean filter the dataframe rows with that series, m.
df[m]

How can i take specific Months out from a Column in python

I have a dataframe that has a column 'mon/yr' that has month and year stored in this format Jun/19 , Jan/22,etc.
I want to Extract only these from that column - ['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22']
and put them into a variable called 'dates' so that I can use it for plotting
My code which does not work -
dates = df["mon/yr"] == ['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22']
This is a python code
this is how to filter rows
df.loc[df['column_name'].isin(some_values)]
Using your dates list, if we wanted to extract just 'Jul/20' and 'Oct/20' we can do:
import pandas as pd
df = pd.DataFrame(['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22'], columns = ['dates'])
mydates = ['Jul/20','Oct/20']
df.loc[df['dates'].isin(mydates)]
which produces:
dates
4 Jul/20
5 Oct/20
So, for your actual use case, assuming that df is a pandas dataframe, and mon/yr is the name of the column, you can do:
dates = df.loc[df['mon/yr'].isin(['Jul/19','Oct/19','Jan/20','Apr/20','Jul/20','Oct/20','Jan/21','Apr/21','Jul/21','Oct/21','Jan/22'])]

When i use set_index,I am not able to create a seperate dataframe with set_index column name

I am trying to extract values in "d" row for the next 7 days from todays date(say 2020-04-22).So I have transposed the df so that dates will be in a seperate column.I want a seperate dataframe with Account and d column to calculate 7 days from todays date (apr 22) using account column.I am beginner to dataframes and numpy and I am learning concepts.
I know i should use date.today() but I am not able to access account column since I used it as set_index
cashflow_path = "./data/input/wpptest.xlsx"
pd_xls_obj = pd.ExcelFile(cashflow_path)
data= pd.read_excel(pd_xls_obj,sheet_name="Sheet1")
data
I have transposed the sheet so that I can easily calculate from todays date
inp=data.set_index('Account').T
inp
inp=inp[['Account','d']]
inp
Key error:Account not in index.
Since you have set 'Account' to be the index you can't select it as a column, but you only need to select the column 'd' and the dates will appear as well. To make 'Account' a column, just duplicate it from the index.
inp['account'] = inp.index
inp = inp[['account', 'd']]

How to find annualized return in a data set containing 30 stocks in python

So I have a dataset containing the closing price of 30 stocks. I have to find the average annualized return and volatility of each stock. I don't have a problem with the formula, I can't seem to be able to formulate how to iterate over each stock and then find it's closing price, and then save each closing price in a different column.
What I have tried:
I have tried to iterate over the columns, and then return the columns, and then assign the function to a variable like:
def get_columns(df):
for columns in df:
return columns
namesOfColumn = get_columns(df)
When I check the type of namesOfColumn, it returns str, and when I check the content of the string, it is the title of the first column in my dataset.
I have also tried
def get_columns(df):
for columns in df:
column = df[columns]
for column in df[columns]:
stock = column
returns = df[stock].pct_change()
My current dataframe looks like
A Close B Close
0 823.45 201.9
1 824.90 198.9
2 823.60 198.3
A & B are the name of the companies.
There are total 30 columns like this,and each columns has around 240 values.
A Return B Return
0 xxxx.xx xxxx.xx
I want my output to look like this
I want to find the annual return of each stock, and then save the return in a dictionary, and then convert that dictionary to a dataframe.
Assuming the index of your dataframe is in datetime format you could just use pandas resample (below I am resampling it yearly - please refer to pandas resample documentation for more info) and do the following:
(1 + df.pct_change()).resample('Y').prod() - 1
Since, it looks like your dataframe is not indexed with pandas datetime, you will have to reindex it first (and then apply the code shown above) as shown below:
import pandas as pd
initial_date = '20XX-XX-XX' #set here the initial date of your dataframe
end_date = '20XX-XX-XX' #set here the end date of your dataframe
df.set_index(pd.date_range(start=initial_date, end=end_date) inplace=True)

Grouping a dataframe and reordering based on date and counts

I have the following dataframe, that is grouped according to the invoice cycle first, then added to a count of clinics in each invoice cycle.
Dataframe after groupby function
I used the following code to add the count column:
df5 = df4.groupby(['Invoice Cycle', 'Clinic']).size().reset_index(name='counts')
and then this code to set the index and get the dataframe, as seen in the image above:
df5 = df5.set_index(['Invoice Cycle','Clinic'])
Now, I want to reorder the Invoice Cycle column so the dates are in order 16-Dec, 17-Jan, 17-Feb, 17-Mar, etc.
Then I want to reorder the clinics in each invoice cycle so clinic with the highest count is on the top and the clinic with the lowest count is on the bottom.
Given the values in Invoice Cycle are strings, and not timestamps, I can't seem to do both of the above tasks.
Is there a way to reorder the dataframe?
You can create a function to transform the date-string into a datetime format:
import pandas as pd
import datetime
def str_to_date(string):
# This will get you the date with the first day of the month (ex. 01-Jan-2017)
date = datetime.datetime.strptime(string, '%y-%b')
return date
df['Invoice Cycle'] = df['Invoice Cycle'].apply(str_to_date)
# now you an sort correctly
df = df.sort_values(['Invoice Cycle', 'counts'])

Categories

Resources