Grouping Dataframe by Multiple Columns, and Then Dropping Duplicates

Grouping Dataframe by Multiple Columns, and Then Dropping Duplicates - python

I have a dataframe which looks like this (see table). For simplicity sake I've "aapl" is the only ticker shown. However, the real dataframe has more tickers.
ticker
year
return
aapl
1999
1
aapl
2000
3
aapl
2000
2
What I'd like to do is first group the dataframe by ticker, then by year. Next, I'd like to remove any duplicate years. In the end the dataframe should look like this:
ticker
year
return
aapl
1999
1
aapl
2000
3
I have a working solution, but it's not very "Pandas-esque", and involves for loops. I'm semi-certain that if I come back to the solution in three months, it'll be completely foreign to me.
Right now, I've been working on the following, with little luck:
df = df.groupby('ticker').groupby('year').drop_duplicates(subset=['year'])
This however, produces the following error:
AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby'
Any help here would be greatly appreciated, thanks.

#QuangHoang provided the simplest version in the comments:
df.drop_duplicates(['ticker', 'year'])
Alternatively, you can use .groupby twice, inside two .applys:
df.groupby("ticker", group_keys=False).apply(lambda x:
x.groupby("year", group_keys=False).apply(lambda x: x.drop_duplicates(['year']))
)
Alternatively, you can use the .duplicated function:
df.groupby('ticker', group_keys=False).apply(lambda x:
x[~x['year'].duplicated(keep='first')])
)

You can try to sort the values first and then groupby.tail
df.sort_values('return').groupby(['ticker','year']).tail(1)
ticker year return
0 aapl 1999 1
1 aapl 2000 3

I'm almost sure you want to do this:
df.drop_duplicates(subset=["ticker","year"])
output

Related

pandas: MultiIndex not showing when plotting DataFrame

I am plotting the following pandas MultiIndex DataFrame:
print(log_returns_weekly.head())
AAPL MSFT TSLA FB GOOGL
Date Date
2016 1 -0.079078 0.005278 -0.155689 0.093245 0.002512
2 -0.001288 -0.072344 0.003811 -0.048291 -0.059711
3 0.119746 0.082036 0.179948 0.064994 0.061744
4 -0.150731 -0.102087 0.046722 0.030044 -0.074852
5 0.069314 0.067842 -0.075598 0.010407 0.056264
with the first sub-index representing the year, and the second one the week from that specific year.
This is simply achieved via the pandas plot() method; however, as seen below, the x-axis will not be in a (year, week) format i.e. (2016, 1), (2016, 2) etc. Instead, it simply shows 'Date,Date' - does anyone therefore know how I can overcome this issue?
log_returns_weekly.plot(figsize(8,8))

You need to convert your multiindex to single one and add a day, so it would be like this: 2016-01-01.
log1 = log_returns_weekly.set_index(log_returns_weekly.index.map(lambda x: pd.datetime(*x,1)))
log1.plot()

Finding last trade from pandas dataframe

I have a table of trades, which have the form (for simplicity):
Ticker Timestamp price
0 AAPL 9:30:00 139
1 FB 11:33:14 110
And so on. Now, I want to extract the last trade for the day for each ticker, which is certainly possible thus (assuming the original table is called trades).
trades['Timestamp']=pd.to_datetime(trades['Timestamp'])
aux = trades.groupby(['Ticker'])['Timestamp'].max()
auxdf = aux.to_frame()
auxdf = auxdf.reset_index()
closing = pd.merge(left=trades,right=auxdf, left_on=['Ticker','Timestamp'],right_on=['Ticker', 'Timestamp'])
Now, this works, but I am not sure if it is either the most elegant or the most efficient approach. Any suggestions?

Try to use ix and idxmax:
trades['Timestamp']=pd.to_datetime(trades['Timestamp']
trades.ix[trades.groupby('Ticker').Timestamp.idxmax()]

Python filter rows in dataframe by date

I have a dataframe in the following format:
month count
2015/01 100
2015/02 200
2015/03 300
...
And want to get a new dataframe which contains month greater than a month, e.g.. 2015/03.
I tried to use the following code:
sdf = df.loc[datetime.strptime(df['month'], '%Y-%m')>=datetime.date(2015,9,1,0,0,0)]
I'm new to Python. Really appreciate it if some one can help.

If you want to get rows which have months larger than 2015/02 for instance, use pd.to_datetime instead of datetime.strptime since the former is vectorized and can accept a Series object as parameter (assuming you are using pandas):
import pandas as pd
df[pd.to_datetime(df.month) >= pd.to_datetime("2015/02")]
# month count
#1 2015/02 200
#2 2015/03 300

putting .size() into new column python pandas

I am very new to python (and to stack overflow!) so hopefully this makes sense!
I have a dataframe which contains years and names (amongst otherthings however this is all I am interested in working with).
I have done df = df.groupby(['year', 'name']).size() to get the amount of times each names appears in each year.
it returns something similar to this:
year name
2001 nameone 2
2001 nametwo 3
2002 nameone 1
2002 nametwo 5
what I'm trying to do is put the size data in to a new column called 'count'.
(eventually what I am intending to do with this is plot it on graphs)
Any help would be greatly appreciated!
Here is the raw code (I have condensed it a bit for convenience) :
hso_df = pd.read_csv('HibernationSurveyObservationsCleaned.csv')
hso_df[["startDate", "endDate", "commonName"]]
year_df = hso_df
year_df['startDate'] = pd.to_datetime(hso_df['startDate'] )
year_df['year'] = year_df['startDate'].dt.year
year_df = year_df[["year", "commonName"]].sort_values('year')
year_df = year_df.groupby(['year', 'commonName']).size()
here is an image of the first 3 rows of the data displayed with .head()
The only columns that are of interest from this data are the commonName and the year (I have taken this from startDate)

IIUC you want transform to add the result of the groupby with its index aligned to the original df:
df['count'] = df.groupby(['year', 'name']).transform('size')
EDIT
Looking at your requirements, I suggest calling reset_index on the groupby result and then merging this back to your main df:
year_df= year_df.reset_index()
hso_df.merge(year_df).rename(columns={0:'count'})

How to apply a function to each column of a pivot table in pandas?

Code:
df = pd.read_csv("example.csv", parse_dates=['ds'])
df2 = df.set_index(['ds', 'city']).unstack('city')
rm = pd.rolling_mean(df2, 3)
sd = pd.rolling_std(df2,3)
df2 output:
What I want: I want to be able to see whether for each city, for each date, if the number is greater than 1 std dev away from the mean of bookings for that city. For ex pseudocode:
for each (city column)
for each (date)
see whether the (number of bookings) - (same date and city rolling mean) > (same date and city std dev)
print that date and city and number of bookings
What the problem is: I'm having trouble trying to figure out how to access the data I need from each of the data frames to do so. The parts of the pseudocode in parenthesis is what I need help figuring out.
What I tried:
df2['city']
list(df2)
Both give me errors.
df2[1:2]
Splicing works, but I feel like thats not the best way to access it.

You should use apply function of DataFrame API. Demo is below:
import pandas as pd
df = pd.DataFrame({'A': [1,2,3,4,5]; 'B': [1,2,3,4,5]})
df['C'] = df.apply(lambda row: row['A']*row['B'], axis=1)
Output:
>>> df
A B C
0 1 1 1
1 2 2 4
2 3 3 9
3 4 4 16
4 5 5 25
More concretely for your case:
You have to precompute: "same date and city rolling mean", "same date and city std dev". You can use groupby function for it, it allows to aggregate data by city and date, after that you can calculate std dev and mean.
Put std dev and mean in your table, use dictionary for it: some_dict = {('city', 'date'):[std_dev, mean], ..}. For putting data in dataframe use apply function.
You have all necessary data for running your check by apply function.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping Dataframe by Multiple Columns, and Then Dropping Duplicates - python

You can try to sort the values first and then groupby.tail df.sort_values('return').groupby(['ticker','year']).tail(1) ticker year return 0 aapl 1999 1 1 aapl 2000 3

I'm almost sure you want to do this: df.drop_duplicates(subset=["ticker","year"]) output

Related

pandas: MultiIndex not showing when plotting DataFrame

Finding last trade from pandas dataframe

Python filter rows in dataframe by date

putting .size() into new column python pandas

How to apply a function to each column of a pivot table in pandas?

Categories

Resources