Problem with a column in my groupby new object - python

so I have a dataframe and I made this operation:
df1 = df1.groupby(['trip_departure_date']).agg(occ = ('occ', 'mean'))
The problem is that when I try to plot, it gives me an error and it says that trip_departure_date doesn't exist!
I did this:
df1.plot(x = 'trip_departure_date', y = 'occ', figsize = (8,5), color = 'purple')
and I get this error:
KeyError: 'trip_departure_date'
Please help!

Your question is similar to this question: pandas groupby without turning grouped by column into index
When you group by a column, the column you group by ceases to be a column, and is instead the index of the resulting operation. The index is not a column, it is an index. If you set as_index=False, pandas keeps the column over which you are grouping as a column, instead of moving it to the index.
The second problem is the .agg() function is also aggregating occ over trip_departure_date, and moving trip_departure_date to an index. You don't need this second function to get the mean of occ grouped by trip_departure_date.
import pandas as pd
df1 = pd.read_csv("trip_departures.txt")
df1_agg = df1.groupby(['trip_departure_date'],as_index=False).mean()
Or if you only want to aggregate the occ column:
df1_agg = df1.groupby(['trip_departure_date'],as_index=False)['occ'].mean()
df1_agg.plot(x = 'trip_departure_date', y = 'occ', figsize = (8,5), color = 'purple')

Related

How to find the top any % of a dataframe?

I want to find the top 1% in my dataframe and append all the values in a list. Then i can check the first value inside and use it as a filter in the dataframe, any idea how to do it ? Or if you have a simplier way to do it !
You can find the dataframe i use here :
https://raw.githubusercontent.com/srptwice/forstack/main/resultat_projet.csv
What i tried is to watch my dataframe with heatmap (from Seaborn) and use a filter like that :
df4 = df2[df2 > 50700]
You can use df.<column name>.quantile(<percentile>) to get the top % of a dataframe. For example, the code below would get you the rows for df2 where bfly column is at the top 10% (90th percentile)
import pandas as pd
df = pd.read_csv('./resultstat_projet.csv')
df.columns = df.columns.str.replace(' ', '') # remove blank spaces in columns
df2 = df[df.bfly > df.bfly.quantile(0.9)]
print(df2)

Pandas dataframe "ValueError: cannot reindex from a duplicate axis" duplicate indices brute force solution?

import pandas as pd
df_avocado = pd.read_csv("avocado.csv")
df_avocado.set_index("Date", inplace=True)
Issue is here:
'''
determines all unique regions (ex: "Alabama", "Alaska", "Arkansas") in dataframe "df_avocado"
finds all data-points belonging to that unique region
dumps those data-points into a temporary dataframe "df_region"
calculates the 25sma of every df_region
dumps the 25sma to "df_avocado_region_25ma" so I can compare 25sma of every region
'''
df_avocado_region_25ma = pd.DataFrame()
for region in df_avocado["region"].unique():
df_region = df_avocado.copy()[df_avocado["region"] == region]
df_avocado_region_25ma[f"{region}_25ma"] = df_region["AveragePrice"].rolling(25).mean()
Jupyter gives "ValueError: cannot reindex from a duplicate axis" when adding each df_region to df_avocado_region_25ma.
I looked into what the ValueError means; quoting from What does `ValueError: cannot reindex from a duplicate axis` mean?, "this error usually rises when you join / assign to a column when the index has duplicate values".
This makes sense as a the "date" column (which I set as the index) has a lot of overlapping values. However, since I don't care that there are duplicate indices (they provide a high/low for the 20sma), and I don't want to overwrite a previous index (prefer to include every data point), is there any way to brute force it and add all of the points in?
www.kaggle.com/neuromusic/avocado-prices
import pandas as pd
df_avocado = pd.read_csv("avocado.csv")
wanted_columns = ["Date", "AveragePrice", "region"]
df_avocado = df_avocado[wanted_columns]
df_avocado["Date"] = pd.to_datetime(df_avocado["Date"])
df_avocado.set_index("Date", inplace=True)
df_avocado.sort_index(inplace=True)
df_avocado_region_25ma = pd.DataFrame()
for region in df_avocado["region"].unique():
df_region = df_avocado.copy()[df_avocado["region"] == region]
df_avocado_region_25ma[f"{region}_25ma"] = df_region["AveragePrice"].rolling(25).mean()
df_avocado_region_25ma.plot()

Add values to bottom of DataFrame automatically with Pandas

I'm initializing a DataFrame:
columns = ['Thing','Time']
df_new = pd.DataFrame(columns=columns)
and then writing values to it like this:
for t in df.Thing.unique():
df_temp = df[df['Thing'] == t] #filtering the df
df_new.loc[counter,'Thing'] = t #writing the filter value to df_new
df_new.loc[counter,'Time'] = dftemp['delta'].sum(axis=0) #summing and adding that value to the df_new
counter += 1 #increment the row index
Is there are better way to add new values to the dataframe each time without explicitly incrementing the row index with 'counter'?
If I'm interpreting this correctly, I think this can be done in one line:
newDf = df.groupby('Thing')['delta'].sum().reset_index()
By grouping by 'Thing', you have the various "t-filters" from your for-loop. We then apply a sum() to 'delta', but only within the various "t-filtered" groups. At this point, the dataframe has the various values of "t" as the indices, and the sums of the "t-filtered deltas" as a corresponding column. To get to your desired output, we then bump the "t's" into their own column via reset_index().

pandas groupby on columns

I am trying following example, where I need to group on columns:
import pandas as pd
import numpy as np
y = pd.DataFrame(np.random.randint(0,10, (20,30)).astype(float),
columns = pd.MultiIndex.from_tuples(
list(zip(np.arange(30),
np.random.randint(0,10, (30,))))
))
y.T.groupby(level = 1).agg(lambda x: np.std(x)/np.mean(x))
and it works. However following returns an error:
y.groupby(level = 1, axis = 1).agg(lambda x: np.std(x)/np.mean(x))
Am I missing something?
Upd: Following works when take separately:
y.groupby(level = 1, axis = 1).agg(np.std)/\
y.groupby(level = 1, axis = 1).agg(np.mean)
The groupby function is applied column-wise to your dataframe, however, when the dataframe is transposed, rows become columns and vice-versa.
This wouldn't be an issue if it weren't for the fact that your rows and columns aren't both multi-index. However, since you're treating your row index as a multi-index via the level=1 attribute, you're getting that error.
Also if you're trying to group by rows, you should have axis=0
y.groupby(y.index, axis = 0).agg(lambda x: np.std(x)/np.mean(x))

How to iterate over columns of pandas dataframe to run regression

I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?
for column in df:
print(df[column])
You can use iteritems():
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))
This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)
You can index dataframe columns by the position using ix.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.
A workaround is to transpose the DataFrame and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name
Using list comprehension, you can get all the columns names (header):
[column for column in df]
Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpy ndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])
I'm a bit late but here's how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)
I landed on this question as I was looking for a clean iterator of columns only (Series, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.
Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...
assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values

Categories

Resources