pandas groupby on columns - python

I am trying following example, where I need to group on columns:
import pandas as pd
import numpy as np
y = pd.DataFrame(np.random.randint(0,10, (20,30)).astype(float),
columns = pd.MultiIndex.from_tuples(
list(zip(np.arange(30),
np.random.randint(0,10, (30,))))
))
y.T.groupby(level = 1).agg(lambda x: np.std(x)/np.mean(x))
and it works. However following returns an error:
y.groupby(level = 1, axis = 1).agg(lambda x: np.std(x)/np.mean(x))
Am I missing something?
Upd: Following works when take separately:
y.groupby(level = 1, axis = 1).agg(np.std)/\
y.groupby(level = 1, axis = 1).agg(np.mean)

The groupby function is applied column-wise to your dataframe, however, when the dataframe is transposed, rows become columns and vice-versa.
This wouldn't be an issue if it weren't for the fact that your rows and columns aren't both multi-index. However, since you're treating your row index as a multi-index via the level=1 attribute, you're getting that error.
Also if you're trying to group by rows, you should have axis=0
y.groupby(y.index, axis = 0).agg(lambda x: np.std(x)/np.mean(x))

Related

Pandas : Convert 1D dataframe to 2D dataframe

I have pandas dataframe whose shape is (4628,).
How do I change the shape of dataframe to (4628,1)?
You might have a Series, you can turn it into DataFrame with Series.to_frame
s = pd.Series(range(10))
out = s.to_frame('col_name') # or .to_frame()
print(s.shape)
(10,)
print(out.shape)
(10, 1)
Don't know how you get that Series, if you use a label like df['a'], you can try passing a list of labels instead, like df[['a']]
You can use the function called reshape(). It's easy to use. Run the code and you'll see the result from (4628,) to (4628,1).
import pandas as pd
df = pd.DataFrame(range(1,4629))
print(df)
df = df.values.reshape(-1)
print(df.shape)
df = df.reshape(-1,1)
print(df.shape)
Results:
......
[4628 rows x 1 columns]
(4628,)
(4628, 1)

Problem with a column in my groupby new object

so I have a dataframe and I made this operation:
df1 = df1.groupby(['trip_departure_date']).agg(occ = ('occ', 'mean'))
The problem is that when I try to plot, it gives me an error and it says that trip_departure_date doesn't exist!
I did this:
df1.plot(x = 'trip_departure_date', y = 'occ', figsize = (8,5), color = 'purple')
and I get this error:
KeyError: 'trip_departure_date'
Please help!
Your question is similar to this question: pandas groupby without turning grouped by column into index
When you group by a column, the column you group by ceases to be a column, and is instead the index of the resulting operation. The index is not a column, it is an index. If you set as_index=False, pandas keeps the column over which you are grouping as a column, instead of moving it to the index.
The second problem is the .agg() function is also aggregating occ over trip_departure_date, and moving trip_departure_date to an index. You don't need this second function to get the mean of occ grouped by trip_departure_date.
import pandas as pd
df1 = pd.read_csv("trip_departures.txt")
df1_agg = df1.groupby(['trip_departure_date'],as_index=False).mean()
Or if you only want to aggregate the occ column:
df1_agg = df1.groupby(['trip_departure_date'],as_index=False)['occ'].mean()
df1_agg.plot(x = 'trip_departure_date', y = 'occ', figsize = (8,5), color = 'purple')

Concatenate two dataframes with different row indices

I want to concatenate two data frames of the same length, by adding a column to the first one (df).
But because certain df rows are being filtered, it seems the index isn't matching.
import pandas as pd
pd.read_csv(io.StringIO(uploaded['customer.csv'].decode('utf-8')), sep=";")
df["Margin"] = df["Sales"]-df["Cost"]
df = df.loc[df["Margin"]>-100000]
df = df.loc[df["Sales"]> 1000]
df.reindex()
df
This returns:
So this operation:
customerCluster = pd.concat([df, clusters], axis = 1, ignore_index= True)
print(customerCluster)
Is returning:
So, I've tried reindex and the argument ignore_index = True as you can see in above code snippet.
Thanks for all the answers. If anyone encounters the same problem, the solution I found was this:
customerID = df["CustomerID"]
customerID = customerID.reset_index(drop=True)
df = df.reset_index(drop=True)
So, basically, the indexes of both data frames are now matching, thus:
customerCluster = pd.concat((customerID, clusters), axis = 1)
This will concatenate correctly the two data frames.

How to update a pandas dataframe with an array of values, indexes, and columns?

I have a large dataframe and would like to update specific values at known row and column indices. I would like to do this without an explicit for loop.
For example:
import string
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 10), index = range(10), columns = list(string.ascii_lowercase)[:10])
I have arbitrary arrays of indexes, columns, and values that I would like to use to update df. For example:
update_values = [0,-2,-3]
update_index = [3,5,7]
update_columns = ["d","g","i"]
I can loop over the arrays to update the original dataframe:
for i,j,v in zip(update_index, update_columns, update_values):
df.loc[i,j] = v
but would like to use a technique not involving an explicit for loop.
Use the underlying numpy values
indexes = map(df.columns.get_loc, update_columns)
df.values[update_index, list(indexes)] = update_values
try using loc which is used to specify the needed indexes and columns names loc[[index_names], [columns_names]]
df.loc[[3,5,7], ["d","g","i"]] = [0,-2,-3]

Reading values from Pandas dataframe rows into equations and entering result back into dataframe

I have a dataframe. For each row of the dataframe: I need to read values from two column indexes, pass these values to a set of equations, enter the result of each equation into its own column index in the same row, go to the next row and repeat.
After reading the responses to similar questions I tried:
import pandas as pd
DF = pd.read_csv("...")
Equation_1 = f(x, y)
Equation_2 = g(x, y)
for index, row in DF.iterrows():
a = DF[m]
b = DF[n]
DF[p] = Equation_1(a, b)
DF[q] = Equation_2(a, b)
Rather than iterating over DF, reading and entering new values for each row, this codes iterates over DF and enters the same values for each row. I am not sure what I am doing wrong here.
Also, from what I have read it is actually faster to treat the DF as a NumPy array and perform the calculation over the entire array at once rather than iterating. Not sure how I would go about this.
Thanks.
Turns out that this is extremely easy. All that must be done is to define two variables and assign the desired columns to them. Then set "the row to be replaced" equivalent to the equation containing the variables.
Pandas already knows that it must apply the equation to every row and return each value to its proper index. I didn't realize it would be this easy and was looking for more explicit code.
e.g.,
import pandas as pd
df = pd.read_csv("...") # df is a large 2D array
A = df[0]
B = df[1]
f(A,B) = ....
df[3] = f(A,B)
# If your equations are simple enough, do operations column-wise in Pandas:
import pandas as pd
test = pd.DataFrame([[1,2],[3,4],[5,6]])
test # Default column names are 0, 1
test[0] # This is column 0
test.icol(0) # This is COLUMN 0-indexed, returned as a Series
test.columns=(['S','Q']) # Column names are easier to use
test #Column names! Use them column-wise:
test['result'] = test.S**2 + test.Q
test # results stored in DataFrame
# For more complicated stuff, try apply, as in Python pandas apply on more columns :
def toyfun(df):
return df[0]-df[1]**2
test['out2']=test[['S','Q']].apply(toyfun, axis=1)
# You can also define the column names when you generate the DataFrame:
test2 = pd.DataFrame([[1,2],[3,4],[5,6]],columns = (list('AB')))

Categories

Resources