I need to divide all but the first columns in a DataFrame by the first column.
Here's what I'm doing, but I wonder if this isn't the "right" pandas way:
df = pd.DataFrame(np.random.rand(10,3), columns=list('ABC'))
df[['B', 'C']] = (df.T.iloc[1:] / df.T.iloc[0]).T
Is there a way to do something like df[['B','C']] / df['A']? (That just gives a 10x12 dataframe of nan.)
Also, after reading some similar questions on SO, I tried df['A'].div(df[['B', 'C']]) but that gives a broadcast error.
I believe df[['B','C']].div(df.A, axis=0) and df.iloc[:,1:].div(df.A, axis=0) work.
do: df.iloc[:,1:] = df.iloc[:,1:].div(df.A, axis=0)
This will divide all columns other than the 1st column with the 'A' column used as divisor.
Results are 1st column + all columns after / 'divisor column'.
You are actually doing a matrix multiplication (Apparently numpy understands that "/" operator multiplies by the inverse), so you need the shapes to match (see here).
e.g.
df['A'].shape --> (10,)
df[['B','C']].shape --> (10,2)
You should make them match as (2,10)(10,):
df[['B','C']].T.shape, df['A'].shape -->((2, 10), (10,))
But then your resulting matrix is:
( df[['B','C']].T / df['A'] ).shape --> (2,10)
Therefore:
( df[['B','C']].T / df['A'] ).T
Shape is (10,2). It gives you the results that you wanted!
Related
I am working on a notebook with python/pandas, and I have:
a Dataframe, X (with size 20550 rows × 18 columns) and a
a Series, a column, y (with size 20550)
I want to merge (or concatenate, append!) the column 'y' at the end of 'X'
and have a X_total with size 20550 rows, 19 columns
This is probably very simple but I am trying to append or concatenate horizontally, but I end up with dataframes with weird dimensions, at the best case I got a df with more rows (20551 rows × 20565 columns, or 20551 rows × 19 columns, full of NaNs)
EDIT:
I tried:
pd.concat([X,y], axis=1)
X.append(other=y)
dfsv=[X,y]
pd.concat([X,y], axis=1, join='outer', ignore_index=False)
X.append(y, ignore_index=True)
any thoughts?
cheers!
To append a Series as a column to a dataframe, the Series must have a name which will be used as the column name. At the same time, the index of the Series need to match with the index of the dataframe. As such, you can do it this way:
y2 = pd.Series(y.values, name='y', index=X.index)
X.join(y2)
Here, we fulfill 2 prerequisites at one step by defining a Series y2 taking the values of Series y, give it the column name y and set its index to be the same as dataframe X. Then, we can use .join() to join y2 at the end of X.
Edit
Another even much simpler solution:
X['y'] = y.values
If X and Y have same indices:
pd.concat([X, Y], axis=1)
If X and Y have different indices, you can try:
X.append(Y, ignore_index=True)
You can either append or con at. It is important though to specify the Axis to be columns
>>> X = pd.concat([X,Y], axis=1)
Code example:
a = pd.DataFrame({"a": [1,2,3],}, index=[1,2,2])
b = pd.DataFrame({"b": [1,4,5],}, index=[1,4,5])
pd.concat([a, b], axis=1)
It raises error: ValueError: Shape of passed values is (7, 2), indices imply (5, 2)
What I expected as a result:
Why does it not return like this? concat's default joining is outer so I think my thought is reasonable enough... Am I missing something?
TLDR: Why? I don't really know for sure, but I think it has to do with just the design of the package.
An index in pandas "is like an address, that’s how any data point across the dataframe or series can be accessed. Rows and columns both have indexes, rows indices are called as index and for columns its general column names." source
Now you are doing it where axis = 1, aka along the vertical axis. That means that we have an address which points to two different values. Hence we can still "access" these values by doing a[a.index == 2]. Do note however the index in a mathematical sense is now not a proper function because one value maps to two different values source. I am guessing the implementation was designed so that indices would be injective, surjective, or bijective in order to make it easier to design.
Thus, when attempting to concatenate, pandas wants to match all the indices together where possible and fill in nans where not possible. However, as the error says, it thinks the shape based off the indices is (5, 2) because of this address sharing two different values. So why doesn't it work? Because I believe pandas checks the shape it should be before hand, and then does the concatenation. In order to check the shape before hand it looks at the indices and therefore it breaks when it checks.
Do note too that this would not work with identical column names as well:
a = pd.DataFrame({"a": [1,2,3], 'b': [9,8,7]}, index=[1,2,2])
b = pd.DataFrame({"b": [1,4,5], 'bx': [1,4,3]}, index=[1,4,5]).rename(columns={'bx': 'b'})
pd.concat([a,b]) # axis=0 is the default
ValueError: Plan shapes are not aligned
Therefore pd.concat needs unique indices along whichever axis it is operating upon. You can't have two identical column names when you normally concatenate row wise, and likewise you can't be able to do it column wise.
Interestingly, for your original example, pd.concat([a, b], ignore_index=True, axis=1) also raises the same error, leading me to more strongly suspect that pandas is checking the shape before the concatenation.
I want to calculate rolling robust covariance using sklearn.covariance MinCovDet.
I have a dataframe df with 3000 rows and 20 columns contain dates in the index.
For each row to calculate the robust covariance over the let's say last 200 days.
I have tried with
df.apply(lambda x: MinCovDet().fit(df[x-400:x].values))
I get a TypeError: ("Cannot convert input [date\n2004-01-02 etc ...
Any idea?
A more general question would be how to apply a function to a n x m array of a pandas Dataframe
Many thanks
Answering the 'more general question'.
There is pandas.DataFrame.rolling() method specially for such cases: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
Then you can use .apply() just as usual, or use the rolling object as an argument. So in your case the code will be the following:
MinCovDet.fit(df.rolling(window = 400).values)
If your dates are really the index, then they should not be seen by df.values. If the dates are the first column, then
df1 = df.loc[:, 1:]
df1.apply(lambda x: MinCovDet().fit(df[x-400:x].values))
should work fine.
I'm trying to create a DataFrame in Pandas with the following code:
df_coefficients = pd.DataFrame(data = log_model.coef_, index = X.columns,
columns = ['Coefficients'])
However, I keep getting the following error:
Shape of passed values is (5, 1), indices imply (1, 5)
The values and indices are as follows:
Indices =
Index([u'Daily Time Spent on Site', u'Age', u'Area Income',
u'Daily Internet Usage', u'Male'],
dtype='object')
Values =
array([[ -4.45816498e-02, 2.18379839e-01, -7.63621392e-06,
-2.45264007e-02, 1.13334440e-03]])
How would I fix this? I've built the same type of table before and I've never gotten this error.
Any help would be appreciated.
Thanks
It looks like your Index and Values arrays have different shapes. As you can see the Index array has single brackets while the Values array has double brackets.
That way python reads index as having shape (5,1) while the Values array is (1,5).
if you enter Values as you wrote in the question:
Values =
array([[ -4.45816498e-02, 2.18379839e-01, -7.63621392e-06,
-2.45264007e-02, 1.13334440e-03]])
and call Values.shape it returns
Values.shape
(1,5)
Instead if you set Values as:
Values = np.array([ -4.45816498e-02, 2.18379839e-01, -7.63621392e-06,
-2.45264007e-02, 1.13334440e-03])
then the shape of Values will be (5,) which will fit with the index array.
Your data has five columns and one row instead of one column and five rows. Just use the transposed version of it with .T:
df_coefficients = pd.DataFrame(data = log_model.coef_.T, index = X.columns,
columns = ['Coefficients'])
I need to edit rows in a pandas.DataFrame by dividing each value by the row.max()
what is the recommended way to do this?
I tried
df.xs('rowlabel') /= df.xs('rowlabel').max()
as I'd do on a numpy array, but it didn't work.
The syntax for a single row is:
df.ix['rowlabel'] /= df.ix['rowlabel'].max()
If you want that done on every row in the dataframe, you can use apply (with axis=1 to select rows instead of columns):
df.apply(lambda x: x / x.max(), axis=1)