Creating DataFrame in Pandas (Wrong Shape) - python

I'm trying to create a DataFrame in Pandas with the following code:
df_coefficients = pd.DataFrame(data = log_model.coef_, index = X.columns,
columns = ['Coefficients'])
However, I keep getting the following error:
Shape of passed values is (5, 1), indices imply (1, 5)
The values and indices are as follows:
Indices =
Index([u'Daily Time Spent on Site', u'Age', u'Area Income',
u'Daily Internet Usage', u'Male'],
dtype='object')
Values =
array([[ -4.45816498e-02, 2.18379839e-01, -7.63621392e-06,
-2.45264007e-02, 1.13334440e-03]])
How would I fix this? I've built the same type of table before and I've never gotten this error.
Any help would be appreciated.
Thanks

It looks like your Index and Values arrays have different shapes. As you can see the Index array has single brackets while the Values array has double brackets.
That way python reads index as having shape (5,1) while the Values array is (1,5).
if you enter Values as you wrote in the question:
Values =
array([[ -4.45816498e-02, 2.18379839e-01, -7.63621392e-06,
-2.45264007e-02, 1.13334440e-03]])
and call Values.shape it returns
Values.shape
(1,5)
Instead if you set Values as:
Values = np.array([ -4.45816498e-02, 2.18379839e-01, -7.63621392e-06,
-2.45264007e-02, 1.13334440e-03])
then the shape of Values will be (5,) which will fit with the index array.

Your data has five columns and one row instead of one column and five rows. Just use the transposed version of it with .T:
df_coefficients = pd.DataFrame(data = log_model.coef_.T, index = X.columns,
columns = ['Coefficients'])

Related

How can I extract a dictionary into Excel?

I'm trying to extract the following dictionary using a pandas data frame into Excel:
results = {'ZF_DTSPP': [735.0500558302846,678.5413714617252,772.0300704610595,722.254907241738,825.2955175305726], 'ZF_DTSPPG': [732.0500558302845,637.4786326591071,655.8462451037873,721.404907241738,821.8455175305724]}
This is my code:
df = pd.DataFrame(data=results, index=[5, 2])
df = (df.T)
print(df)
df.to_excel('dict1.xlsx')
Somehow I always receive following error:
"ValueError: Shape of passed values is (5, 2), indices imply (2, 2)".
What can I do? How do I need to adapt the index?
Is there a way to compare the different values of "ZF_DTSPP" and "ZF_DTSPPG" directly with python?
You can use pd.DataFrame.from_dict as shown in pandas-from-dict, then your code:
df = pd.DataFrame.from_dict(results)

How to pass a series to call a user defined function?

I am trying to pass a series to a user defined function and getting this error:
Function:
def scale(series):
sc=StandardScaler()
sc.fit_transform(series)
print(series)
Code for calling:
df['Value'].apply(scale) # df['Value'] is a Series having float dtype.
Error:
ValueError: Expected 2D array, got scalar array instead:
array=28.69.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Can anyone help address this issue?
The method apply will apply a function to each element in the Series (or in case of a DataFrame either each row or each column depending on the chosen axis). Here you expect your function to process the entire Series and to output a new Series in its stead.
You can therefore simply run:
StandardScaler().fit_transform(df['Value'].values.reshape(-1, 1))
StandardScaler excepts a 2D array as input where each row is a sample input that consists of one or more features. Even it is just a single feature (as seems to be the case in your example) it has to have the right dimensions. Therefore, before handing over your Series to sklearn I am accessing the values (the numpy representation) and reshaping it accordingly.
For more details on reshape(-1, ...) check this out: What does -1 mean in numpy reshape?
Now, the best bit. If your entire DataFrame consists of a single column you could simply do:
StandardScaler().fit_transform(df)
And even if it doesn't, you could still avoid the reshape:
StandardScaler().fit_transform(df[['Value']])
Note how in this case 'Value' is surrounded by 2 sets of braces so this time it is not a Series but rather a DataFrame with a subset of the original columns (in case you do not want to scale all of them). Since a DataFrame is already 2-dimensional, you don't need to worry about reshaping.
Finally, if you want to scale just some of the columns and update your original DataFrame all you have to do is:
>>> df = pd.DataFrame({'A': [1,2,3], 'B': [0,5,6], 'C': [7, 8, 9]})
>>> columns_to_scale = ['A', 'B']
>>> df[columns_to_scale] = StandardScaler().fit_transform(df[columns_to_scale])
>>> df
A B C
0 -1.224745 -1.397001 7
1 0.000000 0.508001 8
2 1.224745 0.889001 9

Pandas concat: why is the `DataFrame` with duplicated index not working with concat()?

Code example:
a = pd.DataFrame({"a": [1,2,3],}, index=[1,2,2])
b = pd.DataFrame({"b": [1,4,5],}, index=[1,4,5])
pd.concat([a, b], axis=1)
It raises error: ValueError: Shape of passed values is (7, 2), indices imply (5, 2)
What I expected as a result:
Why does it not return like this? concat's default joining is outer so I think my thought is reasonable enough... Am I missing something?
TLDR: Why? I don't really know for sure, but I think it has to do with just the design of the package.
An index in pandas "is like an address, that’s how any data point across the dataframe or series can be accessed. Rows and columns both have indexes, rows indices are called as index and for columns its general column names." source
Now you are doing it where axis = 1, aka along the vertical axis. That means that we have an address which points to two different values. Hence we can still "access" these values by doing a[a.index == 2]. Do note however the index in a mathematical sense is now not a proper function because one value maps to two different values source. I am guessing the implementation was designed so that indices would be injective, surjective, or bijective in order to make it easier to design.
Thus, when attempting to concatenate, pandas wants to match all the indices together where possible and fill in nans where not possible. However, as the error says, it thinks the shape based off the indices is (5, 2) because of this address sharing two different values. So why doesn't it work? Because I believe pandas checks the shape it should be before hand, and then does the concatenation. In order to check the shape before hand it looks at the indices and therefore it breaks when it checks.
Do note too that this would not work with identical column names as well:
a = pd.DataFrame({"a": [1,2,3], 'b': [9,8,7]}, index=[1,2,2])
b = pd.DataFrame({"b": [1,4,5], 'bx': [1,4,3]}, index=[1,4,5]).rename(columns={'bx': 'b'})
pd.concat([a,b]) # axis=0 is the default
ValueError: Plan shapes are not aligned
Therefore pd.concat needs unique indices along whichever axis it is operating upon. You can't have two identical column names when you normally concatenate row wise, and likewise you can't be able to do it column wise.
Interestingly, for your original example, pd.concat([a, b], ignore_index=True, axis=1) also raises the same error, leading me to more strongly suspect that pandas is checking the shape before the concatenation.

Divide specific columns by another specific column [duplicate]

I need to divide all but the first columns in a DataFrame by the first column.
Here's what I'm doing, but I wonder if this isn't the "right" pandas way:
df = pd.DataFrame(np.random.rand(10,3), columns=list('ABC'))
df[['B', 'C']] = (df.T.iloc[1:] / df.T.iloc[0]).T
Is there a way to do something like df[['B','C']] / df['A']? (That just gives a 10x12 dataframe of nan.)
Also, after reading some similar questions on SO, I tried df['A'].div(df[['B', 'C']]) but that gives a broadcast error.
I believe df[['B','C']].div(df.A, axis=0) and df.iloc[:,1:].div(df.A, axis=0) work.
do: df.iloc[:,1:] = df.iloc[:,1:].div(df.A, axis=0)
This will divide all columns other than the 1st column with the 'A' column used as divisor.
Results are 1st column + all columns after / 'divisor column'.
You are actually doing a matrix multiplication (Apparently numpy understands that "/" operator multiplies by the inverse), so you need the shapes to match (see here).
e.g.
df['A'].shape --> (10,)
df[['B','C']].shape --> (10,2)
You should make them match as (2,10)(10,):
df[['B','C']].T.shape, df['A'].shape -->((2, 10), (10,))
But then your resulting matrix is:
( df[['B','C']].T / df['A'] ).shape --> (2,10)
Therefore:
( df[['B','C']].T / df['A'] ).T
Shape is (10,2). It gives you the results that you wanted!

What's the difference between [] and [[]] in pandas? [duplicate]

This question already has answers here:
The difference between double brace `[[...]]` and single brace `[..]` indexing in Pandas
(4 answers)
Closed 11 months ago.
I'm confused about the results for indexing columns in pandas.
Both
db['varname']
and
db[['varname']]
give me the column value of 'varname'. However it looks like there is some subtle difference, since the output from db['varname'] shows me the type of the value.
The first looks for a specific Key in your df, a specific column, the second is a list of columns to sub-select from your df so it returns all columns matching the values in the list.
The other subtle thing is that the first by default will return a Series object whilst the second returns a DataFrame even if you pass a list containing a single item
Example:
In [2]:
df = pd.DataFrame(columns=['VarName','Another','me too'])
df
Out[2]:
Empty DataFrame
Columns: [VarName, Another, me too]
Index: []
In [3]:
print(type(df['VarName']))
print(type(df[['VarName']]))
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
so when you pass a list then it tries to match all elements:
In [4]:
df[['VarName','Another']]
Out[4]:
Empty DataFrame
Columns: [VarName, Another]
Index: []
but without the additional [] then this will raise a KeyError:
df['VarName','Another']
KeyError: ('VarName', 'Another')
Because you're then trying to find a column named: 'VarName','Another' which doesn't exist
This is close to a dupe of another, and I got this answer from it at: https://stackoverflow.com/a/45201532/1331446, credit to #SethMMorton.
Answering here as this is the top hit on Google and it took me ages to "get" this.
Pandas has no [[ operator at all.
When you see df[['col_name']] you're really seeing:
col_names = ['col_name']
df[col_names]
In consequence, the only thing that [[ does for you is that it makes the
result a DataFrame, rather than a Series.
[ on a DataFrame looks at the type of the parameter; it ifs a scalar, then you're only after one column, and it hands it back as a Series; if it's a list, then you must be after a set of columns, so it hands back a DataFrame (with only these columns).
That's it!
As #EdChum pointed out, [] will return pandas.core.series.Series whereas [[]] will return pandas.core.frame.DataFrame.
Both are different data structures in pandas.
For sklearn, it is better to use db[['varname']], which has a 2D shape.
for example:
from sklearn.preprocessing import KBinsDiscretizer kbinsDiscretizer
est = KBinsDiscretizer(n_bins=3, encode='onehot-dense', strategy='uniform')
est.fit(db[['varname']]) # where use dfb['varname'] causes error
In [84]: single_brackets = np.array( [ 0, 13, 31, 1313 ] )
In [85]: single_brackets.shape, single_brackets.ndim
Out[85]: ((4,), 1)
# (4, ) : is 4-Elements/Values
# 1 : is One_Dimensional array (Generally...In Pandas we call 1D-Array as "SERIES")
In [86]: double_brackets = np.array( [[ 0, 13, 31, 1313 ]] )
In [87]: double_brackets.shape, double_brackets.ndim
Out[87]: ((1, 4), 2)
#(1, 4) : is 1-row and 4-columns
# 2 : is Two_Dimensional array (Generally...In Pandas we call 2D-Array as "DataFrame")
This is the concept of NumPy ...don't blame Pandas
[ ] -> One_Dimensional array which yields SERIES
[[ ]] -> Two_Dimensional array which yields DataFrame
Still don't believe:
check this:
In [89]: three_brackets = np.array( [[[ 0, 13, 31, 1313 ]]] )
In [93]: three_brackets.shape, three_brackets.ndim
Out[93]: ((1, 1, 4), 3)
# (1, 1, 4) -> In general....(rows, rows, columns)
# 3 -> Three_Dimensional array
Work on creating some NumPy Arrays and 'reshape' and check 'ndim'

Categories

Resources