Subtracting Two Columns of DataFrames not giving expected result - Python, Pandas - python

I have two data frames, each with 672 rows of data.
I want to subtract the values in a column of one data frame from the values in a column of the other data frame. The result can either be a new data frame, or a series, it does not really matter to me. The size of the result should obviously be 672 rows or 672 values.
I have the code:
stock_returns = beta_portfolios_196307_201906.iloc[:,6] - \
fama_french_factors_196307_201906.iloc[:,4]
I also tried
stock_returns = beta_portfolios_196307_201906["Lo 10"] + \
fama_french_factors_196307_201906["RF"]
For both, the result is a series size (1116, ), and most of the value in the series are NaN, with a few being numeric values.
Could someone please explain why this happening and how I can get the result I want?
Here is a the .head() of my data frames:
beta_portfolios_196307_201906.head()
Date Lo 20 Qnt 2 Qnt 3 Qnt 4 ... Dec 6 Dec 7 Dec 8 Dec 9 Hi 10
0 196307 1.13 -0.08 -0.97 -0.94 ... -1.20 -0.49 -1.39 -1.94 -0.77
1 196308 3.66 4.77 6.46 6.23 ... 7.55 7.57 4.91 9.04 10.47
2 196309 -2.78 -0.76 -0.78 -0.81 ... -0.27 -0.63 -1.00 -1.92 -3.68
3 196310 0.74 3.56 2.03 5.70 ... 1.78 6.63 4.78 3.10 3.01
4 196311 -0.63 -0.26 -0.81 -0.92 ... -0.69 -1.32 -0.51 -0.20 0.52
[5 rows x 16 columns]
fama_french_factors_196307_201906.head()
Date Mkt-RF SMB HML RF
444 196307 -0.39 -0.56 -0.83 0.27
445 196308 5.07 -0.94 1.67 0.25
446 196309 -1.57 -0.30 0.18 0.27
447 196310 2.53 -0.54 -0.10 0.29
448 196311 -0.85 -1.13 1.71 0.27
One last thing I should add: At first, all of the values in both data frames were strings, so I had to convert the values to numeric values using:
beta_portfolios_196307_201906 = beta_portfolios_196307_201906.apply(pd.to_numeric, errors='coerce')

Let's explain the issue on an example with just 5 rows.
When both DataFrames, a and b have the same indices, e.g.:
a b
Lo 10 Xxx RF Yyy
0 10 1 0 9 1
1 20 1 1 8 1
2 30 1 2 7 1
3 40 1 3 6 1
4 50 1 4 5 1
The result of subtraction a['Lo 10'] - b['RF'] is:
0 1
1 12
2 23
3 34
4 45
dtype: int64
Rows of both DataFrames are aligned on the index and then corresponding
elements are subtracted.
And now take a look at the case when b has some other indices, e.g.:
RF Yyy
0 9 1
1 8 1
2 7 1
8 6 1
9 5 1
i.e. last 2 rows have index 8 and 9 absent in a.
Then the result of the same subtraction is:
0 1.0
1 12.0
2 23.0
3 NaN
4 NaN
8 NaN
9 NaN
dtype: float64
i.e.:
rows with index 0, 1 and 2 - as before - both DataFrames have these
values.
but if some index is present in only one DataFrame, the result is
NaN,
the number of rows in this result is bigger.
If you want to align both columns by position instead of by the index, you
can run a.reset_index()['Lo 10'] - b.reset_index()['RF'], getting the
result as in the first case.

Related

Matrix are not aligned for dot product

Here is my df matrix:
0 Rooms Area Price
0 0 0.4 0.32 0.307692
1 0 0.4 0.40 0.461538
2 0 0.6 0.48 0.615385
3 0 0.6 0.56 0.646154
4 0 0.6 0.60 0.692308
5 0 0.8 0.72 0.769231
6 0 0.8 0.80 0.846154
7 0 1.0 1.00 1.000000
Here is my B matrix:
weights
0 88
1 87
2 44
3 46
When I write df.dot(B) it says matrix are not aligned.
But here df is 8*4 matrix and B is 4*1
So shouldn't it generate a `8*1 matrix as a dot product?
errors: ValueError: matrices are not aligned
You can just do mul
out = df.mul(B['weights'].values,axis=1)
Out[207]:
0 Rooms Area Price
0 0 34.8 14.08 14.153832
1 0 34.8 17.60 21.230748
2 0 52.2 21.12 28.307710
3 0 52.2 24.64 29.723084
4 0 52.2 26.40 31.846168
5 0 69.6 31.68 35.384626
6 0 69.6 35.20 38.923084
7 0 87.0 44.00 46.000000
"the column names of DataFrame and the index of other must contain the same values,"
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dot.html
Try renaming the index of B to match the column names of A.

Using bins in pandas data frame

I am working on a data frame which has 4 columns in total, i want to bin each column of that data frame iteratively in 8 equal parts. The bin number should be assigned to the data in a separate column for each column.
The code should work even if any different data frame is provided with different column names.
Here, is the code i tried.
for c in df3.columns:
df3['bucket_' + c] = (df3.max() - df3.min()) // 2 + 1
buckets = pd.cut(df3['bucket_' + c], 8, labels=False)
sample data frame
expected output
The respected bin columns display the bin number assigned to each data point according to the range in which they will (using pd.cut to cut column in 8 equal parts) fall.
Thanks in advance!!
sample data
gp1_min gp2 gp3 gp4
17.39 23.19 28.99 44.93
0.74 1.12 3.35 39.78
12.63 13.16 13.68 15.26
72.76 73.92 75.42 94.35
77.09 84.14 74.89 89.87
73.24 75.72 77.28 92.3
78.63 84.35 64.89 89.31
65.59 65.95 66.49 92.43
76.79 83.93 75.89 89.73
57.78 57.78 2.22 71.11
99.9 99.1 100 100
100 100 40.963855 100
expected output
gp1_min gp2 gp3 gp4 bin_gp1 bin_gp2 bin_gp3 bin_gp4
17.39 23.19 28.99 44.93 2 2 2 3
0.74 1.12 3.35 39.78 1 1 1 3
12.63 13.16 13.68 15.26 1 2 2 2
72.76 73.92 75.42 94.35 5 6 6 7
77.09 84.14 74.89 89.87 6 7 6 7
73.24 75.72 77.28 92.3 6 6 6 7
78.63 84.35 64.89 89.31 6 7 5 7
65.59 65.95 66.49 92.43 5 6 5 7
76.79 83.93 75.89 89.73 6 7 6 7
57.78 57.78 2.22 71.11 4 4 1 6
99.9 99.1 100 100 8 8 8 8
100 100 40.96 100 8 8 3 8
I would use a couple of functions from numpy, namely np.linspace to make the bin boundaries and np.digitize to put the dataframe's values into bins:
import numpy as np
def binner(df,num_bins):
for c in df.columns:
cbins = np.linspace(min(df[c]),max(df[c]),num_bins+1)
df[c + '_binned'] = np.digitize(df[c],cbins)
return df

How to create new df based on columns of two different data frames?

I am working on following data frames, though original data frames are quite large with thousands of lines, for illustration purpose I am using much basic df.
My first df is the following :
ID value
0 3 7387
1 8 4784
2 11 675
3 21 900
And there is another huge df, say df2
x y final_id
0 -7.35 2.09 3
1 -6.00 2.76 3
2 -5.89 1.90 4
3 -4.56 2.67 5
4 -3.46 1.34 8
5 -4.67 1.23 8
6 -1.99 3.44 8
7 -5.67 2.40 11
8 -7.56 1.66 11
9 -9.00 3.12 21
10 -8.01 3.11 21
11 -7.90 3.19 22
Now, from the first df, I want to consider only "ID" column and match it's values to the "final_id" column in the second data frame(df2).
I want to create another df which contains only the filtered rows of df2, ie only the rows which contains "final_id" as 3, 8, 11, 21 (as per the "ID" column of df1).
Below would the resultant df:
x y final_id
0 -7.35 2.09 3
1 -6.00 2.76 3
2 -3.46 1.34 8
3 -4.67 1.23 8
4 -1.99 3.44 8
5 -5.67 2.40 11
6 -7.56 1.66 11
7 -9.00 3.12 21
8 -8.01 3.11 21
We can see rows 2, 3, 11 from df2 has been removed from resultant df.
Please help.
You can use isin to create a mask and then use the boolean mask to subset your df2:
mask = df2["final_id"].isin(df["ID"])
print(df2[mask])
x y final_id
0 -7.35 2.09 3
1 -6.00 2.76 3
4 -3.46 1.34 8
5 -4.67 1.23 8
6 -1.99 3.44 8
7 -5.67 2.40 11
8 -7.56 1.66 11
9 -9.00 3.12 21
10 -8.01 3.11 21

Fill in missing rows from columns after groupby in python pandas

I have a dataset that looks something like this but is much larger.
Column A Column B Result
1 1 2.4
1 4 2.9
1 1 2.8
2 5 9.3
3 4 1.2
df.groupby(['Column A','Column B'])['result'].mean()
Column A Column B Result
1 1 2.6
4 2.9
2 5 9.3
3 4 1.2
I want to have a range from 1-10 for Column B with the results for these rows to be the average of Column A and Column B. So this is my desired table:
Column A Column B Result
1 1 2.6
2 2.75
3 2.75
4 2.9
5 6.025
2 1 5.95
2 9.3
3 9.3
...
Hopefully the point is getting across. I know the average thing is pretty confusing so I would settle with just being able to fill in the missing values of my desired range. I appreciate the help!
You need reindex by new index created by MultiIndex.from_product and then groupby by first level Column A with fillna by mean per groups:
df = df.groupby(['Column A','Column B'])['Result'].mean()
mux = pd.MultiIndex.from_product([df.index.get_level_values(0).unique(),
np.arange(1,10)], names=('Column A','Column B'))
df = df.reindex(mux)
df = df.groupby(level='Column A').apply(lambda x: x.fillna(x.mean()))
print (df)
Column A Column B
1 1 2.60
2 2.75
3 2.75
4 2.90
5 2.75
6 2.75
7 2.75
8 2.75
9 2.75
2 1 9.30
2 9.30
3 9.30
4 9.30
5 9.30
6 9.30
7 9.30
8 9.30
9 9.30
3 1 1.20
2 1.20
3 1.20
4 1.20
5 1.20
6 1.20
7 1.20
8 1.20
9 1.20
Name: Result, dtype: float64

Pandas column of lists, create a row for each list element

I have a dataframe where some cells contain lists of multiple values. Rather than storing multiple
values in a cell, I'd like to expand the dataframe so that each item in the list gets its own row (with the same values in all other columns). So if I have:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'trial_num': [1, 2, 3, 1, 2, 3],
'subject': [1, 1, 1, 2, 2, 2],
'samples': [list(np.random.randn(3).round(2)) for i in range(6)]
}
)
df
Out[10]:
samples subject trial_num
0 [0.57, -0.83, 1.44] 1 1
1 [-0.01, 1.13, 0.36] 1 2
2 [1.18, -1.46, -0.94] 1 3
3 [-0.08, -4.22, -2.05] 2 1
4 [0.72, 0.79, 0.53] 2 2
5 [0.4, -0.32, -0.13] 2 3
How do I convert to long form, e.g.:
subject trial_num sample sample_num
0 1 1 0.57 0
1 1 1 -0.83 1
2 1 1 1.44 2
3 1 2 -0.01 0
4 1 2 1.13 1
5 1 2 0.36 2
6 1 3 1.18 0
# etc.
The index is not important, it's OK to set existing
columns as the index and the final ordering isn't
important.
Pandas >= 0.25
Series and DataFrame methods define a .explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column.
df = pd.DataFrame({
'var1': [['a', 'b', 'c'], ['d', 'e',], [], np.nan],
'var2': [1, 2, 3, 4]
})
df
var1 var2
0 [a, b, c] 1
1 [d, e] 2
2 [] 3
3 NaN 4
df.explode('var1')
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
2 NaN 3 # empty list converted to NaN
3 NaN 4 # NaN entry preserved as-is
# to reset the index to be monotonically increasing...
df.explode('var1').reset_index(drop=True)
var1 var2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 NaN 3
6 NaN 4
Note that this also handles mixed columns of lists and scalars, as well as empty lists and NaNs appropriately (this is a drawback of repeat-based solutions).
However, you should note that explode only works on a single column (for now).
P.S.: if you are looking to explode a column of strings, you need to split on a separator first, then use explode. See this (very much) related answer by me.
A bit longer than I expected:
>>> df
samples subject trial_num
0 [-0.07, -2.9, -2.44] 1 1
1 [-1.52, -0.35, 0.1] 1 2
2 [-0.17, 0.57, -0.65] 1 3
3 [-0.82, -1.06, 0.47] 2 1
4 [0.79, 1.35, -0.09] 2 2
5 [1.17, 1.14, -1.79] 2 3
>>>
>>> s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)
>>> s.name = 'sample'
>>>
>>> df.drop('samples', axis=1).join(s)
subject trial_num sample
0 1 1 -0.07
0 1 1 -2.90
0 1 1 -2.44
1 1 2 -1.52
1 1 2 -0.35
1 1 2 0.10
2 1 3 -0.17
2 1 3 0.57
2 1 3 -0.65
3 2 1 -0.82
3 2 1 -1.06
3 2 1 0.47
4 2 2 0.79
4 2 2 1.35
4 2 2 -0.09
5 2 3 1.17
5 2 3 1.14
5 2 3 -1.79
If you want sequential index, you can apply reset_index(drop=True) to the result.
update:
>>> res = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack()
>>> res = res.reset_index()
>>> res.columns = ['subject','trial_num','sample_num','sample']
>>> res
subject trial_num sample_num sample
0 1 1 0 1.89
1 1 1 1 -2.92
2 1 1 2 0.34
3 1 2 0 0.85
4 1 2 1 0.24
5 1 2 2 0.72
6 1 3 0 -0.96
7 1 3 1 -2.72
8 1 3 2 -0.11
9 2 1 0 -1.33
10 2 1 1 3.13
11 2 1 2 -0.65
12 2 2 0 0.10
13 2 2 1 0.65
14 2 2 2 0.15
15 2 3 0 0.64
16 2 3 1 -0.10
17 2 3 2 -0.76
UPDATE: the solution below was helpful for older Pandas versions, because the DataFrame.explode() wasn’t available. Starting from Pandas 0.25.0 you can simply use DataFrame.explode().
lst_col = 'samples'
r = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
Result:
In [103]: r
Out[103]:
samples subject trial_num
0 0.10 1 1
1 -0.20 1 1
2 0.05 1 1
3 0.25 1 2
4 1.32 1 2
5 -0.17 1 2
6 0.64 1 3
7 -0.22 1 3
8 -0.71 1 3
9 -0.03 2 1
10 -0.65 2 1
11 0.76 2 1
12 1.77 2 2
13 0.89 2 2
14 0.65 2 2
15 -0.98 2 3
16 0.65 2 3
17 -0.30 2 3
PS here you may find a bit more generic solution
UPDATE: some explanations: IMO the easiest way to understand this code is to try to execute it step-by-step:
in the following line we are repeating values in one column N times where N - is the length of the corresponding list:
In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())
Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)
this can be generalized for all columns, containing scalar values:
In [11]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: )
Out[11]:
trial_num subject
0 1 1
1 1 1
2 1 1
3 2 1
4 2 1
5 2 1
6 3 1
.. ... ...
11 1 2
12 2 2
13 2 2
14 2 2
15 3 2
16 3 2
17 3 2
[18 rows x 2 columns]
using np.concatenate() we can flatten all values in the list column (samples) and get a 1D vector:
In [12]: np.concatenate(df[lst_col].values)
Out[12]: array([-1.04, -0.58, -1.32, 0.82, -0.59, -0.34, 0.25, 2.09, 0.12, 0.83, -0.88, 0.68, 0.55, -0.56, 0.65, -0.04, 0.36, -0.31])
putting all this together:
In [13]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.drop(lst_col)}
...: ).assign(**{lst_col:np.concatenate(df[lst_col].values)})
Out[13]:
trial_num subject samples
0 1 1 -1.04
1 1 1 -0.58
2 1 1 -1.32
3 2 1 0.82
4 2 1 -0.59
5 2 1 -0.34
6 3 1 0.25
.. ... ... ...
11 1 2 0.68
12 2 2 0.55
13 2 2 -0.56
14 2 2 0.65
15 3 2 -0.04
16 3 2 0.36
17 3 2 -0.31
[18 rows x 3 columns]
using pd.DataFrame()[df.columns] will guarantee that we are selecting columns in the original order...
you can also use pd.concat and pd.melt for this:
>>> objs = [df, pd.DataFrame(df['samples'].tolist())]
>>> pd.concat(objs, axis=1).drop('samples', axis=1)
subject trial_num 0 1 2
0 1 1 -0.49 -1.00 0.44
1 1 2 -0.28 1.48 2.01
2 1 3 -0.52 -1.84 0.02
3 2 1 1.23 -1.36 -1.06
4 2 2 0.54 0.18 0.51
5 2 3 -2.18 -0.13 -1.35
>>> pd.melt(_, var_name='sample_num', value_name='sample',
... value_vars=[0, 1, 2], id_vars=['subject', 'trial_num'])
subject trial_num sample_num sample
0 1 1 0 -0.49
1 1 2 0 -0.28
2 1 3 0 -0.52
3 2 1 0 1.23
4 2 2 0 0.54
5 2 3 0 -2.18
6 1 1 1 -1.00
7 1 2 1 1.48
8 1 3 1 -1.84
9 2 1 1 -1.36
10 2 2 1 0.18
11 2 3 1 -0.13
12 1 1 2 0.44
13 1 2 2 2.01
14 1 3 2 0.02
15 2 1 2 -1.06
16 2 2 2 0.51
17 2 3 2 -1.35
last, if you need you can sort base on the first the first three columns.
Trying to work through Roman Pekar's solution step-by-step to understand it better, I came up with my own solution, which uses melt to avoid some of the confusing stacking and index resetting. I can't say that it's obviously a clearer solution though:
items_as_cols = df.apply(lambda x: pd.Series(x['samples']), axis=1)
# Keep original df index as a column so it's retained after melt
items_as_cols['orig_index'] = items_as_cols.index
melted_items = pd.melt(items_as_cols, id_vars='orig_index',
var_name='sample_num', value_name='sample')
melted_items.set_index('orig_index', inplace=True)
df.merge(melted_items, left_index=True, right_index=True)
Output (obviously we can drop the original samples column now):
samples subject trial_num sample_num sample
0 [1.84, 1.05, -0.66] 1 1 0 1.84
0 [1.84, 1.05, -0.66] 1 1 1 1.05
0 [1.84, 1.05, -0.66] 1 1 2 -0.66
1 [-0.24, -0.9, 0.65] 1 2 0 -0.24
1 [-0.24, -0.9, 0.65] 1 2 1 -0.90
1 [-0.24, -0.9, 0.65] 1 2 2 0.65
2 [1.15, -0.87, -1.1] 1 3 0 1.15
2 [1.15, -0.87, -1.1] 1 3 1 -0.87
2 [1.15, -0.87, -1.1] 1 3 2 -1.10
3 [-0.8, -0.62, -0.68] 2 1 0 -0.80
3 [-0.8, -0.62, -0.68] 2 1 1 -0.62
3 [-0.8, -0.62, -0.68] 2 1 2 -0.68
4 [0.91, -0.47, 1.43] 2 2 0 0.91
4 [0.91, -0.47, 1.43] 2 2 1 -0.47
4 [0.91, -0.47, 1.43] 2 2 2 1.43
5 [-1.14, -0.24, -0.91] 2 3 0 -1.14
5 [-1.14, -0.24, -0.91] 2 3 1 -0.24
5 [-1.14, -0.24, -0.91] 2 3 2 -0.91
For those looking for a version of Roman Pekar's answer that avoids manual column naming:
column_to_explode = 'samples'
res = (df
.set_index([x for x in df.columns if x != column_to_explode])[column_to_explode]
.apply(pd.Series)
.stack()
.reset_index())
res = res.rename(columns={
res.columns[-2]:'exploded_{}_index'.format(column_to_explode),
res.columns[-1]: '{}_exploded'.format(column_to_explode)})
I found the easiest way was to:
Convert the samples column into a DataFrame
Joining with the original df
Melting
Shown here:
df.samples.apply(lambda x: pd.Series(x)).join(df).\
melt(['subject','trial_num'],[0,1,2],var_name='sample')
subject trial_num sample value
0 1 1 0 -0.24
1 1 2 0 0.14
2 1 3 0 -0.67
3 2 1 0 -1.52
4 2 2 0 -0.00
5 2 3 0 -1.73
6 1 1 1 -0.70
7 1 2 1 -0.70
8 1 3 1 -0.29
9 2 1 1 -0.70
10 2 2 1 -0.72
11 2 3 1 1.30
12 1 1 2 -0.55
13 1 2 2 0.10
14 1 3 2 -0.44
15 2 1 2 0.13
16 2 2 2 -1.44
17 2 3 2 0.73
It's worth noting that this may have only worked because each trial has the same number of samples (3). Something more clever may be necessary for trials of different sample sizes.
import pandas as pd
df = pd.DataFrame([{'Product': 'Coke', 'Prices': [100,123,101,105,99,94,98]},{'Product': 'Pepsi', 'Prices': [101,104,104,101,99,99,99]}])
print(df)
df = df.assign(Prices=df.Prices.str.split(',')).explode('Prices')
print(df)
Try this in pandas >=0.25 version
Very late answer but I want to add this:
A fast solution using vanilla Python that also takes care of the sample_num column in OP's example. On my own large dataset with over 10 million rows and a result with 28 million rows this only takes about 38 seconds. The accepted solution completely breaks down with that amount of data and leads to a memory error on my system that has 128GB of RAM.
df = df.reset_index(drop=True)
lstcol = df.lstcol.values
lstcollist = []
indexlist = []
countlist = []
for ii in range(len(lstcol)):
lstcollist.extend(lstcol[ii])
indexlist.extend([ii]*len(lstcol[ii]))
countlist.extend([jj for jj in range(len(lstcol[ii]))])
df = pd.merge(df.drop("lstcol",axis=1),pd.DataFrame({"lstcol":lstcollist,"lstcol_num":countlist},
index=indexlist),left_index=True,right_index=True).reset_index(drop=True)
Also very late, but here is an answer from Karvy1 that worked well for me if you don't have pandas >=0.25 version: https://stackoverflow.com/a/52511166/10740287
For the example above you may write:
data = [(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples]
data = pd.DataFrame(data, columns=['subject', 'trial_num', 'samples'])
Speed test:
%timeit data = pd.DataFrame([(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples], columns=['subject', 'trial_num', 'samples'])
1.33 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit data = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack().reset_index()
4.9 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data = pd.DataFrame({col:np.repeat(df[col].values, df['samples'].str.len())for col in df.columns.drop('samples')}).assign(**{'samples':np.concatenate(df['samples'].values)})
1.38 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Categories

Resources