Interpolate a missing values using rows and columns values - python

In Python Pandas, how should I interactively interpolate a dataframe with some NaN rows and columns?
For example, the following dataframe -
90 92.5 95 100 110 120
Index
1 NaN NaN NaN NaN NaN NaN
2 0.469690 NaN NaN NaN NaN NaN
3 0.478220 NaN 0.492232 0.505685 NaN NaN
4 0.486377 NaN 0.503853 0.518890 0.550517 NaN
5 0.485862 NaN 0.502130 0.515076 0.537675 0.564383
My goal is to interpolate & fill all the NaN efficiently, I.E, to interpolate whatever NaN that is possible. However If I use
df.interpolate(inplace=True, axis=0, method='spline', order=1, limit=20, limit_direction='both')
it will return "TypeError: Cannot interpolate with all NaNs."

You can try this (thank you #Boud for df.dropna(axis=1, how='all')):
In [138]: new = df.dropna(axis=1, how='all').interpolate(limit=20, limit_direction='both')
In [139]: new
Out[139]:
90 95 100 110 120
Index
1 0.469690 0.492232 0.505685 0.550517 0.564383
2 0.469690 0.492232 0.505685 0.550517 0.564383
3 0.478220 0.492232 0.505685 0.550517 0.564383
4 0.486377 0.503853 0.518890 0.550517 0.564383
5 0.485862 0.502130 0.515076 0.537675 0.564383

Related

How To Map Column Values where two others match? "Reindexing only valid with uniquely valued Index objects"?

I have one DataFrame, df, I have four columns shown below:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 NaN
3 110 2 150
5 120 3 NaN
7 140 4 160
9 150 5 190
NaN NaN 6 130
NaN NaN 7 NaN
NaN NaN 8 200
NaN NaN 9 90
NaN NaN 10 NaN
I want instead to map values from df.IDP1Number to IDP2Number using IDP1 to IDP2. I want to replace existing values if IDP1 and IDP2 both exist with IDP1Number. Otherwise leave values in IDP2Number alone.
The error message that appears reads, " Reindexing only valid with uniquely value Index objects
The Dataframe below is what I wish to have:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 100
3 110 2 150
5 120 3 110
7 140 4 160
9 150 5 120
NaN NaN 6 130
NaN NaN 7 140
NaN NaN 8 200
NaN NaN 9 150
NaN NaN 10 NaN
Here's a way to do:
# filter the data and create a mapping dict
maps = df.query("IDP1.notna()")[['IDP1', 'IDP1Number']].set_index('IDP1')['IDP1Number'].to_dict()
# create new column using ifelse condition
df['IDP2Number'] = df.apply(lambda x: maps.get(x['IDP2'], None) if (pd.isna(x['IDP2Number']) or x['IDP2'] in maps) else x['IDP2Number'], axis=1)
print(df)
IDP1 IDP1Number IDP2 IDP2Number
0 1.0 100.0 1 100.0
1 3.0 110.0 2 150.0
2 5.0 120.0 3 110.0
3 7.0 140.0 4 160.0
4 9.0 150.0 5 120.0
5 NaN NaN 6 130.0
6 NaN NaN 7 140.0
7 NaN NaN 8 200.0
8 NaN NaN 9 150.0
9 NaN NaN 10 NaN

How convert rows in a list using pandas

Used code and file: https://github.com/CaioEuzebio/Python-DataScience-MachineLearning/tree/master/SalesLogistics
I am working on an analysis using pandas. Basically I need to sort the orders by quantity of products, and containing the same products.
Example: I have order 1 and order 2, both have product A and product B. Using the product list and product quantity as a key I will create a pivot that will index this combination of products and return me the order who own the same products.
The general objective of the analysis is to obtain a dataframe as follows:
dfFinal
listProds Ordens NumProds
[prod1,prod2,prod3] 1 3
2
3
[prod1,prod3,prod5] 7 3
15
25
[prod5] 8 1
3
So far the code looks like this.
Setting the 'Order' column as index so that the first pivot is made.
df1.index=df1['Ordem']
df3 = df1.assign(col=df1.groupby(level=0).Produto.cumcount()).pivot(columns='col', values='Produto')
With this pivot I get the dataframe below.
df3 =
col 0 1 2 3 4 5 6 7 8 9 ... 54 55 56 57 58 59 60 61 62 63
Ordem
10911KD YIZ12FF-A YIZ12FF-A YIIE2FF-A YIR72FF-A YIR72FF-A YIR72FF-A NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
124636 HYY32ZY-A HYY32ZY-A NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1719KD5 YI742FF-A YI742FF-A YI742FF-A YI742FF-A NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22215KD YI762FF-A YI762FF-A YI762FF-A YI762FF-A YI762FF-A YI762FF-A YI6E2FF-A YI6E2FF-A YI6E2FF-A NaN ... NaN NaN NaN NaN NaN
When I finish running the code, NaN values ​​appear, and I need to remove them from the lines so that I don't influence the analysis I'm doing.

pandas:drop columns which its missing rate over 90%

How to combine this line into pandas dataframe to drop columns which its missing rate over 90%?
this line will show all the column and its missing rate:
percentage = (LoanStats_securev1_2018Q1.isnull().sum()/LoanStats_securev1_2018Q1.isnull().count()*100).sort_values(ascending = False)
Someone familiar with pandas please kindly help.
You can use dropna with a threshold
newdf=df.dropna(axis=1,thresh=len(df)*0.9)
axis=1 indicates column and thresh is the
minimum number of non-NA values required.
I think need boolean indexing with mean of boolean mask:
df = df.loc[:, df.isnull().mean() < .9]
Sample:
np.random.seed(2018)
df = pd.DataFrame(np.random.randn(20,3), columns=list('ABC'))
df.iloc[3:8,0] = np.nan
df.iloc[:-1,1] = np.nan
df.iloc[1:,2] = np.nan
print (df)
A B C
0 -0.276768 NaN 2.148399
1 -1.279487 NaN NaN
2 -0.142790 NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 -0.172797 NaN NaN
9 -1.604543 NaN NaN
10 -0.276501 NaN NaN
11 0.704780 NaN NaN
12 0.138125 NaN NaN
13 1.072796 NaN NaN
14 -0.803375 NaN NaN
15 0.047084 NaN NaN
16 -0.013434 NaN NaN
17 -1.580231 NaN NaN
18 -0.851835 NaN NaN
19 -0.148534 0.133759 NaN
print(df.isnull().mean())
A 0.25
B 0.95
C 0.95
dtype: float64
df = df.loc[:, df.isnull().mean() < .9]
print (df)
A
0 -0.276768
1 -1.279487
2 -0.142790
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 -0.172797
9 -1.604543
10 -0.276501
11 0.704780
12 0.138125
13 1.072796
14 -0.803375
15 0.047084
16 -0.013434
17 -1.580231
18 -0.851835
19 -0.148534

Scatter plot different groups by colors, using groups that have NaN

I have a data frame in pandas:
d1_a d2_a d3_a group
BI59 NaN 0.023333 NaN 2
BI71 NaN 0.173333 NaN 2
BI52 NaN NaN NaN 1
BI44 0.450000 NaN NaN 1
BI36 NaN 0.286667 NaN 2
BI29 NaN 0.030000 NaN 2
BI50 NaN 0.633333 NaN 2
BI63 NaN 0.110000 NaN 2
BI64 NaN 0.320000 NaN 2
BI65 0.206667 NaN NaN 1
BI67 NaN 0.216667 NaN 2
BI68 NaN 0.473333 NaN 2
BI71 NaN 0.053333 NaN 2
BI72 NaN 0.006667 NaN 2
BI75 NaN 0.430000 NaN 2
BI76 NaN 0.260000 NaN 2
BI78 NaN 0.250000 NaN 2
BI81 NaN 0.006667 NaN 2
BI83 NaN 0.603333 NaN 2
BI84 NaN NaN 0.196667 3
BI86 NaN NaN 0.046667 3
BI89 NaN 0.110000 NaN 2
BI91 NaN NaN 0.213333 3
BI93 NaN 0.443333 NaN 2
BI97 0.586667 NaN NaN 1
BI98 0.380000 NaN NaN 1
BI99 0.016667 NaN NaN 1
BI11 NaN 0.206667 NaN 2
BI12 NaN 0.500000 NaN 2
BI17 0.626667 NaN NaN 1
The BI## is the index column, the groups that the rows belong to are denoted by the group column. So d1_a is group 1, d2_a is group 2 and d3_a is group 3. Also, the numbers on the index column would be the x axis. How do I create a scatter plot, with each group being represented by a different color? When I try plotting I get empty plots.
If I try something like subset_d1_a = df['d1_a'].dropna() and do something similar for each group then I can remove the NaNs but now the arrays are of different lengths and I cannot plot them all on the same graph.
Preferably I'd like to do this in seaborn but any method in python will do.
So far, this is what I'm doing, now sure if I'm going down the right path:
subset = pd.concat([df.d1_a, df.d2_a, df.d3_a], axis=1)
subset = subset.sum(axis=1)
subset = pd.concat([subset,df.group], axis=1)
subset = subset.dropna()
g = subset.groupby('groups')
It is not clear what a scatter chart would look like given your data, but you could do something like this:
colors = {1: 'red', 2: 'green', 3: 'blue'}
df.iloc[:, :3].sum(axis=1).plot(kind='bar', colors=df.group.map(colors).tolist()

Multiplying multiple columns in a DataFrame

I'm trying to multiply N columns in a DataFrame by N columns in the same DataFrame, and then divide the results by a single column. I'm having trouble with the first part, see example below.
import pandas as pd
from numpy import random
foo = pd.DataFrame({'A':random.rand(10),
'B':random.rand(10),
'C':random.rand(10),
'N':random.randint(1,100,10),
'X':random.rand(10),
'Y':random.rand(10),
'Z':random.rand(10), })
foo[['A','B','C']].multiply(foo[['X','Y','Z']], axis=0).divide(foo['N'], axis=0)
What I'm trying to get at is column-wise multiplication (i.e. A*X, B*Y, C*Z)
The result is not an N column matrix but a 2N one, where the columns I'm trying to multiply by are added to the DataFrame, and all the entries have NaN values, like so:
A B C X Y Z
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
What's going on here, and how do I do column-wise multiplication?
This will work using the values from columns X, Y, Z and N, but perhaps it will help you see what the issue is:
>>> (foo[['A','B','C']]
.multiply(foo[['X','Y','Z']].values)
.divide(foo['N'].values, axis=0))
A B C
0 0.000452 0.004049 0.010364
1 0.004716 0.001566 0.012881
2 0.001488 0.000296 0.004415
3 0.000269 0.001168 0.000327
4 0.001386 0.008267 0.012048
5 0.000084 0.009588 0.003189
6 0.000099 0.001063 0.006493
7 0.009958 0.035766 0.012618
8 0.001252 0.000860 0.000420
9 0.006422 0.005013 0.004108
The result is indexed on columns A, B, C. It is unclear what the resulting columns should be, which is why you are getting the NaNs.
Appending the function above with .values will give you the result you desire, but it is then up to you to replace the index and columns.
>>> (foo[['A','B','C']]
.multiply(foo[['X','Y','Z']].values)
.divide(foo['N'].values, axis=0)).values
array([[ 4.51754797e-04, 4.04911292e-03, 1.03638836e-02],
[ 4.71588457e-03, 1.56556402e-03, 1.28805803e-02],
[ 1.48820116e-03, 2.95700572e-04, 4.41516179e-03],
[ 2.68791866e-04, 1.16836123e-03, 3.27217820e-04],
[ 1.38648301e-03, 8.26692582e-03, 1.20482313e-02],
[ 8.38762247e-05, 9.58768066e-03, 3.18903965e-03],
[ 9.94132918e-05, 1.06267623e-03, 6.49315435e-03],
[ 9.95764539e-03, 3.57657737e-02, 1.26179014e-02],
[ 1.25210929e-03, 8.59735215e-04, 4.20124326e-04],
[ 6.42175897e-03, 5.01250179e-03, 4.10783492e-03]])

Categories

Resources