Python Pandas Dataframe fill NaN values - python

I am trying to fill NaN values in a dataframe with values coming from a standard normal distribution.
This is currently my code:
sqlStatement = "select * from sn.clustering_normalized_dataset"
df = psql.frame_query(sqlStatement, cnx)
data=df.pivot("user","phrase","tfw")
dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1]))
data[np.isnan(data)] = dfrand[np.isnan(data)]
After pivoting the dataframe 'data' it looks like that:
phrase aaron abbas abdul abe able abroad abu abuse \
user
14233664 NaN NaN NaN NaN NaN NaN NaN NaN
52602716 NaN NaN NaN NaN NaN NaN NaN NaN
123456789 NaN NaN NaN NaN NaN NaN NaN NaN
500158258 NaN NaN NaN NaN NaN NaN NaN NaN
517187571 0.4 NaN NaN 0.142857 1 0.4 0.181818 NaN
However, I need that each NaN value will be replaced with a new random value. So I created a new df consists of only random values (dfrand) and then trying to swap the missing numbers (Nan) by the values from dfrand corresponding to indices of the NaN's. Well - unfortunately it doesn't work -
Although the expression
np.isnan(data)
returns a dataframe consists of True and False values, the expression
dfrand[np.isnan(data)]
return only NaN values so the overall trick doesn't work.
Any ideas what the issue ?

Three-thousand columns is not so many. How many rows do you have? You could always make a random dataframe of the same size and do a logical replacement (the size of your dataframe will dictate whether this is feasible or not.
if you know the size of your dataframe:
import pandas as pd
import numpy as np
# create random dummy dataframe
dfrand = pd.DataFrame(data=np.random.randn(rows,cols))
# import "real" dataframe
data = pd.read_csv(etc.) # or however you choose to read it in
# replace nans
data[np.isnan(data)] = dfrand[np.isnan(data)]
if you do not know the size of your dataframe, just shuffle things around
import pandas as pd
import numpy as np
# import "real" dataframe
data = pd.read_csv(etc.) # or however you choose to read it in
# create random dummy dataframe
dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1]))
# replace nans
data[np.isnan(data)] = dfrand[np.isnan(data)]
EDIT
Per "users" last comment:
"dfrand[np.isnan(data)] returns NaN only."
Right! And that is exactly what you wanted. In my solution I have: data[np.isnan(data)] = dfrand[np.isnan(data)]. Translated, this means: take the randomly-generated value from dfrand that corresponds to the NaN-location within "data" and insert it in "data" where "data" is NaN. An example will help:
a = pd.DataFrame(data=np.random.randint(0,100,(10,3)))
a[0][5] = np.nan
In [32]: a
Out[33]:
0 1 2
0 2 26 28
1 14 79 82
2 89 32 59
3 65 47 31
4 29 59 15
5 NaN 58 90
6 15 66 60
7 10 19 96
8 90 26 92
9 0 19 23
# define randomly-generated dataframe, much like what you are doing, and replace NaN's
b = pd.DataFrame(data=np.random.randint(0,100,(10,3)))
In [39]: b
Out[39]:
0 1 2
0 92 21 55
1 65 53 89
2 54 98 97
3 48 87 79
4 98 38 62
5 46 16 30
6 95 39 70
7 90 59 9
8 14 85 37
9 48 29 46
a[np.isnan(a)] = b[np.isnan(a)]
In [38]: a
Out[38]:
0 1 2
0 2 26 28
1 14 79 82
2 89 32 59
3 65 47 31
4 29 59 15
5 46 58 90
6 15 66 60
7 10 19 96
8 90 26 92
9 0 19 23
As you can see, all NaN's in have been replaced with the randomly-generated value in based on 's nan-value indices.

you could try something like this, assuming you are dealing with one series:
ser = data['column_with_nulls_to_replace']
index = ser[ser.isnull()].index
df = pd.DataFrame(np.random.randn(len(index)), index=index, columns=['column_with_nulls_to_replace'])
ser.update(df)

Related

Dividing one dataframe by another in python using pandas with float values

I have two separate data frames named df1 and df2 as shown below:
Scaffold Position Ref_Allele_Count Alt_Allele_Count Coverage_Depth Alt_Allele_Frequency
0 1 11 7 51 58 0.879310
1 1 16 20 95 115 0.826087
2 2 9 9 33 42 0.785714
3 2 12 86 51 137 0.372263
4 2 67 41 98 139 0.705036
5 3 8 0 0 0 0.000000
6 4 99 32 26 58 0.448276
7 4 101 100 24 124 0.193548
8 4 115 69 26 95 0.273684
9 5 6 40 57 97 0.587629
10 5 19 53 87 140 0.621429
Scaffold Position Ref_Allele_Count Alt_Allele_Count Coverage_Depth Alt_Allele_Frequency
0 1 11 7 64 71 0.901408
1 1 16 10 90 100 0.900000
2 2 9 79 86 165 0.521212
3 2 12 12 73 85 0.858824
4 2 67 54 96 150 0.640000
5 3 8 0 0 0 0.000000
6 4 99 86 28 114 0.245614
7 4 101 32 25 57 0.438596
8 4 115 97 16 113 0.141593
9 5 6 86 43 129 0.333333
10 5 19 59 27 86 0.313953
I have already found the sum values for df1 and df2 in Allele_Count and Coverage Depth but I need to divide the resulting Alt_Allele_Count and Coverage_Depth of both df's with one another to fine the total allele frequency(AF). I have tried dividing the two variable and got the error message :
TypeError: float() argument must be a string or a number, not 'DataFrame'
when I tried to convert them to floats and this table when I laft it as a df:
Alt_Allele_Count Coverage_Depth
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
My code so far:
import csv
import pandas as pd
import numpy as np
df1 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_1.csv')
df2 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_2.csv')
print(df1)
print(df2)
Ref_Allele_Count = (df1[['Ref_Allele_Count']] + df2[['Ref_Allele_Count']])
print(Ref_Allele_Count)
Alt_Allele_Count = (df1[['Alt_Allele_Count']] + df2[['Alt_Allele_Count']])
print(Alt_Allele_Count)
Coverage_Depth = (df1[['Coverage_Depth']] + df2[['Coverage_Depth']]).astype(float)
print(Coverage_Depth)
AF = Alt_Allele_Count / Coverage_Depth
print(AF)
The error stems from the difference between a pandas series and a dataframe. Series are 1 dimensional structures like a singular column, while dataframes are 2d objects like tables. Series added together make a new series of values while dataframes added together make something a lot less usable.
Taking slices of a dataframe can either result in a series or dataframe object depending on how you do it:
df['column_name'] -> Series
df[['column_name', 'column_2']] -> Dataframe
So in the line:
Ref_Allele_Count = (df1[['Ref_Allele_Count']] + df2[['Ref_Allele_Count']])
df1[['Ref_Allele_Count']] becomes a singular column dataframe rather than a series.
Ref_Allele_Count = (df1['Ref_Allele_Count'] + df2['Ref_Allele_Count'])
Should return the correct result here. Same goes for the rest of the columns you're adding together.
This can be fixed by only using once set of brackets '[]' while referring to a column in a pandas df, rather than 2.
import csv
import pandas as pd
import numpy as np
df1 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_1.csv')
df2 = pd.read_csv('C:/Users/Tom/Python_CW/file_pairA_2.csv')
print(df1)
print(df2)
# note that I changed your double brackets ([["col_name"]]) to single (["col_name"])
# this results in pd.Series objects instead of pd.DataFrame objects
Ref_Allele_Count = (df1['Ref_Allele_Count'] + df2['Ref_Allele_Count'])
print(Ref_Allele_Count)
Alt_Allele_Count = (df1['Alt_Allele_Count'] + df2['Alt_Allele_Count'])
print(Alt_Allele_Count)
Coverage_Depth = (df1['Coverage_Depth'] + df2['Coverage_Depth']).astype(float)
print(Coverage_Depth)
AF = Alt_Allele_Count / Coverage_Depth
print(AF)

How to find spearman's correlation in python for only specific values?

i HAVE A DATAMATRIX of five columns
0 1 2 3 4
nan 34 23 34 11
43 34 123 4 44
45 12 4 nan 66
89 78 43 435 23
nan 89 nan 12 687
6 232 34 4 nan
24 56 34 121 56
nan 9 nan 54 12
24 nan 54 12 nan
76 11 123 76 78
43 nan 65 23 89
68 233 34 nan 89
65 53 nan 7 78
34 65 12 8 12
56 98 43 nan 43
I also have a fvector
fvector
23
67
23
nan
nan
87
323
nan
78
32
78
112
nan
56
nan
56
Till now i had just been able to find the correlation based upon full column
for i in datamatrix:
coef,p=spearmanr(datamatrix[i],fvector)
print(coef,p,"for column ",i)
I want to achieve 2 things:
1). I want to find the spearman's correlation between fvector and each column of datamatrix but if one of the two variables or both variables are nan then i want to drop the correlation for particular pair.
for eg. 4th value in column 1 is 78 and 4th value in fvector is nan so i want to exclude the particular pair(not whole column) from the process of correlation.i don't have any idea how to work with specific variable for finding correlation.
2). if the total number of nan values in fvector and datamatrix's column are > 30% then exclude whole column from finding correlation.
Any resource or reference will be helpful
Thanks
1) If you set nan_policy == "omit" the Nan will be ignored in the calculation. See scipy.stats.spearmanr.
2) You can compute the percentage of Nan in each column in this way: (df[i].isna().sum()*100)/df.shape[0]
All together:
nan_fvectr = int(vector.isna().sum())
for i in df:
if ((df[i].isna().sum()+nan_fvectr)*100)/(df.shape[0]*2) >= 30:
continue
coef,p=stats.spearmanr(df[i],vector, nan_policy="omit")
print(coef,p,"for column ",i)

Finding minimum value of a column between two entries in another column

Viewed 64 times
0
I have two columns in a data frame containing more than 1000 rows. Column A can take values X,Y,None. Column B contains random numbers from 50 to 100.
Every time there is a non 'None' occurrence in Column A, it is considered as occurrence4. so, previous non None occurrence in Column A will be occurrence3, and the previous to that will be occurrence2 and the previous to that will be occurrence1. I want to find the minimum value of column B between occurrence4 and occurrence3 and check if it is greater than the minimum value of column B between occurrence2 and occurrence1. The results can be stored in a new column in the data frame as "YES" or "NO".
SAMPLE INPUT
ROWNUM A B
1 None 68
2 None 83
3 X 51
4 None 66
5 None 90
6 Y 81
7 None 81
8 None 100
9 None 83
10 None 78
11 X 68
12 None 53
13 None 83
14 Y 68
15 None 94
16 None 50
17 None 71
18 None 71
19 None 52
20 None 67
21 None 82
22 X 76
23 None 66
24 None 92
For example, I need to find the minimum value of Column B between ROWNUM 14 and ROWNUM 11 and check if it is GREATER THAN the minimum value of Column B between ROWNUM 6 and ROWNUM 3. Next, I need to find the minimum value between ROWNUM 22 AND ROWNUM 14 and check if it is GREATER THAN the minimum value between ROWNUM 11 and ROWNNUM 6 and so on.
EDIT:
In the sample data, we start our calculation from row 14, since that is where we have the fourth non none occurrence of column A. The minimum value between row 14 and row 11 is 53. The minimum value between row 6 and 3 is 51. Since 53 > 51, , it means the minimum value of column B between occurrence 4 and occurrence 3, is GREATER THAN minimum value of column B between occurrence 2 and occurrence 1. So, output at row 14 would be "YES" or 1.
Next, at row 22, the minimum value between row 22 and row 14 is 50. The minimum value between row 11 and 6 is 68. Since 50 < 68, it means minimum between occurrence 4 and occurrence 3 is NOT GREATER THAN minimum between occurrence 2 and occurrence 1. So, output at row 22 would be "NO" or 0.
I have the following code.
import numpy as np
import pandas as pd
df = pd.DataFrame([[0, 0]]*100, columns=list('AB'), index=range(1, 101))
df.loc[[3, 6, 11, 14, 22, 26, 38, 51, 64, 69, 78, 90, 98], 'A'] = 1
df['B'] = np.random.randint(50, 100, size=len(df))
df['result'] = df.index[df['A'] != 0].to_series().rolling(4).apply(
lambda x: df.loc[x[2]:x[3], 'B'].min() > df.loc[x[0]:x[1], 'B'].min(), raw=True)
print(df)
This code works when column A has inputs [0,1]. But I need a code where column A could contain [None, X, Y]. Also, this code produces output as [0,1]. I need output as [YES, NO] instead.
I read your sample data as follows:
df = pd.read_fwf('input.txt', widths=[7, 6, 3], na_values=['None'])
Note na_values=['None'], which provides that None in input (a string)
is read as NaN.
This way the DataFrame is:
ROWNUM A B
0 1 NaN 68
1 2 NaN 83
2 3 X 51
3 4 NaN 66
4 5 NaN 90
5 6 Y 81
6 7 NaN 81
7 8 NaN 100
8 9 NaN 83
9 10 NaN 78
10 11 X 68
11 12 NaN 53
12 13 NaN 83
13 14 Y 69
14 15 NaN 94
15 16 NaN 50
16 17 NaN 71
17 18 NaN 71
18 19 NaN 52
19 20 NaN 67
20 21 NaN 82
21 22 X 76
22 23 NaN 66
23 24 NaN 92
The code to do your task is:
res = df.index[df.A.notnull()].to_series().rolling(4).apply(
lambda x: df.loc[x[2]:x[3], 'B'].min() > df.loc[x[0]:x[1], 'B'].min(), raw=True)\
.dropna().map(lambda x: 'YES' if x > 0 else 'NO').rename('Result')
df = df.join(res)
df.Result.fillna('', inplace=True)
As you can see, it is in part a slight change of your code, with some
additions.
The result is:
ROWNUM A B Result
0 1 NaN 68
1 2 NaN 83
2 3 X 51
3 4 NaN 66
4 5 NaN 90
5 6 Y 81
6 7 NaN 81
7 8 NaN 100
8 9 NaN 83
9 10 NaN 78
10 11 X 68
11 12 NaN 53
12 13 NaN 83
13 14 Y 69 YES
14 15 NaN 94
15 16 NaN 50
16 17 NaN 71
17 18 NaN 71
18 19 NaN 52
19 20 NaN 67
20 21 NaN 82
21 22 X 76 NO
22 23 NaN 66
23 24 NaN 92
The advantage of my solution over the other is that:
the content is either YES or NO, just as you want,
this content shows up only for non-null values in A column,
"ignoring" first 3, which don't have enough "predecessors".
Here's my approach:
def is_incr(x):
return x[:2].min() > x[2:].min()
# replace with s = df['A'] == 'None' if needed
s = df['A'].isna()
df['new_col'] = df.loc[s, 'B'].rolling(4).apply(is_incr)
Output:
ROWNUM A B new_col
0 1 NaN 68 NaN
1 2 NaN 83 NaN
2 3 X 51 NaN
3 4 NaN 66 NaN
4 5 NaN 90 1.0
5 6 Y 81 NaN
6 7 NaN 81 0.0
7 8 NaN 100 0.0
8 9 NaN 83 0.0
9 10 NaN 78 1.0
10 11 X 68 NaN
11 12 NaN 53 1.0
12 13 NaN 83 1.0
13 14 Y 68 NaN
14 15 NaN 94 0.0
15 16 NaN 50 1.0
16 17 NaN 71 1.0
17 18 NaN 71 0.0
18 19 NaN 52 0.0
19 20 NaN 67 1.0
20 21 NaN 82 0.0
21 22 X 76 NaN
22 23 NaN 66 0.0
23 24 NaN 92 1.0

Division in pandas not working as it should

I have two dataframes each with one column. I'm pasting them exactly as they print below:
Top: (it has no column names as it is the result of a Top = Df1.groupby('col1')['att1'].diff().dropna()
1 15.566667
3 5.066667
5 57.266667
7 -10.366667
9 18.966667
11 50.966667
13 -5.633333
15 -14.266667
17 18.933333
19 3.100000
21 35.966667
23 -17.566667
25 -8.066667
27 -6.366667
29 7.133333
31 -2.633333
33 3.333333
35 -23.800000
37 2.333333
39 -53.533333
41 -17.300000
dtype: float64
Bottom: which is the result of Bottom = np.sqrt(Df2.groupby('ID')['Col2'].sum()/n)
ID
12868123 1.029001
757E13D7 1.432014
79731492 2.912770
799EFB29 1.826576
7D44062A 1.736757
7D4C0E2F 1.943503
7DBA169D 0.650023
7E558E2B 1.256287
7E8B3815 1.491974
7EB80123 0.558717
7FFB607D 1.505221
8065A321 1.809937
80EFE91B 2.064825
811F1B1E 0.992645
82B67C94 0.980618
833C27AE 0.969195
83957B28 0.469914
8447B85D 1.477168
84877498 0.872973
8569499D 2.215307
8617B7D9 1.033294
Name: Col2, dtype: float64
I want the divide those two columns values by each other.
Top/Bottom
I get the following:
1 NaN
3 NaN
5 NaN
7 NaN
9 NaN
11 NaN
13 NaN
15 NaN
17 NaN
19 NaN
21 NaN
23 NaN
25 NaN
27 NaN
29 NaN
31 NaN
33 NaN
35 NaN
37 NaN
39 NaN
41 NaN
12868123 NaN
757E13D7 NaN
79731492 NaN
799EFB29 NaN
7D44062A NaN
7D4C0E2F NaN
7DBA169D NaN
7E558E2B NaN
7E8B3815 NaN
7EB80123 NaN
7FFB607D NaN
8065A321 NaN
80EFE91B NaN
811F1B1E NaN
82B67C94 NaN
833C27AE NaN
83957B28 NaN
8447B85D NaN
84877498 NaN
8569499D NaN
8617B7D9 NaN
dtype: float64
I tried resetting the index column, it didn't help. Not sure why it's not working.
Problem is with different index values, because arithmetic opearations align Series by indices, so need cast to numpy array by values:
print (Top/Bottom.values)
1 15.127942
3 3.538141
5 19.660552
7 -5.675464
9 10.920737
11 26.224126
13 -8.666359
15 -11.356216
17 12.690123
19 5.548426
21 23.894609
23 -9.705679
25 -3.906707
27 -6.413841
29 7.274324
31 -2.717031
33 7.093496
35 -16.111911
37 2.672858
39 -24.165198
41 -16.742573
Name: col, dtype: float64
Solution with div:
print (Top.div(Bottom.values))
1 15.127942
3 3.538141
5 19.660552
7 -5.675464
9 10.920737
11 26.224126
13 -8.666359
15 -11.356216
17 12.690123
19 5.548426
21 23.894609
23 -9.705679
25 -3.906707
27 -6.413841
29 7.274324
31 -2.717031
33 7.093496
35 -16.111911
37 2.672858
39 -24.165198
41 -16.742573
dtype: float64
But if assign one index values to another, you can use:
Top.index = Bottom.index
print (Top/Bottom)
ID
12868123 15.127942
757E13D7 3.538141
79731492 19.660552
799EFB29 -5.675464
7D44062A 10.920737
7D4C0E2F 26.224126
7DBA169D -8.666359
7E558E2B -11.356216
7E8B3815 12.690123
7EB80123 5.548426
7FFB607D 23.894609
8065A321 -9.705679
80EFE91B -3.906707
811F1B1E -6.413841
82B67C94 7.274324
833C27AE -2.717031
83957B28 7.093496
8447B85D -16.111911
84877498 2.672858
8569499D -24.165198
8617B7D9 -16.742573
dtype: float64
And if get error like:
ValueError: operands could not be broadcast together with shapes (20,) (21,)
problem is with different length of Series.
I arrived here because I was looking how to divide a column by a subset of itself.
I found a solution which is not reported here
Suppose you have a df like
d = {'mycol1':[0,0,1,1,2,2],'mycol2':[1,2,3,6,4,8]}
df = pd.DataFrame(data=d)
i.e.
mycol1 mycol2
0 0 1
1 0 2
2 1 3
3 1 6
4 2 4
5 2 8
And now you want to divide mycol2 for a subset composed by the first two values
df['mycol2'].div(df[df['mycol1']==0.0]['mycol2'])
will result in
0 1.0
1 1.0
2 NaN
3 NaN
4 NaN
5 NaN
because of the index problem reported by jezreal.
The solution is to simply use concat to concatenate the subset to match the length of the original df.
Nrows = df[df['mycol1'==0.0]]['mycol2'].shape[0]
Nrows_tot = df['mycol2'].shape[0]
times_longer = int(Nrows_tot/Nrows)
df['mycol3'] = df['mycol2'].div(pd.concat([df[df['mycol1']==0.0]['mycol2']]*times_longer,ignore_index=True))

Calculating the duration an event in a time series data frame (python 2.7)

I have a rather large pandas data frame which is a time serie with a lot of different information for each time stamp (eye tracking data).
Part of the data looks a bit like:
In [58]: df
Out[58]:
time event
49 44295 NaN
50 44311 NaN
51 44328 NaN
52 44345 2
53 44361 2
54 44378 2
55 44395 2
56 44411 2
57 44428 3
58 44445 3
59 44461 3
60 44478 3
61 44495 NaN
62 44511 NaN
63 44528 NaN
64 44544 NaN
65 44561 NaN
66 44578 NaN
67 44594 NaN
68 44611 4
69 44628 4
70 44644 4
71 44661 NaN
72 44678 NaN
I would like to calculate the (time) duration of each event as the max(time)-min(time) for a given event e.g. for event 2: 44411-44345 = 66
This duration I would like in a new column so that the data ends up like this:
In [60]: df
Out[60]:
time event duration
49 44295 NaN NaN
50 44311 NaN NaN
51 44328 NaN NaN
52 44345 2 66
53 44361 2 66
54 44378 2 66
55 44395 2 66
56 44411 2 66
57 44428 3 50
58 44445 3 50
59 44461 3 50
60 44478 3 50
61 44495 NaN NaN
62 44511 NaN NaN
63 44528 NaN NaN
64 44544 NaN NaN
65 44561 NaN NaN
66 44578 NaN NaN
67 44594 NaN NaN
68 44611 4 33
69 44628 4 33
70 44644 4 33
71 44661 NaN NaN
72 44678 NaN NaN
How can I do that?
One way would be to use groupby and transform. max - min is also called peak-to-peak, or ptp for short, and so ptp here basically means for lambda x: x.max() - x.min().
>>> df = pd.read_csv("eye.csv",sep="\s+")
>>> df["duration"] = df.dropna().groupby("event")["time"].transform("ptp")
>>> df
time event duration
49 44295 NaN NaN
50 44311 NaN NaN
51 44328 NaN NaN
52 44345 2 66
53 44361 2 66
54 44378 2 66
55 44395 2 66
56 44411 2 66
57 44428 3 50
58 44445 3 50
59 44461 3 50
60 44478 3 50
61 44495 NaN NaN
62 44511 NaN NaN
63 44528 NaN NaN
64 44544 NaN NaN
65 44561 NaN NaN
66 44578 NaN NaN
67 44594 NaN NaN
68 44611 4 33
69 44628 4 33
70 44644 4 33
71 44661 NaN NaN
72 44678 NaN NaN
The dropna was to prevent each NaN value in the event column from being considered its own event. (There's also something weird going on in how ptp works when the key is NaN too, but that's a separate issue.)
Iterate over records using groupby from itertools. Group criteria shall be the event number. As you have the data properly ordered (all event codes related to the same event are not interrupted by others), there is no need to do sorting on even code.
groupby will iteratively return tuples (key, group), where key is the even code and group is list of all the records.
From the records, pick up minimal and maximal time and calculate duration.
Then, do your work to get durations as new field to your records.
There might be more efficient methods using pandas, which I am not aware of. Described solution does not require pandas.
I ended up doing the following work around to the posted answer by #DSM:
df["dur"] = datalist[i][j].groupby("event")["time"].transform("ptp")
dur = []
for i in datalist.index:
if np.isnan(df["event"][i]):
dur.append(df["event"][i])
else:
dur.append(df["dur"][i])
df["Duration"] = dur
This at least works for me.

Categories

Resources