Handling Zeros or NaNs in a Pandas DataFrame operations - python

I have a DataFrame (df) like shown below where each column is sorted from largest to smallest for frequency analysis. That leaves some values either zeros or NaN values as each column has a different length.
08FB006 08FC001 08FC003 08FC005 08GD004
----------------------------------------------
0 253 872 256 11.80 2660
1 250 850 255 10.60 2510
2 246 850 241 10.30 2130
3 241 827 235 9.32 1970
4 241 821 229 9.17 1900
5 232 0 228 8.93 1840
6 231 0 225 8.05 1710
7 0 0 225 0 1610
8 0 0 224 0 1590
9 0 0 0 0 1590
10 0 0 0 0 1550
I need to perform the following calculation as if each column has different lengths or number of records (ignoring zero values). I have tried using NaN but for some reason operations on Nan values are not possible.
Here is what I am trying to do with my df columns :
shape_list1=[]
location_list1=[]
scale_list1=[]
for column in df.columns:
shape1, location1, scale1=stats.genpareto.fit(df[column])
shape_list1.append(shape1)
location_list1.append(location1)
scale_list1.append(scale1)

Assuming all values are positive (as seems from your example and description), try:
stats.genpareto.fit(df[df[column] > 0][column])
This filters every column to operate just on the positive values.
Or, if negative values are allowed,
stats.genpareto.fit(df[df[column] != 0][column])

The syntax is messy, but change
shape1, location1, scale1=stats.genpareto.fit(df[column])
to
shape1, location1, scale1=stats.genpareto.fit(df[column][df[column].nonzero()[0]])
Explanation: df[column].nonzero() returns a tuple of size (1,) whose only element, element [0], is a numpy array that holds the index labels where df is nonzero. To index df[column] by these nonzero labels, you can use df[column][df[column].nonzero()[0]].

Related

Interpolate and match missing values between two dataframes of different dimensions

I'm new to pandas and python in general.
Currently I'm trying to interpolate and make the coordinates of two different dataframes match. The data comes from two different GEOTIFF files from the same source, one being temperature and the other being radiation. The file was converted to pandas with georasters.
The radiation dataframe has more points and data, I want to upscale the temperature dataframe and have the same coordinates as the prior.
Radiation Dataframe:
row
col
value
x
y
0
197
2427
5.755
-83.9325
17.5075
1
197
2428
5.755
-83.93
17.5075
2
197
2429
5.755
-83.9275
17.5075
3
197
2430
5.755
-83.925
17.5075
4
197
2431
5.755
-83.9225
17.5075
1850011 rows × 5 columns
Temperature Dataframe:
row
col
value
x
y
0
59
725
26.8
-83.9583
17.5083
1
59
726
26.8
-83.95
17.5083
2
59
727
26.8
-83.9417
17.5083
3
59
728
26.8
-83.9333
17.5083
4
59
729
26.8
-83.925
17.5083
167791 rows × 5 columns
Source of data
"Gis data - LTAym_AvgDailyTotals (GeoTIFF)"
Temperature Map
Radiation (GHI) Map
In order to be able to change the information of a column, you have to use iloc. I said I gave the fourth column from the left and index 3, which is the same as column x, then I gave it your values, then I printed the result.
import pandas as pd
Radiation = {'row':["197","197","197","197","197"],
'col':["2427","2428","2429","2430","2431"],
'value':['5.755','5.755','5.755','5.755','5.755'],
'x':['-83.9325','-83.93','-83.9275','-83.925','-83.9225'],
'y':['17.5075','17.5075','17.5075','17.5075','17.5075']
}
Temperature = { 'row':["59","59","59","59","59"],
'col':["725","726","727","728","729"],
'value':["26.8","26.8","26.8","26.8","26.8"],
'x':["-83.9583","-83.95","-83.9417","-83.9333","-83.925"],
'y':["17.5083","17.5083","17.5083","17.5083","17.5083"]
}
df1 = pd.DataFrame(Radiation)
df2 = pd.DataFrame(Temperature)
df1.iloc[4:,3]='1850011'
df2.iloc[4:,3]='167791'
Comparison = df1.compare(df2, keep_shape=True, keep_equal=True)
print(df1)
print(df2)

Filtering a labeled image by particle area

I have a labeled image of detected particles and a dataframe with the corresponding area of each labeled particle. What I want to do is filter out every particle on the image with an area smaller than a specified value.
I got it working with the example below, but I know there must be a smarter and especially faster way.
For example skipping the loop by comparing the image with the array.
Thanks for your help!
Example:
labels = df["label"][df.area > 5000].to_numpy()
mask = np.zeros(labeled_image.shape)
for label in labels:
mask[labeled_image == label] = 1
Dataframe:
label centroid-0 centroid-1 area
0 1 15 3681 191
1 2 13 1345 390
2 3 43 3746 885
3 4 32 3616 817
4 5 20 4250 137
... ... ... ...
3827 3828 4149 1620 130
3828 3829 4151 852 62
3829 3830 4155 330 236
3830 3831 4157 530 377
3831 3832 4159 3975 81
You can use isin to check equality to several labels. The resulting boolean array can be directly used as the mask after casting to the required type (e.g. int):
labels = df.loc[df.area.gt(5000), 'label']
mask = np.isin(labeled_image, labels).astype(int)

Ordering a dataframe using value_counts

I have a dataframe in which under the column "component_id", I have component_ids repeating several times.
Here is what the df looks like:
In [82]: df.head()
Out[82]:
index molregno chembl_id assay_id tid tid component_id
0 0 942606 CHEMBL1518722 688422 103668 103668 4891
1 0 942606 CHEMBL1518722 688422 103668 103668 4891
2 0 942606 CHEMBL1518722 688721 78 78 286
3 0 942606 CHEMBL1518722 688721 78 78 286
4 0 942606 CHEMBL1518722 688779 103657 103657 5140
component_synonym
0 LMN1
1 LMNA
2 LGR3
3 TSHR
4 MAPT
As can be seen, the same component_id can be linked to various component_synonyms(essentially the same gene, but different names). I wanted to find out the frequency of each gene as I want to find out the top 20 most frequently hit genes and therefore, I performed a value_counts on the column "component_id". I get something like this.
In [84]: df.component_id.value_counts()
Out[84]:
5432 804
3947 402
5147 312
3 304
2693 294
75 282
Name: component_id, dtype: int64
Is there a way for me to order the entire dataframe according to the component_id that is present the most number of times?
And also, is it possible for my dataframe to contain only the first occurrence of each component_id?
Any advice would be greatly appreciated!
I think you can make use of count to sort the rows and then drop the count column i.e
df['count'] = df.groupby('component_id')['component_id'].transform('count')
df_sorted = df.sort_values(by='count',ascending=False).drop('count',1)

Join dataframe with matrix output using pandas

I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)​
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47

Pandas: select by bigger than a value

My dataframe has a column called dir, it has several values, I want to know how many the values passes a certain point. For example:
df['dir'].value_counts().sort_index()
It returns a Series
0 855
20 881
40 2786
70 3777
90 3964
100 4
110 2115
130 3040
140 1
160 1697
180 1734
190 3
200 618
210 3
220 1451
250 895
270 2167
280 1
290 1643
300 1
310 1894
330 1
340 965
350 1
Name: dir, dtype: int64
Here, I want to know the number of the value passed 500. In this case, it's all except 100, 140, 190,210, 280,300,330,350.
How can I do that?
I can get away with df['dir'].value_counts()[df['dir'].value_counts() > 500]
(df['dir'].value_counts() > 500).sum()
This gets the value counts and returns them as a series of Truth Values. The parens treats this whole thing like a series. .sum() counts the True values as 1 and the False values as 0.

Categories

Resources