How to drop rows with nan cell in dask dataframe? - python

I have a dask dataframe in which I want to delete all the rows which have an NAN value in the "selling_price" column
image_features_df.head(3)
feat1 feat2 feat3 ... feat25087 feat25088 fid selling_price
0 0.0 0.0 0.0 ... 0.0 0.0 2 269.00
1 0.2 0.0 0.8 ... 0.0 0.3 22 NAN
2 0.5 0.0 0.4 ... 0.0 0.1 70 NAN
The above table shows a view of my dataframe.
I want the output to be a dask dataframe without any NAN cells in my "selling_price" column.
Expected Output:
image_features_df.head(3)
feat1 feat2 feat3 ... feat25087 feat25088 fid selling_price
0 0.0 0.0 0.0 ... 0.0 0.0 2 269.00
4 0.3 0.1 0.0 ... 0.0 0.3 26 1720.00
6 0.8 0.0 0.0 ... 0.0 0.1 50 18145.25

Could you please try following, this will remove line if NaN is found in column selling_price.
df.dropna(subset=['selling_price'])

Related

How to create co-occurrence matrix from pandas two column?

I have a small dataset, for example :
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4,5,6,7,8,9,10], 'b': [11,22,11,22,33,11,22,44,11,22]})
df
I want to find out the co-occurrence of column b values for the column a.
What I tried :
df_co = pd.get_dummies(a.a).groupby(a.b).apply(max)
df_co
But this is not a co-occurrence matrix. So I also tried this:
df_co.T.dot(df_co)
which gives me:
Is this a correct method to calculate the co-occurrence matrix?
You can use df.pivot with a dummy column to represent count=1
df.assign(v=1).pivot('a','b').fillna(0)
v
b 11 22 33 44
a
1 1.0 0.0 0.0 0.0
2 0.0 1.0 0.0 0.0
3 1.0 0.0 0.0 0.0
4 0.0 1.0 0.0 0.0
5 0.0 0.0 1.0 0.0
6 1.0 0.0 0.0 0.0
7 0.0 1.0 0.0 0.0
8 0.0 0.0 0.0 1.0
9 1.0 0.0 0.0 0.0
10 0.0 1.0 0.0 0.0
Or, as #Quang Hoang suggested, try pd.crosstab

Add and fill missing columns with values of 0s in pandas matrix [python]

I have a matrix of the form :
movie_id 1 2 3 ... 1494 1497 1500
user_id
1600 1.0 0.0 1.0 ... 0.0 0.0 1.0
1601 1.0 0.0 0.0 ... 1.0 0.0 0.0
1602 0.0 0.0 0.0 ... 0.0 1.0 1.0
1603 0.0 0.0 1.0 ... 0.0 0.0 0.0
1604 1.0 0.0 0.0 ... 1.0 0.0 0.0
. ...
.
.
As you can see even though the movies in my dataset are 1500, some movies haven't been recorded cause of the preprocess that my data has gone through.
What i want is to add and fill all the columns (movie_ids) that haven't been recorded with values of 0 (I don't know which movie_ids haven't been recorded exactly). So for example i want a new matrix of the form:
movie_id 1 2 3 ... 1494 1495 1496 1497 1498 1499 1500
user_id
1600 1.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1601 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1602 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 1.0
1603 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1604 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0
. ...
.
.
Use DataFrame.reindex along axis=1 with fill_value=0 to conform the dataframe columns to a new index range:
df = df.reindex(range(df.columns.min(), df.columns.max() + 1), axis=1, fill_value=0)
Result:
movie_id 1 2 3 1498 1499 1500
user_id
1600 1.0 0.0 1.0 0 0 1.0
1601 1.0 0.0 0.0 0 0 0.0
1602 0.0 0.0 0.0 ... 0 0 1.0
1603 0.0 0.0 1.0 ... 0 0 0.0
1604 1.0 0.0 0.0 0 0 0.0
I assume variable name of the matrix is matrix
n_moovies = 1500
moove_ids = matrix.columns
for moovie_id in range(1, n_moovies + 1):
# iterate over id-s
if moovie_id not in moove_ids:
# if there's no such moovie create a column filled with zeros
matrix[moovie_id] = 0

Drop the rows in front by conditions

df:
Id timestamp data Date sig event1 Start End Timediff2 datadiff2 B
51253 51494 2020-01-27 06:22:08.330 19.5 2020-01-27 -1.0 0.0 NaN 1.0 NaN NaN NaN
51254 51495 2020-01-27 06:22:08.430 19.0 2020-01-27 1.0 1.0 0.0 0.0 0.1 NaN NaN
51255 51496 2020-01-27 07:19:06.297 19.5 2020-01-27 1.0 0.0 1.0 0.0 3417.967 0.0 0.000000
51256 51497 2020-01-27 07:19:06.397 20.0 2020-01-27 1.0 0.0 0.0 0.0 0.1 1.0 0.000293
51259 51500 2020-01-27 07:32:19.587 20.5 2020-01-27 1.0 0.0 0.0 1.0 793.290 1.0 0.001261
I have 2 questions:
I want to drop the rows before the rows where Timediff2 ==0.1.
Add another condition, drop theses rows, unless for that row, Start ==1.
I suggest the following, first I create a top for the row just before Timediff2 == 0.1 then I filter:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Start": [np.NaN, 0.0, 1.0,0.0, 0.0],
"Timediff2": [np.NaN, 0.1, 3417, 0.1, 793]})
df["top"] = (df["Timediff2"] == 0.1).shift(-1)
df = df.loc[(df["Start"] == 1) | (df["top"] == False), :]
df = df.drop(columns="top")
The result is :
Start Timediff2
1 0.0 0.1
2 1.0 3417.0
3 0.0 0.1

Find values from other dataframe and assign to original dataframe

Having input dataframe:
x_1 x_2
0 0.0 0.0
1 1.0 0.0
2 2.0 0.2
3 2.5 1.5
4 1.5 2.0
5 -2.0 -2.0
and additional dataframe as follows:
index x_1_x x_2_x x_1_y x_2_y value dist dist_rank
0 0 0.0 0.0 0.1 0.1 5.0 0.141421 2.0
4 0 0.0 0.0 1.5 1.0 -2.0 1.802776 3.0
5 0 0.0 0.0 0.0 0.0 3.0 0.000000 1.0
9 1 1.0 0.0 0.1 0.1 5.0 0.905539 1.0
11 1 1.0 0.0 2.0 0.4 3.0 1.077033 3.0
14 1 1.0 0.0 0.0 0.0 3.0 1.000000 2.0
18 2 2.0 0.2 0.1 0.1 5.0 1.902630 3.0
20 2 2.0 0.2 2.0 0.4 3.0 0.200000 1.0
22 2 2.0 0.2 1.5 1.0 -2.0 0.943398 2.0
29 3 2.5 1.5 2.0 0.4 3.0 1.208305 3.0
30 3 2.5 1.5 2.5 2.5 4.0 1.000000 1.0
31 3 2.5 1.5 1.5 1.0 -2.0 1.118034 2.0
38 4 1.5 2.0 2.0 0.4 3.0 1.676305 3.0
39 4 1.5 2.0 2.5 2.5 4.0 1.118034 2.0
40 4 1.5 2.0 1.5 1.0 -2.0 1.000000 1.0
45 5 -2.0 -2.0 0.1 0.1 5.0 2.969848 2.0
46 5 -2.0 -2.0 1.0 -2.0 6.0 3.000000 3.0
50 5 -2.0 -2.0 0.0 0.0 3.0 2.828427 1.0
I want to create new columns in input dataframe, basing on additional dataframe with respect to dist_rank. It should extract x_1_y, x_2_y and value for each row, with respect to index and dist_rank so my expected output is following:
I tried following lines:
df['value_dist_rank1']=result.loc[result['dist_rank']==1.0, 'value']
df['value_dist_rank1 ']=result[result['dist_rank']==1.0]['value']
but both gave the same output:
x_1 x_2 value_dist_rank1
0 0.0 0.0 NaN
1 1.0 0.0 NaN
2 2.0 0.2 NaN
3 2.5 1.5 NaN
4 1.5 2.0 NaN
5 -2.0 -2.0 3.0
Here is a way to do it :
(For the sake of clarity I consider the input df as df1 and the additional df as df2)
# First we goupby df2 by index to get all the column information of each index on one line
df2 = df2.groupby('index').agg(lambda x: list(x)).reset_index()
# Then we explode each column into three columns since there is always three columns for each index
columns = ['dist_rank', 'value', 'x_1_y', 'x_2_y']
column_to_add = ['value', 'x_1_y', 'x_2_y']
for index, row in df2.iterrows():
for i in range(3):
column_names = ["{}_dist_rank{}".format(x, row.dist_rank[i])[:-2] for x in column_to_add]
values = [row[x][i] for x in column_to_add]
for column, value in zip(column_names, values):
df2.loc[index, column] = value
# We drop the columns that are not useful :
df2.drop(columns=columns+['dist', 'x_1_x', 'x_2_x'], inplace = True)
# Finally we merge the modified df with our initial dataframe :
result = df1.merge(df2, left_index=True, right_on='index', how='left')
Output :
x_1 x_2 index value_dist_rank2 x_1_y_dist_rank2 x_2_y_dist_rank2 \
0 0.0 0.0 0 5.0 0.1 0.1
1 1.0 0.0 1 3.0 0.0 0.0
2 2.0 0.2 2 -2.0 1.5 1.0
3 2.5 1.5 3 -2.0 1.5 1.0
4 1.5 2.0 4 4.0 2.5 2.5
5 -2.0 -2.0 5 5.0 0.1 0.1
value_dist_rank3 x_1_y_dist_rank3 x_2_y_dist_rank3 value_dist_rank1 \
0 -2.0 1.5 1.0 3.0
1 3.0 2.0 0.4 5.0
2 5.0 0.1 0.1 3.0
3 3.0 2.0 0.4 4.0
4 3.0 2.0 0.4 -2.0
5 6.0 1.0 -2.0 3.0
x_1_y_dist_rank1 x_2_y_dist_rank1
0 0.0 0.0
1 0.1 0.1
2 2.0 0.4
3 2.5 2.5
4 1.5 1.0
5 0.0 0.0

Sort rows of a dataframe in descending order of NaN counts

I'm trying to sort the following Pandas DataFrame:
RHS age height shoe_size weight
0 weight NaN 0.0 0.0 1.0
1 shoe_size NaN 0.0 1.0 NaN
2 shoe_size 3.0 0.0 0.0 NaN
3 weight 3.0 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0
in such a way that the rows with a greater number of NaNs columns are positioned first.
More precisely, in the above df, the row with index 1 (2 Nans) should come before ther row with index 0 (1 NaN).
What I do now is:
df.sort_values(by=['age', 'height', 'shoe_size', 'weight'], na_position="first")
Using df.sort_values and loc based accessing.
df = df.iloc[df.isnull().sum(1).sort_values(ascending=0).index]
print(df)
RHS age height shoe_size weight
1 shoe_size NaN 0.0 1.0 NaN
2 shoe_size 3.0 0.0 0.0 NaN
0 weight NaN 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0
3 weight 3.0 0.0 0.0 1.0
df.isnull().sum(1) counts the NaNs and the rows are accessed based on this sorted count.
#ayhan offered a nice little improvement to the solution above, involving pd.Series.argsort:
df = df.iloc[df.isnull().sum(axis=1).mul(-1).argsort()]
print(df)
RHS age height shoe_size weight
1 shoe_size NaN 0.0 1.0 NaN
0 weight NaN 0.0 0.0 1.0
2 shoe_size 3.0 0.0 0.0 NaN
3 weight 3.0 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0
df.isnull().sum().sort_values(ascending=False)
Here's a one-liner that will do it:
df.assign(Count_NA = lambda x: x.isnull().sum(axis=1)).sort_values('Count_NA', ascending=False).drop('Count_NA', axis=1)
# RHS age height shoe_size weight
# 1 shoe_size NaN 0.0 1.0 NaN
# 0 weight NaN 0.0 0.0 1.0
# 2 shoe_size 3.0 0.0 0.0 NaN
# 3 weight 3.0 0.0 0.0 1.0
# 4 age 3.0 0.0 0.0 1.0
This works by assigning a temporary column ("Count_NA") to count the NAs in each row, sorting on that column, and then dropping it, all in the same expression.
You can add a column of the number of null values, sort by that column, then drop the column. It's up to you if you want to use .reset_index(drop=True) to reset the row count.
df['null_count'] = df.isnull().sum(axis=1)
df.sort_values('null_count', ascending=False).drop('null_count', axis=1)
# returns
RHS age height shoe_size weight
1 shoe_size NaN 0.0 1.0 NaN
0 weight NaN 0.0 0.0 1.0
2 shoe_size 3.0 0.0 0.0 NaN
3 weight 3.0 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0

Categories

Resources