Find row with nan value and delete it - python

I have a dataframe. This dataframe contains three cells id, horstid, date. The cell date has one NaN value. I want the below code what works with pandas, I want it with numpy.
First I want to transform my dataframe to a numpy array. After that I want is to find all rows where the date is NaN and print it. After that I want to remove all this rows. But how could I do this in numpy?
This is my dataframe
id horstid date
0 1 11 2008-09-24
1 2 22 NaN
2 3 33 2008-09-18
3 4 33 2008-10-24
This is my code. That works with fine, but with pandas.
d = {'id': [1, 2, 3, 4], 'horstid': [11, 22, 33, 33], 'date': ['2008-09-24', np.nan, '2008-09-18', '2008-10-24']}
df = pd.DataFrame(data=d)
df['date'].isna()
[OUT]
0 False
1 True
2 False
3 False
df.drop(df.index[df['date'].isna() == True])
[OUT]
id horstid date
0 1 11 2008-09-24
2 3 33 2008-09-18
3 4 33 2008-10-24
What I want is the above code without pandas but with numpy.
npArray = df.to_numpy()
date = npArray [:,2].astype(np.datetime64)
[OUT]
ValueError: Cannot create a NumPy datetime other than NaT with generic units

Here's a solution based on Numpy and pure python:
df = pd.DataFrame.from_dict(dict(horstid = [11, 22, 33, 33], id=[1,2,3,4], data=['2008-09-24', np.nan, '2008-09-18', '2008-10-24']))
a = df.values
index = list(map(lambda x: type(x) != type(1.),a[:, 2]))
print(a[index,:])
[[11 1 '2008-09-24']
[33 3 '2008-09-18']
[33 4 '2008-10-24']]

Related

Using Python, how do I remove duplicates in a PANDAS dataframe column while keeping/ignoring all 'nan' values?

I have a dataframe like this:
import pandas as pd
data1 = {
"siteID": [1, 2, 3, 1, 2, 'nan', 'nan', 'nan'],
"date": [42, 30, 43, 29, 26, 34, 10, 14],
}
df = pd.DataFrame(data1)
But I want to delete any duplicates in siteID, keeping only the most up-to-date value AND keeping all 'nan' values.
I get close with this code:
df_no_dup = df.sort_values('date').drop_duplicates('siteID', keep='last')
which only keeps the siteID with the highest date value. The issue is that most of the rows with 'nan' for siteID are being removed when I want to ignore them all. Is there any way to keep all the rows where siteID is equal to 'nan'?
Expected output:
siteID date
nan 10
nan 14
2 30
nan 34
1 42
3 43
I would use df.duplicated to create a custom condition.
Like this
df.drop(df[df.sort_values('date').duplicated('siteID', keep='last') & (df.siteID!='nan')].index)
Result
siteID date
0 1 42
1 2 30
2 3 43
5 nan 34
6 nan 10
7 nan 14

Concatenate data frames over finite index otherwise start a new column - pandas

I need to add new data to the last column of a data-frame, if this has any empty cells, or create a new column otherwise. I wonder if there is any pythonic way to achieve this through pandas functionalities (e.g. concact, join, merge, etc.). The example is as follows:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'0':[8, 9, 3, 5, 0], '1':[9, 6, 6, np.nan, np.nan]})
df2 = pd.DataFrame({'2':[2, 9, 4]}, index = [3,4,0])
desired_output = pd.DataFrame({'0':[8, 9, 3, 5, 0],
'1':[9, 6, 6, 2, 9],
'2':[4, np.nan, np.nan, np.nan, np.nan]})
# df1
0 1
0 8 9
1 9 6
2 3 6
3 5 NaN
4 0 NaN
# df 2
2
3 2
4 9
0 4
# desired_output
0 1 2
0 8 9 4
1 9 6 NaN
2 3 6 NaN
3 5 2 NaN
4 0 9 NaN
Your problem can be broken down into 2 steps:
Contenate df1 and df2 based on their indexes.
For each row of the concatenated dataframe, move the nan to the end.
Try this:
# Step 1: concatenate the two dataframes
result = pd.concat([df1, df2], axis=1)
# Step 2a: for each row, sort the elements based on their nan status
# For example: sort [1, 2, nan, 3] based on [False, False, True, False]
# np.argsort will return [0, 1, 3, 2]
# Stable sort is critical here since we don't want to swap elements whose
# sort keys are equal.
arr = result.to_numpy()
idx = np.argsort(np.isnan(arr), kind="stable")
# Step 2b: reconstruct the result dataframe based on the sort order
result = pd.DataFrame(np.take_along_axis(arr, idx, axis=1), columns=result.columns)

Sparse data becomes NaN when dropping duplicates

I have a dataframe which consists of some columns that are of a sparse datatype, for example
df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 0, 0]), "B": [55, 100, 55], "C": [4, 4, 4]})
A B C
0 0 55 4
1 0 100 4
2 0 55 4
However, when I try to drop duplicates, the sparse series becomes NaNs.
df.drop_duplicates(inplace=True)
A B C
0 NaN 55 4
1 NaN 100 4
My expected output is
A B C
0 0 55 4
1 0 100 4
How can I prevent this from happening and keep the original values?
SparseArray doesn't store 0 for integer datatype.
See the doc which says that by providing a fill_value as say, something like np.inf, all the elements other than np.inf would be stored. That said, if np.inf is not present in the dataset, it's basically invalidating the concept of sparse array.
Eg:
import numpy as np
import pandas as pd
df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 0, 0], fill_value=np.nan),
"B": [55, 100, 55], "C": [4, 4, 4]})
print(df.drop_duplicates())
Output
A B C
0 0 55 4
1 0 100 4

Convert list to int and sum all elements in pandas dataframe

Very new to pandas and I'm trying to sum the elements of a list in a single column for a pandas dataframe, except I can't find a way to do so
The dataframe looks something like this:
index codes
0 [19, 19]
1 [3, 4]
2 [20, 5, 3]
3 NaN
4 [1]
5 NaN
6 [14, 2]
What I'm trying to get is:
index codes total
0 [19, 19] 38
1 [3, 4] 7
2 [20, 5, 3] 28
3 NaN 0
4 [1] 1
5 NaN 0
6 [14, 2] 16
However the values in codes were obtained by using str.findall('-(\d+)') from a different column, so they are not a list of ints
Any help would be much appreciated, thanks.
I would use str.extractall() instead of str.findall():
# replace orig_column with the correct column name
df['total'] = (df['orig_column'].str.extractall('-(\d+)')
.astype(int).sum(level=0)
.reindex(df.index, fill_value=0)
)
If you really want to use your current codes column:
df['total'] = df['codes'].explode().astype(float).sum(level=0)
Output:
index codes total
0 0 [19, 19] 38
1 1 [3, 4] 7
2 2 [20, 5, 3] 28
3 3 NaN 0
4 4 [1] 1
5 5 NaN 0
6 6 [14, 2] 16
Try df['total'] = df['codes'].apply(lambda x:int(np.nansum(x))) if you want int type output.
Try df['total'] = df['codes'].apply(lambda x:np.nansum(x)) otherwise.
df['total'] = (
df.codes.apply(lambda x: sum([int(e) for e in x]) if type(x) == list else 0)
)

Pad rows with no data as Na in Pandas Dataframe

I have an np array of timestamps:
ts = np.range(5)
In [34]: ts
Out[34]: array([0, 1, 2, 3, 4])
and I have a pandas DataFrame:
data = pd.DataFrame([10, 10, 10], index = [0,3,4])
In [33]: data
Out[33]:
0
0 10
3 10
4 10
The index of data is guaranteed to be a subset of ts. I want to generate the following data frame:
res:
0 10
1 nan
2 nan
3 10
4 10
So I want the index to be ts and the values to be from data. But for rows where timestamp doesn't exist in data, I want an NaN. How can I do this?
You are looking for the reindex function.
For example:
data.reindex(index=ts)
Output:
0
0 10
1 NaN
2 NaN
3 10
4 10

Categories

Resources