Sparse data becomes NaN when dropping duplicates - python

I have a dataframe which consists of some columns that are of a sparse datatype, for example
df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 0, 0]), "B": [55, 100, 55], "C": [4, 4, 4]})
A B C
0 0 55 4
1 0 100 4
2 0 55 4
However, when I try to drop duplicates, the sparse series becomes NaNs.
df.drop_duplicates(inplace=True)
A B C
0 NaN 55 4
1 NaN 100 4
My expected output is
A B C
0 0 55 4
1 0 100 4
How can I prevent this from happening and keep the original values?

SparseArray doesn't store 0 for integer datatype.
See the doc which says that by providing a fill_value as say, something like np.inf, all the elements other than np.inf would be stored. That said, if np.inf is not present in the dataset, it's basically invalidating the concept of sparse array.
Eg:
import numpy as np
import pandas as pd
df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 0, 0], fill_value=np.nan),
"B": [55, 100, 55], "C": [4, 4, 4]})
print(df.drop_duplicates())
Output
A B C
0 0 55 4
1 0 100 4

Related

pandas index and columns in same line

I have multindex with two columns in a dataframe as follow
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 0], [1, 5, 6, 0], [2, 5, 6, 0]]),columns=['a','b','c', 'd'])
df
df.set_index(['a', 'b'],inplace=True)
df
That produces index in a different line of column header
c d
a b
1 2 3 0
5 6 0
2 5 6 0
How index and column can be put in the same line without missing index columns?
Desired output, index and columns in the same line
a b c d
1 2 3 0
5 6 0
2 5 6 0
its true,
df.set_index(['a', 'b'],inplace=True).
then you used "print", printing change view(Change is only in vision.).
if you save dataframe, its save true.

Find row with nan value and delete it

I have a dataframe. This dataframe contains three cells id, horstid, date. The cell date has one NaN value. I want the below code what works with pandas, I want it with numpy.
First I want to transform my dataframe to a numpy array. After that I want is to find all rows where the date is NaN and print it. After that I want to remove all this rows. But how could I do this in numpy?
This is my dataframe
id horstid date
0 1 11 2008-09-24
1 2 22 NaN
2 3 33 2008-09-18
3 4 33 2008-10-24
This is my code. That works with fine, but with pandas.
d = {'id': [1, 2, 3, 4], 'horstid': [11, 22, 33, 33], 'date': ['2008-09-24', np.nan, '2008-09-18', '2008-10-24']}
df = pd.DataFrame(data=d)
df['date'].isna()
[OUT]
0 False
1 True
2 False
3 False
df.drop(df.index[df['date'].isna() == True])
[OUT]
id horstid date
0 1 11 2008-09-24
2 3 33 2008-09-18
3 4 33 2008-10-24
What I want is the above code without pandas but with numpy.
npArray = df.to_numpy()
date = npArray [:,2].astype(np.datetime64)
[OUT]
ValueError: Cannot create a NumPy datetime other than NaT with generic units
Here's a solution based on Numpy and pure python:
df = pd.DataFrame.from_dict(dict(horstid = [11, 22, 33, 33], id=[1,2,3,4], data=['2008-09-24', np.nan, '2008-09-18', '2008-10-24']))
a = df.values
index = list(map(lambda x: type(x) != type(1.),a[:, 2]))
print(a[index,:])
[[11 1 '2008-09-24']
[33 3 '2008-09-18']
[33 4 '2008-10-24']]

Python Pandas Dataframe, remove all rows where 'None' is the value in any column

I have a large dataframe. When it was created 'None' was used as the value where a number could not be calculated (instead of 'nan')
How can I delete all rows that have 'None' in any of it's columns? I though I could use df.dropna and set the value of na, but I can't seem to be able to.
Thanks
I think this is a good representation of the dataframe:
temp = pd.DataFrame(data=[['str1','str2',2,3,5,6,76,8],['str3','str4',2,3,'None',6,76,8]])
Setup
Borrowed #MaxU's df
df = pd.DataFrame([
[1, 2, 3],
[4, None, 6],
[None, 7, 8],
[9, 10, 11]
], dtype=object)
Solution
You can just use pd.DataFrame.dropna as is
df.dropna()
0 1 2
0 1 2 3
3 9 10 11
Supposing you have None strings like in this df
df = pd.DataFrame([
[1, 2, 3],
[4, 'None', 6],
['None', 7, 8],
[9, 10, 11]
], dtype=object)
Then combine dropna with mask
df.mask(df.eq('None')).dropna()
0 1 2
0 1 2 3
3 9 10 11
You can ensure that the entire dataframe is object when you compare with.
df.mask(df.astype(object).eq('None')).dropna()
0 1 2
0 1 2 3
3 9 10 11
Thanks for all your help. In the end I was able to get
df = df.replace(to_replace='None', value=np.nan).dropna()
to work. I'm not sure why your suggestions didn't work for me.
UPDATE:
In [70]: temp[temp.astype(str).ne('None').all(1)]
Out[70]:
0 1 2 3 4 5 6 7
0 str1 str2 2 3 5 6 76 8
Old answer:
In [35]: x
Out[35]:
a b c
0 1 2 3
1 4 None 6
2 None 7 8
3 9 10 11
In [36]: x = x[~x.astype(str).eq('None').any(1)]
In [37]: x
Out[37]:
a b c
0 1 2 3
3 9 10 11
or bit nicer variant from #roganjosh:
In [47]: x = x[x.astype(str).ne('None').all(1)]
In [48]: x
Out[48]:
a b c
0 1 2 3
3 9 10 11
im a bit late to the party, but this is prob the simplest method:
df.dropna(axis=0, how='any')
Parameters:
axis='index/column' how='any/all'
axis '0' is for dropping rows (most common), and '1' will drop columns instead.
and the parameter how will drop if there are 'any' None types in the row/ column,
or if they are all None types (how='all')
if still None is not removed , we can do
df = df.replace(to_replace='None', value=np.nan).dropna()
the above solution worked partially still the None was converted to NaN but not removed (thanks to the above answer as it helped to move further)
so then i added one more line of code that is take the particular column
df['column'] = df['column'].apply(lambda x : str(x))
this changed the NaN to nan
now remove the nan
df = df[df['column'] != 'nan']

Creating boolean matrix from one column with pandas

I have been searching for an answer but I don't know what to search for so I'll ask here instead. I'm a beginner python and pandas enthusiast.
I have a dataset where i would like to produce a matrix from a column. The matrix should have the value of 1 if the value in the column and its transposed state is equal and 0 if its not.
input:
id x1
A 1
B 3
C 1
D 5
output:
A B C D
A 1 0 1 0
B 0 1 0 0
C 1 0 1 0
D 0 0 0 1
I would like to do this for six different columns and add the resulting matrixes into one matrix where the values range from 0-6 instead of just 0-1.
Partly because there's as of yet no convenient cartesian join (whistles and looks away), I tend to drop down to numpy level and use broadcasting when I need to do things like this. IOW, because we can do things like this
>>> df.x1.values - df.x1.values[:,None]
array([[ 0, 2, 0, 4],
[-2, 0, -2, 2],
[ 0, 2, 0, 4],
[-4, -2, -4, 0]])
We can do
>>> pdf = pd.DataFrame(index=df.id.values, columns=df.id.values,
data=(df.x1.values == df.x1.values[:,None]).astype(int))
>>> pdf
A B C D
A 1 0 1 0
B 0 1 0 0
C 1 0 1 0
D 0 0 0 1

Using pandas fillna() on multiple columns

I'm a new pandas user (as of yesterday), and have found it at times both convenient and frustrating.
My current frustration is in trying to use df.fillna() on multiple columns of a dataframe. For example, I've got two sets of data (a newer set and an older set) which partially overlap. For the cases where we have new data, I just use that, but I also want to use the older data if there isn't anything newer. It seems I should be able to use fillna() to fill the newer columns with the older ones, but I'm having trouble getting that to work.
Attempt at a specific example:
df.ix[:,['newcolumn1','newcolumn2']].fillna(df.ix[:,['oldcolumn1','oldcolumn2']], inplace=True)
But this doesn't work as expected - numbers show up in the new columns that had been NaNs, but not the ones that were in the old columns (in fact, looking through the data, I have no idea where the numbers it picked came from, as they don't exist in either the new or old data anywhere).
Is there a way to fill in NaNs of specific columns in a DataFrame with vales from other specific columns of the DataFrame?
fillna is generally for carrying an observation forward or backward. Instead, I'd use np.where... If I understand what you're asking.
import numpy as np
np.where(np.isnan(df['newcolumn1']), df['oldcolumn1'], df['newcolumn1'])
To answer your question: yes. Look at using the value argument of fillna. Along with the to_dict() method on the other dataframe.
But to really solve your problem, have a look at the update() method of the DataFrame. Assuming your two dataframes are similarly indexed, I think it's exactly what you want.
In [36]: df = pd.DataFrame({'A': [0, np.nan, 2, 3, np.nan, 5], 'B': [1, 0, 1, np.nan, np.nan, 1]})
In [37]: df
Out[37]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 NaN
4 NaN NaN
5 5 1
In [38]: df2 = pd.DataFrame({'A': [0, np.nan, 2, 3, 4, 5], 'B': [1, 0, 1, 1, 0, 0]})
In [40]: df2
Out[40]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 0
In [52]: df.update(df2, overwrite=False)
In [53]: df
Out[53]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 1
Notice that all the NaNs in df were replaced except for (1, A) since that was also NaN in df2. Also some of the values like (5, B) differed between df and df2. By using overwrite=False it keeps the value from df.
EDIT: Based on comments it seems like your looking for a solution where the column names don't match over the two DataFrames (It'd be helpful if you posted sample data). Let's try that, replacing column A with C and B with D.
In [33]: df = pd.DataFrame({'A': [0, np.nan, 2, 3, np.nan, 5], 'B': [1, 0, 1, np.nan, np.nan, 1]})
In [34]: df2 = pd.DataFrame({'C': [0, np.nan, 2, 3, 4, 5], 'D': [1, 0, 1, 1, 0, 0]})
In [35]: df
Out[35]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 NaN
4 NaN NaN
5 5 1
In [36]: df2
Out[36]:
C D
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 0
In [37]: d = {'A': df2.C, 'B': df2.D} # pass this values in fillna
In [38]: df
Out[38]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 NaN
4 NaN NaN
5 5 1
In [40]: df.fillna(value=d)
Out[40]:
A B
0 0 1
1 NaN 0
2 2 1
3 3 1
4 4 0
5 5 1
I think if you invest the time to learn pandas you'll hit fewer moments of frustration. It's a massive library though, so it takes time.

Categories

Resources