Python pandas read_table converts zero to NaN - python

Say I have the following file test.txt:
Aaa Bbb
Foo 0
Bar 1
Baz NULL
(The separator is actually a tab character, which I can't seem to input here.)
And I try to read it using pandas (0.10.0):
In [523]: pd.read_table("test.txt")
Out[523]:
Aaa Bbb
0 Foo NaN
1 Bar 1
2 Baz NaN
Note that the zero value in the first column has suddenly turned into NaN! I was expecting a DataFrame like this:
Aaa Bbb
0 Foo 0
1 Bar 1
2 Baz NaN
What do I need to change to obtain the latter? I suppose I could use pd.read_table("test.txt", na_filter=False) and subsequently replace 'NULL' values with NaN and change the column dtype. Is there a more straightforward solution?

I think this is issue #2599, "read_csv treats zeroes as nan if column contains any nan", which is now closed. I can't reproduce in my development version:
In [27]: with open("test.txt") as fp:
....: for line in fp:
....: print repr(line)
....:
'Aaa\tBbb\n'
'Foo\t0\n'
'Bar\t1\n'
'Baz\tNULL\n'
In [28]: pd.read_table("test.txt")
Out[28]:
Aaa Bbb
0 Foo 0
1 Bar 1
2 Baz NaN
In [29]: pd.__version__
Out[29]: '0.10.1.dev-f7f7e13'

Try:
import pandas as pd
df = pd.read_table("14256839_input.txt", sep=" ", na_values="NULL")
print df
print df.dtypes
This gives me
Aaa Bbb
0 Foo 0
1 Bar 1
2 Baz NaN
Aaa object
Bbb float64

Related

Create a new column evaluating values in different rows

Starting from an imported df from excel like that:
Code
Time
Rev
AAA
5
3
AAA
3
2
AAA
6
1
BBB
10
2
BBB
5
1
I want to add a new column like that evidence the last revision:
Code
Time
Rev
Last
AAA
5
3
OK
AAA
3
2
NOK
AAA
6
1
NOK
BBB
10
2
OK
BBB
5
1
NOK
The df is already sorted by 'Code' and 'Rev'
df= df.sort_values(['Code', 'Rev'],
ascending = [True,False])
I thought to evaluate the column 'Code', if the value in column Code is equal to the value in upper row I must have NOK in the new column.
Unfortunately, I am not able to write it in python
You can do:
#Create a column called 'Last' with 'NOK' values
df['Last'] = 'NOK'
#Skipping sorting because you say df is already sorted.
#Then locate the first row in each group and change its value to 'OK'
df.loc[df.groupby('Code', as_index=False).nth(0).index, 'Last'] = 'OK'
You can use pandas.groupby.cumcount and set every first row of group to 'OK'.
dict_ = {
'Code': ['AAA', 'AAA', 'AAA', 'BBB', 'BBB'],
'Time': [5, 3, 6, 10, 5],
'Rev': [3, 2, 1, 2, 1],
}
df = pd.DataFrame(dict_)
df['Last'] = 'NOK'
df.loc[df.groupby('Code').cumcount() == 0,'Last']='OK'
This gives us the expected output:
df
Code Time Rev Last
0 AAA 5 3 OK
1 AAA 3 2 NOK
2 AAA 6 1 NOK
3 BBB 10 2 OK
4 BBB 5 1 NOK
or you can try fetching the head of each group and set the value to OK for it.
df.loc[df.groupby('Code').head(1).index, 'Last'] = 'OK'
which gives us the same thing
df
Code Time Rev Last
0 AAA 5 3 OK
1 AAA 3 2 NOK
2 AAA 6 1 NOK
3 BBB 10 2 OK
4 BBB 5 1 NOK

Fill NAs in a Column with samples from itself

Simple toy dataframe:
df = pd.DataFrame({'mycol':['foo','bar','hello','there',np.nan,np.nan,np.nan,'foo'],
'mycol2':'this is here to make it a DF'.split()})
print(df)
mycol mycol2
0 foo this
1 bar is
2 hello here
3 there to
4 NaN make
5 NaN it
6 NaN a
7 foo DF
I'm trying to fill the NaNs in mycol with samples from itself, e.g. I want the NaNs to be replaced with samples of foo,bar,hello etc.
# fill NA values with n samples (n= number of NAs) from df['mycol']
df['mycol'].fillna(df['mycol'].sample(n=df.isna().sum(), random_state=1,replace=True).values)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# fill NA values with n samples, n=1. Dropna from df['mycol'] before sampling:
df['mycol'] = df['mycol'].fillna(df['mycol'].dropna().sample(n=1, random_state=1,replace=True)).values
# nothing happens
Expected Output: Nas filled with random samples from mycol:
mycol mycol2
0 foo this
1 bar is
2 hello here
3 there to
4 foo make
5 foo it
6 hello a
7 foo DF
edit for answer: #Jezrael's answer below sorted it, I had a problem with my indexes.
df['mycol'] = (df['mycol']
.dropna()
.sample(n=len(df),replace=True)
.reset_index(drop=True))
Interesting problem.
For me working set values with loc with converting values to numpy array for avoid data alignment:
a = df['mycol'].dropna().sample(n=df['mycol'].isna().sum(), random_state=1,replace=True)
print (a)
3 there
7 foo
0 foo
Name: mycol, dtype: object
#pandas 0.24+
df.loc[df['mycol'].isna(), 'mycol'] = a.to_numpy()
#pandas below
#df.loc[df['mycol'].isna(), 'mycol'] = a.values
print (df)
mycol mycol2
0 foo this
1 bar is
2 hello here
3 there to
4 there make
5 foo it
6 foo a
7 foo DF
Your solution should working if length of Series and index same like original DataFrame:
s = df['mycol'].dropna().sample(n=len(df), random_state=1,replace=True)
s.index = df.index
print (s)
0 there
1 foo
2 foo
3 bar
4 there
5 foo
6 foo
7 bar
Name: mycol, dtype: object
df['mycol'] = df['mycol'].fillna(s)
print (df)
# mycol mycol2
0 foo this
1 bar is
2 hello here
3 there to
4 there make
5 foo it
6 foo a
7 foo DF
you can do forward or backward fill:
#backward fill
df['mycol'] = df['mycol'].fillna(method='bfill')
#forward Fill
df['mycol'] = df['mycol'].fillna(method='ffill')

Python Pandas Debugging on to_datetime

Millions of records of data is in my dataframe. I have to convert of the string columns to datetime. I'm doing it as follows:
allData['Col1'] = pd.to_datetime(allData['Col1'])
However some of the strings are not valid datetime strings, and thus I get a value error. I'm not very good at debugging in Python, so I'm struggling to find the reason why some of the data items are not convertible.
I need Python to show me the row number, as well as the value that is not convertible, instead of throwing out a useless error that tells me nothing. How can I achieve this?
You can use boolean indexing with condition where check NaT values by isnull created to_datetime with parameter errors='coerce' - it create NaT where are invalid datetime:
allData1 = allData[pd.to_datetime(allData['Col1'], errors='coerce').isnull()]
Sample:
allData = pd.DataFrame({'Col1':['2015-01-03','a','2016-05-08'],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (allData)
B C Col1 D E F
0 4 7 2015-01-03 1 5 7
1 5 8 a 3 3 4
2 6 9 2016-05-08 5 6 3
print (pd.to_datetime(allData['Col1'], errors='coerce'))
0 2015-01-03
1 NaT
2 2016-05-08
Name: Col1, dtype: datetime64[ns]
print (pd.to_datetime(allData['Col1'], errors='coerce').isnull())
0 False
1 True
2 False
Name: Col1, dtype: bool
allData1 = allData[pd.to_datetime(allData['Col1'], errors='coerce').isnull()]
print (allData1)
B C Col1 D E F
1 5 8 a 3 3 4

Creating NaN values in Pandas (instead of Numpy)

I'm converting a .ods spreadsheet to a Pandas DataFrame. I have whole columns and rows I'd like to drop because they contain only "None". As "None" is a str, I have:
pandas.DataFrame.replace("None", numpy.nan)
...on which I call: .dropna(how='all')
Is there a pandas equivalent to numpy.nan?
Is there a way to use .dropna() with the *string "None" rather than NaN?
You can use float('nan') if you really want to avoid importing things from the numpy namespace:
>>> import pandas as pd
>>> s = pd.Series([1, 2, 3])
>>> s[1] = float('nan')
>>> s
0 1.0
1 NaN
2 3.0
dtype: float64
>>>
>>> s.dropna()
0 1.0
2 3.0
dtype: float64
Moreover, if you have a string value "None", you can .replace("None", float("nan")):
>>> s[1] = "None"
>>> s
0 1
1 None
2 3
dtype: object
>>>
>>> s.replace("None", float("nan"))
0 1.0
1 NaN
2 3.0
dtype: float64
If you are trying to drop directly the rows containing a "None" string value (without converting these "None" cells to NaN values), I guess it can be done without using replace + dropna
Considering a DataFrame like :
In [3]: df = pd.DataFrame({
"foo": [1,2,3,4],
"bar": ["None",5,5,6],
"baz": [8, "None", 9, 10]
})
In [4]: df
Out[4]:
bar baz foo
0 None 8 1
1 5 None 2
2 5 9 3
3 6 10 4
Using replace and dropna will return
In [5]: df.replace('None', float("nan")).dropna()
Out[5]:
bar baz foo
2 5.0 9.0 3
3 6.0 10.0 4
Which can also be obtained by simply selecting the row you need :
In [7]: df[df.eval("foo != 'None' and bar != 'None' and baz != 'None'")]
Out[7]:
bar baz foo
2 5 9 3
3 6 10 4
You can also use the drop method of your dataframe, selecting appropriately the axis/labels targeted :
In [9]: df.drop(df[(df.baz == "None") |
(df.bar == "None") |
(df.foo == "None")].index)
Out[9]:
bar baz foo
2 5 9 3
3 6 10 4
These two methods are more or less interchangeable as you can also do for example:
df[(df.baz != "None") & (df.bar != "None") & (df.foo != "None")]
(but i guess the comparison df.somecolumns == "Some string" is only possible if the column type allows it, before theses last 2 examples, which wasn't the case with eval, i had to do df = df.astype (object) as the foo columns was of type int64)

Pandas: Create new dataframe that averages duplicates from another dataframe

Say I have a dataframe my_df with column duplicates, e..g
foo bar foo hello
0 1 1 5
1 1 2 5
2 1 3 5
I would like to create another dataframe that averages the duplicates:
foo bar hello
0.5 1 5
1.5 1 5
2.5 1 5
How can I do this in Pandas?
So far I have managed to identify duplicates:
my_columns = my_df.columns
my_duplicates = print [x for x, y in collections.Counter(my_columns).items() if y > 1]
By I don't know how to ask Pandas to average them.
You can groupby the column index and take the mean:
In [11]: df.groupby(level=0, axis=1).mean()
Out[11]:
bar foo hello
0 1 0.5 5
1 1 1.5 5
2 1 2.5 5
A somewhat trickier example is if there is a non numeric column:
In [21]: df
Out[21]:
foo bar foo hello
0 0 1 1 a
1 1 1 2 a
2 2 1 3 a
The above will raise: DataError: No numeric types to aggregate. Definitely not going to win any prizes for efficiency, but here's generic method to do in this case:
In [22]: dupes = df.columns.get_duplicates()
In [23]: dupes
Out[23]: ['foo']
In [24]: pd.DataFrame({d: df[d] for d in df.columns if d not in dupes})
Out[24]:
bar hello
0 1 a
1 1 a
2 1 a
In [25]: pd.concat(df.xs(d, axis=1) for d in dupes).groupby(level=0, axis=1).mean()
Out[25]:
foo
0 0.5
1 1.5
2 2.5
In [26]: pd.concat([Out[24], Out[25]], axis=1)
Out[26]:
foo bar hello
0 0.5 1 a
1 1.5 1 a
2 2.5 1 a
I think the thing to take away is avoid column duplicates... or perhaps that I don't know what I'm doing.

Categories

Resources