Selecting columns by column NAME dtype - python

import pandas as pd
import numpy as np
cols = ['string',pd.Timestamp('2017-10-13'), 'anotherstring', pd.Timestamp('2017-10-14')]
pd.DataFrame(np.random.rand(5,4), columns=cols)
How can I get back just the 2nd and 4th column (which have dtype 'date time.datetime')? The types of the column contents are exactly the same, so select_dtypes doesn't help.

Use type with map:
df = df.loc[:, df.columns.map(type) == pd.Timestamp]
print (df)
2017-10-13 00:00:00 2017-10-14 00:00:00
0 0.894932 0.502015
1 0.080334 0.155712
2 0.600152 0.206344
3 0.008913 0.919534
4 0.280229 0.951434
Details:
print (df.columns.map(type))
Index([ <class 'str'>,
<class 'pandas._libs.tslib.Timestamp'>,
<class 'str'>,
<class 'pandas._libs.tslib.Timestamp'>]
print (df.columns.map(type) == pd.Timestamp)
[False True False True]
Alternative solution:
df1 = df.loc[:, [isinstance(i, pd.Timestamp) for i in df.columns]]
print (df1)
2017-10-13 00:00:00 2017-10-14 00:00:00
0 0.818283 0.128299
1 0.570288 0.458400
2 0.857426 0.395963
3 0.595765 0.306861
4 0.196899 0.438231

Related

Find string data-type that includes a number in Pandas DataFrame and change the type

I have a dataframe with multiple columns. One or more than one column contain string values that may or may not include numbers (integer or float).
import pandas as pd
import numpy as np
data = [('A', '>10', 'ABC'),
('B', '10', '15'),
('C', '<10', '>10'),
('D', '10', '15'),
('E', '10-20', '10-30'),
('F', '20.0', 'ABC'),
('G', '25.1', '30.1') ]
data_df = pd.DataFrame(data, columns = ['name', 'value1', 'value2'])
I am looking for a method to check each of the cells inside the dataframe if there is any value which is assigned as strings but contains numerical(integer or float) value and then change it to integer or float by keeping the whole dataframe intact(not changing it to array)
so far, I found "How to find string data-type that includes a number in Pandas DataFrame" article on stackoverflow useful, but this article is guided to drop the numerical values stored as string types.
If need all values numeric repalce non numeric to missing values:
data_df.iloc[:, 1:] = data_df.iloc[:, 1:].apply(pd.to_numeric, errors='coerce')
print (data_df)
name value1 value2
0 A NaN NaN
1 B 10.0 15.0
2 C NaN NaN
3 D 10.0 15.0
4 E NaN NaN
5 F 20.0 NaN
6 G 25.1 30.1
If need replace missing values to original strings:
data_df.iloc[:, 1:] = (data_df.iloc[:, 1:]
.apply(pd.to_numeric, errors='coerce')
.fillna(data_df.iloc[:, 1:]))
print (data_df)
name value1 value2
0 A >10 ABC
1 B 10.0 15.0
2 C <10 >10
3 D 10.0 15.0
4 E 10-20 10-30
5 F 20.0 ABC
6 G 25.1 30.1
But then get mixed types numeric with strings:
print (data_df.iloc[:, 1:].applymap(type))
value1 value2
0 <class 'str'> <class 'str'>
1 <class 'float'> <class 'float'>
2 <class 'str'> <class 'str'>
3 <class 'float'> <class 'float'>
4 <class 'str'> <class 'str'>
5 <class 'float'> <class 'str'>
6 <class 'float'> <class 'float'>
EDIT:
cols = data_df.select_dtypes(object).columns.difference(['name'], sort=False)
data_df[cols] = data_df[cols].apply(lambda x: pd.to_numeric(x.str.strip(), errors='coerce'))

Parsing Pandas df Column of mixed data into Datetime

df = pd.DataFrame('23.Jan.2020 01.Mar.2017 5663:33 20.May.2021 626'.split())
I want to convert to date-like elements to datetime and for numbers, to return the original value.
I have tried
t=pd.to_datetime(df[0], format='%d.%b.%Y', errors='ignore')
which just returns to original df with no change. And I have tried to change errors to 'coerce', which does the conversion for date like elements, but numbers are dropped
t=pd.to_datetime(df[0], format='%d.%b.%Y', errors='coerce')
Then I attempt to return the original df value if NaT, else substitute with the new datetime from t
df.where(t.isnull(), other=t, axis=1)
Which works for returning the original df value where NaT, but it doesn't transfer the datetime
Maybe this is what you want?
dt = pd.Series('23.Jan.2020 01.Mar.2017 5663:33 20.May.2021 626'.split())
res = pd.to_datetime(dt, format="%d.%b.%Y", errors='coerce').fillna(dt)
This way the resulting elements in the series has the correct types:
>>> res.map(type)
0 <class 'pandas._libs.tslibs.timestamps.Timesta...
1 <class 'pandas._libs.tslibs.timestamps.Timesta...
2 <class 'str'>
3 <class 'pandas._libs.tslibs.timestamps.Timesta...
4 <class 'str'>
dtype: object
PS: I used a Series because it's easier to pass to to_datetime, and to Series.fillna.
this will combine the two field types in the way you have specified:
import pandas as pd
df = pd.DataFrame('23.Jan.2020 01.Mar.2017 5663:33 20.May.2021 626'.split())
mod = pd.to_datetime(df[0], format='%d.%b.%Y', errors='coerce')
ndf = pd.concat([df, mod], axis=1)
ndf.columns = ['original', 'modified']
def funk(col1,col2):
return col1 if pd.isnull(col2) else col2
ndf.apply(lambda x: funk(x.original,x.modified), axis=1)
# 0 2020-01-23 00:00:00
# 1 2017-03-01 00:00:00
# 2 5663:33
# 3 2021-05-20 00:00:00
# 4 626

How can I get intersection of two pandas series text column?

I have two pandas series of text column how can I get intersection of those?
print(df)
0 {this, is, good}
1 {this, is, not, good}
print(df1)
0 {this, is}
1 {good, bad}
I'm looking for a output something like below.
print(df2)
0 {this, is}
1 {good}
I've tried this but it returns
df.apply(lambda x: x.intersection(df1))
TypeError: unhashable type: 'set'
Looks like a simple logic:
s1 = pd.Series([{'this', 'is', 'good'}, {'this', 'is', 'not', 'good'}])
s2 = pd.Series([{'this', 'is'}, {'good', 'bad'}])
s1 - (s1 - s2)
#Out[122]:
#0 {this, is}
#1 {good}
#dtype: object
This approach works for me
import pandas as pd
import numpy as np
data = np.array([{'this', 'is', 'good'},{'this', 'is', 'not', 'good'}])
data1 = np.array([{'this', 'is'},{'good', 'bad'}])
df = pd.Series(data)
df1 = pd.Series(data1)
df2 = pd.Series([df[i] & df1[i] for i in xrange(df.size)])
print(df2)
I appreciate above answers. Here is a simple example to solve the same if you have DataFrame (As I guess, after looking into your variable names like df & df1, you had asked this for DataFrame .).
This df.apply(lambda row: row[0].intersection(df1.loc[row.name][0]), axis=1) will do that. Let's see how I reached to the solution.
The answer at https://stackoverflow.com/questions/266582... was helpful for me.
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({
... "set": [{"this", "is", "good"}, {"this", "is", "not", "good"}]
... })
>>>
>>> df
set
0 {this, is, good}
1 {not, this, is, good}
>>>
>>> df1 = pd.DataFrame({
... "set": [{"this", "is"}, {"good", "bad"}]
... })
>>>
>>> df1
set
0 {this, is}
1 {bad, good}
>>>
>>> df.apply(lambda row: row[0].intersection(df1.loc[row.name][0]), axis=1)
0 {this, is}
1 {good}
dtype: object
>>>
How I reached to the above solution?
>>> df.apply(lambda x: print(x.name), axis=1)
0
1
0 None
1 None
dtype: object
>>>
>>> df.loc[0]
set {this, is, good}
Name: 0, dtype: object
>>>
>>> df.apply(lambda row: print(row[0]), axis=1)
{'this', 'is', 'good'}
{'not', 'this', 'is', 'good'}
0 None
1 None
dtype: object
>>>
>>> df.apply(lambda row: print(type(row[0])), axis=1)
<class 'set'>
<class 'set'>
0 None
1 None
dtype: object
>>> df.apply(lambda row: print(type(row[0]), df1.loc[row.name]), axis=1)
<class 'set'> set {this, is}
Name: 0, dtype: object
<class 'set'> set {good}
Name: 1, dtype: object
0 None
1 None
dtype: object
>>> df.apply(lambda row: print(type(row[0]), type(df1.loc[row.name])), axis=1)
<class 'set'> <class 'pandas.core.series.Series'>
<class 'set'> <class 'pandas.core.series.Series'>
0 None
1 None
dtype: object
>>> df.apply(lambda row: print(type(row[0]), type(df1.loc[row.name][0])), axis=1)
<class 'set'> <class 'set'>
<class 'set'> <class 'set'>
0 None
1 None
dtype: object
>>>
Similar to above except if you want to keep everything in one dataframe
Current df:
df = pd.DataFrame({0: np.array([{'this', 'is', 'good'},{'this', 'is', 'not', 'good'}]), 1: np.array([{'this', 'is'},{'good', 'bad'}])})
Intersection of series 0 & 1
df[2] = df.apply(lambda x: x[0] & x[1], axis=1)

Pandas DataFrames with NaNs equality comparison

In the context of unit testing some functions, I'm trying to establish the equality of 2 DataFrames using python pandas:
ipdb> expect
1 2
2012-01-01 00:00:00+00:00 NaN 3
2013-05-14 12:00:00+00:00 3 NaN
ipdb> df
identifier 1 2
timestamp
2012-01-01 00:00:00+00:00 NaN 3
2013-05-14 12:00:00+00:00 3 NaN
ipdb> df[1][0]
nan
ipdb> df[1][0], expect[1][0]
(nan, nan)
ipdb> df[1][0] == expect[1][0]
False
ipdb> df[1][1] == expect[1][1]
True
ipdb> type(df[1][0])
<type 'numpy.float64'>
ipdb> type(expect[1][0])
<type 'numpy.float64'>
ipdb> (list(df[1]), list(expect[1]))
([nan, 3.0], [nan, 3.0])
ipdb> df1, df2 = (list(df[1]), list(expect[1])) ;; df1 == df2
False
Given that I'm trying to test the entire of expect against the entire of df, including NaN positions, what am I doing wrong?
What is the simplest way to compare equality of Series/DataFrames including NaNs?
You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:
In [11]: from pandas.testing import assert_frame_equal
In [12]: assert_frame_equal(df, expected, check_names=False)
You can wrap this in a function with something like:
try:
assert_frame_equal(df, expected, check_names=False)
return True
except AssertionError:
return False
In more recent pandas this functionality has been added as .equals:
df.equals(expected)
One of the properties of NaN is that NaN != NaN is True.
Check out this answer for a nice way to do this using numexpr.
(a == b) | ((a != a) & (b != b))
says this (in pseudocode):
a == b or (isnan(a) and isnan(b))
So, either a equals b, or both a and b are NaN.
If you have small frames then assert_frame_equal will be okay. However, for large frames (10M rows) assert_frame_equal is pretty much useless. I had to interrupt it, it was taking so long.
In [1]: df = DataFrame(rand(1e7, 15))
In [2]: df = df[df > 0.5]
In [3]: df2 = df.copy()
In [4]: df
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 14
dtypes: float64(15)
In [5]: timeit (df == df2) | ((df != df) & (df2 != df2))
1 loops, best of 3: 598 ms per loop
timeit of the (presumably) desired single bool indicating whether the two DataFrames are equal:
In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all()
1 loops, best of 3: 687 ms per loop
Like #PhillipCloud answer, but more written out
In [26]: df1 = DataFrame([[np.nan,1],[2,np.nan]])
In [27]: df2 = df1.copy()
They really are equivalent
In [28]: result = df1 == df2
In [29]: result[pd.isnull(df1) == pd.isnull(df2)] = True
In [30]: result
Out[30]:
0 1
0 True True
1 True True
A nan in df2 that doesn't exist in df1
In [31]: df2 = DataFrame([[np.nan,1],[np.nan,np.nan]])
In [32]: result = df1 == df2
In [33]: result[pd.isnull(df1) == pd.isnull(df2)] = True
In [34]: result
Out[34]:
0 1
0 True True
1 False True
You can also fill with a value you know not to be in the frame
In [38]: df1.fillna(-999) == df1.fillna(-999)
Out[38]:
0 1
0 True True
1 True True
Any equality comparison using == with np.NaN is False, even np.NaN == np.NaN is False.
Simply, df1.fillna('NULL') == df2.fillna('NULL'), if 'NULL' is not a value in the original data.
To be safe, do the following:
Example a) Compare two dataframes with NaN values
bools = (df1 == df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = True
assert bools.all().all()
Example b) Filter rows in df1 that do not match with df2
bools = (df1 != df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = False
df_outlier = df1[bools.all(axis=1)]
(Note: this is wrong - bools[pd.isnull(df1) == pd.isnull(df2)] = False)
df.fillna(0) == df2.fillna(0)
You can use fillna(). Documenation here.
from pandas import DataFrame
# create a dataframe with NaNs
df = DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
df2 = df
# comparison fails!
print df == df2
# all is well
print df.fillna(0) == df2.fillna(0)

Reindexing pandas timeseries from object dtype to datetime dtype

I have a time-series that is not recognized as a DatetimeIndex despite being indexed by standard YYYY-MM-DD strings with valid dates. Coercing them to a valid DatetimeIndex seems to be inelegant enough to make me think I'm doing something wrong.
I read in (someone else's lazily formatted) data that contains invalid datetime values and remove these invalid observations.
In [1]: df = pd.read_csv('data.csv',index_col=0)
In [2]: print df['2008-02-27':'2008-03-02']
Out[2]:
count
2008-02-27 20
2008-02-28 0
2008-02-29 27
2008-02-30 0
2008-02-31 0
2008-03-01 0
2008-03-02 17
In [3]: def clean_timestamps(df):
# remove invalid dates like '2008-02-30' and '2009-04-31'
to_drop = list()
for d in df.index:
try:
datetime.date(int(d[0:4]),int(d[5:7]),int(d[8:10]))
except ValueError:
to_drop.append(d)
df2 = df.drop(to_drop,axis=0)
return df2
In [4]: df2 = clean_timestamps(df)
In [5] :print df2['2008-02-27':'2008-03-02']
Out[5]:
count
2008-02-27 20
2008-02-28 0
2008-02-29 27
2008-03-01 0
2008-03-02 17
This new index is still only recognized as a 'object' dtype rather than a DatetimeIndex.
In [6]: df2.index
Out[6]: Index([2008-01-01, 2008-01-02, 2008-01-03, ..., 2012-11-27, 2012-11-28,
2012-11-29], dtype=object)
Reindexing produces NaNs because they're different dtypes.
In [7]: i = pd.date_range(start=min(df2.index),end=max(df2.index))
In [8]: df3 = df2.reindex(index=i,columns=['count'])
In [9]: df3['2008-02-27':'2008-03-02']
Out[9]:
count
2008-02-27 NaN
2008-02-28 NaN
2008-02-29 NaN
2008-03-01 NaN
2008-03-02 NaN
I create a fresh dataframe with the appropriate index, drop the data to a dictionary, then populate the new dataframe based on the dictionary values (skipping missing values).
In [10]: df3 = pd.DataFrame(columns=['count'],index=i)
In [11]: values = dict(df2['count'])
In [12]: for d in i:
try:
df3.set_value(index=d,col='count',value=values[d.isoformat()[0:10]])
except KeyError:
pass
In [13]: print df3['2008-02-27':'2008-03-02']
Out[13]:
count
2008-02-27 20
2008-02-28 0
2008-02-29 27
2008-03-01 0
2008-03-02 17
In [14]: df3.index
Out[14];
<class 'pandas.tseries.index.DatetimeIndex'>
[2008-01-01 00:00:00, ..., 2012-11-29 00:00:00]
Length: 1795, Freq: D, Timezone: None
This last part of setting values based on lookups to a dictionary keyed by strings seems especially hacky and makes me think I've missed something important.
You could use pd.to_datetime:
In [1]: import pandas as pd
In [2]: pd.to_datetime('2008-02-27')
Out[2]: datetime.datetime(2008, 2, 27, 0, 0)
This allows you to "clean" the index (or similarly a column) by applying it to the Series:
df.index = pd.to_datetime(df.index)
or
df['date_col'] = df['date_col'].apply(pd.to_datetime)

Categories

Resources