Python Pandas Debugging on to_datetime - python

Millions of records of data is in my dataframe. I have to convert of the string columns to datetime. I'm doing it as follows:
allData['Col1'] = pd.to_datetime(allData['Col1'])
However some of the strings are not valid datetime strings, and thus I get a value error. I'm not very good at debugging in Python, so I'm struggling to find the reason why some of the data items are not convertible.
I need Python to show me the row number, as well as the value that is not convertible, instead of throwing out a useless error that tells me nothing. How can I achieve this?

You can use boolean indexing with condition where check NaT values by isnull created to_datetime with parameter errors='coerce' - it create NaT where are invalid datetime:
allData1 = allData[pd.to_datetime(allData['Col1'], errors='coerce').isnull()]
Sample:
allData = pd.DataFrame({'Col1':['2015-01-03','a','2016-05-08'],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (allData)
B C Col1 D E F
0 4 7 2015-01-03 1 5 7
1 5 8 a 3 3 4
2 6 9 2016-05-08 5 6 3
print (pd.to_datetime(allData['Col1'], errors='coerce'))
0 2015-01-03
1 NaT
2 2016-05-08
Name: Col1, dtype: datetime64[ns]
print (pd.to_datetime(allData['Col1'], errors='coerce').isnull())
0 False
1 True
2 False
Name: Col1, dtype: bool
allData1 = allData[pd.to_datetime(allData['Col1'], errors='coerce').isnull()]
print (allData1)
B C Col1 D E F
1 5 8 a 3 3 4

Related

Convert Dataframe to series and viceversa / Delete columns from serie or dataframe

I'am trying to convert this dataframe into a series or the series to a dataframe (basicly one into an other) in order to be able to do operations with it, my second problem is wanting to delete the first column of the dataframe below (before of after converting doesn't really matter) or be able to delete a column from a series.
I searched for similar questions but they did not correspond to my issue.
Thanks in advance here are the dataframe and the series.
JOUR FL_AB_PCOUP FL_ABER_NEGA FL_AB_PMAX FL_AB_PSKVA FL_TROU_PDC \
0 2018-07-09 -0.448787 0.0 1.498464 -0.197012 1.001577
CDC_INCOMPLET_HORS_ABERRANTS CDC_COMPLET_HORS_ABERRANTS CDC_ABSENT \
0 -0.729002 -1.03586 1.032936
CDC_ABERRANTS PRM_X_PDC_ZERO mean.msr.pdc sd.msr.pdc sum.msr.pdc \
0 1.49976 -0.497693 -1.243274 -1.111366 0.558516
FL_AB_PCOUP 8.775974e-05
FL_ABER_NEGA 0.000000e+00
FL_AB_PMAX 1.865632e-03
FL_AB_PSKVA 2.027215e-05
FL_TROU_PDC 2.222952e-02
FL_AB_COMBI 1.931156e-03
CDC_INCOMPLET_HORS_ABERRANTS 1.562195e-03
CDC_COMPLET_HORS_ABERRANTS 9.758743e-01
CDC_ABSENT 2.063239e-02
CDC_ABERRANTS 1.931156e-03
PRM_X_PDC_ZERO 2.127753e+01
mean.msr.pdc 1.125987e+03
sd.msr.pdc 1.765955e+03
sum.msr.pdc 3.310615e+08
n.resil 3.884103e-04
dtype: float64
Setup:
df = pd.DataFrame({'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]})
print (df)
B C D E
0 4 7 1 5
1 5 8 3 3
2 4 9 5 6
3 5 4 7 9
4 5 2 1 2
5 4 3 0 4
Use for DataFrame to Series selecting, e.g. by position by iloc or by name of index by loc :
#select some row, e.g. first
s = df.iloc[0]
print (s)
B 4
C 7
D 1
E 5
Name: 0, dtype: int64
And for Series to DataFrame use to_frame with transpose if necessary:
df = s.to_frame().T
print (df)
B C D E
0 4 7 1 5
Last for remove column from DataFrame use DataFrame.drop:
df = df.drop('B',axis=1)
print (df)
C D E
0 7 1 5
And value from Series use Series.drop:
s = s.drop('C')
print (s)
B 4
D 1
E 5
Name: 0, dtype: int64
you can delete your particular column by
df.drop(df.columns[i], axis=1)
to convert dataframe to series
pd.Series(df)

Confused about the usage of .apply and lambda

After encountering this code:
I was confused about the usage of both .apply and lambda. Firstly does .apply apply the desired change to all elements in all the columns specified or each column one by one? Secondly, does x in lambda x: iterate through every element in specified columns or columns separately? Thirdly, does x.min or x.max give us the minimum or maximum of all the elements in specified columns or minimum and maximum elements of each column separately? Any answer explaining the whole process would make me more than grateful.
Thanks.
I think here is the best avoid apply - loops under the hood and working with subset of DataFrame by columns from list:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
c = ['B','C','D']
So first select minimal values of selected columns and similar maximal:
print (df[c].min())
B 4
C 2
D 0
dtype: int64
Then subtract and divide:
print ((df[c] - df[c].min()))
B C D
0 0 5 1
1 1 6 3
2 0 7 5
3 1 2 7
4 1 0 1
5 0 1 0
print (df[c].max() - df[c].min())
B 1
C 7
D 7
dtype: int64
df[c] = (df[c] - df[c].min()) / (df[c].max() - df[c].min())
print (df)
A B C D E F
0 a 0.0 0.714286 0.142857 5 a
1 b 1.0 0.857143 0.428571 3 a
2 c 0.0 1.000000 0.714286 6 a
3 d 1.0 0.285714 1.000000 9 b
4 e 1.0 0.000000 0.142857 2 b
5 f 0.0 0.142857 0.000000 4 b
EDIT:
For debug apply is best create custom function:
def f(x):
#for each loop return column
print (x)
#return scalar - min
print (x.min())
#return new Series - column
print ((x-x.min())/ (x.max() - x.min()))
return (x-x.min())/ (x.max() - x.min())
df[c] = df[c].apply(f)
print (df)
Check if the data are really being normalised. Because x.min and x.max may simply take the min and max of a single value, hence no normalisation would occur.

pandas pivot table for heatmap

I am trying to generate a heatmap using seaborn, however I am having a small problem with the formatting of my data.
Currently, my data is in the form:
Name Diag Date
A 1 2006-12-01
A 1 1994-02-12
A 2 2001-07-23
B 2 1999-09-12
B 1 2016-10-12
C 3 2010-01-20
C 2 1998-08-20
I would like to create a heatmap (preferably in python) showing Name on one axis against Diag - if occured. I have tried to pivot the table using pd.pivot, however I was given the error
ValueError: Index contains duplicate entries, cannot reshape
this came from:
piv = df.pivot_table(index='Name',columns='Diag')
Time is irrelevant, but I would like to show which Names have had which Diag, and which Diag combos cluster together. Do I need to create a new table for this or is it possible for that I have? In some cases the Name is not associated with all Diag
EDIT:
I have since tried:
piv = df.pivot_table(index='Name',columns='Diag', values='Time', aggfunc='mean')
However as Time is in datetime format, I end up with:
pandas.core.base.DataError: No numeric types to aggregate
You need pivot_table with some aggregate function, because for same index and column have multiple values and pivot need unique values only:
print (df)
Name Diag Time
0 A 1 12 <-duplicates for same A, 1 different value
1 A 1 13 <-duplicates for same A, 1 different value
2 A 2 14
3 B 2 18
4 B 1 1
5 C 3 9
6 C 2 8
df = df.pivot_table(index='Name',columns='Diag', values='Time', aggfunc='mean')
print (df)
Diag 1 2 3
Name
A 12.5 14.0 NaN
B 1.0 18.0 NaN
C NaN 8.0 9.0
Alternative solution:
df = df.groupby(['Name','Diag'])['Time'].mean().unstack()
print (df)
Diag 1 2 3
Name
A 12.5 14.0 NaN
B 1.0 18.0 NaN
C NaN 8.0 9.0
EDIT:
You can also check all duplicates by duplicated:
df = df.loc[df.duplicated(['Name','Diag'], keep=False), ['Name','Diag']]
print (df)
Name Diag
0 A 1
1 A 1
EDIT:
mean of datetimes is not easy - need convert dates to nanoseconds, get mean and last convert to datetimes. Also there is another problem - need replace NaN to some scalar, e.g. 0 what is converted to 0 datetime - 1970-01-01.
df.Date = pd.to_datetime(df.Date)
df['dates_in_ns'] = pd.Series(df.Date.values.astype(np.int64), index=df.index)
df = df.pivot_table(index='Name',
columns='Diag',
values='dates_in_ns',
aggfunc='mean',
fill_value=0)
df = df.apply(pd.to_datetime)
print (df)
Diag 1 2 3
Name
A 2000-07-07 12:00:00 2001-07-23 1970-01-01
B 2016-10-12 00:00:00 1999-09-12 1970-01-01
C 1970-01-01 00:00:00 1998-08-20 2010-01-20

Creating column on filtered pandas DataFrame

From an initial DataFrame loaded from a csv file,
df = pd.read_csv("file.csv",sep=";")
I get a filtered copy with
df_filtered = df[df["filter_col_name"]== value]
However, when creating a new column using the diff() method,
df_filtered["diff"] = df_filtered["feature"].diff()
I get the following warning:
/usr/local/bin/ipython3:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#!/usr/bin/python3
I notice also that the processing time is very long.
Surprisingly (at leat to me...), if I do the same thing on the non-filtered DataFrame, I runs fine.
How should I proceed to create a "diff" column on the filtered data?
You need copy:
If you modify values in df_filtered later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
#need process sliced df, return sliced df
df_filtered = df[df["filter_col_name"]== value].copy()
Or:
#need process sliced df, return all df
df.loc[df["filter_col_name"]== value, 'feature'] =
df.loc[df["filter_col_name"]== value , 'feature'].diff()
Sample:
df = pd.DataFrame({'filter_col_name':[1,1,3],
'feature':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
C D E F feature filter_col_name
0 7 1 5 7 4 1
1 8 3 3 4 5 1
2 9 5 6 3 6 3
value = 1
df_filtered = df[df["filter_col_name"]== value].copy()
df_filtered["diff"] = df_filtered["feature"].diff()
print (df_filtered)
C D E F feature filter_col_name diff
0 7 1 5 7 4 1 NaN
1 8 3 3 4 5 1 1.0
value = 1
df.loc[df["filter_col_name"]== value, 'feature'] =
df.loc[df["filter_col_name"]== value , 'feature'].diff()
print (df)
C D E F feature filter_col_name
0 7 1 5 7 NaN 1
1 8 3 3 4 1.0 1
2 9 5 6 3 6.0 3
Try using
df_filtered.loc[:, "diff"] = df_filtered["feature"].diff()

Pandas messing up the date column on DataFrame.from_csv

I'm using Python 3.4.0 with pandas==0.16.2. I have soccer team results in CSV files that have the following columns: date, at, goals.scored, goals.lost, result. The 'at' column can have one of the three values (H, A, N) which indicate whether the game took place at team's home stadium, away or at a neutral place respectively. Here's the head of one such file:
date,at,goals.scored,goals.lost,result
16/09/2014,A,0,2,2
01/10/2014,H,4,1,1
22/10/2014,A,2,1,1
04/11/2014,H,3,3,0
26/11/2014,H,2,0,1
09/12/2014,A,4,1,1
25/02/2015,H,1,3,2
17/03/2015,A,2,0,1
19/08/2014,A,0,0,0
When I load this file into pandas.DataFrame in the usual way:
import pandas as pd
aTeam = pd.DataFrame.from_csv("teamA-results.csv")
the first two columns 'date' and 'at' seem to be treated as one and I get a malformed data frame like this one:
aTeam.dtypes
at object
goals.scored int64
goals.lost int64
result int64
dtype: object
aTeam
at goals.scored goals.lost result
date
2014-09-16 A 0 2 2
2014-01-10 H 4 1 1
2014-10-22 A 2 1 1
2014-04-11 H 3 3 0
2014-11-26 H 2 0 1
...
The code block does not clearly reflect the corruption, so I attached the screenshot from the Jupyter notebook:
As you can see 'date' and 'at' columns seemed to be treated as one column of object type:
aTeam['at']
date
2014-09-16 A
2014-01-10 H
2014-10-22 A
2014-04-11 H
2014-11-26 H
2014-09-12 A
Initially I thought the lack of quotes around the date was causing this problem, so I added those, but it did not help at all, so then I en-quoted all the values in the 'at' column which still did not solve the problem. I tried single and double quotes in the CSV file. Interestingly using no quotes or double quotes around the values in 'date' and 'at' produced the same results as you could see above. Single quotes were interpreted as parts of the value in the 'at' column, but not in the date column:
Adding the parse_dates=True param did not have any effect on the data frame.
I did not have such issues when I was working with these CSV files in R. I will appreciate any help on this one.
I can replicate your issue using from_csv,the issue is it uses col 0 as the index so passing index_col=None would work:
index_col : int or sequence, default 0
Column to use for index. If a sequence is given, a MultiIndex is used. Different default from read_table
import pandas as pd
aTeam = pd.DataFrame().from_csv("in.csv",index_col=None)
Output:
date at goals.scored goals.lost result
0 16/09/2014 A 0 2 2
1 01/10/2014 H 4 1 1
2 22/10/2014 A 2 1 1
3 04/11/2014 H 3 3 0
4 26/11/2014 H 2 0 1
5 09/12/2014 A 4 1 1
6 25/02/2015 H 1 3 2
7 17/03/2015 A 2 0 1
8 19/08/2014 A 0 0 0
Odr using .read_csv works correctly and is what you probably wanted based on the fact you were trying quotechar which is a valid arg:
import pandas as pd
aTeam = pd.read_csv("in.csv")
Output:
date at goals.scored goals.lost result
0 16/09/2014 A 0 2 2
1 01/10/2014 H 4 1 1
2 22/10/2014 A 2 1 1
3 04/11/2014 H 3 3 0
4 26/11/2014 H 2 0 1
5 09/12/2014 A 4 1 1
6 25/02/2015 H 1 3 2
7 17/03/2015 A 2 0 1
8 19/08/2014 A 0 0 0
I have no problem here (using Python 2.7, Pandas 16.2). I used the text you pasted and created a .csv file on my desktop and loaded it in Pandas using two methods
import pandas as pd
a = pd.read_csv('test.csv')
b = pd.DataFrame.from_csv('test.csv')
>>>a
date at goals.scored goals.lost result
0 16/09/2014 A 0 2 2
1 01/10/2014 H 4 1 1
2 22/10/2014 A 2 1 1
3 04/11/2014 H 3 3 0
4 26/11/2014 H 2 0 1
5 09/12/2014 A 4 1 1
6 25/02/2015 H 1 3 2
7 17/03/2015 A 2 0 1
8 19/08/2014 A 0 0 0
>>>b
at goals.scored goals.lost result
date
2014-09-16 A 0 2 2
2014-01-10 H 4 1 1
2014-10-22 A 2 1 1
2014-04-11 H 3 3 0
2014-11-26 H 2 0 1
2014-09-12 A 4 1 1
2015-02-25 H 1 3 2
2015-03-17 A 2 0 1
2014-08-19 A 0 0 0
You can see the different behavior between the read_csv and from_csv commands in how the index is handled. However, I do not see anything like the problem you mentioned. You could always try further defining the read_csv parameters, though I doubt that will make a substantial difference.
Could you confirm that your date and "at" columns are being smashed together by querying the dataframe via aTeam['at'] and see what that yields?

Categories

Resources