Could anyone please help me with why I am getting the below error, everything worked before when I used the same logic, after I converted my data type of date columns to the appropriate format.
Below is the line of code I am trying to run
data['OPEN_DT'] = data['OPEN_DT'].apply(lambda x: datetime.strptime(x,'%Y-%m-%d') if len(x[:x.find ('-')]) == 4 else datetime.strptime(x, '%d-%m-%Y'))
Error being received :
AttributeError Traceback (most recent call last)
<ipython-input-93-f0a22bfffeee> in <module>
----> 1 data['OPEN_DT'] = data['OPEN_DT'].apply(lambda x: datetime.strptime(x,'%Y-%m-%d') if len(x[:x.find ('-')]) == 4 else datetime.strptime(x, '%d-%m-%Y'))
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3846 else:
3847 values = self.astype(object).values
-> 3848 mapped = lib.map_infer(values, f, convert=convert_dtype)
3849
3850 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-93-f0a22bfffeee> in <lambda>(x)
----> 1 data['OPEN_DT'] = data['OPEN_DT'].apply(lambda x: datetime.strptime(x,'%Y-%m-%d') if len(x[:x.find ('-')]) == 4 else datetime.strptime(x, '%d-%m-%Y'))
AttributeError: 'Timestamp' object has no attribute 'find'
ValueError: time data '30/09/2020' does not match format '%d-%m-%Y'
Many thanks.
as ValueError is showing - %d-%m-%Y this needs to be changed to %d/%m/%Y to read date - 30/09/2020.
from datetime import datetime
import pandas as pd
data = {}
dates = {'27/09/2020', '28/09/2020', '29/09/2020', '30/09/2020'}
data['OPEN_DT'] = pd.Series(list(dates))
data['OPEN_DT'] = data['OPEN_DT'].apply(lambda x: datetime.strptime(x,'%Y/%m/%d') if len(x[:str(x).find ('-')]) == 4 else datetime.strptime(x, '%d/%m/%Y'))
print(data)
x is timestamp you should convert it to str then look for '-' in it:
str(x).find('-')
And Why don't you simply use infer_time_format to automatically detect the format by pandas?
Related
I'm using Colab to run the code but I get this error and I can't fix it. Could you help me out?
I don't know what to do in order to fix it because I have tried to change upper case or lower case.
#Inativos: ajustar nomes das colunas
dfInativos = dfInativos.rename(columns={'userId': 'id'})
dfInativos = dfInativos.rename(columns={'classId': 'ClasseId'})
dfInativos[['id','ClasseId','lastActivityDate','inactivityDaysCount','sevenDayInactiveStatus']] = dfInativos
#dfInativos['id'] = dfInativos['id'].astype(int, errors = 'ignore')
#dfInativos['ClasseId'] = dfInativos['ClasseId'].astype(int, errors = 'ignore')
dfInativos['id'] = pd.to_numeric(dfInativos['id'],errors = 'coerce')
dfInativos['ClasseId'] = pd.to_numeric(dfInativos['ClasseId'],errors = 'coerce')
#dfInativos.dropna(subset = ['lastActivityDate'], inplace=True)
dfInativos.drop_duplicates(subset = ['id','ClasseId'], inplace=True)
dfInativos['seven DayInactiveStatus'] = dfInativos['sevenDayInactiveStatus'].replace(0,'')
#Add Inactive data to main data frame
df = df.merge(dfInativos, on=['id','ClasseId'], how='left')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-79-10fe94c48d1f> in <module>()
2 dfInativos = dfInativos.rename(columns={'userId': 'id'})
3 dfInativos = dfInativos.rename(columns={'classId': 'ClasseId'})
----> 4 dfInativos[['id','ClasseId','lastActivityDate','inactivityDaysCount','sevenDayInactiveStatus']] = dfInativos
5
6
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexers.py in check_key_length(columns, key, value)
426 if columns.is_unique:
427 if len(value.columns) != len(key):
--> 428 raise ValueError("Columns must be same length as key")
429 else:
430 # Missing keys in columns are represented as -1
ValueError: Columns must be same length as key
The data in test.csv are like this:
TIMESTAMP POLYLINE
0 1408039037 [[-8.585676,41.148522],[-8.585712,41.148639],[...
1 1408038611 [[-8.610876,41.14557],[-8.610858,41.145579],[-...
2 1408038568 [[-8.585739,41.148558],[-8.58573,41.148828],[-...
3 1408039090 [[-8.613963,41.141169],[-8.614125,41.141124],[...
4 1408039177 [[-8.619903,41.148036],[-8.619894,41.148036]]
.. ... ...
315 1419171485 [[-8.570196,41.159484],[-8.570187,41.158962],[...
316 1419170802 [[-8.613873,41.141232],[-8.613882,41.141241],[...
317 1419172121 [[-8.6481,41.152536],[-8.647461,41.15241],[-8....
318 1419171980 [[-8.571699,41.156073],[-8.570583,41.155929],[...
319 1419171420 [[-8.574561,41.180184],[-8.572248,41.17995],[-...
[320 rows x 2 columns]
I read them from csv file in this way:
train = pd.read_csv("path/train.csv",engine='python',error_bad_lines=False)
So, I have this timestamp in Unix format. I want to convert in UTC time and then extract year, month, day and so on.
This is the code for the conversion from Unix timestamp to UTC date time:
train["TIMESTAMP"] = [float(time) for time in train["TIMESTAMP"]]
train["data_time"] = [datetime.datetime.fromtimestamp(time, datetime.timezone.utc) for time in train["TIMESTAMP"]]
To extract year and other information I do this:
train["year"] = train["data_time"].dt.year
train["month"] = train["data_time"].dt.month
train["day"] = train["data_time"].dt.day
train["hour"] = train["data_time"].dt.hour
train["min"] = train["data_time"].dt.minute
But I obtain this error when the execution arrives at the extraction point:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-30-d2249cabe965> in <module>()
67 train["TIMESTAMP"] = [float(time) for time in train["TIMESTAMP"]]
68 train["data_time"] = [datetime.datetime.fromtimestamp(time, datetime.timezone.utc) for time in train["TIMESTAMP"]]
---> 69 train["year"] = train["data_time"].dt.year
70 train["month"] = train["data_time"].dt.month
71 train["day"] = train["data_time"].dt.day
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/accessors.py in __new__(cls, data)
478 return PeriodProperties(data, orig)
479
--> 480 raise AttributeError("Can only use .dt accessor with datetimelike values")
AttributeError: Can only use .dt accessor with datetimelike values
I also read a lot of similiar discussion but I can't figure out why I obtain this error.
Edited:
So the train["TIMESTAMP"] data are like this:
1408039037
1408038611
1408039090
Then I do this with this data:
train["TIMESTAMP"] = [float(time) for time in train["TIMESTAMP"]]
train["data_time"] = [datetime.datetime.fromtimestamp(time, datetime.timezone.utc) for time in train["TIMESTAMP"]]
train["year"] = train["data_time"].dt.year
train["month"] = train["data_time"].dt.month
train["day"] = train["data_time"].dt.day
train["hour"] = train["data_time"].dt.hour
train["min"] = train["data_time"].dt.minute
train = train[["year", "month", "day", "hour","min"]]
I'm trying to plot a section of a dataframe. The first column is formatted using the to_datetime method:
all_data['Date_Time_(GMT)'] = pd.to_datetime(all_data['Date_Time_(GMT)'])
...
all_data['Date_Time_(GMT)'].dtype
[out] dtype('<M8[ns]')
The second column is a bunch of intergers:
all_data[new_column].dtype
[out] dtype('int64')
When I try to plot the two columns I get a parser error. Here is the code for the plot:
my_column = 'My Column'
start_date = '2020-08-11 09:28:37'
end_date = '2020-08-11 09:29:28'
new_plot = pd.DataFrame()
new_plot['Date_Time_(GMT)'] = all_data['Date_Time_(GMT)']
new_plot[my_column] = all_data[my_column]
mask = (new_plot['Date_Time_(GMT)'] > start_date) & (new_plot['Date_Time_(GMT)'] <= end_date)
new_plot = new_plot.loc[mask]
df = pd.DataFrame(new_plot, columns=[new_plot['Date_Time_(GMT)'], new_plot[my_column]])
df.plot(x='Date_Time_(GMT)', y=my_column, kind='line' )
plt.show()
Here is the error output:
ParserError Traceback (most recent call last)
pandas\_libs\tslibs\conversion.pyx in pandas._libs.tslibs.conversion._convert_str_to_tsobject()
pandas\_libs\tslibs\parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string()
c:\users\user name\appdata\local\programs\python\python38\lib\site-
packages\dateutil\parser\_parser.py in parse(timestr, parserinfo, **kwargs)
1373 else:
-> 1374 return DEFAULTPARSER.parse(timestr, **kwargs)
1375
c:\users\user name\appdata\local\programs\python\python38\lib\site-
packages\dateutil\parser\_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
648 if res is None:
--> 649 raise ParserError("Unknown string format: %s", timestr)
650
ParserError: Unknown string format: Date_Time_(GMT)
Any ideas what completely obvious thing i've done wrong?
Setting the index should do the trick
df = all_data[
(all_data['Date_Time_(GMT)'] > start_date) &
(all_data['Date_Time_(GMT)'] <= end_date)
].copy().set_index('Date_Time_(GMT)')
df.plot(y=my_column, kind='line')
I want to find the country names for a data frame columns with top level domains such as 'de', 'it', 'us'.. by using the iso3166 package.
There are domains in the dataset that does not exist in iso3166, therefore, Value Error got raised.
I tried to solve the value error by letting the code return Boolean values but it runs for a really long time. Will be great to know how to speed it up.
Sample data: df['country']
0 an
1 de
2 it
My code (Note the code does not raise KeyError error. My question is how to make it faster)
df['country_name'] = df['country'].apply(lambda x: countries.get(x)[0] if \
df['country'].str.find(x).any() == countries.get(x)[1].lower() else 'unknown')
df['country] is the data frame column. countries.get() is for getting country names from iso3166
df['country'].str.find(x).any() finds top level domains and countries.get(x)[1].lower()returns top level domains. If they are the same then I use countries.get(x)[0] to return the country name
Expected output
country country_name
an unknown
de Germany
it Italy
Error if I run df['country_name'] = df['country'].apply(lambda x: countries.get(x)[0]) (I renamed the dataframe so it's different from the error message)
KeyError Traceback (most recent call last)
<ipython-input-110-d51176ce2978> in <module>
----> 1 bf['country_name'] = bf['country'].apply(lambda x: countries.get(x)[0])
/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
3846 else:
3847 values = self.astype(object).values
-> 3848 mapped = lib.map_infer(values, f, convert=convert_dtype)
3849
3850 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-110-d51176ce2978> in <lambda>(x)
----> 1 bf['country_name'] = bf['country'].apply(lambda x: countries.get(x)[0])
/opt/anaconda3/lib/python3.8/site-packages/iso3166/__init__.py in get(self, key, default)
358
359 if r == NOT_FOUND:
--> 360 raise KeyError(key)
361
362 return r
KeyError: 'an'```
A little error handling and defining your logic outside of the apply() method should get you where you want to go. Something like:
def get_country_name(x):
try:
return countries.get(x)[0]
except:
return 'unknown'
df['country_name'] = df['country'].apply(lambda x: get_country_name(x))
This James Tollefson's answer, narrowed down to its core, didn't want to change his answer too much, here's the implementation:
def get_country_name(x: str) -> str:
return countries[x][0] if countries.get(x) else 'unknown'
df['country_name'] = df['country'].apply(get_country_name)
I have issues with converting dates in an imported .txt file and I wonder what I'm doing wrong.
I import the data by:
df_TradingMonthlyDates = pd.read_csv(TradingMonthlyDates, dtype=str, sep=',') # header=True,
and it looks like the following table (dates represents start/end of month and have a header Date):
Date
0 2008-12-30
1 2008-12-31
2 2009-01-01
3 2009-01-02
4 2009-01-29
.. ...
557 2020-06-29
558 2020-06-30
559 2020-07-01
560 2020-07-02
561 2020-07-30
.. ...
624 2021-11-30
625 2021-12-01
626 2021-12-02
627 2021-12-30
628 2021-12-31
[629 rows x 1 columns]
<class 'pandas.core.frame.DataFrame'>
I then calculate today's date:
df_EndDate = datetime.now().date()
I'm trying to apply the data above in this function to get the closest date before a given date (given date = today's date in my case):
# https://stackoverflow.com/questions/32237862/find-the-closest-date-to-a-given-date
def nearest(items, pivot):
return min([i for i in items if i < pivot], key=lambda x: abs(x - pivot))
date_output = nearest(df_TradingMonthlyDates, df_EndDate)
# date_output should be = 2020-07-02 given today's date of 2020-07-12
The error messages I receive is that the df_TradingMonthlyDates is not in date format. So I have tried to convert the dataframe to datetime format but can't make it work.
What I have tried to convert the data to date format:
# df_TradingMonthlyDates["Date"] = pd.to_datetime(df_TradingMonthlyDates["Date"], format="%Y-%m-%d")
# df_TradingMonthlyDates = datetime.strptime(df_TradingMonthlyDates, "%Y-%m-%d").date()
# df_TradingMonthlyDates['Date'] = df_TradingMonthlyDates['Date'].apply(lambda x: pd.to_datetime(x[0], format="%Y-%m-%d"))
# df_TradingMonthlyDates = df_TradingMonthlyDates.iloc[1:]
# print(df_TradingMonthlyDates)
# df_TradingMonthlyDates = datetime.strptime(str(df_TradingMonthlyDates), "%Y-%m-%d").date()
# for line in split_source[1:]: # skip the first line
Code:
import pandas as pd
from datetime import datetime
# Version 1
TradingMonthlyDates = "G:/MonthlyDates.txt"
# Import file where all the first/end month date exists
df_TradingMonthlyDates = pd.read_csv(TradingMonthlyDates, dtype=str, sep=',') # header=True,
print(df_TradingMonthlyDates)
# https://community.dataquest.io/t/datetime-and-conversion/213425
# df_TradingMonthlyDates["Date"] = pd.to_datetime(df_TradingMonthlyDates["Date"], format="%Y-%m-%d")
# df_TradingMonthlyDates = datetime.strptime(df_TradingMonthlyDates, "%Y-%m-%d").date()
# df_TradingMonthlyDates['Date'] = df_TradingMonthlyDates['Date'].apply(lambda x: pd.to_datetime(x[0], format="%Y-%m-%d"))
# df_TradingMonthlyDates = df_TradingMonthlyDates.iloc[1:]
# print(df_TradingMonthlyDates)
# df_TradingMonthlyDates = datetime.strptime(str(df_TradingMonthlyDates), "%Y-%m-%d").date()
# for line in split_source[1:]: # skip the first line # maybe header is the problem
print(type(df_TradingMonthlyDates))
df_TradingMonthlyDates = df_TradingMonthlyDates.datetime.strptime(df_TradingMonthlyDates, "%Y-%m-%d")
df_TradingMonthlyDates = df_TradingMonthlyDates.time()
print(df_TradingMonthlyDates)
df_EndDate = datetime.now().date()
print(type(df_EndDate))
# https://stackoverflow.com/questions/32237862/find-the-closest-date-to-a-given-date
def nearest(items, pivot):
return min([i for i in items if i < pivot], key=lambda x: abs(x - pivot))
date_output = nearest(df_TradingMonthlyDates, df_EndDate)
Error messages are different depending on how I tried to convert data type, but I interpret that they all notice that my date format is not successful :
df_TradingMonthlyDates = df_TradingMonthlyDates.datetime.strptime(df_TradingMonthlyDates, "%Y-%m-%d")
Traceback (most recent call last):
File "g:/till2.py", line 25, in <module>
df_TradingMonthlyDates = df_TradingMonthlyDates.datetime.strptime(df_TradingMonthlyDates, "%Y-%m-%d")
File "C:\Users\ID\AppData\Roaming\Python\Python38\site-packages\pandas\core\generic.py", line 5274, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'datetime'
df_TradingMonthlyDates["Date"] = pd.to_datetime(df_TradingMonthlyDates["Date"], format="%Y-%m-%d")
Traceback (most recent call last):
File "g:/till2.py", line 40, in <module>
date_output = nearest(df_TradingMonthlyDates, df_EndDate)
File "g:/till2.py", line 38, in nearest
return min([i for i in items if i < pivot], key=lambda x: abs(x - pivot))
File "g:/till2.py", line 38, in <listcomp>
return min([i for i in items if i < pivot], key=lambda x: abs(x - pivot))
TypeError: '<' not supported between instances of 'str' and 'datetime.date'
Edit: Added Method 3, which might be the easiest with.loc and then .iloc
You could take a slightly different approach (with Method #1 or Method #2 below) by taking the absolute minimum of the difference between today's date and the data, but a key thing you weren't doing was wrapping pd.to_datetime() around the datetime.date object df_EndDate in order to transform it into a DatetimeArray so that it could be compared against your Date column. They both have to be in the same format of DatetimeArray in order to be compared.
Method 1:
import pandas as pd
import datetime as dt
df_TradingMonthlyDates = pd.DataFrame({'Date': {'0': '2008-12-30',
'1': '2008-12-31',
'2': '2009-01-01',
'3': '2009-01-02',
'4': '2009-01-29',
'557': '2020-06-29',
'558': '2020-06-30',
'559': '2020-07-01',
'560': '2020-07-02',
'561': '2020-07-30',
'624': '2021-11-30',
'625': '2021-12-01',
'626': '2021-12-02',
'627': '2021-12-30',
'628': '2021-12-31'}})
df_TradingMonthlyDates['Date'] = pd.to_datetime(df_TradingMonthlyDates['Date'])
df_TradingMonthlyDates['EndDate'] = pd.to_datetime(dt.datetime.now().date())
df_TradingMonthlyDates['diff'] = (df_TradingMonthlyDates['Date'] - df_TradingMonthlyDates['EndDate'])
a=min(abs(df_TradingMonthlyDates['diff']))
df_TradingMonthlyDates = df_TradingMonthlyDates.loc[(df_TradingMonthlyDates['diff'] == a)
| (df_TradingMonthlyDates['diff'] == -a)]
df_TradingMonthlyDates
output 1:
Date EndDate diff
560 2020-07-02 2020-07-11 -9 days
If you don't want the extra columns and just the date, then assign variables to create series rather than new columns:
Method 2:
d = pd.to_datetime(df_TradingMonthlyDates['Date'])
t = pd.to_datetime(dt.datetime.now().date())
e = (d-t)
a=min(abs(e))
df_TradingMonthlyDates = df_TradingMonthlyDates.loc[(e == a) | (e == -a)]
df_TradingMonthlyDates
output 2:
Date
560 2020-07-02
Method 3:
df_TradingMonthlyDates['Date'] = pd.to_datetime(df_TradingMonthlyDates['Date'])
date_output = df_TradingMonthlyDates.sort_values('Date') \
.loc[df_TradingMonthlyDates['Date'] <=
pd.to_datetime(dt.datetime.now().date())] \
.iloc[-1,:]
date_output
output 3:
Date 2020-07-02
Name: 560, dtype: datetime64[ns]