Hi guys would appreciate some help. I'm analyzng a series (a set of columns) that has a date format like this:
'1060208'
The first three digits represent the year where the first digit, '1' exists for comparison purposes. in the case above, the year is 2006. the 4th and 5th digit represent the month and the rest represents the day. I want to convert these dates to something like this
106-02-08
So that i can use .groupby to sort per month or year. Here is my code so far
class Data:
def convertdate(self):
self.dates.apply(lambda x:x[0:3] + '-' + x[3:5] + '-' + x [5:7])
return self.dates
when I run this, I get the error:
TypeError: 'int' object is not subscriptable
Can you please tell me what went wrong? Or can you suggest some alternative way to do this? Thank you so much.
Assumings that dates is a list of int, you can do:
input_dates = [1060208, 1060209]
input_dates_to_str = map(lambda x: str(x), input_dates)
output = list(map(lambda x: '-'.join([x[0:3], x[3:5], x[5:]]), input_dates_to_str))
Anyway, when working with dates I suggest you using datetime package.
Quick answer to your question: 1060208 is an integer, integers are not subscriptable, so you need to change it to a string.
Some other thoughts:
Where is your data? Is this all in a pandas dataframe? If so why are you writing classes to convert your data? There are better/faster ways of doing it. Like convert your intgeger date to a string, get rid of the first digit, and convert it to datetime.
What does "where 1 is put for comparison purposes" mean? It could have been recorded that way but obviously a date and a flag (I assume it is some kind of flag) should not be represented in the same field. So why don't you put that 1 in a field of its own?
Related
I have a date column in my DataFrame say df_dob and it looks like -
id
DOB
23312
31-12-9999
1482
31-12-9999
807
#VALUE!
2201
06-12-1925
653
01/01/1855
108
01/01/1855
768
1967-02-20
What I want to print is a list of unique years like - `['9999', '1925', '1855', '1967']
basically through this list I just wanted to check whether there is some unwanted year is present or not.
I have tried(pasted my code below) but getting ValueError: time data 01/01/1855 doesn't match format specified and could not resolve it.
df_dob['DOB'] = df_dob['DOB'].replace('01/01/1855 00:00:00', '1855-01-01')
df_dob['DOB'] = pd.to_datetime(df_dob.DOB, format='%Y-%m-%d')
df_dob['DOB'] = df_dob['DOB'].dt.strftime('%Y-%m-%d')
print(np.unique(df_dob['DOB']))
# print(list(df_dob['DOB'].year.unique()))
P.S - when I print df_dob['DOB'], I get values like - 1967-02-20 00:00:00
Can you try this?
df_dob["DOB"] = pd.to_datetime(df_DOB["Date"])
df_dob['YOB'] = df_dob['DOB'].dt.strftime('%Y')
Use pandas' unique for this. And on year only.
So try:
print(df['DOB'].dt.year.unique())
Also, you don't need to stringify your time. Alse, you don't need to replace anything, pandas is smart enough to do it for you. So you overall code becomes:
df_dob['DOB'] = pd.to_datetime(df_dob.DOB) # No need to pass format if there isn't some specific anomoly
print(df['DOB'].dt.year.unique())
Edit:
Another method:
Since you have outofbounds problem,
Another method you can try is not converting them to datetime, but rather find all the four digit numbers in each column using regex.
So,
df['DOB'].str.extract(r'(\d{4})')[0].unique()
[0] because unique() is a function of pd.series not a dataframe. So taking the first series in the dataframe.
The first thing you need to know is if the resulting values (which you said look like 1967-02-20 00:00:00 are datetimes or not. That's as simple as df_dob.info()
If the result says similar to datetime64[ns] for the DOB column, you're good. If not you'll need to cast it as a DateTime. You have a couple of different formats so that might be part of your problem. Also, because there're several ways of doing this and it's a separate question, I'm not addressing it.
We going to leverage the speed of sets, plus a bit of pandas, and then convert that back to a list as you wanted the final version to be.
years = list({i for i in df['date'].dt.year})
And just a side note, you can't use [] instead of list() as you'll end with a list with a single element that's a set.
That's a list as you indicated. If you want it as a column, you won't get unique values
Nitish's answer will also work but give you something like: array([9999, 1925, 1855, 1967])
I have a string of yearly data month 1-12, trying to convert it to datetime.month values and then converge it on the main df that already has dt.month values according to some date
usage_12month["MONTH"]= pd.to_datetime(usage_12month["MONTH"])
usage_12month['MONTH'] = usage_12month['MONTH'].dt.month
display(usage_12month)
merge = pd.merge(df,usage_12month, how='left',on='MONTH')
ValueError: Given date string not likely a datetime.
​get the error on the 1st line
.dt.month on a datetime returns an int. So I'm assuming you want to convert usage_12month["MONTH"] from a string to an int to be able to merge it with the other df.
There is a simplier way than converting it to a datetime. You could replace the first two lines by usage_12month["MONTH"]= pd.to_numeric(usage_12month["MONTH"]) and it should work.
--
The error you get on the first line is because you don't specify to the to_datetime function how to interpet the string as a datetime (the number in the string could represent a day, an hour...).
To make your way work you have to give a 'format' parameter to the to_datetime function. In your case, your string contains only the month number, so the format string would be '%m' (see https://strftime.org/) : usage_12month["MONTH"]= pd.to_datetime(usage_12month["MONTH"], format = '%m')
When you're supplying the function with a "usual" date fromat like 'yyyy/mm/dd' it guesses how to interpret it, but it is alway better to provide a format to the function.
Currently, I'm working with a column called 'amount' that contains transaction amounts. This column is from the string datatype and I would like to convert it to a number data type.
The problem I ran into was that the code I wrote to convert the string data type to numbers worked but the only problem is that when I removed the ',' in the code below and changed it to numbers, the decimals were added which causes extremely high values in my data. So, 100000,95 became 10000095. I used the following code to convert my string data type to numbers:
df["amount"] = df["amount"].str.replace(',', '')
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
Can someone help me with this problem?
EDIT: Not all values contain decimals. I'm looking for a solution for only the values that contain a ','.
You need repalce by comma if need floats:
df["amount"] = df["amount"].str.replace(',', '.')
So I am trying to transform the data I have into the form I can work with. I have this column called "season/ teams" that looks smth like "1989-90 Bos"
I would like to transform it into a string like "1990" in python using pandas dataframe. I read some tutorials about pd.replace() but can't seem to find a use for my scenario. How can I solve this? thanks for the help.
FYI, I have 16k lines of data.
A snapshot of the data I am working with:
To change that field from "1989-90 BOS" to "1990" you could do the following:
df['Yr/Team'] = df['Yr/Team'].str[:2] + df['Yr/Team'].str[5:7]
If the structure of your data will always be the same, this is an easy way to do it.
If the data in your Yr/Team column has a standard format you can extract the values you need based on their position.
import pandas as pd
df = pd.DataFrame({'Yr/Team': ['1990-91 team'], 'data': [1]})
df['year'] = df['Yr/Team'].str[0:2] + df['Yr/Team'].str[5:7]
print(df)
Yr/Team data year
0 1990-91 team 1 1991
You can use pd.Series.str.extract to extract a pattern from a column of string. For example, if you want to extract the first year, second year and team in three different columns, you can use this:
df["year"].str.extract(r"(?P<start_year>\d+)-(?P<end_year>\d+) (?P<team>\w+)")
Note the use of named parameters to automatically name the columns
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html
I have two date strings in a list (i.e dateList = ['2013-11-26 08:09:51', '2013-11-26 01:19:51'])
If their is possiblility to compare between date strings of specified format, please provide a solution by returning latest date from the list.
Thanks in Advance
...please provide a solution by returning latest date from the list.
max(dateList)
Because of the formatting of your strings (i.e. starting with the largest time unit and working step by step to the smallest, additional zeroes for single-digit values), they can be directly compared to one another.
You have asked for two different things Compare the dates and get the latest date:
To get the latest date use #jonrsharpe solution:
You can compare them as strings, by using all(), I'm using all so it can work with lots of dates and not just 2:
dateList = ['2013-11-26 08:09:51', '2013-11-26 08:09:51']
if all(dateList[0] == x for x in dateList):
print "Equal"
else:
print "Not equal"
Use dateutil library:
from dateutil import parser
dateList = [parser.parse(date) for date in dateList]
latest_date = max(dateList)
It will give you latest date.
:)