Change a string in a column to an integer in pandas

Change a string in a column to an integer in pandas - python

enter image description heretable of movie data
As you can see the column of release date puts the month in a string.
I want to change all of them into numbers.
For example, Dec 18, 2009 can just be 12. I am not interested in the year.
Update: I think I got it. They still come out as objects when I do .info() but at least I was able to get to the number

You can convert it to datetime and use Series.dt.month
df['release date'] = pd.to_datetime(df['release date']).dt.month
print(df)
release date
0 12

Related

Convert pandas datetime column to Excel serial date

I have a pandas dataframe with date values, however, I need to convert it from dates to text General format like in Excel, not to date string, in order to match with primary keys values in SQL, which are, unfortunately, reordered in general format. Is it possible to do it Python or the only way to convert this column to general format in Excel?
Here is how the dataframe's column looks like:
ID Desired Output
1/1/2022 44562
7/21/2024 45494
1/1/1931 11324

Yes, it's possible. The general format in Excel starts counting the days from the date 1900-1-1.
You can calculate a time delta between the dates in ID and 1900-1-1.
Inspired by this post you could do...
data = pd.DataFrame({'ID': ['1/1/2022','7/21/2024','1/1/1931']})
data['General format'] = (
pd.to_datetime(data["ID"]) - pd.Timestamp("1900-01-01")
).dt.days + 2
print(data)
ID General format
0 1/1/2022 44562
1 7/21/2024 45494
2 1/1/1931 11324
The +2 is because:
Excel starts counting from 1 instead of 0
Excel incorrectly considers 1900 as a leap year

Excel stores dates as sequential serial numbers so that they can be
used in calculations. By default, January 1, 1900 is serial number 1,
and January 1, 2008 is serial number 39448 because it is 39,447 days
after January 1, 1900.
-Microsoft's documentation
So you can just calculate (difference between your date and January 1, 1900) + 1
see How to calculate number of days between two given dates

Pandas reads date from CSV incorrectly

I am very new to Python, and finding it very frustrating.
I have a CSV that I am importing, but its reading the date column incorrectly.
In the Month column, I have the 1st of each month - so it should read (yyyy-mm-dd):
2020-01-01
2020-02-01
2020-03-01
etc
however, its reading it as (yyyy-dd-mm)
2020-01-01
2020-01-02
2020-01-03
etc
I've tried several conversion functions from stackoverflow as well as other websites, but they either just don't work, or do nothing.
My import is as follows:
try:
collections_data = pd.read_csv('./monthly_collections.csv')
print("Collections Data imported successfully.")
except error as e:
print("Error importing Collections Data!")
I have tried the parse_dates parameter on the import, but it doesn't help.
If I then try this:
temp = pd.to_datetime(collections_data['Collections Month'], format='%m/%d/%Y')
temp
then I get
which you can see, it is reading the months as the days - in other words, it is showing individual days of the month, instead of the 1st day of each month.
I'd greatly appreciate some help to get these dates corrected, as I need to do some date calculations on them, and also join two tables based on this date - which is going to be my next problem.
Kind Regards

Inferring date format
Some dates are ambiguous, while others aren't. Consider these dates:
2020-27-01
2020-12-14
2020-01-02
11-10-12
In examples #1 & #2 we can easily infer that date format. In example #1, The first four digit have to be the year (there's no 2020th month or 2020th day of a month), the following two digits have to be the day of the month (there's no 27th month and we already have year information) and the last two digits are the month (we already have year and day of month information). We can use a similar approach for example #2.
For example #3 is that the first day of the second month, or is that the second day of the first month? It's impossible to tell without more information. If for instances we had the following sequence of dates: '2020-22-01', '2020-25-01', '2020-01-02', it would be reasonable to infer that '2020-01-02' refers to the first day of the second month, otherwise we would not be able to parse the previous two dates.
In example #4, it's impossible to infer the date format. Either pair of digits would make sense as a year, month or day. (Using pandas.read_csv() you can make use of the dayfirst and yearfirst kwargs, or explicitly declare your date formats and use pandas.to_datetime(some_df, format=).
Your problem
Your dates are ambiguous, from what you've included in your question is not possible to infer whether it's in a day first format (dd-mm) or a month first format (mm-dd). pandas defaults to dayfirst=False so a date like your date 2020-02-01 is expected to mean the second day of the first month unless you specific otherwise. See pandas.read_csv().
dayfirst : bool, default False
DD/MM format dates, international and European format.
Above means that in order to parse 01/02 (DD/MM), 2020/02/01 (iso/international format) or 01/02/2020 (European format) as the first day of the second month you will need to specify pandas.read_csv(somefile.csv, ... dayfirst=True).
I've tried several conversion functions from stackoverflow as well as other websites, but they either just don't work, or do nothing.
You haven't provided the code that you've used that didn't work, nor the code which you used which parsed your dates as month first. If you include an example of what you actually tried I can make a specific comment.
In your question you say that your date format is in (yyyy-mm-dd) but you passed format='%m/%d/%Y' and in your screenshots you have '/' and '-' as your separator in different places. So I'm not sure what your original dates look like.
What you passed to the format kwarg means the first two digits are zero-passed months (i.e 04) followed by a '/' then zero-padded days, followed by '/' and then year as yyyy. If what you wrote at the beginning of your question is correct you should have passed format='%Y-%m-%d' (see the strftime format codes).

Try https://towardsdatascience.com/4-tricks-you-should-know-to-parse-date-columns-with-pandas-read-csv-27355bb2ad0e
Essentially, try the dayfirst optional input for the read_csv function.
You would set it to True and have
collections_data = pd.read_csv('./monthly_collections.csv', dayfirst = True)

Convert column to datetime format, in a leap year

I'm new to Python and programming in general, so I wasn't able to figure out the following: I have a dataframe named ozon, for which column 1 is the time stamp in mm-dd format. Now I want to change that column to a datetime format using the following code:
ozon[1] = pd.to_datetime(ozon[1], format='%m-%d')
Now this is giving me the following error: ValueError: day is out of range for month.
I think it has to do with the fact that it's a leap year, so it doesn't recognize February 29 as a valid date. How can I overcome this error? And could I also add a year to the timestamp (2020)?
Thanks so much in advance!

Add year to column and also to format:
ozon[1] = pd.to_datetime(ozon[1] + '-2000', format='%m-%d-%Y')
If still not working because some values are not valid add errors='coerce' parameter:
ozon[1] = pd.to_datetime(ozon[1] + '-2000', format='%m-%d-%Y', errors='coerce')

How to convert date format (whole column) and produce one more column with Quarter & Year

I have an excel file with a date column. Is there a way to change the date format to MM-DD-YY and create one more column with Quarter & Year? I am very new to Python and I would really appreciate it if you could help me with this one. Thanks!
Current format
Date format: Jan 1, 2016
Desired outcome
Date format: 01/01/2016
One more additional column with something like this "Q1-2016"

Python's datetime module's got you covered. For input:
myDate = datetime.strptime(<datestring>, "%b %d, %Y")
And for output:
print(myDate.strftime("%m/%d/%Y"))
Getting the quarter would be a little bit harder, but you could use myDate.month to figure something out with time ranges. See also, python datetime reference
example, using simple division so january-march are Q1, april-june are Q2, etc.:
print("Q%d-%d" % (myDate.month // 3 + 1, myDate.year))

Python: Date conversion to year-weeknumber, issue at switch of year

I am trying to convert a dataframe column with a date and timestamp to a year-weeknumber format, i.e., 01-05-2017 03:44 = 2017-1. This is pretty easy, however, I am stuck at dates that are in a new year, yet their weeknumber is still the last week of the previous year. The same thing that happens here.
I did the following:
df['WEEK_NUMBER'] = df.date.dt.year.astype(str).str.cat(df.date.dt.week.astype(str), sep='-')
Where df['date'] is a very large column with date and times, ranging over multiple years.
A date which gives a problem is for example:
Timestamp('2017-01-01 02:11:27')
The output for my code will be 2017-52, while it should be 2016-52. Since the data covers multiple years, and weeknumbers and their corresponding dates change every year, I cannot simply subtract a few days.
Does anybody have an idea of how to fix this? Thanks!

Replace df.date.dt.year by this:
(df.date.dt.year- ((df.date.dt.week>50) & (df.date.dt.month==1)))
Basically, it means that you will substract 1 to the year value if the week number is greater than 50 and the month is January.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.