Isues with date format in pandas - python

I am working with a dataset that contains dates in the American M-D-Y format.
When I load the dataset into a Pandas data frame and change the column type to the date format the dates get messed up.
Example: In the data set the first date is written as (11/04/2015) which means the 11th of April 2015. But when I convert to DateTime and use sort the data frame by the date the first date is (01/08/2015) which is incorrect. How can I change the column to DateTime and not get this messup?
dataset example :
IDX_CUSTOMER_ITEM_CODE IDX_COMPANY QtySold TotalOnHand Date
0 131 1 3 26 11/04/2015
1 134 1 3 17 11/04/2015
2 137 1 3 114 11/04/2015
3 140 1 3 18 11/04/2015
4 179 1 1 21 11/04/2015
... ... ... ... ... ...
1048570 1059 10 0 23 04/03/2017
1048571 1075 10 3 14 04/03/2017
1048572 2135 10 2 4 04/03/2017
1048573 1035 10 2 3 04/03/2017
1048574 1038 10 0 5 04/03/2017
The first date is 11 of April 2015 and last 4th march 2017.
When I do:
transactions['Date'] = pd.to_datetime(transactions['Date'])
The oldest date becomes 01/08/2015 and the latest 31/12/2016 which is incorrect. so tired:
transactions['Date'] = pd.to_datetime(transactions['Date'], format = '%dd-%mm-%yy')
Got the following error:
time data '11/04/2015' does not match format '%dd-%mm-%yy' (match)

You can also use dayfirst parameter:
pd.to_datetime(df['Date'], dayfirst=True)
Output:
0 2015-04-11
1 2015-04-11
2 2015-04-11
3 2015-04-11
4 2015-04-11
5 2017-03-04
6 2017-03-04
7 2017-03-04
8 2017-03-04
9 2017-03-04

You format is wrong. You can refer to Python strftime reference for the meaning of % code.
transactions['Date'] = pd.to_datetime(transactions['Date'], format = '%d/%m/%Y')
print(transactions['Date'])
0 2015-04-11
1 2015-04-11
2 2015-04-11
3 2015-04-11
4 2015-04-11
5 2017-03-04
6 2017-03-04
7 2017-03-04
8 2017-03-04
9 2017-03-04
Name: Date, dtype: datetime64[ns]

Related

Python Pandas - Difference between groupby keys with repeated valyes

I have some data with dates of sales to my clients.
The data looks like this:
Cod client
Items
Date
0
100
1
2022/01/01
1
100
7
2022/01/01
2
100
2
2022/02/01
3
101
5
2022/01/01
4
101
8
2022/02/01
5
101
10
2022/02/01
6
101
2
2022/04/01
7
101
2
2022/04/01
8
102
4
2022/02/01
9
102
10
2022/03/01
What I'm trying to acomplish is to calculate the differences beetween dates for each client: grouped first by "Cod client" and after by "Date" (because of the duplicates)
The expected result is like:
Cod client
Items
Date
Date diff
Explain
0
100
1
2022/01/01
NaT
First date for client 100
1
100
7
2022/01/01
NaT
...repeat above
2
100
2
2022/02/01
31
Diff from first date 2022/01/01
3
101
5
2022/01/01
NaT
Fist date for client 101
4
101
8
2022/02/01
31
Diff from first date 2022/01/01
5
101
10
2022/02/01
31
...repeat above
6
101
2
2022/04/01
59
Diff from previous date 2022/02/01
7
101
2
2022/04/01
59
...repeat above
8
102
4
2022/02/01
NaT
First date for client 102
9
102
10
2022/03/01
28
Diff from first date 2022/02/01
I already tried doing df["Date diff"] = df.groupby("Cod client")["Date"].diff() but it considers the repeated dates and return zeroes for then
I appreciate for help!
IIUC you can combine several groupby operations:
# ensure datetime
df['Date'] = pd.to_datetime(df['Date'])
# set up group
g = df.groupby('Cod client')
# identify duplicated dates per group
m = g['Date'].apply(pd.Series.duplicated)
# compute the diff, mask and ffill
df['Date diff'] = g['Date'].diff().mask(m).groupby(df['Cod client']).ffill()
output:
Cod client Items Date Date diff
0 100 1 2022-01-01 NaT
1 100 7 2022-01-01 NaT
2 100 2 2022-02-01 31 days
3 101 5 2022-01-01 NaT
4 101 8 2022-02-01 31 days
5 101 10 2022-02-01 31 days
6 101 2 2022-04-01 59 days
7 101 2 2022-04-01 59 days
8 102 4 2022-02-01 NaT
9 102 10 2022-03-01 28 days
Another way to do this, with transform:
import pandas as pd
# data saved as .csv
df = pd.read_csv("Data.csv", header=0, parse_dates=True)
# convert Date column to correct date.
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")
# new column!
df["Date diff"] = df.sort_values("Date").groupby("Cod client")["Date"].transform(lambda x: x.diff().replace("0 days", pd.NaT).ffill())

Split date column into two

I have the following dataframe:
date
wind (°)
wind (kt)
temp (C°)
humidity(%)
currents (°)
currents (kt)
stemp (C°)
sea_temp_diff
wind_distance_diff
wind_speed_diff
temp_diff
humidity_diff
current_distance_diff
current_speed_diff
8 12018
175.000000
16.333333
25.500000
82.500000
60.000000
0.100000
25.400000
-1.066667
23.333333
-0.500000
-0.333333
-12.000000
160.000000
6.666667e-02
9 12019
180.000000
17.000000
23.344828
79.724138
230.000000
0.100000
23.827586
-0.379310
22.068966
1.068966
0.827586
-7.275862
315.172414
3.449034e+02
10 12020
365.000000
208.653846
24.192308
79.346154
355.769231
192.500000
24.730769
574.653846
1121.923077
1151.153846
1149.346154
-19.538462
1500.000000
1.538454e+03
14 22019
530.357143
372.964286
23.964286
81.964286
1270.714286
1071.560714
735.642857
-533.642857
-327.500000
-356.892857
1.857143
-10.321429
-873.571429
-8.928107e+02
15 22020
216.551724
12.689655
24.517241
81.137931
288.275862
172.565517
196.827586
-171.379310
-8.965517
3.724138
1.413793
-7.137931
-105.517241
-1.722724e+02
16 32019
323.225806
174.709677
25.225806
80.741935
260.000000
161.451613
25.709677
480.709677
486.451613
483.967742
0.387097
153.193548
1044.516129
9.677065e+02
17 32020
351.333333
178.566667
25.533333
78.800000
427.666667
166.666667
26.600000
165.533333
-141.000000
-165.766667
166.633333
158.933333
8.333333
1.500000e-01
18 42017
180.000000
14.000000
27.000000
5000.000000
200.000000
0.400000
25.400000
2.600000
20.000000
-4.000000
0.000000
0.000000
-90.000000
-1.000000e-01
19 42019
694.230769
589.769231
24.038462
69.461538
681.153846
577.046154
26.884615
-1.346154
37.307692
-1.692308
1.500000
4.769231
98.846154
1.538462e-01
20 42020
306.666667
180.066667
24.733333
75.166667
427.666667
166.666667
26.800000
165.066667
205.333333
165.200000
1.100000
-4.066667
360.333333
3.334233e+02
21 52017
146.333333
11.966667
22.900000
5000.000000
116.333333
0.410000
26.066667
-1.553333
8.666667
0.833333
-0.766667
0.000000
95.000000
-1.300000e-01
22 52019
107.741935
12.322581
23.419355
63.032258
129.354839
0.332258
25.935484
-1.774194
14.838710
0.096774
-0.612903
-14.451613
130.967742
I need to sort the 'date' column chronologically, and I'm wondering if there's a way for me to split it two ways, with the '10' in one column and 2017 in another, sort both of them in ascending order, and then bring them back together.
I had tried this:
australia_overview[['month','year']] = australia_overview['date'].str.split("2",expand=True)
But I am getting error like this:
ValueError: Columns must be same length as key
How can I solve this issue?
From your DataFrame :
>>> df = pd.DataFrame({'id': [1, 2, 3, 4],
... 'date': ['1 42018', '12 32019', '8 112020', '23 42021']},
... index = [0, 1, 2, 3])
>>> df
id date
0 1 1 42018
1 2 12 32019
2 3 8 112020
3 4 23 42021
We can split the column to get the first value of day like so :
>>> df['day'] = df['date'].str.split(' ', expand=True)[0]
>>> df
id date day
0 1 1 42018 1
1 2 12 32019 12
2 3 8 112020 8
3 4 23 42021 23
And get the 4 last digit from the column date for the year to get the expected result :
>>> df['year'] = df['date'].str[-4:].astype(int)
>>> df
id date day year
0 1 1 42018 1 2018
1 2 12 32019 12 2019
2 3 8 112020 8 2020
3 4 23 42021 23 2021
Bonus : as asked in the comment, you can even get the month using the same principle :
>>> df['month'] = df['date'].str.split(' ', expand=True)[1].str[:-4].astype(int)
>>> df
id date day year month
0 1 1 42018 1 2018 4
1 2 12 32019 12 2019 3
2 3 8 112020 8 2020 11
3 4 23 42021 23 2021 4

Sort pandas csv list with string date

I have read a couple of similar post regarding the issue before, but none of the solutions worked for me. so I got the followed csv :
Score date term
0 72 3 Feb ·   1
1 47 1 Feb ·   1
2 119 6 Feb ·   1
8 101 7 hrs ·   1
9 536 11 min ·   1
10 53 2 hrs ·   1
11 20 11 Feb ·   3
3 15 1 hrs ·   2
4 33 7 Feb ·   1
5 153 4 Feb ·   3
6 34 3 min ·   2
7 26 3 Feb ·   3
I want to sort the csv by date. What's the easiest way to do that ?
You can create 2 helper columns - one for datetimes created by to_datetime and second for timedeltas created by to_timedelta, only necessary format HH:MM:SS, so added Series.replace by regexes, so last is possible sorting by 2 columns by DataFrame.sort_values:
df['date1'] = pd.to_datetime(df['date'], format='%d %b', errors='coerce')
times = df['date'].replace({'(\d+)\s+min': '00:\\1:00',
'\s+hrs': ':00:00'}, regex=True)
df['times'] = pd.to_timedelta(times, errors='coerce')
df = df.sort_values(['times','date1'])
print (df)
Score date term date1 times
6 34 3 min 2 NaT 00:03:00
9 536 11 min 1 NaT 00:11:00
3 15 1 hrs 2 NaT 01:00:00
10 53 2 hrs 1 NaT 02:00:00
8 101 7 hrs 1 NaT 07:00:00
1 47 1 Feb 1 1900-02-01 NaT
0 72 3 Feb 1 1900-02-03 NaT
7 26 3 Feb 3 1900-02-03 NaT
5 153 4 Feb 3 1900-02-04 NaT
2 119 6 Feb 1 1900-02-06 NaT
4 33 7 Feb 1 1900-02-07 NaT
11 20 11 Feb 3 1900-02-11 NaT

Comparing daily value in each year in DataFrame to same day-number's value in another specific year

I have a daily time series of closing prices of a financial instrument going back to 1990.
I am trying to compare the daily percentage change for each trading day of the previous years to it's respective trading day in 2019. I have 41 trading days of data for 2019 at this time.
I get so far as filtering down and creating a new DataFrame with only the first 41 dates, closing prices, daily percentage changes, and the "trading day of year" ("tdoy") classifier for each day in the set, but am not having luck from there.
I've found other Stack Overflow questions that help people compare datetime days, weeks, years, etc. but I am not able to recreate this because of the arbitrary value each "tdoy" represents.
I won't bother creating a sample DataFrame because of the number of rows so I've linked the CSV I've come up with to this point: Sample CSV.
I think the easiest approach would just be to create a new column that returns what the 2019 percentage change is for each corresponding "tdoy" (Trading Day of Year) using df.loc, and if I could figure this much out I could then create yet another column to do the simple difference between that year/day's percentage change to 2019's respective value. Below is what I try to use (and I've tried other variations) to no avail.
df['2019'] = df['perc'].loc[((df.year == 2019) & (df.tdoy == df.tdoy))]
I've tried to search Stack and Google in probably 20 different variations of my problem and can't seem to find an answer that fits my issue of arbitrary "Trading Day of Year" classification.
I'm sure the answer is right in front of my face somewhere but I am still new to data wrangling.
First step is to import the csv properly. I'm not sure if you made the adjustment, but your data's date column is a string object.
# import the csv and assign to df. parse dates to datetime
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
# filter the dataframe so that you only have 2019 and 2018 data
df=df[df['year'] >= 2018]
df.tail()
Unnamed: 0 Dates last perc year tdoy
1225 7601 2019-02-20 29.96 0.007397 2019 37
1226 7602 2019-02-21 30.49 0.017690 2019 38
1227 7603 2019-02-22 30.51 0.000656 2019 39
1228 7604 2019-02-25 30.36 -0.004916 2019 40
1229 7605 2019-02-26 30.03 -0.010870 2019 41
Put the tdoy and year into a multiindex.
# create a multiindex
df.set_index(['tdoy','year'], inplace=True)
df.tail()
Dates last perc
tdoy year
37 2019 7601 2019-02-20 29.96 0.007397
38 2019 7602 2019-02-21 30.49 0.017690
39 2019 7603 2019-02-22 30.51 0.000656
40 2019 7604 2019-02-25 30.36 -0.004916
41 2019 7605 2019-02-26 30.03 -0.010870
Make pivot table
# make a pivot table and assign it to a variable
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
df1.head()
year 2018 2019
tdoy
1 33.08 27.55
2 33.38 27.90
3 33.76 28.18
4 33.74 28.41
5 33.65 28.26
Create calculated column
# create the new column
df1['pct_change'] = (df1[2019]-df1[2018])/df1[2018]
df1
year 2018 2019 pct_change
tdoy
1 33.08 27.55 -0.167170
2 33.38 27.90 -0.164170
3 33.76 28.18 -0.165284
4 33.74 28.41 -0.157973
5 33.65 28.26 -0.160178
6 33.43 28.18 -0.157045
7 33.55 28.32 -0.155887
8 33.29 27.94 -0.160709
9 32.97 28.17 -0.145587
10 32.93 28.11 -0.146371
11 32.93 28.24 -0.142423
12 32.79 28.23 -0.139067
13 32.51 28.77 -0.115042
14 32.23 29.01 -0.099907
15 32.28 29.01 -0.101301
16 32.16 29.06 -0.096393
17 32.52 29.38 -0.096556
18 32.68 29.51 -0.097001
19 32.50 30.03 -0.076000
20 32.79 30.30 -0.075938
21 32.87 30.11 -0.083967
22 33.08 30.42 -0.080411
23 33.07 30.17 -0.087693
24 32.90 29.89 -0.091489
25 32.51 30.13 -0.073208
26 32.50 30.38 -0.065231
27 33.16 30.90 -0.068154
28 32.56 30.81 -0.053747
29 32.21 30.87 -0.041602
30 31.96 30.24 -0.053817
31 31.85 30.33 -0.047724
32 31.57 29.99 -0.050048
33 31.80 29.89 -0.060063
34 31.70 29.95 -0.055205
35 31.54 29.95 -0.050412
36 31.54 29.74 -0.057070
37 31.86 29.96 -0.059636
38 32.07 30.49 -0.049267
39 32.04 30.51 -0.047753
40 32.36 30.36 -0.061805
41 32.62 30.03 -0.079399
Altogether without comments and data, the codes looks like:
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df=df[df['year'] >= 2018]
df.set_index(['tdoy','year'], inplace=True)
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
df1['pct_change'] = (df1[2019]-df1[2018])/df1[2018]
[EDIT] poster requesting for all dates compared to 2019.
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df.set_index(['tdoy','year'], inplace=True)
Ignore year filter above, create pivot table
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
Create a loop going through the years/columns and create a new field for each year comparing to 2019.
for y in df1.columns:
df1[str(y) + '_pct_change'] = (df1[2019]-df1[y])/df1[y]
To view some data...
df1.loc[1:4, "1990_pct_change":"1994_pct_change"]
year 1990_pct_change 1991_pct_change 1992_pct_change 1993_pct_change 1994_pct_change
tdoy
1 0.494845 0.328351 0.489189 0.345872 -0.069257
2 0.496781 0.364971 0.516304 0.361640 -0.045828
3 0.523243 0.382050 0.527371 0.369956 -0.035262
4 0.524960 0.400888 0.531536 0.367838 -0.034659
Final code for all years:
df = pd.read_csv('TimeSeriesEx.csv', parse_dates=['Dates'])
df.set_index(['tdoy','year'], inplace=True)
df1 = df.pivot_table(values='last', index='tdoy', columns='year')
for y in df1.columns:
df1[str(y) + '_pct_change'] = (df1[2019]-df1[y])/df1[y]
df1
I also came up with my own answer more along the lines of what I was trying to originally accomplish. DataFrame I'll work with for the example. df:
Dates last perc year tdoy
0 2016-01-04 29.93 -0.020295 2016 2
1 2016-01-05 29.63 -0.010023 2016 3
2 2016-01-06 29.59 -0.001350 2016 4
3 2016-01-07 29.44 -0.005069 2016 5
4 2017-01-03 34.57 0.004358 2017 2
5 2017-01-04 34.98 0.011860 2017 3
6 2017-01-05 35.00 0.000572 2017 4
7 2017-01-06 34.77 -0.006571 2017 5
8 2018-01-02 33.38 0.009069 2018 2
9 2018-01-03 33.76 0.011384 2018 3
10 2018-01-04 33.74 -0.000592 2018 4
11 2018-01-05 33.65 -0.002667 2018 5
12 2019-01-02 27.90 0.012704 2019 2
13 2019-01-03 28.18 0.010036 2019 3
14 2019-01-04 28.41 0.008162 2019 4
15 2019-01-07 28.26 -0.005280 2019 5
I created a DataFrame with only the 2019 values for tdoy and perc
df19 = df[['tdoy','perc']].loc[df['year'] == 2019]
and then zipped a dictionary for those values
perc19 = dict(zip(df19.tdoy,df19.perc))
to end up with
perc19=
{2: 0.012704174228675058,
3: 0.010035842293906852,
4: 0.008161816891412365,
5: -0.005279831045406497}
Then map these keys with the tdoy column in the original DataFrame to create a column titled 2019 that has the corresponding 2019 percentage change value for that trading day
df['2019'] = df['tdoy'].map(perc19)
and then create a vs2019 column where I find the difference of 2019 vs. perc and square it yielding
Dates last perc year tdoy 2019 vs2019
0 2016-01-04 29.93 -0.020295 2016 2 0.012704 6.746876
1 2016-01-05 29.63 -0.010023 2016 3 0.010036 3.995038
2 2016-01-06 29.59 -0.001350 2016 4 0.008162 1.358162
3 2016-01-07 29.44 -0.005069 2016 5 -0.005280 0.001590
4 2017-01-03 34.57 0.004358 2017 2 0.012704 0.431608
5 2017-01-04 34.98 0.011860 2017 3 0.010036 0.033038
6 2017-01-05 35.00 0.000572 2017 4 0.008162 0.864802
7 2017-01-06 34.77 -0.006571 2017 5 -0.005280 0.059843
8 2018-01-02 33.38 0.009069 2018 2 0.012704 0.081880
9 2018-01-03 33.76 0.011384 2018 3 0.010036 0.018047
10 2018-01-04 33.74 -0.000592 2018 4 0.008162 1.150436
From here I can groupby in various ways and further calculate to find most similar trending percentage changes vs. the year I am comparing against (2019).

Pandas/Python Pulling end of month rows from dataframe into separate dataframe

Currently I have a time series data frame as follows:
dfMain =
Date Portfolio Value
0 2016-07-01 1.000000e+06
1 2016-07-08 1.025168e+06
2 2016-07-15 1.028053e+06
3 2016-07-22 1.024184e+06
4 2016-07-29 1.022491e+06
5 2016-08-05 1.023241e+06
6 2016-08-12 1.030325e+06
7 2016-08-19 1.032742e+06
8 2016-08-26 1.032567e+06
9 2016-09-02 1.028614e+06
10 2016-09-09 9.930876e+05
11 2016-09-16 9.956875e+05
12 2016-09-23 1.010174e+06
13 2016-09-30 1.010388e+06
14 2016-10-07 1.004989e+06
15 2016-10-14 9.924929e+05
16 2016-10-21 9.969708e+05
17 2016-10-28 9.816373e+05
18 2016-11-04 9.563689e+05
19 2016-11-11 9.869579e+05
20 2016-11-18 9.936929e+05
21 2016-11-25 1.009625e+06
Given that the dataframe can be different (can't just pull specific rows from example) what would be the best way to pull the closest to the end of month dates from the dataframe? for example index 4 would be pulled because that is the closest to the end of month date.
Any tips would be greatly appreciated!
Group on the month number and find the last record:
df.Date = pd.to_datetime(df.Date, errors='coerce')
df.groupby(df.Date.dt.month).last()
Date Portfolio Value
Date
7 2016-07-29 1022491.0
8 2016-08-26 1032567.0
9 2016-09-30 1010388.0
10 2016-10-28 981637.3
11 2016-11-25 1009625.0
If rows aren't sorted by Date, call sort_values first:
df.sort_values('Date').groupby(df.Date.dt.month).last()
Date Portfolio Value
Date
7 2016-07-29 1022491.0
8 2016-08-26 1032567.0
9 2016-09-30 1010388.0
10 2016-10-28 981637.3
11 2016-11-25 1009625.0
Should work in any case.
If you have dates spanning multiple years, better to groupby on the year-month:
df.sort_values('Date').groupby([df.Date.dt.year, df.Date.dt.month]).last()
You need to sort the dates and then find the last value for each group.
df['Date'] = pd.to_datetime(df['Date'])
grp = df.sort_values('Date').groupby(df['Date'].dt.month)
pd.DataFrame([grp.get_group(x).iloc[-1] for x in grp.groups])
Output:
Date Portfolio Value
4 2016-07-29 1022491.0
8 2016-08-26 1032567.0
13 2016-09-30 1010388.0
17 2016-10-28 981637.3
21 2016-11-25 1009625.0

Categories

Resources