python date and datetime from multiple columns - python

I have some data that looks like:
key DATE - DAY DATE - MONTH DATE - YEAR GMT HRS GMT MINUTES
1 2 29 2 2016 2 2
2 3 29 2 2016 2 2
3 4 29 2 2016 2 2
4 5 29 2 2016 2 2
5 6 29 2 2016 2 2
6 7 29 2 2016 2 2
7 8 29 2 2016 2 3
8 9 29 2 2016 2 3
9 10 29 2 2016 2 3
GMT SECONDS
1 54
2 55
3 56
4 57
5 58
6 59
7 0
8 1
9 2
At first the data was type float and the year was in format 16 so I did:
t['DATE - MONTH'] = t['DATE - MONTH'].astype(int)
t['DATE - YEAR'] = t['DATE - YEAR'].astype(int)
t['DATE - YEAR'] = t['DATE - YEAR']+2000
t['DATE - DAY'] = t['DATE - DAY'].astype(int)
^Note I was also confused why when using an index number rather than the column name you only work on what seems to be a temp table ie you can print the desired result but it didnt change the data frame.
Then I tried two methods:
t['Date'] = pd.to_datetime(dict(year=t['DATE - YEAR'], month = t['DATE - MONTH'], day = t['DATE - DAY']))
t['Date'] = pd.to_datetime((t['DATE - YEAR']*10000+t['DATE - MONTH']*100+t['DATE - DAY']).apply(str),format='%Y%m%d')
Both return:
ValueError: cannot assemble the datetimes: time data 20000000 does not match format '%Y%m%d' (match)
I'd like to create a date column (and then after use a similar logic for a datetime column with the additional 3 columns).
What is the problem?
EDIT: I had bad data and added errors='coerce' to handle those rows

First rename all columns, filter by values of dict and use to_datetime:
Assembling a datetime from multiple columns of a DataFrame. The keys can be common abbreviations like ['year', 'month', 'day', 'minute', 'second', 'ms', 'us', 'ns']) or plurals of the same.
d = {'DATE - YEAR':'year','DATE - MONTH':'month','DATE - DAY':'day',
'GMT HRS':'hour','GMT MINUTES':'minute','GMT SECONDS':'second'}
df['datetime'] = pd.to_datetime(df.rename(columns=d)[list(d.values())])
print (df)
key DATE - DAY DATE - MONTH DATE - YEAR GMT HRS GMT MINUTES \
1 2 29 2 2016 2 2
2 3 29 2 2016 2 2
3 4 29 2 2016 2 2
4 5 29 2 2016 2 2
5 6 29 2 2016 2 2
6 7 29 2 2016 2 2
7 8 29 2 2016 2 3
8 9 29 2 2016 2 3
9 10 29 2 2016 2 3
GMT SECONDS datetime
1 54 2016-02-29 02:02:54
2 55 2016-02-29 02:02:55
3 56 2016-02-29 02:02:56
4 57 2016-02-29 02:02:57
5 58 2016-02-29 02:02:58
6 59 2016-02-29 02:02:59
7 0 2016-02-29 02:03:00
8 1 2016-02-29 02:03:01
9 2 2016-02-29 02:03:02
Detail:
print (df.rename(columns=d)[list(d.values())])
day month second year minute hour
1 29 2 54 2016 2 2
2 29 2 55 2016 2 2
3 29 2 56 2016 2 2
4 29 2 57 2016 2 2
5 29 2 58 2016 2 2
6 29 2 59 2016 2 2
7 29 2 0 2016 3 2
8 29 2 1 2016 3 2
9 29 2 2 2016 3 2

Related

How to do rolling sum with conditional window criteria on different index levels in Python

I want to do a rolling sum based on different levels of the index but am struggling to make it a reality. Instead of explaining the problem am giving below the demo input and desired output along with the kind of insights am looking for.
So I have multiple brands and each of their sales of various item categories in different year month day grouped by as below. What I want is a dynamic rolling sum at each day level, rolled over a window on Year as asked.
for eg, if someone asks
Demo question 1) Till a certain day(not including that day) what were their last 2 years' sales of that particular category for that particular brand.
I need to be able to answer this for every single day i.e every single row should have a number as shown in Table 2.0.
I want to be able to code in such a way that if the question changes from 2 years to 3 years I just need to change a number. I also need to do the same thing at the month's level.
demo question 2) Till a certain day(not including that day) what was their last 3 months' sale of that particular category for that particular year for that particular brand.
Below is demo input
The tables are grouped by brand,category,year,month,day and sum of sales from a master table which had all the info and sales at hour level each day
Table 1.0
Brand
Category
Year
Month
Day
Sales
ABC
Big Appliances
2021
9
3
0
Clothing
2021
9
2
0
Electronics
2020
10
18
2
Utensils
2020
10
18
0
2021
9
2
4
3
0
XYZ
Big Appliances
2012
4
29
7
2013
4
7
6
Clothing
2012
4
29
3
Electronics
2013
4
9
1
27
2
5
4
5
2015
4
27
7
5
2
2
Fans
2013
4
14
4
5
4
0
2015
4
18
1
5
17
11
2016
4
12
18
Furniture
2012
5
4
1
8
6
20
4
2013
4
5
1
7
8
9
2
2015
4
18
12
27
15
5
2
4
17
3
Musical-inst
2012
5
18
10
2013
4
5
6
2015
4
16
10
18
0
2016
4
12
1
16
13
Utencils
2012
5
8
2
2016
4
16
3
18
2
2017
4
12
13
Below is desired output for demo question 1 based on the demo table(last 2 years cumsum not including that day)
Table 2.0
Brand
Category
Year
Month
Day
Sales
Conditional Cumsum(till last 2 years)
ABC
Big Appliances
2021
9
3
0
0
Clothing
2021
9
2
0
0
Electronics
2020
10
18
2
0
Utensils
2020
10
18
0
0
2021
9
2
4
0
3
0
4
XYZ
Big Appliances
2012
4
29
7
0
2013
4
7
6
7
Clothing
2012
4
29
3
0
Electronics
2013
4
9
1
0
27
2
1
5
4
5
3
2015
4
27
7
8
5
2
2
15
Fans
2013
4
14
4
0
5
4
0
4
2015
4
18
1
4
5
17
11
5
2016
4
12
18
12
Furniture
2012
5
4
1
0
8
6
1
20
4
7
2013
4
5
1
11
7
8
12
9
2
20
2015
4
18
12
11
27
15
23
5
2
4
38
17
3
42
Musical-inst
2012
5
18
10
0
2013
4
5
6
10
2015
4
16
10
6
18
0
16
2016
4
12
1
10
16
13
11
Utencils
2012
5
8
2
0
2016
4
16
3
0
18
2
3
2017
4
12
13
5
End thoughts:
The idea is to basically do a rolling window over year column maintaining the 2 years span criteria and keep on summing the sales figures.
P.S I really need a fast solution due to the huge data size and therefore created a .apply function row-wise which I didn't find feasible. A better solution by using some kind of group rolling sum or supporting columns will be really helpful.
Here I'm giving a sample solution for the above problem.
I have concidered just onr product so that the solution would be simple
Code:
from datetime import date,timedelta
Input={"Utencils": [[2012,5,8,2],[2016,4,16,3],[2017,4,12,13]]}
Input1=Input["Utencils"]
Limit=timedelta(365*2)
cumsum=0
lis=[]
Tot=[]
for i in range(len(Input1)):
if(lis):
while(lis):
idx=lis[0]
Y,M,D=Input1[i][:3]
reqDate=date(Y,M,D)-Limit
Y,M,D=Input1[idx][:3]
if(date(Y,M,D)<=reqDate):
lis.pop(0)
cumsum-=Input1[idx][3]
else:
break
Tot.append(cumsum)
lis.append(i)
cumsum+=Input1[i][3]
print(Tot)
Here Tot would output the required cumsum column for the given data.
Output:
[0, 0, 3]
Here you can specify the Time span using Number of days in Limit variable.
Hope this solves the problem you are looking for.

How to unpack a list of tuple in various length in a panda dataframe?

ID LIST_OF_TUPLE (2col)
1 [('2012','12'), ('2012','33'), ('2014', '82')]
2 NA
3 [('2012','12')]
4 [('2012','12'), ('2012','33'), ('2014', '82'), ('2022', '67')]
Result:
ID TUP_1 TUP_2(3col)
1 2012 12
1 2012 33
1 2014 82
3 2012 12
4 2012 12
4 2012 33
4 2014 82
4 2022 67
Thanks in advance.
This is explode then create a dataframe and then join:
s = df['LIST_OF_TUPLE'].explode()
out = (df[['ID']].join(pd.DataFrame(s.tolist(),index=s.index)
.add_prefix("TUP_")).reset_index(drop=True)) #you can chain a dropna if reqd
print(out)
ID TUP_0 TUP_1
0 1 2012 12
1 1 2012 33
2 1 2014 82
3 2 NaN None
4 3 2012 12
5 4 2012 12
6 4 2012 33
7 4 2014 82
8 4 2022 67

Sort pandas csv list with string date

I have read a couple of similar post regarding the issue before, but none of the solutions worked for me. so I got the followed csv :
Score date term
0 72 3 Feb ·   1
1 47 1 Feb ·   1
2 119 6 Feb ·   1
8 101 7 hrs ·   1
9 536 11 min ·   1
10 53 2 hrs ·   1
11 20 11 Feb ·   3
3 15 1 hrs ·   2
4 33 7 Feb ·   1
5 153 4 Feb ·   3
6 34 3 min ·   2
7 26 3 Feb ·   3
I want to sort the csv by date. What's the easiest way to do that ?
You can create 2 helper columns - one for datetimes created by to_datetime and second for timedeltas created by to_timedelta, only necessary format HH:MM:SS, so added Series.replace by regexes, so last is possible sorting by 2 columns by DataFrame.sort_values:
df['date1'] = pd.to_datetime(df['date'], format='%d %b', errors='coerce')
times = df['date'].replace({'(\d+)\s+min': '00:\\1:00',
'\s+hrs': ':00:00'}, regex=True)
df['times'] = pd.to_timedelta(times, errors='coerce')
df = df.sort_values(['times','date1'])
print (df)
Score date term date1 times
6 34 3 min 2 NaT 00:03:00
9 536 11 min 1 NaT 00:11:00
3 15 1 hrs 2 NaT 01:00:00
10 53 2 hrs 1 NaT 02:00:00
8 101 7 hrs 1 NaT 07:00:00
1 47 1 Feb 1 1900-02-01 NaT
0 72 3 Feb 1 1900-02-03 NaT
7 26 3 Feb 3 1900-02-03 NaT
5 153 4 Feb 3 1900-02-04 NaT
2 119 6 Feb 1 1900-02-06 NaT
4 33 7 Feb 1 1900-02-07 NaT
11 20 11 Feb 3 1900-02-11 NaT

Mean value based on another column group

I have a dataframe (2000 rows, 5 columns):
year month day GroupBy_Day
0 2013 11 6 3
1 2013 11 7 10
2 2013 11 8 4
3 2013 11 9 4
4 2013 11 10 4
...
24 2013 12 1 5
25 2013 12 2 4
26 2013 12 3 5
27 2013 12 4 2
28 2013 12 5 7
29 2013 12 6 1
I already grouped my elements and got the count for each days (column GroupBy_Day). I need to get the mean count by day (e.g, for all days 6, we have a mean of (3+1)/2 = 2 occurence), and substract this value to GroupBy_Day in a new column.

How can I join columns in Pandas?

I´ve got the following data frame:
IDENT YEAR MONTH DAY HOUR MIN XXXX YYYY GPS SNR
0 0 2015 5 13 5 0 20.45 16 0 44
1 0 2015 5 13 4 0 20.43 16 0 44
2 0 2015 5 13 3 0 20.42 16 0 44
3 0 2015 5 13 2 0 20.47 16 0 40
4 0 2015 5 13 1 0 20.50 16 0 44
5 0 2015 5 13 0 0 20.54 16 0 44
6 0 2015 5 12 23 0 20.56 16 0 40
It comes from a csv file and I´d made the dataframe using Python Pandas.
Now I´d like to join the columns YEAR+MONTH+DAY+HOUR+MIN to make a new one, for example
DATE-TIME
2015-5-13-5-0
How can I do that ?
date_cols = ['YEAR','MONTH','DAY','HOUR','MIN']
df[date_cols] = df[date_cols].astype(str)
df['the_date'] = df[date_cols].apply(lambda x: '-'.join(x),axis=1)
Output:
IDENT YEAR MONTH DAY HOUR MIN XXXX YYYY GPS SNR the_date
0 0 2015 5 13 5 0 20.45 16 0 44 2015-5-13-5-0
1 0 2015 5 13 4 0 20.43 16 0 44 2015-5-13-4-0
2 0 2015 5 13 3 0 20.42 16 0 44 2015-5-13-3-0
3 0 2015 5 13 2 0 20.47 16 0 40 2015-5-13-2-0
4 0 2015 5 13 1 0 20.50 16 0 44 2015-5-13-1-0
5 0 2015 5 13 0 0 20.54 16 0 44 2015-5-13-0-0
6 0 2015 5 12 23 0 20.56 16 0 40 2015-5-12-23-0
df.loc[:, 'DATE-TIME'] = df.apply(lambda x: "{0}-{1}-{2}-{3}-{4}"
.format(int(x.YEAR),
int(x.MONTH),
int(x.DAY),
int(x.HOUR),
int(x.MIN)),
axis=1)
>>> df
IDENT YEAR MONTH DAY HOUR MIN XXXX YYYY GPS SNR DATE-TIME
0 0 2015 5 13 5 0 20.45 16 0 44 2015-5-13-5-0
1 0 2015 5 13 4 0 20.43 16 0 44 2015-5-13-4-0
2 0 2015 5 13 3 0 20.42 16 0 44 2015-5-13-3-0
3 0 2015 5 13 2 0 20.47 16 0 40 2015-5-13-2-0
4 0 2015 5 13 1 0 20.50 16 0 44 2015-5-13-1-0
5 0 2015 5 13 0 0 20.54 16 0 44 2015-5-13-0-0
6 0 2015 5 12 23 0 20.56 16 0 40 2015-5-12-23-0

Categories

Resources