Retrieve date from corrupted datetime column

Retrieve date from corrupted datetime column - python

I have a timeseries dataframe which has over 11000 observations. Unfortunately the datetime column got corrupted when stored in .csv format. The date portion (Y/M/D) went missing and I am left with only the time as shown below in the first 50 observations of the dataframe.
I know that the same values in the sequence of left out Time portion of the corrupted date_time column correspond to a specific date. For example all observations with the date_time value "10:27.9" correspond to a specific date and all observations with the value "45:05.8" correspond to some other date (here previous date).
Given this, how can I get the original datetime column (in Y/M/D H:M:S format) assuming the first set of rows belong to 15th April, 2021, the 2nd set to 14th Apr, 2021, so on., for each previous day passed. As I am not sure what is 10:27.9 is (I guess it is in S:M:H format), it does not matter if I get any values for the H:M:S portion as long as I have correct date.
Appreciate inputs.
D Date_Time
0 349 10:27.9
1 20 10:27.9
2 66 10:27.9
3 29 10:27.9
4 14 10:27.9
5 112 10:27.9
6 104 10:27.9
7 22 10:27.9
8 135 10:27.9
9 33 10:27.9
10 81 10:27.9
11 53 10:27.9
12 2 10:27.9
13 9 10:27.9
14 18 10:27.9
15 24 10:27.9
16 50 10:27.9
17 1 10:27.9
18 28 10:27.9
19 4 10:27.9
20 9 10:27.9
21 11 10:27.9
22 5 10:27.9
23 1 10:27.9
24 0 10:27.9
25 3 10:27.9
26 0 10:27.9
27 0 10:27.9
28 0 10:27.9
29 0 10:27.9
30 0 10:27.9
31 0 10:27.9
32 0 10:27.9
33 0 10:27.9
34 2 10:27.9
35 0 10:27.9
36 278 45:05.8
37 22 45:05.8
38 38 45:05.8
39 25 45:05.8
40 18 45:05.8
41 104 45:05.8
42 67 45:05.8
43 24 45:05.8
44 120 45:05.8
45 29 45:05.8
46 73 45:05.8
47 51 45:05.8
48 3 45:05.8
49 8 45:05.8
50 18 45:05.8

Create a reverse date_range() starting at 2021-04-15 and then map() the current Date_Time values.
Note that this does not preserve the times, but that was acceptable if I understood the comments correctly.
keys = df.Date_Time.unique()
values = pd.date_range('2021-04-15', periods=keys.size, freq='-1D')
mapping = dict(zip(keys, values))
df.Date_Time = df.Date_Time.map(mapping)
# D Date_Time
# 0 349 2021-04-15
# 1 20 2021-04-15
# 2 66 2021-04-15
# ...
# 48 3 2021-04-14
# 49 8 2021-04-14
# 50 18 2021-04-14

Related

Issue with sorting in pandas column in ascending order

I have the following code.
I am trying to sort the values of the first column of the 'happydflist' dataframe in ascending order.
However, the output this gives me includes some values such as '2','3' and '8' that do not fit in with the ascending order theme.
happydflist = happydflist[happydflist.columns[0]]
happydflistnew = happydflist.sort_values(ascending=True)
print(happydflistnew)
12 13
10 19
13 2
11 24
15 3
6 33
24 35
8 36
5 37
25 49
17 49
20 50
26 51
22 52
16 52
18 52
19 52
28 53
27 54
23 54
21 59
9 74
7 75
14 8
Name: 0_happy, dtype: object
I would be so grateful for a helping hand!
'happydflist' looks like this:
5 37
6 33
7 75
8 36
9 74
10 19
11 24
12 13
13 2
14 8
15 3
16 52
17 49
18 52
19 52
20 50
21 59
22 52
23 54
24 35
25 49
26 51
27 54
28 53
Name: 0_happy, dtype: object

Maybe your dataframe's dtype of some is str so make that to int instead.
happydflist.astype('int').sort_values()
if you need str dtype use astype 1more so:
happydflist.astype('int').sort_values().astype('str')

I managed to resolve the issue by using the df.strip() function to remove 'white space' around text in a dataframe, combined with the .dropna() function.
happydflistnew = happydflist[happydflist.columns[0]].str.strip()
happydflistnew = happydflistnew.dropna()
happydflistsorted = happydflistnew.astype('int').sort_values(ascending=True)
maxvalue = len(happydflistsorted)
minhappiness = happydflistsorted.iloc[0]
maxhappiness = happydflistsorted.iloc[maxvalue-1]

Calculate mean from multiple columns

I have 12 columns filled with wages. I want to calculate the mean but my output is 12 different means from each column, but I want one mean which is calculated with the whole dataset as one.
This is how my df looks:
Month 1 Month 2 Month 3 Month 4 ... Month 9 Month 10 Month 11 Month 12
0 1429.97 2816.61 2123.29 2123.29 ... 2816.61 2816.61 1429.97 1776.63
1 3499.53 3326.20 3499.53 2112.89 ... 1939.56 2806.21 2632.88 2459.55
2 2599.95 3119.94 3813.26 3466.60 ... 3466.60 3466.60 2946.61 2946.61
3 2599.95 2946.61 3466.60 2773.28 ... 2253.29 3119.94 1906.63 2773.28
I used this code to calculate the mean:
mean = df.mean()
Do i have to convert these 12 columns into one column or how can i calculate one mean?

Just call the mean again to get the mean of those 12 values:
df.mean().mean()

Use numpy.mean with convert values to 2d array:
mean = np.mean(df.to_numpy())
print (mean)
2914.254166666667
Or use DataFrame.melt:
mean = df.melt()['value'].mean()
print (mean)
2914.254166666666

You can also use stack:
df.stack().mean()
Suppose this dataframe:
>>> df
A B C D E F G H
0 60 1 59 25 8 27 34 43
1 81 48 32 30 60 3 90 22
2 66 15 21 5 23 36 83 46
3 56 42 14 86 41 64 89 56
4 28 53 89 89 52 13 12 39
5 64 7 2 16 91 46 74 35
6 81 81 27 67 26 80 19 35
7 56 8 17 39 63 6 34 26
8 56 25 26 39 37 14 41 27
9 41 56 68 38 57 23 36 8
>>> df.stack().mean()
41.6625

Create bi-weekly and monthly labels with week numbers in pandas

I have a dataframe with profit values, IDs, and week values. It looks a little like this
ID
Week
Profit
A
1
2
A
2
2
A
3
0
A
4
0
I want to create two new columns called "Bi-Weekly" and "Monthly", so week 1 would be label 2, week 2 would also be label 2, but week 3 would be labeled 4, and week 4 would be labeled 4, and they would all be labeled month 1, so I could groupby weekly, bi-weekly, or monthly profit as needed. Right now I've created two functions which work, but the weeks are going to go up to a year (52 weeks) so I was wondering if there's a more efficient way. My bi-weekly function below.
def biweek(prof_calc):
if (prof_calc['week']==2):
return 2
elif (prof_calc['week']==3):
return 2
elif (prof_calc['week']==4):
return 4
elif (prof_calc['week']==5):
return 4
elif (prof_calc['week']==6):
return 6
elif (prof_calc['week']==7):
return 6
elif (prof_calc['week']==8):
return 8
elif (prof_calc['week']==9):
return 8
elif (prof_calc['week']==10):
return 10
elif (prof_calc['week']==11):
return 10
prof_calc['BiWeek'] = prof_calc.apply(biweek, axis=1)

IIUC, you could try:
df["Biweekly"] = (df["Week"]-1)//2+1
df["Monthly"] = (df["Week"]-1)//4+1
>>> df
ID Week Profit Biweekly Monthly
0 A 1 42 1 1
1 A 2 69 1 1
2 A 3 53 2 1
3 A 4 63 2 1
4 A 5 56 3 2
5 A 6 57 3 2
6 A 7 86 4 2
7 A 8 23 4 2
8 A 9 35 5 3
9 A 10 10 5 3
10 A 11 25 6 3
11 A 12 21 6 3
12 A 13 39 7 4
13 A 14 82 7 4
14 A 15 76 8 4
15 A 16 20 8 4
16 A 17 97 9 5
17 A 18 67 9 5
18 A 19 21 10 5
19 A 20 22 10 5
20 A 21 88 11 6
21 A 22 67 11 6
22 A 23 33 12 6
23 A 24 38 12 6
24 A 25 8 13 7
25 A 26 67 13 7
26 A 27 16 14 7
27 A 28 49 14 7
28 A 29 3 15 8
29 A 30 17 15 8
30 A 31 79 16 8
31 A 32 19 16 8
32 A 33 21 17 9
33 A 34 9 17 9
34 A 35 56 18 9
35 A 36 83 18 9
36 A 37 1 19 10
37 A 38 53 19 10
38 A 39 66 20 10
39 A 40 55 20 10
40 A 41 85 21 11
41 A 42 90 21 11
42 A 43 34 22 11
43 A 44 3 22 11
44 A 45 9 23 12
45 A 46 28 23 12
46 A 47 58 24 12
47 A 48 14 24 12
48 A 49 42 25 13
49 A 50 69 25 13
50 A 51 76 26 13
51 A 52 49 26 13

Place data from a Pandas DF into a Grid or Template

I have process where the end product is a Pandas DF where the output, which is variable in terms of data and length, is structured like this example of the output.
9 80340796
10 80340797
11 80340798
12 80340799
13 80340800
14 80340801
15 80340802
16 80340803
17 80340804
18 80340805
19 80340806
20 80340807
21 80340808
22 80340809
23 80340810
24 80340811
25 80340812
26 80340813
27 80340814
28 80340815
29 80340816
30 80340817
31 80340818
32 80340819
33 80340820
34 80340821
35 80340822
36 80340823
37 80340824
38 80340825
39 80340826
40 80340827
41 80340828
42 80340829
43 80340830
44 80340831
45 80340832
46 80340833
I need to get the numbers in the second column above, into the following grid format based on the numbers in the first column above.
1 2 3 4 5 6 7 8 9 10 11 12
A 1 9 17 25 33 41 49 57 65 73 81 89
B 2 10 18 26 34 42 50 58 66 74 82 90
C 3 11 19 27 35 43 51 59 67 75 83 91
D 4 12 20 28 36 44 52 60 68 76 84 92
E 5 13 21 29 37 45 53 61 69 77 85 93
F 6 14 22 30 38 46 54 62 70 78 86 94
G 7 15 23 31 39 47 55 63 71 79 87 95
H 8 16 24 32 40 48 56 64 72 80 88 96
So the end result in this example would be
Any advice on how to go about this would be much appreciated. I've been asked for this by a colleague, so the data is easy to read for their team (as it matches the layout of a physical test) but I have no idea how to produce it.

pandas pivot table, can do what you want in your question, but first you have to create 2 auxillary columns, 1 determing which column the value has to go in, another which row it is. You can get that as shown in the following example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'num': list(range(9, 28)), 'val': list(range(80001, 80020))})
max_rows = 8
df['row'] = (df['num']-1)%8
df['col'] = np.ceil(df['num']/8).astype(int)
df.pivot_table(values=['val'], columns=['col'], index=['row'])
val
col 2 3 4
row
0 80001.0 80009.0 80017.0
1 80002.0 80010.0 80018.0
2 80003.0 80011.0 80019.0
3 80004.0 80012.0 NaN
4 80005.0 80013.0 NaN
5 80006.0 80014.0 NaN
6 80007.0 80015.0 NaN
7 80008.0 80016.0 NaN

performing differences between rows in pandas based on columns values

I have this dataframe, I'm trying to create a new column where I want to store the difference of products sold based on code and date.
for example this is the starting dataframe:
date code sold
0 20150521 0 47
1 20150521 12 39
2 20150521 16 39
3 20150521 20 38
4 20150521 24 38
5 20150521 28 37
6 20150521 32 36
7 20150521 4 43
8 20150521 8 43
9 20150522 0 47
10 20150522 12 37
11 20150522 16 36
12 20150522 20 36
13 20150522 24 36
14 20150522 28 35
15 20150522 32 31
16 20150522 4 42
17 20150522 8 41
18 20150523 0 50
19 20150523 12 48
20 20150523 16 46
21 20150523 20 46
22 20150523 24 46
23 20150523 28 45
24 20150523 32 42
25 20150523 4 49
26 20150523 8 49
27 20150524 0 39
28 20150524 12 33
29 20150524 16 30
... ... ... ...
150 20150606 32 22
151 20150606 4 34
152 20150606 8 33
153 20150607 0 31
154 20150607 12 30
155 20150607 16 30
156 20150607 20 29
157 20150607 24 28
158 20150607 28 26
159 20150607 32 24
160 20150607 4 30
161 20150607 8 30
162 20150608 0 47
I think this could be a solution...
full_df1=full_df[full_df.date == '20150609'].reset_index(drop=True)
full_df1['code'] = full_df1['code'].astype(float)
full_df1= full_df1.sort(['code'], ascending=[False])
code date sold
8 32 20150609 33
7 28 20150609 36
6 24 20150609 37
5 20 20150609 39
4 16 20150609 42
3 12 20150609 46
2 8 20150609 49
1 4 20150609 49
0 0 20150609 50
full_df1.set_index('code')['sold'].diff().reset_index()
that gives me back this output for a single date 20150609 :
code difference
0 32 NaN
1 28 3
2 24 1
3 20 2
4 16 3
5 12 4
6 8 3
7 4 0
8 0 1
is there a better solution to have the same result in a more pythonic way?
I would like to create a new column [difference] and store the data there having as result 4 columns [date, code, sold, difference]

This exactly the kind of thing that panda's groupby functionality is built for, and I highly recommend reading and working through this documentation: panda's groupby documentation
This code replicates what you are asking for, but for every date.
df = pd.DataFrame({'date':['Mon','Mon','Mon','Tue','Tue','Tue'],'code':[10,21,30,10,21,30], 'sold':[12,13,34,10,15,20]})
df['difference'] = df.groupby('date')['sold'].diff()
df
code date sold difference
0 10 Mon 12 NaN
1 21 Mon 13 1
2 30 Mon 34 21
3 10 Tue 10 NaN
4 21 Tue 15 5
5 30 Tue 20 5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Retrieve date from corrupted datetime column - python

Related

Issue with sorting in pandas column in ascending order

Calculate mean from multiple columns

Create bi-weekly and monthly labels with week numbers in pandas

Place data from a Pandas DF into a Grid or Template

performing differences between rows in pandas based on columns values

Categories

Resources