Sort pandas csv list with string date

Sort pandas csv list with string date - python

I have read a couple of similar post regarding the issue before, but none of the solutions worked for me. so I got the followed csv :
Score date term
0 72 3 Feb ·   1
1 47 1 Feb ·   1
2 119 6 Feb ·   1
8 101 7 hrs ·   1
9 536 11 min ·   1
10 53 2 hrs ·   1
11 20 11 Feb ·   3
3 15 1 hrs ·   2
4 33 7 Feb ·   1
5 153 4 Feb ·   3
6 34 3 min ·   2
7 26 3 Feb ·   3
I want to sort the csv by date. What's the easiest way to do that ?

You can create 2 helper columns - one for datetimes created by to_datetime and second for timedeltas created by to_timedelta, only necessary format HH:MM:SS, so added Series.replace by regexes, so last is possible sorting by 2 columns by DataFrame.sort_values:
df['date1'] = pd.to_datetime(df['date'], format='%d %b', errors='coerce')
times = df['date'].replace({'(\d+)\s+min': '00:\\1:00',
'\s+hrs': ':00:00'}, regex=True)
df['times'] = pd.to_timedelta(times, errors='coerce')
df = df.sort_values(['times','date1'])
print (df)
Score date term date1 times
6 34 3 min 2 NaT 00:03:00
9 536 11 min 1 NaT 00:11:00
3 15 1 hrs 2 NaT 01:00:00
10 53 2 hrs 1 NaT 02:00:00
8 101 7 hrs 1 NaT 07:00:00
1 47 1 Feb 1 1900-02-01 NaT
0 72 3 Feb 1 1900-02-03 NaT
7 26 3 Feb 3 1900-02-03 NaT
5 153 4 Feb 3 1900-02-04 NaT
2 119 6 Feb 1 1900-02-06 NaT
4 33 7 Feb 1 1900-02-07 NaT
11 20 11 Feb 3 1900-02-11 NaT

Related

How calculate diff() in condition value? Python

I have a pandas df, like this:
ID date value
0 10 2022-01-01 100
1 10 2022-01-02 150
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200
5 10 2022-01-06 0
6 10 2022-01-07 150
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100
12 23 2022-02-01 490
13 23 2022-02-02 0
14 23 2022-02-03 350
15 23 2022-02-04 333
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211
20 23 2022-02-09 100
I would like calculate the days of last value. Like the bellow example. How can I using diff() for this? And the calculus change by ID.
Output:
ID date value days_last_value
0 10 2022-01-01 100 0
1 10 2022-01-02 150 1
2 10 2022-01-03 0
3 10 2022-01-04 0
4 10 2022-01-05 200 3
5 10 2022-01-06 0
6 10 2022-01-07 150 2
7 10 2022-01-08 0
8 10 2022-01-09 0
9 10 2022-01-10 0
10 10 2022-01-11 0
11 10 2022-01-12 100 5
12 23 2022-02-01 490 0
13 23 2022-02-02 0
14 23 2022-02-03 350 2
15 23 2022-02-04 333 1
16 23 2022-02-05 0
17 23 2022-02-06 0
18 23 2022-02-07 0
19 23 2022-02-08 211 4
20 23 2022-02-09 100 1

Explanation below.
import pandas as pd
df = pd.DataFrame({'ID': 12 * [10] + 9 * [23],
'value': [100, 150, 0, 0, 200, 0, 150, 0, 0, 0, 0, 100, 490, 0, 350, 333, 0, 0, 0, 211, 100]})
days = df.groupby(['ID', (df['value'] != 0).cumsum()]).size().groupby('ID').shift(fill_value=0)
days.index = df.index[df['value'] != 0]
df['days_last_value'] = days
df
ID value days_last_value
0 10 100 0.0
1 10 150 1.0
2 10 0 NaN
3 10 0 NaN
4 10 200 3.0
5 10 0 NaN
6 10 150 2.0
7 10 0 NaN
8 10 0 NaN
9 10 0 NaN
10 10 0 NaN
11 10 100 5.0
12 23 490 0.0
13 23 0 NaN
14 23 350 2.0
15 23 333 1.0
16 23 0 NaN
17 23 0 NaN
18 23 0 NaN
19 23 211 4.0
20 23 100 1.0
First, we'll have to group by 'ID'.
We also creates groups for each block of days, by creating a True/False series where value is not 0, then performing a cumulative sum. That is the part (df['value'] != 0).cumsum(), which results in
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 4
8 4
9 4
10 4
11 5
12 6
13 6
14 7
15 8
16 8
17 8
18 8
19 9
20 10
We can use the values in this series to also group on; combining that with the 'ID' group, you have the individual blocks of days. This is the df.groupby(['ID', (df['value'] != 0).cumsum()]) part.
Now, for each block, we get its size, which is obviously the interval in days; which is what you want. We do need to shift one up, since we've counted the total number of days per group, and the difference would be one less; and fill with 0 at the bottom. But this shift has to be by ID group, so we first group by ID again before shifting (as we lost the grouping after doing .size()).
Now, this new series needs to get assigned back to the dataframe, but it's obviously shorter. Since its index it also reset, we can't easily reassign it (not with df['days_last_value'], df.loc[...] or df.iloc).
Instead, we select the index values of the original dataframe where value is not zero, and set the index of the days equal to that.
Now, it's easy step to directly assign the days to relevant column in the dataframe: Pandas will match the indices.

Split date column into two

I have the following dataframe:
date
wind (°)
wind (kt)
temp (C°)
humidity(%)
currents (°)
currents (kt)
stemp (C°)
sea_temp_diff
wind_distance_diff
wind_speed_diff
temp_diff
humidity_diff
current_distance_diff
current_speed_diff
8 12018
175.000000
16.333333
25.500000
82.500000
60.000000
0.100000
25.400000
-1.066667
23.333333
-0.500000
-0.333333
-12.000000
160.000000
6.666667e-02
9 12019
180.000000
17.000000
23.344828
79.724138
230.000000
0.100000
23.827586
-0.379310
22.068966
1.068966
0.827586
-7.275862
315.172414
3.449034e+02
10 12020
365.000000
208.653846
24.192308
79.346154
355.769231
192.500000
24.730769
574.653846
1121.923077
1151.153846
1149.346154
-19.538462
1500.000000
1.538454e+03
14 22019
530.357143
372.964286
23.964286
81.964286
1270.714286
1071.560714
735.642857
-533.642857
-327.500000
-356.892857
1.857143
-10.321429
-873.571429
-8.928107e+02
15 22020
216.551724
12.689655
24.517241
81.137931
288.275862
172.565517
196.827586
-171.379310
-8.965517
3.724138
1.413793
-7.137931
-105.517241
-1.722724e+02
16 32019
323.225806
174.709677
25.225806
80.741935
260.000000
161.451613
25.709677
480.709677
486.451613
483.967742
0.387097
153.193548
1044.516129
9.677065e+02
17 32020
351.333333
178.566667
25.533333
78.800000
427.666667
166.666667
26.600000
165.533333
-141.000000
-165.766667
166.633333
158.933333
8.333333
1.500000e-01
18 42017
180.000000
14.000000
27.000000
5000.000000
200.000000
0.400000
25.400000
2.600000
20.000000
-4.000000
0.000000
0.000000
-90.000000
-1.000000e-01
19 42019
694.230769
589.769231
24.038462
69.461538
681.153846
577.046154
26.884615
-1.346154
37.307692
-1.692308
1.500000
4.769231
98.846154
1.538462e-01
20 42020
306.666667
180.066667
24.733333
75.166667
427.666667
166.666667
26.800000
165.066667
205.333333
165.200000
1.100000
-4.066667
360.333333
3.334233e+02
21 52017
146.333333
11.966667
22.900000
5000.000000
116.333333
0.410000
26.066667
-1.553333
8.666667
0.833333
-0.766667
0.000000
95.000000
-1.300000e-01
22 52019
107.741935
12.322581
23.419355
63.032258
129.354839
0.332258
25.935484
-1.774194
14.838710
0.096774
-0.612903
-14.451613
130.967742
I need to sort the 'date' column chronologically, and I'm wondering if there's a way for me to split it two ways, with the '10' in one column and 2017 in another, sort both of them in ascending order, and then bring them back together.
I had tried this:
australia_overview[['month','year']] = australia_overview['date'].str.split("2",expand=True)
But I am getting error like this:
ValueError: Columns must be same length as key
How can I solve this issue?

From your DataFrame :
>>> df = pd.DataFrame({'id': [1, 2, 3, 4],
... 'date': ['1 42018', '12 32019', '8 112020', '23 42021']},
... index = [0, 1, 2, 3])
>>> df
id date
0 1 1 42018
1 2 12 32019
2 3 8 112020
3 4 23 42021
We can split the column to get the first value of day like so :
>>> df['day'] = df['date'].str.split(' ', expand=True)[0]
>>> df
id date day
0 1 1 42018 1
1 2 12 32019 12
2 3 8 112020 8
3 4 23 42021 23
And get the 4 last digit from the column date for the year to get the expected result :
>>> df['year'] = df['date'].str[-4:].astype(int)
>>> df
id date day year
0 1 1 42018 1 2018
1 2 12 32019 12 2019
2 3 8 112020 8 2020
3 4 23 42021 23 2021
Bonus : as asked in the comment, you can even get the month using the same principle :
>>> df['month'] = df['date'].str.split(' ', expand=True)[1].str[:-4].astype(int)
>>> df
id date day year month
0 1 1 42018 1 2018 4
1 2 12 32019 12 2019 3
2 3 8 112020 8 2020 11
3 4 23 42021 23 2021 4

Group columns based on the headers if they are found in the same list. Pandas Python

So I have a data frame that is something like this
Resource 2020-06-01 2020-06-02 2020-06-03
Name1 8 7 8
Name2 7 9 9
Name3 10 10 10
Imagine that the header is literal all the days of the month. And that there are way more names than just three.
I need to reduce the columns to five. Considering the first column to be the days between 2020-06-01 till 2020-06-05. Then from Saturday till Friday of the same week. Or the last day of the month if it is before Friday. So for June would be these weeks:
week 1: 2020-06-01 to 2020-06-05
week 2: 2020-06-06 to 2020-06-12
week 3: 2020-06-13 to 2020-06-19
week 4: 2020-06-20 to 2020-06-26
week 5: 2020-06-27 to 2020-06-30
I have no problem defining these weeks. The problem is grouping the columns based on them.
I couldn't come up with anything.
Does someone have any ideas about this?

I have to use these code to generate your dataframe.
dates = pd.date_range(start='2020-06-01', end='2020-06-30')
df = pd.DataFrame({
'Name1': np.random.randint(1, 10, size=len(dates)),
'Name2': np.random.randint(1, 10, size=len(dates)),
'Name3': np.random.randint(1, 10, size=len(dates)),
})
df = df.set_index(dates).transpose().reset_index().rename(columns={'index': 'Resource'})
Then, the solution starts from here.
# Set the first column as index
df = df.set_index(df['Resource'])
# Remove the unused column
df = df.drop(columns=['Resource'])
# Transpose the dataframe
df = df.transpose()
# Output:
Resource Name1 Name2 Name3
2020-06-01 00:00:00 3 2 7
2020-06-02 00:00:00 5 6 8
2020-06-03 00:00:00 2 3 6
...
# Bring "Resource" from index to column
df = df.reset_index()
df = df.rename(columns={'index': 'Resource'})
# Add a column "week of year"
df['week_no'] = df['Resource'].dt.weekofyear
# You can simply group by the week no column
df.groupby('week_no').sum().reset_index()
# Output:
Resource week_no Name1 Name2 Name3
0 23 38 42 41
1 24 37 30 43
2 25 38 29 23
3 26 29 40 42
4 27 2 8 3
I don't know what you want to do for the next. If you want your original form, just transpose() it back.
EDIT: OP claimed the week should start from Saturday end up with Friday
# 0: Monday
# 1: Tuesday
# 2: Wednesday
# 3: Thursday
# 4: Friday
# 5: Saturday
# 6: Sunday
df['weekday'] = df['Resource'].dt.weekday.apply(lambda day: 0 if day <= 4 else 1)
df['customised_weekno'] = df['week_no'] + df['weekday']
Output:
Resource Resource Name1 Name2 Name3 week_no weekday customised_weekno
0 2020-06-01 4 7 7 23 0 23
1 2020-06-02 8 6 7 23 0 23
2 2020-06-03 5 9 5 23 0 23
3 2020-06-04 7 6 5 23 0 23
4 2020-06-05 6 3 7 23 0 23
5 2020-06-06 3 7 6 23 1 24
6 2020-06-07 5 4 4 23 1 24
7 2020-06-08 8 1 5 24 0 24
8 2020-06-09 2 7 9 24 0 24
9 2020-06-10 4 2 7 24 0 24
10 2020-06-11 6 4 4 24 0 24
11 2020-06-12 9 5 7 24 0 24
12 2020-06-13 2 4 6 24 1 25
13 2020-06-14 6 7 5 24 1 25
14 2020-06-15 8 7 7 25 0 25
15 2020-06-16 4 3 3 25 0 25
16 2020-06-17 6 4 5 25 0 25
17 2020-06-18 6 8 2 25 0 25
18 2020-06-19 3 1 2 25 0 25
So, you can use customised_weekno for grouping.

python date and datetime from multiple columns

I have some data that looks like:
key DATE - DAY DATE - MONTH DATE - YEAR GMT HRS GMT MINUTES
1 2 29 2 2016 2 2
2 3 29 2 2016 2 2
3 4 29 2 2016 2 2
4 5 29 2 2016 2 2
5 6 29 2 2016 2 2
6 7 29 2 2016 2 2
7 8 29 2 2016 2 3
8 9 29 2 2016 2 3
9 10 29 2 2016 2 3
GMT SECONDS
1 54
2 55
3 56
4 57
5 58
6 59
7 0
8 1
9 2
At first the data was type float and the year was in format 16 so I did:
t['DATE - MONTH'] = t['DATE - MONTH'].astype(int)
t['DATE - YEAR'] = t['DATE - YEAR'].astype(int)
t['DATE - YEAR'] = t['DATE - YEAR']+2000
t['DATE - DAY'] = t['DATE - DAY'].astype(int)
^Note I was also confused why when using an index number rather than the column name you only work on what seems to be a temp table ie you can print the desired result but it didnt change the data frame.
Then I tried two methods:
t['Date'] = pd.to_datetime(dict(year=t['DATE - YEAR'], month = t['DATE - MONTH'], day = t['DATE - DAY']))
t['Date'] = pd.to_datetime((t['DATE - YEAR']*10000+t['DATE - MONTH']*100+t['DATE - DAY']).apply(str),format='%Y%m%d')
Both return:
ValueError: cannot assemble the datetimes: time data 20000000 does not match format '%Y%m%d' (match)
I'd like to create a date column (and then after use a similar logic for a datetime column with the additional 3 columns).
What is the problem?
EDIT: I had bad data and added errors='coerce' to handle those rows

First rename all columns, filter by values of dict and use to_datetime:
Assembling a datetime from multiple columns of a DataFrame. The keys can be common abbreviations like ['year', 'month', 'day', 'minute', 'second', 'ms', 'us', 'ns']) or plurals of the same.
d = {'DATE - YEAR':'year','DATE - MONTH':'month','DATE - DAY':'day',
'GMT HRS':'hour','GMT MINUTES':'minute','GMT SECONDS':'second'}
df['datetime'] = pd.to_datetime(df.rename(columns=d)[list(d.values())])
print (df)
key DATE - DAY DATE - MONTH DATE - YEAR GMT HRS GMT MINUTES \
1 2 29 2 2016 2 2
2 3 29 2 2016 2 2
3 4 29 2 2016 2 2
4 5 29 2 2016 2 2
5 6 29 2 2016 2 2
6 7 29 2 2016 2 2
7 8 29 2 2016 2 3
8 9 29 2 2016 2 3
9 10 29 2 2016 2 3
GMT SECONDS datetime
1 54 2016-02-29 02:02:54
2 55 2016-02-29 02:02:55
3 56 2016-02-29 02:02:56
4 57 2016-02-29 02:02:57
5 58 2016-02-29 02:02:58
6 59 2016-02-29 02:02:59
7 0 2016-02-29 02:03:00
8 1 2016-02-29 02:03:01
9 2 2016-02-29 02:03:02
Detail:
print (df.rename(columns=d)[list(d.values())])
day month second year minute hour
1 29 2 54 2016 2 2
2 29 2 55 2016 2 2
3 29 2 56 2016 2 2
4 29 2 57 2016 2 2
5 29 2 58 2016 2 2
6 29 2 59 2016 2 2
7 29 2 0 2016 3 2
8 29 2 1 2016 3 2
9 29 2 2 2016 3 2

Python how to get values in one dataframe from the other dataframe

import pandas as pd
import numpy as np
df1=pd.DataFrame(np.arange(25).reshape((5,5)),index=pd.date_range('2015/01/01',periods=5,freq='D')))
df1['trading_signal']=[1,-1,1,-1,1]
df1
0 1 2 3 4 trading_signal
2015-01-01 0 1 2 3 4 1
2015-01-02 5 6 7 8 9 -1
2015-01-03 10 11 12 13 14 1
2015-01-04 15 16 17 18 19 -1
2015-01-05 20 21 22 23 24 1
and
df2
0 1 2 3 4
Date Time
2015-01-01 22:55:00 0 1 2 3 4
23:55:00 5 6 7 8 9
2015-01-02 00:55:00 10 11 12 13 14
01:55:00 15 16 17 18 19
02:55:00 20 21 22 23 24
how would I get the value of trading_signal from df1 and sent it to df2.
I want an output like this:
0 1 2 3 4 trading_signal
Date Time
2015-01-01 22:55:00 0 1 2 3 4 1
23:55:00 5 6 7 8 9 1
2015-01-02 00:55:00 10 11 12 13 14 -1
01:55:00 15 16 17 18 19 -1
02:55:00 20 21 22 23 24 -1

You need to either merge or join. If you merge you need to reset_index, which is less memory efficient ans slower than using join. Please read the docs on Joining a single index to a multi index:
New in version 0.14.0.
You can join a singly-indexed DataFrame with a level of a
multi-indexed DataFrame. The level will match on the name of the index
of the singly-indexed frame against a level name of the multi-indexed
frame
If you want to use join, you must name the index of df1 to be Date so that it matches the name of the first level of df2:
df1.index.names = ['Date']
df1[['trading_signal']].join(df2, how='right')
trading_signal 0 1 2 3 4
Date Time
2015-01-01 22:55:00 1 0 1 2 3 4
23:55:00 1 5 6 7 8 9
2015-01-02 00:55:00 -1 10 11 12 13 14
01:55:00 -1 15 16 17 18 19
02:55:00 -1 20 21 22 23 24
I'm joining right for a reason, if you don't understand what this means please read Brief primer on merge methods (relational algebra).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sort pandas csv list with string date - python

Related

How calculate diff() in condition value? Python

Split date column into two

Group columns based on the headers if they are found in the same list. Pandas Python

python date and datetime from multiple columns

Python how to get values in one dataframe from the other dataframe

Categories

Resources