pandas sort DataFrame based on object, and datetime columns?

pandas sort DataFrame based on object, and datetime columns? - python

I'm having trouble simply sorting a pandas dataframe first by a column with a string then by datetime column. when doing so, the dates returned are out of order. What am I doing wrong?
df looks like
Date Field 1
0 2013-07-01 00:00:00 1
1 2013-07-02 00:00:00 1
2 2013-07-03 00:00:00 1
3 2013-07-03 00:00:00 2
4 2013-07-05 00:00:00 2
5 2013-07-05 00:00:00 1
6 2013-07-08 00:00:00 2
7 2013-07-09 00:00:00 2
8 2013-07-11 00:00:00 2
9 2013-07-12 00:00:00 2
10 2013-07-15 00:00:00 1
11 2013-07-16 00:00:00 1
12 2013-07-17 00:00:00 1
13 2013-07-18 00:00:00 1
14 2013-07-19 00:00:00 1
When the dataframe was created, Date was an object, and converted to datetime using:
df['Date'] = df['Date'].apply(dateutil.parser.parse)
now the dtypes are:
Date datetime64[ns]
Field 1 int64
dtype: object
when running a
df.sort_index(by=['Field 1', 'Date'])
or
df.sort(['Field 1','Date'])
I get back:
Date Field 1
0 2013-07-01 00:00:00 1
1 2013-07-02 00:00:00 1
2 2013-07-03 00:00:00 1
10 2013-07-15 00:00:00 1
5 2013-07-05 00:00:00 1
11 2013-07-16 00:00:00 1
12 2013-07-17 00:00:00 1
13 2013-07-18 00:00:00 1
14 2013-07-19 00:00:00 1
8 2013-07-11 00:00:00 2
9 2013-07-12 00:00:00 2
3 2013-07-03 00:00:00 2
4 2013-07-05 00:00:00 2
6 2013-07-08 00:00:00 2
7 2013-07-09 00:00:00 2
what I really want back is:
Date Field 1
0 2013-07-01 00:00:00 1
1 2013-07-02 00:00:00 1
2 2013-07-03 00:00:00 1
5 2013-07-05 00:00:00 1
10 2013-07-15 00:00:00 1
11 2013-07-16 00:00:00 1
12 2013-07-17 00:00:00 1
13 2013-07-18 00:00:00 1
14 2013-07-19 00:00:00 1
3 2013-07-03 00:00:00 2
4 2013-07-05 00:00:00 2
6 2013-07-08 00:00:00 2
7 2013-07-09 00:00:00 2
8 2013-07-11 00:00:00 2
9 2013-07-12 00:00:00 2

Related

Pandas DateTime only partially showing in Matplotlib

Problem: I am trying to make a bar chart based on a very simple Pandas DataFrame that has a DateTime index and integers in one column. Data below. Only some of the data is showing up in Matplotlib, however.
Code:
fig, ax = plt.subplots(figsize=(16,6))
ax.bar(unique_opps_by_month.index, unique_opps_by_month['Opportunity Name'])
ax.xaxis.set_major_locator(mdates.MonthLocator((1,4,7,10)))
ax.xaxis.set_minor_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%b'))
ax.set_xlim('2016-01', '2021-04')
fig.autofmt_xdate()
The output, however, looks like:
This doesn't match the data though! There should be eight bars in 2017, and not all the same height. There are similar problems in the other years as well. Why are only some of the bars showing? How can I make Matplotlib show all the data?
Data
2016-02-01 1
2016-05-01 1
2016-08-01 1
2016-09-01 1
2017-01-01 1
2017-02-01 1
2017-03-01 1
2017-04-01 1
2017-07-01 3
2017-10-01 2
2017-11-01 3
2017-12-01 1
2018-02-01 2
2018-03-01 2
2018-04-01 2
2018-06-01 1
2018-07-01 1
2018-08-01 1
2018-11-01 1
2018-12-01 2
2019-03-01 5
2019-04-01 2
2019-05-01 1
2019-06-01 2
2019-07-01 1
2019-08-01 2
2019-09-01 4
2019-11-01 5
2020-01-01 4
2020-02-01 6
2020-03-01 1
2020-06-01 1
2020-07-01 2
2020-09-01 3
2020-10-01 5
2020-11-01 4
2020-12-01 6
2021-01-01 3
2021-02-01 6
2021-03-01 3

Create regular time series from irregular interval with python

I wonder if is it possible to convert irregular time series interval to regular one without interpolating value from other column like this :
Index count
2018-01-05 00:00:00 1
2018-01-07 00:00:00 4
2018-01-08 00:00:00 15
2018-01-11 00:00:00 2
2018-01-14 00:00:00 5
2018-01-19 00:00:00 5
....
2018-12-26 00:00:00 6
2018-12-29 00:00:00 7
2018-12-30 00:00:00 8
And I expect the result to be something like this:
Index count
2018-01-01 00:00:00 0
2018-01-02 00:00:00 0
2018-01-03 00:00:00 0
2018-01-04 00:00:00 0
2018-01-05 00:00:00 1
2018-01-06 00:00:00 0
2018-01-07 00:00:00 4
2018-01-08 00:00:00 15
2018-01-09 00:00:00 0
2018-01-10 00:00:00 0
2018-01-11 00:00:00 2
2018-01-12 00:00:00 0
2018-01-13 00:00:00 0
2018-01-14 00:00:00 5
2018-01-15 00:00:00 0
2018-01-16 00:00:00 0
2018-01-17 00:00:00 0
2018-01-18 00:00:00 0
2018-01-19 00:00:00 5
....
2018-12-26 00:00:00 6
2018-12-27 00:00:00 0
2018-12-28 00:00:00 0
2018-12-29 00:00:00 7
2018-12-30 00:00:00 8
2018-12-31 00:00:00 0
So, far I just try resample from pandas but it only partially solved my problem.
Thanks in advance

Use DataFrame.reindex with date_range:
#if necessary
df.index = pd.to_datetime(df.index)
df = df.reindex(pd.date_range('2018-01-01','2018-12-31'), fill_value=0)
print (df)
count
2018-01-01 0
2018-01-02 0
2018-01-03 0
2018-01-04 0
2018-01-05 1
...
2018-12-27 0
2018-12-28 0
2018-12-29 7
2018-12-30 8
2018-12-31 0
[365 rows x 1 columns]

How to prevent .diff() function to get a ridiculous value when applied to a dataframe of datetimes and NaT values in Pandas?

I have got a dataframe loc_df where all the values are datetime and some of them are NaT. This is what loc_df looks like:
loc_df = pd.DataFrame({'10101':['2020-01-03','2019-11-06','2019-10-09','2019-09-26','2019-09-19','2019-08-19','2019-08-08','2019-07-05','2019-07-04','2019-06-27','2019-05-21','2019-04-21','2019-04-15','2019-04-06','2019-03-28','2019-02-28'], '10102':['2020-01-03','2019-11-15','2019-11-11','2019-10-23','2019-10-10','2019-10-06','2019-09-26','2019-07-14','2019-05-21','2019-03-15','2019-03-11','2019-02-27','2019-02-25',None,None,None], '10103':['2019-08-27','2019-07-14','2019-06-24','2019-05-21','2019-04-11','2019-03-06','2019-02-11',None,None,None,None,None,None,None,None,None]})
loc_df = loc_df.apply(pd.to_datetime)
print(loc_df)
10101 10102 10103
0 2020-01-03 2020-01-03 2019-08-27
1 2019-11-06 2019-11-15 2019-07-14
2 2019-10-09 2019-11-11 2019-06-24
3 2019-09-26 2019-10-23 2019-05-21
4 2019-09-19 2019-10-10 2019-04-11
5 2019-08-19 2019-10-06 2019-03-06
6 2019-08-08 2019-09-26 2019-02-11
7 2019-07-05 2019-07-14 NaT
8 2019-07-04 2019-05-21 NaT
9 2019-06-27 2019-03-15 NaT
10 2019-05-21 2019-03-11 NaT
11 2019-04-21 2019-02-27 NaT
12 2019-04-15 2019-02-25 NaT
13 2019-04-06 NaT NaT
14 2019-03-28 NaT NaT
15 2019-02-28 NaT NaT
I want to know the days between the dates for each colum so I have used:
loc_df = loc_df.diff(periods = -1)
The result was:
print(loc_df)
10101 10102 10103
0 58 days 49 days 00:00:00 44 days 00:00:00
1 28 days 4 days 00:00:00 20 days 00:00:00
2 13 days 19 days 00:00:00 34 days 00:00:00
3 7 days 13 days 00:00:00 40 days 00:00:00
4 31 days 4 days 00:00:00 36 days 00:00:00
5 11 days 10 days 00:00:00 23 days 00:00:00
6 34 days 74 days 00:00:00 -88814 days +00:12:43.145224
7 1 days 54 days 00:00:00 0 days 00:00:00
8 7 days 67 days 00:00:00 0 days 00:00:00
9 37 days 4 days 00:00:00 0 days 00:00:00
10 30 days 12 days 00:00:00 0 days 00:00:00
11 6 days 2 days 00:00:00 0 days 00:00:00
12 9 days -88800 days +00:12:43.145224 0 days 00:00:00
13 9 days 0 days 00:00:00 0 days 00:00:00
14 28 days 0 days 00:00:00 0 days 00:00:00
15 NaT NaT NaT
Do you know why I high values at the end of each column? I guess it has something to do with subtract a NaT to a datetime.
Is there an alternative to my code to prevent this?
Thanks in advance

If you have some initial data:
print(loc_df)
10101 10102 10103
0 2020-01-03 2020-01-03 2019-08-27
1 2019-11-06 2019-11-15 2019-07-14
2 2019-10-09 2019-11-11 2019-06-24
3 2019-09-26 2019-10-23 2019-05-21
4 2019-09-19 2019-10-10 2019-04-11
5 2019-08-19 2019-10-06 2019-03-06
6 2019-08-08 2019-09-26 2019-02-11
7 2019-07-05 2019-07-14 NaT
8 2019-07-04 2019-05-21 NaT
9 2019-06-27 2019-03-15 NaT
10 2019-05-21 2019-03-11 NaT
11 2019-04-21 2019-02-27 NaT
12 2019-04-15 2019-02-25 NaT
13 2019-04-06 NaT NaT
14 2019-03-28 NaT NaT
15 2019-02-28 NaT NaT
You could use DataFrame.ffill to fill in the NaT values before you use diff():
loc_df = loc_df.ffill()
loc_df = loc_df.diff(periods=-1)
print(loc_df)
10101 10102 10103
0 58 days 49 days 44 days
1 28 days 4 days 20 days
2 13 days 19 days 34 days
3 7 days 13 days 40 days
4 31 days 4 days 36 days
5 11 days 10 days 23 days
6 34 days 74 days 0 days
7 1 days 54 days 0 days
8 7 days 67 days 0 days
9 37 days 4 days 0 days
10 30 days 12 days 0 days
11 6 days 2 days 0 days
12 9 days 0 days 0 days
13 9 days 0 days 0 days
14 28 days 0 days 0 days
15 NaT NaT NaT

Is there a way to perform create relational pandas dataframes?

I am struggling to get my pandas df into the format I require due to incorrectly populating a bit masked dataframe.
I have a number of data frames:
plot_d1_sw1 - this is a read from a .csv
timestamp switchID deviceID count
0 2019-05-01 07:00:00 1 GTEC122277 1
1 2019-05-01 08:00:00 1 GTEC122277 1
3 2019-05-01 10:00:00 1 GTEC122277 3
d1_sw1 - this is the last 12 hours and a conditional as to whether the data appears in filt
timestamp num
0 2019-05-01 12:00:00 False
1 2019-05-01 11:00:00 False
2 2019-05-01 10:00:00 True
3 2019-05-01 09:00:00 False
4 2019-05-01 08:00:00 True
5 2019-05-01 07:00:00 True
6 2019-05-01 06:00:00 False
7 2019-05-01 05:00:00 False
8 2019-05-01 04:00:00 False
9 2019-05-01 03:00:00 False
10 2019-05-01 02:00:00 False
11 2019-05-01 01:00:00 False
I have tried masking this and pulling through the count column into the any True values using the following:
mask_d1_sw1 = d1_sw1.num == False
d1_sw1.loc[mask_d1_sw1, column_name] = 0
i=0
for row in plot_d1_sw1.itertuples():
mask_d1_sw1 = d1_sw1.num == True
d1_sw1.loc[mask_d1_sw1, column_name] = plot_d1_sw1['count'].values[i]
print(d1_sw1)
i = i + 1
this gives me:
timestamp num
0 2019-05-01 12:00:00 0
1 2019-05-01 11:00:00 0
2 2019-05-01 10:00:00 3
3 2019-05-01 09:00:00 0
4 2019-05-01 08:00:00 3
5 2019-05-01 07:00:00 3
6 2019-05-01 06:00:00 0
7 2019-05-01 05:00:00 0
8 2019-05-01 04:00:00 0
9 2019-05-01 03:00:00 0
10 2019-05-01 02:00:00 0
11 2019-05-01 01:00:00 0
... I know that this is because I am looping through the count column of plot_d1_sw1 but I cannot for the life of me work out how to logically fill this to get the outcome:
timestamp num
0 2019-05-01 12:00:00 0
1 2019-05-01 11:00:00 0
2 2019-05-01 10:00:00 3
3 2019-05-01 09:00:00 0
4 2019-05-01 08:00:00 1
5 2019-05-01 07:00:00 1
6 2019-05-01 06:00:00 0
7 2019-05-01 05:00:00 0
8 2019-05-01 04:00:00 0
9 2019-05-01 03:00:00 0
10 2019-05-01 02:00:00 0
11 2019-05-01 01:00:00 0
How can I achieve this outcome?

One way is to merge on the timestamp and then multiply the boolean values with count:
df = d1_sw1.merge(plot_d1_sw1, how='left', on='timestamp')
df['num'] = df.num.mul(df['count'].fillna(0)).astype(int)
df[['timestamp', 'num']]
Which gives:
timestamp num
0 2019-05-01-12:00:00 0
1 2019-05-01-11:00:00 0
2 2019-05-01-10:00:00 3
3 2019-05-01-09:00:00 0
4 2019-05-01-08:00:00 1
5 2019-05-01-07:00:00 1
6 2019-05-01-06:00:00 0
7 2019-05-01-05:00:00 0
8 2019-05-01-04:00:00 0
9 2019-05-01-03:00:00 0
10 2019-05-01-02:00:00 0
11 2019-05-01-01:00:00 0

Split a numeric ID column into two using pandas

DateTime Junction Vehicles ID
0 2015-11-01 00:00:00 1 15 20151101001
1 2015-11-01 01:00:00 1 13 20151101011
2 2015-11-01 02:00:00 1 10 20151101021
3 2015-11-01 03:00:00 1 7 20151101031
4 2015-11-01 04:00:00 1 9 20151101041
5 2015-11-01 05:00:00 1 6 20151101051
6 2015-11-01 06:00:00 1 9 20151101061
7 2015-11-01 07:00:00 1 8 20151101071
8 2015-11-01 08:00:00 1 11 20151101081
9 2015-11-01 09:00:00 1 12 20151101091
I want to split the ID column into two separate columns such that the first 4 digits are in one, and the remaining digits are in the second.
Code I've tried:
new_ID = data.apply(lambda x: x.rsplit(4))
But it doesn't work. How can I do this with pandas?

Option 1
The simplest and most direct - use the str accessor.
v = df.ID.astype(str)
df['Year'], df['ID'] = v.str[:4], v.str[4:]
df
DateTime Junction Vehicles ID Year
0 2015-11-01 00:00:00 1 15 1101001 2015
1 2015-11-01 01:00:00 1 13 1101011 2015
2 2015-11-01 02:00:00 1 10 1101021 2015
3 2015-11-01 03:00:00 1 7 1101031 2015
4 2015-11-01 04:00:00 1 9 1101041 2015
5 2015-11-01 05:00:00 1 6 1101051 2015
6 2015-11-01 06:00:00 1 9 1101061 2015
7 2015-11-01 07:00:00 1 8 1101071 2015
8 2015-11-01 08:00:00 1 11 1101081 2015
9 2015-11-01 09:00:00 1 12 1101091 2015
Option 2
str.extract
v = df.ID.astype(str).str.extract('(?P<Year>\d{4})(?P<ID>.*)', expand=True)
df = pd.concat([df.drop('ID', 1), v], 1)
df
DateTime Junction Vehicles Year ID
0 2015-11-01 00:00:00 1 15 2015 1101001
1 2015-11-01 01:00:00 1 13 2015 1101011
2 2015-11-01 02:00:00 1 10 2015 1101021
3 2015-11-01 03:00:00 1 7 2015 1101031
4 2015-11-01 04:00:00 1 9 2015 1101041
5 2015-11-01 05:00:00 1 6 2015 1101051
6 2015-11-01 06:00:00 1 9 2015 1101061
7 2015-11-01 07:00:00 1 8 2015 1101071
8 2015-11-01 08:00:00 1 11 2015 1101081
9 2015-11-01 09:00:00 1 12 2015 1101091

Here is a numeric solution (assuming that the length of ID column is constant):
In [10]: df['Year'], df['ID'] = df['ID'] // 10**7, df['ID'] % 10**7
In [11]: df
Out[11]:
DateTime Junction Vehicles ID Year
0 2015-11-01 00:00:00 1 15 1101001 2015
1 2015-11-01 01:00:00 1 13 1101011 2015
2 2015-11-01 02:00:00 1 10 1101021 2015
3 2015-11-01 03:00:00 1 7 1101031 2015
4 2015-11-01 04:00:00 1 9 1101041 2015
5 2015-11-01 05:00:00 1 6 1101051 2015
6 2015-11-01 06:00:00 1 9 1101061 2015
7 2015-11-01 07:00:00 1 8 1101071 2015
8 2015-11-01 08:00:00 1 11 1101081 2015
9 2015-11-01 09:00:00 1 12 1101091 2015

df[id_col].map(lambda x: int(str(x)[:5])) # as an integer
df[id_col].map(lambda x: str(x)[:5]) # as a string

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas sort DataFrame based on object, and datetime columns? - python

Related

Pandas DateTime only partially showing in Matplotlib

Create regular time series from irregular interval with python

How to prevent .diff() function to get a ridiculous value when applied to a dataframe of datetimes and NaT values in Pandas?

Is there a way to perform create relational pandas dataframes?

Split a numeric ID column into two using pandas

Categories

Resources