Problem: I am trying to make a bar chart based on a very simple Pandas DataFrame that has a DateTime index and integers in one column. Data below. Only some of the data is showing up in Matplotlib, however.
Code:
fig, ax = plt.subplots(figsize=(16,6))
ax.bar(unique_opps_by_month.index, unique_opps_by_month['Opportunity Name'])
ax.xaxis.set_major_locator(mdates.MonthLocator((1,4,7,10)))
ax.xaxis.set_minor_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%b'))
ax.set_xlim('2016-01', '2021-04')
fig.autofmt_xdate()
The output, however, looks like:
This doesn't match the data though! There should be eight bars in 2017, and not all the same height. There are similar problems in the other years as well. Why are only some of the bars showing? How can I make Matplotlib show all the data?
Data
2016-02-01 1
2016-05-01 1
2016-08-01 1
2016-09-01 1
2017-01-01 1
2017-02-01 1
2017-03-01 1
2017-04-01 1
2017-07-01 3
2017-10-01 2
2017-11-01 3
2017-12-01 1
2018-02-01 2
2018-03-01 2
2018-04-01 2
2018-06-01 1
2018-07-01 1
2018-08-01 1
2018-11-01 1
2018-12-01 2
2019-03-01 5
2019-04-01 2
2019-05-01 1
2019-06-01 2
2019-07-01 1
2019-08-01 2
2019-09-01 4
2019-11-01 5
2020-01-01 4
2020-02-01 6
2020-03-01 1
2020-06-01 1
2020-07-01 2
2020-09-01 3
2020-10-01 5
2020-11-01 4
2020-12-01 6
2021-01-01 3
2021-02-01 6
2021-03-01 3
Related
I was trying to find difference of a series of dates and a date. for example, the series is
from may1 to june1 which is
date = pd.DataFrame()
In [0]: date['test'] = pd.date_range("2021-05-01", "2021-06-01", freq = "D")
Out[0]: date
test
0 2021-05-01 00:00:00
1 2021-05-02 00:00:00
2 2021-05-03 00:00:00
3 2021-05-04 00:00:00
4 2021-05-05 00:00:00
5 2021-05-06 00:00:00
6 2021-05-07 00:00:00
7 2021-05-08 00:00:00
8 2021-05-09 00:00:00
9 2021-05-10 00:00:00
In[1]
date['test'] = date['test'].dt.date
Out[1]:
test
0 2021-05-01
1 2021-05-02
2 2021-05-03
3 2021-05-04
4 2021-05-05
5 2021-05-06
6 2021-05-07
7 2021-05-08
8 2021-05-09
9 2021-05-10
In[2]:date['base'] = dt.strptime("2021-05-01",'%Y-%m-%d')
Out[2]:
0 2021-05-01 00:00:00
1 2021-05-01 00:00:00
2 2021-05-01 00:00:00
3 2021-05-01 00:00:00
4 2021-05-01 00:00:00
5 2021-05-01 00:00:00
6 2021-05-01 00:00:00
7 2021-05-01 00:00:00
8 2021-05-01 00:00:00
9 2021-05-01 00:00:00
In[3]:date['base'] = date['base'].dt.date
Out[3]:
base
0 2021-05-01
1 2021-05-01
2 2021-05-01
3 2021-05-01
4 2021-05-01
5 2021-05-01
6 2021-05-01
7 2021-05-01
8 2021-05-01
9 2021-05-01
In[4]:date['test']-date['base']
Out[4]:
diff
0 0 days 00:00:00.000000000
1 1 days 00:00:00.000000000
2 2 days 00:00:00.000000000
3 3 days 00:00:00.000000000
4 4 days 00:00:00.000000000
5 5 days 00:00:00.000000000
6 6 days 00:00:00.000000000
7 7 days 00:00:00.000000000
8 8 days 00:00:00.000000000
9 9 days 00:00:00.000000000
10 10 days 00:00:00.000000000
the only thing i could get is this. I don't want anything other than the number 1-10 cuz i need them for further numerical calculation but i can't get rid of those. Also how could i construct a time series which just outputs the date not the hms after it? i don't want to manually .dt.date for all of those and it sometimes mess things up
You don't need to create a column base for this, simply do:
>>> (date['test'] - pd.to_datetime("2021-05-01", format='%Y-%m-%d')).dt.days
0 0
1 1
2 2
3 3
4 4
...
27 27
28 28
29 29
30 30
31 31
Name: test, dtype: int64
You can convert the timestamps first to epoch seconds (they are actually stored internally as some number, and likely a factor of epoch seconds)
Using pandas datetime to unix timestamp seconds
import pandas as pd
# start df with date column
df = pd.DataFrame({"date": pd.date_range("2021-05-01", "2021-06-01", freq = "D")})
# create a column for datetimes
df["ts"] = (df["date"] - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s")
>>> df
date ts
0 2021-05-01 1619827200
1 2021-05-02 1619913600
2 2021-05-03 1620000000
3 2021-05-04 1620086400
...
31 2021-06-01 1622505600
This will allow you to do integer math before converting back
>>> df["days"] = (df["ts"] - min(df["ts"])) // (60*60*24) # 1 day in seconds
>>> df
date ts days
0 2021-05-01 1619827200 0
1 2021-05-02 1619913600 1
2 2021-05-03 1620000000 2
3 2021-05-04 1620086400 3
...
31 2021-06-01 1622505600 31
Alternatively, with a naive day-based series, you can use the index as the day offset (as that's how the DataFrame was generated)!
>>> import pandas as pd
>>> df = pd.DataFrame({"date": pd.date_range("2021-05-01", "2021-06-01", freq = "D")})
>>> df["days"] = df.index
>>> df
date days
0 2021-05-01 0
1 2021-05-02 1
2 2021-05-03 2
3 2021-05-04 3
...
31 2021-06-01 31
I am struggling to get my pandas df into the format I require due to incorrectly populating a bit masked dataframe.
I have a number of data frames:
plot_d1_sw1 - this is a read from a .csv
timestamp switchID deviceID count
0 2019-05-01 07:00:00 1 GTEC122277 1
1 2019-05-01 08:00:00 1 GTEC122277 1
3 2019-05-01 10:00:00 1 GTEC122277 3
d1_sw1 - this is the last 12 hours and a conditional as to whether the data appears in filt
timestamp num
0 2019-05-01 12:00:00 False
1 2019-05-01 11:00:00 False
2 2019-05-01 10:00:00 True
3 2019-05-01 09:00:00 False
4 2019-05-01 08:00:00 True
5 2019-05-01 07:00:00 True
6 2019-05-01 06:00:00 False
7 2019-05-01 05:00:00 False
8 2019-05-01 04:00:00 False
9 2019-05-01 03:00:00 False
10 2019-05-01 02:00:00 False
11 2019-05-01 01:00:00 False
I have tried masking this and pulling through the count column into the any True values using the following:
mask_d1_sw1 = d1_sw1.num == False
d1_sw1.loc[mask_d1_sw1, column_name] = 0
i=0
for row in plot_d1_sw1.itertuples():
mask_d1_sw1 = d1_sw1.num == True
d1_sw1.loc[mask_d1_sw1, column_name] = plot_d1_sw1['count'].values[i]
print(d1_sw1)
i = i + 1
this gives me:
timestamp num
0 2019-05-01 12:00:00 0
1 2019-05-01 11:00:00 0
2 2019-05-01 10:00:00 3
3 2019-05-01 09:00:00 0
4 2019-05-01 08:00:00 3
5 2019-05-01 07:00:00 3
6 2019-05-01 06:00:00 0
7 2019-05-01 05:00:00 0
8 2019-05-01 04:00:00 0
9 2019-05-01 03:00:00 0
10 2019-05-01 02:00:00 0
11 2019-05-01 01:00:00 0
... I know that this is because I am looping through the count column of plot_d1_sw1 but I cannot for the life of me work out how to logically fill this to get the outcome:
timestamp num
0 2019-05-01 12:00:00 0
1 2019-05-01 11:00:00 0
2 2019-05-01 10:00:00 3
3 2019-05-01 09:00:00 0
4 2019-05-01 08:00:00 1
5 2019-05-01 07:00:00 1
6 2019-05-01 06:00:00 0
7 2019-05-01 05:00:00 0
8 2019-05-01 04:00:00 0
9 2019-05-01 03:00:00 0
10 2019-05-01 02:00:00 0
11 2019-05-01 01:00:00 0
How can I achieve this outcome?
One way is to merge on the timestamp and then multiply the boolean values with count:
df = d1_sw1.merge(plot_d1_sw1, how='left', on='timestamp')
df['num'] = df.num.mul(df['count'].fillna(0)).astype(int)
df[['timestamp', 'num']]
Which gives:
timestamp num
0 2019-05-01-12:00:00 0
1 2019-05-01-11:00:00 0
2 2019-05-01-10:00:00 3
3 2019-05-01-09:00:00 0
4 2019-05-01-08:00:00 1
5 2019-05-01-07:00:00 1
6 2019-05-01-06:00:00 0
7 2019-05-01-05:00:00 0
8 2019-05-01-04:00:00 0
9 2019-05-01-03:00:00 0
10 2019-05-01-02:00:00 0
11 2019-05-01-01:00:00 0
How can I convert timestamp column of dataframe to numeric value? The datatype of the below Time column in below dataframe 'df' is 'datetime64'.
Time Count
2018-05-15 00:00:00 4
2018-05-15 00:15:00 1
2018-05-15 00:30:00 5
2018-05-15 00:45:00 6
2018-05-15 01:15:00 3
2018-05-15 01:30:00 4
2018-05-15 02:30:00 5
2018-05-15 02:45:00 3
2018-05-15 03:15:00 2
2018-05-15 03:30:00 5
By using to_numeric
pd.to_numeric(df.Time)
Out[218]:
0 1526342400000000000
1 1526343300000000000
2 1526344200000000000
3 1526345100000000000
4 1526346900000000000
5 1526347800000000000
6 1526351400000000000
7 1526352300000000000
8 1526354100000000000
9 1526355000000000000
Name: Time, dtype: int64
I have a dataframe with about 50 columns, some of them are period_start_time, id, speed_throughput, etc.
dataframe sample:
id period_start_time speed_througput ...
0 1 2017-06-14 20:00:00 6
1 1 2017-06-14 20:00:00 10
2 1 2017-06-14 21:00:00 2
3 1 2017-06-14 21:00:00 5
4 2 2017-06-14 20:00:00 8
5 2 2017-06-14 20:00:00 12
...
I have tried to go create two new columns by grouping two columns(id and period_start_time) and find avg and min of speed_trhoughput.
The code I've tried:
df['Throughput_avg']=df.sort_values(['period_start_time'],ascending=False).groupby(['period_start_time','id'])[['speed_trhoughput']].max()
df['Throughput_min'] = df.groupby(['period_start_time', 'id'])[['speed_trhoughput']].min()
As you can see, there are two ways I've tried, but nothing works.
The error message I received for both attempts:
TypeError:incompatible index of inserted column with frame index
I suppose you know what my output needs to be, so there is no need to post it.
Option 1
Use agg in a groupby and join to attach to main dataframe
df.join(
df.groupby(['id', 'period_start_time']).speed_througput.agg(
['mean', 'min']
).rename(columns={'mean': 'avg'}).add_prefix('Throughput_'),
on=['id', 'period_start_time']
)
id period_start_time speed_througput Throughput_avg Throughput_min
0 1 2017-06-14 20:00:00 6 8.0 6
1 1 2017-06-14 20:00:00 10 8.0 6
2 1 2017-06-14 21:00:00 2 3.5 2
3 1 2017-06-14 21:00:00 5 3.5 2
4 2 2017-06-14 20:00:00 8 10.0 8
5 2 2017-06-14 20:00:00 12 10.0 8
Option 2
Use transform in a groupby context and use assign to add the new columns
g = df.groupby(['id', 'period_start_time']).speed_througput.transform
df.assign(Throughput_avg=g('mean'), Throughput_min=g('min'))
id period_start_time speed_througput Throughput_avg Throughput_min
0 1 2017-06-14 20:00:00 6 8.0 6
1 1 2017-06-14 20:00:00 10 8.0 6
2 1 2017-06-14 21:00:00 2 3.5 2
3 1 2017-06-14 21:00:00 5 3.5 2
4 2 2017-06-14 20:00:00 8 10.0 8
5 2 2017-06-14 20:00:00 12 10.0 8
I'm having trouble simply sorting a pandas dataframe first by a column with a string then by datetime column. when doing so, the dates returned are out of order. What am I doing wrong?
df looks like
Date Field 1
0 2013-07-01 00:00:00 1
1 2013-07-02 00:00:00 1
2 2013-07-03 00:00:00 1
3 2013-07-03 00:00:00 2
4 2013-07-05 00:00:00 2
5 2013-07-05 00:00:00 1
6 2013-07-08 00:00:00 2
7 2013-07-09 00:00:00 2
8 2013-07-11 00:00:00 2
9 2013-07-12 00:00:00 2
10 2013-07-15 00:00:00 1
11 2013-07-16 00:00:00 1
12 2013-07-17 00:00:00 1
13 2013-07-18 00:00:00 1
14 2013-07-19 00:00:00 1
When the dataframe was created, Date was an object, and converted to datetime using:
df['Date'] = df['Date'].apply(dateutil.parser.parse)
now the dtypes are:
Date datetime64[ns]
Field 1 int64
dtype: object
when running a
df.sort_index(by=['Field 1', 'Date'])
or
df.sort(['Field 1','Date'])
I get back:
Date Field 1
0 2013-07-01 00:00:00 1
1 2013-07-02 00:00:00 1
2 2013-07-03 00:00:00 1
10 2013-07-15 00:00:00 1
5 2013-07-05 00:00:00 1
11 2013-07-16 00:00:00 1
12 2013-07-17 00:00:00 1
13 2013-07-18 00:00:00 1
14 2013-07-19 00:00:00 1
8 2013-07-11 00:00:00 2
9 2013-07-12 00:00:00 2
3 2013-07-03 00:00:00 2
4 2013-07-05 00:00:00 2
6 2013-07-08 00:00:00 2
7 2013-07-09 00:00:00 2
what I really want back is:
Date Field 1
0 2013-07-01 00:00:00 1
1 2013-07-02 00:00:00 1
2 2013-07-03 00:00:00 1
5 2013-07-05 00:00:00 1
10 2013-07-15 00:00:00 1
11 2013-07-16 00:00:00 1
12 2013-07-17 00:00:00 1
13 2013-07-18 00:00:00 1
14 2013-07-19 00:00:00 1
3 2013-07-03 00:00:00 2
4 2013-07-05 00:00:00 2
6 2013-07-08 00:00:00 2
7 2013-07-09 00:00:00 2
8 2013-07-11 00:00:00 2
9 2013-07-12 00:00:00 2