I have two date columns having corresponding Dollars associated in two other column. I want to plot it in single chart, but for that data preparation in required in python.
Actual table
StartDate
start$
EndDate
End$
5 June
500
7 June
300
7 June
600
10 June
550
8 june
900
10 June
600
Expected Table
PythonDate
start$
End$
5 June
500
0
6 june
0
0
7 June
600
300
8 June
900
0
9 June
0
0
10June
0
1150
Any solution in Python?
I can suggest you a basic logic, you figure out how to do it. It's not difficult to do it and it'll be a good learning too:
You can read only the subset of columns you need from the input table
as a single dataframe. Make such two dataframes with value as 0 for
the column that you be missing and then append them together.
So I have some sea surface temperature anomaly data. These data have been filtered down so that these are the values that are below a certain threshold. However, I am trying to identify cold spells - that is, to isolate events that last longer than 5 consecutive days. A sample of my data is below (I've been working between xarray datasets/dataarrays and pandas dataframes). Note, the 'day' is the day number of the month I am looking at (eventually will be expanded to the whole year). I have been scouring SO/the internet for ways to extract these 5-day-or-longer events based on the 'day' column, but I haven't gotten anything to work. I'm still relatively new to coding so my first thought was looping over the rows of the 'day' column but I'm not sure. Any insight is appreciated.
Here's what some of my data look like as a pandas df:
lat lon time day ssta
5940 24.125 262.375 1984-06-03 3 -1.233751
21072 24.125 262.375 1984-06-04 4 -1.394495
19752 24.125 262.375 1984-06-05 5 -1.379742
10223 24.125 262.375 1984-06-27 27 -1.276407
47355 24.125 262.375 1984-06-28 28 -1.840763
... ... ... ... ... ...
16738 30.875 278.875 2015-06-30 30 -1.345640
3739 30.875 278.875 2020-06-16 16 -1.212824
25335 30.875 278.875 2020-06-17 17 -1.446407
41891 30.875 278.875 2021-06-01 1 -1.714249
27740 30.875 278.875 2021-06-03 3 -1.477497
64228 rows × 5 columns
As a filtered xarray:
xarray.Dataset
Dimensions: lat: 28, lon: 68, time: 1174
Coordinates:
time (time) datetime64[ns] 1982-06-01 ... 2021-06-04
lon (lon) float32 262.1 262.4 262.6 ... 278.6 278.9
lat (lat) float32 24.12 24.38 24.62 ... 30.62 30.88
day (time) int64 1 2 3 4 5 6 7 ... 28 29 30 1 2 3 4
Data variables:
ssta (time, lat, lon) float32 nan nan nan nan ... nan nan nan nan
Attributes: (0)
TLDR; I want to identify (and retain the information of) events that are 5+ consecutive days, ie if there were a day 3 through day 8, or day 21 through day 30, etc.
I think rather than filtering your original data you should try to do it the pandas way which in this case means obtain a series with true false values depending on your condition.
Your data seems not to include temperatures so here is my example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'temp':np.random.randint(10,high=40,size=64228,dtype='int64')})
Will generate a DataFrame with a single column containing random temperatures between 10 and 40 degrees. Notice that I can just work with the auto generated index but you might have to switch it to a column like time or date or something like that using .set_index. Say we are interested in the consecutive days with more than 30 degrees.
is_over_30 = df['temp'] > 30
will give us a True/False array with that information. Notice that this format is very useful since we can index with it. E.g. df[is_over_30] will give us the rows of the dataframe for days where the temperature is over 30 deg. Now we wanna shift the True/False values in is_over_30 one spot forward and generate a new series that is true if both are true like so
is_over_30 & np.roll(is_over_30, -1)
Basically we are done here and could write 3 more of those & rolls. But there is a way to write it more concise.
from functools import reduce
is_consecutively_over_30 = reduce(lambda a,b: a&b, [np.roll(is_over_30, -i) for i in range(5)])
Keep in mind that that even though the last 4 days can't be consecutively over 30 deg this might still happen here since roll shifts the first values into the position relevant for that. But you can just set the last 4 values to False to resolve this.
is_consecutively_over_30[-4:] = False
You can pull the day ranges of the spells using this approach:
min_spell_days = 6
days = {'day': [1,2,5,6,7,8,9,10,17,19,21,22,23,24,25,26,27,31]}
df = pd.DataFrame(days)
Find number of days between consecutive entries:
diff = df['day'].diff()
Mark the last day of a spell:
df['last'] = (diff == 1) & (diff.shift(-1) > 1)
Accumulate the number of days in each spell:
df['diff0'] = np.where(diff > 1, 0, diff)
df['cs'] = df['diff0'].eq(0).cumsum()
df['spell_days'] = df.groupby('cs')['diff0'].transform('cumsum')
Mark the last entry as the last day of a spell if applicable:
if diff.iat[-1] == 1:
df['last'].iat[-1] = True
Select the last day of all qualifying spells:
df_spells = (df[df['last'] & (df['spell_days'] >= (min_spell_days-1))]).copy()
Identify the start, end and duration of each spell:
df_spells['end_day'] = df_spells['day']
df_spells['start_day'] = (df_spells['day'] - df['spell_days'])
df_spells['spell_days'] = df['spell_days'] + 1
Resulting df:
df_spells[['start_day','end_day','spell_days']].astype('int')
start_day end_day spell_days
7 5 10 6
16 21 27 7
Also, using date arithmetic 'day' you could represent a serial day number relative to some base date - like 1/1/1900. That way spells that span month and year boundaries could be handled. It would then be trivial to convert back to a date using date arithmetic and that serial number.
I have a dataframe/series containing hourly sampled data over a couple of years. I'd like to sum the values for each month, then calculate the mean of those monthly totals over all the years.
I can get a multi-index dataframe/series of the totals using:
df.groupby([df.index.year, df.index.month]).sum()
Date & Time Date & Time
2016 3 220.246292
4 736.204574
5 683.240291
6 566.693919
7 948.116766
8 761.214823
9 735.168033
10 771.210572
11 542.314915
12 434.467037
2017 1 728.983901
2 639.787918
3 709.944521
4 704.610437
5 685.729297
6 760.175060
7 856.928659
But I don't know how to then combine the data to get the means.
I might be totally off on the wrong track too. Also not sure I've labelled the question very well.
I think you need mean per years - so per first level:
df.groupby([df.index.year, df.index.month]).sum().mean(level=0)
You can use groupby twice, once to get the monthly sum, once to get the mean of monthly sum:
(df.groupby(pd.Grouper(freq='M')).sum()
.groupby(pd.Grouper(freq='Y')).mean()
)
I need help with some big pandas issue.
As a lot of people asked to have the real input and real desired output in order to answer the question, there it goes:
So I have the following dataframe
Date user cumulative_num_exercises total_exercises %_exercises
2017-01-01 1 2 7 28,57
2017-01-01 2 1 7 14.28
2017-01-01 4 3 7 42,85
2017-01-01 10 1 7 14,28
2017-02-02 1 2 14 14,28
2017-02-02 2 3 14 21,42
2017-02-02 4 4 14 28,57
2017-02-02 10 5 14 35,71
2017-03-03 1 3 17 17,64
2017-03-03 2 3 17 17,64
2017-03-03 4 5 17 29,41
2017-03-03 10 6 17 35,29
%_exercises_accum
28,57
42,85
85,7
100
14,28
35,7
64,27
100
17,64
35,28
64,69
100
-The column %_exercises is the value of the column (cumulative_num_exercises/total_exercises)*100
-The column %_exercises_accum is the value of the sum of the %_exercises for each month. (Note that at the end of each month, it reaches the value 100).
-I need to calculate, whith this data, the % of users that contributed to do a 50%, 80% and 90% of the total exercises, during each month.
-In order to do so, I have thought to create a new column, called category, which will later be used to count how many users contributed to each of the 3 percentages (50%, 80% and 90%). The category column takes the following values:
0 if the user did a %_exercises_accum = 0.
1 if the user did a %_exercises_accum < 50 and > 0.
50 if the user did a %_exercises_accum = 50.
80 if the user did a %_exercises_accum = 80.
90 if the user did a %_exercises_accum = 90.
And so on, because there are many cases in order to determine who contributes to which percentage of the total number of exercises on each month.
I have already determined all the cases and all the values that must be taken.
Basically, I traverse the dataframe using a for loop, and with two main ifs:
if (df.iloc[i][date] == df.iloc[i][date].shift()):
calculations to determine the percentage or percentages to which the user from the second to the last row of the same month group contributes
(because the same user can contribute to all the percentages, or to more than one)
else:
calculations to determine to which percentage of exercises the first
member of each
month group contributes.
The calculations involve:
Looking at the value of the category column in the previous row using shift().
Doing while loops inside the for, because when a user suddenly reaches a big percentage, we need to go back for the users in the same month, and change their category_column value to 50, as they have contributed to the 50%, but didn't reach it. for instance, in this situation:
Date %_exercises_accum
2017-01-01 1,24
2017-01-01 3,53
2017-01-01 20,25
2017-01-01 55,5
The desired output for the given dataframe at the beginning of the question would include the same columns as before (date, user, cumulative_num_exercises, total_exercises, %_exercises and %_exercises_accum) plus the category column, which is the following:
category
50
50
508090
90
50
50
5080
8090
50
50
5080
8090
Note that the rows with the values: 508090, or 8090, mean that that user is contributing to create:
508090: both 50%, 80% and 90% of total exercises in a month.
8090: both 80% and 90% of exercises in a month.
Does anyone know how can I simplify this for loop by traversing the groups of a group by object?
Thank you very much!
Given no sense of what calculations you wish to accomplish, this is my best guess at what you're looking for. However, I'd re-iterate Datanovice's point that the best way to get answers is to provide a sample output.
You can slice to each unique date using the following code:
dates = ['2017-01-01', '2017-01-01','2017-01-01','2017-01-01','2017-02-02','2017-02-02','2017-02-02','2017-02-02','2017-03-03','2017-03-03','2017-03-03','2017-03-03']
df = pd.DataFrame(
{'date':pd.to_datetime(dates),
'user': [1,2,4,10,1,2,4,10,1,2,4,10],
'cumulative_num_exercises':[2,1,3,1,2,3,4,5,3,3,5,6],
'total_exercises':[7,7,7,7,14,14,14,14,17,17,17,17]}
)
df = df.set_index('date')
for idx in df.index.unique():
hold = df.loc[idx]
### YOUR CODE GOES HERE ###
With Pandas I have created a DataFrame from an imported .csv file (this file is generated through simulation). The DataFrame consists of half-hourly energy consumption data for a single year. I have already created a DateTimeindex for the dates.
I would like to be able to reformat this data into average hourly week and weekend profile results. With the week profile excluding holidays.
DataFrame:
Date_Time Equipment:Electricity:LGF Equipment:Electricity:GF
01/01/2000 00:30 0.583979872 0.490327348
01/01/2000 01:00 0.583979872 0.490327348
01/01/2000 01:30 0.583979872 0.490327348
01/01/2000 02:00 0.583979872 0.490327348
I found an example (Getting the average of a certain hour on weekdays over several years in a pandas dataframe) that explains doing this for several years, but not explicitly for a week (without holidays) and weekend.
I realised that there is no resampling techniques in Pandas that do this directly, I used several aliases (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) for creating Monthly and Daily profiles.
I was thinking of using the business day frequency and create a new dateindex with working days and compare that to my DataFrame datetimeindex for every half hour. Then return values for working days and weekend days when true or false respectively to create a new dataset, but am not sure how to do this.
PS; I am just getting into Python and Pandas.
Dummy data (for future reference, more likely to get an answer if you post some in a copy-paste-able form)
df = pd.DataFrame(data={'a':np.random.randn(1000)},
index=pd.date_range(start='2000-01-01', periods=1000, freq='30T'))
Here's an approach. First define a US (or modify as appropriate) business day offset with holidays, and generate and range covering your dates.
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
bday_us = CustomBusinessDay(calendar=USFederalHolidayCalendar())
bday_over_df = pd.date_range(start=df.index.min().date(),
end=df.index.max().date(), freq=bday_us)
Then, develop your two grouping columns. An hour column is easy.
df['hour'] = df.index.hour
For weekday/weekend/holiday, define a function to group the data.
def group_day(date):
if date.weekday() in [5,6]:
return 'weekend'
elif date.date() in bday_over_df:
return 'weekday'
else:
return 'holiday'
df['day_group'] = df.index.map(group_day)
Then, just group by the two columns as you wish.
In [140]: df.groupby(['day_group', 'hour']).sum()
Out[140]:
a
day_group hour
holiday 0 1.890621
1 -0.029606
2 0.255001
3 2.837000
4 -1.787479
5 0.644113
6 0.407966
7 -1.798526
8 -0.620614
9 -0.567195
10 -0.822207
11 -2.675911
12 0.940091
13 -1.601885
14 1.575595
15 1.500558
16 -2.512962
17 -1.677603
18 0.072809
19 -1.406939
20 2.474293
21 -1.142061
22 -0.059231
23 -0.040455
weekday 0 9.192131
1 2.759302
2 8.379552
3 -1.189508
4 3.796635
5 3.471802
... ...
18 -5.217554
19 3.294072
20 -7.461023
21 8.793223
22 4.096128
23 -0.198943
weekend 0 -2.774550
1 0.461285
2 1.522363
3 4.312562
4 0.793290
5 2.078327
6 -4.523184
7 -0.051341
8 0.887956
9 2.112092
10 -2.727364
11 2.006966
12 7.401570
13 -1.958666
14 1.139436
15 -1.418326
16 -2.353082
17 -1.381131
18 -0.568536
19 -5.198472
20 -3.405137
21 -0.596813
22 1.747980
23 -6.341053
[72 rows x 1 columns]