I have two date columns having corresponding Dollars associated in two other column. I want to plot it in single chart, but for that data preparation in required in python.
Actual table
StartDate
start$
EndDate
End$
5 June
500
7 June
300
7 June
600
10 June
550
8 june
900
10 June
600
Expected Table
PythonDate
start$
End$
5 June
500
0
6 june
0
0
7 June
600
300
8 June
900
0
9 June
0
0
10June
0
1150
Any solution in Python?
I can suggest you a basic logic, you figure out how to do it. It's not difficult to do it and it'll be a good learning too:
You can read only the subset of columns you need from the input table
as a single dataframe. Make such two dataframes with value as 0 for
the column that you be missing and then append them together.
Related
i have a dataframe like below.
Trying to sum Week 7 and Week 8
SalesQuantity values for all the regarding productCodes[1-317], and update their week 7 rows Sales Quantity as a new value. And deleting their Week 8 rows from Dataframe.
Week column range is [7-26] and all of the weeks include [1-317] product code
cause of the original data is before group by [Week,ProductCode]
Week ProductCode SalesQuantity
7 1 49.285714
7 2 36.714286
7 3 33.285714
7 4 36.857143
7 5 42.714286
... ... ...
8 3 61.000000
26 314 4.285714
26 315 3.571429
26 316 6.142857
26 317 3.285714
Example Result : From the above table, adding week 7+8 SalesQuantities for product code 3: 61.000+33.285714= 94.285.714 new SalesQuantity updated value for week 7 is founded for ProductCode 3.
After that, need delete Week 8 row for ProductCode 3.
How to automate it for all of the ProductCode[1-317]?
Thanks
Use the `groupby()' method:
sumSales = data[['productCode', 'SalesQuality']].groupby('ProductCode').sum()
This creates a new DataFrame, with the sum of SalesQuality, indexed with the product code. The data[['productCode', 'SalesQuality']] part creates a sub-selection of the original data frame, otherwise the weeks also get summed.
I am having trouble illustrating my problem with the form the data is in without complicating things. So bear with me as I would like to start with the following screen shot is for explaining the problem only (aka the data is not in this form) :
I would like to identify the past 14 days with a number > 0 across all bins (aka the total row has a value greater than 0). This would include all days except for days 5 and 12 (highlighted in red). I would then like to sum across bins horizontally for those 14 days (aka sum all days expect for 5 and 12, by bin), with the goal of ultimately calculating a 14 day average by Bin number.
Note the example above would be for one “Lane”, where my data has > 10,000. The example also only illustrates today being day 16. But I would like to apply this logic to every day in the data set. I.e. on day 20 (along with any other date), it would look at the last 14 days with a value across all bins, then use that data range to aggregate across Bin. This is a screenshot sample of how the data looks:
A simple example using the data as it is structured, with only 3 Bins, 1 Lane, and a 3 data point/date look back:
Lane Date Bin KG
AMS-ORD 2018-08-26 3 10
AMS-ORD 2018-08-29 1 25
AMS-ORD 2018-08-30 2 30
AMS-ORD 2018-09-03 2 20
AMS-ORD 2018-09-04 1 40
Note KG here is a sum. Again this is for one day (aka today), but I would like every date in my data set to follow the same logic. The output would look like the following:
Lane Date Bin KG Average
AMS-ORD 2018-09-04 1 40 13.33
AMS-ORD 2018-09-04 2 50 16.67
AMS-ORD 2018-09-04 3 0 -
I have messed around with .rolling(14).mean(), .tail(), and some others. The problem I have is specifying the correct date range for the correct Bin aggregation.
I'm working in Python and I have a Pandas DataFrame of Uber data from New York City. A part of the DataFrame looks like this:
Year Week_Number Total_Dispatched_Trips
2015 51 1,109
2015 5 54,380
2015 50 8,989
2015 51 1,025
2015 21 10,195
2015 38 51,957
2015 43 266,465
2015 29 66,139
2015 40 74,321
2015 39 3
2015 50 854
As it is right now, the same week appears multiple times for each year. I want to sum the values for "Total_Dispatched_Trips" for every week for each year. I want each week to appear only once per year. (So week 51 can't appear multiple times for year 2015 etc.). How do I do this? My dataset is over 3k rows, so I would prefer not to do this manually.
Thanks in advance.
okidoki here is it, borrowing on Convert number strings with commas in pandas DataFrame to float
import locale
from locale import atof
locale.setlocale(locale.LC_NUMERIC, '')
df['numeric_trip'] = pd.to_numeric(df.Total_Dispatched_trips.apply(atof), errors = 'coerce')
df.groupby(['Year', 'Week_number']).numeric_trip.sum()
So, I was trying to accomplish this in SQL but was advised there would be a simple way to do this in Pandas... I would appreciate your help/hints!
I currently have the table on the left with two columns (begin subsession and end subsession), and I would like to add the two left columns "session start" and "session end". I know how to simply add the columns, but I can't figure out the query that would allow me to identify the continuous values in the two original columns (ie the end sub-session value is the same as the next rows begin sub-session value) and then add the first begin session value, and last end session value (for continuous rows) to the respective rows in my new columns. Please refer to the image.. for example, for the first three rows the "end subsession" value is the same as the next rows "begin subsession" values, so the first three "session start" and "session end" would be the same, with the minimum of the "begin subsession" values and the maximum "end sub session" value.
I was trying something along these lines in SQL, obviously didn't work, and I realized the aggregate function doesn't work in this case...
SELECT
FROM viewershipContinuous =
CASE
WHEN endSubsession.ROWID = beginSubession.ROWID+1
THEN MIN(beginSubsession)
ELSE beginSubsession.ROWID+1
END;
The table on the left is what I have, the table on the right is what I want to achieve
You can first compare next value by shifted column esub with column bsub if not equal (!=) and then create groups by cumsum:
s = df['bsub'].ne(df['esub'].shift()).cumsum()
print (s)
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 3
8 3
dtype: int32
Then groupby by Series s with transform min and max:
g = df.groupby(s)
df['session start'] = g['bsub'].transform('min')
df['session end'] = g['esub'].transform('max')
print (df)
bsub esub session start session end
0 1700 1705 1700 1800
1 1705 1730 1700 1800
2 1730 1800 1700 1800
3 1900 1920 1900 1965
4 1920 1950 1900 1965
5 1950 1960 1900 1965
6 1960 1965 1900 1965
7 2000 2001 2000 2002
8 2001 2002 2000 2002
With Pandas I have created a DataFrame from an imported .csv file (this file is generated through simulation). The DataFrame consists of half-hourly energy consumption data for a single year. I have already created a DateTimeindex for the dates.
I would like to be able to reformat this data into average hourly week and weekend profile results. With the week profile excluding holidays.
DataFrame:
Date_Time Equipment:Electricity:LGF Equipment:Electricity:GF
01/01/2000 00:30 0.583979872 0.490327348
01/01/2000 01:00 0.583979872 0.490327348
01/01/2000 01:30 0.583979872 0.490327348
01/01/2000 02:00 0.583979872 0.490327348
I found an example (Getting the average of a certain hour on weekdays over several years in a pandas dataframe) that explains doing this for several years, but not explicitly for a week (without holidays) and weekend.
I realised that there is no resampling techniques in Pandas that do this directly, I used several aliases (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) for creating Monthly and Daily profiles.
I was thinking of using the business day frequency and create a new dateindex with working days and compare that to my DataFrame datetimeindex for every half hour. Then return values for working days and weekend days when true or false respectively to create a new dataset, but am not sure how to do this.
PS; I am just getting into Python and Pandas.
Dummy data (for future reference, more likely to get an answer if you post some in a copy-paste-able form)
df = pd.DataFrame(data={'a':np.random.randn(1000)},
index=pd.date_range(start='2000-01-01', periods=1000, freq='30T'))
Here's an approach. First define a US (or modify as appropriate) business day offset with holidays, and generate and range covering your dates.
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
bday_us = CustomBusinessDay(calendar=USFederalHolidayCalendar())
bday_over_df = pd.date_range(start=df.index.min().date(),
end=df.index.max().date(), freq=bday_us)
Then, develop your two grouping columns. An hour column is easy.
df['hour'] = df.index.hour
For weekday/weekend/holiday, define a function to group the data.
def group_day(date):
if date.weekday() in [5,6]:
return 'weekend'
elif date.date() in bday_over_df:
return 'weekday'
else:
return 'holiday'
df['day_group'] = df.index.map(group_day)
Then, just group by the two columns as you wish.
In [140]: df.groupby(['day_group', 'hour']).sum()
Out[140]:
a
day_group hour
holiday 0 1.890621
1 -0.029606
2 0.255001
3 2.837000
4 -1.787479
5 0.644113
6 0.407966
7 -1.798526
8 -0.620614
9 -0.567195
10 -0.822207
11 -2.675911
12 0.940091
13 -1.601885
14 1.575595
15 1.500558
16 -2.512962
17 -1.677603
18 0.072809
19 -1.406939
20 2.474293
21 -1.142061
22 -0.059231
23 -0.040455
weekday 0 9.192131
1 2.759302
2 8.379552
3 -1.189508
4 3.796635
5 3.471802
... ...
18 -5.217554
19 3.294072
20 -7.461023
21 8.793223
22 4.096128
23 -0.198943
weekend 0 -2.774550
1 0.461285
2 1.522363
3 4.312562
4 0.793290
5 2.078327
6 -4.523184
7 -0.051341
8 0.887956
9 2.112092
10 -2.727364
11 2.006966
12 7.401570
13 -1.958666
14 1.139436
15 -1.418326
16 -2.353082
17 -1.381131
18 -0.568536
19 -5.198472
20 -3.405137
21 -0.596813
22 1.747980
23 -6.341053
[72 rows x 1 columns]