Am having a dataframe,need to implement
every month I will be running this script so automatically it will pick based on extracted date
Input Dataframe
client_id expo_value value cal_value extracted_date
1 126 30 27.06 08/2022
2 135 60 36.18 08/2022
3 144 120 45 08/2022
4 162 30 54.09 08/2022
5 153 90 63.63 08/2022
6 181 120 72.9 08/2022
Input Dataframe
client_id expo_value value cal_value extracted_date Output_Value
1 126 30 27.06 08/2022 126+26.18 = 152.18
2 135 60 36.18 08/2022 261.29+70.02 = 331.31
3 144 120 45 08/2022 557.4+174.19 = 731.59
4 162 30 54.09 08/2022 156.7+ 52.34 = 209.04
5 153 90 63.63 08/2022 444.19+ 182.9 =627.09
6 181 120 72.9 08/2022 700.64+282.19=982.83
I want to implement 31 days/30 days/28 days inside the below function & i tried manually entering the number 31(days) for calculation but it should automatically should pick based on which month has how many days
def month_data(data):
if (data['value'] <=30).any():
return data['expo_value'] *30/ 31(days) + data['cal_value'] * 45/ 31(days)
elif (data['value'] <=60).any():
return data['expo_value'] *60/ 31(days) + data['cal_value'] * 90/31(days)
elif (data['value'] <=90).any():
return data['expo_value'] *100/31(days) + data['cal_value'] * 120/ 31(days)
else (data['value'] <=120).any():
return np.nan
Let me see if I understood you correctly. I tried to reproduce a small subset of your dataframe (you should do this next time you post something). The answer is as follows:
import pandas as pd
from datetime import datetime
import calendar
# I'll make a subset dataframe based on your example
data = [[30, '02/2022'], [60, '08/2022']]
df = pd.DataFrame(data, columns=['value', 'extracted_date'])
# First, turn the extracted_date column into a correct date format
date_correct_format = [datetime.strptime(i, '%m/%Y') for i in df['extracted_date']]
# Second, calculate the number of days per month
num_days = [calendar.monthrange(i.year, i.month)[1] for i in date_correct_format]
num_days
I have a data frame:
I have to calculate all the differences but separately for each event. In the data frame, you can see that after index 8 index 12 starts which means the start of a new event and that difference should be calculated separately. So This means as the difference between index_col is 4 the new event starts and that difference should be sum separately.
So the sum of events should be like this e.g
index_col 1-8 sum of Difference should be 20.96 (belongs to the first event)
index_col 12-17 sum of Difference should be 16.17(belongs to the second even)
and so on ...
index_col Depth(nm) Load(µN) Time (s) Difference
1 42.478033 432.482376 5.460979 8.70957
2 44.217959 432.163277 5.461261 1.73993
3 44.517313 432.764691 5.461824 3.36262
4 44.602024 433.754851 5.462669 2.37831
5 44.452232 434.808104 5.463514 1.8221
6 44.785705 435.698639 5.464358 1.1552
7 44.008191 436.724050 5.464922 1.02758
8 44.104820 438.753727 5.466611 1.04814
12 39.918249 390.597846 5.476275 7.61717
13 40.939905 391.229950 5.477120 2.66319
14 40.709209 392.333573 5.477965 1.99305
15 40.975959 393.208349 5.478810 1.88325
16 40.415786 395.135862 5.480218 1.00294
17 40.748377 396.057784 5.481062 1.13622
21 45.101152 441.052546 5.554368 5.64005
22 43.096024 442.489659 5.554931 2.13311
23 44.581075 442.264911 5.555213 1.48505
24 43.757947 443.295160 5.555776 2.34133
25 44.020544 444.209317 5.556621 2.15143
26 44.457026 445.121651 5.557466 2.2784
27 44.332075 446.131261 5.558310 1.36814
28 43.853956 447.344522 5.559155 1.0139
32 38.420457 381.697812 5.462362 5.80165
33 39.247295 382.417916 5.463206 2.51963
34 38.910364 383.542124 5.464051 1.67136
38 45.939504 467.899009 5.564736 6.58783
39 44.251143 469.194422 5.565299 1.40849
40 46.242257 468.823029 5.565581 1.99111
41 45.032736 469.930914 5.566144 1.95164
42 45.540791 470.765236 5.566989 2.50574
43 45.520035 471.821972 5.567834 1.91457
44 45.593076 472.835489 5.568678 1.24077
45 45.267980 474.618237 5.570086 1.05416
46 45.238412 475.640147 5.570931 1.038062
49 38.193023 392.286042 5.490368 8.13389
50 41.444420 391.411630 5.490650 3.2514
The way you add the data as plain text is very unhelpful. It would be much easier and faster if you add the data in the form index_col = ..., load = ... and so on.
That aside, this is my code:
index_col = [1, 2, 3, 4, 5, 6, 7, 8, 12, 13, 14, 15, 16, 17, 21, 22, 23, 24, 25, 26, 27, 28, 32, 33, 34, 38, 39, 40, 41, 42, 43, 44, 45, 46, 49, 50]
depth = [42.478033, 44.217959, 44.517313, 44.602024, 44.452232, 44.785705, 44.008191, 44.10482, 39.918249, 40.939905, 40.709209, 40.975959, 40.415786, 40.748377, 45.101152, 43.096024, 44.581075, 43.757947, 44.020544, 44.457026, 44.332075, 43.853956, 38.420457, 39.247295, 38.910364, 45.939504, 44.251143, 46.242257, 45.032736, 45.540791, 45.520035, 45.593076, 45.26798, 45.238412, 38.193023, 41.44442]
load = [432.482376, 432.163277, 432.764691, 433.754851, 434.808104, 435.698639, 436.72405, 438.753727, 390.597846, 391.22995, 392.333573, 393.208349, 395.135862, 396.057784, 441.052546, 442.489659, 442.264911, 443.29516, 444.209317, 445.121651, 446.131261, 447.344522, 381.697812, 382.417916, 383.542124, 467.899009, 469.194422, 468.823029, 469.930914, 470.765236, 471.821972, 472.835489, 474.618237, 475.640147, 392.286042, 391.41163]
time = [5.460979, 5.461261, 5.461824, 5.462669, 5.463514, 5.464358, 5.464922, 5.466611, 5.476275, 5.47712, 5.477965, 5.47881, 5.480218, 5.481062, 5.554368, 5.554931, 5.555213, 5.555776, 5.556621, 5.557466, 5.55831, 5.559155, 5.462362, 5.463206, 5.464051, 5.564736, 5.565299, 5.565581, 5.566144, 5.566989, 5.567834, 5.568678, 5.570086, 5.570931, 5.490368, 5.49065]
difference = [8.70957, 1.73993, 3.36262, 2.37831, 1.8221, 1.1552, 1.02758, 1.04814, 7.61717, 2.66319, 1.99305, 1.88325, 1.00294, 1.13622, 5.64005, 2.13311, 1.48505, 2.34133, 2.15143, 2.2784, 1.36814, 1.0139, 5.80165, 2.51963, 1.67136, 6.58783, 1.40849, 1.99111, 1.95164, 2.50574, 1.91457, 1.24077, 1.05416, 1.03806, 8.13389, 3.2514]
df = pd.DataFrame({'index': index_col, 'depth': depth, 'load': load, 'time': time, 'difference': difference})
sum_diff = []
start = 0
for i in range(len(df)):
if i == len(df) - 1:
end = i+1
sum_diff.append(sum(df['difference'][start:end]))
else:
if df['index'][i] + 1 != df['index'][i + 1]:
end = i+1
sum_diff.append(sum(df['difference'][start:end]))
start = end
print(sum_diff)
Output:
[21.24345, 16.295820000000003, 18.411409999999997, 9.99264, 19.69237, 11.38529]
I checked if the calculation is correct by doing this manually:
print(sum(df['difference'][0:8]))
print(sum(df['difference'][8:14]))
print(sum(df['difference'][14:22]))
print(sum(df['difference'][22:25]))
print(sum(df['difference'][25:34]))
print(sum(df['difference'][34:36]))
and yes, I got the same output:
21.24345
16.295820000000003
18.411409999999997
9.99264
19.69237
11.38529
And elegant solution would be using groupby the dataframe based on the index_col differences and construct a dict for flexible use of sum. Take an empty dataframe and use it for the storage of the summed results.
You can do as follows:
df = pd.DataFrame(data)
result = pd.DataFrame(columns = ['event_no', 'sum'])
grouped_dict = dict(tuple(df.groupby(df['index_col'].diff().gt(1).cumsum())))
for index in grouped_dict:
result = result.append({'event_no': index+1, 'sum': grouped_dict[index]['difference'].sum()}, ignore_index=True)
And this will give you exactly what you want:
event_no sum
0 1.0 21.243450
1 2.0 16.295820
2 3.0 18.411410
3 4.0 9.992640
4 5.0 19.692372
5 6.0 11.385290
What does df.groupby(df['index_col'].diff().gt(1).cumsum()) do?
The diff() simply calculates the difference between consecutive indices in df['index_col']. The gt(1) returns whether each element in the df['index_col'].diff() is greater than 1 or not. the cumsum() then sums these boolean results. As index 0-7 is False, cumsum is 0 for each of these indexes. Then index 8 is True, So cumsum becomes 1 and remains same for the rest of the consecutive indices as they return False for gt(1).
The calculation goes in the same way for rest of the consecutive segments. So for df.groupby() we get inputs of groups of 0's to 5's as follows:
0
0
0
0
0
0
0
0
1
1
1
1
1
1
2
2
2
2
2
2
2
2
3
3
3
4
4
4
4
4
4
4
4
4
5
5
Hence group by is done based on these 5 values for your given input.
Hope that's clear now!
A few Django formatting issues which require df header changes.
Test data:
Test_Data = [
('Year_Month', ['Done_RFQ','Not_Done_RFQ','Total_RFQ']),
('2018_11', [10, 20, 30]),
('2019_06',[10,20,30]),
('2019_12', [40, 50, 60]),
]
df = pd.DataFrame(dict(Test_Data))
print(df)
Year_Month 2018_11 2019_06 2019_12
0 Done_RFQ 10 10 40
1 Not_Done_RFQ 20 20 50
2 Total_RFQ 30 30 60
Desired output:
Year_Month 2018_Nov 2019_Jun 2019_Dec
0 Done_RFQ 10 10 40
1 Not_Done_RFQ 20 20 50
2 Total_RFQ 30 30 60
My attempt:
df_names = df.columns
for df_name in df_names:
if df_name[:1] == '20':
df.df_name = str(pd.to_datetime(df_name, format='%Y_%m').dt.strftime('%Y_%b'))
Error: AttributeError: 'Timestamp' object has no attribute 'dt'
I was hoping the date object could be used for the formatting. Any sgguestions on how to genearalise this for any string in the headers?
IIUC
s=pd.Series(df.columns)
s2=pd.to_datetime(s,format='%Y_%m',errors ='coerce').dt.strftime('%Y_%b')
df.columns=s2.mask(s2=='NaT').fillna(s)
df
Out[368]:
2018_Nov 2019_Jun 2019_Dec Year_Month
0 10 10 40 Done_RFQ
1 20 20 50 Not_Done_RFQ
2 30 30 60 Total_RFQ
You can drop the .dt since .strftime is a method for Timestamp:
df.df_name = str(pd.to_datetime(df_name, format='%Y_%m').strftime('%Y_%b'))
I have a dataframe that contains the duration of a trip as text values as shown below in the column driving_duration_text.
print df
yelp_id driving_duration_text \
0 alexander-rubin-photography-napa 1 hour 43 mins
1 jumas-automotive-napa-2 1 hour 32 mins
2 larson-brothers-painting-napa 1 hour 30 mins
3 preferred-limousine-napa 1 hour 32 mins
4 cardon-y-el-tirano-miami 1 day 16 hours
5 sweet-dogs-miami 1 day 3 hours
As you can see some are written in hours and others in days. How could I convert this format to seconds?
UPDATE:
In [150]: df['seconds'] = (pd.to_timedelta(df['driving_duration_text']
.....: .str.replace(' ', '')
.....: .str.replace('mins', 'min'))
.....: .dt.total_seconds())
In [151]: df
Out[151]:
yelp_id driving_duration_text seconds
0 alexander-rubin-photography-napa 1 hour 43 mins 6180.0
1 jumas-automotive-napa-2 1 hour 32 mins 5520.0
2 larson-brothers-painting-napa 1 hour 30 mins 5400.0
3 preferred-limousine-napa 1 hour 32 mins 5520.0
4 cardon-y-el-tirano-miami 1 day 16 hours 144000.0
5 sweet-dogs-miami 1 day 3 hours 97200.0
OLD answer:
you can do it this way:
from collections import defaultdict
import re
def humantime2seconds(s):
d = {
'w': 7*24*60*60,
'week': 7*24*60*60,
'weeks': 7*24*60*60,
'd': 24*60*60,
'day': 24*60*60,
'days': 24*60*60,
'h': 60*60,
'hr': 60*60,
'hour': 60*60,
'hours': 60*60,
'm': 60,
'min': 60,
'mins': 60,
'minute': 60,
'minutes':60
}
mult_items = defaultdict(lambda: 1).copy()
mult_items.update(d)
parts = re.search(r'^(\d+)([^\d]*)', s.lower().replace(' ', ''))
if parts:
return int(parts.group(1)) * mult_items[parts.group(2)] + humantime2seconds(re.sub(r'^(\d+)([^\d]*)', '', s.lower()))
else:
return 0
df['seconds'] = df.driving_duration_text.map(humantime2seconds)
Output:
In [64]: df
Out[64]:
yelp_id driving_duration_text seconds
0 alexander-rubin-photography-napa 1 hour 43 mins 6180
1 jumas-automotive-napa-2 1 hour 32 mins 5520
2 larson-brothers-painting-napa 1 hour 30 mins 5400
3 preferred-limousine-napa 1 hour 32 mins 5520
4 cardon-y-el-tirano-miami 1 day 16 hours 144000
5 sweet-dogs-miami 1 day 3 hours 97200
Given that the text does seem to follow a standardized format, this is relatively straightforward. We need to break the string apart, compose it into relevant pieces, and then process them.
def parse_duration(duration):
items = duration.split()
words = items[1::2]
counts = items[::2]
seconds = 0
for i, each in enumerate(words):
seconds += get_seconds(each, counts[i])
return seconds
def get_seconds(word, count):
counts = {
'second': 1,
'minute': 60,
'hour': 3600,
'day': 86400
# and so on
}
# Bit complicated here to handle plurals
base = counts.get(word[:-1], counts.get(word, 0))
return base * count