I would like to ask how to sum using python or excel.
Like to do summation of "number" columns based on "time" column.
Sum of the Duration for (00:00 am - 00:59 am) is (2+4) 6.
Sum of the Duration for (02:00 am - 02:59 am) is (3+1) 4.
Could you please advise how to ?
When you have a dataframe you can use groupby to accomplish this:
# import pandas module
import pandas as pd
# Create a dictionary with the values
data = {
'time' : ["12:20:51", "12:40:51", "2:26:35", "2:37:35"],
'number' : [2, 4, 3, 1]}
# create a Pandas dataframe
df = pd.DataFrame(data)
# or load the CSV
df = pd.read_csv('path/dir/filename.csv')
# Convert time column to datetime data type
df['time'] = df['time'].apply(pd.to_datetime, format='%H:%M:%S')
# add values by hour
dff = df.groupby(df['time'].dt.hour)['number'].sum()
print(dff.head(50))
output:
time
12 6
2 4
When you need more than one column. You can pass the columns as a list inside .groupby(). The code will look like this:
import pandas as pd
df = pd.read_csv('filename.csv')
# Convert time column to datetime data type
df['time'] = df['time'].apply(pd.to_datetime, format='%H:%M:%S')
df['date'] = df['date'].apply(pd.to_datetime, format='%d/%m/%Y')
# add values by hour
dff = df.groupby([df['date'], df['time'].dt.hour])['number'].sum()
print(dff.head(50))
# save the file
dff.to_csv("filename.csv")
I need to create a new column and the value should be:
the current fair_price - fair_price 15 minutes ago(or the closest row)
I need to filter who is the row 15 minutes before then calculate the diff.
import numpy as np
import pandas as pd
from datetime import timedelta
df = pd.DataFrame(pd.read_csv('./data.csv'))
def calculate_15min(row):
end_date = pd.to_datetime(row['date']) - timedelta(minutes=15)
mask = (pd.to_datetime(df['date']) <= end_date).head(1)
price_before = df.loc[mask]
return price_before['fair_price']
def calc_new_val(row):
return 'show date 15 minutes before, maybe it will be null, nope'
df['15_min_ago'] = df.apply(lambda row: calculate_15min(row), axis=1)
myFields = ['pkey_id', 'date', '15_min_ago', 'fair_price']
print(df[myFields].head(5))
df[myFields].head(5).to_csv('output.csv', index=False)
I did it using nodejs but python is not my beach, maybe you have a fast solution...
pkey_id,date,fair_price,15_min_ago
465620,2021-05-17 12:28:30,45080.23,fair_price_15_min_before
465625,2021-05-17 12:28:35,45060.17,fair_price_15_min_before
465629,2021-05-17 12:28:40,45052.74,fair_price_15_min_before
465633,2021-05-17 12:28:45,45043.89,fair_price_15_min_before
465636,2021-05-17 12:28:50,45040.93,fair_price_15_min_before
465640,2021-05-17 12:28:56,45049.95,fair_price_15_min_before
465643,2021-05-17 12:29:00,45045.38,fair_price_15_min_before
465646,2021-05-17 12:29:05,45039.87,fair_price_15_min_before
465650,2021-05-17 12:29:10,45045.55,fair_price_15_min_before
465652,2021-05-17 12:29:15,45042.53,fair_price_15_min_before
465653,2021-05-17 12:29:20,45039.34,fair_price_15_min_before
466377,2021-05-17 12:42:50,45142.74,fair_price_15_min_before
466380,2021-05-17 12:42:55,45143.24,fair_price_15_min_before
466393,2021-05-17 12:43:00,45130.98,fair_price_15_min_before
466398,2021-05-17 12:43:05,45128.13,fair_price_15_min_before
466400,2021-05-17 12:43:10,45140.9,fair_price_15_min_before
466401,2021-05-17 12:43:15,45136.38,fair_price_15_min_before
466404,2021-05-17 12:43:20,45118.54,fair_price_15_min_before
466405,2021-05-17 12:43:25,45120.69,fair_price_15_min_before
466407,2021-05-17 12:43:30,45121.37,fair_price_15_min_before
466413,2021-05-17 12:43:36,45133.71,fair_price_15_min_before
466415,2021-05-17 12:43:40,45137.74,fair_price_15_min_before
466419,2021-05-17 12:43:45,45127.96,fair_price_15_min_before
466431,2021-05-17 12:43:50,45100.83,fair_price_15_min_before
466437,2021-05-17 12:43:55,45091.78,fair_price_15_min_before
466438,2021-05-17 12:44:00,45084.75,fair_price_15_min_before
466445,2021-05-17 12:44:06,45094.08,fair_price_15_min_before
466448,2021-05-17 12:44:10,45106.51,fair_price_15_min_before
466456,2021-05-17 12:44:15,45122.97,fair_price_15_min_before
466461,2021-05-17 12:44:20,45106.78,fair_price_15_min_before
466466,2021-05-17 12:44:25,45096.55,fair_price_15_min_before
466469,2021-05-17 12:44:30,45088.06,fair_price_15_min_before
466474,2021-05-17 12:44:35,45086.12,fair_price_15_min_before
466491,2021-05-17 12:44:40,45065.95,fair_price_15_min_before
466495,2021-05-17 12:44:45,45068.21,fair_price_15_min_before
466502,2021-05-17 12:44:55,45066.47,fair_price_15_min_before
466506,2021-05-17 12:45:00,45063.82,fair_price_15_min_before
466512,2021-05-17 12:45:05,45070.48,fair_price_15_min_before
466519,2021-05-17 12:45:10,45050.59,fair_price_15_min_before
466523,2021-05-17 12:45:16,45041.13,fair_price_15_min_before
466526,2021-05-17 12:45:20,45038.36,fair_price_15_min_before
466535,2021-05-17 12:45:25,45029.72,fair_price_15_min_before
466553,2021-05-17 12:45:31,45016.2,fair_price_15_min_before
466557,2021-05-17 12:45:35,45011.2,fair_price_15_min_before
466559,2021-05-17 12:45:40,45007.04,fair_price_15_min_before
This is the CSV
Firstly convert your date column to datetime dtype:
df['date']=pd.to_datetime(df['date'])
Then filter values:
date15min=df['date']-pd.offsets.DateOffset(minutes=15)
out=df.loc[df['date'].isin(date15min.tolist())]
Now Finally do your calculations:
df['price_before_15min']=df['fair_price'].where(df['date'].isin((out['date']+pd.offsets.DateOffset(minutes=15)).tolist()))
df['price_before_15min']=df['price_before_15min'].diff()
df['date_before_15min']=date15min
Now If you print df you will get your desired output
Update:
For that purpose just make a slightly change in the above method:
out=df.loc[df['date'].dt.minute.isin(date15min.dt.minute.tolist())]
df['price_before_15min']=df['fair_price'].where(df['date'].dt.minute.isin((out['date']+pd.offsets.DateOffset(minutes=15)).dt.minute.tolist()))
I have a DataFrame that looks like this
date Burned
8/11/2019 7:00 0.0
8/11/2019 7:00 10101.0
8/11/2019 8:16 5.2
I have this code:
import pandas as pd
import numpy as np
# Read data from file 'filename.csv'
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later)
df = pd.read_csv("../example.csv")
# Preview the first 5 lines of the loaded data
df = df.assign(Burned = df['Quantity'])
df.loc[df['To'] != '0x0000000000000000000000000000000000000000', 'Burned'] = 0.0
# OR:
df['cum_sum'] = df['Burned'].cumsum()
df['percent_burned'] = df['cum_sum']/df['Quantity'].max()*100.0
a=pd.concat([df['DateTime'], df['Burned']], axis=1, keys=['date', 'Burned'])
b=a.groupby(df.index.date).count()
But I get this error: AttributeError: 'RangeIndex' object has no attribute 'date'
Basically I am wanting to sort all these times just by day since it has timestamps throughout the day. I don't care what time of the day different things occured, I just want to get the total number of 'Burned' per day.
First add parse_dates=['DateTime'] to read_csv for convert column Datetime:
df = pd.read_csv("../example.csv", parse_dates=['DateTime'])
Or first column:
df = pd.read_csv("../example.csv", parse_dates=[0])
In your solution is date column, so need Series.dt.date with sum:
b = a.groupby(a['date'].dt.date)['Burned'].sum().reset_index(name='Total')
I have a dataframe with date values and would like to manipulate them to 1 Jan or later. Since I need to do this element-wise, I use np.maximum(). The code below however gives
TypeError: Cannot compare type 'Timestamp' with type 'int'.
What's the appropriate method to deal with this kind of data type?
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': np.arange('1999-12', '2000-02', dtype='datetime64[D]')})
df['corrected_date'] = np.maximum(pd.to_datetime('20000101', format='%Y%m%d'), df['date'])
For me working comparing with Series:
s = pd.Series(pd.to_datetime('20000101', format='%Y%m%d'), index=df.index)
df['corrected_date'] = np.maximum(s, df['date'])
Or with DatetimeIndex:
i = np.repeat(pd.to_datetime(['20000101'], format='%Y%m%d'), len(df))
df['corrected_date'] = np.maximum(i, df['date'])
I have the following dataframe:
Month Col_1 Col_2
1 0,121 0,123
2 0,231 0,356
3 0,150 0,156
4 0,264 0,426
...
I need to resample this to weekly resolution and to interpolate between the points. The latter part, the interpolation is straight-forward. The reindex part is a bit tricky, on the other hand, at least for me.
If I use the DataFrame.reindex() method, it will only erase all the entries from the dataframe. I have tried to do it manually, by using .loc() to create new 'NaN' entries between each consecutive months, but this method overwrites the entries I already have.
Any clue how to do it? Thanks!
I have to assume a start date, I chose 2009-12-31.
To get resample to work, you need a pd.DateTimeIndex.
start_date = pd.to_datetime('2009-12-31')
df.Month = df.Month.apply(lambda x: start_date + pd.offsets.MonthEnd(x))
df = df.set_index('Month')
df.resample('W').interpolate()
Replicable code
from StringIO import StringIO
import pandas as pd
text = """Month Col_1 Col_2
1 0,121 0,123
2 0,231 0,356
3 0,150 0,156
4 0,264 0,426"""
df = pd.read_csv(StringIO(text), decimal=',', delim_whitespace=True)
start_date = pd.to_datetime('2009-12-31')
df.Month = df.Month.apply(lambda x: start_date + pd.offsets.MonthEnd(x))
df = df.set_index('Month')
df.resample('W').interpolate()