I am new to pandas and I am struggling adding dates to my pandas dataFrame df that comes from .csv file. I have a dataFrame with several unique ids, and each id has 120 months, I need to add a column date. Each id should have exactly the same dates for 120 periods. I am struggling to add them as after first id there is another id and the dates should start over again. my data in csv file looks like this:
month id
1 1593
2 1593
...
120 1593
1 8964
2 8964
...
120 8964
1 58944
...
Here is my code and I am not really sure how should I use groupby method to add dates for my dataframe based on id:
group=df.groupby('id')
group['date']=pd.date_range(start='2020/6/1', periods=120, freq='MS').shift(14,freq='D')
Please help me!!!
If you know how many sets of 120 you have, you can use this. Just change the 2 at the end. This example creates a repeating 120 dates twice. You may have to adapt for your specific use.
new_dates = list(pd.date_range(start='2020/6/1', periods=120, freq='MS').shift(14,freq='D'))*2
df = pd.DataFrame({'date': new_dates})
These are the same except ones using lambda
def repeatingDates(numIds): return [d.strftime(
'%Y/%m/%d') for d in pandas.date_range(start='2020/6/1', periods=120, freq='MS')] * numIds
repeatingDates = lambda numIds: [d.strftime(
'%Y/%m/%d') for d in pandas.date_range(start='2020/6/1', periods=120, freq='MS')] * numIds
You can use Pandas transform. This is how I solved it:
dataf['dates'] = \
(dataf
.groupby("id")
.transform(lambda d: pd.date_range(start='2020/6/1', periods=d.max(), freq='MS').shift(14,freq='D')
)
Results:
month id dates
0 1 1593 2020-06-15
1 2 1593 2020-07-15
2 3 1593 2020-08-15
3 1 8964 2020-06-15
4 2 8964 2020-07-15
5 1 58944 2020-06-15
6 2 58944 2020-07-15
7 3 58944 2020-08-15
8 4 58944 2020-09-15
Test data:
import io
import pandas as pd
dataf = pd.read_csv(io.StringIO("""
month,id
1,1593
2,1593
3,1593
1,8964
2,8964
1,58944
2,58944
3,58944
4,58944""")).astype(int)
I have the a pandas dataframe in this format:
Dates
11-Feb-18
18-Feb-18
03-Mar-18
25-Mar-18
29-Mar-18
04-Apr-18
08-Apr-18
14-Apr-18
17-Apr-18
30-Apr-18
04-May-18
I want to find dates between two consecutive dates. In this example I want to make a new column which will contain dates between two consecutive dates. For example between 11-Feb-18 and 18-Feb-18, I will get all the dates between these two dates.
I tried this code but it's throwing me error:
pd.DataFrame({'dates':pd.date_range(pd.to_datetime(df_new['Time.[Day]'].loc[i].diff(-1)))})
if you want to add a column with the list of dates tat are missing in between, this shoudl work. This could be more efficient and it has to work around the NaT in the last row and becomes a bit longer as intended, but gives you the result.
import pandas as pd
from datetime import timedelta
test_df = pd.DataFrame({
"Dates" :
["11-Feb-18", "18-Feb-18", "03-Mar-18", "25-Mar-18", "29-Mar-18", "04-Apr-18",
"08-Apr-18", "14-Apr-18", "17-Apr-18", "30-Apr-18", "04-May-18"]
})
res = (
test_df
.assign(
# convert to datetime
Dates = lambda x : pd.to_datetime(x.Dates),
# get next rows date
Dates_next = lambda x : x.Dates.shift(-1),
# create the date range
Dates_list = lambda x : x.apply(
lambda x :
pd.date_range(
x.Dates + timedelta(days=1),
x.Dates_next - timedelta(days=1),
freq="D").date.tolist()
if pd.notnull(x.Dates_next)
else None
, axis = 1
))
)
print(res)
results in:
Dates Dates_next Dates_list
0 2018-02-11 2018-02-18 [2018-02-12, 2018-02-13, 2018-02-14, 2018-02-1...
1 2018-02-18 2018-03-03 [2018-02-19, 2018-02-20, 2018-02-21, 2018-02-2...
2 2018-03-03 2018-03-25 [2018-03-04, 2018-03-05, 2018-03-06, 2018-03-0...
3 2018-03-25 2018-03-29 [2018-03-26, 2018-03-27, 2018-03-28]
4 2018-03-29 2018-04-04 [2018-03-30, 2018-03-31, 2018-04-01, 2018-04-0...
5 2018-04-04 2018-04-08 [2018-04-05, 2018-04-06, 2018-04-07]
6 2018-04-08 2018-04-14 [2018-04-09, 2018-04-10, 2018-04-11, 2018-04-1...
7 2018-04-14 2018-04-17 [2018-04-15, 2018-04-16]
8 2018-04-17 2018-04-30 [2018-04-18, 2018-04-19, 2018-04-20, 2018-04-2...
9 2018-04-30 2018-05-04 [2018-05-01, 2018-05-02, 2018-05-03]
10 2018-05-04 NaT None
As a sidenote, if you don't need the last row after the analysis, you could filter out the last row after assigning the next date and eliminate the if statement to make it faster.
This works with dataframes, adding a new column containing the requested list
It iterates over the column 1, preparing a list of lists for column 2.
At the and it creates a new dataframe column and assigns the prepared values to it.
import pandas as pd
from pprint import pp
from datetime import datetime, timedelta
df = pd.read_csv("test.csv")
in_betweens = []
for i in range(len(df["dates"])-1):
d = datetime.strptime(df["dates"][i],"%d-%b-%y")
d2 = datetime.strptime(df["dates"][i+1],"%d-%b-%y")
d = d + timedelta(days=1)
in_between = []
while d < d2:
in_between.append(d.strftime("%d-%b-%y"))
d = d + timedelta(days=1)
in_betweens.append(in_between)
in_betweens.append([])
df["in_betwens"] = in_betweens
df.head()
I'm still a novice with python and I'm having problems trying to group some data to show that record that has the highest (maximum) date, the dataframe is as follows:
...
I am trying the following:
df_2 = df.max(axis = 0)
df_2 = df.periodo.max()
df_2 = df.loc[df.groupby('periodo').periodo.idxmax()]
And it gives me back:
Timestamp('2020-06-01 00:00:00')
periodo 2020-06-01 00:00:00
valor 3.49136
Although the value for 'periodo' is correct, for 'valor' it is not, since I need to obtain the corresponding complete record ('period' and 'value'), and not the maximum of each one. I have tried other ways but I can't get to what I want ...
I need to do?
Thank you in advance, I will be attentive to your answers!
Regards!
# import packages we need, seed random number generator
import pandas as pd
import datetime
import random
random.seed(1)
Create example dataframe
dates = [single_date for single_date in (start_date + datetime.timedelta(n) for n in range(day_count))]
values = [random.randint(1,1000) for _ in dates]
df = pd.DataFrame(zip(dates,values),columns=['dates','values'])
ie df will be:
dates values
0 2020-01-01 389
1 2020-01-02 808
2 2020-01-03 215
3 2020-01-04 97
4 2020-01-05 500
5 2020-01-06 30
6 2020-01-07 915
7 2020-01-08 856
8 2020-01-09 400
9 2020-01-10 444
Select rows with highest entry in each column
You can do:
df[df['dates'] == df['dates'].max()]
(Or, if wanna use idxmax, can do: df.loc[[df['dates'].idxmax()]])
Returning:
dates values
9 2020-01-10 444
ie this is the row with the latest date
&
df[df['values'] == df['values'].max()]
(Or, if wanna use idxmax again, can do: df.loc[[df['values'].idxmax()]] - as in Scott Boston's answer.)
and
dates values
6 2020-01-07 915
ie this is the row with the highest value in the values column.
Reference.
I think you need something like:
df.loc[[df['valor'].idxmax()]]
Where you use idxmax on the 'valor' column. Then use that index to select that row.
MVCE:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'periodo':pd.date_range('2018-07-01', periods = 600, freq='d'),
'valor':np.random.random(600)+3})
df.loc[[df['valor'].idxmax()]]
Output:
periodo valor
474 2019-10-18 3.998918
Data
data = {"account":{"0":383080,"1":383080,"2":383080,"3":412290,"4":412290,"5":412290,"6":412290,"7":412290,"8":218895,"9":218895,"10":218895,"11":218895},"name":{"0":"Will LLC","1":"Will LLC","2":"Will LLC","3":"Jerde-Hilpert","4":"Jerde-Hilpert","5":"Jerde-Hilpert","6":"Jerde-Hilpert","7":"Jerde-Hilpert","8":"Kulas Inc","9":"Kulas Inc","10":"Kulas Inc","11":"Kulas Inc"},"order":{"0":10001,"1":10001,"2":10001,"3":10005,"4":10005,"5":10005,"6":10005,"7":10005,"8":10006,"9":10006,"10":10006,"11":10006},"sku":{"0":"B1-20000","1":"S1-27722","2":"B1-86481","3":"S1-06532","4":"S1-82801","5":"S1-06532","6":"S1-47412","7":"S1-27722","8":"S1-27722","9":"B1-33087","10":"B1-33364","11":"B1-20000"},"quantity":{"0":7,"1":11,"2":3,"3":48,"4":21,"5":9,"6":44,"7":36,"8":32,"9":23,"10":3,"11":-1},"unit price":{"0":33.69,"1":21.12,"2":35.99,"3":55.82,"4":13.62,"5":92.55,"6":78.91,"7":25.42,"8":95.66,"9":22.55,"10":72.3,"11":72.18},"ext price":{"0":235.83,"1":232.32,"2":107.97,"3":2679.36,"4":286.02,"5":832.95,"6":3472.04,"7":915.12,"8":3061.12,"9":518.65,"10":216.9,"11":72.18}}
pd.DataFrame(data=data)
Current Solution
sku_total = df.groupby(['order','sku'])['ext price'].sum().rename('sku total').reset_index()
sku_total['sku total'] / sku_total['order'].map(df.groupby('order')['ext price'].sum())
Question
How to divide:
df.groupby(['order','sku'])['ext price'].sum()
by
df.groupby('order')['ext price'].sum()
Without having to reset_index?
Doesn't div do the trick or am I understanding something inccorectly?
import pandas as pd
import numpy as np
data = {"account":{"0":383080,"1":383080,"2":383080,"3":412290,"4":412290,"5":412290,"6":412290,"7":412290,"8":218895,"9":218895,"10":218895,"11":218895},"name":{"0":"Will LLC","1":"Will LLC","2":"Will LLC","3":"Jerde-Hilpert","4":"Jerde-Hilpert","5":"Jerde-Hilpert","6":"Jerde-Hilpert","7":"Jerde-Hilpert","8":"Kulas Inc","9":"Kulas Inc","10":"Kulas Inc","11":"Kulas Inc"},"order":{"0":10001,"1":10001,"2":10001,"3":10005,"4":10005,"5":10005,"6":10005,"7":10005,"8":10006,"9":10006,"10":10006,"11":10006},"sku":{"0":"B1-20000","1":"S1-27722","2":"B1-86481","3":"S1-06532","4":"S1-82801","5":"S1-06532","6":"S1-47412","7":"S1-27722","8":"S1-27722","9":"B1-33087","10":"B1-33364","11":"B1-20000"},"quantity":{"0":7,"1":11,"2":3,"3":48,"4":21,"5":9,"6":44,"7":36,"8":32,"9":23,"10":3,"11":-1},"unit price":{"0":33.69,"1":21.12,"2":35.99,"3":55.82,"4":13.62,"5":92.55,"6":78.91,"7":25.42,"8":95.66,"9":22.55,"10":72.3,"11":72.18},"ext price":{"0":235.83,"1":232.32,"2":107.97,"3":2679.36,"4":286.02,"5":832.95,"6":3472.04,"7":915.12,"8":3061.12,"9":518.65,"10":216.9,"11":72.18}}
df = pd.DataFrame(data=data)
print(df)
df_1 = df.groupby(['order','sku'])['ext price'].sum()
df_2 = df.groupby('order')['ext price'].sum()
df_res = df_1.div(df_2)
print(df_res)
Output:
order sku
10001 B1-20000 0.409342
B1-86481 0.187409
S1-27722 0.403249
10005 S1-06532 0.429090
S1-27722 0.111798
S1-47412 0.424170
S1-82801 0.034942
10006 B1-20000 0.018657
B1-33087 0.134058
B1-33364 0.056063
S1-27722 0.791222
Name: ext price, dtype: float64
IIUC,
we can use transform which allows you to do groupby operations while maintaing the index:
you can then assign the variable to a new column if u wish.
s = (df.groupby(['order','sku'])['ext price'].transform('sum')
/ df.groupby('order')['ext price'].transform('sum'))
print(s)
0 0.409342
1 0.403249
2 0.187409
3 0.429090
4 0.034942
5 0.429090
6 0.424170
7 0.111798
8 0.791222
9 0.134058
10 0.056063
11 0.018657
I have a pandas df, and I use between_time a and b to clean the data. How do I
get a non_between_time behavior?
I know i can try something like.
df.between_time['00:00:00', a]
df.between_time[b,23:59:59']
then combine it and sort the new df. It's very inefficient and it doesn't work for me as I have data betweeen 23:59:59 and 00:00:00
Thanks
You could find the index locations for rows with time between a and b, and then use df.index.diff to remove those from the index:
import pandas as pd
import io
text = '''\
date,time, val
20120105, 080000, 1
20120105, 080030, 2
20120105, 080100, 3
20120105, 080130, 4
20120105, 080200, 5
20120105, 235959.01, 6
'''
df = pd.read_csv(io.BytesIO(text), parse_dates=[[0, 1]], index_col=0)
index = df.index
ivals = index.indexer_between_time('8:01:30','8:02')
print(df.reindex(index.diff(index[ivals])))
yields
val
date_time
2012-01-05 08:00:00 1
2012-01-05 08:00:30 2
2012-01-05 08:01:00 3
2012-01-05 23:59:59.010000 6