I have a pandas df, and I use between_time a and b to clean the data. How do I
get a non_between_time behavior?
I know i can try something like.
df.between_time['00:00:00', a]
df.between_time[b,23:59:59']
then combine it and sort the new df. It's very inefficient and it doesn't work for me as I have data betweeen 23:59:59 and 00:00:00
Thanks
You could find the index locations for rows with time between a and b, and then use df.index.diff to remove those from the index:
import pandas as pd
import io
text = '''\
date,time, val
20120105, 080000, 1
20120105, 080030, 2
20120105, 080100, 3
20120105, 080130, 4
20120105, 080200, 5
20120105, 235959.01, 6
'''
df = pd.read_csv(io.BytesIO(text), parse_dates=[[0, 1]], index_col=0)
index = df.index
ivals = index.indexer_between_time('8:01:30','8:02')
print(df.reindex(index.diff(index[ivals])))
yields
val
date_time
2012-01-05 08:00:00 1
2012-01-05 08:00:30 2
2012-01-05 08:01:00 3
2012-01-05 23:59:59.010000 6
Related
This is a sample data:
import pandas as pd
import numpy as np
data = pd.DataFrame([(date(2022,1,1), np.random.randint(10, size=30)),
(date(2022,2,1),np.random.randint(10, size=30)),
(date(2022,3,1),np.random.randint(10, size=30))],
columns=('month_begin','daily_sales'))
I want to (1) create a column to be filled with each day (so the column would be 2022-01-01, 2022-01-02, ... 2022-03-31); (2) break the array-like string column into each row. Something like this:
I was thinking about creating a list of days between 2022-01-01 to 2022-03-01, but was stuck on how to fill each row with the daily data. Any suggestion is appreciated!
import pandas as pd
import numpy as np
from datetime import date
data = pd.DataFrame([(date(2022,1,1), np.random.randint(10, size=30)),
(date(2022,2,1),np.random.randint(10, size=30)),
(date(2022,3,1),np.random.randint(10, size=30))],
columns=('month_begin','daily_sales'))
result = pd.DataFrame()
for index, row in data.iterrows():
df = pd.DataFrame({'date':pd.date_range(row['month_begin'],
periods=len(row['daily_sales'])),
'daily_sales':row['daily_sales'],
})
result = pd.concat([result, df], ignore_index=True)
print(result)
Output:
date daily_sales
0 2022-01-01 9
1 2022-01-02 7
2 2022-01-03 7
3 2022-01-04 8
4 2022-01-05 4
.. ... ...
85 2022-03-26 3
86 2022-03-27 1
87 2022-03-28 9
88 2022-03-29 7
89 2022-03-30 0
[90 rows x 2 columns]
I have some data I want to count by month. The column I want count has three different possible values, each representing a different car sold. Here is an example of my dataframe:
Date Type_Car_Sold
2015-01-01 00:00:00 2
2015-01-01 00:00:00 1
2015-01-01 00:00:00 1
2015-01-01 00:00:00 3
... ...
I want to make it so I have a dataframe that counts each specific car type sold by month separately, so looking like this:
Month Car_Type_1 Car_Type_2 Car_Type_3 Total_Cars_Sold
1 15 12 17 44
2 9 18 20 47
... ... ... ... ...
How exactly would I go about doing this? I've tried doing:
cars_sold = car_data['Type_Car_Sold'].groupby(car_data.Date.dt.month).agg('count')
but that just sums up all the cars sold in the month, rather than breaking it down by the total amount of each type sold. Any thoughts?
Maybe not the cleanest solution, but this should get you pretty close
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
"Date": [datetime(2022,1,1), datetime(2022,1,1), datetime(2022,2,1), datetime(2022,2,1)],
"Type": [1, 2, 1, 1],
})
df['Date'] = df["Date"].dt.to_period('M')
df['Value'] = 1
print(pd.pivot_table(df, values='Value', index=['Date'], columns=['Type'], aggfunc='count'))
Type 1 2
Date
2022-01 1.0 1.0
2022-02 2.0 NaN
Alternatively you can also pass multiple columns to groupby:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
"Date": [datetime(2022,1,1), datetime(2022,1,1), datetime(2022,2,1), datetime(2022,2,1)],
"Type": [1, 2, 1, 1],
})
df['Date'] = df["Date"].dt.to_period('M')
df.groupby(['Date', 'Type']).size()
Date Type
2022-01 1 1
2 1
2022-02 1 2
dtype: int64
This seems to have the unfortunate side effect of excluding keys with zero value. Also the result is multiindexed rows rather than having the index as rows+columns.
For more information on this approach, check this question.
I am new to pandas and I am struggling adding dates to my pandas dataFrame df that comes from .csv file. I have a dataFrame with several unique ids, and each id has 120 months, I need to add a column date. Each id should have exactly the same dates for 120 periods. I am struggling to add them as after first id there is another id and the dates should start over again. my data in csv file looks like this:
month id
1 1593
2 1593
...
120 1593
1 8964
2 8964
...
120 8964
1 58944
...
Here is my code and I am not really sure how should I use groupby method to add dates for my dataframe based on id:
group=df.groupby('id')
group['date']=pd.date_range(start='2020/6/1', periods=120, freq='MS').shift(14,freq='D')
Please help me!!!
If you know how many sets of 120 you have, you can use this. Just change the 2 at the end. This example creates a repeating 120 dates twice. You may have to adapt for your specific use.
new_dates = list(pd.date_range(start='2020/6/1', periods=120, freq='MS').shift(14,freq='D'))*2
df = pd.DataFrame({'date': new_dates})
These are the same except ones using lambda
def repeatingDates(numIds): return [d.strftime(
'%Y/%m/%d') for d in pandas.date_range(start='2020/6/1', periods=120, freq='MS')] * numIds
repeatingDates = lambda numIds: [d.strftime(
'%Y/%m/%d') for d in pandas.date_range(start='2020/6/1', periods=120, freq='MS')] * numIds
You can use Pandas transform. This is how I solved it:
dataf['dates'] = \
(dataf
.groupby("id")
.transform(lambda d: pd.date_range(start='2020/6/1', periods=d.max(), freq='MS').shift(14,freq='D')
)
Results:
month id dates
0 1 1593 2020-06-15
1 2 1593 2020-07-15
2 3 1593 2020-08-15
3 1 8964 2020-06-15
4 2 8964 2020-07-15
5 1 58944 2020-06-15
6 2 58944 2020-07-15
7 3 58944 2020-08-15
8 4 58944 2020-09-15
Test data:
import io
import pandas as pd
dataf = pd.read_csv(io.StringIO("""
month,id
1,1593
2,1593
3,1593
1,8964
2,8964
1,58944
2,58944
3,58944
4,58944""")).astype(int)
I'm still a novice with python and I'm having problems trying to group some data to show that record that has the highest (maximum) date, the dataframe is as follows:
...
I am trying the following:
df_2 = df.max(axis = 0)
df_2 = df.periodo.max()
df_2 = df.loc[df.groupby('periodo').periodo.idxmax()]
And it gives me back:
Timestamp('2020-06-01 00:00:00')
periodo 2020-06-01 00:00:00
valor 3.49136
Although the value for 'periodo' is correct, for 'valor' it is not, since I need to obtain the corresponding complete record ('period' and 'value'), and not the maximum of each one. I have tried other ways but I can't get to what I want ...
I need to do?
Thank you in advance, I will be attentive to your answers!
Regards!
# import packages we need, seed random number generator
import pandas as pd
import datetime
import random
random.seed(1)
Create example dataframe
dates = [single_date for single_date in (start_date + datetime.timedelta(n) for n in range(day_count))]
values = [random.randint(1,1000) for _ in dates]
df = pd.DataFrame(zip(dates,values),columns=['dates','values'])
ie df will be:
dates values
0 2020-01-01 389
1 2020-01-02 808
2 2020-01-03 215
3 2020-01-04 97
4 2020-01-05 500
5 2020-01-06 30
6 2020-01-07 915
7 2020-01-08 856
8 2020-01-09 400
9 2020-01-10 444
Select rows with highest entry in each column
You can do:
df[df['dates'] == df['dates'].max()]
(Or, if wanna use idxmax, can do: df.loc[[df['dates'].idxmax()]])
Returning:
dates values
9 2020-01-10 444
ie this is the row with the latest date
&
df[df['values'] == df['values'].max()]
(Or, if wanna use idxmax again, can do: df.loc[[df['values'].idxmax()]] - as in Scott Boston's answer.)
and
dates values
6 2020-01-07 915
ie this is the row with the highest value in the values column.
Reference.
I think you need something like:
df.loc[[df['valor'].idxmax()]]
Where you use idxmax on the 'valor' column. Then use that index to select that row.
MVCE:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'periodo':pd.date_range('2018-07-01', periods = 600, freq='d'),
'valor':np.random.random(600)+3})
df.loc[[df['valor'].idxmax()]]
Output:
periodo valor
474 2019-10-18 3.998918
So... I have a Dataframe that looks like this, but much larger:
DATE ITEM STORE STOCK
0 2018-06-06 A L001 4
1 2018-06-06 A L002 0
2 2018-06-06 A L003 4
3 2018-06-06 B L001 1
4 2018-06-06 B L002 2
You can reproduce the same DataFrame with the following code:
import pandas as pd
import numpy as np
import itertools as it
lojas = ['L001', 'L002', 'L003']
itens = list("ABC")
dr = pd.date_range(start='2018-06-06', end='2018-06-12')
df = pd.DataFrame(data=list(it.product(dr, itens, lojas)), columns=['DATE', 'ITEM', 'STORE'])
df['STOCK'] = np.random.randint(0,5, size=len(df.ITEM))
I wanna calculate de STOCK difference between days in every pair ITEM-STORE and iterating over groups in a groupby object is easy using the function .diff() to get something like this:
DATE ITEM STORE STOCK DELTA
0 2018-06-06 A L001 4 NaN
9 2018-06-07 A L001 0 -4.0
18 2018-06-08 A L001 4 4.0
27 2018-06-09 A L001 0 -4.0
36 2018-06-10 A L001 3 3.0
45 2018-06-11 A L001 2 -1.0
54 2018-06-12 A L001 2 0.0
I´ve manage to do so by the following code:
gg = df.groupby([df.ITEM, df.STORE])
lg = []
for (name, group) in gg:
aux = group.copy()
aux.reset_index(drop=True, inplace=True)
aux['DELTA'] = aux.STOCK.diff().fillna(value=0, inplace=Tr
lg.append(aux)
df = pd.concat(lg)
But in a large DataFrame, it gets impracticable. Is there a faster more pythonic way to do this task?
I've tried to improve your groupby code, so this should be a lot faster.
v = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff()
df['DELTA'] = np.where(np.isnan(v), 0, v)
Some pointers/ideas here:
Don't iterate over groups
Don't pass series as the groupers if the series belong to the same DataFrame. Pass string labels instead.
diff can be vectorized
The last line is tantamount to a fillna, but fillna is slower than np.where
Specifying sort=False will prevent the output from being sorted by grouper keys, improving performance further
This can also be re-written as
df['DELTA'] = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff().fillna(0)