I have a dataframe that looks like this :
ID time city transport
0 1 10:20:00 London car
1 20 08:50:20 Berlin air plane
2 44 21:10:00 Paris train
3 32 10:24:00 Rome car
4 56 08:53:10 Berlin air plane
5 90 21:8:00 Paris train
.
.
.
1009 446 10:21:24 London car
I want to group these data so that same value in 'city' and 'transport' but with time difference of +3min or -3min should have the same 'ID'.
I already tried pd.Grouper() like this but didn't work:
df['time'] = pd.to_datetime(df['time'])
df['ID'] = df.groupby([pd.Grouper(key= 'time',freq ='3min'),'city','transport'])['ID'].transform('first')
The output is the first dataframe I had without any changes. One reason could be that by using .datetime the date will be added as well to "time" and because my data is very big the date will differ and groupby doesn't work.
I couldn't figure it out how to add time intervall (+3min or -3min) while using groupby and without adding DATE to 'time' column.
What I'm expecting is this :
ID time city transport
0 1 10:20:00 London car
1 20 08:50:20 Berlin air plane
2 44 21:10:00 Paris train
3 32 10:24:00 Rome car
4 20 08:53:10 Berlin air plane
5 44 21:8:00 Paris train
.
.
.
1009 1 10:21:24 London car
it has been a while that I'm struggling with this question and I really appreciate any help.
Thanks in advance
def convert(seconds):
seconds = seconds % (24 * 3600)
hour = seconds // 3600
seconds %= 3600
minutes = seconds // 60
seconds %= 60
return hour,minutes,seconds
def get_sec(h,m,s):
"""Get Seconds from time."""
if h==np.empty:
h=0
if m==np.empty:
m=0
if s==np.empty:
s=0
return int(h) * 3600 + int(m) * 60 + int(s)
df['time']=df['time'].apply(lambda x: datetime.strptime(x,'%H:%M:%S') if isinstance(x,str) else x )
df=df.sort_values(by=["time"])
print(df)
prev_hour=np.empty
prev_minute=np.empty
prev_second=np.empty
for key,item in df.iterrows():
curr_hour=item.time.hour
curr_minute=item.time.minute
curr_second=item.time.second
curr_id=item.id
curr_seconds=get_sec(curr_hour, curr_minute ,curr_second)
prev_seconds=get_sec(prev_hour, prev_minute,prev_second)
diff_seconds=curr_seconds-prev_seconds
hour,minute,second=convert(diff_seconds)
if (hour==0) & (minute <=3):
df.loc[key,'id']=prev_id
prev_hour=item.time.hour
prev_minute=item.time.minute
prev_second=item.time.second
prev_id=item.id
print(df)
output:
id time city transport
1 20 1900-01-01 08:50:20 Berlin air plane
4 20 1900-01-01 08:53:10 Berlin air plane
0 1 1900-01-01 10:20:00 London car
3 32 1900-01-01 10:24:00 Rome car
5 90 1900-01-01 21:08:00 Paris train
2 90 1900-01-01 21:10:00 Paris train
Exploring pd.Grouper()
found it useful to insert start time so that it's more obvious how buckets are being generated
you requirement +/- 3mins, most closely is a 6min bucket. Mostly matches your requirement but +/- 3 mins of what?
have done something that just shows what has been grouped and shows time bucket
setup
df = pd.read_csv(io.StringIO(""" ID time city transport
0 1 10:20:00 London car
1 20 08:50:20 Berlin air plane
2 44 21:10:00 Paris train
3 32 10:24:00 Rome car
4 56 08:53:10 Berlin air plane
5 90 21:08:00 Paris train
6 33 05:08:22 Paris train"""), sep="\s\s+", engine="python")
# force in origin so grouper generates bucket every Xmins from midnight with no seconds...
df = pd.concat([pd.DataFrame({"time":[pd.Timedelta(0)],"dummy":[True]}), df]).assign(dummy=lambda dfa: dfa.dummy.fillna(False))
df = df.assign(td=pd.to_timedelta(df.time))
analysis
### DEBUGGER ### - see whats being grouped...
df.groupby([pd.Grouper(key="td", freq="6min"), "city","transport"]).agg(lambda x: list(x) if len(x)>0 else np.nan).dropna()
see that two time buckets will group >1 ID
time
dummy
ID
(Timedelta('0 days 05:06:00'), 'Paris', 'train')
['05:08:22']
[False]
[33.0]
(Timedelta('0 days 08:48:00'), 'Berlin', 'air plane')
['08:50:20', '08:53:10']
[False, False]
[20.0, 56.0]
(Timedelta('0 days 10:18:00'), 'London', 'car')
['10:20:00']
[False]
[1.0]
(Timedelta('0 days 10:24:00'), 'Rome', 'car')
['10:24:00']
[False]
[32.0]
(Timedelta('0 days 21:06:00'), 'Paris', 'train')
['21:10:00', '21:08:00']
[False, False]
[44.0, 90.0]
solution
# finally +/- double the window. NB this is not +/- but rows that group the same
(df.assign(ID=lambda dfa: dfa
.groupby([pd.Grouper(key= 'td',freq ='6min'),'city','transport'])['ID']
.transform('first'))
# cleanup... NB needs changing if dummy row is not inserted
.query("not dummy")
.drop(columns=["td","dummy"])
.assign(ID=lambda dfa: dfa.ID.astype(int))
)
time
ID
city
transport
10:20:00
1
London
car
08:50:20
20
Berlin
air plane
21:10:00
44
Paris
train
10:24:00
32
Rome
car
08:53:10
20
Berlin
air plane
21:08:00
44
Paris
train
05:08:22
33
Paris
train
Related
A B C Time D
1 sandy 12 02:30:24 California
2 sandy 22 01:24:06 California
3 sunny 8 05:03:52 Rhode Island
4 sunny 32 07:03:25 Rhode Island
Required output
A B C Time D
1 sandy 12 02:30:24 California
2 22 01:24:06 California
sandy Total 34 01:57:15
3 sunny 8 05:03:52 Rhode Island
4 32 07:03:25 Rhode Island
sunny Total 40 06:03:38
Total 74 04:00:27
want to add a total of the numeric columns at the end of each group and average time (i have two time column in actual)of time column
You can generate the Total lines by .groupby() + agg() and .assign(), the Grand Total line by pd.Series(). Then, append to the original df by .append(), followed by sort_index() to sort back together the same column B ('sandy', 'sunny'):
df_total = (df.assign(Time=pd.to_timedelta(df['Time']))
.groupby('B')[['C', 'Time']]
.agg({'C': 'sum', 'Time': lambda x: str(x.mean().round('1s')).split()[-1]})
.assign(A='Total: ', D='')
)
df_grand_total = pd.Series({'A': '',
'C': df['C'].sum(),
'Time': str(pd.to_timedelta(df['Time']).mean().round('1s')).split()[-1],
'D': ''},
name='~Grand Total:')
df_final = (df.set_index('B')
.append(df_total)
.append(df_grand_total)
.sort_index()
.reset_index()
)
Result:
print(df_total)
C Time A D
B
sandy 34 01:57:15 Total:
sunny 40 06:03:38 Total:
print(df_grand_total)
A
C 74
Time 04:00:27
D
Name: ~Grand Total:, dtype: object
print(df_final)
B A C Time D
0 sandy 1 12 02:30:24 California
1 sandy 2 22 01:24:06 California
2 sandy Total: 34 01:57:15
3 sunny 3 8 05:03:52 Rhode Island
4 sunny 4 32 07:03:25 Rhode Island
5 sunny Total: 40 06:03:38
6 ~Grand Total: 74 04:00:27
recently I am doing with this data set
import pandas as pd
data = {'Product':['Box','Bottles','Pen','Markers','Bottles','Pen','Markers','Bottles','Box','Markers','Markers','Pen'],
'State':['Alaska','California','Texas','North Carolina','California','Texas','Alaska','Texas','North Carolina','Alaska','California','Texas'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df1=pd.DataFrame(data, columns=['Product','State','Sales'])
df1
I want to find the 3 groups that have the highest sales
grouped_df1 = df1.groupby('State')
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False))
So I have a dataframe like this
Now, I want to find the top 3 State that have the highest sales.
I tried to use
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).head(3)
# It gives me the first three rows
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).max()
#It only gives me the maximum value
The expected result should be:
Texas: 31
California: 24
North Carolina: 18
Thus, how can I fix it? Because sometimes, a State can have 3 top sales, for example Alaska may have 3 top sales. When I simply sort it, the top 3 will be Alaska, and it cannot find 2 other groups.
Many thanks!
You could add a new column called Sales_Max_For_State and then use drop_duplicates and nlargest:
>>> df1['Sales_Max_For_State'] = df1.groupby(['State'])['Sales'].transform(max)
>>> df1
Product State Sales Sales_Max_For_State
0 Box Alaska 14 16
1 Bottles California 24 24
2 Pen Texas 31 31
3 Markers North Carolina 12 18
4 Bottles California 13 24
5 Pen Texas 7 31
6 Markers Alaska 9 16
7 Bottles Texas 31 31
8 Box North Carolina 18 18
9 Markers Alaska 16 16
10 Markers California 18 24
11 Pen Texas 14 31
>>> df2 = df1.drop_duplicates(['Sales_Max_For_State']).nlargest(3, 'Sales_Max_For_State')[['State', 'Sales_Max_For_State']]
>>> df2
State Sales_Max_For_State
2 Texas 31
1 California 24
3 North Carolina 18
I think there are a few ways to do this:
1-
df1.groupby('State').agg({'Sales': 'max'}).sort_values(by='Sales', ascending=False).iloc[:3]
2-df1.groupby('State').agg({'Sales': 'max'})['Sales'].nlargest(3)
Sales
State
Texas 31
California 24
North Carolina 18
My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0
There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')
In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')
I ran the below code in Jupyter Notebook, I was expecting the output to appear like an excel table but instead the output was split up and not in a table. How can I get it to show up in table format?
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("Robbery_2014_to_2019.csv")
print(df.head())
Output:
X Y Index_ event_unique_id occurrencedate \
0 -79.270393 43.807190 17430 GO-2015134200 2015-01-23T14:52:00.000Z
1 -79.488281 43.764091 19205 GO-20142956833 2014-09-21T23:30:00.000Z
2 -79.215836 43.761856 15831 GO-2015928336 2015-03-23T11:30:00.000Z
3 -79.436264 43.642963 16727 GO-20142711563 2014-08-15T22:00:00.000Z
4 -79.369461 43.654526 20091 GO-20142492469 2014-07-12T19:00:00.000Z
reporteddate premisetype ucr_code ucr_ext \
0 2015-01-23T14:57:00.000Z Outside 1610 210
1 2014-09-21T23:37:00.000Z Outside 1610 200
2 2015-06-03T15:08:00.000Z Other 1610 220
3 2014-08-16T00:09:00.000Z Apartment 1610 200
4 2014-07-14T01:35:00.000Z Apartment 1610 100
offence ... occurrencedayofyear occurrencedayofweek \
0 Robbery - Business ... 23.0 Friday
1 Robbery - Mugging ... 264.0 Sunday
2 Robbery - Other ... 82.0 Monday
3 Robbery - Mugging ... 227.0 Friday
4 Robbery With Weapon ... 193.0 Saturday
occurrencehour MCI Division Hood_ID Neighbourhood \
0 14 Robbery D42 129 Agincourt North (129)
1 23 Robbery D31 27 York University Heights (27)
2 11 Robbery D43 137 Woburn (137)
3 22 Robbery D11 86 Roncesvalles (86)
4 19 Robbery D51 73 Moss Park (73)
Long Lat ObjectId
0 -79.270393 43.807190 2001
1 -79.488281 43.764091 2002
2 -79.215836 43.761856 2003
3 -79.436264 43.642963 2004
4 -79.369461 43.654526 2005
[5 rows x 29 columns]
Use display(df.head()) (produces slightly nicer output than without display()
Print function is applied to represent any kind of information like string or estimated value.
Whereas Display() will display the dataset in
Background
I have five years of NO2 measurement data, in csv files-one file for every location and year. I have loaded all the files into pandas dataframes in the same format:
Date Hour Location NO2_Level
0 01/01/2016 00 Street 18
1 01/01/2016 01 Street 39
2 01/01/2016 02 Street 129
3 01/01/2016 03 Street 76
4 01/01/2016 04 Street 40
Goal
For each dataframe count the number of times NO2_Level is greater than 150 and output this.
So I wrote a loop that's creates all the dataframes from the right directories and cleans them appropriately .
Problem
Whatever I've tried produces results I know on inspection are incorrect, e.g :
-the count value for every location on a given year is the same (possible but unlikely)
-for a year when I know there should be any positive number for the count, every location returns 0
What I've tried
I have tried a lot of approaches to getting this value for each dataframe, such as making the column a series:
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()'''
Using pd.count():
count = df[df['NO2_Level'] >= 150].count()
These two approaches have gotten closest to what I want to output
Example to test on
data = {'Date': ['01/01/2016','01/02/2016',' 01/03/2016', '01/04/2016', '01/05/2016'], 'Hour': ['00', '01', '02', '03', '04'], 'Location': ['Street','Street','Street','Street','Street',], 'NO2_Level': [18, 39, 129, 76, 40]}
df = pd.DataFrame(data=d)
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()
count
Expected Outputs
So from this I'm trying to get it to output a single line for each dataframe that was made in the format Location, year, count (of condition):
Kirkstall Road,2013,47
Haslewood Close,2013,97
...
Jack Lane Hunslet,2015,158
So the above example would produce
Street, 2016, 1
Actual
Every year produces the same result for each location, for some years (2014) the count doesn't seem to work at all when on inspection there should be:
Kirkstall Road,2013,47
Haslewood Close,2013,47
Tilbury Terrace,2013,47
Corn Exchange,2013,47
Temple Newsam,2014,0
Queen Street Morley,2014,0
Corn Exchange,2014,0
Tilbury Terrace,2014,0
Haslewood Close,2015,43
Tilbury Terrace,2015,43
Corn Exchange,2015,43
Jack Lane Hunslet,2015,43
Norman Rows,2015,43
Hopefully this helps.
import pandas as pd
ddict = {
'Date':['2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-02',],
'Hour':['00','01','02','03','04','02'],
'Location':['Street','Street','Street','Street','Street','Street',],
'N02_Level':[19,39,129,76,40, 151],
}
df = pd.DataFrame(ddict)
# Convert dates to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Make a Year column
df['Year'] = df['Date'].apply(lambda x: x.strftime('%Y'))
# Group by lcoation and year, count by M02_Level > 150
df1 = df[df['N02_Level'] > 150].groupby(['Location','Year']).size().reset_index(name='Count')
# Interate the results
for i in range(len(df1)):
loc = df1['Location'][i]
yr = df1['Year'][i]
cnt = df1['Count'][i]
print(f'{loc},{yr},{cnt}')
### To not use f-strings
for i in range(len(df1)):
print('{loc},{yr},{cnt}'.format(loc=df1['Location'][i], yr=df1['Year'][i], cnt=df1['Count'][i]))
Sample data:
Date Hour Location N02_Level
0 2016-01-01 00 Street 19
1 2016-01-01 01 Street 39
2 2016-01-01 02 Street 129
3 2016-01-01 03 Street 76
4 2016-01-01 04 Street 40
5 2016-01-02 02 Street 151
Output:
Street,2016,1
here is a solution with a sample generated (randomly):
def random_dates(start, end, n):
start_u = start.value // 10 ** 9
end_u = end.value // 10 ** 9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
location = ['street', 'avenue', 'road', 'town', 'campaign']
df = pd.DataFrame({'Date' : random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-12-31'), 20),
'Location' : np.random.choice(location, 20),
'NOE_level' : np.random.randint(low=130, high= 200, size=20)})
#Keep only year for Date
df['Date'] = df['Date'].dt.strftime("%Y")
print(df)
df = df.groupby(['Location', 'Date'])['NOE_level'].apply(lambda x: (x>150).sum()).reset_index(name='count')
print(df)
Example df generated:
Date Location NOE_level
0 2018 town 191
1 2017 campaign 187
2 2017 town 137
3 2016 avenue 148
4 2017 campaign 195
5 2018 town 181
6 2018 road 187
7 2018 town 184
8 2016 town 155
9 2016 street 183
10 2018 road 136
11 2017 road 171
12 2018 street 165
13 2015 avenue 193
14 2016 campaign 170
15 2016 street 132
16 2016 campaign 165
17 2015 road 161
18 2018 road 161
19 2015 road 140
output:
Location Date count
0 avenue 2015 1
1 avenue 2016 0
2 campaign 2016 2
3 campaign 2017 2
4 road 2015 1
5 road 2017 1
6 road 2018 2
7 street 2016 1
8 street 2018 1
9 town 2016 1
10 town 2017 0
11 town 2018 3