I have a big date file that I'm trying to extract data from. I have two columns Start Time & Date What I would like to do is display each Date followed by each Start Time followed by a count of each of those start times. So the output would look like this:
Date Start Time
30/12/2021 15:00 2
30/12/2021 16:00 6
30/12/2021 17:00 3
This is what I've tried:
df = pd.read_excel(xls)
counter = df['Start Time'].value_counts()
date_counter = df['Date'].value_counts()
total = (df['Start Time']).groupby(df['Date']).sum()
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(total)
input()
But this outputs like this:
Date Start Time
30/12/2021 15:0016:0016:0017:0018:0018:00
Any suggestions are much appreciated!
You're only grouping by 1 column. You need to group-by both columns and get the count using size()
df.groupby(['Date', 'Start Time']).size()
You can value count with the 2 keys
counts = df[['Date','Start Time']].value_counts()
for this input
Date Start Time
0 30/12/21 15:00
1 30/12/21 16:00
2 31/12/21 15:00
3 30/12/21 15:00
4 31/12/21 16:00
5 30/12/21 18:00
6 30/12/21 13:00
7 31/12/21 15:00
throws
Date Start Time
31/12/21 15:00 2
30/12/21 15:00 2
31/12/21 16:00 1
30/12/21 18:00 1
16:00 1
13:00 1
Related
Using .weekday() to find the day of the week as an integer (Monday = 0 ... Sunday = 6) for everyday from today until next year (+365 days from today). Problem now is that if the 1st of the month starts mid week then I need to return the day of the week with the 1st day of the month now being = 0.
Ex. If the month starts Wednesday then Wednesday = 0... Sunday = 4 (for that week only).
Annotated Picture of Month Explaining What I Want to Do
Originally had the below code but wrong as the first statement will run 7 days regardless.
import datetime
from datetime import date
for day in range (1,365):
departure_date = date.today() + datetime.timedelta(days=day)
if departure_date.weekday() < 7:
day_of_week = departure_date.day
else:
day_of_week = departure_date.weekday()
The following seems to do the job properly:
import datetime as dt
def custom_weekday(date):
if date.weekday() > (date.day-1):
return date.day - 1
else:
return date.weekday()
for day in range (1,366):
departure_date = dt.date.today() + dt.timedelta(days=day)
day_of_week = custom_weekday(date=departure_date)
print(departure_date, day_of_week, departure_date.weekday())
Your code had two small bugs:
the if condition was wrong
days are represented inconsistently: date.weekday() is 0-based, date.day is 1-based
For every date, get the first week of that month. Then, check if the date is within that first week. If it is, use the .day - 1 value (since you are 0-based). Otherwise, use the .weekday().
from datetime import date, datetime, timedelta
for day in range (-5, 40):
departure_date = date.today() + timedelta(days=day)
first_week = date(departure_date.year, departure_date.month, 1).isocalendar()[1]
if first_week == departure_date.isocalendar()[1]:
day_of_week = departure_date.day - 1
else:
day_of_week = departure_date.weekday()
print(departure_date, day_of_week)
2021-08-27 4
2021-08-28 5
2021-08-29 6
2021-08-30 0
2021-08-31 1
2021-09-01 0
2021-09-02 1
2021-09-03 2
2021-09-04 3
2021-09-05 4
2021-09-06 0
2021-09-07 1
2021-09-08 2
2021-09-09 3
2021-09-10 4
2021-09-11 5
2021-09-12 6
2021-09-13 0
2021-09-14 1
2021-09-15 2
2021-09-16 3
2021-09-17 4
2021-09-18 5
2021-09-19 6
2021-09-20 0
2021-09-21 1
2021-09-22 2
2021-09-23 3
2021-09-24 4
2021-09-25 5
2021-09-26 6
2021-09-27 0
2021-09-28 1
2021-09-29 2
2021-09-30 3
2021-10-01 0
2021-10-02 1
2021-10-03 2
2021-10-04 0
2021-10-05 1
2021-10-06 2
2021-10-07 3
2021-10-08 4
2021-10-09 5
2021-10-10 6
For any date D.M.Y, get the weekday W of 1.M.Y.
Then you need to adjust weekday value only for the first 7-W days of that month. To adjust, simply subtract the value W.
Example for September 2021: the first date of month (1.9.2021) is a Wednesday, so W is 2. You need to adjust weekdays for dates 1.9.2021 to 5.9.2021 (because 7-2 is 5) in that month by minus 2.
I have the following data in CSV file:
time conc time conc time conc time conc
1:00 10 5:00 11 9:00 55 13:00 1
2:00 13 6:00 8 10:00 6 14:00 4
3:00 9 7:00 7 11:00 8 15:00 3
4:00 8 8:00 1 12:00 11 16:00 8
And I just wanted to merge them together as:
time conc
1:00 10
2:00 13
3:00 9
4:00 8
...
16:00 8
I've got more than 1000 columns, but I'm new to pandas. So just wondering how I can achieve?
One approach is to cut the dataframe in two-column slices, then re-combine using pd.concat() after renaming.
First load the dataframe normally:
df = pd.read_csv('time_conc.csv')
df
Which looks something like the below. Notice that pd.read_csv() has added a suffix to the duplicate column names:
time conc time.1 conc.1 time.2 conc.2 time.3 conc.3
0 1:00 10 5:00 11 9:00 55 13:00 1
1 2:00 13 6:00 8 10:00 6 14:00 4
2 3:00 9 7:00 7 11:00 8 15:00 3
3 4:00 8 8:00 1 12:00 11 16:00 8
Then slice using pd.DataFrame.iloc:
total_columns = len(df.columns)
columns_per_set = 2
column_sets = [df.iloc[:,set_start:set_start + columns_per_set].copy() for set_start in range(0, total_columns, columns_per_set)]
column_sets is now a list holding each pair of duplicate columns as a separate dataframe. Next, loop through the list to rename the columns back to the original:
for s in column_sets:
s.columns = ['time', 'conc']
This modifies each two-column dataframe in place so that their column names match.
Finally, use pd.concat() to combine them by matching the column axis:
new_df = pd.concat(column_sets, axis=0, sort=False)
new_df
Which gives you the full two columns:
time conc
0 1:00 10
1 2:00 13
2 3:00 9
3 4:00 8
0 5:00 11
1 6:00 8
2 7:00 7
3 8:00 1
0 9:00 55
1 10:00 6
2 11:00 8
3 12:00 11
0 13:00 1
1 14:00 4
2 15:00 3
3 16:00 8
Since your file has duplicated column names, Pandas will add suffixes. The DataFrame header by default will be like ['time', 'conc', 'time.1', 'conc.1', 'time.2', 'conc.2', 'time.3', 'conc.3' ...]
Assuming that the separator of your CSV file is a comma:
import pandas as pd
df = pd.read_csv('/path/to/your/file.csv', sep=',')
total_n = len(df.columns)
lst = []
for x in range(int(total_n / 2 )):
if x == 0:
cols = ['time', 'conc']
else:
cols = ['time'+'.'+str(x), 'conc'+'.'+str(x)]
df_sub = df[cols] #Slice two columns each time
df_sub.columns = ['time', 'conc'] #Slices should have the same column names
lst.append(df_sub)
df = pd.concat(lst) #Concatenate all the objects
Assuming that df is a DataFrame with the csv file data you can try the following:
# rename columns if needed
df.columns = ["time", "conc"]*(df.shape[1]//2)
# concatenate pairs of adjacent columns
pd.concat([df.iloc[:, [i, i+1]] for i in range(0, df.shape[1], 2)])
It gives:
time conc
0 1:00 10
1 2:00 13
2 3:00 9
3 4:00 8
0 5:00 11
.. ... ..
3 12:00 11
0 13:00 1
1 14:00 4
2 15:00 3
3 16:00 8
I have a df in the format:
date number category
2014-02-02 17:00:00 4 red
2014-02-03 17:00:00 5 red
2014-02-04 17:00:00 4 blue
2014-02-05 17:00:00 4 blue
2014-02-06 17:00:00 4 red
2014-02-07 17:00:00 4 red
2014-02-08 17:00:00 4 blue
...
How do I group on day of the week and take a total of 'number' in that day of the week, so Id have a df of 7 items, monday, tuesday etc, and the total number of 'number' on that day. With this I want to make a histogram with number on the y and day of the week on the x.
After reading your question again, I understand why #Quang Hoang answered the way he did. Not so sure if that's what you had wanted or if the below is:
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
df['day'] = df['date'].apply(lambda x: x.day_name())
counts = df.groupby('day')['Number'].sum()
plt.bar(counts.index, counts)
plt.show()
You can use dt.day_name() to extract the day name, then use pd.crosstab to count the number:
pd.crosstab(df['date'].dt.day_name(),df['number'])
Output:
number 4 5
date
Friday 1 0
Monday 0 1
Saturday 1 0
Sunday 1 0
Thursday 1 0
Tuesday 1 0
Wednesday 1 0
And to plot a histogram, you can chain the above with .plot.bar():
This is my transaction dataframe, where each row mean a transaction :
date station
30/10/2017 15:20 A
30/10/2017 15:45 A
31/10/2017 07:10 A
31/10/2017 07:25 B
31/10/2017 07:55 B
I need to group the start_date to a hour interval and count each city, so the end result will be:
date hour station count
30/10/2017 16:00 A 2
31/10/2017 08:00 A 1
31/10/2017 08:00 B 2
Where the first row means from 15:00 to 16:00 on 30/10/2017, there are 2 transactions in station A
How to do this in Pandas?
I tried this code, but the result is wrong :
df_start_tmp = df_trip[['Start Date', 'Start Station']]
times = pd.DatetimeIndex(df_start_tmp['Start Date'])
df_start = df_start_tmp.groupby([times.hour, df_start_tmp['Start Station']]).count()
Thanks a lot for the help
IIUC size+pd.Grouper
df.date=pd.to_datetime(df.date)
df.groupby([pd.Grouper(key='date',freq='H'),df.station]).size().reset_index(name='count')
Out[235]:
date station count
0 2017-10-30 15:00:00 A 2
1 2017-10-31 07:00:00 A 1
2 2017-10-31 07:00:00 B 2
I'm trying to combine all rows of a dataframe that have the same time stamp into a single row. The df is 5k by 20.
A B ...
timestamp
11:00 NaN 10 ...
11:00 5 NaN ...
12:00 15 20 ...
... ... ...
group the 2 11:00 rows as follows
A B ...
timestamp
11:00 5 10 ...
12:00 15 20 ...
... ... ...
Any help would be appreciated. Thank you.
I have tried
df.groupby( df.index ).sum()
You could melt ('unpivot') the DataFrame to convert it from wide form to long form, remove the null values, then aggregate via groupby.
import pandas as pd
df = pd.DataFrame({'timestamp' : ['11:00','11:00','12:00'],
'A' : [None,5,15],
'B' : [10,None,20]
})
A B timestamp
0 NaN 10 11:00
1 5 NaN 11:00
2 15 20 12:00
df2 = pd.melt(df, id_vars = 'timestamp') # specify the value_vars if needed
timestamp variable value
0 11:00 A NaN
1 11:00 A 5
2 12:00 A 15
3 11:00 B 10
4 11:00 B NaN
5 12:00 B 20
df2.dropna(inplace=True)
df3 = df2.groupby(['timestamp', 'variable']).sum()
value
timestamp variable
11:00 A 5
B 10
12:00 A 15
B 20
df3.unstack()
value
variable A B
timestamp
11:00 5 10
12:00 15 20
groupby after replacing the NaN values with 0's.
df.fillna(0, inplace=True)
df.groupby(df.index).sum()
Try using resample:
>>> df.resample('60Min', how='sum')
A B
2015-05-28 11:00:00 5 10
2015-05-28 12:00:00 15 20
More examples can be found in the Pandas Documentation.
You cannot sum a number and a NaN in python. You probably need to use .aggregate() :)