My data has various columns including a date and a time column. The data is stretched across three months. I need to count no. of rows in a particular hour irrespective of the date. So that would mean getting the count of rows in 00:00 to 01:00 window and similarly for the rest 23 hours. How do I do that? Overall I will have 24 rows with their counts.
Here is my data:
>>>df[["date","time"]]
date time
0 2006-11-10 00:01:21
1 2006-11-10 00:02:26
2 2006-11-10 00:02:38
3 2006-11-10 00:05:38
4 2006-11-10 00:05:38
Output should be like:
00:00-00:59 SomeCount
Both are object types
I think simpliest is convert both columns to datetimes and count hours by Series.dt.hour with Series.value_counts:
out = pd.to_datetime(df["date"] + ' ' + df["time"]).dt.hour.value_counts().sort_index()
Or if need your format use Series.dt.strftime with GroupBy.size:
s = pd.to_datetime(df["date"] + ' ' + df["time"]).dt.strftime('%H:00-%H:59')
print (s)
0 00:00-00:59
1 00:00-00:59
2 00:00-00:59
3 00:00-00:59
4 00:00-00:59
dtype: object
out = s.groupby(s, sort=False).size()
print (out)
00:00-00:59 5
dtype: int64
Last for DataFrame use:
df = out.rename_axis('times').reset_index(name='count')
You can split the time string with the delimiter :. Then create an another column hour for hour. Then use groupby() to group them on the basis of new column hour. You can now store the data in a new series or dataframe to get the desired output
groupby() the hour
then build DF that has values you want
cleanup indexes and column names
import io
df = pd.read_csv(io.StringIO(""" date time
0 2006-11-10 00:01:21
1 2006-11-10 00:02:26
2 2006-11-10 00:02:38
3 2006-11-10 00:05:38
4 2006-11-10 02:05:38"""), sep="\s\s+", engine="python")
dfc = (df.groupby(pd.to_datetime(df.time).dt.hour)
.apply(lambda d: pd.DataFrame({"count":[len(d)]},
index=[pd.to_datetime(d["time"]).min().strftime("%H:%M")
+"-"+pd.to_datetime(d["time"]).max().strftime("%H:%M")]))
.reset_index()
.drop(columns=["time"])
.rename(columns={"level_1":"time"})
)
time
count
0
00:01-00:05
4
1
02:05-02:05
1
My solution generates row counts for all 24 hours, with 0 for hours "absent"
in the source DataFrame.
To show a more instructive example, I defined the source DataFrame containing
rows from several hours:
date time
0 2006-11-10 01:21:00
1 2006-11-10 02:26:00
2 2006-11-10 02:38:00
3 2006-11-10 05:38:00
4 2006-11-10 05:38:00
5 2006-11-11 05:43:00
6 2006-11-11 05:51:00
Note that last 2 rows are from different date, but as you want grouping by
hour only, they will be counted in the same group as previous 2 rows
(hour 5).
The first step is to create a Series containing almost what you want:
wrk = df.groupby(pd.to_datetime(df.time).dt.hour).apply(
lambda grp: grp.index.size).reindex(range(24), fill_value=0)
The initial part of wrk is:
time
0 0
1 1
2 2
3 0
4 0
5 4
6 0
7 0
The left column (the index) contains hour as an integer and the
right column is the count - how many rows are in this hour.
The only thing to do is to reformat the index to your desired format:
wrk.index = wrk.index.map(lambda h: f'{h:02}:00-{h:02}:59')
The result (initial part only) is:
time
00:00-00:59 0
01:00-01:59 1
02:00-02:59 2
03:00-03:59 0
04:00-04:59 0
05:00-05:59 4
06:00-06:59 0
07:00-07:59 0
But if you want to get counts only for hours present in your source
data, then drop .reindex(…) from the above code.
Then your (full) result, for the above DataFrame will be:
time
01:00-01:59 1
02:00-02:59 2
05:00-05:59 4
dtype: int64
Related
I would like to add a column to my dataset which corresponds to the time stamp, and counts the day by steps. That is, for one year there should be 365 "steps" and I would like for all grouped payments for each account on day 1 to be labeled 1 in this column and all payments on day 2 are then labeled 2 and so on up to day 365. I would like it to look something like this:
account time steps
0 A 2022.01.01 1
1 A 2022.01.02 2
2 A 2022.01.02 2
3 B 2022.01.01 1
4 B 2022.01.03 3
5 B 2022.01.05 5
I have tried this:
def day_step(x):
x['steps'] = x.time.dt.day.shift()
return x
df = df.groupby('account').apply(day_step)
however, it only counts for each month, once a new month begins it starts again from 1.
How can I fix this to make it provide the step count for the entire year?
Use GroupBy.transform with first or min Series, subtract column time, convert timedeltas to days and add 1:
df['time'] = pd.to_datetime(df['time'])
df['steps1'] = (df['time'].sub(df.groupby('account')['time'].transform('first'))
.dt.days
.add(1)
print (df)
account time steps steps1
0 A 2022-01-01 1 1
1 A 2022-01-02 2 2
2 A 2022-01-02 2 2
3 B 2022-01-01 1 1
4 B 2022-01-03 3 3
5 B 2022-01-05 5 5
First idea, working only if first row is January 1:
df['steps'] = df['time'].dt.dayofyear
I have a question similar to this SO question except instead of only having numbers and text, my column has numbers, text and dates in them. How do I remove all rows that have values other than numbers and dates?
For example, I have a generic dataframe:
id
Nums and Dates
1
40
2
1/1/2021
3
AABBC
4
20
5
1/2/2021
And after removal, it should look like this.
id
Nums and Dates
1
40
2
1/1/2021
4
20
5
1/2/2021
You can create two series that checks for dates and numbers with to_datetime and to_numeric and only keep non-null rows with notnull(). Passing errors='coerce' returns null to non-dates / non-numbers:
dates = pd.to_datetime(df['Nums and Dates'], errors='coerce')
nums = pd.to_numeric(df['Nums and Dates'], errors='coerce')
df[(dates.notnull()) | (nums.notnull())]
Out[1]:
id Nums and Dates
0 1 40
1 2 1/1/2021
3 4 20
4 5 1/2/2021
Supposing the name of your dataframe as df and the name of your column as Nums and Dates, try this:
not_str_values = [value for value in df if type(value) is not str]
Then:
df = df.loc[df['Nums and Dates'].isin(not_str_values)]
With contains and regex, You can do in one pass through the column. You can remove
the rows which has alphabets.
df[~df['Nums and Dates'].str.contains(r'[A-Za-z]', regex=True)]
id Nums and Dates
0 1 40
1 2 1/1/2021
3 4 20
4 5 1/2/2021
I have a pandas dataframe that looks like this,
id start end
0 1 2020-02-01 2020-04-01
1 2 2020-04-01 2020-04-28
I have two additional parameters that are date values say x and y. x and y will be always a first day of the month.
I want to expand the above data frame to the one shown below for x = "2020-01-01" and y = "2020-06-01",
id month status
0 1 2020-01 -1
1 1 2020-02 1
2 1 2020-03 2
3 1 2020-04 2
4 1 2020-05 -1
5 1 2020-06 -1
6 2 2020-01 -1
7 2 2020-02 -1
8 2 2020-03 -1
9 2 2020-04 1
10 2 2020-05 -1
11 2 2020-06 -1
The dataframe expanded such that for each id, there will be additional months_between(x, y) rows made. And a status columns is made and values are filled in such that,
If the month column value is equal to month of start column then fill status as 1
If the month column value is greater than month of start column but less than or equal to month of end column fill it as 2.
If the month column value is less than month of start month then fill it as -1. Also if the month column value is greater than month of end fill status with -1.
I'm trying to solve this in pandas without looping. The current solution I have is with loops and takes longer to run with huge datasets.
Is there any pandas functions that can help me here?
Thanks #Code Different for the solution. It solves the issue. However there is an extension to the problem where the dataframe can look like this,
id start end
0 1 2020-02-01 2020-02-20
1 1 2020-04-01 2020-05-10
2 2 2020-04-10 2020-04-28
One id can have more than one entry. For the above x and y which is 6 months apart, I want to have 6 rows for each id in the dataframe. The solution currently creates 6 rows for each row in the dataframe. Which is okay but not ideal when dealing with dataframe with millions of ids.
Make sure the start and end columns are of type Timestamp:
# Explode each month between x and y
x = '2020-01-01'
y = '2020-06-01'
df['month'] = [pd.date_range(x, y, freq='MS')] * len(df)
df = df.explode('month').drop_duplicate(['id', 'month'])
# Determine the status
df['status'] = -1
cond = df['start'] == df['month']
df.loc[cond, 'status'] = 1
cond = (df['start'] < df['month']) & (df['month'] <= df['end'])
df.loc[cond, 'status'] = 2
I have a Panda data frame (df) with many columns. For the sake of simplicity, I am posting three columns with dummy data here.
Timestamp Source Length
0 1 5
1 1 5
2 1 5
3 2 5
4 2 5
5 3 5
6 1 5
7 3 5
8 2 5
9 1 5
Using Panda functions, First I set timestamp as index of the df.
index = pd.DatetimeIndex(data[data.columns[1]]*10**9) # Convert timestamp
df = df.set_index(index) # Set Timestamp as index
Next I can use groupby and pd.TimeGrouper functions to group the data into 5 seconds bins and compute cumulative length for each bin as following:
df_length = data[data.columns[5]].groupby(pd.TimeGrouper('5S')).sum()
So the df_length dataframe should look like:
Timestamp Length
0 25
5 25
Now the problem is: "I want to get the same bins of 5 seconds, but ant to compute the cumulative length for each source (1,2 and 3) in separate columns in the following format:
Timestamp 1 2 3
0 15 10 0
5 10 5 10
I think I can use df.groupby with some conditions to get it. But confused and tired now :(
Appreciate solution using panda functions only.
You can add new column for groupby Source for MultiIndex DataFrame and then reshape by unstack last level of MultiIndex for columns:
print (df[df.columns[2]].groupby([pd.TimeGrouper('5S'), df['Source']]).sum())
Timestamp Source
1970-01-01 00:00:00 1 15
2 10
1970-01-01 00:00:05 1 10
2 5
3 10
Name: Length, dtype: int64
df1 = df[df.columns[2]].groupby([pd.TimeGrouper('5S'), df['Source']])
.sum()
.unstack(fill_value=0)
print (df1)
Source 1 2 3
Timestamp
1970-01-01 00:00:00 15 10 0
1970-01-01 00:00:05 10 5 10
I have the following time series dataset of the number of sales happening for a day as a pandas data frame.
date, sales
20161224,5
20161225,2
20161227,4
20161231,8
Now if I have to include the missing data points here(i. e. missing dates) with a constant value(zero) and want to make it look the following way, how can I do this efficiently(assuming the data frame is ~50MB) using Pandas.
date, sales
20161224,5
20161225,2
20161226,0**
20161227,4
20161228,0**
20161229,0**
20161231,8
**Missing rows which are been added to the data frame.
Any help will be appreciated.
You can first cast to to_datetime column date, then set_index and reindex by min and max value of index, reset_index and if necessary change format by strftime:
df.date = pd.to_datetime(df.date, format='%Y%m%d')
df = df.set_index('date')
df = df.reindex(pd.date_range(df.index.min(), df.index.max()), fill_value=0)
.reset_index()
.rename(columns={'index':'date'})
print (df)
date sales
0 2016-12-24 5
1 2016-12-25 2
2 2016-12-26 0
3 2016-12-27 4
4 2016-12-28 0
5 2016-12-29 0
6 2016-12-30 0
7 2016-12-31 8
Last if need change format:
df.date = df.date.dt.strftime('%Y%m%d')
print (df)
date sales
0 20161224 5
1 20161225 2
2 20161226 0
3 20161227 4
4 20161228 0
5 20161229 0
6 20161230 0
7 20161231 8