I have a dataframe with lots of measurements of temperature values. I want to count the number of measurements in every day of the month. So far, I managed to display the number of measurements, and also to create a new dataframe, containing the unique values of the days.
How can I add the number of measurements to the new dataframe(the one containing all the unique values of days), in a new column?
So far, I have managed this function, which counts the number of measurements in the given day:
def measurements_in_a_day(day, month, year):
full_date = day.format(), '/', month.format(), '/', year.format()
full_date = ''.join(full_date)
seriesObj = data.apply(lambda x: True if x['day'] == (full_date) else False, axis=1)
no_of_rows = len(seriesObj[seriesObj == True].index)
print('Number of Rows in dataframe in which date is ', full_date, ' are ', no_of_rows)
The thing is that I have to call this function 3 different times, because the csv file doesn't save the save format for data. How can I add the count of measurements in a new column in the dataframe created for unique month days?
Did you try using pandas groupby ?
something like data.groupby('day').count() should give you what you want.
df1=df.groupby('day')['time'].count().reset_index()
df1=df1.rename(columns={'time':'count'})
In one line:
df1=df.groupby('day')['time'].count().reset_index().rename(columns={'time':'count'})
If you prefer having the days as index you can do the following
df1=df.groupby('day')['time'].count().rename('count')
Related
I want to extract specific information from [this csv file][1].
I need make a list of days and give an overview.
You're looking for DataFrame.resample. Based on a specific column, it will group the rows of the dataframe by a specific time interval.
First you need to do this, if you haven't already:
data['Date/Time'] = pd.to_datetime(data['Date/Time'])
Get the lowest 5 days of visibility:
>>> df.resample(rule='D', on='Date/Time')['Visibility (km)'].mean().nsmallest(5)
Date/Time
2012-03-01 2.791667
2012-03-14 5.350000
2012-12-27 6.104167
2012-01-17 6.433333
2012-02-01 6.795833
Name: Visibility (km), dtype: float64
Basically what that does is this:
Groups all the rows by day
Converts each group to the average value of all the Visibility (km) items for that day
Returns the 5 smallest
Count the number of foggy days
>>> df.resample(rule='D', on='Date/Time').apply(lambda x: x['Weather'].str.contains('Fog').any()).sum()
78
Basically what that does is this:
Groups all the rows by day
For each day, adds a True if any row inside that day contains 'Fog' in the Weather column, False otherwise
Counts how many True's there were, and thus the number of foggy days.
This will get you an array of all unique foggy days. you can use the shape method to get its dimension
df[df["Weather"].apply(lambda x : "Fog" in x)]["Date/Time"].unique()
I need make a list of days with lowest visibility and give an overview of other time parameters for those days in tabular form.
Since your Date/Time column represents a particular hour, you'll need to do some grouping to get the minimum visibility for a particular day. The following will find the 5 least-visible days.
# Extract the date from the "Date/Time" column
>>> data["Date"] = pandas.to_datetime(data["Date/Time"]).dt.date
# Group on the new "Date" column and get the minimum values of
# each column for each group.
>>> min_by_day = data.groupby("Date").min()
# Now we can use nsmallest, since 1 row == 1 day in min_by_day.
# Since `nsmallest` returns a pandas.Series with "Date" as the index,
# we have to use `.index` to pull the date objects from the result.
>>> least_visible_days = min_by_day.nsmallest(5, "Visibility (km)").index
Then you can limit your original dataset to the least-visible days with
data[data["Date"].isin(least_visible_days)]
I also need the total number of foggy days.
We can use the extracted date in this case too:
# Extract the date from the "Date/Time" column
>>> data["Date"] = pandas.to_datetime(data["Date/Time"]).dt.date
# Filter on hours which have foggy weather
>>> foggy = data[data["Weather"].str.contains("Fog")]
# Count number of unique days
>>> len(foggy["Date"].unique())
I am trying to drop specific rows in a dataframe where the index is a date with 1hr intervals during specific times of the day. (It is hourly intervals of stock market data).
For instance, 2021-10-26 09:30:00-4:00,2021-10-26 10:30:00-4:00,2021-10-26 11:30:00-4:00, 2021-10-26 12:30:00-4:00 etc.
I want to be able to specify the row to keep by hh:mm (e.g. keep just the 6:30, 10:30 data each day), and drop all the rest.
I'm pretty new to programming so have absolutely no idea how to do this.
If your columns are datetime objects and not strings, you can do something like this
df = pd.Dataframe()
...input data, etc...
columns = df.columns
kept = []
for col in columns
if (col.dt.hour == 6 or col.dt.hour == 10) and col.dt.minute == 30
kept.append(col)
else:
continue
df = df[kept]
see about half way down about working with time in pandas on this source here
https://www.dataquest.io/blog/python-datetime-tutorial/
I have a pandas dataframe where observations are broken out per every two days. The values in the 'Date' column each describe a range of two days (eg 2020-02-22 to 2020-02-23).
I want to spit those Date values into individual days, with a row for each day. The closest I got was by doing newdf = df_day.set_index(df_day.columns.drop('Date',1).tolist()).Date.str.split(' to ', expand=True).stack().reset_index().loc[:, df_day.columns]
The problem here is that the new date values are returned as NaNs. Is there a way to achieve this data broken out by individual day?
I might not be understanding, but based on the image it's a single date per row as is, just poorly labeled -- I would manipulate the index strings, and if I can't do that I would create a new date column, or new df w/ clean date and merge it.
You should be able to chop off the first 14 characters with a lambda -- leaving you with second listed date in index.
I can't reproduce this, so bear with me.
df.rename(index=lambda s: s[14:])
#should remove first 14 characters from each row label.
#leaving just '2020-02-23' in row 2.
#If you must skip row 1, idx = df.index[1:]
#or df.iloc[1:].rename(index=lambda s: s[1:])
Otherwise, I would just replace it with a new datetime index.
didx = pd.DatetimeIndex(start ='2000-01-10', freq ='D',end='2020-02-26')
#Make sure same length as df
df.set_index(didx)
#Or
#df['new_date'] = didx.values
#df.set_index('new_date').drop(columns=['Date'])
#Or
#df.append(didx,axis=1) #might need ignore_index=True
My dataframe1 contains the day column which has numeric data from 1 to 7 for each day of the week. 1 - Monday, 2 - Tuesday...etc.
This day column is the day of Departure of a flight.
I need to create a new column dayOfBooking in a second dataframe2 which finds day of the week based on the number of days before a person books a flight and the day of departure of the flight.
For that I've written this function:
def findDay(dayOfDeparture, beforeDay):
beforeDay = int(beforeDay)
beforeDay = beforeDay % 7
if((dayOfDeparture - beforeDay) > 0):
dayAns = currDay - beforeDay;
else:
dayAns = 7 - abs(dayOfDeparture - beforeDay)
return(dayAns)
I want something like:
dataframe2["dayOfBooking"] = findDay(dataframe1["day"], i)
where i is the scalar value.
I can see that findDay takes the entire column day of dataframe1 instead of taking a single value for each row.
Is there an easy way to accomplish this like when we want a third column to be the sum of two other columns for each row, we can just write this:
dataframe["sum"] = dataframe2["val1"] + dataframe2["val2"]
EDIT: Figured it out. Answer and explanation below.
df2["colname"] = df.apply(lambda row: findDay(row['col'], i), axis = 1)
We have to use the apply function if we want to extract each row value of a particular column and pass it to a user defined function.
axis = 1 denotes that every row value is being taken for that column.
I have census data that looks like this for a full month and I want to find out how many unique inmates there were for the month. The information is taken daily so there are multiples.
_id,Date,Gender,Race,Age at Booking,Current Age
1,2016-06-01,M,W,32,33
2,2016-06-01,M,B,25,27
3,2016-06-01,M,W,31,33
My method now is to group them by day and then add the ones that are not accounted for into the DataFrame. My question is how to account for two people with the same info. They would both get not added to the new DataFrame because one of them already exists? I'm trying to figure out how many people total were in the prison during this time.
_id is incremental, for example here is some data from the second day
2323,2016-06-02,M,B,20,21
2324,2016-06-02,M,B,44,45
2325,2016-06-02,M,B,22,22
2326,2016-06-02,M,B,38,39
link to the dataset here: https://data.wprdc.org/dataset/allegheny-county-jail-daily-census
You could use the df.drop_duplicates() which will return the DataFrame with only unique values, then count the entries.
Something like this should work:
import pandas as pd
df = pd.read_csv('inmates_062016.csv', index_col=0, parse_dates=True)
uniqueDF = df.drop_duplicates()
countUniques = len(uniqueDF.index)
print(countUniques)
Result:
>> 11845
Pandas drop_duplicates Documentation
Inmates June 2016 CSV
The problem with this approach / data is that there could be many individual inmates that are the same age / gender / race that would be filtered out.
I think the trick here is to groupby as much as possible and check the differences in those (small) groups through the month:
inmates = pd.read_csv('inmates.csv')
# group by everything except _id and count number of entries
grouped = inmates.groupby(
['Gender', 'Race', 'Age at Booking', 'Current Age', 'Date']).count()
# pivot the dates out and transpose - this give us the number of each
# combination for each day
grouped = grouped.unstack().T.fillna(0)
# get the difference between each day of the month - the assumption here
# being that a negative number means someone left, 0 means that nothing
# has changed and positive means that someone new has come in. As you
# mentioned yourself, that isn't necessarily true
diffed = grouped.diff()
# replace the first day of the month with the grouped numbers to give
# the number in each group at the start of the month
diffed.iloc[0, :] = grouped.iloc[0, :]
# sum only the positive numbers in each row to count those that have
# arrived but ignore those that have left
diffed['total'] = diffed.apply(lambda x: x[x > 0].sum(), axis=1)
# sum total column
diffed['total'].sum() # 3393