How to get specific information from a csv file using pandas?

How to get specific information from a csv file using pandas? - python

I want to extract specific information from [this csv file][1].
I need make a list of days and give an overview.

You're looking for DataFrame.resample. Based on a specific column, it will group the rows of the dataframe by a specific time interval.
First you need to do this, if you haven't already:
data['Date/Time'] = pd.to_datetime(data['Date/Time'])
Get the lowest 5 days of visibility:
>>> df.resample(rule='D', on='Date/Time')['Visibility (km)'].mean().nsmallest(5)
Date/Time
2012-03-01 2.791667
2012-03-14 5.350000
2012-12-27 6.104167
2012-01-17 6.433333
2012-02-01 6.795833
Name: Visibility (km), dtype: float64
Basically what that does is this:
Groups all the rows by day
Converts each group to the average value of all the Visibility (km) items for that day
Returns the 5 smallest
Count the number of foggy days
>>> df.resample(rule='D', on='Date/Time').apply(lambda x: x['Weather'].str.contains('Fog').any()).sum()
78
Basically what that does is this:
Groups all the rows by day
For each day, adds a True if any row inside that day contains 'Fog' in the Weather column, False otherwise
Counts how many True's there were, and thus the number of foggy days.

This will get you an array of all unique foggy days. you can use the shape method to get its dimension
df[df["Weather"].apply(lambda x : "Fog" in x)]["Date/Time"].unique()

I need make a list of days with lowest visibility and give an overview of other time parameters for those days in tabular form.
Since your Date/Time column represents a particular hour, you'll need to do some grouping to get the minimum visibility for a particular day. The following will find the 5 least-visible days.
# Extract the date from the "Date/Time" column
>>> data["Date"] = pandas.to_datetime(data["Date/Time"]).dt.date
# Group on the new "Date" column and get the minimum values of
# each column for each group.
>>> min_by_day = data.groupby("Date").min()
# Now we can use nsmallest, since 1 row == 1 day in min_by_day.
# Since `nsmallest` returns a pandas.Series with "Date" as the index,
# we have to use `.index` to pull the date objects from the result.
>>> least_visible_days = min_by_day.nsmallest(5, "Visibility (km)").index
Then you can limit your original dataset to the least-visible days with
data[data["Date"].isin(least_visible_days)]
I also need the total number of foggy days.
We can use the extracted date in this case too:
# Extract the date from the "Date/Time" column
>>> data["Date"] = pandas.to_datetime(data["Date/Time"]).dt.date
# Filter on hours which have foggy weather
>>> foggy = data[data["Weather"].str.contains("Fog")]
# Count number of unique days
>>> len(foggy["Date"].unique())

Related

How to set the time stamp of the data frame to thw first day of the month

I have a few data frames that i am resampling to match each other. I would like to set the timestamps (index) for all the data to be the first days of the month of the dsy the measurements were taken. I cannot find anywhere how to do it, the closest I got was with the resample(period=...), but it leaves me without the day.
The code I tried
df['value'].resample('M',kind = 'period').sum()
It comes like like this whereas I would like the timestamp to have the form of 2018-09-01.

This line is all what you need:
df.index = pd.to_datetime(df.index).strftime('%Y-%m-%d')
# Output
# value
# 2018-09-01 11
# 2018-10-01 12
It transforms your index column to a datetime type column. The first day of the month is automatically inserted. For more details, see the docs.

Counting Distinct Items That Equal 0 Based On Another Column

I'm trying to count the number of 0s in a column based on conditions in another column. I have three columns in the spreadsheet: DATE, LOCATION, and SALES. Column 1 is the date column. Column 2 is the location column (there are 5 different locations). Column 3 is the sales volume for the day.
I want to count the number of instances where the different locations have 0 sales for the day.
I have tried a number of groupby combinations and cannot get an answer.
df_summary = df.groupby(['Location']).count()['Sales'] == 0
Any help is appreciated.

Try filter first:
(df[df['Volume']==0].groupby('Location').size()
.reindex(df['Location'].unique()) # fill in the zero numbers
.reset_index(name='No Sales Days') # convert to dataframe
)
Or
df['Volume'].eq(0).groupby(df['Location']).sum()

Add hours column to regular list of minutes, group by it, and average the data in Python

I have looked for similar questions, but none seems to be addressing the following challenge. I have a pandas dataframe with a list of minutes and corresponding values, like the following:
minute value
0 454
1 434
2 254
The list is a year-long list, thus counting 60 minutes * 24 hours * 365 days = 525600 observations.
I would like to add a new column called hour, which indeed expresses the hour of the day (assuming minutes 0-59 are 12AM, 60-119 are 1AM, and so forth until the following day, where the sequence restarts).
Then, once the hour column is added, I would like to group observations by it and calculate the average value for every hour of the year, and end up with a dataframe with 24 observations, each expressing the average value of the original data at each hour n.

Using integer and remainder division you can get the hour.
df['hour'] = df['minute']//60%24
If you want other date information it can be useful to use January 1st of some year (not a leap year) as the origin and convert to a datetime. Then you can grab a lot of the date attributes, in this case hour.
df['hour'] = pd.to_datetime(df['minute'], unit='m', origin='2017-01-01').dt.hour
Then for your averages you get the resulting 24 row Series with:
df.groupby('hour')['value'].mean()

Here's a way to do:
# sample df
df = pd.DataFrame({'minute': np.arange(525600), 'value': np.arange(525600)})
# set time format
df['minute'] = pd.to_timedelta(df['minute'], unit='m')
# calculate mean
df_new = df.groupby(pd.Grouper(key='minute', freq='1H'))['value'].mean().reset_index()
Although, you don't need hour column explicity to calculate these value, but if you want to get it, you can do it by:
df_new['hour'] = pd.to_datetime(df_new['minute']).dt.hour

Pandas average row count per Day of the week

To get to that sum/count of rows of a weekday i do the following:
df['day'] = pandas.to_datetime(df['datetime']).dt.day_name()
print(pandas.value_counts(df.day))
But how can i get the average of rows per weekday if, for example there are more Fridays in the data frame than Mondays?
Or asked differently: how can i divide each count by the number of that weekday that have happend ?
To clarify:
For example there have been 5 Mo,tue,wen,thurs but 4 fri,sat,sun () i would like to divide the counts of Mo,tue,wen,thurs by 5 and the counts of fri,sat,sun () by 4
The answer below is correct.

Assuming you just want the number of days:
num_days = df['day'].value_counts()
If you want percentages of days in the dataset.
df['day'].value_counts(normalize=True)
Taking this a step further, it looks like you want the number of days in your dataset versus the number of possibly days.
# Create series for days in your dataframe
days_in_df = df['day'].value_counts()
# Create a dataframe with all days
start = '01/01/2019'
end = '01/31/2019'
all_days_df = pd.DataFrame(data={'datetime':pd.date_range(start='01/01/2019',periods=31,freq='d')})
all_days_df['all_days'] = all_days_df['datetime'].dt.day_name()
# Use that for value counts
all_days_count = all_days_df['all_days'].value_counts()
# We now merge them
result = pd.concat([all_days_count,days_in_df],axis=1,sort=True)
# Finnaly we can get the ration
result['day']/result['all_days']

Find Maximum Date within Date Range without filtering in Python

I have a file with one row per EMID per Effective Date. I need to find the maximum Effective date per EMID that occurred before a specific date. For instance, if EMID =1 has 4 rows, one for 1/1/16, one for 10/1/16, one for 12/1/16, and one for 12/2/17, and I choose the date 1/1/17 as my specific date, I'd want to know that 12/1/16 is the maximum date for EMID=1 that occurred before 1/1/17.
I know how to find the maximum date overall by EMID (groupby.max()). I also can filter the file to just dates before 1/1/17 and find the max of the remaining rows. However, ultimately I need the last row before 1/1/17, and then all the rows following 1/1/17, so filtering out the rows that occur after the date isn't optimal, because then I have to do complicated joins to get them back in.
# Create dummy data
dummy = pd.DataFrame(columns=['EmID', 'EffectiveDate'])
dummy['EmID'] = [random.randint(1, 10000) for x in range(49999)]
dummy['EffectiveDate'] = [np.random.choice(pd.date_range(datetime.datetime(2016,1,1), datetime.datetime(2018,1,3))) for i in range(49999)]
#Create group by
g = dummy.groupby('EmID')['EffectiveDate']
# This doesn't work, but effectively shows what I'm trying to do
dummy['max_prestart'] = max(dt for dt in g if dt < datetime(2017,1,1))
I expect that output to be an additional column in my dataframe that has the maximum date that occurred before the specified date.

Using map after selected .
s=dummy.loc[dummy.EffectiveDate>'2017-01-01'].groupby('EmID').EffectiveDate.max()
dummy['new']=dummy.EmID.map(s)
Here Using transform and assuming else dt
dummy['new']=dummy.loc[dummy.EffectiveDate>'2017-01-01'].groupby('EmID').EffectiveDate.transform('max')
dummy['new']=dummy['new'].fillna(dummy.EffectiveDate)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get specific information from a csv file using pandas? - python

I want to extract specific information from [this csv file][1]. I need make a list of days and give an overview.

This will get you an array of all unique foggy days. you can use the shape method to get its dimension df[df["Weather"].apply(lambda x : "Fog" in x)]["Date/Time"].unique()

Related

How to set the time stamp of the data frame to thw first day of the month

Counting Distinct Items That Equal 0 Based On Another Column

Add hours column to regular list of minutes, group by it, and average the data in Python

Pandas average row count per Day of the week

Find Maximum Date within Date Range without filtering in Python

Categories

Resources