import pandas as pd
import numpy as np
# Show the specified columns and save it to a new file
col_list= ["STATION", "NAME", "DATE", "AWND", "SNOW"]
df = pd.read_csv('Data.csv', usecols=col_list)
df.to_csv('filteredData.csv')
df['year'] = pd.DatetimeIndex(df['DATE']).year
df2016 = df[(df.year==2016)]
df_2016 = df2016.groupby(['NAME', 'DATE'])['SNOW'].mean()
df_2016.to_csv('average2016.csv')
How come my dates are not ordered correctly here? Row 12 should be on the top but it's on the bottom of May instead and same goes for row 25
The average of SNOW per NAME/month is also not being displayed on my excel sheet. Why is that? Basically, I'm trying to calculate the average SNOW for May in ADA 0.7 SE, MI US. Then calculate the average SNOW for June in ADA 0.7 SE, MI US. etc..
I've spent all day and this is all I have got... Any help will be appreciated. Thanks in advance.
original data
https://gofile.io/?c=1gpbyT
Please try
Data
df=pd.read_csv(r'directorywhere the data is\data.csv')
df
Working
df.dtypes# Checking the datatype on each column
df.columns#listing columns
df['DATE']=pd.to_datetime(df['DATE'])#Converting date from object to a date format
df.set_index(df['DATE'], inplace=True)#Seeting the date as index
df['SNOW'].fillna(0)#filling all Not a Number values with zeros to make aggregation possible
df['SnowMean']=df.groupby([df.index.month, df.NAME])['SNOW'].transform('mean')#Groupby name, month and calculate the mean of snow. Store the result in anew column called df['SnowMean']
df
Checking
df.loc[:,['DATE','Month','SnowMean']]# Slice relevant columns to check
I realize you have multiple years. If you wanted mean per month in each year, again extract the year and add it in the groups to groupby as follows
df['SnowMeanPerYearPerMonth']=df.groupby([df.index.month,df.index.year,df.NAME])['SNOW'].transform('mean')
df
Check again
pd.set_option('display.max_rows',999)#diaplay upto 999 rows to check
df.loc[:,['DATE','Month','Year','SnowMean']]# Slice relevant columns to check
Related
I have a few set of days where the index is based on 30min data from monday to friday. There might some missing dates (Might be because of holidays). But i would like to find the highest from column high and lowest from column low for ever past week. Like i am calculating today so previous week high and low is marked in the yellow of attached image.
Tried using rolling , resampling but some how not working. Can any one help
enter image description here
You really should add sample data to your question (by that I mean a piece of code/text that can easily be used to create a dataframe for illustrating how the proposed solution works).
Here's a suggestion. With df your dataframe, and column datatime with datetimes (and not strings):
df["week"] = (
df["datetime"].dt.isocalendar().year.astype(str)
+ df["datetime"].dt.isocalendar().week.astype(str)
)
mask = df["high"] == df.groupby("week")["high"].transform("max")
df = df.merge(
df[mask].rename(columns={"low": "high_low"})
.groupby("week").agg({"high_low": "min"}).shift(),
on="week", how="left"
).drop(columns="week")
Add a week column to df (year + week) for grouping along weeks.
Extract the rows with the weekly maximum highs by mask (there could be more than one for a week).
Build a corresponding dataframe with the weekly minimum of the lows corresponding to the weekly maximum highs (column named high_low), shift it once to get the value from the previous week, and .merge it to df.
If column datetime doesn't contain datetimes:
df["datetime"] = pd.to_datetime(df["datetime"])
If I have understood correctly, the solution should be
get the week number from the date
groupby the week number and fetch the max and min number.
groupby the week fetch max date to get max/last date for a week
now merge all the dataframes into one based on date key
Once the steps are done, you could do any formatting as required.
Looking to clean multiple data sets in a more automated way. The current format is year as column, month as row, the number values.
Below is an example of the current format, the original data has multiple years/months.
Current Format:
Year
Jan
Feb
2022
300
200
Below is an example of how I would like the new format to look like. It combines month and year into one column and transposes the number into another column.
How would I go about doing this in excel or python? Have files with many years and multiple months.
New Format:
Date
Number
2022-01
300
2022-02
200
Check below solution. You need to extend month_df for the months, current just cater to the example.
import pandas as pd
df = pd.DataFrame({'Year':[2022],'Jan':[300],'Feb':[200]})
month_df = pd.DataFrame({'Char_Month':['Jan','Feb'], 'Int_Month':['01','02']})
melted_df = pd.melt(df, id_vars=['Year'], value_vars=['Jan', 'Feb'], var_name='Char_Month',value_name='Number')
pd.merge(melted_df, month_df,left_on='Char_Month', right_on='Char_Month').\
assign(Year=melted_df['Year'].astype('str')+'-'+month_df['Int_Month'])\
[['Year','Number']]
Output:
I'm trying to add two-columns and trying to display their total in a new column and following as well
The total sum of sales in the month of Jan
The minimum sales amount in the month of Feb
The average (mean) sales for the month of Mar
and trying to create a data frame called d2 that only contains rows of data in d that don't have any missing (NaN) values
I have implemented the following code
import pandas as pd
new_val= pd.read_csv("/Users/mayur/574_repos_2019/ml-python-
class/assignments/data/assg-01-data.csv")
new_val['total'] = 'total'
new_val.to_csv('output.csv', index=False)
display(new_val)
d.head(5)# it's not showing top file lines of the .csv data
# .CSV file sample data
#account name street city state postal-code Jan Feb Mar total
#0118 Kerl, 3St . Waily Texas 28752.0 10000 62000 35000 total
#0118 mkrt, 1Wst. con Texas 22751.0 12000 88200 15000 total
It's giving me a total as a word.
When you used new_val['total'] = 'total' you basically told Pandas that you want a Column in your DataFrame called total where every variable is the string total.
What you want to fix is the variable assignment. For this I can give you quick and dirty solution that will hopefully make a more appealing solution be clearer to you.
You can iterate through your DataFrame and add the two columns to get the variable for the third.
for i,j in new_val.iterrows():
new_val.iloc[i]['total'] = new_val.iloc[i]['Jan'] + new_val.iloc[i]['Feb'] + new_val.iloc[i]['Mar']
Note, that this requires column total to have already been defined. This also requires iterating through your entire data set, so if your data set is large this is not the best option.
As mentioned by #Cavenfish, that new_val['total'] = 'total' creates a column total where value of every cell is the string total.
You should rather use new_val['total'] = new_val['Jan']+new_val['Feb']+new_val['Mar']
For treatment of NA values you can use a mask new_val.isna() which will generate boolean for all cells whether they are NA or not in your array. You can then apply any logic on top of it. For your example, the below should work:
new_val.isna().sum(axis=1)==4
Considering that you now have 4 columns in your dataframe Jan,Feb,Mar,total; it will return False in case one of the row contains NA. You can then apply this mask to new_val['total'] to assign default value in case NA is encountered in one of the columns for a row.
I have a large (+10m rows) dataframe with three columns: sales dates (dtype: datetime64[ns]), customer names and sales per customer. Sales dates include day, month and year in the form yyyy-mm-dd (i.e. 2019-04-19). I discovered the pandas to_period function and like to use the period[A-MAR] dtype. As the business year (ending in March) is different from the calendar year that is exactly what I was looking for. With the to_period function I am able to assign the respective sales dates to the correct business year while avoiding to create new columns with additional information.
I convert the date column as follows:
df_input['Date'] = pd.DatetimeIndex(df_input['Date']).to_period("A-MAR")
Now a peculiar issue arrises when I use pivot_table to aggregate the data and set margins=True. The aggfunc returns the correct values in the output table. However, the results in the last row (total value, created by the margins) are wrong as NaN is shown (or in my case a 0 as I set fill_value = 0). The function I use:
df_output = df_input.pivot_table(index="Customer",
columns = "Date",
values = "Sales",
aggfunc ={"Sales": np.sum},
fill_value = 0,
margins= True)
When I do not convert the dates to a period but use a simple year (integer) instead, the margins are calculated correctly and no NaN appears in the last row of the pivot output table.
I searched all over the internet but could not find a solution that was working. I would like to keep working with the period datatype and just need the margins to be calculated correctly. I hope someone can help me out here. Thank you!
I have census data that looks like this for a full month and I want to find out how many unique inmates there were for the month. The information is taken daily so there are multiples.
_id,Date,Gender,Race,Age at Booking,Current Age
1,2016-06-01,M,W,32,33
2,2016-06-01,M,B,25,27
3,2016-06-01,M,W,31,33
My method now is to group them by day and then add the ones that are not accounted for into the DataFrame. My question is how to account for two people with the same info. They would both get not added to the new DataFrame because one of them already exists? I'm trying to figure out how many people total were in the prison during this time.
_id is incremental, for example here is some data from the second day
2323,2016-06-02,M,B,20,21
2324,2016-06-02,M,B,44,45
2325,2016-06-02,M,B,22,22
2326,2016-06-02,M,B,38,39
link to the dataset here: https://data.wprdc.org/dataset/allegheny-county-jail-daily-census
You could use the df.drop_duplicates() which will return the DataFrame with only unique values, then count the entries.
Something like this should work:
import pandas as pd
df = pd.read_csv('inmates_062016.csv', index_col=0, parse_dates=True)
uniqueDF = df.drop_duplicates()
countUniques = len(uniqueDF.index)
print(countUniques)
Result:
>> 11845
Pandas drop_duplicates Documentation
Inmates June 2016 CSV
The problem with this approach / data is that there could be many individual inmates that are the same age / gender / race that would be filtered out.
I think the trick here is to groupby as much as possible and check the differences in those (small) groups through the month:
inmates = pd.read_csv('inmates.csv')
# group by everything except _id and count number of entries
grouped = inmates.groupby(
['Gender', 'Race', 'Age at Booking', 'Current Age', 'Date']).count()
# pivot the dates out and transpose - this give us the number of each
# combination for each day
grouped = grouped.unstack().T.fillna(0)
# get the difference between each day of the month - the assumption here
# being that a negative number means someone left, 0 means that nothing
# has changed and positive means that someone new has come in. As you
# mentioned yourself, that isn't necessarily true
diffed = grouped.diff()
# replace the first day of the month with the grouped numbers to give
# the number in each group at the start of the month
diffed.iloc[0, :] = grouped.iloc[0, :]
# sum only the positive numbers in each row to count those that have
# arrived but ignore those that have left
diffed['total'] = diffed.apply(lambda x: x[x > 0].sum(), axis=1)
# sum total column
diffed['total'].sum() # 3393