I'm new to python, and I have this assignment I have to deliver soon.
I have a .xlsx file that I've imported with pandas. It's a file from my workplace which tells us the day (mon - sat), time (from 10 am - 8 pm), sales per hour, visiting customers and customers that actually bought from the store (5 rows, 65 col). How can I get the total sales from each of the days? I tried to get the sum from monday by writing the cols from that day, but it wasn't accurate.
monday = (data['Sales per hour'][1:12].sum())
Is there a better way to sum the data from monday without having to write down the cols [1:12].sum())?
Here is a pic of the file I'm using. I want to get the total sum for each of the days and plot them into a histogram. I's also like to plot a comparison histogram between visiting customers and buying customers.
The file
#You can try Pandas's Group by to resolve your issue
First, rename the column for better use remove blank space from the name
data.rename(columns = {'Sales per hour':'Sales_per_hour'}, inplace = True)
Daywise_Data=data.groupby('Day').Sales_per_hour.sum().reset_index()
This will give you day-wise data into a separate data frame which can be used further to plot the histogram.
Related
I'm trying to add two-columns and trying to display their total in a new column and following as well
The total sum of sales in the month of Jan
The minimum sales amount in the month of Feb
The average (mean) sales for the month of Mar
and trying to create a data frame called d2 that only contains rows of data in d that don't have any missing (NaN) values
I have implemented the following code
import pandas as pd
new_val= pd.read_csv("/Users/mayur/574_repos_2019/ml-python-
class/assignments/data/assg-01-data.csv")
new_val['total'] = 'total'
new_val.to_csv('output.csv', index=False)
display(new_val)
d.head(5)# it's not showing top file lines of the .csv data
# .CSV file sample data
#account name street city state postal-code Jan Feb Mar total
#0118 Kerl, 3St . Waily Texas 28752.0 10000 62000 35000 total
#0118 mkrt, 1Wst. con Texas 22751.0 12000 88200 15000 total
It's giving me a total as a word.
When you used new_val['total'] = 'total' you basically told Pandas that you want a Column in your DataFrame called total where every variable is the string total.
What you want to fix is the variable assignment. For this I can give you quick and dirty solution that will hopefully make a more appealing solution be clearer to you.
You can iterate through your DataFrame and add the two columns to get the variable for the third.
for i,j in new_val.iterrows():
new_val.iloc[i]['total'] = new_val.iloc[i]['Jan'] + new_val.iloc[i]['Feb'] + new_val.iloc[i]['Mar']
Note, that this requires column total to have already been defined. This also requires iterating through your entire data set, so if your data set is large this is not the best option.
As mentioned by #Cavenfish, that new_val['total'] = 'total' creates a column total where value of every cell is the string total.
You should rather use new_val['total'] = new_val['Jan']+new_val['Feb']+new_val['Mar']
For treatment of NA values you can use a mask new_val.isna() which will generate boolean for all cells whether they are NA or not in your array. You can then apply any logic on top of it. For your example, the below should work:
new_val.isna().sum(axis=1)==4
Considering that you now have 4 columns in your dataframe Jan,Feb,Mar,total; it will return False in case one of the row contains NA. You can then apply this mask to new_val['total'] to assign default value in case NA is encountered in one of the columns for a row.
I have daily data and want to calculate 5 days, 30 days and 90 days moving average per user and write out to a CSV. New data comes in everyday. How do I calculate these averages for the new data only, assuming I will load the data frame with last 89 days data plus today's data.
date user daily_sales 5_days_MA 30_days_MV 90_days_MV
2019-05-01 1 34
2019-05-01 2 20
....
2019-07-18 .....
The number of rows per day is about 1 million. If data for 90days is too much, 30 days is OK
You can apply rolling() method on your dataset if it's in DataFrame format.
your_df['MA_30_days'] = df[where_to_apply].rolling(window = 30).mean()
If you need different window on which moving average will be calculated just change window parameter. In my example I used mean() to calculate but you can choose some other statistic as well.
This code will create another column named 'MA_30_days' with calculated moving average in your DataFrame.
You can also create another DataFrame where you will collect and loop over your dataset to calculate all moving averages and save it to CSV format as you wanted.
your_df.to_csv('filename.csv')
In your case to calculation should be consider only the newest data. If you want to perform this on latest data just slice it. However the very first rows will be NaN (depends on window).
df[where_to_apply][-90:].rolling(window = 30).mean()
This will calculate moving average on last 90 rows of specific column in some df and first 29 rows would be NaN. If your latest 90 rows should be all meaningful data than you can start calculation earlier than on last 90 rows - depends on window size.
if the df already contains yesterday's moving average, and just the new day's Simple MA is required, I would say use this approach:
MAlength=90
df.loc[day-1:'MA']=(
(df.loc[day-1:'MA']*MAlength) #expand yesterday's MA value
-df.loc[day-MAlength:'Price'] #remove oldest price
+df.loc[day-MAlength:'Price'] #add newest price
)/MAlength #re-average
The database I'm using shows the total number of debtors for every town for every quarter.
Since there's 43 towns listed, there's 43 'total debtors' per quarter (30-Sep-17, etc).
My goal is to find the total number of debtors for every quarter (so theoretically, finding the sum of every 43 'total debtors' listed) but I'm not quite sure how.
I've tried using the sum() function, but I'm sure how to make it so it only adds the total quarter by quarter.
Here's what the database looks like and my attempt (I printed the first 50 rows just to provide an idea of what it looks like)
https://i.imgur.com/h1y43j8.png
Sorry in advance if the explanation was a bit unclear.
You should use groupby. It's a nice pandas function to do exactly what you are trying to do. It groups the df according to whatever column you pick.
total_debtors_pq = df.groupby('Quarter end date')['Total number of debtors'].sum()
You can then extract the total for each quarter from total_debtors_pq.
I'm trying to find the maximum rainfall value for each season (DJF, MAM, JJA, SON) over a 10 year period. I am using netcdf data and xarray to try and do this. The data consists of rainfall (recorded every 3 hours), lat, and lon data. Right now I have the following code:
ds.groupby('time.season).max('time')
However, when I do it this way the output has a shape of (4,145,192) indicating that it's taking the maximum value for each season over the entire period. I would like the maximum for each individual season every year. In other words, output should have something with a shape like (40,145,192) (4 values for each year x 10 years)
I've looked into trying to do this with DataSet.resample as well using time=3M as the frequency, but then it doesn't split the months up correctly. If I have to I can alter the dataset, so it starts in the correct place, but I was hoping there would be an easier way considering there's already a function to group it correctly.
Thanks and let me know if you need anymore details!
Resample is going to be the easiest tool for this job. You are close with the time frequency but you probably want to use the quarterly frequency with an offset:
ds.resample(time='QS-Mar').max('time')
These offsets can be further configured as described in the Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
So far I've read in 2 CSV's and merged them based on a common element. I take the output of the merged CSV and iterate through the unique element they've been merged on. While I have them separated I want to generate a daily count line and a two week rolling average from the current date going backward. I cannot index based of the 'Date Opened' field but I still need my outputs organized by this with the most recent first. Once these are sorted by date my daily count plotting issue will be rectified. My remaining task would be to compute a two week rolling average for count within the week. I've looked into the Pandas documentation and I think the rolling_mean will work but the parameters of this function don't really make sense to me. I've tried biwk_avg = pd.rolling_mean(open_dt, 28) but that doesnt seem to work. I know there is an easier way to do this but I think I've hit a roadblock with the documentation available. The end result should look something like this graph. Right now my daily count graph isnt sorted(even though I think I've instructed it to) and is unusable in line form.
def data_sort():
data_merge = data_extract()
domains = data_merge.groupby('PWx Domain')
for domain in domains.groups.items():
dsort = (data_merge.loc[domain[1]])
print (dsort.head())
open_dt = pd.to_datetime(dsort['Date Opened']).dt.date
#open_dt.to_csv('output\''+str(domain)+'_out.csv', sep = ',')
open_ct = open_dt.value_counts(sort= False)
biwk_avg = pd.rolling_mean(open_ct, 28)
plt.plot(open_ct,'bo')
plt.show()
data_sort()
Rolling mean alone is not enough in your case; you need a combination of resampling (to group data by days) followed by a 14-day rolling mean (why do you use 28 in your code?). Something like thins:
for _,domain in data_merge.groupby('PWx Domain'):
# Convert date to the index
domain.index = pd.to_datetime(domain['Date Opened'])
# Sort dy dates
domain.sort_index(inplace=True)
# Do the averaging
rolling = pd.rolling_mean(domain.resample('1D').mean(), 14)
plt.plot(rolling,'bo')
plt.show()