pandas calculating mean per month - python

I created the following dataframe:
availability = pd.DataFrame(propertyAvailableData).set_index("createdat")
monthly_availability = availability.fillna(value=0).groupby(pd.TimeGrouper(freq='M'))
This gives the following output
2015-08-18 2015-09-09 2015-09-10 2015-09-11 2015-09-12 \
createdat
2015-08-12 1.0 1.0 1.0 1.0 1.0
2015-08-17 0.0 0.0 0.0 0.0 0.0
2015-08-18 0.0 1.0 1.0 1.0 1.0
2015-08-18 0.0 0.0 0.0 0.0 0.0
2015-08-19 0.0 1.0 1.0 1.0 1.0
2015-09-03 0.0 1.0 1.0 1.0 1.0
2015-09-03 0.0 1.0 1.0 1.0 1.0
2015-09-07 0.0 0.0 0.0 0.0 0.0
2015-09-08 0.0 0.0 0.0 0.0 0.0
2015-09-11 0.0 0.0 0.0 0.0 0.0
I'm trying to get the averages per created at month by doing:
monthly_availability_mean = monthly_availability.mean()
However, here I get the following output:
2015-08-18 2015-09-09 2015-09-10 2015-09-11 2015-09-12 \
createdat
2015-08-31 0.111111 0.444444 0.666667 0.777778 0.777778
2015-09-30 0.000000 0.222222 0.222222 0.222222 0.222222
2015-10-31 0.000000 0.000000 0.000000 0.000000 0.000000
And when I hand check august I get:
1.0 + 0 + 0 + 0 + 0 / 5 = 0.2
How do I get the correct mean per month?

availability.resample('M').mean()

I just encountered the same issue and solved it with the following code
#load data daily
df = pd.read_csv('./name.csv')
#set Date as index
df.Date = pd.to_datetime(df.Date)
df_date = df.set_index('Date', inplace=False)
#get monthly mean
df_month = df_date.resample('M').mean()
#group months
df_monthly_mean = df_month.groupby(df_daily.index.month).mean()
How that this was helpful!

Related

I am trying to write a For Loop in Python to identify types of sales for use with a 'sales report'

UPDATED - 4.13.22
I am new to programming python and am trying to create a program using For Loops that will go through a data frame by rows to identify different types of 'group sales' made up by different combinations of product sales and posting the results in a 'Result' column.
I was told in previous comments to print the df and paste it:
Date LFMIX SALE LCSIX SALE LOTIX SALE LSPIX SALE LEQIX SALE \
0 0.0 0.0 30000.0 0.0 0.0 0.0
1 0.0 0.0 30000.0 0.0 0.0 0.0
2 0.0 30000.0 0.0 0.0 0.0 0.0
3 0.0 25000.0 25000.0 0.0 0.0 0.0
4 0.0 30000.0 30000.0 0.0 0.0 0.0
5 0.0 30000.0 0.0 0.0 0.0 30000.0
6 0.0 0.0 30000.0 0.0 0.0 30000.0
7 0.0 25000.0 25000.0 0.0 0.0 25000.0
AUM LFMIX AUM LCSIX AUM LOTIX AUM LSPIX AUM LEQIX \
0 200000.0 0.0 0.0 0.0 0.0
1 500000.0 0.0 0.0 0.0 0.0
2 0.0 200000.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 200000.0
5 0.0 200000.0 0.0 0.0 0.0
6 200000.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0
is the sale = 10% of pairing fund AUM LFMIX LCSIX LOTIX LSPIX LEQIX \
0 0.0 1 1 0.0 0.0 0.0
1 0.0 1 1 0.0 0.0 0.0
2 0.0 1 1 0.0 0.0 0.0
3 0.0 1 1 0.0 0.0 0.0
4 0.0 1 1 0.0 0.0 1.0
5 0.0 1 1 0.0 0.0 1.0
6 0.0 1 1 0.0 0.0 1.0
7 0.0 1 1 0.0 0.0 1.0
Expected_Result Result
0 DP1
1 0
2 DP2
3 DP3
4 TT1
5 TT2
6 TT3
7 TT4
my Python code to sort just the 1st row:
for row in range(len(df)):
if (df["LCSIX"][row] >= (df["AUM LFMIX"][row] * .1)): df["Result"][row] = "DP1"
and the results:
Date LFMIX SALE LCSIX SALE LOTIX SALE LSPIX SALE LEQIX SALE \
0 0.0 0.0 30000.0 0.0 0.0 0.0
1 0.0 0.0 30000.0 0.0 0.0 0.0
2 0.0 30000.0 0.0 0.0 0.0 0.0
3 0.0 25000.0 25000.0 0.0 0.0 0.0
4 0.0 30000.0 30000.0 0.0 0.0 0.0
5 0.0 30000.0 0.0 0.0 0.0 30000.0
6 0.0 0.0 30000.0 0.0 0.0 30000.0
7 0.0 25000.0 25000.0 0.0 0.0 25000.0
AUM LFMIX AUM LCSIX AUM LOTIX AUM LSPIX AUM LEQIX \
0 200000.0 0.0 0.0 0.0 0.0
1 500000.0 0.0 0.0 0.0 0.0
2 0.0 200000.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 200000.0
5 0.0 200000.0 0.0 0.0 0.0
6 200000.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0
is the sale = 10% of pairing fund AUM LFMIX LCSIX LOTIX LSPIX LEQIX \
0 0.0 1 1 0.0 0.0 0.0
1 0.0 1 1 0.0 0.0 0.0
2 0.0 1 1 0.0 0.0 0.0
3 0.0 1 1 0.0 0.0 0.0
4 0.0 1 1 0.0 0.0 1.0
5 0.0 1 1 0.0 0.0 1.0
6 0.0 1 1 0.0 0.0 1.0
7 0.0 1 1 0.0 0.0 1.0
Expected_Result Result
0 DP1
1 0
2 DP2 DP1
3 DP3 DP1
4 TT1 DP1
5 TT2 DP1
6 TT3
7 TT4 DP1
As you can see, the code fail to identify row[0] as a DP1 and misidentifies other rows.
I am planning on coding 'For Loops' that will identify 17 different types of group sales, this is simply the 1st group I am trying to identify...
Thanks for the help.
When you're working with pandas, you need to think in terms of doing things with whole columns, NOT row by row, which is hopelessly slow in pandas. If you need to go row by row, then do all of that before you convert to pandas.
In this case, you need to set the "result" column for all rows where your condition is met. This does that in one line:
df["result"][df["LCIX"] >= df["AUM_LFMIX"]*0.1] = "DP1"
So, we select the column as "result", and we select the rows where the relation is true. Simple. ;)

Add and fill missing columns with values of 0s in pandas matrix [python]

I have a matrix of the form :
movie_id 1 2 3 ... 1494 1497 1500
user_id
1600 1.0 0.0 1.0 ... 0.0 0.0 1.0
1601 1.0 0.0 0.0 ... 1.0 0.0 0.0
1602 0.0 0.0 0.0 ... 0.0 1.0 1.0
1603 0.0 0.0 1.0 ... 0.0 0.0 0.0
1604 1.0 0.0 0.0 ... 1.0 0.0 0.0
. ...
.
.
As you can see even though the movies in my dataset are 1500, some movies haven't been recorded cause of the preprocess that my data has gone through.
What i want is to add and fill all the columns (movie_ids) that haven't been recorded with values of 0 (I don't know which movie_ids haven't been recorded exactly). So for example i want a new matrix of the form:
movie_id 1 2 3 ... 1494 1495 1496 1497 1498 1499 1500
user_id
1600 1.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1601 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1602 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 1.0
1603 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1604 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0
. ...
.
.
Use DataFrame.reindex along axis=1 with fill_value=0 to conform the dataframe columns to a new index range:
df = df.reindex(range(df.columns.min(), df.columns.max() + 1), axis=1, fill_value=0)
Result:
movie_id 1 2 3 1498 1499 1500
user_id
1600 1.0 0.0 1.0 0 0 1.0
1601 1.0 0.0 0.0 0 0 0.0
1602 0.0 0.0 0.0 ... 0 0 1.0
1603 0.0 0.0 1.0 ... 0 0 0.0
1604 1.0 0.0 0.0 0 0 0.0
I assume variable name of the matrix is matrix
n_moovies = 1500
moove_ids = matrix.columns
for moovie_id in range(1, n_moovies + 1):
# iterate over id-s
if moovie_id not in moove_ids:
# if there's no such moovie create a column filled with zeros
matrix[moovie_id] = 0

Subset pandas dataframe based on first non zero occurrence

Here is the sample dataframe:-
Trade_signal
2007-07-31 0.0
2007-08-31 0.0
2007-09-28 0.0
2007-10-31 0.0
2007-11-30 0.0
2007-12-31 0.0
2008-01-31 0.0
2008-02-29 0.0
2008-03-31 0.0
2008-04-30 0.0
2008-05-30 0.0
2008-06-30 0.0
2008-07-31 -1.0
2008-08-29 0.0
2008-09-30 -1.0
2008-10-31 -1.0
2008-11-28 -1.0
2008-12-31 0.0
2009-01-30 -1.0
2009-02-27 -1.0
2009-03-31 0.0
2009-04-30 0.0
2009-05-29 1.0
2009-06-30 1.0
2009-07-31 1.0
2009-08-31 1.0
2009-09-30 1.0
2009-10-30 0.0
2009-11-30 1.0
2009-12-31 1.0
1 represents buy and -1 represents sell. I want to subset the dataframe so that the new dataframe starts with first 1 occurrence. Expected Output:-
2009-05-29 1.0
2009-06-30 1.0
2009-07-31 1.0
2009-08-31 1.0
2009-09-30 1.0
2009-10-30 0.0
2009-11-30 1.0
2009-12-31 1.0
Please suggest the way forward. Apologies if this is a repeated question.
Simply do. Here df[1] refers to the column containing buy/sell data.
new_df = df.iloc[df[df["Trade Signal"]==1].index[0]:,:]

Counting consonants and vowels in a split string

I read in a .csv file. I have the following data frame that counts vowels and consonants in a string in the column Description. This works great, but my problem is I want to split Description into 8 columns and count the consonants and vowels for each column. The second part of my code allows for me to split Description into 8 columns. How can I count the vowels and consonants on all 8 columns the Description is split into?
import pandas as pd
import re
def anti_vowel(s):
result = re.sub(r'[AEIOU]', '', s, flags=re.IGNORECASE)
return result
data = pd.read_csv('http://core.secure.ehc.com/src/util/detail-price-list/TristarDivision_SummitMedicalCenter_CM.csv')
data.dropna(inplace = True)
data['Vowels'] = data['Description'].str.count(r'[aeiou]', flags=re.I)
data['Consonant'] = data['Description'].str.count(r'[bcdfghjklmnpqrstvwxzy]', flags=re.I)
print (data)
This is the code I'm using to split the column Description into 8 columns.
import pandas as pd
data = data["Description"].str.split(" ", n = 8, expand = True)
data = pd.read_csv('http://core.secure.ehc.com/src/util/detail-price-list/TristarDivision_SummitMedicalCenter_CM.csv')
data.dropna(inplace = True)
data = data["Description"].str.split(" ", n = 8, expand = True)
print (data)
Now how can I put it all together?
In order to read each column of the 8 and count consonants I know i can use the following replacing the 0 with 0-7:
testconsonant = data[0].str.count(r'[bcdfghjklmnpqrstvwxzy]', flags=re.I)
testvowel = data[0].str.count(r'[aeiou]', flags=re.I)
Desired output would be:
Description [0] vowel count consonant count Description [1] vowel count consonant count Description [2] vowel count consonant count Description [3] vowel count consonant count Description [4] vowel count consonant count all the way to description [7]
stack then unstack
stacked = data.stack()
pd.concat({
'Vowels': stacked.str.count('[aeiou]', flags=re.I),
'Consonant': stacked.str.count('[bcdfghjklmnpqrstvwxzy]', flags=re.I)
}, axis=1).unstack()
Consonant Vowels
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 3.0 5.0 5.0 1.0 2.0 NaN NaN NaN NaN 1.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
1 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
2 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
3 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
4 3.0 5.0 3.0 1.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 NaN
5 3.0 5.0 3.0 1.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 NaN
6 3.0 4.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN NaN
7 3.0 3.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 1.0 0.0 0.0 0.0 NaN NaN
8 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 3.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
9 3.0 3.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 1.0 0.0 0.0 0.0 NaN NaN
10 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
11 3.0 3.0 0.0 2.0 2.0 NaN NaN NaN NaN 3.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
12 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
13 3.0 3.0 0.0 2.0 2.0 NaN NaN NaN NaN 3.0 1.0 0.0 0.0 0.0 NaN NaN NaN NaN
14 3.0 5.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
15 3.0 3.0 0.0 3.0 1.0 NaN NaN NaN NaN 3.0 0.0 0.0 0.0 1.0 NaN NaN NaN NaN
If you want to combine this with the data dataframe, you can do:
stacked = data.stack()
pd.concat({
'Data': data,
'Vowels': stacked.str.count('[aeiou]', flags=re.I),
'Consonant': stacked.str.count('[bcdfghjklmnpqrstvwxzy]', flags=re.I)
}, axis=1).unstack()

Losing Values Joining DataFrames

I can't work out why this code is dropping values
solddf[['Name', 'Barcode', 'SalesRank', 'SoldPrices', 'SoldDates', 'SoldIds']].head()
Out[3]:
Name Barcode \
62693 Near Dark [DVD] [1988] [Region 1] [US Import] ... 1.313124e+10
94823 Battlefield 2 Modern Combat / Game 1.463315e+10
24965 Star Wars: The Force Unleashed (PS3) 2.327201e+10
24964 Star Wars: The Force Unleashed (PS3) 2.327201e+10
24963 Star Wars: The Force Unleashed (PS3) 2.327201e+10
SalesRank SoldPrices SoldDates SoldIds
62693 14.04 2017-08-05 07:28:56 162558627930
94823 1.49 2017-09-06 04:48:42 132301267483
24965 4.29 2017-08-23 18:44:42 302424166550
24964 5.27 2017-09-08 19:55:02 132317908530
24963 5.56 2017-09-15 08:23:24 132322978130
Here's my dataframe. It stores each sale I pull from an eBay API as a new row.
My aim to look for correlation between weekly sales and Amazon's Sales Rank.
solddf['Week'] = solddf['SoldDates'].apply(lambda x: x.week)
weeklysales = solddf.groupby(['Barcode', 'Week']).size().unstack()
weeklysales = weeklysales.fillna(0)
weeklysales['Mean'] = weeklysales.mean(axis=1)
weeklysales.head()
Out[5]:
Week 29 30 31 32 33 34 35 36 37 38 39 40 41 \
Barcode
1.313124e+10 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.463315e+10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
2.327201e+10 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 2.0 2.0 0.0 2.0 1.0
2.327201e+10 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2.327201e+10 0.0 0.0 3.0 2.0 2.0 2.0 1.0 1.0 5.0 0.0 2.0 2.0 1.0
Week 42 Mean
Barcode
1.313124e+10 0.0 0.071429
1.463315e+10 0.0 0.071429
2.327201e+10 0.0 0.642857
2.327201e+10 0.0 0.142857
2.327201e+10 0.0 1.500000
So, I've worked out the mean weekly sales for each item (or barcode)
I then want to take the mean values and insert them back into my solddf dataframe that I started with.
s1 = pd.Series(weeklysales.Mean, index=solddf.Barcode).reset_index()
s1 = s1.sort_values('Barcode')
s1.head()
Out[17]:
Barcode Mean
0 1.313124e+10 0.071429
1 1.463315e+10 0.071429
2 2.327201e+10 0.642857
3 2.327201e+10 0.642857
4 2.327201e+10 0.642857
This is looking fine, has the right number of rows and should fit
solddf = solddf.sort_values('Barcode')
solddf['WeeklySales'] = s1.Mean
This method seems to work, but I'm having an issue that some np.nan values are now appeared which weren't in s1 before
s1.Mean.isnull().sum()
Out[13]: 0
len(s1) == len(solddf)
Out[14]: True
But loads of my values that have passed across are now np.nan
solddf.WeeklySales.isnull().sum()
Out[16]: 27214
Can anyone tell me why?
While writing this I had an idea for a work-around
s1list = s1.Mean.tolist()
solddf['WeeklySales'] = s1list
solddf.WeeklySales.isnull().sum()
Out[20]: 0
Still curious what the problem with the previous method is though!
Instead of trying to align the two indices and inserting the new row, you should just use pd.merge.
output = pd.merge(solddf, s1, on='Barcode')
This way you can select the type of join you would like to do as well using the how kwarg.
I would also advise reading Merge, join, and concatenate as it covers a lot of helpful methods for combining dataframes.

Categories

Resources