Losing Values Joining DataFrames

Losing Values Joining DataFrames - python

I can't work out why this code is dropping values
solddf[['Name', 'Barcode', 'SalesRank', 'SoldPrices', 'SoldDates', 'SoldIds']].head()
Out[3]:
Name Barcode \
62693 Near Dark [DVD] [1988] [Region 1] [US Import] ... 1.313124e+10
94823 Battlefield 2 Modern Combat / Game 1.463315e+10
24965 Star Wars: The Force Unleashed (PS3) 2.327201e+10
24964 Star Wars: The Force Unleashed (PS3) 2.327201e+10
24963 Star Wars: The Force Unleashed (PS3) 2.327201e+10
SalesRank SoldPrices SoldDates SoldIds
62693 14.04 2017-08-05 07:28:56 162558627930
94823 1.49 2017-09-06 04:48:42 132301267483
24965 4.29 2017-08-23 18:44:42 302424166550
24964 5.27 2017-09-08 19:55:02 132317908530
24963 5.56 2017-09-15 08:23:24 132322978130
Here's my dataframe. It stores each sale I pull from an eBay API as a new row.
My aim to look for correlation between weekly sales and Amazon's Sales Rank.
solddf['Week'] = solddf['SoldDates'].apply(lambda x: x.week)
weeklysales = solddf.groupby(['Barcode', 'Week']).size().unstack()
weeklysales = weeklysales.fillna(0)
weeklysales['Mean'] = weeklysales.mean(axis=1)
weeklysales.head()
Out[5]:
Week 29 30 31 32 33 34 35 36 37 38 39 40 41 \
Barcode
1.313124e+10 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.463315e+10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
2.327201e+10 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 2.0 2.0 0.0 2.0 1.0
2.327201e+10 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2.327201e+10 0.0 0.0 3.0 2.0 2.0 2.0 1.0 1.0 5.0 0.0 2.0 2.0 1.0
Week 42 Mean
Barcode
1.313124e+10 0.0 0.071429
1.463315e+10 0.0 0.071429
2.327201e+10 0.0 0.642857
2.327201e+10 0.0 0.142857
2.327201e+10 0.0 1.500000
So, I've worked out the mean weekly sales for each item (or barcode)
I then want to take the mean values and insert them back into my solddf dataframe that I started with.
s1 = pd.Series(weeklysales.Mean, index=solddf.Barcode).reset_index()
s1 = s1.sort_values('Barcode')
s1.head()
Out[17]:
Barcode Mean
0 1.313124e+10 0.071429
1 1.463315e+10 0.071429
2 2.327201e+10 0.642857
3 2.327201e+10 0.642857
4 2.327201e+10 0.642857
This is looking fine, has the right number of rows and should fit
solddf = solddf.sort_values('Barcode')
solddf['WeeklySales'] = s1.Mean
This method seems to work, but I'm having an issue that some np.nan values are now appeared which weren't in s1 before
s1.Mean.isnull().sum()
Out[13]: 0
len(s1) == len(solddf)
Out[14]: True
But loads of my values that have passed across are now np.nan
solddf.WeeklySales.isnull().sum()
Out[16]: 27214
Can anyone tell me why?
While writing this I had an idea for a work-around
s1list = s1.Mean.tolist()
solddf['WeeklySales'] = s1list
solddf.WeeklySales.isnull().sum()
Out[20]: 0
Still curious what the problem with the previous method is though!

Instead of trying to align the two indices and inserting the new row, you should just use pd.merge.
output = pd.merge(solddf, s1, on='Barcode')
This way you can select the type of join you would like to do as well using the how kwarg.
I would also advise reading Merge, join, and concatenate as it covers a lot of helpful methods for combining dataframes.

Related

Conditionally Set Values Greater Than 0 To 1

I have a dataframe that looks like this, with many more date columns
AUTHOR 2022-07-01 2022-10-14 2022-10-15 .....
0 Kathrine 0.0 7.0 0.0
1 Catherine 0.0 13.0 17.0
2 Amanda Jane 0.0 0.0 0.0
3 Jaqueline 0.0 3.0 0.0
4 Christine 0.0 0.0 0.0
I would like to set values in each column after the AUTHOR to 1 when the value is greater than 0, so the resulting table would look like this:
AUTHOR 2022-07-01 2022-10-14 2022-10-15 .....
0 Kathrine 0.0 1.0 0.0
1 Catherine 0.0 1.0 1.0
2 Amanda Jane 0.0 0.0 0.0
3 Jaqueline 0.0 1.0 0.0
4 Christine 0.0 0.0 0.0
I tried the following line of code but got an error, which makes sense. As I need to figure out how to apply this code just to the date columns while also keeping the AUTHOR column in my table.
Counts[Counts != 0] = 1
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value

You can select the date column first then mask on these columns
cols = df.drop(columns='AUTHOR').columns
# or
cols = df.filter(regex='\d{4}-\d{2}-\d{2}').columns
# or
cols = df.select_dtypes(include='number').columns
df[cols] = df[cols].mask(df[cols] != 0, 1)
print(df)
AUTHOR 2022-07-01 2022-10-14 2022-10-15
0 Kathrine 0.0 1.0 0.0
1 Catherine 0.0 1.0 1.0
2 Amanda Jane 0.0 0.0 0.0
3 Jaqueline 0.0 1.0 0.0
4 Christine 0.0 0.0 0.0

Since you would like to only exclude the first column you could first set it as index and then create your booleans. In the end you will reset the index.
df.set_index('AUTHOR').pipe(lambda g: g.mask(g > 0, 1)).reset_index()
df
AUTHOR 2022-10-14 2022-10-15
0 Kathrine 0.0 1.0
1 Cathrine 1.0 1.0

How to create co-occurrence matrix from pandas two column?

I have a small dataset, for example :
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4,5,6,7,8,9,10], 'b': [11,22,11,22,33,11,22,44,11,22]})
df
I want to find out the co-occurrence of column b values for the column a.
What I tried :
df_co = pd.get_dummies(a.a).groupby(a.b).apply(max)
df_co
But this is not a co-occurrence matrix. So I also tried this:
df_co.T.dot(df_co)
which gives me:
Is this a correct method to calculate the co-occurrence matrix?

You can use df.pivot with a dummy column to represent count=1
df.assign(v=1).pivot('a','b').fillna(0)
v
b 11 22 33 44
a
1 1.0 0.0 0.0 0.0
2 0.0 1.0 0.0 0.0
3 1.0 0.0 0.0 0.0
4 0.0 1.0 0.0 0.0
5 0.0 0.0 1.0 0.0
6 1.0 0.0 0.0 0.0
7 0.0 1.0 0.0 0.0
8 0.0 0.0 0.0 1.0
9 1.0 0.0 0.0 0.0
10 0.0 1.0 0.0 0.0
Or, as #Quang Hoang suggested, try pd.crosstab

Add and fill missing columns with values of 0s in pandas matrix [python]

I have a matrix of the form :
movie_id 1 2 3 ... 1494 1497 1500
user_id
1600 1.0 0.0 1.0 ... 0.0 0.0 1.0
1601 1.0 0.0 0.0 ... 1.0 0.0 0.0
1602 0.0 0.0 0.0 ... 0.0 1.0 1.0
1603 0.0 0.0 1.0 ... 0.0 0.0 0.0
1604 1.0 0.0 0.0 ... 1.0 0.0 0.0
. ...
.
.
As you can see even though the movies in my dataset are 1500, some movies haven't been recorded cause of the preprocess that my data has gone through.
What i want is to add and fill all the columns (movie_ids) that haven't been recorded with values of 0 (I don't know which movie_ids haven't been recorded exactly). So for example i want a new matrix of the form:
movie_id 1 2 3 ... 1494 1495 1496 1497 1498 1499 1500
user_id
1600 1.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1601 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1602 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 1.0
1603 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1604 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0
. ...
.
.

Use DataFrame.reindex along axis=1 with fill_value=0 to conform the dataframe columns to a new index range:
df = df.reindex(range(df.columns.min(), df.columns.max() + 1), axis=1, fill_value=0)
Result:
movie_id 1 2 3 1498 1499 1500
user_id
1600 1.0 0.0 1.0 0 0 1.0
1601 1.0 0.0 0.0 0 0 0.0
1602 0.0 0.0 0.0 ... 0 0 1.0
1603 0.0 0.0 1.0 ... 0 0 0.0
1604 1.0 0.0 0.0 0 0 0.0

I assume variable name of the matrix is matrix
n_moovies = 1500
moove_ids = matrix.columns
for moovie_id in range(1, n_moovies + 1):
# iterate over id-s
if moovie_id not in moove_ids:
# if there's no such moovie create a column filled with zeros
matrix[moovie_id] = 0

How to drop columns of Pandas DataFrame with zero values in the last row

I have a Pandas Dataframe which tells me monthly sales of items in shops
df.head():
ID month sold
0 150983 0 1.0
1 56520 0 13.0
2 56520 1 7.0
3 56520 2 13.0
4 56520 3 8.0
I want to remove all IDs where there were no sales last month. I.e. month == 33 & sold == 0. Doing the following
unwanted_df = df[((df['month'] == 33) & (df['sold'] == 0.0))]
I just get 46 rows, which is far too little. But nevermind, I would like to have the data in different format anyway. Pivoted version of above table is just what I want:
pivoted_df = df.pivot(index='month', columns = 'ID', values = 'sold').fillna(0)
pivoted_df.head()
ID 0 2 3 5 6 7 8 10 11 12 ... 214182 214185 214187 214190 214191 214192 214193 214195 214197 214199
month
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Question. How to remove columns with the value 0 in the last row in pivoted_df?

You can do this with one line:
pivoted_df= pivoted_df.drop(pivoted_df.columns[pivoted_df.iloc[-1,:]==0],axis=1)

I want to remove all IDs where there were no sales last month
You can first calculate the IDs satisfying your condition:
id_selected = df.loc[(df['month'] == 33) & (df['sold'] == 0), 'ID']
Then filter these from your dataframe via a Boolean mask:
df = df[~df['ID'].isin(id_selected)]
Finally, use pd.pivot_table with your filtered dataframe.

pandas calculating mean per month

I created the following dataframe:
availability = pd.DataFrame(propertyAvailableData).set_index("createdat")
monthly_availability = availability.fillna(value=0).groupby(pd.TimeGrouper(freq='M'))
This gives the following output
2015-08-18 2015-09-09 2015-09-10 2015-09-11 2015-09-12 \
createdat
2015-08-12 1.0 1.0 1.0 1.0 1.0
2015-08-17 0.0 0.0 0.0 0.0 0.0
2015-08-18 0.0 1.0 1.0 1.0 1.0
2015-08-18 0.0 0.0 0.0 0.0 0.0
2015-08-19 0.0 1.0 1.0 1.0 1.0
2015-09-03 0.0 1.0 1.0 1.0 1.0
2015-09-03 0.0 1.0 1.0 1.0 1.0
2015-09-07 0.0 0.0 0.0 0.0 0.0
2015-09-08 0.0 0.0 0.0 0.0 0.0
2015-09-11 0.0 0.0 0.0 0.0 0.0
I'm trying to get the averages per created at month by doing:
monthly_availability_mean = monthly_availability.mean()
However, here I get the following output:
2015-08-18 2015-09-09 2015-09-10 2015-09-11 2015-09-12 \
createdat
2015-08-31 0.111111 0.444444 0.666667 0.777778 0.777778
2015-09-30 0.000000 0.222222 0.222222 0.222222 0.222222
2015-10-31 0.000000 0.000000 0.000000 0.000000 0.000000
And when I hand check august I get:
1.0 + 0 + 0 + 0 + 0 / 5 = 0.2
How do I get the correct mean per month?

availability.resample('M').mean()

I just encountered the same issue and solved it with the following code
#load data daily
df = pd.read_csv('./name.csv')
#set Date as index
df.Date = pd.to_datetime(df.Date)
df_date = df.set_index('Date', inplace=False)
#get monthly mean
df_month = df_date.resample('M').mean()
#group months
df_monthly_mean = df_month.groupby(df_daily.index.month).mean()
How that this was helpful!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Losing Values Joining DataFrames - python

Related

Conditionally Set Values Greater Than 0 To 1

How to create co-occurrence matrix from pandas two column?

Add and fill missing columns with values of 0s in pandas matrix [python]

How to drop columns of Pandas DataFrame with zero values in the last row

pandas calculating mean per month

Categories

Resources