Conditionally Set Values Greater Than 0 To 1 - python

I have a dataframe that looks like this, with many more date columns
AUTHOR 2022-07-01 2022-10-14 2022-10-15 .....
0 Kathrine 0.0 7.0 0.0
1 Catherine 0.0 13.0 17.0
2 Amanda Jane 0.0 0.0 0.0
3 Jaqueline 0.0 3.0 0.0
4 Christine 0.0 0.0 0.0
I would like to set values in each column after the AUTHOR to 1 when the value is greater than 0, so the resulting table would look like this:
AUTHOR 2022-07-01 2022-10-14 2022-10-15 .....
0 Kathrine 0.0 1.0 0.0
1 Catherine 0.0 1.0 1.0
2 Amanda Jane 0.0 0.0 0.0
3 Jaqueline 0.0 1.0 0.0
4 Christine 0.0 0.0 0.0
I tried the following line of code but got an error, which makes sense. As I need to figure out how to apply this code just to the date columns while also keeping the AUTHOR column in my table.
Counts[Counts != 0] = 1
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value

You can select the date column first then mask on these columns
cols = df.drop(columns='AUTHOR').columns
# or
cols = df.filter(regex='\d{4}-\d{2}-\d{2}').columns
# or
cols = df.select_dtypes(include='number').columns
df[cols] = df[cols].mask(df[cols] != 0, 1)
print(df)
AUTHOR 2022-07-01 2022-10-14 2022-10-15
0 Kathrine 0.0 1.0 0.0
1 Catherine 0.0 1.0 1.0
2 Amanda Jane 0.0 0.0 0.0
3 Jaqueline 0.0 1.0 0.0
4 Christine 0.0 0.0 0.0

Since you would like to only exclude the first column you could first set it as index and then create your booleans. In the end you will reset the index.
df.set_index('AUTHOR').pipe(lambda g: g.mask(g > 0, 1)).reset_index()
df
AUTHOR 2022-10-14 2022-10-15
0 Kathrine 0.0 1.0
1 Cathrine 1.0 1.0

Related

Сonvert the data from summary to daily time series data (pandas)

I have a dataset which is a time series. It has several regions at once, here is a small example:
date confirmed deaths recovered region_code
0 2020-03-27 3.0 0.0 0.0 ARK
1 2020-03-27 4.0 0.0 0.0 BA
2 2020-03-27 1.0 0.0 0.0 BEL
..........................................................
71540 2022-07-19 164194.0 2830.0 160758.0 YAR
71541 2022-07-19 19170.0 555.0 18484.0 YEV
71542 2022-07-19 169603.0 2349.0 167075.0 ZAB
I have three columns for which I want to display information about how many new cases have been added in separate three columns:
date confirmed deaths recovered region_code daily_confirmed daily_deaths daily_recovered
0 2020-03-27 3.0 0.0 0.0 ARK 3.0 0.0 0.0
1 2020-03-27 4.0 0.0 0.0 BA 4.0 0.0 0.0
2 2020-03-27 1.0 0.0 0.0 BEL 1.0 0.0 0.0
..........................................................
71540 2022-07-19 164194.0 2830.0 160758.0 YAR 32.0 16.0 8.0
71541 2022-07-19 19170.0 555.0 18484.0 YEV 6.0 1.0 1.0
71542 2022-07-19 169603.0 2349.0 167075.0 ZAB 1.0 8.0 9.0
That is, for each region, you need to get the difference between the current date and the last day in order to understand how many new cases have occurred.
The problem is that I don't know how to do this process correctly. Since there are no missing dates in the data, you can use something like this: df['daily_cases'] = df['confirmed'] - df['confirmed'].shift(fill_value=0). But there are many different regions here, that is, first you need to filter everything correctly somehow ... Any ideas how to do this?
Use DataFrameGroupBy.diff with replace first missing values by original columns add prefix to columns and cast to inetegers if necessary:
print (df)
date confirmed deaths recovered region_code
0 2020-03-27 3.0 0.0 0.0 ARK
1 2020-03-27 4.0 0.0 0.0 BA
2 2020-03-27 1.0 0.0 0.0 BEL
3 2020-03-28 4.0 0.0 4.0 ARK
4 2020-03-28 6.0 0.0 0.0 BA
5 2020-03-28 1.0 0.0 0.0 BEL
6 2020-03-29 6.0 0.0 10.0 ARK
7 2020-03-29 8.0 0.0 0.0 BA
8 2020-03-29 5.0 0.0 0.0 BEL
cols = ['confirmed','deaths','recovered']
df1 = (df.groupby(['region_code'])[cols]
.diff()
.fillna(df[cols])
.add_prefix('daily_')
.astype(int))
print (df1)
daily_confirmed daily_deaths daily_recovered
0 3 0 0
1 4 0 0
2 1 0 0
3 1 0 4
4 2 0 0
5 0 0 0
6 2 0 6
7 2 0 0
8 4 0 0
Last append to original:
df = df.join(df1)
print (df)

new Pandas Dataframe column calculated from other column values

How can I create a new column in a dataframe that consists of the MEAN of an indexed range of values in that row?
example:
1 2 3 JUNK
0 0.0 0.0 0.0 A
1 1.0 1.0 -1.0 B
2 2.0 2.0 1.0 C
the JUNK column would be ignored when trying to determine the MEAN column
expected output:
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.0
1 1.0 1.0 -1.0 B 0.33
2 2.0 2.0 1.0 C 1.66
Use drop for removing or iloc for filter out unnecessary columns:
df['MEAN'] = df.drop('JUNK', axis=1).mean(axis=1)
df['MEAN'] = df.iloc[:, :-1].mean(axis=1)
print (df)
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.000000
1 1.0 1.0 -1.0 B 0.333333
2 2.0 2.0 1.0 C 1.666667

Filling the missing rows in pandas dataframe

data = {
'node1': [1, 1,1, 2,2,5],
'node2': [8,16,22,5,25,10],
'weight': [1,1,1,1,1,1], }
df = pd.DataFrame(data, columns = ['node1','node2','weight'])
df2=df.assign(Cu=df.groupby('node1').cumcount()).set_index('Cu').groupby('node1') \
.apply(lambda x : x['node2']).unstack('Cu').fillna(np.nan)
Output:
1 8.0 16.0 22.0
2 5.0 25.0 0.0
5 10.0 0.0 0.0
This the output I am gettting but I require the output:
1 8 16 22
2 5 25 0
3 0 0 0
4 0 0 0
5 10 0 0
The rows which are missing in the data like the 3,4 should have the columns as zeros
Here are few ways of doing it.
Option 1
In [36]: idx = np.arange(df.node1.min(), df.node1.max()+1)
In [37]: df.groupby('node1')['node2'].apply(list).apply(pd.Series).reindex(idx).fillna(0)
Out[37]:
0 1 2
node1
1 8.0 16.0 22.0
2 5.0 25.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
5 10.0 0.0 0.0
Option 2
In [39]: (df.groupby('node1')['node2'].apply(lambda x: pd.Series(x.values))
.unstack().reindex(idx).fillna(0))
Out[39]:
0 1 2
node1
1 8.0 16.0 22.0
2 5.0 25.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
5 10.0 0.0 0.0
Option 3
In [55]: pd.DataFrame.from_dict(
{i: x.values for i, x in df.groupby('node1')['node2']},
orient='index').reindex(idx).fillna(0)
Out[55]:
0 1 2
1 8.0 16.0 22.0
2 5.0 25.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
5 10.0 0.0 0.0
And, measure the efficiency, readability based on your usecase.
In [15]: idx = np.arange(df.node1.min(), df.node1.max()+1)
In [16]: df.pivot_table(index='node1',
columns=df.groupby('node1').cumcount(),
values='node2',
fill_value=0) \
.reindex(idx) \
.fillna(0)
Out[16]:
0 1 2
node1
1 8.0 16.0 22.0
2 5.0 25.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
5 10.0 0.0 0.0

Losing Values Joining DataFrames

I can't work out why this code is dropping values
solddf[['Name', 'Barcode', 'SalesRank', 'SoldPrices', 'SoldDates', 'SoldIds']].head()
Out[3]:
Name Barcode \
62693 Near Dark [DVD] [1988] [Region 1] [US Import] ... 1.313124e+10
94823 Battlefield 2 Modern Combat / Game 1.463315e+10
24965 Star Wars: The Force Unleashed (PS3) 2.327201e+10
24964 Star Wars: The Force Unleashed (PS3) 2.327201e+10
24963 Star Wars: The Force Unleashed (PS3) 2.327201e+10
SalesRank SoldPrices SoldDates SoldIds
62693 14.04 2017-08-05 07:28:56 162558627930
94823 1.49 2017-09-06 04:48:42 132301267483
24965 4.29 2017-08-23 18:44:42 302424166550
24964 5.27 2017-09-08 19:55:02 132317908530
24963 5.56 2017-09-15 08:23:24 132322978130
Here's my dataframe. It stores each sale I pull from an eBay API as a new row.
My aim to look for correlation between weekly sales and Amazon's Sales Rank.
solddf['Week'] = solddf['SoldDates'].apply(lambda x: x.week)
weeklysales = solddf.groupby(['Barcode', 'Week']).size().unstack()
weeklysales = weeklysales.fillna(0)
weeklysales['Mean'] = weeklysales.mean(axis=1)
weeklysales.head()
Out[5]:
Week 29 30 31 32 33 34 35 36 37 38 39 40 41 \
Barcode
1.313124e+10 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.463315e+10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
2.327201e+10 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 2.0 2.0 0.0 2.0 1.0
2.327201e+10 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2.327201e+10 0.0 0.0 3.0 2.0 2.0 2.0 1.0 1.0 5.0 0.0 2.0 2.0 1.0
Week 42 Mean
Barcode
1.313124e+10 0.0 0.071429
1.463315e+10 0.0 0.071429
2.327201e+10 0.0 0.642857
2.327201e+10 0.0 0.142857
2.327201e+10 0.0 1.500000
So, I've worked out the mean weekly sales for each item (or barcode)
I then want to take the mean values and insert them back into my solddf dataframe that I started with.
s1 = pd.Series(weeklysales.Mean, index=solddf.Barcode).reset_index()
s1 = s1.sort_values('Barcode')
s1.head()
Out[17]:
Barcode Mean
0 1.313124e+10 0.071429
1 1.463315e+10 0.071429
2 2.327201e+10 0.642857
3 2.327201e+10 0.642857
4 2.327201e+10 0.642857
This is looking fine, has the right number of rows and should fit
solddf = solddf.sort_values('Barcode')
solddf['WeeklySales'] = s1.Mean
This method seems to work, but I'm having an issue that some np.nan values are now appeared which weren't in s1 before
s1.Mean.isnull().sum()
Out[13]: 0
len(s1) == len(solddf)
Out[14]: True
But loads of my values that have passed across are now np.nan
solddf.WeeklySales.isnull().sum()
Out[16]: 27214
Can anyone tell me why?
While writing this I had an idea for a work-around
s1list = s1.Mean.tolist()
solddf['WeeklySales'] = s1list
solddf.WeeklySales.isnull().sum()
Out[20]: 0
Still curious what the problem with the previous method is though!
Instead of trying to align the two indices and inserting the new row, you should just use pd.merge.
output = pd.merge(solddf, s1, on='Barcode')
This way you can select the type of join you would like to do as well using the how kwarg.
I would also advise reading Merge, join, and concatenate as it covers a lot of helpful methods for combining dataframes.

Sort rows of a dataframe in descending order of NaN counts

I'm trying to sort the following Pandas DataFrame:
RHS age height shoe_size weight
0 weight NaN 0.0 0.0 1.0
1 shoe_size NaN 0.0 1.0 NaN
2 shoe_size 3.0 0.0 0.0 NaN
3 weight 3.0 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0
in such a way that the rows with a greater number of NaNs columns are positioned first.
More precisely, in the above df, the row with index 1 (2 Nans) should come before ther row with index 0 (1 NaN).
What I do now is:
df.sort_values(by=['age', 'height', 'shoe_size', 'weight'], na_position="first")
Using df.sort_values and loc based accessing.
df = df.iloc[df.isnull().sum(1).sort_values(ascending=0).index]
print(df)
RHS age height shoe_size weight
1 shoe_size NaN 0.0 1.0 NaN
2 shoe_size 3.0 0.0 0.0 NaN
0 weight NaN 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0
3 weight 3.0 0.0 0.0 1.0
df.isnull().sum(1) counts the NaNs and the rows are accessed based on this sorted count.
#ayhan offered a nice little improvement to the solution above, involving pd.Series.argsort:
df = df.iloc[df.isnull().sum(axis=1).mul(-1).argsort()]
print(df)
RHS age height shoe_size weight
1 shoe_size NaN 0.0 1.0 NaN
0 weight NaN 0.0 0.0 1.0
2 shoe_size 3.0 0.0 0.0 NaN
3 weight 3.0 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0
df.isnull().sum().sort_values(ascending=False)
Here's a one-liner that will do it:
df.assign(Count_NA = lambda x: x.isnull().sum(axis=1)).sort_values('Count_NA', ascending=False).drop('Count_NA', axis=1)
# RHS age height shoe_size weight
# 1 shoe_size NaN 0.0 1.0 NaN
# 0 weight NaN 0.0 0.0 1.0
# 2 shoe_size 3.0 0.0 0.0 NaN
# 3 weight 3.0 0.0 0.0 1.0
# 4 age 3.0 0.0 0.0 1.0
This works by assigning a temporary column ("Count_NA") to count the NAs in each row, sorting on that column, and then dropping it, all in the same expression.
You can add a column of the number of null values, sort by that column, then drop the column. It's up to you if you want to use .reset_index(drop=True) to reset the row count.
df['null_count'] = df.isnull().sum(axis=1)
df.sort_values('null_count', ascending=False).drop('null_count', axis=1)
# returns
RHS age height shoe_size weight
1 shoe_size NaN 0.0 1.0 NaN
0 weight NaN 0.0 0.0 1.0
2 shoe_size 3.0 0.0 0.0 NaN
3 weight 3.0 0.0 0.0 1.0
4 age 3.0 0.0 0.0 1.0

Categories

Resources