Set minimal spacing between values - python

I have the following dataframe where the column value is sorted:
df = pd.DataFrame({'variable': {0: 'Chi', 1: 'San Antonio', 2: 'Dallas', 3: 'PHL', 4: 'Houston', 5: 'NY', 6: 'Phoenix', 7: 'San Diego', 8: 'LA', 9: 'San Jose', 10: 'SF'}, 'value': {0: 191.28, 1: 262.53, 2: 280.21, 3: 283.08, 4: 290.75, 5: 295.72, 6: 305.6, 7: 357.89, 8: 380.07, 9: 452.71, 10: 477.67}})
Output:
variable value
0 Chi 191.28
1 San Antonio 262.53
2 Dallas 280.21
3 PHL 283.08
4 Houston 290.75
5 NY 295.72
6 Phoenix 305.60
7 San Diego 357.89
8 LA 380.07
9 San Jose 452.71
10 SF 477.67
I want to find values where the distance between neighboring values is smaller than 10:
df['value'].diff() < 10
Output:
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 False
8 False
9 False
10 False
Name: value, dtype: bool
Now I want to equally space those True values that are too close to each other. The idea is to take the first value before the True sequence (280.21) and add 5 to each next True value (cumulative sum): first True = 280 + 5, second True = 280 + 5 + 5, third True = 280 + 5 + 5...
Expected Output:
variable value
0 Chi 191.28
1 San Antonio 262.53
2 Dallas 280.21
3 PHL 285.21 <-
4 Houston 290.21 <-
5 NY 295.21 <-
6 Phoenix 300.21 <-
7 San Diego 357.89
8 LA 380.07
9 San Jose 452.71
10 SF 477.67
My solution:
mask = df['value'].diff() < 10
df.loc[mask, 'value'] = 5
df.loc[mask | mask.shift(-1), 'value'] = last_day[mask | mask.shift(-1), 'value'].cumsum()
Maybe there is a more elegant one.

Let's try this:
df = pd.DataFrame({'variable': {0: 'Chi', 1: 'San Antonio', 2: 'Dallas', 3: 'PHL', 4: 'Houston', 5: 'NY', 6: 'Phoenix', 7: 'San Diego', 8: 'LA', 9: 'San Jose', 10: 'SF'}, 'value': {0: 191.28, 1: 262.53, 2: 280.21, 3: 283.08, 4: 290.75, 5: 295.72, 6: 305.6, 7: 357.89, 8: 380.07, 9: 452.71, 10: 477.67}})
s = df['value'].diff() < 10
add_amt = s.cumsum().mask(~s) * 5
df_out = df.assign(value=df['value'].mask(add_amt.notna()).ffill() + add_amt.fillna(0))
df_out
Output:
variable value
0 Chi 191.28
1 San Antonio 262.53
2 Dallas 280.21
3 PHL 285.21
4 Houston 290.21
5 NY 295.21
6 Phoenix 300.21
7 San Diego 357.89
8 LA 380.07
9 San Jose 452.71
10 SF 477.67

Related

drop.na() not working on dataframe with Nan values?

I have a data frame with Nan values. For some reason, df.dropna() doesn't work when I try to drop these rows. Any thoughts?
Example of a row:
30754 22 Nan Nan Nan Nan Nan Nan Jewellery-Women N
df = pd.read_csv('/Users/xxx/Desktop/CS 677/Homework_4/FashionDataset.csv')
df.dropna()
df.head().to_dict()
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'BrandName': {0: 'life',
1: 'only',
2: 'fratini',
3: 'zink london',
4: 'life'},
'Deatils': {0: 'solid cotton blend collar neck womens a-line dress - indigo',
1: 'polyester peter pan collar womens blouson dress - yellow',
2: 'solid polyester blend wide neck womens regular top - off white',
3: 'stripes polyester sweetheart neck womens dress - black',
4: 'regular fit regular length denim womens jeans - stone'},
'Sizes': {0: 'Size:Large,Medium,Small,X-Large,X-Small',
1: 'Size:34,36,38,40',
2: 'Size:Large,X-Large,XX-Large',
3: 'Size:Large,Medium,Small,X-Large',
4: 'Size:26,28,30,32,34,36'},
'MRP': {0: 'Rs\n1699',
1: 'Rs\n3499',
2: 'Rs\n1199',
3: 'Rs\n2299',
4: 'Rs\n1699'},
'SellPrice': {0: '849', 1: '2449', 2: '599', 3: '1379', 4: '849'},
'Discount': {0: '50% off',
1: '30% off',
2: '50% off',
3: '40% off',
4: '50% off'},
'Category': {0: 'Westernwear-Women',
1: 'Westernwear-Women',
2: 'Westernwear-Women',
3: 'Westernwear-Women',
4: 'Westernwear-Women'}}
This is what I get when using df.head().to_dict()
Try this;
df = pd.DataFrame({"col1":[12,20,np.nan,np.nan],
"col2":[10,np.nan,np.nan,40]})
df1 = df.dropna()
# df;
col1 col2
0 12.0 10.0
1 20.0 NaN
2 NaN NaN
3 NaN 40.0
# df1;
col1 col2
0 12.0 10.0

How to plot correlation between two columns

The task is the following:
Is there a correlation between the age of an athlete and his result at the Olympics in the entire dataset?
Each athlete has a name, age, medal (gold, silver, bronze or NA).
In my opinion, it is necessary to count the number of all athletes of the same age and calculate the percentage of them who have any kind of medal (data.Medal.notnull()). The graph should show all ages on the x-axis, and the percentage of those who has any medal on the y-axis. How to get this data and create the graphic with help of pandas and matprolib?
For instance, some data like in table:
Name Age Medal
Name1 20 Silver
Name2 21 NA
Name3 20 NA
Name4 22 Bronze
Name5 22 NA
Name6 21 NA
Name7 20 Gold
Name8 19 Silver
Name9 20 Gold
Name10 20 NA
Name11 21 Silver
The result should be (in the graphic):
19 - 100%
20 - 60%
21 - 33%
22 - 50%
First, turn df.Medal into 1s for a medal and 0s for NaN values using np.where.
import pandas as pd
import numpy as np
data = {'Name': {0: 'Name1', 1: 'Name2', 2: 'Name3', 3: 'Name4', 4: 'Name5',
5: 'Name6', 6: 'Name7', 7: 'Name8', 8: 'Name9', 9: 'Name10',
10: 'Name11'},
'Age': {0: 20, 1: 21, 2: 20, 3: 22, 4: 22, 5: 21, 6: 20, 7: 19, 8: 20,
9: 20, 10: 21},
'Medal': {0: 'Silver', 1: np.nan, 2: np.nan, 3: 'Bronze', 4: np.nan,
5: np.nan, 6: 'Gold', 7: 'Silver', 8: 'Gold', 9: np.nan,
10: 'Silver'}}
df = pd.DataFrame(data)
df.Medal = np.where(df.Medal.notna(),1,0)
print(df)
Name Age Medal
0 Name1 20 1
1 Name2 21 0
2 Name3 20 0
3 Name4 22 1
4 Name5 22 0
5 Name6 21 0
6 Name7 20 1
7 Name8 19 1
8 Name9 20 1
9 Name10 20 0
10 Name11 21 1
Now, you could plot the data maybe as follows:
import seaborn as sns
import matplotlib.ticker as mtick
sns.set_theme()
ax = sns.barplot(data=df, x=df.Age, y=df.Medal, errorbar=None)
# in versions prior to `seaborn 0.12` use
# `ax = sns.barplot(data=df, x=df.Age, y=df.Medal, ci=None)`
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1.0))
# adding labels
ax.bar_label(ax.containers[0],
labels=[f'{round(v*100,2)}%' for v in ax.containers[0].datavalues])
Result:
Incidentally, if you would have wanted to calculate these percentages, one option could have been to use pd.crosstab:
percentages = pd.crosstab(df.Age,df.Medal, normalize='index')\
.rename(columns={1:'percentages'})['percentages']
print(percentages)
Age
19 1.000000
20 0.600000
21 0.333333
22 0.500000
Name: percentages, dtype: float64
So, with matplotlib, you could also do something like:
percentages = pd.crosstab(df.Age,df.Medal, normalize='index')\
.rename(columns={1:'percentages'})['percentages'].mul(100)
my_cmap = plt.get_cmap("viridis")
rescale = lambda y: (y - np.min(y)) / (np.max(y) - np.min(y))
fig, ax = plt.subplots()
ax.bar(x=percentages.index.astype(str),
height=percentages.to_numpy(),
color=my_cmap(rescale(percentages.to_numpy())))
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.bar_label(ax.containers[0], fmt='%.1f%%')
plt.show()
Result:

How to resample, and create a time series of value_counts() and count() on multiple columns?

I have the following dataframe:
Client_Id Date Age_Group Gender
0 579427 2020-02-01 Under 65 Female
1 579464 2020-02-01 Under 65 Female
2 579440 2020-02-01 Under 65 Male
3 579470 2020-02-01 75 - 79 Female
4 579489 2020-02-01 75 - 79 Female
5 579424 2020-02-01 75 - 79 Male
6 579492 2020-02-01 75 - 79 Male
7 579552 2020-02-01 75 - 79 Male
8 579439 2020-02-01 80 - 84 Male
9 579445 2020-03-01 80 - 84 Female
10 579496 2020-03-01 80 - 84 Female
11 579569 2020-03-01 80 - 84 Male
12 579610 2020-03-01 80 - 84 Male
13 579450 2020-03-01 80 - 84 Female
14 579423 2020-03-01 85 and over Female
15 579428 2020-03-01 85 and over Male
I am trying to resample, and get a time series of count of Client_Id, count of Gender, and count of Age_Group.
For example, I can get value_counts of Gender:
df.set_index('Date').resample('D')['Gender'].value_counts()
Date Gender
2020-02-01 Male 5
Female 4
2020-03-01 Female 4
Male 3
I can also get value_counts for Age_Group.
And I can get number of clients per day:
df.set_index('Date').resample('D')['Client_Id'].count()
Date
2020-01-02 9
2020-01-03 7
However I would like to all outputs to be one dataframe, with the result of the value counts as their own column.
I have managed to do it, like this:
However the code is VERY ugly. I also have more column to process, and I would prefer not to have such a long chain of merge.
This is what I've done, using unstack and merge:
(df.set_index('Date').resample('D')['Client_Id'].count().to_frame()
.merge(df.set_index('Date').resample('D')['Gender'].value_counts().unstack(), left_index=True, right_index=True)
.merge(df.set_index('Date').resample('D')['Age_Group'].value_counts().unstack(), left_index=True, right_index=True))
Is there an easier / more tidy / built in way to do this?
My dataframe as a dict:
{'Client_Id': {0: 579427,
1: 579464,
2: 579440,
3: 579470,
4: 579489,
5: 579424,
6: 579492,
7: 579552,
8: 579439,
9: 579445,
10: 579496,
11: 579569,
12: 579610,
13: 579450,
14: 579423,
15: 579428},
'Date': {0: Timestamp('2020-01-02 00:00:00'),
1: Timestamp('2020-01-02 00:00:00'),
2: Timestamp('2020-01-02 00:00:00'),
3: Timestamp('2020-01-02 00:00:00'),
4: Timestamp('2020-01-02 00:00:00'),
5: Timestamp('2020-01-02 00:00:00'),
6: Timestamp('2020-01-02 00:00:00'),
7: Timestamp('2020-01-02 00:00:00'),
8: Timestamp('2020-01-02 00:00:00'),
9: Timestamp('2020-01-03 00:00:00'),
10: Timestamp('2020-01-03 00:00:00'),
11: Timestamp('2020-01-03 00:00:00'),
12: Timestamp('2020-01-03 00:00:00'),
13: Timestamp('2020-01-03 00:00:00'),
14: Timestamp('2020-01-03 00:00:00'),
15: Timestamp('2020-01-03 00:00:00')},
'Age_Group': {0: 'Under 65',
1: 'Under 65',
2: 'Under 65',
3: '75 - 79',
4: '75 - 79',
5: '75 - 79',
6: '75 - 79',
7: '75 - 79',
8: '80 - 84',
9: '80 - 84',
10: '80 - 84',
11: '80 - 84',
12: '80 - 84',
13: '80 - 84',
14: '85 and over',
15: '85 and over'},
'Gender': {0: 'Female ',
1: 'Female ',
2: 'Male ',
3: 'Female ',
4: 'Female ',
5: 'Male ',
6: 'Male ',
7: 'Male ',
8: 'Male ',
9: 'Female ',
10: 'Female ',
11: 'Male ',
12: 'Male ',
13: 'Female ',
14: 'Female ',
15: 'Male '}}
Use Series.unstack for DatetimeIndex in df1, so possible use concat:
df1 = df.set_index('Date').resample('D')['Gender'].value_counts().unstack()
df2 = df.set_index('Date').resample('D')['Client_Id'].count()
df = pd.concat([df1, df2], axis=1)

Using pandas, how to assign the values in the column based on the values from another colum?

I have dataframe with two columns that looks like this:
Name Date
Brown 1/5/2021
Brown 12/15/2011
Brown 1/25/2006
Davis 1/9/2021
Davis 3/9/2004
Davis 1/29/2021
Garcia 1/7/2021
Garcia 11/17/2008
Garcia 1/13/2013
Johnson 1/3/2021
Johnson 1/13/2017
Johnson 12/23/2011
Jones 1/6/2021
Jones 1/16/2009
Jones 1/4/2014
Martinez 1/11/2018
Martinez 1/21/2002
Martinez 1/31/2021
Miller 1/8/2021
Miller 2/18/2021
Miller 1/28/2021
Rodriguez 1/10/2020
Rodriguez 1/20/2001
Rodriguez 1/30/2021
Smith 1/2/2021
Smith 1/12/2021
Smith 5/22/2010
Williams 1/4/2021
Williams 1/24/2016
Williams 1/4/2006
I am trying to add new column and enter the value in that column based on value from the first column. So for every instance of:
Smith in column Name, the new column Grade will have letter A,
Johnson will have B
Williams will have C
Brown will have D
Jones will have E
Garcia will have F
Miller will have G
Davis will have H
Rodriguez will have K
Martinez will have L
So desired output would look like this:
Name Date Grade
Brown 1/5/2021 D
Brown 12/15/2011 D
Brown 1/25/2006 D
Davis 1/9/2021 H
Davis 3/9/2004 H
Davis 1/29/2021 H
Garcia 1/7/2021 F
Garcia 11/17/2008 F
Garcia 1/13/2013 F
Johnson 1/3/2021 B
Johnson 1/13/2017 B
Johnson 12/23/2011 B
Jones 1/6/2021 E
Jones 1/16/2009 E
Jones 1/4/2014 E
Martinez 1/11/2018 L
Martinez 1/21/2002 L
Martinez 1/31/2021 L
Miller 1/8/2021 G
Miller 2/18/2021 G
Miller 1/28/2021 G
Rodriguez 1/10/2020 K
Rodriguez 1/20/2001 K
Rodriguez 1/30/2021 K
Smith 1/2/2021 A
Smith 1/12/2021 A
Smith 5/22/2010 A
Williams 1/4/2021 C
Williams 1/24/2016 C
Williams 1/4/2006 C
What I tried so far works for single instance like:
df['Grade']= df['Name'].apply(lambda x: 'D' if (x == 'Brown'))
Instead of 10 lambda functions, is there a more elegant way that I could do within one function?
I'm not sure if this qualifies as elegant, but it's another solution.
df['Grade']= df['Name'].map({'Brown': 'D' , ....})
You can use np.select from numpy (usually stylized as np):
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': {0: 'Brown', 1: 'Brown', 2: 'Brown', 3: 'Davis', 4: 'Davis', 5: 'Davis', 6: 'Garcia', 7: 'Garcia', 8: 'Garcia', 9: 'Johnson', 10: 'Johnson', 11: 'Johnson', 12: 'Jones', 13: 'Jones', 14: 'Jones', 15: 'Martinez', 16: 'Martinez', 17: 'Martinez', 18: 'Miller', 19: 'Miller', 20: 'Miller', 21: 'Rodriguez', 22: 'Rodriguez', 23: 'Rodriguez', 24: 'Smith', 25: 'Smith', 26: 'Smith', 27: 'Williams', 28: 'Williams', 29: 'Williams'},
'Date': {0: '1/5/2021', 1: '12/15/2011', 2: '1/25/2006', 3: '1/9/2021', 4: '3/9/2004', 5: '1/29/2021', 6: '1/7/2021', 7: '11/17/2008', 8: '1/13/2013', 9: '1/3/2021', 10: '1/13/2017', 11: '12/23/2011', 12: '1/6/2021', 13: '1/16/2009', 14: '1/4/2014', 15: '1/11/2018', 16: '1/21/2002', 17: '1/31/2021', 18: '1/8/2021', 19: '2/18/2021', 20: '1/28/2021', 21: '1/10/2020', 22: '1/20/2001', 23: '1/30/2021', 24: '1/2/2021', 25: '1/12/2021', 26: '5/22/2010', 27: '1/4/2021', 28: '1/24/2016', 29: '1/4/2006'}})
conds = [df['Name'].eq('Smith'),
df['Name'].eq('Johnson'),
df['Name'].eq('Williams'),
df['Name'].eq('Brown'),
df['Name'].eq('Jones'),
df['Name'].eq('Garcia'),
df['Name'].eq('Miller'),
df['Name'].eq('Davis'),
df['Name'].eq('Rodriguez'),
df['Name'].eq('Martinez')]
choices = ['A','B','C','D','E','F','G','H','K','L']
df['Grade'] = np.select(conds, choices, np.nan)
Name Date Grade
0 Brown 1/5/2021 D
1 Brown 12/15/2011 D
2 Brown 1/25/2006 D
3 Davis 1/9/2021 H
4 Davis 3/9/2004 H
5 Davis 1/29/2021 H
6 Garcia 1/7/2021 F
7 Garcia 11/17/2008 F
8 Garcia 1/13/2013 F
9 Johnson 1/3/2021 B
10 Johnson 1/13/2017 B
11 Johnson 12/23/2011 B
12 Jones 1/6/2021 E
13 Jones 1/16/2009 E
14 Jones 1/4/2014 E
15 Martinez 1/11/2018 L
16 Martinez 1/21/2002 L
17 Martinez 1/31/2021 L
18 Miller 1/8/2021 G
19 Miller 2/18/2021 G
20 Miller 1/28/2021 G
21 Rodriguez 1/10/2020 K
22 Rodriguez 1/20/2001 K
23 Rodriguez 1/30/2021 K
24 Smith 1/2/2021 A
25 Smith 1/12/2021 A
26 Smith 5/22/2010 A
27 Williams 1/4/2021 C
28 Williams 1/24/2016 C
29 Williams 1/4/2006 C

Check if column in dataframe is missing values

I have am column full of state names.
I know how to iterate down through it, but I don't know what syntax to use to have it check for empty values as it goes. tried "isnull()" but that seems to be the wrong approach. Anyone know a way?
was thinking something like:
for state_name in datFrame.state_name:
if datFrame.state_name.isnull():
print ('no name value' + other values from row)
else:
print(row is good.)
df.head():
state_name state_ab city zip_code
0 Alabama AL Chickasaw 36611
1 Alabama AL Louisville 36048
2 Alabama AL Columbiana 35051
3 Alabama AL Satsuma 36572
4 Alabama AL Dauphin Island 36528
to_dict():
{'state_name': {0: 'Alabama',
1: 'Alabama',
2: 'Alabama',
3: 'Alabama',
4: 'Alabama'},
'state_ab': {0: 'AL', 1: 'AL', 2: 'AL', 3: 'AL', 4: 'AL'},
'city': {0: 'Chickasaw',
1: 'Louisville',
2: 'Columbiana',
3: 'Satsuma',
4: 'Dauphin Island'},
'zip_code': {0: '36611', 1: '36048', 2: '35051', 3: '36572', 4: '36528'}}
Based on your description, you can use np.where to check if rows are either null or empty strings.
df['status'] = np.where(df['state'].eq('') | df['state'].isnull(), 'Not Good', 'Good')
(MCVE) For example, suppose you have the following dataframe
state
0 New York
1 Nevada
2
3 None
4 New Jersey
then,
state status
0 New York Good
1 Nevada Good
2 Not Good
3 None Not Good
4 New Jersey Good
It's always worth mentioning that you should avoid loops whenever possible, because they are way slower than masking

Categories

Resources