Plotting in pivot table using label - python

my dataset
df
Month 1 2 3 4 5 Label
Name
A 120 80.5 120 105.5 140 0
B 80 110 98.5 105 100 1
C 150 90.5 105 120 190 2
D 100 105 98.5 110 120 1
...
To draw a plot for Month, applying the inverse matrix,
df = df.T
df
Name A B C D
Month
1 120 80 150 100
2 80.05 110 90.5 105
3 130 98.5 105 98.005
4 105.5 105 120 110
5 140 100 190 120
Label 0.00 1.0 2.0 1.000
Ultimately what I want to do is Drawing a plot, the x-axis is this month, y-axis is value.
but,
I have two questions.
Q1.
To inverse matrix, the data type of 'label' is changed(int -> float),
Can only the index of the 'label' be set to int type?
output what I want
df = df.T
df
Name A B C D
Month
1 120 80 150 100
2 80.05 110 90.5 105
3 130 98.5 105 98.005
4 105.5 105 120 110
5 140 100 190 120
Label 0 1 2 1
Q2.
q1 is actually for q2.
When drawing a plot, I want to group it using a label.(Like seaborn hue)
When drawing a plot using the pivot table above, is there a way for grouping to be possible?
(matplotlib, sns method does not matter)
The label above doesn't have to be int, and if possible, you don't need to answer the q1 task.
thank you for reading

Q2: You need reshape values, e.g. here with DataFrame.melt for possible use hue:
df1 = df.reset_index().melt(['Name','Label'])
print (df1)
sns.stripplot(data=df1,hue='Label',x='Name',y='value')
Q1: Pandas not support it, e.g. if convert last row label it not change values to floats:
df = df.T
df.loc['Label', :] = df.loc['Label', :].astype(int)
print (df)
Name A B C D
1 120.0 80.0 150.0 100.0
2 80.5 110.0 90.5 105.0
3 120.0 98.5 105.0 98.5
4 105.5 105.0 120.0 110.0
5 140.0 100.0 190.0 120.0
Label 0.0 1.0 2.0 1.0
EDIT:
df1 = df.reset_index().melt(['Name','Label'], var_name='Month')
print (df1)
Name Label Month value
0 A 0 1 120.0
1 B 1 1 80.0
2 C 2 1 150.0
3 D 1 1 100.0
4 A 0 2 80.5
5 B 1 2 110.0
6 C 2 2 90.5
7 D 1 2 105.0
8 A 0 3 120.0
9 B 1 3 98.5
10 C 2 3 105.0
11 D 1 3 98.5
12 A 0 4 105.5
13 B 1 4 105.0
14 C 2 4 120.0
15 D 1 4 110.0
16 A 0 5 140.0
17 B 1 5 100.0
18 C 2 5 190.0
19 D 1 5 120.0
sns.lineplot(data=df1,hue='Label',x='Month',y='value')

Related

Pandas Python highest 2 rows of every 3 and tabling the results

Suppose I have the following dataframe:
. Column1 Column2
0 25 1
1 89 2
2 59 3
3 78 10
4 99 20
5 38 30
6 89 100
7 57 200
8 87 300
Im not sure if what I want to do is impossible or not. But I want to compare every three rows of column1 and then take the highest 2 out the three rows and assign the corresponding 2 Column2 values to a new column. The values in column 3 does not matter if they are joined or not. It does not matter if they are arranged or not for I know every 2 rows of column 3 belong to every 3 rows of column 1.
. Column1 Column2 Column3
0 25 1 2
1 89 2 3
2 59 3
3 78 10 20
4 99 20 10
5 38 30
6 89 100 100
7 57 200 300
8 87 300
You can use np.arange with np.repeat to create a grouping array which groups every 3 values.
Then use GroupBy.nlargest then extract indices of those values using pd.Index.get_level_values, then assign them to Column3 pandas handles index alignment.
n_grps = len(df)/3
g = np.repeat(np.arange(n_grps), 3)
idx = df.groupby(g)['Column1'].nlargest(2).index.get_level_values(1)
vals = df.loc[idx, 'Column2']
vals
# 1 2
# 2 3
# 4 20
# 3 10
# 6 100
# 8 300
# Name: Column2, dtype: int64
df['Column3'] = vals
df
Column1 Column2 Column3
0 25 1 NaN
1 89 2 2.0
2 59 3 3.0
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 NaN
8 87 300 300.0
To get output like you mentioned in the question you have to sort and push NaN to last then you have perform this additional step.
df['Column3'] = df.groupby(g)['Column3'].apply(lambda x:x.sort_values()).values
Column1 Column2 Column3
0 25 1 2.0
1 89 2 3.0
2 59 3 NaN
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 300.0
8 87 300 NaN

How to filling up the missing value in pandas dataframe

I have a dataframe that contains missing values.
index month value
0 201501 100
1 201507 172
2 201602 181
3 201605 98
I want to fill the missing value of the above data frame with the list below.
list = [201501, 201502, 201503 ... 201612]
The result I want to get...
index month value
0 201501 100
1 201502 100
2 201503 100
3 201504 100
4 201505 100
5 201506 100
6 201507 172
7 201508 172
...
...
23 201611 98
34 201612 98
Setup
my_list = list(range(201501,201509))
df=df.drop('index',axis=1) #remove the column index after use pd.read_clipboard
print(df)
month value
0 201501 100
1 201507 172
2 201602 181
3 201605 98
pd.DataFrame.reindex
df = (df.set_index('month')
.reindex( index = np.sort(np.unique(df['month'].tolist() + my_list)) )
.ffill()
.reset_index() )
print(df)
month value
0 201501 100.0
1 201502 100.0
2 201503 100.0
3 201504 100.0
4 201505 100.0
5 201506 100.0
6 201507 172.0
7 201508 172.0
8 201602 181.0
9 201605 98.0
10 201612 98.0
Using pandas.DataFrame.merge:
l = list(range(201501,201509))
new_df = df.merge(pd.Series(l,name='month'),how='outer').sort_values('month').ffill()
new_df['index'] = range(new_df.shape[0])
Output:
index month value
0 0 201501 100.0
4 1 201502 100.0
5 2 201503 100.0
6 3 201504 100.0
7 4 201505 100.0
8 5 201506 100.0
1 6 201507 172.0
9 7 201508 172.0
2 8 201602 181.0
3 9 201605 98.0

Slice values of a column and calculate average in python

I have a dataframe with three columns:
a b c
0 73 12
73 80 2
80 100 5
100 150 13
Values in "a" and "b" are days. I need to find the average values of "c" in each 30 day-interval (slice values inside [min(a),max(b)] in 30 days and calculate average of c). I want as a result have a dataframe like this:
aa bb c_avg
0 30 12
30 60 12
60 90 6.33
90 120 9
120 150 13
Another sample data could be:
a b c
0 1264.0 1629.0 0.000000
1 1629.0 1632.0 133.333333
6 1632.0 1699.0 0.000000
2 1699.0 1706.0 21.428571
7 1706.0 1723.0 0.000000
3 1723.0 1726.0 50.000000
8 1726.0 1890.0 0.000000
4 1890.0 1893.0 33.333333
1 1893.0 1994.0 0.000000
How can I get to the final table?
First create ranges DataFrame by ranges defined a and b columns:
a = np.arange(0, 180, 30)
df1 = pd.DataFrame({'aa':a[:-1], 'bb':a[1:]})
#print (df1)
Then cross join all rows by helper column tmp:
df3 = pd.merge(df1.assign(tmp=1), df.assign(tmp=1), on='tmp')
#print (df3)
And last filter - There are 2 solution by columns for filtering:
df4 = df3[df3['aa'].between(df3['a'], df3['b']) | df3['bb'].between(df3['a'], df3['b'])]
print (df4)
aa bb tmp a b c
0 0 30 1 0 73 12
4 30 60 1 0 73 12
8 60 90 1 0 73 12
10 60 90 1 80 100 5
14 90 120 1 80 100 5
15 90 120 1 100 150 13
19 120 150 1 100 150 13
df4 = df4.groupby(['aa','bb'], as_index=False)['c'].mean()
print (df4)
aa bb c
0 0 30 12.0
1 30 60 12.0
2 60 90 8.5
3 90 120 9.0
4 120 150 13.0
df5 = df3[df3['a'].between(df3['aa'], df3['bb']) | df3['b'].between(df3['aa'], df3['bb'])]
print (df5)
aa bb tmp a b c
0 0 30 1 0 73 12
8 60 90 1 0 73 12
9 60 90 1 73 80 2
10 60 90 1 80 100 5
14 90 120 1 80 100 5
15 90 120 1 100 150 13
19 120 150 1 100 150 13
df5 = df5.groupby(['aa','bb'], as_index=False)['c'].mean()
print (df5)
aa bb c
0 0 30 12.000000
1 60 90 6.333333
2 90 120 9.000000
3 120 150 13.000000

Adding a row from a dataframe into another by matching columns with NaN values in row pandas python

The Scenario:
I have 2 dataframes fc0 and yc0. Where fc0 is a Cluster and yc0 is another dataframe which needs to be merged in fc0.
The Nature of data is as follows:
fc0
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
yc0
iid uid 1 2 5 6 9 15
0 944 5.0 3.0 4.0 3.0 3.0 5.0
The Twist
I have 1682 columns in fc0 and I have few hundered values in yc0. Now I need the yc0 to go into fc0
In haste of resolving it, I even tried yc0.reset_index(inplace=True) but wasn't really helpful.
Expected Output
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
944 5.0 3.0 NaN NaN 4.0 3.0 3.0
References
Link1 Tried this, but landed up inserting NaN values for 1st 16 Columns and rest of the data shifted by that many columns
Link2 Couldn't match column keys, besides I tried it for row.
Link3 Merging doesn't match the columns in it.
Link4 Concatenation doesn't work that way.
Link5 Same issues with Join.
EDIT 1
fc0.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 235 entries, 234 to 468
Columns: 1683 entries, uid to 1682
dtypes: float64(1682), int64(1)
memory usage: 3.0 MB
and
yc0.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 336 entries, uid to 1007
dtypes: float64(335), int64(1)
memory usage: 2.7 KB
Here's a MVCE example. Does this small sample data show the functionality that you are expecting?
df1 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('ABCE'))
A B C E
0 81 57 54 88
1 63 63 74 10
2 13 89 88 66
3 90 81 3 31
4 66 93 55 4
df2 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('BCDE'))
B C D E
0 93 48 62 25
1 24 97 52 88
2 53 50 21 13
3 81 27 7 81
4 10 21 77 19
df_out = pd.concat([df1,df2])
print(df_out)
Output:
A B C D E
0 81.0 57 54 NaN 88
1 63.0 63 74 NaN 10
2 13.0 89 88 NaN 66
3 90.0 81 3 NaN 31
4 66.0 93 55 NaN 4
0 NaN 93 48 62.0 25
1 NaN 24 97 52.0 88
2 NaN 53 50 21.0 13
3 NaN 81 27 7.0 81
4 NaN 10 21 77.0 19

Pandas - Sum up previous values of a column

New to pandas, I'm trying to sum up all previous values of a column. In SQL I did this by joining the table to itself, so I've been taking the same approach in pandas, but having some issues.
Original Data Frame
TeamName PlayerCount Goals CalMonth
0 A 25 126 1
1 A 25 100 2
2 A 25 156 3
3 B 22 205 1
4 B 30 300 2
5 B 28 189 3
Code
prev_month = np.where(df3['CalMonth'] == 12, df3['CalMonth'] - 11, df3['CalMonth'] + 1)
df4 = pd.merge(df3, df3, how='left', left_on=['TeamName','CalMonth'], right_on=['TeamName', prev_month])
print(df4.head(20))
Output
TeamName PlayerCount_x Goals_x CalMonth_x
0 A 25 126 1
1 A 25 100 2
2 A 25 156 3
3 B 22 205 1
4 B 22 300 2
5 B 22 189 3
PlayerCount_y Goals_y CalMonth_y
NaN NaN NaN
25 126 1
25 100 2
22 NaN NaN
22 205 1
22 100 2
The output is what I had in mind, but what I want now is to create a column that is YTD and sum up all Goals from previous months. Here are my desired results (can either include the current month or not, that can be done in an additional step):
TeamName PlayerCount_x Goals_x CalMonth_x
0 A 25 126 1
1 A 25 100 2
2 A 25 156 3
3 B 22 205 1
4 B 22 300 2
5 B 22 189 3
PlayerCount_y Goals_y CalMonth_y Goals_YTD
NaN NaN NaN NaN
25 126 1 126
25 100 2 226
22 NaN NaN NaN
22 205 1 205
22 100 2 305

Categories

Resources