Groupby data in Pandad to label my pie chart - python

I have using groupby in Pandas,
df_0 = df_find_quality_und.groupby("user score").size()
df_0
and get this output;
Output:
user score
3.0 14
7.0 2
7.1 1
7.2 2
7.3 1
7.5 1
7.7 1
7.8 2
7.9 1
8.0 1
8.1 2
8.3 1
8.7 1
dtype: int64
The right side is the frequency of the the data.
Then, I want to plot this data into a pie-chart. But I don't know how to take the left data only to become the labels of my pie-chart. Hence, I write it manually like this;
user_sc_und = [3.0,7.0,7.1,7.2,7.3,7.5,7.7,7.8,7.9,8.0,8.1,8.3,8.7]
user_sc_und_1 = df_find_quality_und.groupby("user score")
plot = plt.pie(df_0, labels=user_sc_und)
The pie-chart was well shown in the output with the label. But, I don't want to write the labels manually every time.. Can anyone help me on how to label the pie-chart by taking the groupby data on the left side?
Thank you.

Pandas allows you to plot pie chart:
df_0 = df_find_quality_und.groupby("user score").size()
df_0.plot.pie(autopct="%.2f",figsize=(10,10))
output:

The labels are actually the index of the groupby sizes object, so, you can try:
plot = plt.pie(df_0, labels=df_0.index)

Related

Pandas Plot range as bar

I have the following Dataframe(this table is just an example, the Types and sizes are more):
df = pd.DataFrame({
'type':['A','A','B','B','C','C','D','D'],
'size':['a','b','c','d','e','f','g','h'],
'Nx':[4.3,2.4,2.5,4.4,3.5,1.8,4.5,2.8],
'min':[0.5,2.5,0.7,3.2,0.51,2,0.3,3],
'max':[1.5,3.4,1.7,4.3,1.51,3,1.2,4]})
print(df)
ax=df.plot.bar(x='type',y='max',stacked=True,bottom=df['min'])
ax.plt(x='type',y='Nx')
This is the result:
type size Nx min max
0 A a 4.3 0.50 1.50
1 A b 2.4 2.50 3.40
2 B c 2.5 0.70 1.70
3 B d 4.4 3.20 4.30
4 C e 3.5 0.51 1.51
5 C f 1.8 2.00 3.00
6 D g 4.5 0.30 1.20
7 D h 2.8 3.00 4.00
how can i plot this data by having just one column for Type A, B,C.. And then plot scatter for Type,Nx to be like this:
You can add a new column called height equal to max - min since the plt.bar method takes a height parameter, then reindex the DataFrame by ['type','size']. Then loop through the levels of this multiindex DataFrame and plot a bar with a different color for each unique type and size combination.
This also requires you to define your own color palette. I chose a discrete color palette from plt.cm and mapped integer values to each color. As you are looping through each unique type and size, you can have a counter for the inner most loop to ensure that each bar within the same type has a different color.
NOTE: this does make the assumption that there aren't multiple rows with the same type and size.
To show this is generalizable, I added another bar of type 'D' and size 'i' and it appears as a distinct bar in the plot.
import pandas as pd
import matplotlib.pyplot as plt
## added a third size to type D
df = pd.DataFrame({
'type':['A','A','B','B','C','C','D','D','D'],
'size':['a','b','c','d','e','f','g','h','i'],
'Nx':[4.3,2.4,2.5,4.4,3.5,1.8,4.5,2.8,5.6],
'min':[0.5,2.5,0.7,3.2,0.51,2,0.3,3,4.8],
'max':[1.5,3.4,1.7,4.3,1.51,3,1.2,4,5.3]})
## create a height column for convenience
df['height'] = df['max'] - df['min']
df_grouped = df.set_index(['type','size'])
## create a list of as many colors as there are categories
cmap = plt.cm.get_cmap('Accent', 10)
## loop through the levels of the grouped DataFrame
for each_type, df_type in df_grouped.groupby(level=0):
color_idx=0
for each_size, df_type_size in df_type.groupby(level=1):
color_idx += 1
plt.bar(x=[each_type]*len(df_type_size), height=df_type_size['height'], bottom=df_type_size['min'], width=0.4,
edgecolor='grey', color=cmap(color_idx))
plt.scatter(x=[each_type]*len(df_type_size), y=df_type_size['Nx'], color=cmap(color_idx))
plt.ylim([0, 7])
plt.show()

Comparing values in two dataframes and generate report if difference is greater set point

I have 2 data frames ( master and slave) looks like below.
# Master
C D E F G
0 5 44 4.0 33 22
1 1 0 4.5 565 11
# Slave
C D E F G
0 5 44 4.0 33.0 22
1 1 4 6.5 562.5 10
Expected results( highlight those cells where difference is > 1)
C D E F G
0 5 44 4.0 33.0 22
1 1 4 6.5 562.5 10
Where 4, 6.5, 562.5 are highlighted
Picture attached for better understanding.
I would like to compare two data frames and would like to highlight the cells where the difference exceed the SET VALUE( >1) in a newly created data frame. SET value=1 is constant for entire data frame.
Please note difference should be based on Absolute value. i.e ABS( master- slave)
I would like to use the numpy np.isclose function to achieve my goal.
This should happen for bigger data frame with 200 rows and 300 columns.
Data frame displayed here is small for better understanding.
Cell : D2 : highlight is required since (D2_MASTER) -(D2_Slave)= 0- 4 = -4
Cell : E2 : highlight is required since (E2_MASTER) -(E2_Slave)= 4.5- 6.5 = -2
Cell : F2 : highlight is required since (F2_MASTER) -(F2_Slave)= 565- 562.5.5 = 2.5
Cell : G2 : NO highlight since (G2_MASTER) -(G2_Slave)=11- 10 = 1 (should not be highlighted since difference is within the limit)
I just started coding in python and using pandas on my own and I admit I am a bit lost.
Thanks for reading all this and thanks in advance for any suggestions and feedback. !
Code
for ind,row in dfmaster.iterrows():
print(row)
(dfmaster.iloc())=np.isclose ( (dfmaster.iloc()) , (dfmaster.iloc()) , atol=1)#.any()
Let's try style.format:
def highlight_error(df):
return pd.DataFrame(np.where(df.sub(slave).abs() > 1, 'background-color:red', ''),
df.index, df.columns)
master.style.apply(highlight_error, axis=None)
On Jupyter notebook you would get:

Is there a way where I can plot data from a CSV file where every 10 data points in a column are a different line in the same graph?

Say I have a data set that is in two columns. I want to plot a line plot iterating through every 10. So, I would take the first 10, and then the second 10, which is right under the first 10, for another line plot on the same graph (different color line). The data is stacked on each other in a CSV file with no header.
Currently, I have it taking in the entire column. It plots them, however there is no differentiation as to which data set it is. I want to plot multiple lines on the same graph but the CSV file has all the data sets in one column, but I need to graph every 10.
EDIT
Below I have Data added I would like the first column to be the x-axis and the second to be the y.
Sample Data:
0 8.2
1 9.1
2 2.2
3 3.3
4 9.8
5 6.3
6 4.8
7 8.6
8 3.9
9 2.1
0 9.34
1 10.2
2 7.22
3 6.98
4 1.34
5 2.56
6 6.78
7 4.56
8 3.3
9 9.4
OK, try this:
# this is the toy data
df = pd.DataFrame({0:list(range(10))*2,
1:np.random.uniform(9,11,20)})
# set up axes for plots
fig, ax = plt.subplots(1,1)
# the groupby argument groups every 10 rows together
# then pass it to the `lambda` function,
# which plots each chunk to the given plt axis
df.groupby(df.reset_index().index//10).apply(lambda x: ax.plot(x[0], x[1]) )
plt.show()
Option 2:
I found sns is a better tool for the purpose:
fig, ax = plt.subplots(1,1, figsize=(10,6))
sns.lineplot(x=df[0],
y=df[1],
hue=df.reset_index().index//10,
data=df,
palette='Set1')
plt.show()
outputs:

Pandas - Outer Join on Column with Repeating Values

This is my first question on Stack Overflow, please let me know how I can help you help me if my question is unclear.
Goal: Use Python and Pandas to Outer join (or merge) Data Sets containing different experimental trials where the "x" axis of each trial is extremely similar but has some deviations. Most importantly, the "x" axis increases, hits a maximum and then decreases, often overlapping with previously existing "x" points.
Problem: When I go to join/merge the datasets on "x", the "x" column is sorted, messing up the order of the collected data and making it impossible to plot it correctly.
Here is a small example of what I am trying to do:
Wouldn't let me add pictures because I am new. Here is the code to generate these example data sets.
Data Sets :
Import:
import numpy as np
import pandas as pd
import random as rand
Code :
T1 = {'x':np.array([1,1.5,2,2.5,3,3.5,4,5,2,1]),'y':np.array([10000,8500,7400,6450,5670,5100,4600,4500,8400,9000]),'z':np.array(rand.sample(range(0,10000),10))}'
T2 = {'x':np.array([1,2,3,4,5,6,7,2,1.5,1]),'y':np.array([10500,7700,5500,4560,4300,3900,3800,5400,8400,8800]),'z':np.array(rand.sample(range(0,10000),10))}
Trial1 = pd.DataFrame(T1)
Trial2 = pd.DataFrame(T2)
Attempt to Merge/Join:
WomboCombo = Trial1.join(Trial2,how='outer',lsuffix=1,rsuffix=2, on='x')
WomboCombo2 = pd.merge(left=Trial1, right= Trial2, how = 'outer', left
Attempt to split into two parts, increasing and decreasing part (manually found row number where data "x" starts decreasing):
Trial1Inc = Trial1[0:8]
Trial2Inc = Trial2[0:7]
Result - Merge works well, join messes with the "x" column, not sure why:
Trial1Inc.merge(Trial2Inc,on='x',how='outer', suffixes=[1,2])
Incrementing section Merge Result
Trial1Inc.join(Trial2Inc,on='x',how='outer', lsuffix=1,rsuffix=2)
Incrementing section Join Result
Hopefully my example is clear, the "x" column in Trial 1 increases until 5, then decreases back towards 0. In Trial 2, I altered the test a bit because I noticed that I needed data at a slightly higher "x" value. Trial 2 Increases until 7 and then quickly decreases back towards 0.
My end goal is to plot the average of all y values (where there is overlap between the trials) against the corresponding x values.
If there is overlap I can add error bars. Pandas is almost perfect for what I am trying to do because an Outer join adds null values where there is no overlap and is capable of horizontally concatenating the two trials when there is overlap.
All thats left now is to figure out how to join on the "x" column but maintain its order of increasing values and then decreasing values. The reason it is important for me to first increase "x" and then decrease it is because when looking at the "y" values, it seems as though the initial "y" value at a given "x" is greater than the "y" value when "x" is decreasing (E.G. in trial 1 when x=1, y=10000, however, later in the trial when we come back to x=1, y=9000, this trend is important. When Pandas sorts the column before merging, instead of there being a clean curve showing a decrease in "y" as "x" increases and then the reverse, there are vertical downward jumps at any point where the data was joined.
I would really appreciate any help with either:
A) a perfect solution that lets me join on "x" when "x" contains duplicates
B) an efficient way to split the data sets into increasing "x" and decreasing "x" so that I can merge the increasing and decreasing sections of each trial separately and then vertically concat them.
Hopefully I did an okay job explaining the problem I would like to solve. Please let me know if I can clarify anything,
Thanks for the help!
I think #xyzjayne idea of splitting the dataframe is a great idea.
Splitting Trial1 and Trial2:
# index of max x value in Trial2
t2_max_index = Trial2.index[Trial2['x'] == Trial2['x'].max()].tolist()
# split Trial2 by max value
trial2_high = Trial2.loc[:t2_max_index[0]].set_index('x')
trial2_low = Trial2.loc[t2_max_index[0]+1:].set_index('x')
# index of max x value in Trial1
t1_max_index = Trial1.index[Trial1['x'] == Trial1['x'].max()].tolist()
# split Trial1 by max vlaue
trial1_high = Trial1.loc[:t1_max_index[0]].set_index('x')
trial1_low = Trial1.loc[t1_max_index[0]+1:].set_index('x')
Once we split the dataframes we join the highers together and the lowers together:
WomboCombo_high = trial1_high.join(trial2_high, how='outer', lsuffix='1', rsuffix='2', on='x').reset_index()
WomboCombo_low = trial1_low.join(trial2_low, how='outer', lsuffix='1', rsuffix='2', on='x').reset_index()
We now combine them toegther to have one dataframe WomboCombo
WomboCombo = WomboCombo_high.append(WomboCombo_low)
OUTPUT:
x y1 z1 y2 z2
0 1.0 10000.0 3425.0 10500.0 3061.0
1 1.5 8500.0 5059.0 NaN NaN
2 2.0 7400.0 2739.0 7700.0 7090.0
3 2.5 6450.0 9912.0 NaN NaN
4 3.0 5670.0 2099.0 5500.0 1140.0
5 3.5 5100.0 9637.0 NaN NaN
6 4.0 4600.0 7581.0 4560.0 9584.0
7 5.0 4500.0 8616.0 4300.0 3940.0
8 6.0 NaN NaN 3900.0 5896.0
9 7.0 NaN NaN 3800.0 6211.0
0 2.0 8400.0 3181.0 5400.0 9529.0
2 1.5 NaN NaN 8400.0 3260.0
1 1.0 9000.0 4280.0 8800.0 8303.0
One possible solution is to give you trial rows specific IDs an then merge on the IDs. Should keep the x values from being sorted.
Here's what I was trying out, but it doesn't address varying numbers of data points. I like gym-hh's answer, though it's not clear to me that you wanted two columns of y,z pairs. So you could combine his ideas and this code to get what you need.
Trial1['index1'] = Trial1.index
Trial2['index1'] = Trial2.index
WomboCombo = Trial1.append(Trial2)
WomboCombo.sort_values(by=['index1'],inplace=True)
WomboCombo
Output:
x y z index1
0 1.0 10000 7148 0
0 1.0 10500 2745 0
1 1.5 8500 248 1
1 2.0 7700 9505 1
2 2.0 7400 6380 2
2 3.0 5500 3401 2
3 2.5 6450 6183 3
3 4.0 4560 5281 3
4 3.0 5670 99 4
4 5.0 4300 8864 4
5 3.5 5100 5132 5
5 6.0 3900 7570 5
6 4.0 4600 9951 6
6 7.0 3800 7447 6
7 2.0 5400 3713 7
7 5.0 4500 3863 7
8 1.5 8400 8776 8
8 2.0 8400 1592 8
9 1.0 9000 2167 9
9 1.0 8800 782 9

Find points in cells through pandas dataframes of coordinates

I have to find which points are inside a grid of square cells, given the points coordinates and the coordinates of the bounds of the cells, through two pandas dataframes.
I'm calling dfc the dataframe containing the code and the boundary coordinates of the cells (I simplify the problem, in the real analysis I have a big grid with geographical points and tons of points to check):
Code,minx,miny,maxx,maxy
01,0.0,0.0,2.0,2.0
02,2.0,2.0,3.0,3.0
and dfp the dataframe containing an Id and the coordinates of the points:
Id,x,y
0,1.5,1.5
1,1.1,1.1
2,2.2,2.2
3,1.3,1.3
4,3.4,1.4
5,2.0,1.5
Now I would like to perform a search returning in dfc dataframe a new column (called 'GridCode') of the grid in which the point is in. The cells should be perfectly squared, so I would like to perform an analysis through:
a = np.where(
(dfp['x'] > dfc['minx']) &
(dfp['x'] < dfc['maxx']) &
(dfp['y'] > dfc['miny']) &
(dfp['y'] < dfc['maxy']),
r2['Code'],
'na')
avoiding several loops on the dataframes. The lenghts of the dataframes are not the same. The resulting dataframe should be as follows:
Id x y GridCode
0 0 1.5 1.5 01
1 1 1.1 1.1 01
2 2 2.2 2.2 02
3 3 1.3 1.3 01
4 4 3.4 1.4 na
5 5 2.0 1.5 na
Thanks in advance for your help!
Probably a better way, but since this has been sitting out there for awhile..
Using Pandas boolean indexing to filter the dfc data frame instead of np.where()
def findGrid(dfp):
c = dfc[(dfp['x'] > dfc['minx']) &
(dfp['x'] < dfc['maxx']) &
(dfp['y'] > dfc['miny']) &
(dfp['y'] < dfc['maxy'])].Code
if len(c) == 0:
return None
else:
return c.iat[0]
Then use the pandas apply() function
dfp['GridCode'] = dfp.apply(findGrid,axis=1)
Will yield this
Id x y GridCode
0 0 1.5 1.5 1
1 1 1.1 1.1 1
2 2 2.2 2.2 2
3 3 1.3 1.3 1
4 4 3.4 1.4 NaN
5 5 2.0 1.5 NaN

Categories

Resources