Related
For an assignment I have to erase the outliers of a csv based on the different method
I tried working with the variable 'height' of the csv after opening the csv into a panda dataframe, but it keeps giving me errors or not touching the outliers at all, all this trying to use KNN method in python
The code that I wrote is the following
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_blobs
df = pd.read_csv("data.csv")
print(df.describe())
print(df.columns)
df['height'].plot(kind='hist')
print(df['height'].value_counts())
data= pd.DataFrame(df['height'],df['active'])
k=1
knn = NearestNeighbors(n_neighbors=k)
knn.fit([df['height']])
neighbors_and_distances = knn.kneighbors([df['height']])
knn_distances = neighbors_and_distances[0]
tnn_distance = np.mean(knn_distances, axis=1)
print(knn_distances)
PCM = df.plot(kind='scatter', x='x', y='y', c=tnn_distance, colormap='viridis')
plt.show()
And the data it something like this:
id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,18857,1,50,64.0,130,70,3,1,0,0,0,1
3,17623,2,250,82.0,150,100,1,1,0,0,1,1
I dont know what Im missing or doing wrong
df = pd.read_csv("data.csv")
X = df[['height', 'weight']]
X.plot(kind='scatter', x='weight', y='height', colormap='viridis')
plt.show()
knn = NearestNeighbors(n_neighbors=2).fit(X)
distances, indices = knn.kneighbors(X)
X['distances'] = distances[:,1]
X.distances
0 1.000000
1 1.000000
2 1.000000
3 3.000000
4 1.000000
5 1.000000
6 133.958949
7 100.344407
...
X.plot(kind='scatter', x='weight', y='height', c='distances', colormap='viridis')
plt.show()
MAX_DIST = 10
X[distances < MAX_DIST]
height weight
0 162 78.0
1 162 78.0
2 151 76.0
3 151 76.0
4 171 84.0
...
And finally to filter out all the outliers:
MAX_DIST = 10
X = X[X.distances < MAX_DIST]
I am producing a pandas barplot with raw counts represented by the plot, however I would like to annotate the bars with the pct of those counts as a whole. I have seen a lot of people using ax.patches methods to annotate but my values are unrelated to the get_height of the actual bars.
Here is some toy data. The plot produced will be the individual counts of the specific type. However, I want to add annotations above that specific bar that represent the pct total of that specific type to all types for that person's name.
Let me know if you need any more clarification.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'ID': [1,1,1,2,2,3,3,3,4],
'name': ['bob','bob','bob','shelby','shelby','jordan','jordan','jordan','jeff'],
'type': ['type1','type2','type4','type1','type6','type5','type8','type2',None]}
df: pd.DataFrame = pd.DataFrame(data=d)
df_pivot: pd.DataFrame = df.pivot_table(index='type', columns=['name'], values='ID', aggfunc={'ID': np.sum}).fillna(0)
# create percent totals of the specific type's row of the total
df_pivot['bob_pct_total']: pd.Series = (df_pivot['bob']/df_pivot['bob'].sum()).mul(100).round(1)
df_pivot['shelby_pct_total']: pd.Series = (df_pivot['shelby']/df_pivot['shelby'].sum()).mul(100).round(1)
df_pivot['jordan_pct_total']: pd.Series = (df_pivot['jordan']/df_pivot['jordan'].sum()).mul(100).round(1)
df_pivot.head(10)
name bob jordan shelby bob_pct_total shelby_pct_total jordan_pct_total
type
type1 1.0 0.0 2.0 33.3 50.0 0.0
type2 1.0 3.0 0.0 33.3 0.0 33.3
type4 1.0 0.0 0.0 33.3 0.0 0.0
type5 0.0 3.0 0.0 0.0 0.0 33.3
type6 0.0 0.0 2.0 0.0 50.0 0.0
type8 0.0 3.0 0.0 0.0 0.0 33.3
fig, ax = plt.subplots(figsize=(15,15))
df_pivot.plot(kind='bar', y=['bob','jordan','shelby'], ax=ax)
You can use the old approach, looping through the bars, using the height to position whatever text you want. Since matplotlib 3.4.0 there also is a new function bar_label that removes much of the boilerplate:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'ID': [1, 1, 1, 2, 2, 3, 3, 3, 4],
'name': ['bob', 'bob', 'bob', 'shelby', 'shelby', 'jordan', 'jordan', 'jordan', 'jeff'],
'type': ['type1', 'type2', 'type4', 'type1', 'type6', 'type5', 'type8', 'type2', None]}
df: pd.DataFrame = pd.DataFrame(data=d)
df_pivot: pd.DataFrame = df.pivot_table(index='type', columns=['name'], values='ID', aggfunc={'ID': np.sum}).fillna(0)
# create percent totals of the specific type's row of the total
df_pivot['bob_pct_total']: pd.Series = (df_pivot['bob'] / df_pivot['bob'].sum()).mul(100).round(1)
df_pivot['shelby_pct_total']: pd.Series = (df_pivot['shelby'] / df_pivot['shelby'].sum()).mul(100).round(1)
df_pivot['jordan_pct_total']: pd.Series = (df_pivot['jordan'] / df_pivot['jordan'].sum()).mul(100).round(1)
fig, ax = plt.subplots(figsize=(12, 5))
columns = ['bob', 'jordan', 'shelby']
df_pivot.plot(kind='bar', y=['bob', 'jordan', 'shelby'], rot=0, ax=ax)
for bars, col in zip(ax.containers, ['bob_pct_total', 'jordan_pct_total', 'shelby_pct_total']):
ax.bar_label(bars, labels=['' if val == 0 else f'{val}' for val in df_pivot[col]])
plt.tight_layout()
plt.show()
PS: To skip labeling the first bars, you could experiment with:
for bars, col in zip(ax.containers, ['bob_pct_total', 'jordan_pct_total', 'shelby_pct_total']):
labels=['' if val == 0 else f'{val}' for val in df_pivot[col]]
labels[0] = ''
ax.bar_label(bars, labels=labels)
Given a dataframe df with columns A, B, C, and D,
A B C D
0 88 38 15.66 30.0
1 88 34 15.66 40.0
2 15 15 12.00 20.0
3 15 19 8.00 15.0
4 45 12 6.00 15.0
5 45 30 4.00 30.0
6 29 31 3.60 15.0
7 88 20 3.60 10.0
8 64 25 3.60 15.0
9 45 43 3.60 20.0
I want to make a scatter plot that graphs A vs B, with sizes based on C and colors based on D. After trying many ways to do this, I settled on grouping the data by D, then plotting each group in D:
fig,axes=plt.subplots()
factor=df.groupby('D')
for name, group in factor:
axes.scatter(group.A,group.B,s=(group.C)**2,c=group.D,
cmap='viridis',norm=Normalize(vmin=min(df.D),vmax=max(df.D)),label=name)
This yields the appropriate result, but the default legend() function is wrong. The groups listed in the legend have correct names, but incorrect colors and sizes (colors should vary by group, and sizes of all markers should be the same).
I tried to manually set the legend, which I can approximate the colors but can't get the sizes to be equal. Eventually I'd like a second legend that will link sizes to the appropriate levels of C.
axes.legend(loc=1,scatterpoints=1,fontsize='small',frameon=False,ncol=2)
leg=axes.get_legend()
for i in range(len(factor)):
z=plt.cm.viridis(np.linspace(0,1,len(factor)))
leg.legendHandles[i].set_color(z[i])
Here's one approach that seems to satisfy your requirements, using Seaborn's lmplot(). (Inspiration taken from this post.)
First, generate some sample data:
import numpy as np
import pandas as pd
n = 10
min_size = 50
max_size = 300
A = np.random.random(n)
B = np.random.random(n)*2
C = np.random.randint(min_size, max_size, size=n)
D = np.random.choice(['Group1','Group2'], n)
df = pd.DataFrame({'A':A,'B':B,'C':C,'D':D})
Now plot:
import seaborn as sns
sns.lmplot(x='A', y='B', hue='D',
fit_reg=False, data=df,
scatter_kws={'s':df.C})
UPDATE
Given updated example data from OP, the same lmplot() approach should fulfill specifications: group legend is tracked by color, size of legend indicators is equal.
sns.lmplot(x='A', y='B', hue='D', data=df,
scatter_kws={'s':df.C**2}, fit_reg=False,)
My csv data looks something like the one provided below. I wanted to create a stack bar plot with pandas/python where each bar represent male and female portions with two colors and on the top of the bar it shows the total count of both male and female taking the drug(in my case). For instance, for the Age of 20 fall total of 7 people and 6 of them are male and 1 is female so on the bar plot there should be 7 on the top of the bar and this 6:1 portion is shown in the bar with two colors. I managed to group the people according to their age count and plot it but I wanted to show the bar with two genders on different colors as well. Any help will be appreciated . Thank you.
Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = pd.read_csv('/home/MedResearch/Desktop/drug_ids_age_gender.csv')
data.values
df = pd.DataFrame(data)
df2 = pd.merge(df1,df, left_index = True, right_index = True)
temp1 = df2.groupby('Age').Age.count()
df3 = pd.merge(df1,df, left_index = True, right_index = True)
temp2 = df3.groupby('Gender').Age.count()
ax = temp1.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
ax.annotate(np.round(p.get_height(),decimals=0), (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.ylabel('Numbers Of Patients Taking the drug')
plt.title('Age Distribution Of Patients Based Upon drug Intake')
plt.show()
Got something like this as a result:
This questions come back often, so I decided to write a step by step explanation. Note that I'm not a pandas guru, so there are things that could probably be optimized.
I started by generating getting a list of ages that I will use for my x-axis:
cvsdata = '''Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M'''
df = pd.read_csv(StringIO.StringIO(cvsdata))
ages = df.Age.unique()
array([15, 17, 19, 20, 21, 23, 24])
Then I generated a grouped dataframe with the counts of each M and F per age:
counts = df.groupby(['Age','Gender']).count()
print counts
Drug_ID
Age Gender
15 F 1
17 M 1
19 M 2
20 F 1
M 6
21 F 1
M 3
23 F 3
M 4
24 F 3
M 2
Using that, I can easily calculate the total number of individual per age group:
totals = counts.sum(level=0)
print totals
Drug_ID
Age
15 1
17 1
19 2
20 7
21 4
23 7
24 5
To prepare for plotting, I'll transform my counts dataframe to separate each sex by columns, instead of by index. Here I also drop that 'Drug_ID' column name because the unstack() operation creates a MultiIndex and it's much easier to manipulate the dataframe without that MultiIndex.
counts = counts.unstack(level=1)
counts.columns = counts.columns.droplevel(level=0)
print counts
Gender F M
Age
15 1.0 NaN
17 NaN 1.0
19 NaN 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
Looks pretty good. I'll just do a final refinement and replace the NaN by 0.
counts = counts.fillna(0)
print counts
Gender F M
Age
15 1.0 0.0
17 0.0 1.0
19 0.0 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
With this dataframe, it is trivial to plot the stacked bars:
plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['F'], bottom=counts['M'], color='pink', label='F')
plt.legend()
plt.xlabel('Ages')
plt.ylabel('Count')
To plot the total counts on top of the bars, we'll use the annotate() function. We cannot do it in one single pass, instead we'll loop through the ages and the totals (for simplicity sake, I take the values and flatten() them because they're not quite in the right format, not exactly sure why here)
for age,tot in zip(ages,totals.values.flatten()):
plt.annotate('N={:d}'.format(tot), xy=(age+0.4, tot), xytext=(0,5), textcoords='offset points', ha='center', va='bottom')
the coordinates for the annotations are (age+0.4, tot) because the bars go from x to x+width with width=0.8by default, and therefore x+0.4 is the center of the bar, while tot is of course the full height of the bar. To offset the text a bit, I offset the text by a few (5) points in the y direction. Adjust according to your liking.
Check out the documentation for bar() to adjust the parameters of the bar plots.
Check out the documentation for annotate() to customize your annotations
I have a dat file with different data. The file has different numbers arranged in 7 columns seperated with two whitespaces. Is it possible to read and extract the data for each column and assign the data to a key in a dictionary, using arrays. Is it possible to assign numpy arrays as values for a key in a dictionary?
The dat.file have numbers like this:
1 -0.8 92.3 2.8 150 0 0
2 -0.7 99.3 1.9 140 0 0
3 -0.3 96.4 2.5 120 0 0
4 -0.3 95.0 3.1 130 0 0
5 -0.8 95.7 3.1 130 0 0
6 -0.5 95.0 2.1 120 0 0
7 -0.7 90.9 3.6 110 0 0
8 -0.6 85.7 2.6 80 0 0
9 -0.7 85.7 3.1 60 0 0
10 -1.2 85.6 3.6 50 0 8
I first read all the lines, then I split the values with whitespace as seperator, for each line. I tried to assign the values in each column to the corresponding key in the dictionary, but this does not work. I think I have to put the values in an array and then put the array in the dictionary in some way?
def read_data(filename):
infile = open(filename, 'r')
for line in infile.readlines():
data = {'hour': None, 'temperature': None, 'humidity':
None, 'wind_speed':
None, 'wind_direction':
None, 'direct_flux': None, 'diffuse_flux': None}
lines = line.split()
data['hour'] = lines[0]
data['temperature'] = lines[1]
data['humidity'] = lines[2]
data['wind_speed'] = lines[3]
data['wind_direction'] = lines[4]
data['direct_flux'] = lines[5]
data['diffuse_flux'] = lines[6]
return data
EDIT: I realized numpy arrays are a specific scientific data structure. I have not used them but assume converting the below lists (and its append operation) into numpy arrays is trivial.
You are correct. A dictionary holds (key, value) pairs. An entry of the form (key, value, value, ..., value) is not acceptable. Using a list() as the value (as you suggested) is a solution. Note now that the index corresponds to the line number the data was in.
data = {'hour': None, 'temperature': None, 'humidity':
None, 'wind_speed':
None, 'wind_direction':
None, 'direct_flux': None, 'diffuse_flux': None}
# For each key, initialize a list as its value.
for key in data:
data[key] = list()
for line in infile.readlines():
lines = line.split()
# we simply append into the list this key references.
data['hour'].append(lines[0])
data['temperature'].append(lines[1])
data['humidity'].append(lines[2])
data['wind_speed'].append(lines[3])
data['wind_direction'].append(lines[4])
data['direct_flux'].append(lines[5])
data['diffuse_flux'].append(lines[6])
return data
I'm not quite sure I got right what you are asking for, but I'll try to answer.
I guess you want to load those tabulated data in a way you can easily work with, and making use of numpy's functionality.
Then, I think you have two options.
Using PANDAS
Pandas (here the documentation) is a really complete package that uses numpy to let you work with labelled data (so that columns and rows have a name, and not only a positional index)
using pandas the idea would be to do:
import pandas as pd
df = pd.read_csv('data.tab', sep=" ", index_col=0, header=None,
names=['hour', 'temp', 'hum', 'w_speed', 'w_direction',
'direct_flux','diffuse_flux'])
df
temp hum w_speed w_direction direct_flux diffuse_flux
hour
1 -0.8 92.3 2.8 150 0 0
2 -0.7 99.3 1.9 140 0 0
3 -0.3 96.4 2.5 120 0 0
4 -0.3 95.0 3.1 130 0 0
5 -0.8 95.7 3.1 130 0 0
6 -0.5 95.0 2.1 120 0 0
7 -0.7 90.9 3.6 110 0 0
8 -0.6 85.7 2.6 80 0 0
9 -0.7 85.7 3.1 60 0 0
10 -1.2 85.6 3.6 50 0 8
Or, if you have the column names as the first row of the file simply:
import pandas as pd
df = pd.read_csv('data.tab', sep=" ", index_col=0)
If you haven't heard of this library and you are managing this kind of data, I think it is really worthwhile to give it a close look.
Using only Numpy
If you don't need to do much with those data, or won't do it again or whatever, getting Pandas may be a bit too much...
In any case, you can always read the tabulated file from numpy
import numpy as np
array = np.loadtxt("data.tab", delimiter=" ")
It will ignore comment lines (by default lines with #) and you can also skip the first row and so on.
Now you'll have all the data on array, and you can access it slicing and indexing. If you want to have labelled categories (and you don't like the first option), you can build your dictionary of arrays following the last snippet of code by:
data = {}
headers = ['hour', 'temp', 'hum', 'w_speed', 'w_direction', 'direct_flux',
'diffuse_flux']
for i in xrange(len(headers)):
data[header[i]] = array[:,i]