Boxplot visualization - python

So I have to do this boxplot, and I want to limit the variables from a column in a dataset, and the problem I am having is that I don't know how to do that. this is what I have for now, I want to pick the top ten nationalities that are in the column, but I cannot figure out how to do it.

If I understand your question correctly, this should work for a dataframe called df with a "Nationality" column called Nationality:
import collections
counts = collections.Counter(df.Nationality)
top10countries = [elem for elem, _ in counts.most_common(10)]
df_top10 = df[df['Nationality'].isin(top10countries)]
and then use df_top10 to make boxplots.

Related

Concatenating and appending to a DataFrame inside the For Loop in Python

I have the following problem.
There is quite big dataset with the features and IDs. Due to the task definition, I'm trying to do clustering but not for all dataset, instead of that I'm taking each of the IDs and then train the model on the feature data from this particular ID. How does that look in details:
Imagine, that we have our initial dataframe df_init
Then I create the array with unique ID_s:
dd = df_init['ID'].unique()
After that, set comprehension is being created just like that:
dds = {x:y for x,y in df_init.groupby('ID')}
Using for loops and iterating over dds, I'm taking the data and use it for training the clustering algorithm. After that, pd.concat() is using to get the dataframe back(for this example, will show only two IDs):
df = pd.DataFrame()
d={}
n=5
for i in dd[:2]:
d[i] = dds[i].iloc[: , 1:5].values
ac = AgglomerativeClustering(n_clusters=n, linkage='complete').fit(d[i])
labels = ac.labels_
labels = pd.DataFrame(labels)
df = pd.concat([df, labels])
print(i)
print('Labels: ', labels)
So the result for this loop will be following output:
And the output df will look like that(shown only for first ID, the rest labels are also there):
My question is the following: how can I add the additional column to this dataframe in the loop, that will be matching certain ID to corresponding labels (4 labels-ID_1, another 4 labels-ID_2, etc.)? Are there any pandas solution for achieving that?
Many thanks in advance!
Below this line:
labels = pd.DataFrame(labels)
Add the following:
labels['ID']=i
This will give you the extra column with the proper ID for each subset

Display additional values in holoviews sankey labels or hover information

I would like to find a way to modify the labels on holoviews sankey diagrams that they show, in addition to the numerical values, also the percentage values.
For example:
import holoviews as hv
import pandas as pd
hv.extension('bokeh')
data = {'A':['XX','XY','YY','XY','XX','XX'],
'B':['RR','KK','KK','RR','RK','KK'],
'values':[10,5,8,15,19,1]}
df = pd.DataFrame(data, columns=['A','B','values'])
sankey = hv.Sankey(df)
For 'From' label 'YY' which is 'YY - 8' change this to 'YY - 8 (13.7%)' - add the additional percentage in there.
I have found ways to change from the absolute value to percentage by using something along the lines of:
value_dim = hv.Dimension('Percentage', unit='%')
But can't find a way to have both values in the label.
Additionally, I tried to modify the hover tag. In my search to find ways to modify this I found ways to reference and display various attributes in the hover information (through the bokeh tooltips) but it does not seem like you can manipulate this information.
In this post two possible ways are explained how to achive the wanted result. Let's start with the example DataFrame and the necessary imports.
import holoviews as hv
from holoviews import opts, dim # only needed for 2. solution
import pandas as pd
data = {'A':['XX','XY','YY','XY','XX','XX'],
'B':['RR','KK','KK','RR','RK','KK'],
'values':[10,5,8,15,19,1],
}
df = pd.DataFrame(data)
1. Option
Use hv.Dimension(spec, **params), which gives you the opportunity to apply a formatter with the keyword value_format to a column name. This formatter is simple the combination of the value and the value in percent.
total = df.groupby('A', sort=False)['values'].sum().sum()
def fmt(x):
return f'{x} ({round(x/total,2)}%)'
hv.Sankey(df, vdims = hv.Dimension('values', value_format=fmt))
2. Option
Extend the DataFrame df by one column wich stores the labels, you want to use. This can be later reused inside the Sankey, with opts(labels=dim('labels')). To ckeck if the calculations are correct, you can turn show_values on, but this will cause a duplicate inside the labels. Therefor in the final solution show_values is set to False. This can be sometime tricky to find the correct order.
labels = []
for item in ['A', 'B']:
grouper = df.groupby(item, sort=False)['values']
total_sum = grouper.sum().sum()
for name, group in grouper:
_sum = group.sum()
_percent = round(_sum/total_sum,2)
labels.append(f'{name} - {_sum} ({_percent}%)')
df['labels'] = labels
hv.Sankey(df).opts(show_values=False, labels=dim('labels'))
The downside of this solution is, that we apply a groupby for both columns 'A' and 'B'. This is something holoviews will do, too. So this is not very efficient.
Output
Comment
Both solutions create nearly the same figure, except that the HoverTool is not equal.

How to populate arrays with values read in from csv via pandas?

I have create a DataFrame using pandas by reading a csv file. What I want to do is iterate down the rows (for the values in column 1) into a certain array, and do the same for the values in column 2 for a different array. This seems like it would normally be a fairly easy thing to do, so I think I am missing something, however I can't find much online that doesn't get too complicated and doesn't seem to do what I want. Stack questions like this one appear to be asking the same thing, but the answers are long and complicated. Is there no way to do this in a few lines of code? Here is what I have set up:
import pandas as pd
#available possible players
playerNames = []
df = pd.read_csv('Fantasy Week 1.csv')
What I anticipate I should be able to do would be something like:
for row in df.columns[1]:
playerNames.append(row)
This however does not return the desired result.
Essentially, if df =
[1,2,3
4,5,6
7,8,9], I would want my array to be [1,4,7]
Do:
for row in df[df.columns[1]]:
playerNames.append(row)
Or even better:
print(df[df.columns[1]].tolist())
In this case you want the 1st column's values so do:
for row in df[df.columns[0]]:
playerNames.append(row)
Or even better:
print(df[df.columns[0]].tolist())

Trying to get the largest number in a column of a .csv file

This is what I have currently, I get the error int is 'int' object is not iterable. If I understand correctly my issue is that BIKE_AVAILABLE is assigned a number at the top of my project with a number so instead of looking at the column it is looking at that number and hitting an error. How should I go about going through the column? I apologize in advance for the newby question
for i in range(len(stations[BIKES_AVAILABLE]) -1):
most_bikes = max(stations[BIKES_AVAILABLE])
sort(stations[BIKES_AVAILABLE]).remove(max(stations[BIKES_AVAILABLE]))
if most_bikes == max(stations[BIKES_AVAILABLE]):
second_most = max(stations[BIKES_AVAILABLE])
index_1 = index(most_bikes)
index_2 = index(second_most)
most_bikes = max(data[0][index_1], data[0][index_2])
return most_bikes
Another method that might be better for you to use with data manipulation is to try the pandas module.
Then you could do this:
import pandas as pd
data = pd.read_csv('bicycle_data.csv')
# Alternative:
# most_sales = data['sold'].max()
most_sales = max(data['sold'])
Now you don't have to worry about indexing columns with numbers:
You can also do something like this:
sorted_data = data.sort_values(by='sold', ascending=False)
# Displays top 5 sold bicycles.
print(sorted_data.head(5))
More importantly if you enjoy using indexes, there is a function to get you the index of the max value called idxmax built into pandas.
Using a generator inside max()
If you have a CSV file named test.csv, with contents:
line1,3,abc
line2,1,ahc
line3,9,sbc
line4,4,agc
You can use a generator expression inside the max() function for a memory efficient solution (i.e. no list is created).
If you wanted to do this for the second column, then:
max(int(l.split(',')[1]) for l in open("test.csv").readlines())
which would give 9 for this example.
Update
To get the row (index), you need to store the index of the max number in the column so that you can access this:
max(((i,int(l.split(',')[1])) for i,l in enumerate(open("test.csv").readlines())),key=lambda t:t[1])[0]
which gives 2 here as the line in test.csv (above) with the max number in column 2 (which is 9) is 2 (i.e. the third line).
This works fine, but you may prefer to just break it up slightly:
lines = open("test.csv").readlines()
max(((i,int(l.split(',')[1])) for i,l in enumerate(lines)),key=lambda t:t[1])[0]
Assuming a csv structure like so:
data = ['1,blue,15,True',
'2,red,25,False',
'3,orange,35,False',
'4,yellow,24,True',
'5,green,12,True']
If I want to get the max value from the 3rd column I would do this:
largest_number = max([n.split(',')[2] for n in data])

pandas grouping based on different data

I want to group data based on different dataframe's cuts.
So for instance I cut from a frame:
my_fcuts = pd.qcut(frame1['prices'],5)
pd.groupby(frame2, my_fcuts)
Since the lengths must be same, the above statement will fail.
I know I can easily write a mapper function, but what if this was the case
my_fcuts = pd.qcut(frame1['prices'],20) or some higher number. Surely there must be some built-in statement in pandas to do this very simple thing. groupby should be able to accept "cuts" from different data and reclassify.
Any ideas?
Thanks I figured out the answer myself
volgroups = np.digitize(btest['vol_proxy'],np.linspace(min(data['vol_proxy']), max(data['vol_proxy']), 10))
trendgroups = np.digitize(btest['trend_proxy'],np.linspace(min(data['trend_proxy']), max(data['trend_proxy']), 10))
#btest.groupby([volgroups,trendgroups]).mean()['pnl'].plot(kind='bar')
#plt.show()
df = btest.groupby([volgroups,trendgroups]).groups

Categories

Resources