violin plot for olympic dataset [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed yesterday.
Improve this question
120 years of olympic history athletes and results athlete_events.csv with shape (271116, 15). Draw violin plot(violin plot using Matplotlib) shows the height distributions of athletes in the Gymnastics, Cycling and Basketball sports at the Olympic games between 2000 to 2016 (inclusive). The line inside each violin shows the location of the median value. The three violins should be purple, green and blue in colour, including their filled colour and the colour of the median lines (Hint: consider the return values of the violin plot method). Since some athletes may compete in multiple events or multiple Olympics, make sure to include only one instance of an athlete (no duplicates).
I am looking for the appropriate code
# specify the sports we want to plot
sports = ['Gymnastics', 'Cycling', 'Basketball']
# filter the relevant data
df1 = df[(df['Sport'].isin(sports)) & (df['Year'] >= 2000) & (df['Year'] <= 2016)]
df1 = df1.drop_duplicates(subset = ['Name'])
df2 = df1[['Sport', 'Year', 'Height']] I filtered this way but I am not getting appropriate plot

Related

How to merge 3 different pandas dataframes? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have 3 pandas dataframes as :
abc
xyz
colour
type
pattern
colour
type
pattern
lenght
breadth
height
area
lenght
breadth
height
area
I want to combine the dataframes so that it looks like this :
abc colour length
type breadth
pattern height
area
xyz colour length
type breadth
pattern height
area
I also want to export the end result to an excel sheet so how do i do that without making it look messy?
first concat second and third df with rows of first df and concat them:
df1 = pd.DataFrame([['abc'],['xyz']],columns=['col1'])
df2 = pd.DataFrame([['colour'],
['type'],
['pattern']],columns=['col2'])
df3 = pd.DataFrame([['lenght'],
['breadth'],
['height'],
['area']],columns=['col3'])
pd.concat([pd.concat([df[lambda x: x.index == i].reset_index(),df2,df3],axis=1) for i in range(len(df1))]).drop('index',axis=1)

draw boxplot for data in a loop [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a 1000*8 dataset and each column represent the price of a stock in different time so there are 8 stocks. I want to draw 8 boxplots for all the stocks to examine the extreme values in a loop in python. Could you please tell me how I can do that?
As a quick alternative to using matplotlib directly, Pandas has a reasonable boxplot function that could be used.
df = pd.DataFrame(np.random.randn(1000, 8), columns=list('ABCDEFGH'))
df.boxplot(column = list(df.columns))
edit: Just realise your question asked to do this in a loop.
for c in df.columns:
fig, ax = plt.subplots()
ax = df.boxplot(column = c)

How to plot in this kind of graph in Python or R? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have searched about how to plot the graph by matplotlib or ggplot but I couldn't figure out how to make it.
from Nature 500(7463):415-421 August 2013.
so I wanna plot in dots and with a mark for median, kind of showing distribution.
million thanks for any help!
This question is really about how to research the literature. So let's do that.
Here's the article in PubMed. It's also freely-available at PubMed Central. There, we find supplementary data files in XLS format. The file with data closest to what we need is this XLS file. Unfortunately, exploration reveals that it contains only 8 distinct tissue types, whereas Figure 1 contains 30. So we cannot reproduce that figure from the data. This is not uncommon in science.
However: the figure caption points us to this article, which contains a similar figure. Data is available in this XLS file.
I downloaded that file, opened in Excel and saved as the latest XLSX format. Now we can read it into R, assuming the file is in Downloads:
library(tidyverse)
library(readxl)
tableS2 <- read_excel("~/Downloads/NIHMS471461-supplement-3.xlsx",
sheet = "Table S2")
Now we read the figure caption:
Each dot corresponds to a tumor-normal pair, with vertical position indicating the total frequency of somatic mutations in the exome. Tumor types are ordered by their median somatic mutation frequency...
In our file, the pairs correspond to name, total frequency is n_coding_mutations and somatic mutation frequency is coding_mutation_rate. So we want to:
group by tumor_type
calculate the median of coding_mutation_rate
order the values of n_coding_mutations within tumor_type
order tumor_type by median coding_mutation_rate
And then plot the ordered total frequencies versus sample, grouped by the ordered tumor types.
Which might look something like this:
tableS2 %>%
group_by(tumor_type) %>%
mutate(median_n = median(n_coding_mutations)) %>%
arrange(tumor_type, coding_mutation_rate) %>%
mutate(idx = row_number()) %>%
arrange(median_n) %>%
ungroup() %>%
mutate(tumor_type = factor(tumor_type,
levels = unique(tumor_type))) %>%
ggplot(aes(idx, n_coding_mutations)) +
geom_point() +
facet_grid(~tumor_type,
switch = "x") +
scale_y_log10() +
geom_hline(aes(yintercept = median_n),
color = "red") +
theme_minimal() +
theme(strip.text.x = element_text(angle = 90),
axis.title.x = element_blank(),
axis.text.x = element_blank())
Result:
Which looks pretty close to the original:

Handling a pandas column that has multiple values for data analysis [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a dataframe with 'genre' as a column. In this column, each entry has several values. For example, a movie 'Harry Potter' could have fantasy,adventure in the genre column. As I am doing a data analysis and exploration, I have no idea how to represent this column with multiple values to show any relationships between movies and/or genre.
I have thought of using a graph analysis to show the relationship, but I would like to explore other approaches I can consider?
sample data
You can use str.get_dummies for new indicator columns by genres:
df = pd.DataFrame({'Movies': ['Harry Potter', 'Toy Story'],
'Genres': ['fantasy,adventure',
'adventure,animation,children,comedy,fantasy']})
#print (df)
df = df.set_index('Movies')['Genres'].str.get_dummies(',')
print (df)
adventure animation children comedy fantasy
Movies
Harry Potter 1 0 0 0 1
Toy Story 1 1 1 1 1

Python method to display dataframe rows with least common column string [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a dataframe with 3 columns (department, sales, region), and I want to write a method to display all rows that are from the least common region. Then I need to write another method to count the frequency of the departments that are represented in the least common region. No idea how to do this.
Functions would be unecessary - pandas already has implementations to accomplish what you want! Suppose I had the following csv file, test.csv...
department,sales,region
sales,26,midwest
finance,45,midwest
tech,69,west
finance,43,east
hr,20,east
sales,34,east
If I'm understanding you correctly, I would obtain a DataFrame representing the least common region like so:
import pandas as pd
df = pd.read_csv('test.csv')
counts = df['region'].value_counts()
least_common = counts[counts == counts.min()].index[0]
least_common_df = df.loc[df['region'] == least_common]
least_common_df is now:
department sales region
2 tech 69 west
As for obtaining the department frequency for the least common region, I'll leave that up to you. (I've already shown you how to get the frequency for region.)

Categories

Resources