I'm new to python and pandas but I've used R in the past for data analysis. I have a simple dataset:
df.head()
Sequence Level Count
1 Easy 5
1 Medium 7
1 Hard 9
I would like to convert this to:
Sequence Easy Medium Hard
1 5 7 9
In R, I could simply do this by using the reshape2 package. In python it seems like one of my options is to create dummy variables using get_dummies but that would still generate multiple rows for the same Sequence in my case. Is there an easy way of achieving my resultset?
I'm finally trying to plot it using:
import matplotlib.pyplot as plt
df.plot(kind='bar', stacked=True)
plt.show()
Any help would be appreciated.
You could use pandas pivot_table:
In [1436]: pd.pivot_table(df, index='Sequence', columns='Level', values='Count')
Out[1436]:
Level Easy Hard Medium
Sequence
1 5 9 7
Then you could plot it:
df1 = pd.pivot_table(df, index='Sequence', columns='Level', values='Count')
df1.plot(kind='bar', stacked=True)
Related
I have the following df:
Country 2013 2014 2015 2016 2017
0 USA 40 30 20 30 30
1 Chile 1 2 4 6 1
So i need to plot the total Infected (which are the numbers in each year) throughout time per year.
So I did:
grid = sns.FacetGrid(data=df, col="Country", col_wrap=5, hue="Country")
grid.map(plt.plot,)
But this is not going to work because each year is a column and I cannot pass that to the grid.map
Any ideas on how to do this?
Not sure what exactly kind of plot you wanted, but this is one way I got around your problem:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Country':['USA', 'Chile'],
'2013':[40,1],
'2014':[30,2],
'2015':[20,4],
'2016':[30,6],
'2017':[30,1]})
df = df.T # This will transpose our df: see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html
df.columns = df.iloc[0] #Set the row [0] as our header
df.drop(['Country'], inplace=True, axis=0) # Drop row [0] since we don't want it.
Right now, this is what our df looks like:
From our df we can call:
df.plot.bar()
plt.xticks(rotation=0)
And we get the desired plot:
Plot
Ps. I can't post pictures so far, but please take a look o the links StackOverflow provides for them.
This code is one way of solving it, but definitely you can approach this by different method. Remember the plot is based on matplotlib, so you can customize as such.
I have a following dataframe with date as index
Apples Oranges Strawberries
07-13-2020 1 5 10
07-14-2020 1 17 4
I have to make the line chart of above dataframe with number of fruits on the Y axis and dates on the x axis.
df.plot(x=df.index,y=["Apples","Oranges","Strawberries"],kind="line") is not working
how can I fix it?
Try converting your pandas index to datetime format and try again as below:
df.index = pd.to_datetime(df.index, format='%m-%d-%Y', errors='ignore')
df.plot(kind="line")
That's df.plot.line(). The index is automatically the x axis, and the columns are the groups.
df.plot.line()
https://matplotlib.org/ dedicated libraries for this you can learn.
I have a enormous data in (.csv) format which consists of various columns from that of my interest is column 3 and 7. I want to print both columns
Sample Data: {Only Col 3 and 7 are displayed}
Names Numbers
John 12
Kim 5
Alex 16
mike 2
giki 8
David 18
Desired Output #values greater than 10:
John 12
Alex 16
David 18
Desired Output #values lesser than 10:
Kim 5
mike 2
giki 8
Rhea
I'm not sure I understand what are trying to accomplish there, therefore I'll try to help you going through some basic stuff:
a) Do you already have your data on a DataFrame format? Or it is in some form of tabular data such as a csv or Excel file?
Dataframe = Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
Anyways you will have to import pandas to read or manipulate this file. Then you can transform it into a DataFrame using one of Pandas reading functions, such as pandas.read_csv or pandas.read_excel.
import pandas as pd
# if your data is in a dictionary
df = pd.DataFrame(data=d)
# csv
df = pd.read_csv('file name and path')
b) Then you can slice through it using pandas, and create new DataFrames
output1 = df.loc[df['Numbers'] > 10]
output2 = df.loc[df['Numbers'] < 10]
c) The most basic way to plot is using the pandas method plot on your new DataFrame (you can get a lot fancier than that using matplotlib or seaborn). Although you should probably think about what kind of information you want to visualize, which is not clear to me.
out1.plot()
#histogram
out2.hist()
d) You can save your new dataframes using pandas as well. Here is an example of a CSV file
df.to_csv(path_or_buf=None, sep=', ', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None)
I hope I could shed some light into your doubts ;) .
I have a dataset which looks like this:
val
1
1
3
4
6
6
9
...
I can't load it into pandas dataframe due to it's huge size. So I aggregate data using Spark to form:
val occurrences
1 2
3 1
4 1
6 2
9 1
...
and load it into pandas dataframe. "val" column is not above 100, so it doesn't take much memory.
My problem is, I can't operate easily on such structure, e.g. find mean or median using pandas nor plot a boxplot with seaborn. I can do it only using explicit formulas written by me, but not ready builtin methods. Is there a pandas structure or any other way, which allows to cope with such data?
For example:
1,1,3,4,6,6,9
would be:
df = pd.DataFrame({'val': [1,3,4,6,9], "occurrences" : [2,1,1,2,1]})
Median is 4. I'm looking for a method to extract median directly from given df.
No, pandas does not operate on such objects how you would expect. Elsewhere on StackOverflow, even computing a median for that table structure takes at least a few lines of code.
If you wanted to make your own seaborn hooks/wrappers, a good place to start would probably be an efficient percentiles(df, p) method. The median is then just percentiles(df, [50]). A box plot would just be percentiles(df, [0, 25, 50, 75, 100]), and so on. Your development time could then be fairly minimal (depending on how complicated the statistics you need are).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I have a large data set that contains 3 attributes per row: A,B,C
Column A: can take the values 1, 2, and 0.
Column B and C: can take any values.
I'd like to perform density estimation using histograms for P(A = 2 | B,C) and plot the results using python.
I do not need the code to do it, I can try and figure that on my own. I just need to know the procedures and the tools that should I use?
To answer your over-all question, we should go through different steps and answer different questions:
How to read csv file (or text data) ?
How to filter data ?
How to plot data ?
At each stage, you need to use some techniques and specific tools, you might also have different choices at different stages (You can look on the internet for different alternatives).
1- How to read csv file:
There is a built-in function to go through the csv file where you store your data. But most people recommend Pandas to deal with csv files.
After installing Pandas package, you can read your csv file using Read_CSV command.
import pandas as pd
df= pd.read_csv("file.csv")
As you didn't share the csv file, I will make a random dataset to explain the up-coming steps.
import pandas as pd
import numpy as np
t= [1,1,1,2,0,1,1,0,0,2,1,1,2,0,0,0,0,1,1,1]
df = pd.DataFrame(np.random.randn(20, 2), columns=list('AC'))
df['B']=t #put a random column with only 0,1,2 values, then insert it to the dataframe
Note: Numpy is a python-Package. It's helpful to work with mathematical operations. You don't primarily need it, but I mentioned it to clear confusion here.
In case you print df in this case, you will get as result:
A C B
0 -0.090162 0.035458 1
1 2.068328 -0.357626 1
2 -0.476045 -1.217848 1
3 -0.405150 -1.111787 2
4 0.502283 1.586743 0
5 1.822558 -0.398833 1
6 0.367663 0.305023 1
7 2.731756 0.563161 0
8 2.096459 1.323511 0
9 1.386778 -1.774599 2
10 -0.512147 -0.677339 1
11 -0.091165 0.587496 1
12 -0.264265 1.216617 2
13 1.731371 -0.906727 0
14 0.969974 1.305460 0
15 -0.795679 -0.707238 0
16 0.274473 1.842542 0
17 0.771794 -1.726273 1
18 0.126508 -0.206365 1
19 0.622025 -0.322115 1
2- - How to filter data:
There are different techniques to filter data. The easiest one is by selecting the name of column inside your dataframe + the condition. In our case, the criteria is selecting value "2" in column B.
l= df[df['B']==2]
print l
You can also use other ways such groupby, lambda to go through the data frame and apply different conditions to filter the data.
for key in df.groupby('B'):
print key
If you run the above-mentioned scripts you'll get:
For the first one: Only data where B==2
A C B
3 -0.405150 -1.111787 2
9 1.386778 -1.774599 2
12 -0.264265 1.216617 2
For the second one: Printing the results divided in groups.
(0, A C B
4 0.502283 1.586743 0
7 2.731756 0.563161 0
8 2.096459 1.323511 0
13 1.731371 -0.906727 0
14 0.969974 1.305460 0
15 -0.795679 -0.707238 0
16 0.274473 1.842542 0)
(1, A C B
0 -0.090162 0.035458 1
1 2.068328 -0.357626 1
2 -0.476045 -1.217848 1
5 1.822558 -0.398833 1
6 0.367663 0.305023 1
10 -0.512147 -0.677339 1
11 -0.091165 0.587496 1
17 0.771794 -1.726273 1
18 0.126508 -0.206365 1
19 0.622025 -0.322115 1)
(2, A C B
3 -0.405150 -1.111787 2
9 1.386778 -1.774599 2
12 -0.264265 1.216617 2)
How to plot your data:
The simplest ways to plot your data is by using matplotlib
The easiest ways to plot data in columns B, is by running :
import random
import matplotlib.pyplot as plt
xbins=range(0,len(l))
plt.hist(df.B, bins=20, color='blue')
plt.show()
You'll get this result:
if you wanna plot the results combined, you should use different colors/techniques to make it useful.
import numpy as np
import matplotlib.pyplot as plt
a = df.A
b = df.B
c = df.C
t= range(20)
plt.plot(t, a, 'r--', b, 'bs--', c, 'g^--')
plt.legend()
plt.show()
You'll get as a result:
Plotting data is driven by a specific need. You can explore the different ways to plot data by going through the examples of marplotlib.org official website.
If you're looking for other tools that do slightly more sophisticated things than nonparametric density estimation with histograms, please check this link to the python repository or directly install the package with
pip install cde
In addition to an extensive documentation, the package implements
Nonparametric (conditional & neighborhood kernel density estimation)
semiparametric (least squares cde) and
parametric neural network-based methods (mixture density networks, kernel density estimation)
Also, the package allows to compute centered moments, statistical divergences (kl-divergence, hellinger, jensen-shannon), percentiles, expected shortfalls and data generating processes (arma-jump, jump-diffusion, GMMs etc.)
Disclaimer: I am one of the package developers.