Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I have a large data set that contains 3 attributes per row: A,B,C
Column A: can take the values 1, 2, and 0.
Column B and C: can take any values.
I'd like to perform density estimation using histograms for P(A = 2 | B,C) and plot the results using python.
I do not need the code to do it, I can try and figure that on my own. I just need to know the procedures and the tools that should I use?
To answer your over-all question, we should go through different steps and answer different questions:
How to read csv file (or text data) ?
How to filter data ?
How to plot data ?
At each stage, you need to use some techniques and specific tools, you might also have different choices at different stages (You can look on the internet for different alternatives).
1- How to read csv file:
There is a built-in function to go through the csv file where you store your data. But most people recommend Pandas to deal with csv files.
After installing Pandas package, you can read your csv file using Read_CSV command.
import pandas as pd
df= pd.read_csv("file.csv")
As you didn't share the csv file, I will make a random dataset to explain the up-coming steps.
import pandas as pd
import numpy as np
t= [1,1,1,2,0,1,1,0,0,2,1,1,2,0,0,0,0,1,1,1]
df = pd.DataFrame(np.random.randn(20, 2), columns=list('AC'))
df['B']=t #put a random column with only 0,1,2 values, then insert it to the dataframe
Note: Numpy is a python-Package. It's helpful to work with mathematical operations. You don't primarily need it, but I mentioned it to clear confusion here.
In case you print df in this case, you will get as result:
A C B
0 -0.090162 0.035458 1
1 2.068328 -0.357626 1
2 -0.476045 -1.217848 1
3 -0.405150 -1.111787 2
4 0.502283 1.586743 0
5 1.822558 -0.398833 1
6 0.367663 0.305023 1
7 2.731756 0.563161 0
8 2.096459 1.323511 0
9 1.386778 -1.774599 2
10 -0.512147 -0.677339 1
11 -0.091165 0.587496 1
12 -0.264265 1.216617 2
13 1.731371 -0.906727 0
14 0.969974 1.305460 0
15 -0.795679 -0.707238 0
16 0.274473 1.842542 0
17 0.771794 -1.726273 1
18 0.126508 -0.206365 1
19 0.622025 -0.322115 1
2- - How to filter data:
There are different techniques to filter data. The easiest one is by selecting the name of column inside your dataframe + the condition. In our case, the criteria is selecting value "2" in column B.
l= df[df['B']==2]
print l
You can also use other ways such groupby, lambda to go through the data frame and apply different conditions to filter the data.
for key in df.groupby('B'):
print key
If you run the above-mentioned scripts you'll get:
For the first one: Only data where B==2
A C B
3 -0.405150 -1.111787 2
9 1.386778 -1.774599 2
12 -0.264265 1.216617 2
For the second one: Printing the results divided in groups.
(0, A C B
4 0.502283 1.586743 0
7 2.731756 0.563161 0
8 2.096459 1.323511 0
13 1.731371 -0.906727 0
14 0.969974 1.305460 0
15 -0.795679 -0.707238 0
16 0.274473 1.842542 0)
(1, A C B
0 -0.090162 0.035458 1
1 2.068328 -0.357626 1
2 -0.476045 -1.217848 1
5 1.822558 -0.398833 1
6 0.367663 0.305023 1
10 -0.512147 -0.677339 1
11 -0.091165 0.587496 1
17 0.771794 -1.726273 1
18 0.126508 -0.206365 1
19 0.622025 -0.322115 1)
(2, A C B
3 -0.405150 -1.111787 2
9 1.386778 -1.774599 2
12 -0.264265 1.216617 2)
How to plot your data:
The simplest ways to plot your data is by using matplotlib
The easiest ways to plot data in columns B, is by running :
import random
import matplotlib.pyplot as plt
xbins=range(0,len(l))
plt.hist(df.B, bins=20, color='blue')
plt.show()
You'll get this result:
if you wanna plot the results combined, you should use different colors/techniques to make it useful.
import numpy as np
import matplotlib.pyplot as plt
a = df.A
b = df.B
c = df.C
t= range(20)
plt.plot(t, a, 'r--', b, 'bs--', c, 'g^--')
plt.legend()
plt.show()
You'll get as a result:
Plotting data is driven by a specific need. You can explore the different ways to plot data by going through the examples of marplotlib.org official website.
If you're looking for other tools that do slightly more sophisticated things than nonparametric density estimation with histograms, please check this link to the python repository or directly install the package with
pip install cde
In addition to an extensive documentation, the package implements
Nonparametric (conditional & neighborhood kernel density estimation)
semiparametric (least squares cde) and
parametric neural network-based methods (mixture density networks, kernel density estimation)
Also, the package allows to compute centered moments, statistical divergences (kl-divergence, hellinger, jensen-shannon), percentiles, expected shortfalls and data generating processes (arma-jump, jump-diffusion, GMMs etc.)
Disclaimer: I am one of the package developers.
Related
I am trying to create a pie chart in Python. I have a dataset with 2137 responses to a question, with answer choices ranging from 1 to 5. I am trying to produce a pie chart with the percentage of responses for each answer choice, but when I run my code, it produce a pie chart of each respondent (so 2137 pieces of the pie. I am thinking I need to use the .groupby function, but I am not entirely sure on how to correctly do it.
df3 = pd.DataFrame(df, columns=['Q78']).groupby(['Q78'])
df3.plot.pie(subplots=True)
Here is what I have tried. (PS I am just starting to learn Python, so sorry if this is a dumb question!!)
One of possible solutions:
s = df.Q87
s.groupby(s).size().plot.pie(autopct='%1.1f%%');
To test my code I created a Datarame limited to just 8 answers:
Q87
0 A
1 B
2 C
3 D
4 E
5 A
6 B
7 A
and I got the following picture:
I have a DataFrame(df) in the below format. I want to create multi-line chart using this data.
Name Category Score Count
A Books 12025.4375 48
A Music 17893.25 4
A Movie 31796.37838 37
A Script 1560.4 5
A Art 973.125 8
B Music 1929 15
B Movie 3044.229167 48
B Script 3402.4 10
B Art 2450.125 8
C Books 14469.3 10
C Music 10488.78947 57
C Movie 1827.101695 59
C Script 7077 2
Expected Output:
I want unique Category at X-Axis.
Score at Y-Axis and Multiple lines representing multiple Name.
Count is just an additional data which is not needed for this graph.
I tried using the below syntax, which is not showing the output in expected format.
lines = df.line(x= 'Category',\
y=['Name','Score'],figsize=(20,10))
I tried multiple options and answers available here but seems like nothing is working for me.
First pivot data and then plot by DataFrame.plot, line is default value so should be omitted:
import matplotlib.pyplot as plt
df1 = df.pivot('Category','Name','Score')
df1.plot(figsize=(20,10))
#show values in x axis
plt.xticks(np.arange(len(df1.index)), df1.index)
plt.show()
This may be a basic question, I have a categorical data and I want to feed this into my machine learning model. my ML model accepts only numerical data. What is the correct way to convert this categorical data into numerical data.
My Sample DF:
T-size Gender Label
0 L M 1
1 L M 1
2 M F 1
3 S F 0
4 M M 1
5 L M 0
6 S F 1
7 S F 0
8 M M 1
I know this following code convert my categorical data into numerical
Type-1:
df['T-size'] = df['T-size'].cat.codes
Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.
For this example I know S < M < L. What should I do when I have want to convert data like above.
Type-2:
In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample
for Male,
(4/5)
for Female,
(2/4)
WKT,
(4/5) > (2/4)
How should I replace for this kind of column?
Can I replace M with (4/5) and F with (2/4) for this problem?
What is the proper way to dealing with column?
help me to understand this better.
There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.
Regarding your t-shirts above, you can give a pandas categorical type an order:
df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).
if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.
Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.
If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :
size_mapping = {"S": 1, "M":2 , "L":3}
#mapping to the DataFrame
df['T-size_num'] = df['T-size'].map(size_mapping)
This allows you to treat the input as numerical data while preserving the hierarchy
And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.
df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})
For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning
For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:
d = {'L':2, 'M':1, 'S':0}
df['T-size'] = df['T-size'].map(d)
Output:
T-size Gender Label
0 2 M 1
1 2 M 1
2 1 F 1
3 0 F 0
4 1 M 1
5 2 M 0
6 0 F 1
7 0 F 0
8 1 M 1
For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.
It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.
So a dataset of:
Gender
M
F
M
M
F
Would become
Gender_M Gender_F
1 0
0 1
1 0
1 0
0 1
This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.
I have a dataset which looks like this:
val
1
1
3
4
6
6
9
...
I can't load it into pandas dataframe due to it's huge size. So I aggregate data using Spark to form:
val occurrences
1 2
3 1
4 1
6 2
9 1
...
and load it into pandas dataframe. "val" column is not above 100, so it doesn't take much memory.
My problem is, I can't operate easily on such structure, e.g. find mean or median using pandas nor plot a boxplot with seaborn. I can do it only using explicit formulas written by me, but not ready builtin methods. Is there a pandas structure or any other way, which allows to cope with such data?
For example:
1,1,3,4,6,6,9
would be:
df = pd.DataFrame({'val': [1,3,4,6,9], "occurrences" : [2,1,1,2,1]})
Median is 4. I'm looking for a method to extract median directly from given df.
No, pandas does not operate on such objects how you would expect. Elsewhere on StackOverflow, even computing a median for that table structure takes at least a few lines of code.
If you wanted to make your own seaborn hooks/wrappers, a good place to start would probably be an efficient percentiles(df, p) method. The median is then just percentiles(df, [50]). A box plot would just be percentiles(df, [0, 25, 50, 75, 100]), and so on. Your development time could then be fairly minimal (depending on how complicated the statistics you need are).
I'm new to python and pandas but I've used R in the past for data analysis. I have a simple dataset:
df.head()
Sequence Level Count
1 Easy 5
1 Medium 7
1 Hard 9
I would like to convert this to:
Sequence Easy Medium Hard
1 5 7 9
In R, I could simply do this by using the reshape2 package. In python it seems like one of my options is to create dummy variables using get_dummies but that would still generate multiple rows for the same Sequence in my case. Is there an easy way of achieving my resultset?
I'm finally trying to plot it using:
import matplotlib.pyplot as plt
df.plot(kind='bar', stacked=True)
plt.show()
Any help would be appreciated.
You could use pandas pivot_table:
In [1436]: pd.pivot_table(df, index='Sequence', columns='Level', values='Count')
Out[1436]:
Level Easy Hard Medium
Sequence
1 5 9 7
Then you could plot it:
df1 = pd.pivot_table(df, index='Sequence', columns='Level', values='Count')
df1.plot(kind='bar', stacked=True)