I have what is (to me at least) a complicated dataframe I'm trying to reshape so that I can more easily create visualizations. The data is from a survey and each row is a complete survey with 247 columns. These columns are split as to what sort of data they contain. Some is identifying information, (who took the survey, what product the survey is on, what the scores were on particular questions and what comments they had about that particular product). Here is a simplification of the dataframe
id Evaluator item Mar1 Mar1[Comments] Comf1 Comf1[Com..
1 001 11 3 "asf adfsfs.." 3 "text.."
2 001 14 2 "asf adfsfs.." 4 "text.."
3 002 11 4 "asf adfsfs.." 2 "text.."
4 002 14 3 "asf adfsfs.." 3 "text.."
5 002 34 0 "asf adfsfs.." 1 "text.."
6 003 11 2 "asf adfsfs.." 0 "text.."
....
It continues on from here, but in this case 'Mar1' and 'Comf1' are rated questions. I have another datatable that helps describe all the question and question types within the survey so I can perform data selections like the following...
df[df['ItemNum']==11][(qtable[(qtable['type'].str.contains("OtoU")==True)]).id]
Which pulls from qtable all the 'types' of 'OtoU' (all the rating questions) for the ItemNum 11. This is all well and good and gets me something like this...
Mar1 Mar2 Comf1 Comf2 Comf3 Interop1 Interop2 .....
1 2 3 1 3 4 4
2 3 3 2 4 2 2
2 1 1 4 4 1 2
1 3 2 2 2 1 1
3 4 1 2 3 3 3
I can't really do much with it in that form (at least I don't think I can). What I 'think' I need to do is flatten it out into a form that goes more like
Item Question Score Section Evaluator ...
11 Mar1 3 Maritime 001 ...
11 Comf1 2 Comfort 001 ...
11 Comf2 3 Comfort 001 ...
14 Mar1 1 Maritime 001 ...
But, I'll be damned if I know how to do that. I tried to do it (the wrong way I'm pretty sure) with iterating through the dataframe but I quickly realized that it both took quiet some time to do, and the resulting data was of questionable integrity.
So, (very) long story short. How do I go about doing this sort of transform through the power of pandas? I would like to do a number of plots including box plots by question for each 'item' as well as factorplots broken by 'section' and multi line charts plotting the mean of each question by item... if that helps you better understand where I am trying to go with this thing. Sorry for the long post, I just wanted to make sure I supplied enough information to get a solid answer.
Thanks,
Related
I have a pandas dataframe of employees that I need to filter based on 2 columns. I need to filter on department and level. So let's say we have department 'Human Resources' and within that it has level 1,2,3,4,5. I'm specifically looking for Human Resources level 2,4 and 5.
I have my desired departments and levels stored in dictionary, for example:
departments = dict({'Human Resources' : ['2','4','5'] ,'IT' : ['1','3','5','6'], etc.... })
My dataframe will list every employee, for all departments and for all levels (plus lots more). I now want to filter that dataframe using the dictionary above. So in the Human Resources example, I just want returned the employees who are in 'Human Resouces' and are in levels 2, 4 and 5.
An example of the df would be:
employee_ID Department Level
001 Human Resources 1
002 Human Resources 1
003 Human Resources 2
004 Human Resources 3
005 Human Resources 4
006 Human Resources 4
007 Human Resources 5
008 IT 1
009 IT 2
010 IT 3
011 IT 4
012 IT 5
013 IT 6
Using the dictionary I've displayed above, my expected result would be
employee_ID Department Level
003 Human Resources 2
005 Human Resources 4
006 Human Resources 4
007 Human Resources 5
008 IT 1
010 IT 3
012 IT 5
013 IT 6
I have no idea how I'd do this?
you can use groupby on Departement and use isin on the Level and get the value for the departement concerned with the name of the group.
#example data
departments = dict({'Human Resources' : ['2','4','5'] ,'IT' : ['1','3','5','6']})
df = pd.DataFrame({'Id':range(10),
'Departement': ['Human Resources']*5+['IT']*5,
'Level':list(range(1,6))*2})
#filter
print (df[df.groupby('Departement')['Level']
.apply(lambda x: x.isin(departments[x.name]))])
Id Departement Level
1 1 Human Resources 2
3 3 Human Resources 4
4 4 Human Resources 5
5 5 IT 1
7 7 IT 3
9 9 IT 5
I am new to python and trying to move some of my work from excel to python, and wanted an excel SUMIFS equivalent in pandas, for example something like:
SUMIFS(F:F, D:D, "<="&C2, B:B, B2, F:F, ">"&0)
I my case, I have 6 columns, a unique Trade ID, an Issuer, a Trade date, a release date, a trader, and a quantity. I wanted to get a column which show the sum of available quantity for release at each row. Something like the below:
A B C D E F G
ID Issuer TradeDate ReleaseDate Trader Quantity SumOfAvailableRelease
1 Horse 1/1/2012 13/3/2012 Amy 7 0
2 Horse 2/2/2012 15/5/2012 Dave 2 0
3 Horse 14/3/2012 NaN Dave -3 7
4 Horse 16/5/2012 NaN John -4 9
5 Horse 20/5/2012 10/6/2012 John 2 9
6 Fish 6/6/2013 20/6/2013 John 11 0
7 Fish 25/6/2013 9/9/2013 Amy 4 11
8 Fish 8/8/2013 15/9/2013 Dave 5 11
9 Fish 25/9/2013 NaN Amy -3 20
Usually, in excel, I just pull the SUMIFS formulas down the whole column and it will work, I am not sure how I can do it in python.
Many thanks!
What you could do is a df.where
so for example you could say
Qdf = df.where(df["Quantity"]>=5)
and then do you sum, Idk what you want to do since I have 0 knowledge about excell but I hope this helps
I have a two tables of of customer information and transaction info.
Customer information includes each person's quality of health (from 0 to 100)
e.g. if I extract just the Name and HealthQuality columns:
John: 70
Mary: 20
Paul: 40
etc etc.
After applying featuretools I noticed a new DIFF(HealthQuality) variable.
According to the docs, this is what DIFF does:
"Compute the difference between the value in a list and the previous value in that list."
Is featuretools calculating the difference between Mary and John's health quality in this instance?
I don't think this kind of feature synthesis really works for customer records e.g. CUM_SUM(emails_sent) for John. John's record is one row, and he has one value for the amount of emails we sent him.
For now I am using the ignore_variables=[all_customer_info] option to remove all of the customer data except for transactions table of course.
This also leads me into another question.
Using data from the transactions table, John now has a DIFF(MEAN(transactions.amount)). What is the DIFF measured in this instance?
id MEAN(transactions.amount) DIFF(MEAN(transactions.amount))
0 1 21.950000 NaN
1 2 20.000000 -1.950000
2 3 35.604581 15.604581
3 4 NaN NaN
4 5 22.782682 NaN
5 6 35.616306 12.833624
6 7 24.560536 -11.055771
7 8 331.316552 306.756016
8 9 60.565852 -270.750700
I'll start off by saying that I'm not really talented in statistical analysis. I have a dataset stored in a .csv file that I'm looking to represent graphically. What I'm trying to represent is the frequency of survival (represented for each person as a 0 or 1 in the Survived column) for each unique entry in the other columns.
For example: one of the other columns, Class, holds one of three possible values (1, 2, or 3). I want to graph the probability that someone from Class 1 survives versus Class 2 versus Class 3, so that I can visually determine whether or not class is correlated to survival rate.
I've attached the snippet of code that I've developed so far, but I'd understand if everything I'm doing is wrong because I've never used pandas before.
1 import pandas as pd
2 import matplotlib.pyplot as plt
3
4 df = pd.read_csv('train.csv')
5
6 print(list(df)[2:]) # slicing first 2 values of "ID" and "Survived"
7
8 for column in list(df)[2:]:
9 try:
10 df.plot(x='Survived',y=column,kind='hist')
11 except TypeError:
12 print("Column {} not usable.".format(column))
13
14 plt.show()
EDIT: I've attached a small segment of the dataframe below
PassengerId Survived Pclass Name ... Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ... A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ... PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina ... STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ... 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ... 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James ... 330877 8.4583 NaN Q
I think you want this:
df.groupby('Pclass')['Survived'].mean()
This separates the dataframe into three groups based on the three unique values of Pclass. It then takes the mean of Survived, which is equal to the number of 1 values divided by the number of values total. This would produce a dataframe looking something like this:
Pclass
1 0.558824
2 0.636364
3 0.696970
It is then trivial from there to plot a bar graph with .plot.bar() if you wish.
Adding to the answer, here is a simple bar graph.
result = df.groupby('Pclass')['Survived'].mean()
result.plot(kind='bar', rot=1, ylim=(0, 1))
I was going through this question where Ted Petrou explains the difference between .transform and .apply
This is the DataFrame used
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'],
'a':[4,5,1,3], 'b':[6,10,3,11]})
State a b
0 Texas 4 6
1 Texas 5 10
2 Florida 1 3
3 Florida 3 11
Function inspect is defined
def inspect(x):
print(x)
When I call inspect function using apply, I get 3 dataframes instead of 2
df.groupby('State').apply(lambda x:inspect(x))
State a b
2 Florida 1 3
3 Florida 3 11
State a b
2 Florida 1 3
3 Florida 3 11
State a b
0 Texas 4 6
1 Texas 5 10
Why am I getting 3 dataframes, instead of 2 while printing ? I really want to know how apply function works?
Thanks in advance.
From the docs:
In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row.