Count votes of a survey by the answer - python

I'm working with a survey relative to income. I have my data like this:
form Survey1 Survey2 Country
0 1 1 1 1
1 2 1 2 5
2 3 2 2 4
3 4 2 1 1
4 5 2 2 4
I want to group by the answer and by the Country. For example, let's think the Survey2 refers to the number of cars of the respondent, I want to know the number of people that owns one car in a certain country.
The expected output is as follows:
Country Survey1_1 Survey1_2 Survey2_1 Survey2_2
0 1 1 1 2 0
1 4 0 2 0 2
2 5 1 0 0 1
Here I added '_#' where # is the answer to count.
Until now I've created a code to find the different answers for each column and I've counted the answers responding, let's say 1, but I haven't founded the way to count the answers for a specific country.
number_unic = df.head().iloc[:,j+ci].nunique() # count unique answers
val_unic = list(df.iloc[:,column].unique()) # unique answers
for i in range(len(vals_unic)):
names = str(df.columns[j+ci]+'_' + str(vals[i])) #names of columns
count = (df.iloc[:,j+ci]==vals[i]).sum() #here I count the values that are equal to an unique answer
df.insert(len(df.columns.values),names, count) # to insert new columns

I would do this with a pivot_table:
In [11]: df.pivot_table(["Survey1", "Survey2"], ["Country"], df.groupby("Country").cumcount())
Out[11]:
Survey1 Survey2
0 1 0 1
Country
1 1.0 2.0 1.0 1.0
4 2.0 2.0 2.0 2.0
5 1.0 NaN 2.0 NaN
To get the output you wanted you could do something like:
In [21]: res = df.pivot_table(["Survey1", "Survey2"], ["Country"], df.groupby("Country").cumcount())
In [22]: res.columns = [s + "_" + str(n + 1) for s, n in res.columns.values]
In [23]: res
Out[23]:
Survey1_1 Survey1_2 Survey2_1 Survey2_2
Country
1 1.0 2.0 1.0 1.0
4 2.0 2.0 2.0 2.0
5 1.0 NaN 2.0 NaN
But, generally it's better to use the MultiIndex here...
To count the number of each responses you can do this somewhat more complicated groupby and value_count:
In [31]: df1 = df.set_index("Country")[["Survey1", "Survey2"]] # more columns work fine here
In [32]: df1.unstack().groupby(level=[0, 1]).value_counts().unstack(level=0, fill_value=0).unstack(fill_value=0)
Out[32]:
Survey1 Survey2
1 2 1 2
Country
1 1 1 2 0
4 0 2 0 2
5 1 0 0 1

Related

Find the number of previous consecutive occurences of value different than current row value in pandas dataframe

Assume that we have the following pandas dataframe:
df = pd.DataFrame({'x':[0,0,1,0,0,0,0],'y':[1,1,1,1,1,1,0],'z':[0,1,1,1,0,0,1]})
x y z
0 0 1 0
1 0 1 1
2 1 1 1
3 0 1 1
4 0 1 0
5 0 1 0
6 0 0 1
All dataframe is filled either by 1 or 0. Looking at each column separately, if current row value is different than previous value I need to count number of previous consecutive values:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
I tried to write a lambda function and apply it to entire dataframe, but I failed. Any idea?
Let's try this:
def f(col):
x = (col != col.shift().bfill())
s = x.cumsum()
return s.groupby(s).transform('count').shift().where(x)
df.apply(f).fillna('')
Output:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
Details:
Use apply, to apply a custom function on each column of the dataframe.
Find the difference spots in the column then use cumsum to create groups of consecutive values, then groupby and transform to create a count for each record, then mask the values in the column using where for the difference spots.
You can try the following, where you identify the "runs" first, get the "runs" lengths. You will only entry at where it switches, so it is the lengths of the runs except the last one.
import pandas as pd
import numpy as np
def func(x,missing=np.NaN):
runs = np.cumsum(np.append(0,np.diff(x)!=0))
switches = np.where(np.diff(x!=0))[0] + 1
out = np.repeat(missing,len(x))
out[switches] = np.bincount(runs)[:-1]
# thanks to Scott see comments below
##out[switches] = pd.value_counts(runs,sort=False)[:-1]
return(out)
df.apply(func)
x y z
0 NaN NaN NaN
1 NaN NaN 1.0
2 2.0 NaN NaN
3 1.0 NaN NaN
4 NaN NaN 3.0
5 NaN NaN NaN
6 NaN 6.0 2.0
It might be faster with a good implementation of run length encoding.. but I am not too familiar with it in python..

Get Sum of Every Time Two Values Match

My googleing has failed me, I think my main issue is im unsure how to phrase the question (sorry about the crappy title). I'm trying to find the total each time 2 people vote the same way. Below you will see an example of how the data looks and the output I was looking for. I have a working solution but its very slow (see bottom) and was wondering if theres a better way to go about this.
This is how the data is shaped
----------------------------------
event person vote
1 a y
1 b n
1 c nv
1 d nv
1 e y
2 a n
2 b nv
2 c y
2 d n
2 e n
----------------------------------
This is the output im looking for
----------------------------------
Person a b c d e
a 2 0 0 1 2
b 0 2 0 0 0
c 0 0 2 1 0
d 1 0 1 2 1
e 2 0 0 1 2
----------------------------------
Working Code
df = df.pivot(index='event', columns='person', values='vote')
frame = pd.DataFrame(columns=df.columns, index=df.columns)
for person1, value in frame.iterrows():
for person2 in frame:
count = 0
for i, row in df.iterrows():
person1_votes = row[person1]
person2_votes = row[person2]
if person1_votes == person2_votes:
count += 1
frame.at[person1, person2] = count
Try look at your problem in different way
df=df.assign(key=1)
mergedf=df.merge(df,on=['event','key'])
mergedf['equal']=mergedf['vote_x'].eq(mergedf['vote_y'])
output=mergedf.groupby(['person_x','person_y'])['equal'].sum().unstack()
output
Out[1241]:
person_y a b c d e
person_x
a 2.0 0.0 0.0 1.0 2.0
b 0.0 2.0 0.0 0.0 0.0
c 0.0 0.0 2.0 1.0 0.0
d 1.0 0.0 1.0 2.0 1.0
e 2.0 0.0 0.0 1.0 2.0
#Wen-Ben already answered your question. It bases on the concept of finding all possibilities of pair-wise person and count those having same vote. Finding all pair-wise is cartesian product (cross join). You may read great post from #cs95 on cartesian product (CROSS JOIN) with pandas
In your problem, you count same vote per event, so it is cross joint per event. Therefore, you don't need adding helper key column as in #cs95 post. You may cross join directly on column event. After cross join, filter out those pair-wise person<->person having same vote using query. Finally, using crosstab to count those pair-wise.
Below is my solution:
df_match = df.merge(df, on='event').query('vote_x == vote_y')
pd.crosstab(index=df_match.person_x, columns=df_match.person_y)
Out[1463]:
person_y a b c d e
person_x
a 2 0 0 1 2
b 0 2 0 0 0
c 0 0 2 1 0
d 1 0 1 2 1
e 2 0 0 1 2

Loop that counts unique values in a pandas df

I am trying to create a loop or a more efficient process that can count the amount of current values in a pandas df. At the moment I'm selecting the value I want to perform the function on.
So for the df below, I'm trying to determine two counts.
1) ['u'] returns the count of the same remaining values left in ['Code', 'Area']. So how many remaining times the same values occur.
2) ['On'] returns the amount of values that are currently occurring in ['Area']. It achieves this by parsing through the df to see if those values occur again. So it essentially looks into the future to see if those values occur again.
import pandas as pd
d = ({
'Code' : ['A','A','A','A','B','A','B','A','A','A'],
'Area' : ['Home','Work','Shops','Park','Cafe','Home','Cafe','Work','Home','Park'],
})
df = pd.DataFrame(data=d)
#Select value
df1 = df[df.Code == 'A'].copy()
df1['u'] = df1[::-1].groupby('Area').Area.cumcount()
ids = [1]
seen = set([df1.iloc[0].Area])
dec = False
for val, u in zip(df1.Area[1:], df1.u[1:]):
ids.append(ids[-1] + (val not in seen) - dec)
seen.add(val)
dec = u == 0
df1['On'] = ids
df1 = df1.reindex(df.index).fillna(df1)
The problem is I want to run this script on all values in Code. Instead of selecting one at a time. For instance, if I want to do the same thing on Code['B'], I would have to change: df2 = df1[df1.Code == 'B'].copy() and the run the script again.
If I have numerous values in Code it becomes very inefficient. I need a loop where it finds all unique values in 'Code'Ideally, the script would look like:
df1 = df[df.Code == 'All unique values'].copy()
Intended Output:
Code Area u On
0 A Home 2.0 1.0
1 A Work 1.0 2.0
2 A Shops 0.0 3.0
3 A Park 1.0 3.0
4 B Cafe 1.0 1.0
5 A Home 1.0 3.0
6 B Cafe 0.0 1.0
7 A Work 0.0 3.0
8 A Home 0.0 2.0
9 A Park 0.0 1.0
I find your "On" logic very confusing. That said, I think I can reproduce it:
df["u"] = df.groupby(["Code", "Area"]).cumcount(ascending=False)
df["nunique"] = pd.get_dummies(df.Area).groupby(df.Code).cummax().sum(axis=1)
df["On"] = (df["nunique"] -
(df["u"] == 0).groupby(df.Code).cumsum().groupby(df.Code).shift().fillna(0)
which gives me
In [212]: df
Out[212]:
Code Area u nunique On
0 A Home 2 1 1.0
1 A Work 1 2 2.0
2 A Shops 0 3 3.0
3 A Park 1 4 3.0
4 B Cafe 1 1 1.0
5 A Home 1 4 3.0
6 B Cafe 0 1 1.0
7 A Work 0 4 3.0
8 A Home 0 4 2.0
9 A Park 0 4 1.0
In this, u is the number of matching (Code, Area) pairs after that row. nunique is the number of unique Area values seen so far in that Code.
On is the number of unique Areas seen so far, except that once we "run out" of an Area -- once it's not used any more -- we start subtracting it from nuniq.
Using GroupBy with size and cumcount, you can construct your u series.
Your logic for On isn't clear: this requires clarification.
g = df.groupby(['Code', 'Area'])
df['u'] = g['Code'].transform('size') - (g.cumcount() + 1)
print(df)
Code Area u
0 A Home 2
1 A Home 1
2 B Shops 1
3 A Park 1
4 B Cafe 1
5 B Shops 0
6 A Home 0
7 B Cafe 0
8 A Work 0
9 A Park 0

Find the increase, decrease in column of Dataframe group by an other column in Python / Pandas

In my dataframe i want to know if the ordonnee value are decreasing, increasing,or not changing, in comparison with the precedent value (the row before) and group by the column temps.
I already try the method of these post:
stackoverflow post
And I try to groupby but this is not working do you have ideas?
entry = pd.DataFrame([['1',0,0],['1',1,1],['1',2,1],['1',3,1],['1',3,-2],['2',1,2],['2',1,3]],columns=['temps','abcisse','ordonnee'])
output = pd.DataFrame([['1',0,0,'--'],['1',1,1,'increase'],['1',2,1,'--'],['1',3,1,'--'],['1',3,-2,'decrease'],['2',1,2,'--'],['2',1,3,'increase']],columns=['temps','abcisse','ordonnee','variation'])
Use
In [5537]: s = entry.groupby('temps').ordonnee.diff().fillna(0)
In [5538]: entry['variation'] = np.where(s.eq(0), '--',
np.where(s.gt(0), 'increase',
'decrease'))
In [5539]: entry
Out[5539]:
temps abcisse ordonnee variation
0 1 0 0 --
1 1 1 1 increase
2 1 2 1 --
3 1 3 1 --
4 1 3 -2 decrease
5 2 1 2 --
6 2 1 3 increase
Also, as pointed in jezrael's comment, you can use np.select instead of np.where
In [5549]: entry['variation'] = np.select([s>0, s<0], ['increase', 'decrease'],
default='--')
Details
In [5541]: s
Out[5541]:
0 0.0
1 1.0
2 0.0
3 0.0
4 -3.0
5 0.0
6 1.0
Name: ordonnee, dtype: float64
Use np.where with groupby transform i.e
entry['new'] = entry.groupby(['temps'])['ordonnee'].transform(lambda x : \
np.where(x.diff()>0,'incresase',
np.where(x.diff()<0,'decrease','--')))
Output :
temps abcisse ordonnee new
0 1 0 0 --
1 1 1 1 incresase
2 1 2 1 --
3 1 3 1 --
4 1 3 -2 decrease
5 2 1 2 --
6 2 1 3 incresase

taking a count of numbers occurring in a column in a dataframe using pandas

I have a dataframe like the one given below. There is one column on top. There is a second column with element name given below it. I am trying to take a count of all the numbers under each element and trying to transpose the data so that the ranking will become the column header after transposing and the count will be the data underneath each rank. tried multiple methods using pandas like
df.eq('1').sum(axis=1)
df2=df.transpose
but not getting the desired output.
how would you rank these items on a scale of 1-5
X Y Z
1 2 1
2 1 3
3 1 1
1 3 2
1 1 2
2 5 3
4 1 2
1 4 4
3 3 5
desired output is something like
1 2 3 4 5
X (count of 1s)(count of 2s).....so on
Y (count of 1s)(count of 2s).......
Z (count of 1s)(count of 2s)............
any help would really mean a lot.
You can apply the pd.value_counts to all columns, which will count values from all the columns and then transpose the result:
df.apply(pd.value_counts).fillna(0).T
# 1 2 3 4 5
#X 4.0 2.0 2.0 1.0 0.0
#Y 4.0 1.0 2.0 1.0 1.0
#Z 2.0 3.0 2.0 1.0 1.0
Option 0
pd.concat
pd.concat({c: s.value_counts() for c, s in df.iteritems()}).unstack(fill_value=0)
Option 1
stack preserves int dtype
df.stack().groupby(level=1).apply(
pd.value_counts
).unstack(fill_value=0)

Categories

Resources