I am trying to create a loop or a more efficient process that can count the amount of current values in a pandas df. At the moment I'm selecting the value I want to perform the function on.
So for the df below, I'm trying to determine two counts.
1) ['u'] returns the count of the same remaining values left in ['Code', 'Area']. So how many remaining times the same values occur.
2) ['On'] returns the amount of values that are currently occurring in ['Area']. It achieves this by parsing through the df to see if those values occur again. So it essentially looks into the future to see if those values occur again.
import pandas as pd
d = ({
'Code' : ['A','A','A','A','B','A','B','A','A','A'],
'Area' : ['Home','Work','Shops','Park','Cafe','Home','Cafe','Work','Home','Park'],
})
df = pd.DataFrame(data=d)
#Select value
df1 = df[df.Code == 'A'].copy()
df1['u'] = df1[::-1].groupby('Area').Area.cumcount()
ids = [1]
seen = set([df1.iloc[0].Area])
dec = False
for val, u in zip(df1.Area[1:], df1.u[1:]):
ids.append(ids[-1] + (val not in seen) - dec)
seen.add(val)
dec = u == 0
df1['On'] = ids
df1 = df1.reindex(df.index).fillna(df1)
The problem is I want to run this script on all values in Code. Instead of selecting one at a time. For instance, if I want to do the same thing on Code['B'], I would have to change: df2 = df1[df1.Code == 'B'].copy() and the run the script again.
If I have numerous values in Code it becomes very inefficient. I need a loop where it finds all unique values in 'Code'Ideally, the script would look like:
df1 = df[df.Code == 'All unique values'].copy()
Intended Output:
Code Area u On
0 A Home 2.0 1.0
1 A Work 1.0 2.0
2 A Shops 0.0 3.0
3 A Park 1.0 3.0
4 B Cafe 1.0 1.0
5 A Home 1.0 3.0
6 B Cafe 0.0 1.0
7 A Work 0.0 3.0
8 A Home 0.0 2.0
9 A Park 0.0 1.0
I find your "On" logic very confusing. That said, I think I can reproduce it:
df["u"] = df.groupby(["Code", "Area"]).cumcount(ascending=False)
df["nunique"] = pd.get_dummies(df.Area).groupby(df.Code).cummax().sum(axis=1)
df["On"] = (df["nunique"] -
(df["u"] == 0).groupby(df.Code).cumsum().groupby(df.Code).shift().fillna(0)
which gives me
In [212]: df
Out[212]:
Code Area u nunique On
0 A Home 2 1 1.0
1 A Work 1 2 2.0
2 A Shops 0 3 3.0
3 A Park 1 4 3.0
4 B Cafe 1 1 1.0
5 A Home 1 4 3.0
6 B Cafe 0 1 1.0
7 A Work 0 4 3.0
8 A Home 0 4 2.0
9 A Park 0 4 1.0
In this, u is the number of matching (Code, Area) pairs after that row. nunique is the number of unique Area values seen so far in that Code.
On is the number of unique Areas seen so far, except that once we "run out" of an Area -- once it's not used any more -- we start subtracting it from nuniq.
Using GroupBy with size and cumcount, you can construct your u series.
Your logic for On isn't clear: this requires clarification.
g = df.groupby(['Code', 'Area'])
df['u'] = g['Code'].transform('size') - (g.cumcount() + 1)
print(df)
Code Area u
0 A Home 2
1 A Home 1
2 B Shops 1
3 A Park 1
4 B Cafe 1
5 B Shops 0
6 A Home 0
7 B Cafe 0
8 A Work 0
9 A Park 0
Related
I have a python dataframe with columns, 'Expected' vs 'Actual' that shows a product (A,B,C or D) for each record
ID
Expected
Actual
1
A
B
2
A
A
3
C
B
4
B
D
5
C
D
6
A
A
7
B
B
8
A
D
I want to get a count from both columns for each unique value found in both columns (both columns dont share all the same products). So the result should look like this,
Value
Expected
Actual
A
4
2
B
2
3
C
2
0
D
0
3
Thank you for all your help
You can use apply and value_counts
df = pd.DataFrame({'Expected':['A','A','C','B','C','A','B','A'],'Actual':['B','A','B','D','D','A','B','D']})
df.apply(pd.Series.value_counts).fillna(0)
output:
Expected Actual
A 4.0 2.0
B 2.0 3.0
C 2.0 0.0
D 0.0 3.0
I would do it following way
import pandas as pd
df = pd.DataFrame({'Expected':['A','A','C','B','C','A','B','A'],'Actual':['B','A','B','D','D','A','B','D']})
ecnt = df['Expected'].value_counts()
acnt = df['Actual'].value_counts()
known = sorted(set(df['Expected']).union(df['Actual']))
cntdf = pd.DataFrame({'Value':known,'Expected':[ecnt.get(k,0) for k in known],'Actual':[acnt.get(k,0) for k in known]})
print(cntdf)
output
Value Expected Actual
0 A 4 2
1 B 2 3
2 C 2 0
3 D 0 3
Explanation: main idea here is having separate value counts for Expected column and Actual column. If you wish to rather have Value as Index of your pandas.DataFrame you can do
...
cntdf = pd.DataFrame([acnt,ecnt]).T.fillna(0)
print(cntdf)
output
Actual Expected
D 3.0 0.0
B 3.0 2.0
A 2.0 4.0
C 0.0 2.0
I'm sure there is an elegant solution for this, but I cannot find one. In a pandas dataframe, how do I remove all duplicate values in a column while ignoring one value?
repost_of_post_id title
0 7139471603 Man with an RV needs a place to park for a week
1 6688293563 Land for lease
2 None 2B/1.5B, Dishwasher, In Lancaster
3 None Looking For Convenience? Check Out Cordova Par...
4 None 2/bd 2/ba, Three Sparkling Swimming Pools, Sit...
5 None 1 bedroom w/Closet is bathrooms in Select Unit...
6 None Controlled Access/Gated, Availability 24 Hours...
7 None Beautiful 3 Bdrm 2 & 1/2 Bth Home For Rent
8 7143099582 Need Help Getting Approved?
9 None *MOVE IN READY APT* REQUEST TOUR TODAY!
What I want is to keep all None values in repost_of_post_id, but omit any duplicates of the numerical values, for example if there are duplicates of 7139471603 in the dataframe.
[UPDATE]
I got the desired outcome using this script, but I would like to accomplish this in a one-liner, if possible.
# remove duplicate repost id if present (i.e. don't remove rows where repost_of_post_id value is "None")
# ca_housing is the original dataframe that needs to be cleaned
ca_housing_repost_none = ca_housing.loc[ca_housing['repost_of_post_id'] == "None"]
ca_housing_repost_not_none = ca_housing.loc[ca_housing['repost_of_post_id'] != "None"]
ca_housing_repost_not_none_unique = ca_housing_repost_not_none.drop_duplicates(subset="repost_of_post_id")
ca_housing_unique = ca_housing_repost_none.append(ca_housing_repost_not_none_unique)
You could try dropping the None values, then detecting duplicates, then filtering them out of the original list.
In [1]: import pandas as pd
...: from string import ascii_lowercase
...:
...: ids = [1,2,3,None,None, None, 2,3, None, None,4,5]
...: df = pd.DataFrame({'id': ids, 'title': list(ascii_lowercase[:len(ids)])})
...: print(df)
...:
...: print(df[~df.index.isin(df.id.dropna().duplicated().loc[lambda x: x].index)])
id title
0 1.0 a
1 2.0 b
2 3.0 c
3 NaN d
4 NaN e
5 NaN f
6 2.0 g
7 3.0 h
8 NaN i
9 NaN j
10 4.0 k
11 5.0 l
id title
0 1.0 a
1 2.0 b
2 3.0 c
3 NaN d
4 NaN e
5 NaN f
8 NaN i
9 NaN j
10 4.0 k
11 5.0 l
You could use drop_duplicates and merge with the NaNs as follows:
df_cleaned = df.drop_duplicates('post_id', keep='first').merge(df[df.post_id.isnull()], how='outer')
This will keep the first occurence of ids duplicated and all NaNs rows.
My googleing has failed me, I think my main issue is im unsure how to phrase the question (sorry about the crappy title). I'm trying to find the total each time 2 people vote the same way. Below you will see an example of how the data looks and the output I was looking for. I have a working solution but its very slow (see bottom) and was wondering if theres a better way to go about this.
This is how the data is shaped
----------------------------------
event person vote
1 a y
1 b n
1 c nv
1 d nv
1 e y
2 a n
2 b nv
2 c y
2 d n
2 e n
----------------------------------
This is the output im looking for
----------------------------------
Person a b c d e
a 2 0 0 1 2
b 0 2 0 0 0
c 0 0 2 1 0
d 1 0 1 2 1
e 2 0 0 1 2
----------------------------------
Working Code
df = df.pivot(index='event', columns='person', values='vote')
frame = pd.DataFrame(columns=df.columns, index=df.columns)
for person1, value in frame.iterrows():
for person2 in frame:
count = 0
for i, row in df.iterrows():
person1_votes = row[person1]
person2_votes = row[person2]
if person1_votes == person2_votes:
count += 1
frame.at[person1, person2] = count
Try look at your problem in different way
df=df.assign(key=1)
mergedf=df.merge(df,on=['event','key'])
mergedf['equal']=mergedf['vote_x'].eq(mergedf['vote_y'])
output=mergedf.groupby(['person_x','person_y'])['equal'].sum().unstack()
output
Out[1241]:
person_y a b c d e
person_x
a 2.0 0.0 0.0 1.0 2.0
b 0.0 2.0 0.0 0.0 0.0
c 0.0 0.0 2.0 1.0 0.0
d 1.0 0.0 1.0 2.0 1.0
e 2.0 0.0 0.0 1.0 2.0
#Wen-Ben already answered your question. It bases on the concept of finding all possibilities of pair-wise person and count those having same vote. Finding all pair-wise is cartesian product (cross join). You may read great post from #cs95 on cartesian product (CROSS JOIN) with pandas
In your problem, you count same vote per event, so it is cross joint per event. Therefore, you don't need adding helper key column as in #cs95 post. You may cross join directly on column event. After cross join, filter out those pair-wise person<->person having same vote using query. Finally, using crosstab to count those pair-wise.
Below is my solution:
df_match = df.merge(df, on='event').query('vote_x == vote_y')
pd.crosstab(index=df_match.person_x, columns=df_match.person_y)
Out[1463]:
person_y a b c d e
person_x
a 2 0 0 1 2
b 0 2 0 0 0
c 0 0 2 1 0
d 1 0 1 2 1
e 2 0 0 1 2
I'm working with a survey relative to income. I have my data like this:
form Survey1 Survey2 Country
0 1 1 1 1
1 2 1 2 5
2 3 2 2 4
3 4 2 1 1
4 5 2 2 4
I want to group by the answer and by the Country. For example, let's think the Survey2 refers to the number of cars of the respondent, I want to know the number of people that owns one car in a certain country.
The expected output is as follows:
Country Survey1_1 Survey1_2 Survey2_1 Survey2_2
0 1 1 1 2 0
1 4 0 2 0 2
2 5 1 0 0 1
Here I added '_#' where # is the answer to count.
Until now I've created a code to find the different answers for each column and I've counted the answers responding, let's say 1, but I haven't founded the way to count the answers for a specific country.
number_unic = df.head().iloc[:,j+ci].nunique() # count unique answers
val_unic = list(df.iloc[:,column].unique()) # unique answers
for i in range(len(vals_unic)):
names = str(df.columns[j+ci]+'_' + str(vals[i])) #names of columns
count = (df.iloc[:,j+ci]==vals[i]).sum() #here I count the values that are equal to an unique answer
df.insert(len(df.columns.values),names, count) # to insert new columns
I would do this with a pivot_table:
In [11]: df.pivot_table(["Survey1", "Survey2"], ["Country"], df.groupby("Country").cumcount())
Out[11]:
Survey1 Survey2
0 1 0 1
Country
1 1.0 2.0 1.0 1.0
4 2.0 2.0 2.0 2.0
5 1.0 NaN 2.0 NaN
To get the output you wanted you could do something like:
In [21]: res = df.pivot_table(["Survey1", "Survey2"], ["Country"], df.groupby("Country").cumcount())
In [22]: res.columns = [s + "_" + str(n + 1) for s, n in res.columns.values]
In [23]: res
Out[23]:
Survey1_1 Survey1_2 Survey2_1 Survey2_2
Country
1 1.0 2.0 1.0 1.0
4 2.0 2.0 2.0 2.0
5 1.0 NaN 2.0 NaN
But, generally it's better to use the MultiIndex here...
To count the number of each responses you can do this somewhat more complicated groupby and value_count:
In [31]: df1 = df.set_index("Country")[["Survey1", "Survey2"]] # more columns work fine here
In [32]: df1.unstack().groupby(level=[0, 1]).value_counts().unstack(level=0, fill_value=0).unstack(fill_value=0)
Out[32]:
Survey1 Survey2
1 2 1 2
Country
1 1 1 2 0
4 0 2 0 2
5 1 0 0 1
The Dataframe consists of table, the format of which is shown in the Attached image. I apologize for not being able to type the format here as while trying to type the format of the Dataframe, it was getting messed up due to long decimal values, so i thought to attach its snapshot.
Country names are the index of the data frame and the cell values consists of corresponding GDP value. The intent is to calculate the average of all the rows for each country. When np.average was applied -
#name of Dataframe - GDP
def function_average()
GDP['Average'] = np.average(GDP.iloc[:,0:])
return GDP
function_average()
The new column got created reflecting all the values as NaN. I assumed its probably due to the inappropriately formatted cell values. I tried truncating that using the following code -
GDP = np.round(GDP, decimals =2)
And yet, there was no change in values. The code ran successfully though and there was no error.
Please advise, how to proceed in this case, should i try to make change in the spreadsheet itself or attempt to format cell values in Dataframe?
I regret for any inconvenience caused for not being able to provide any other required information at this point. please let me know if any other detail is required.
Problem is need axis=1 for count mean per rows and change function to numpy.nanmean or DataFrame.mean:
Sample:
np.random.seed(100)
GDP = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
GDP.loc[0, 'A'] = np.nan
GDP['Average1'] = np.average(GDP.iloc[:,0:], axis=1)
GDP['Average2'] = np.nanmean(GDP.iloc[:,0:], axis=1)
GDP['Average3'] = GDP.iloc[:,0:].mean(axis=1)
print (GDP)
A B C D E Average1 Average2 Average3
0 NaN 8 3 7 7 NaN 6.25 6.25
1 0.0 4 2 5 2 2.6 2.60 2.60
2 2.0 2 1 0 8 2.6 2.60 2.60
3 4.0 0 9 6 2 4.2 4.20 4.20
4 4.0 1 5 3 4 3.4 3.40 3.40
You get NaN, because at least one NaN:
print (np.average(GDP.iloc[:,0:]))
nan
GDP['Average'] = np.average(GDP.iloc[:,0:])
print (GDP)
A B C D E Average
0 NaN 8 3 7 7 NaN
1 0.0 4 2 5 2 NaN
2 2.0 2 1 0 8 NaN
3 4.0 0 9 6 2 NaN
4 4.0 1 5 3 4 NaN