I have this dataframe(User is the index)(Also, this a part of a bigger dataframe, so Series cannot be used):
User Score
A 5
B 10
A 4
C 8
I want to add Score of duplicate Users.
So First I calculate the Sum of Score of the duplicate User:
sum = df.loc['A'].sum()
Then I drop the duplicate rows:
df.drop('A',inplace=True)
Then I append the new values as a dictionary to the dataframe:
dic = {'User':'A','Score':10}
df = df.append(dic,ignore_index=True)
But I get this dataframe:
Score User
0 10 NaN
1 8 NaN
2 10 A
The default autoincrement values has replaced User as index and values of User are non NaN.
The expected dataframe would be:
User Score
B 10
C 8
A 9
What you are attempting will not work because your original dataframe has an Index called 'User', not a column. You're trying to append a dict, defining a column called 'User', with a DataFrame that has no such column.
Compare the result of your
df = df.append(dic,ignore_index=True)
with
df = df.reset_index().append(dic,ignore_index=True)
However, this begs the question of why you're doing it this way. What you want to do can be simply achieved with a groupby, a la
import pandas as pd
data = {'User': ['A', 'B', 'A', 'C'], 'Score': [5, 10, 4, 8]}
data
df = pd.DataFrame(data)
df
Out[6]:
User Score
0 A 5
1 B 10
2 A 4
3 C 8
df.set_index('User')
Out[7]:
Score
User
A 5
B 10
A 4
C 8
df = df.set_index('User')
df
Out[10]:
Score
User
A 5
B 10
A 4
C 8
df.groupby('User').sum()
Out[30]:
Score
User
A 9
B 10
C 8
You can try this
Let's say you have the below dataframe
df
you can use the below code to get the result
df.groupby('User')['Score'].sum().reset_index().set_index('User')
Related
Given any df with only 3 columns, and n rows. Im trying to split, horizontally, on loop, at the position where the value on a column is max.
Something close to what np.array_split() does, but not on equal sizes necessarily. It would have to be at the row with the value determined by the max rule, at that moment on the loop. I imagine the over or under cutting bit is not necessarily the harder part.
An example: (sorry, its my first time actually making a question. Formatting code here is unknown for me yet)
df = pd.DataFrame({'a': [3,1,5,5,4,4], 'b': [1,7,1,2,5,5], 'c': [2,4,1,3,2,2]})
This df, with the max value condition applied on column b (7), would be cutted on a 2 row df and other with 4 rows.
Perhaps this might help you. Assume our n by 3 dataframe is as follows:
df = pd.DataFrame({'a': [1,2,3,4], 'b': [4,3,2,1], 'c': [2,4,1,3]})
>>> df
a b c
0 1 4 2
1 2 3 4
2 3 2 4
3 4 1 3
We can create a list of rows where max values occur for each column.
rows = [df[df[i] == max(df[i])] for i in df.columns]
>>> rows[0]
a b c
3 4 1 3
>>> rows[2]
a b c
1 2 3 4
2 3 2 4
This can also be written as a list of indexes if preferred.
indexes = [i.index for i in rows]
>>> indexes
[Int64Index([3], dtype='int64'), Int64Index([0], dtype='int64'), Int64Index([1, 2], dtype='int64')]
I am working with student test data. The data provided is in a new format and I need to align it with the older format for an existing BI application. Where a range of columns used to contain questions numbers, the column name now contains the correct answer (this includes duplicate column names as imported form the source XLSX - see the image below). Different year levels have a different number of questions (so the "Total" column is not fixed. I need to rename the answer columns back to the sequential question numbers starting at 1. What is the best way to achieve this?
NB the sample df is not quite right as there are duplicate column names as the column name represents the correct answer. I cannot provide a sample df without importing it from a CSV/XLSX.
Updated with some sample df data:
data = {
'StudentID': [10, 11, 12, 13],
'Year' : [2021,2021,2021,2021],
'TestName': ['Math83', 'Math83','Math83','Math83'],
'A' : ['C','A','C','B'],
'B' : ['D','C','C','C'],
'C' : ['D','D','C','D'],
'D' : ['B','C','C','C'],
'Total': [5,4,3,5,],
'Score': [3,3,4,2,],
'Error': [1,2,1,1]
}
df = pd.DataFrame(data)
Here is a solution using set_axis()
cols = df.columns
tn = cols.get_loc('TestName')+1
total = cols.get_loc('Total')
(df.set_axis(cols[:tn].tolist() +
list(range(1,len(df.columns[tn:total+1]))) +
cols[total:].tolist(),axis=1))
Output:
StudentID Year TestName 1 2 3 4 Total Score Error
0 10 2021 Math83 C D D B 5 3 1
1 11 2021 Math83 A C D C 4 3 2
2 12 2021 Math83 C C C C 3 4 1
3 13 2021 Math83 B C D C 5 2 1
With this data set I want to know the people (id) who have made payments for both types a and b. Want to create a subset of data with the people who have made both a and b payments. (this is just an example set of data, one I'm using is much larger)
I've tried grouping by the id then making subset of data where type.len >= 2. Then tried creating another subset based on conditions df.loc[(df.type == 'a') & (df.type == 'b')]. I thought if I grouped by the id first then ran that df.loc code it would work but it doesn't.
Any help is much appreciated.
Thanks.
Separate the dataframe into two, one with type a payments and the other with type b payments, then merge them,
df_typea = df[(df['type'] == 'a')]
df_typeb = df[(df['type'] == 'b')]
df_merge = pd.merge(df_typea, df_typeb, how = 'outer', on = ['id', 'id'], suffixes =('_a', '_b'))
This will create a separate column for each payment type.
Now, you can find the ids for which both payments have been made,
df_payments = df_merge[(df_merge['type_a'] == 'a') & (df_merge['type_b'] == 'b')]
Note that this will create two records for items similar to that of id 9, for which there is more than two payments. I am assuming that you simply want to check if any payments of type 'a' and 'b' have been made for each id. In this case, you can simply drop any duplicates,
df_payments_no_duplicates = df_payments['id'].drop_duplicates()
You first split your DataFrame into two DataFrames:
one with type a payments only
one with type b payments only
You then join both DataFrames on id.
You can use groupby to solve this problem. This first time, group by id and type and then you can group again to see if the id had both types.
import pandas as pd
df = pd.DataFrame({"id" : [1, 1, 2, 3, 4, 4, 5, 5], 'payment' : [10, 15, 5, 20, 35, 30, 10, 20], 'type' : ['a', 'b', 'a','a','a','a','b', 'a']})
df_group = df.groupby(['id', 'type']).nunique()
#print(df_group)
'''
payment
id type
1 a 1
b 1
2 a 1
3 a 1
4 a 2
5 a 1
b 1
'''
# if the value in this series is 2, the id has both a and b
data = df_group.groupby('id').size()
#print(data)
'''
id
1 2
2 1
3 1
4 1
5 2
dtype: int64
'''
You can use groupby and nunique to get the count of unique payment types done.
print (df.groupby('id')['type'].agg(['nunique']))
This will give you:
id
1 2
2 1
3 1
4 1
5 1
6 2
7 1
8 1
9 2
If you want to list out only the rows that had both a and b types.
df['count'] = df.groupby('id')['type'].transform('nunique')
print (df[df['count'] > 1])
By using groupby.transform, each row will be populated with the unique count value. Then you can use count > 1 to filter out the rows that have both a and b.
This will give you:
id payment type count
0 1 10 a 2
1 1 15 b 2
7 6 10 b 2
8 6 15 a 2
11 9 35 a 2
12 9 30 a 2
13 9 10 b 2
You may also use the length of the returned set for the given id for column 'type':
len(set(df[df['id']==1]['type'])) # returns 2
len(set(df[df['id']==2]['type'])) # returns 1
Thus, the following would give you an answer to your question
paid_both = []
for i in set(df['id']):
if len(set(df[df['id']==i]['type'])) == 2:
paid_both.append(i)
## paid_both = [1,6,9] #the id's who paid both
You could then iterate through the unique id values to return the results for all ids. If 2 is returned, then the people have made payments for both types (a) and (b).
I have a dataframe as show below
>> df
A 1
B 2
A 5
B 6
A 7
B 8
How do I reformat it to make it
A 1 5 7
B 2 6 8
Thanks
Given a data frame like this
df = pd.DataFrame(dict(one=list('ABABAB'), two=range(6)))
you can do
df.groupby('one').two.apply(lambda s: s.reset_index(drop=True)).unstack()
# 0 1 2
# one
# A 0 2 4
# B 1 3 5
or (slightly slower, and giving a slightly different result)
df.groupby('one').apply(lambda d: d.two.reset_index(drop=True))
# two 0 1 2
# one
# A 0 2 4
# B 1 3 5
The first approach works with a DataFrameGroupBy, the second uses a SeriesGroupBy.
You can grab the series and use np.reshape to keep the correct dimensions.
The order = 'F' makes it scroll through columns (such as Fortran), order = 'C' scrolls through rows like C
Then it gets into a dataframe
df = pd.DataFrame(data=np.arange(10))
data = df['a'].values.reshape((2, 5), order='F')
df = pd.DataFrame(data=data, index=['a', 'b'])
how did you generate this data frame. I think it should have been generated using dictionary and then generate dataframe using that dict.
d = {'A': [1,5,7], 'B':[2,6,8]}
df = pandas.DataFrame(data=d, index=['p1','p2','p3'])
and then you can use df.T to transpose your dataframe if you need to.
So I've been doing things like this with pandas:
usrdata['columnA'] = usrdata.apply(functionA, axis=1)
in order to do row operations and changing/adding columns to my dataframe.
However, now I want to try to do something like this:
usrdata['columnB', 'columnC'] = usrdata.apply(functionB, axis=1)
But the output of function B is a Series with only one column in a tuple (with two values for each row) apparently. Is there a nice way for me to either:
format the output from functionB so it can readily be added to my
dataframe
add (and possibly have to unpack) the output from functionB and assign each each column to each column of my dataframe?
Try using zip:
usrdata['columnB'], usrdata['columnC'] = zip(*usrdata.apply(functionB, axis=1))
I'd assign directly to a df consisting of your new df's and modify the func body to return a Series constructed with a list of the data:
In [9]:
df = pd.DataFrame({'a':[1, 2, 3, 4, 5]})
df
Out[9]:
a
0 1
1 2
2 3
3 4
4 5
In [10]:
def func(x):
return pd.Series([x*3, x*10])
df[['b','c']] = df['a'].apply(func)
df
Out[10]:
a b c
0 1 3 10
1 2 6 20
2 3 9 30
3 4 12 40
4 5 15 50