I have student data with id's and some values and I need to pivot the table for count of ID.
Here's an example of data:
id name maths science
0 B001 john 50 60
1 B021 Kenny 89 77
2 B041 Jessi 100 89
3 B121 Annie 91 73
4 B456 Mark 45 33
pivot table:
count of ID
5
Lots of different ways to approach this, I would use either shape or nunique() as Sandeep suggested.
data = {'id' : ['0','1','2','3','4'],
'name' : ['john', 'kenny', 'jessi', 'Annie', 'Mark'],
'math' : [50,89,100,91,45],
'science' : [60,77,89,73,33]}
df = pd.DataFrame(data)
print(df)
id name math science
0 0 john 50 60
1 1 kenny 89 77
2 2 jessi 100 89
3 3 Annie 91 73
4 4 Mark 45 33
then pass either of the following:
df.shape() which gives you the length of a data frame.
or
in:df['id'].nunique()
out:5
Related
I have a dataframe df0 which contains 100 entries. Is there a way to generate a subset of dataframe df1 which has 20 entries appearing in random order in comparison to df0 every time we print df1?
Example -
df0 =
calories
duration
name
0
420
50
Ana
1
380
40
Mike
:
:
:
:
99
390
45
James
print(df1) #first time
calories
duration
name
0
420
50
Ana
1
230
10
Joe
:
:
:
:
49
380
42
Eli
print(df1) #second time
calories
duration
name
0
620
36
Megan
1
390
45
James
:
:
:
:
49
430
42
Rick
and so on...
number of columns remain same and all of the values that are appearing in df1 subsets are present in df0.
Try this:
df.sample(frac=1).head(20)
I would like to add a number to same room occurrences with the same ID. So my dataframe has two columns ('ID' and 'Room'). I want to add a number to each Room according to its occurrence in the column 'room' for a single ID. Underneath is the original df and the desired df.
Example: ID34 has 3 bedrooms so I want the first to be -> bedroom_1, the second -> bedroom_2 and the third -> bedroom_3.
original df:
ID Room
34 Livingroom
34 Bedroom
34 Kitchen
34 Bedroom
34 Bedroom
34 Storage
50 Kitchen
50 Kitchen
89 Livingroom
89 Bedroom
89 Bedroom
98 Livingroom
Desired df:
ID Room
34 Livingroom_1
34 Bedroom_1
34 Kitchen_1
34 Bedroom_2
34 Bedroom_3
34 Storage_1
50 Kitchen_1
50 Kitchen_2
89 Livingroom_1
89 Bedroom_1
89 Bedroom_2
98 Livingroom_1
Tried code:
import pandas as pd
import numpy as np
data = pd.DataFrame({"ID": [34,34,34,34,34,34,50,50,89,89,89,98],
"Room": ['Livingroom','Bedroom','Kitchen','Bedroom','Bedroom','Storage','Kitchen','Kitchen','Livingroom','Bedroom','Bedroom','Livingroom']})
df = pd.DataFrame(columns=['ID'])
for i in range(df['Room'].nunique()):
df_new = (df[df['Room'] == ])
df_new.columns = ['ID', 'Room' + str(i)]
df_result = df_result.merge(df_new, on='ID', how='outer')
Lets try concatenate room with the cumcount of the df grouped by Room and ID as follows
df=df.assign(Room=df.Room+"_"+(df.groupby(['ID','Room']).cumcount()+1).astype(str))
ID Room
0 34 Livingroom_1
1 34 Bedroom_1
2 34 Kitchen_1
3 34 Bedroom_2
4 34 Bedroom_3
5 34 Storage_1
6 50 Kitchen_1
7 50 Kitchen_2
8 89 Livingroom_1
9 89 Bedroom_1
10 89 Bedroom_2
11 98 Livingroom_1
import inflect
p = inflect.engine()
df['Room'] += df.groupby('Room').cumcount().add(1).map(p.ordinal).radd('_')
print(df)
https://stackoverflow.com/a/59951701/3756587 I copied from here.
Here is some code that can do that for you. I'm basically breaking it down into three steps.
Perform a groupby apply to get apply a custom function on a group by operation. This allows you to generate new names for each pair of ID, Room.
Reduce the multiindex to the original index. Because we are grouping on two columns the index is now a hierarchical grouping of the two columns. We are discarding the original because we want to use our new names.
Perform an explode on each entry. This is because for simplicity, we are computing the apply result as an array. A subsequent explode then give each element in the array a unique row.
def f(rooms_col):
arr = np.empty(len(rooms_col), dtype=object)
for i, name in enumerate(rooms_col):
arr[i] = name + f"_{i + 1}"
return arr
# assuming data is the data from above
tmp_df = data.groupby(["ID", "Room"])["Room"].apply(f)
# Drop the old room name
tmp_df.index = tmp_df.index.droplevel(1)
# Explode the results array -> 1 row per entry
df = tmp_df.explode()
print(df)
Here is your output:
ID
34 Bedroom_1
34 Bedroom_2
34 Bedroom_3
34 Kitchen_1
34 Livingroom_1
34 Storage_1
50 Kitchen_1
50 Kitchen_2
89 Bedroom_1
89 Bedroom_2
89 Livingroom_1
98 Livingroom_1
Name: Room, dtype: object
This question already has answers here:
Change one value based on another value in pandas
(7 answers)
Closed 2 years ago.
I have a sample dataset here. In real case, it has a train and test dataset. Both of them have around 300 columns and 800 rows. I want to filter out all those rows based on a certain value in one column and then set all values in that row from column 3 e.g. to column 50 to zero. How can I do it?
Sample dataset:
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav','Princi','Anuj','Nancy'],
'Age':[27, 24, 22, 32,66,43],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Katauj', 'vbinauj'],
'Payment':[15,20,40,50,3,23],
'Qualification':['Msc', 'MA', 'MCA', 'Phd','MA','MS']}
df = pd.DataFrame(data)
df
Here is the output of sample dataset:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 Kanpur 20 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 Kannauj 50 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
As you can see, in the first column, there values =="Princi", So if I find rows that Name column value =="Princi", then I want to set column "Address" and "Payment" in those rows to zero.
Here is the expected output:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA #this row
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd #this row
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
In my real dataset, I tried:
train.loc[:, 'got':'tod']# get those columns # I could select all those columns
and train.loc[df['column_wanted'] == "that value"] # I got all those rows
But how can I combine them? Thanks for your help!
Use the loc accessor; df.loc[boolean selection, columns]
df.loc[df['Name'].eq('Princi'),'Address':'Payment']=0
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
question to choose value based on two df.
>>> df[['age','name']]
age name
0 44 Anna
1 22 Bob
2 33 Cindy
3 44 Danis
4 55 Cindy
5 66 Danis
6 11 Anna
7 43 Bob
8 12 Cindy
9 19 Danis
10 11 Anna
11 32 Anna
12 55 Anna
13 33 Anna
14 32 Anna
>>> df2[['age','name']]
age name
5 66 Danis
4 55 Cindy
0 44 Anna
7 43 Bob
expected result is all rows that value 'age' is higher than df['age'] based on column 'name.
expected result
age name
12 55 Anna
Per comments, use merge and filter dataframe:
df.merge(df2, on='name', suffixes={'','_y'}).query('age > age_y')[['name','age']]
Output:
name age
4 Anna 55
IIUC, you can use this to find the max age of all names:
pd.concat([df,df2]).groupby('name')['age'].max()
Output:
name
Anna 55
Bob 43
Cindy 55
Danis 66
Name: age, dtype: int64
Try this:
index = df[df['age'] > age].index
df.loc[index]
There are a few edge cases you don't mention how you would like to resolve, but generally what you want to do is iterate down the df and compare ages and use the larger. You could do so in the following manner:
df3 = pd.DataFrame(columns = ['age', 'name'])
for x in len(df):
if df['age'][x] > df2['age'][x]:
df3['age'][x] = df['age'][x]
df3['name'][x] = df['name'][x]
else:
df3['age'][x] = df2['age'][x]
df3['name'][x] = df2['name'][x]
Although you will need to modify this to reflect how you want to resolve names that are only in one list, or if the lists are of different sizes.
One solution comes to my mind is merge and drop
df.merge(df2, on='name', suffixes=('', '_y')).query('age.gt(age_y)', engine='python')[['age','name']]
Out[175]:
age name
4 55 Anna
This question already has an answer here:
Convert pandas.groupby to dict
(1 answer)
Closed 4 years ago.
Given a table (/dataFrame) x:
name day earnings revenue
Oliver 1 100 44
Oliver 2 200 69
John 1 144 11
John 2 415 54
John 3 33 10
John 4 82 82
Is it possible to split the table into two tables based on the name column (that acts as an index), and nest the two tables under the same object (not sure about the exact terms to use). So in the example above, tables[0] will be:
name day earnings revenue
Oliver 1 100 44
Oliver 2 200 69
and tables[1] will be:
name day earnings revenue
John 1 144 11
John 2 415 54
John 3 33 10
John 4 82 82
Note that that the number of rows in each 'sub-table' may vary.
Cheers,
Create dictionary of DataFrames:
dfs = dict(tuple(df.groupby('name')))
And then select by keys - value of name column:
print (dfs['Oliver'])
print (dfs['John'])