so I am trying to get all unique values in a dataframe.
This is the code
for i in df.columns.tolist():
print(f"{i}")
print(df[i].unique())
This is the result I am getting
customerID
['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' ... '4801-JZAZL' '8361-LTMKD' '3186-AJIEK']
gender
['Female' 'Male']
SeniorCitizen
[0 1]
Partner
['Yes' 'No']
Dependents
['No' 'Yes']
tenure
[ 1 34 2 45 8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27
5 46 11 70 63 43 15 60 18 66 9 3 31 50 64 56 7 42 35 48 29 65 38 68
32 55 37 36 41 6 4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26 0
39]
PhoneService
['No' 'Yes']
MultipleLines
['No phone service' 'No' 'Yes']
InternetService
['DSL' 'Fiber optic' 'No']
OnlineSecurity
['No' 'Yes' 'No internet service']
OnlineBackup
['Yes' 'No' 'No internet service']
DeviceProtection
['No' 'Yes' 'No internet service']
TechSupport
['No' 'Yes' 'No internet service']
StreamingTV
['No' 'Yes' 'No internet service']
StreamingMovies
['No' 'Yes' 'No internet service']
Contract
['Month-to-month' 'One year' 'Two year']
PaperlessBilling
['Yes' 'No']
PaymentMethod
['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
'Credit card (automatic)']
MonthlyCharges
[29.85 56.95 53.85 ... 63.1 44.2 78.7 ]
TotalCharges
['29.85' '1889.5' '108.15' ... '346.45' '306.6' '6844.5']
Churn
['No' 'Yes']
Why is It skipping most values in MonthlyCharges and TotalCharges? and how to deal with it?
Thank you
Pandas has a setting to not show you all the values if they exceed some length.
So if your dataframe has more than 50 lines I think, it will show you the first 25 then the elipsis (...) then the last few and this is also what is happening here.
You could probably change that by executing the following:
pd.set_option("display.max_rows", None)
I would anyhow not do that, as if you have a very long dataframe, printing out all those lines might be very time consuming.
Alternatively you could iterate over the unique values and just print them out one by one:
for value in df[i].unique():
print(value)
Related
assign values to a column based on multiple columns in dataframe
I have a following code - where I am trying assign value to a column based age of the person
conditions = [df['age']<=25,df['age']>25,df['age']>=50]
values = ['age below 25','between 25 and 50','50+']
df['age category']=np.select(conditions,values)
output -
gender name age age category
0 male A 45 between 25 and 50
1 female B 22 age below 25
2 other C 54 between 25 and 50
for the age 54 it should assign age category as 50+
so i have tried following code which shows a error
conditions = [df['age']<=25,(df['age']>25 & df['age']<50),df['age']>=50]
values = ['age below 25','between 25 and 50','50+']
df['age category']=np.select(conditions,values)
I think we can use either where, select or loc for this but entirely not sure.. Thanks in advance
I would use cut here:
### user defined threshold ages in order
ages = [25, 50]
### below is programmatic
labels = ([f'age below {ages[0]}']
+[f'between {a} and {b}'
for a,b in zip(ages, ages[1:])]
+[f'{ages[-1]}+']
)
df['age category'] = pd.cut(df['age'], bins=[0]+ages+[np.inf], labels=labels)
Output:
gender name age age category
0 male A 45 between 25 and 50
1 female B 22 age below 25
2 other C 54 50+
You can use default parameter for np.select and since the first condition encountered is selected, you can use:
conditions = [df['age'] < 25, df['age'] < 50]
values = ['age below 25', 'between 25 and 50']
df['age category'] = np.select(conditions, values, default='50+')
print(df)
# Output:
age age category
0 56 50+
1 18 age below 25
2 39 between 25 and 50
3 21 age below 25
4 13 age below 25
5 24 age below 25
6 54 50+
7 47 between 25 and 50
8 43 between 25 and 50
9 60 50+
10 65 50+
11 21 age below 25
12 53 50+
13 66 50+
14 52 50+
15 13 age below 25
16 10 age below 25
17 46 between 25 and 50
18 13 age below 25
19 57 50+
I would like to add a number to same room occurrences with the same ID. So my dataframe has two columns ('ID' and 'Room'). I want to add a number to each Room according to its occurrence in the column 'room' for a single ID. Underneath is the original df and the desired df.
Example: ID34 has 3 bedrooms so I want the first to be -> bedroom_1, the second -> bedroom_2 and the third -> bedroom_3.
original df:
ID Room
34 Livingroom
34 Bedroom
34 Kitchen
34 Bedroom
34 Bedroom
34 Storage
50 Kitchen
50 Kitchen
89 Livingroom
89 Bedroom
89 Bedroom
98 Livingroom
Desired df:
ID Room
34 Livingroom_1
34 Bedroom_1
34 Kitchen_1
34 Bedroom_2
34 Bedroom_3
34 Storage_1
50 Kitchen_1
50 Kitchen_2
89 Livingroom_1
89 Bedroom_1
89 Bedroom_2
98 Livingroom_1
Tried code:
import pandas as pd
import numpy as np
data = pd.DataFrame({"ID": [34,34,34,34,34,34,50,50,89,89,89,98],
"Room": ['Livingroom','Bedroom','Kitchen','Bedroom','Bedroom','Storage','Kitchen','Kitchen','Livingroom','Bedroom','Bedroom','Livingroom']})
df = pd.DataFrame(columns=['ID'])
for i in range(df['Room'].nunique()):
df_new = (df[df['Room'] == ])
df_new.columns = ['ID', 'Room' + str(i)]
df_result = df_result.merge(df_new, on='ID', how='outer')
Lets try concatenate room with the cumcount of the df grouped by Room and ID as follows
df=df.assign(Room=df.Room+"_"+(df.groupby(['ID','Room']).cumcount()+1).astype(str))
ID Room
0 34 Livingroom_1
1 34 Bedroom_1
2 34 Kitchen_1
3 34 Bedroom_2
4 34 Bedroom_3
5 34 Storage_1
6 50 Kitchen_1
7 50 Kitchen_2
8 89 Livingroom_1
9 89 Bedroom_1
10 89 Bedroom_2
11 98 Livingroom_1
import inflect
p = inflect.engine()
df['Room'] += df.groupby('Room').cumcount().add(1).map(p.ordinal).radd('_')
print(df)
https://stackoverflow.com/a/59951701/3756587 I copied from here.
Here is some code that can do that for you. I'm basically breaking it down into three steps.
Perform a groupby apply to get apply a custom function on a group by operation. This allows you to generate new names for each pair of ID, Room.
Reduce the multiindex to the original index. Because we are grouping on two columns the index is now a hierarchical grouping of the two columns. We are discarding the original because we want to use our new names.
Perform an explode on each entry. This is because for simplicity, we are computing the apply result as an array. A subsequent explode then give each element in the array a unique row.
def f(rooms_col):
arr = np.empty(len(rooms_col), dtype=object)
for i, name in enumerate(rooms_col):
arr[i] = name + f"_{i + 1}"
return arr
# assuming data is the data from above
tmp_df = data.groupby(["ID", "Room"])["Room"].apply(f)
# Drop the old room name
tmp_df.index = tmp_df.index.droplevel(1)
# Explode the results array -> 1 row per entry
df = tmp_df.explode()
print(df)
Here is your output:
ID
34 Bedroom_1
34 Bedroom_2
34 Bedroom_3
34 Kitchen_1
34 Livingroom_1
34 Storage_1
50 Kitchen_1
50 Kitchen_2
89 Bedroom_1
89 Bedroom_2
89 Livingroom_1
98 Livingroom_1
Name: Room, dtype: object
I have a dataframe which I named parking which has multiple columns, in this case Registration State, Violation Code, and Summons Number.
For each Registration State, I want the 3 Violation Codes which the highest row count. The best I've been able to get is:
parking_state_group = parking.groupby(['Registration State', 'Violation Code'])['Summons Number'].count()
When printed (i.e. print(parking_state_group.reset_index()) looks like:
Registration State Violation Code Summons Number
0 99 0 14
1 99 6 1
2 99 10 6
3 99 13 2
4 99 14 75
... ... ... ...
1811 WY 37 3
1812 WY 38 4
1813 WY 40 4
1814 WY 46 1
1815 WY 68 1
This at least gets me the count of each Violation Code for each state (Summons Number is like an ID field for each row). I want this to return only the 3 violation codes for each state with the highest count, so something like:
Registration State Violation Code Summons Number
0 99 14 75
1 99 31 61
2 99 87 55
... ... ... ...
1812 WY 38 4
1813 WY 40 4
1811 WY 37 3
I've tried .nlargest() but this doesn't seem to get the largest .count(), only the largest values within a column, which isn't what I'm looking for.
Lets try
df[['Registration State', 'Violation Code', 'Summons Number']].groupby('Registration State')['Summons Number'].nlargest(3).reset_index().rename(columns={'level_1':'Violation Code'})
I have student data with id's and some values and I need to pivot the table for count of ID.
Here's an example of data:
id name maths science
0 B001 john 50 60
1 B021 Kenny 89 77
2 B041 Jessi 100 89
3 B121 Annie 91 73
4 B456 Mark 45 33
pivot table:
count of ID
5
Lots of different ways to approach this, I would use either shape or nunique() as Sandeep suggested.
data = {'id' : ['0','1','2','3','4'],
'name' : ['john', 'kenny', 'jessi', 'Annie', 'Mark'],
'math' : [50,89,100,91,45],
'science' : [60,77,89,73,33]}
df = pd.DataFrame(data)
print(df)
id name math science
0 0 john 50 60
1 1 kenny 89 77
2 2 jessi 100 89
3 3 Annie 91 73
4 4 Mark 45 33
then pass either of the following:
df.shape() which gives you the length of a data frame.
or
in:df['id'].nunique()
out:5
I feel like I am missing something really simple here, can someone tell me what is wrong with this code?
I am trying to group by Sex where the Age > 30 and the Survived value = 1.
'Sex' is a boolean value (1 or 0), if that makes a difference
data_r.groupby('Sex')([data_r.Age >30],[data_r.Survived == 1]).count()
This is throwing:
"'DataFrameGroupBy' object is not callable"
any ideas? thanks
You need filter first and then groupby.
data_r[(data_r.Age>30) & (data_r.Survived==1)].groupby('Sex').count()
You can do you filtering before grouping.
data_r.query('Age > 30 and Survived == 1').groupby('Sex').count()
Output:
PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare \
Sex
female 83 83 83 83 83 83 83 83 83
male 41 41 41 41 41 41 41 41 41
Cabin Embarked
Sex
female 47 81
male 25 41
IMHO... I'd use size it is safer, count does not include null values(NaN values). Notice those different values in the columns this is due to NaN values.
data_r.query('Age > 30 and Survived == 1').groupby('Sex').size()
Output:
Sex
female 83
male 41
dtype: int64