I've a data frame in pandas, and I'm trying to generate a new table based on existing table by grouping them with their aggregated mode value.
df
country scores attempts
india 11 6
india 12 3
india 12 3
india 12 7
india 10 3
india 12 3
pakistan 10 4
Pakistan 14 4
pakistan 14 5
srilanka 23 5
srilanka 21 5
srilanka 21 6
srilanka 23 5
srilanka 23 6
srilanka 23 5
Result will be like this
country scores attempts
0 India 12 3
1 Pakistan 14 4
2 srilanka 23 5
please help me solve this issue.
Use GroupBy.size first and then get first modes by DataFrameGroupBy.idxmax for indice by maximal counts:
print (df)
country scores attempts
0 india 11 6
1 india 12 3
2 india 12 3
3 india 12 7
4 india 10 3
5 india 12 3
6 pakistan 10 4
7 pakistan 14 4
8 pakistan 14 4 <- correct data
9 srilanka 23 5
10 srilanka 21 5
11 srilanka 21 6
12 srilanka 23 5
13 srilanka 23 6
14 srilanka 23 5
df1 = df.groupby(['country','scores','attempts']).size().reset_index(name='count')
print (df1)
country scores attempts count
0 india 10 3 1
1 india 11 6 1
2 india 12 3 3
3 india 12 7 1
4 pakistan 10 4 1
5 pakistan 14 4 2
6 srilanka 21 5 1
7 srilanka 21 6 1
8 srilanka 23 5 3
9 srilanka 23 6 1
df2 = df1.loc[df1.groupby('country')['count'].idxmax()].drop('count', axis=1).reset_index(drop=True)
print (df2)
country scores attempts
0 india 12 3
1 pakistan 14 4
2 srilanka 23 5
How to fill the values of column ["state"] with another column ["country"] only in NaN values?
Like in this Pandas DataFrame:
state country sum
0 NaN China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 NaN India 5
5 NaN Srilanka 6
6 NaN Malaysia 7
7 NaN Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 NaN US 12
12 NaN Canada 13
What code should I do to fill state columns with country column only in NaN values, like this:
state country sum
0 China China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 India India 5
5 Srilanka Srilanka 6
6 Malaysia Malaysia 7
7 Bhutan Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 US US 12
12 Canada Canada 13
I can use this code:
df.loc[df['state'].isnull(), 'state'] = df[df['state'].isnull()]['country'].replace(df['country'])
But in a very large dataset with 300K of rows, it compute for 5-6 minutes and crashed every time. Because it is replacing one value at a time.
Like this
Can anyone help me with efficient code for this?
Please!
Perhaps using fillna without checking for isnull() and replace():
df['state'].fillna(df['country'], inplace=True)
print(df)
Output
state country sum
0 China China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 India India 5
5 Srilanka Srilanka 6
6 Malaysia Malaysia 7
7 Bhutan Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 US US 12
12 Canada Canada 13
The question is still not answered !!!!
Let's say that I have this dataframe :
import pandas as pd
Name = ['ID', 'Country', 'IBAN','ID_bal_amt', 'ID_bal_time','Dan_city','ID_bal_mod','Dan_country','ID_bal_type', 'ID_bal_amt', 'ID_bal_time','ID_bal_mod','ID_bal_type' ,'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country','ID_bal_amt', 'ID_bal_time','ID_bal_mod','ID_bal_type' ]
Value = ['TAMARA_CO', 'GERMANY','FR56', '12','June','Berlin','OPBD', '55','CRDT','432', 'August', 'CLBD','DBT', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP','432','March','FABD','CRDT']
Ccy = ['','','','EUR','EUR','','EUR','','','','EUR','EUR','USD','USD','USD','','CHF', '','DKN','','','USD','CHF']
Group = ['0','0','0','1','1','1','1','1','1','2','2','2','2','2','2','2','3','3','3','4','4','4','4']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 ID_bal_amt 12 EUR 1
4 ID_bal_time June EUR 1
5 Dan_city Berlin 1
6 ID_bal_mod OPBD EUR 1
7 Dan_country 55 1
8 ID_bal_type CRDT 1
9 ID_bal_amt 432 2
10 ID_bal_time August EUR 2
11 ID_bal_mod CLBD EUR 2
12 ID_bal_type DBT USD 2
13 Dan_sex M USD 2
14 Dan_Age 22 USD 2
15 Dan_country FRA 2
16 Dan_sex M CHF 3
17 Dan_city Madrid 3
18 Dan_country ESP DKN 3
19 ID_bal_amt 432 4
20 ID_bal_time March 4
21 ID_bal_mod FABD USD 4
22 ID_bal_type CRDT CHF 4
I want to reduce this dataframe ! I want to reduce only the rows that contains the string "bal" by keeping the group of rows that is associated at the the mode : "CLBD". That means that I search the value "CLBD" for the the name "ID_bal_mod" and then I keep all the others names ID_bal_amt, ID_bal_time, ID_bal_mod, ID_bal_type that are in the same group. In our example, it is the names that are in the group 2
In addition, I want to change the their value in the column "Group" to 0.
So at the end I would like to get this new dataframe where the indexing is reset too
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 Dan_city Berlin 1
4 Dan_country 55 1
5 ID_bal_amt 432 0
6 ID_bal_time August EUR 0
7 ID_bal_mod CLBD EUR 0
8 ID_bal_type DBT USD 0
9 Dan_sex M USD 2
10 Dan_Age 22 USD 2
11 Dan_country FRA 2
12 Dan_sex M CHF 3
13 Dan_city Madrid 3
14 Dan_country ESP DKN 3
Anyone has an efficient idea ?
Thank you
Let's try your logic:
rows_with_bal = df['Name'].str.contains('bal')
groups_with_CLBD = ((rows_with_bal & df['Value'].eq('CLBD'))
.groupby(df['Group']).transform('any')
)
# set the `Group` to 0 for `groups_with_CLBD`
df.loc[groups_with_CLBD, 'Group'] = 0
# keep the rows without bal or `groups_with_CLBD`
df = df.loc[(~rows_with_bal) | groups_with_CLBD]
Output:
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
5 Dan_city Berlin 1
7 Dan_country 55 1
9 ID_bal_amt 432 0
10 ID_bal_time August EUR 0
11 ID_bal_mod CLBD EUR 0
12 ID_bal_type DBT USD 0
13 Dan_sex M USD 0
14 Dan_Age 22 USD 0
15 Dan_country FRA 0
16 Dan_sex M CHF 3
17 Dan_city Madrid 3
18 Dan_country ESP DKN 3
hope you can help me this.
The df looks like this.
region AMER
country Brazil Canada Columbia Mexico United States
metro Rio de Janeiro Sao Paulo Toronto Bogota Mexico City Monterrey Atlanta Boston Chicago Culpeper Dallas Denver Houston Los Angeles Miami New York Philadelphia Seattle Silicon Valley Washington D.C.
ID
321321 2 1 1 13 15 29 1 2 1 11 6 15 3 2 14 3
23213 3
231 2 2 3 1 5 6 3 3 4 3 3 4
23213 4 1 1 1 4 1 2 27 1
21321 4 2 2 1 14 3 2 4 2
12321 1 2 1 1 1 1 10
123213 2 45 5 1
12321 1
123 1 3 2
I want to get the count of columns that have data per of metro and country per region of all the rows(id/index) and store that count into a new column.
Regards,
RJ
You may want to try
df['new']df.sum(level=0, axis=1)
Having grouped data, I want to drop from the results groups that contain only a single observation with the value below a certain threshold.
Initial data:
df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
'City' :['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
'Sales' : [13,6,16,8,4,3,1]})
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
Now grouping the data:
df.groupby(['Province', 'City']).sum()
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
MN Winnipeg 3
ON Toronto 13
Windsor 1
QC Montreal 6
Now the part I can't figure out is how to drop provinces with only one city (or generally N observations) with the total sales less then 10. The expected output should be:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1
I.e. MN/Winnipeg and QC/Montreal are gone from the results. Ideally, they won't be completely gone but combined into a new group called 'Other', but this may be material for another question.
you can do it this way:
In [188]: df
Out[188]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [189]: g = df.groupby(['Province', 'City']).sum().reset_index()
In [190]: g
Out[190]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
3 MN Winnipeg 3
4 ON Toronto 13
5 ON Windsor 1
6 QC Montreal 6
Now we will create a mask for those 'provinces with more than one city':
In [191]: mask = g.groupby('Province').City.transform('count') > 1
In [192]: mask
Out[192]:
0 True
1 True
2 False
3 False
4 True
5 True
6 False
dtype: bool
And cities with the total sales greater or equal to 10 win:
In [193]: g[(mask) | (g.Sales >= 10)]
Out[193]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
4 ON Toronto 13
5 ON Windsor 1
I wasn't satisfied with any of the answers given, so I kept chipping at this until I figured out the following solution:
In [72]: df
Out[72]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [73]: df.groupby(['Province', 'City']).sum().groupby(level=0).filter(lambda x: len(x)>1 or x.Sales > 10)
Out[73]:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1