I have a csv like this:
Country Values Address
USA 1 AnyAddress
USA 2 AnyAddress
Brazil 1 AnyAddress
UK 3 AnyAddress
Australia 0 AnyAddress
Australia 0 AnyAddress
I need to group data by Country and sum Values, then return a string with the country and max value summed, in this case considering USA that is lexicographically greater then UK, the output is like this:
"Country: USA, Value: 3"
When I use groupby in pandas I am not able to get the strings with country name and value, how can I do that?
try:
max_values = df.groupby('Country').sum().reset_index().max().values
your_string = f"Country: {max_values[0]}, Value: {max_values[1]}"
Output:
>>> print(your_string)
Country: USA, Value: 3
You can do:
df.groupby("Country", as_index=False)["Values"].sum()\
.sort_values(["Values", "Country"], ascending=False).iloc[0]
Outputs:
Country USA
Values 3
Name: 3, dtype: object
Related
Given the following data ...
city country
0 London UK
1 Paris FR
2 Paris US
3 London UK
... I'd like a count of each city-country pair
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
The following works but feels like a hack:
df = pd.DataFrame([('London', 'UK'), ('Paris', 'FR'), ('Paris', 'US'), ('London', 'UK')], columns=['city', 'country'])
df.assign(**{'n': 1}).groupby(['city', 'country']).count().reset_index()
I'm assigning an additional column n of all 1s, grouping on city&country, and then count()ing occurrences of this new 'all 1s' column. It works, but adding a column just to count it feels wrong.
Is there a cleaner solution?
There is a better way..use value_counts
df.value_counts().reset_index(name='n')
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
I have a table. There are numbers in the column 'Para'. I have to find the index for a particular sentence from the column 'Country_Title in such a way that the value at Column 'Para' is 2.
Main DataFrame 'df_countries' is shown below:
Index
Sequence
Para
Country_Title
0
5
4
India is seventh largest country
1
6
6
Australia is a continent country
2
7
2
Canada is the 2nd largest country
3
9
3
UAE is a country in Western Asia
4
10
2
China is a country in East Asia
5
11
1
Germany is in Central Europe
6
13
2
Russia is the largest country
7
14
3
Capital city of China is Beijing
Suppose my keyword is China. And I want to get the index for the sentence with 'China', but only the one at 'Para' = 2
Consider the rows at index :- 4 and 7 ; both have same Country_Title. But I want to obtain the index for the one with 'Para' = 2, i.e., the result must be index = 4
My Approach:
I derived another DataFrame 'df_para2_countries' from above table as shown below:
Index
Para
Country_Title
2
2
Canada is the 2nd largest country
4
2
China is a country in East Asia
6
2
Russia is the largest country
Now I store the country title as:
c = list(df_level2_countries['Country_Title'])
I used a for loop to parse through elements in 'c' and find the index of a particular country in the table 'df_countries'
for i in c:
if 'China' in i:
print(i)
ind= df_para2_countries.loc[df_para2_countries['Country_Title'] = i]
print(ind)
the identifier 'ind' gives error.
I want to get the index, but this doesn't work.
Please post your suggestion on how can I approach to this.
You need two equals in your condition?
If you need only the 'index', that is, the value from your first column called index, then you can change the series returned from .loc() to a list, and then get first value, for instance:
ind = df_para2_countries.loc[df_para2_countries['Country_Title'] == i].to_list()[0]
Hope it works :)
I would like to assign a binary value (1 or 0) whether a column contains not empty/empty lists.
For example:
Country Test
Germany []
Italy ['pizza']
United Kingdom ['queen', 'king','big']
France ['Eiffel']
Spain []
...
What I would expect is something like this:
Country Test Binary
Germany [] 0
Italy ['pizza'] 1
United Kingdom ['queen', 'king','big'] 1
France ['Eiffel'] 1
Spain [] 0
...
I do not know how to use np.where or another to get these results.
I think to check if a column contains an empty list I should do something like this: df[df['Test'] != '[]']
You can do a simple check for length and based on the value, you can convert it to 0 or 1.
df['Binary'] = (df['Test'].str.len() != 0).astype(int)
While this is good, the most efficient way to do it was provided by #Marat.
df['Binary'] = df['Test'].astype(bool).astype(int)
The full code is here:
import pandas as pd
c = ['Country','Test']
d = [['Germany',[]],
['Italy',['pizza']],
['United Kingdom', ['queen', 'king','big']],
['France',['Eiffel']],
['Spain',[]]]
df = pd.DataFrame(data=d,columns=c)
df['Binary'] = df['Test'].astype(bool).astype(int)
print (df)
The output of this will be:
Country Test Binary
0 Germany [] 0
1 Italy [pizza] 1
2 United Kingdom [queen, king, big] 1
3 France [Eiffel] 1
4 Spain [] 0
Use str.len:
np.clip(df.Test.str.len(), 0, 1)
#or
(df.Test.str.len()==0).astype(int)
I have a dataframe
name country gender
Ada US 1
Aby UK 0
Alan US 0
Eli US 1
Eddy US 1
Bing NW 0
Bing US 1
Eli UK 0
Eli US 0
Alan US 1
Ada UK 0
Some names are assigned with different gender and country. E.g. Eli has US and 1 also has UK and 0.
I have used
groupby('name')['gender]
groupby('name')['code']
After the groupby, I am hoping to return the "gender" and "country" with the highest frequency. For example, if Eli has two US and one UK, then the country should be US. Same rule applies to gender.
For gender I used > 0.5 rule
df= df_inv.groupby('name')['gender'].mean()
df = df_inv.reset_index()
df['gender'] = (df['gender']>=0.5).astype(int)
Is there easier way to write this code? Also, is there any solution for categorical variable like country?
You should group by two properties (name and country/gender), build a table, and choose the column with the maximum value in each row:
df.groupby(['name','country']).size().unstack().idxmax(1)
#name
#Aby UK
#Ada UK
#Alan US
#Bing NW
#Eddy US
#Eli US
df.groupby(['name','gender']).size().unstack().idxmax(1)
#name
#Aby 0
#Ada 0
#Alan 0
#Bing 0
#Eddy 1
#Eli 0
You can later join the results if you want.
We can do groupby with function mode by agg
df = df.groupby('name').agg({'country':lambda x : x.mode()[0],'gender':lambda x : int(x.mean()>0.5)})
Out[154]:
country gender
name
Aby UK 0
Ada UK 0
Alan US 0
Bing NW 0
Eddy US 1
Eli US 0
Looks like this will do the work... pls check and confirm
a=df.groupby('name')['gender'].max().to_frame().reset_index()
b=df.groupby('name')['country'].max().to_frame().reset_index()
df=b
df['gender']=a['gender']
del a,b
I'm looking to delete rows of a DataFrame if total count of a particular column occurs only 1 time
Example of raw table (values are arbitrary for illustrative purposes):
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
9 Bolivia #McDonalds 3456
10 Kenya #Schools 3455
11 Ukraine #Cars 3456
12 US #Tshirts 3456789
Intended outcome:
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
I know that df.Series.value_counts()>1 will identify which df.Series occur more than 1 time; and that the code returned will look something like the following:
Population
True
GDP
True
#McDonalds
False
#Schools
False
#Cars
False
#Tshirts
False
I want to write something like the following so that my new DataFrame drops column values from df.Series that occur only 1 time, but this doesn't work:
df.drop(df.Series.value_counts()==1,axis=1,inplace=True)
You can do this by creating a boolean list/array by either list comprehensions or using DataFrame's string manipulation methods.
The list comprehension approach is:
vc = df['Series'].value_counts()
u = [i not in set(vc[vc==1].index) for i in df['Series']]
df = df[u]
The other approach is to use the str.contains method to check whether the values of the Series column contain a given string or match a given regular expression (used in this case as we are using multiple strings):
vc = df['Series'].value_counts()
pat = r'|'.join(vc[vc==1].index) #Regular expression
df = df[~df['Series'].str.contains(pat)] #Tilde is to negate boolean
Using this regular expressions approach is a bit more hackish and may require some extra processing (character escaping, etc) on pat in case you have regex metacharacters in the strings you want to filter out (which requires some basic regex knowledge). However, it's worth noting this approach is about 4x faster than using the list comprehension approach (tested on the data provided in the question).
As a side note, I recommend avoiding using the word Series as a column name as that's the name of a pandas object.
This is an old question, but the current answer doesn't work for any moderately large dataframes. A much faster and more "dataframe" way is to add a value count column and filter out count.
Create the dataset:
df = pd.DataFrame({'Country': 'Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US'.split(),
'Series': 'Pop Pop Pop Pop GDP GDP GDP GDP McDonalds Schools Cars Tshirts'.split()})
Drop rows that have a count < 1 for the column ('Series' in this case):
# Group values for Series and add 'cnt' column with count
df['cnt'] = df.groupby(['Series'])['Country'].transform('count')
# Drop indexes for count value == 1, and dropping 'cnt' column
df.drop(df[df.cnt==1].index)[['Country','Series']]