Assign binary value whether a column contains empty list - python

I would like to assign a binary value (1 or 0) whether a column contains not empty/empty lists.
For example:
Country Test
Germany []
Italy ['pizza']
United Kingdom ['queen', 'king','big']
France ['Eiffel']
Spain []
...
What I would expect is something like this:
Country Test Binary
Germany [] 0
Italy ['pizza'] 1
United Kingdom ['queen', 'king','big'] 1
France ['Eiffel'] 1
Spain [] 0
...
I do not know how to use np.where or another to get these results.
I think to check if a column contains an empty list I should do something like this: df[df['Test'] != '[]']

You can do a simple check for length and based on the value, you can convert it to 0 or 1.
df['Binary'] = (df['Test'].str.len() != 0).astype(int)
While this is good, the most efficient way to do it was provided by #Marat.
df['Binary'] = df['Test'].astype(bool).astype(int)
The full code is here:
import pandas as pd
c = ['Country','Test']
d = [['Germany',[]],
['Italy',['pizza']],
['United Kingdom', ['queen', 'king','big']],
['France',['Eiffel']],
['Spain',[]]]
df = pd.DataFrame(data=d,columns=c)
df['Binary'] = df['Test'].astype(bool).astype(int)
print (df)
The output of this will be:
Country Test Binary
0 Germany [] 0
1 Italy [pizza] 1
2 United Kingdom [queen, king, big] 1
3 France [Eiffel] 1
4 Spain [] 0

Use str.len:
np.clip(df.Test.str.len(), 0, 1)
#or
(df.Test.str.len()==0).astype(int)

Related

How can I count # of occurences of more than one column (eg city & country)?

Given the following data ...
city country
0 London UK
1 Paris FR
2 Paris US
3 London UK
... I'd like a count of each city-country pair
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
The following works but feels like a hack:
df = pd.DataFrame([('London', 'UK'), ('Paris', 'FR'), ('Paris', 'US'), ('London', 'UK')], columns=['city', 'country'])
df.assign(**{'n': 1}).groupby(['city', 'country']).count().reset_index()
I'm assigning an additional column n of all 1s, grouping on city&country, and then count()ing occurrences of this new 'all 1s' column. It works, but adding a column just to count it feels wrong.
Is there a cleaner solution?
There is a better way..use value_counts
df.value_counts().reset_index(name='n')
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1

Get Max Sum value of a column in Pandas

I have a csv like this:
Country Values Address
USA 1 AnyAddress
USA 2 AnyAddress
Brazil 1 AnyAddress
UK 3 AnyAddress
Australia 0 AnyAddress
Australia 0 AnyAddress
I need to group data by Country and sum Values, then return a string with the country and max value summed, in this case considering USA that is lexicographically greater then UK, the output is like this:
"Country: USA, Value: 3"
When I use groupby in pandas I am not able to get the strings with country name and value, how can I do that?
try:
max_values = df.groupby('Country').sum().reset_index().max().values
your_string = f"Country: {max_values[0]}, Value: {max_values[1]}"
Output:
>>> print(your_string)
Country: USA, Value: 3
You can do:
df.groupby("Country", as_index=False)["Values"].sum()\
.sort_values(["Values", "Country"], ascending=False).iloc[0]
Outputs:
Country USA
Values 3
Name: 3, dtype: object

Problem with New Column in Pandas Dataframe

I have a dataframe and I'm trying to create a new column of values that is one column divided by the other. This should be obvious but I'm only getting 0's and 1's as my output.
I also tried converting the output to float in case the output was somehow being rounded off but that didn't change anything.
def answer_seven():
df = answer_one()
columns_to_keep = ['Self-citations', 'Citations']
df = df[columns_to_keep]
df['ratio'] = df['Self-citations'] / df['Citations']
return df
answer_seven()
Output:
Self_cite Citations ratio
Country
Aus. 15606 90765 0
Brazil 14396 60702 0
Canada 40930 215003 0
China 411683 597237 1
France 28601 130632 0
Germany 27426 140566 0
India 37209 128763 0
Iran 19125 57470 0
Italy 26661 111850 0
Japan 61554 223024 0
S Korea 22595 114675 0
Russian 12422 34266 0
Spain 23964 123336 0
Britain 37874 206091 0
America 265436 792274 0
Does anyone know why I'm only getting 1's and 0's when I want float values? I tried the solutions given in the link suggested and none of them worked. I've tried to convert the values to floats using a few different methods including .astype('float'), float(df['A']) and df['ratio'] = df['Self-citations'] * 1.0 / df['Citations']. But none have worked so far.
Without having the exact dataframe it is difficult to say. But it is most likely a casting problem.
Lets build a MCVE:
import io
import pandas as pd
s = io.StringIO("""Country;Self_cite;Citations
Aus.;15606;90765
Brazil;14396;60702
Canada;40930;215003
China;411683;597237
France;28601;130632
Germany;27426;140566
India;37209;128763
Iran;19125;57470
Italy;26661;111850
Japan;61554;223024
S. Korea;22595;114675
Russian;12422;34266
Spain;23964;123336
Britain;37874;206091
America;265436;792274""")
df = pd.read_csv(s, sep=';', header=0).set_index('Country')
Then we can perform the desired operation as you suggested:
df['ratio'] = df['Self_cite']/df['Citations']
Checking dtypes:
df.dtypes
Self_cite int64
Citations int64
ratio float64
dtype: object
The result is:
Self_cite Citations ratio
Country
Aus. 15606 90765 0.171939
Brazil 14396 60702 0.237159
Canada 40930 215003 0.190369
China 411683 597237 0.689313
France 28601 130632 0.218943
Germany 27426 140566 0.195111
India 37209 128763 0.288973
Iran 19125 57470 0.332782
Italy 26661 111850 0.238364
Japan 61554 223024 0.275997
S. Korea 22595 114675 0.197035
Russian 12422 34266 0.362517
Spain 23964 123336 0.194299
Britain 37874 206091 0.183773
America 265436 792274 0.335031
Graphically:
df['ratio'].plot(kind='bar')
If you want to enforce type, you can cast dataframe using astype method:
df.astype(float)

Pandas string contains and replace

I have the following dataframe
A B
0 France United States of America
1 Italie France
2 United Stats Italy
I'm looking for a function that can take (for each word in column A) the first 4 letters and then search in column B whether or not these 4 letters are there. Now if this is the case, I want to replace the value in A with the similar value (similar first 4 letters) in B.
Example : for the word Italie in column A, I have to take Ital then search in B whether or not we can find it. Then I want to replace Italie with its similar word Italy.
I've tried to do for with str.contains function
But still cannot take only the first 4 letters.
Output expected :
A B
0 France United States of America
1 Italy France
2 United Stats of America Italy
In order to summarize, I am looking for correcting values in column A to become similar to those in column b
Solution from fuzzy match --fuzzywuzzy
from fuzzywuzzy import process
def fuzzyreturn(x):
return [process.extract(x, df.B.values, limit=1)][0][0][0]
df.A.apply(fuzzyreturn)
Out[608]:
0 France
1 Italy
2 United States of America
Name: A, dtype: object
df.A=df.A.apply(fuzzyreturn)

Weird behaviour with pandas cut, groupby and multiindex in Python

I have a dataframe like this one,
Continent % Renewable
Country
China Asia 2
United States North America 1
Japan Asia 1
United Kingdom Europe 1
Russian Federation Europe 2
Canada North America 5
Germany Europe 2
India Asia 1
France Europe 2
South Korea Asia 1
Italy Europe 3
Spain Europe 3
Iran Asia 1
Australia Australia 1
Brazil South America 5
where the % Renewableis a column created using the cut function,
Top15['% Renewable'] = pd.cut(Top15['% Renewable'], 5, labels=range(1,6))
when I group by Continentand % Renewable to count the number of countries in each subset I do,
count_groups = Top15.groupby(['Continent', '% Renewable']).size()
which is,
Continent % Renewable
Asia 1 4
2 1
Australia 1 1
Europe 1 1
2 3
3 2
North America 1 1
5 1
South America 5 1
The weird thing is the indexing now, if I index for a value that the category value is > 0 this gives me the value,
count_groups.loc['Asia', 1]
>> 4
if not,
count_groups.loc['Asia', 3]
>> IndexingError: Too many indexers
shouldn't it give me a 0 as there are no entries in that category? I would assume so as that dataframe was created using the groupby.
If not, can anyone suggest a procedure so I can preserve the 0 nr of countries for a category of % Renewable?
You have a Series with MultiIndex. Normally, we use tuples for indexing with MultiIndexes but pandas can be flexible about that.
In my opinion, count_groups.loc[('Asia', 3)] should raise a KeyError since this pair does not appear in the index but that's for pandas developers to decide I guess.
To return a default value from a Series, we can use get like we do in dictionaries:
count_groups.get(('Asia', 3), 0)
This will return 0 if the key does not exist.

Categories

Resources