Pandas string contains and replace - python

I have the following dataframe
A B
0 France United States of America
1 Italie France
2 United Stats Italy
I'm looking for a function that can take (for each word in column A) the first 4 letters and then search in column B whether or not these 4 letters are there. Now if this is the case, I want to replace the value in A with the similar value (similar first 4 letters) in B.
Example : for the word Italie in column A, I have to take Ital then search in B whether or not we can find it. Then I want to replace Italie with its similar word Italy.
I've tried to do for with str.contains function
But still cannot take only the first 4 letters.
Output expected :
A B
0 France United States of America
1 Italy France
2 United Stats of America Italy
In order to summarize, I am looking for correcting values in column A to become similar to those in column b

Solution from fuzzy match --fuzzywuzzy
from fuzzywuzzy import process
def fuzzyreturn(x):
return [process.extract(x, df.B.values, limit=1)][0][0][0]
df.A.apply(fuzzyreturn)
Out[608]:
0 France
1 Italy
2 United States of America
Name: A, dtype: object
df.A=df.A.apply(fuzzyreturn)

Related

Flag if a row is duplicated an attach if it's the 1st, 2nd, etc duplicated row

I'd like to flag if a row is duplicated, and attach if it's the 1st, 2nd, 3rd, etc duplicated column in a Pandas DataFrame.
More visually, I'd like to go from:
id
Country
City
1
France
Paris
2
France
Paris
3
France
Lyon
4
France
Lyon
5
France
Lyon
to
id
Country
City
duplicated_flag
1
France
Paris
1
2
France
Paris
1
3
France
Lyon
2
4
France
Lyon
2
5
France
Lyon
2
Note that id is not taken into account to see if the row is duplicated.
Two options:
First, if you have lots of columns that you need to compare, you can use:
comparison_df = df.drop("id", axis=1)
df["duplicated_flag"] = (comparison_df != comparison_df.shift()).any(axis=1).cumsum()
We drop the columns that aren't needed in the comparison. Then, we check whether each row is equivalent to the one above it using .shift() and .any(). Finally, we read off the value of duplicated_flag using .cumsum().
But, if you only have two columns to compare (or if for some reason you have lots of columns that you need to drop), you can find mismatched rows one at a time, and then use .cumsum() to get the value of duplicated_flag for each row. It's a bit more verbose so I'm not super happy with this option, but I'm leaving this here for completeness in case this suits your use case better:
country_comparison = df["Country"].ne(df["Country"].shift())
city_comparison = df["City"].ne(df["City"].shift())
df["duplicated_flag"] = (country_comparison | city_comparison).cumsum()
print(df)
These output:
id Country City duplicated_flag
0 1 France Paris 1
1 2 France Paris 1
2 3 France Lyon 2
3 4 France Lyon 2
4 5 France Lyon 2

To get the Index of a particular element, which is a sentence string present in a column of a DataFrame with conditions

I have a table. There are numbers in the column 'Para'. I have to find the index for a particular sentence from the column 'Country_Title in such a way that the value at Column 'Para' is 2.
Main DataFrame 'df_countries' is shown below:
Index
Sequence
Para
Country_Title
0
5
4
India is seventh largest country
1
6
6
Australia is a continent country
2
7
2
Canada is the 2nd largest country
3
9
3
UAE is a country in Western Asia
4
10
2
China is a country in East Asia
5
11
1
Germany is in Central Europe
6
13
2
Russia is the largest country
7
14
3
Capital city of China is Beijing
Suppose my keyword is China. And I want to get the index for the sentence with 'China', but only the one at 'Para' = 2
Consider the rows at index :- 4 and 7 ; both have same Country_Title. But I want to obtain the index for the one with 'Para' = 2, i.e., the result must be index = 4
My Approach:
I derived another DataFrame 'df_para2_countries' from above table as shown below:
Index
Para
Country_Title
2
2
Canada is the 2nd largest country
4
2
China is a country in East Asia
6
2
Russia is the largest country
Now I store the country title as:
c = list(df_level2_countries['Country_Title'])
I used a for loop to parse through elements in 'c' and find the index of a particular country in the table 'df_countries'
for i in c:
if 'China' in i:
print(i)
ind= df_para2_countries.loc[df_para2_countries['Country_Title'] = i]
print(ind)
the identifier 'ind' gives error.
I want to get the index, but this doesn't work.
Please post your suggestion on how can I approach to this.
You need two equals in your condition?
If you need only the 'index', that is, the value from your first column called index, then you can change the series returned from .loc() to a list, and then get first value, for instance:
ind = df_para2_countries.loc[df_para2_countries['Country_Title'] == i].to_list()[0]
Hope it works :)

Assign binary value whether a column contains empty list

I would like to assign a binary value (1 or 0) whether a column contains not empty/empty lists.
For example:
Country Test
Germany []
Italy ['pizza']
United Kingdom ['queen', 'king','big']
France ['Eiffel']
Spain []
...
What I would expect is something like this:
Country Test Binary
Germany [] 0
Italy ['pizza'] 1
United Kingdom ['queen', 'king','big'] 1
France ['Eiffel'] 1
Spain [] 0
...
I do not know how to use np.where or another to get these results.
I think to check if a column contains an empty list I should do something like this: df[df['Test'] != '[]']
You can do a simple check for length and based on the value, you can convert it to 0 or 1.
df['Binary'] = (df['Test'].str.len() != 0).astype(int)
While this is good, the most efficient way to do it was provided by #Marat.
df['Binary'] = df['Test'].astype(bool).astype(int)
The full code is here:
import pandas as pd
c = ['Country','Test']
d = [['Germany',[]],
['Italy',['pizza']],
['United Kingdom', ['queen', 'king','big']],
['France',['Eiffel']],
['Spain',[]]]
df = pd.DataFrame(data=d,columns=c)
df['Binary'] = df['Test'].astype(bool).astype(int)
print (df)
The output of this will be:
Country Test Binary
0 Germany [] 0
1 Italy [pizza] 1
2 United Kingdom [queen, king, big] 1
3 France [Eiffel] 1
4 Spain [] 0
Use str.len:
np.clip(df.Test.str.len(), 0, 1)
#or
(df.Test.str.len()==0).astype(int)

Pandas - value_counts on multiple values in one cell

I have a dataframe which has a column with multiple values, separated by a comma like this:
Country
Australia, Cuba, Argentina
Australia
United States, Canada, United Kingdom, Argentina
I would like to count each unique value, similar to value_counts, like this:
Australia: 2
Cuba: 1
Argentina: 2
United States: 1
My simplest method is shown below, but I suspect that this can be done more efficiently and neatly.
from collections import Counter
Counter(pd.DataFrame(data['Country'].str.split(',', expand=True)).values.ravel())
Cheers
You can using get_dummies
df.Country.str.get_dummies(sep=', ').sum()
Out[354]:
Argentina 2
Australia 2
Canada 1
Cuba 1
United Kingdom 1
United States 1
dtype: int64
Another option is to split and then use value_counts
pd.Series(df.Country.str.split(', ').sum()).value_counts()
Argentina 2
Australia 2
United Kingdom 1
Canada 1
Cuba 1
United States 1
dtype: int64

Weird behaviour with pandas cut, groupby and multiindex in Python

I have a dataframe like this one,
Continent % Renewable
Country
China Asia 2
United States North America 1
Japan Asia 1
United Kingdom Europe 1
Russian Federation Europe 2
Canada North America 5
Germany Europe 2
India Asia 1
France Europe 2
South Korea Asia 1
Italy Europe 3
Spain Europe 3
Iran Asia 1
Australia Australia 1
Brazil South America 5
where the % Renewableis a column created using the cut function,
Top15['% Renewable'] = pd.cut(Top15['% Renewable'], 5, labels=range(1,6))
when I group by Continentand % Renewable to count the number of countries in each subset I do,
count_groups = Top15.groupby(['Continent', '% Renewable']).size()
which is,
Continent % Renewable
Asia 1 4
2 1
Australia 1 1
Europe 1 1
2 3
3 2
North America 1 1
5 1
South America 5 1
The weird thing is the indexing now, if I index for a value that the category value is > 0 this gives me the value,
count_groups.loc['Asia', 1]
>> 4
if not,
count_groups.loc['Asia', 3]
>> IndexingError: Too many indexers
shouldn't it give me a 0 as there are no entries in that category? I would assume so as that dataframe was created using the groupby.
If not, can anyone suggest a procedure so I can preserve the 0 nr of countries for a category of % Renewable?
You have a Series with MultiIndex. Normally, we use tuples for indexing with MultiIndexes but pandas can be flexible about that.
In my opinion, count_groups.loc[('Asia', 3)] should raise a KeyError since this pair does not appear in the index but that's for pandas developers to decide I guess.
To return a default value from a Series, we can use get like we do in dictionaries:
count_groups.get(('Asia', 3), 0)
This will return 0 if the key does not exist.

Categories

Resources