Generating Binary columns based on string column which needs splitting - python

I have a pandas column, which is shown as below.
finalDF['question.DAL_negative_countries_MX_BR_VE']
0 NaN
1 China
2 NaN
...
9787 United States
9788 United States | China | France | Germany | Uni...
I want to generate individual binary columns for each response option (country). For example, one of the generated binary columns can be "question.DAL_positive_countries_MX_BR_VE_United_States" containing 1 if the respondent selected "United_States", and 0 otherwise.
I know to do df.col.str.split('|') but doesn't create binary columns as I need.

Related

Pandas string contains and replace

I have the following dataframe
A B
0 France United States of America
1 Italie France
2 United Stats Italy
I'm looking for a function that can take (for each word in column A) the first 4 letters and then search in column B whether or not these 4 letters are there. Now if this is the case, I want to replace the value in A with the similar value (similar first 4 letters) in B.
Example : for the word Italie in column A, I have to take Ital then search in B whether or not we can find it. Then I want to replace Italie with its similar word Italy.
I've tried to do for with str.contains function
But still cannot take only the first 4 letters.
Output expected :
A B
0 France United States of America
1 Italy France
2 United Stats of America Italy
In order to summarize, I am looking for correcting values in column A to become similar to those in column b
Solution from fuzzy match --fuzzywuzzy
from fuzzywuzzy import process
def fuzzyreturn(x):
return [process.extract(x, df.B.values, limit=1)][0][0][0]
df.A.apply(fuzzyreturn)
Out[608]:
0 France
1 Italy
2 United States of America
Name: A, dtype: object
df.A=df.A.apply(fuzzyreturn)

Combining similar rows in Stata / python

I am doing some data prep for graph analysis and my data looks as follows.
country1 country2 pair volume
USA CHN USA_CHN 10
CHN USA CHN_USA 5
AFG ALB AFG_ALB 2
ALB AFG ALB_AFG 5
I would like to combine them such that
country1 country2 pair volume
USA CHN USA_CHN 15
AFG ALB AFG_ALB 7
Is there a simple way for me to do so in Stata or python? I've tried making a duplicate dataframe and renamed the 'pair' as country2_country1, then merged them, and dropped duplicate volumes, but it's a hairy way of going about things: I was wondering if there is a better way.
If it helps to know, my data format is for a directed graph, and I am converting it to undirected.
Your key must consist of sets of two countries, so that they compare equal regardless of order. In Python/Pandas, this can be accomplished as follows.
import pandas as pd
import io
# load in your data
s = """
country1 country2 pair volume
USA CHN USA_CHN 10
CHN USA CHN_USA 5
AFG ALB AFG_ALB 2
ALB AFG ALB_AFG 5
"""
data = pd.read_table(io.BytesIO(s), sep='\s+')
# create your key (using frozenset instead of set, since frozenset is hashable)
key = data[['country1', 'country2']].apply(frozenset, 1)
# group by the key and aggregate using sum()
print(data.groupby(key).sum())
This results in
volume
(CHN, USA) 15
(AFG, ALB) 7
which isn't exactly what you wanted, but you should be able to get it into the right shape from here.
Here is a solution that takes pandas automatic alignment of indexes
df1 = df.set_index(['country1'])
df2 = df.set_index(['country2'])
df1['volume'] += df2['volume']
df1.reset_index().query('country1 > country2')
country1 country2 pair volume
0 USA CHN USA_CHN 15
3 ALB AFG ALB_AFG 7
Here is a solution based on #jean-françois-fabre comment.
split_sorted = df.pair.str.split('_').map(sorted)
df_switch = pd.concat([split_sorted.str[0],
split_sorted.str[1],
df['volume']], axis=1, keys=['country1', 'country2', 'volume'])
df_switch.groupby(['country1', 'country2'], as_index=False, sort=False).sum()
output
country1 country2 volume
0 CHN USA 15
1 AFG ALB 7
In Stata you can just lean on the fact that alphabetical ordering gives a distinct signature to each pair.
clear
input str3 (country1 country2) volume
USA CHN 10
CHN USA 5
AFG ALB 2
ALB AFG 5
end
gen first = cond(country1 < country2, country1, country2)
gen second = cond(country1 < country2, country2, country1)
collapse (sum) volume, by(first second)
list
+-------------------------+
| first second volume |
|-------------------------|
1. | AFG ALB 7 |
2. | CHN USA 15 |
+-------------------------+
You can merge back with the original dataset if wished.
Documented and discussed here
NB: Presenting a clear data example is helpful. Presenting it as the code to input the data is even more helpful.
Note: As Nick Cox comments below, this solution gets a bit crazy when the number of countries is large. (With 200 countries, you need to accurately store a 200-bit number)
Here's a neat way to do it using pure Stata.
I effectively convert the countries into binary "flags", making something like the following mapping:
AFG 0001
ALB 0010
CHN 0100
USA 1000
This is achieved by numbering each country as normal, then calculating 2^(country_number). When we then add these binary numbers, the result is a combination of the two "flags". For example,
AFG + CHN = 0101
CHN + AFG = 0101
Notice that it now doesn't make any difference which order the countries come in!
So we can now happily add the flags and collapse by the result, summing over volume as we go.
Here's the complete code (heavily commented so it looks much longer than it is!)
// Convert country names into numbers, storing the resulting
// name/number mapping in a label called "countries"
encode country1, generate(from_country) label(countries)
// Do it again for the other country, using the existing
// mappings where the countries already exist, and adding to the
// existing mapping where they don't
encode country2, generate(to_country) label(countries)
// Add these numbers as if they were binary flags
// Thus CHN (3) + USA (4) becomes:
// 010 +
// 100
// ---
// 110
// This makes adding strings commutative and unique. This means that
// the new variable doesn't care which way round the countries are
// nor can it get confused by pairs of countries adding up to the same
// number.
generate bilateral = 2^from_country + 2^to_country
// The rest is easy. Collapse by the new summed variable
// taking (arbitrarily) the lowest of the from_countries
// and the highest of the to_countries
collapse (sum) volume (min) from_country (max) to_country, by(bilateral)
// Tell Stata that these new min and max countries still have the same
// label:
label values from_country "countries"
label values to_country "countries"

How to create Boolean indicator matrix in PYTHON

I have the following dataset:
user artist sex country
0 1 red hot chili peppers f Germany
1 1 the black dahlia murder f Germany
2 1 goldfrapp f Germany
3 2 dropkick murphys f Germany
4 2 le tigre f Germany
.
.
289950 19718 bob dylan f Canada
289951 19718 pixies f Canada
289952 19718 the clash f Canada
I want to create a Boolean indicator matrix using a dataframe, where there is one row for each user and one column for each artist. For each row(user) if there is artist return 1 else return 0.
Just to mention, there are 1004 unique artists and 15000 unique users—it’s a large data set.
I have created an empty matrix using the following:
pd.DataFrame(index=user, columns=artist)
I am having difficulty populating the dataframe correctly.
There is a method in pandas called notnull
Suppose your dataframe is named df, you should use:
df['has_artist'] = df['artist'].notnull()
This will add a column of boolean named has_artist to your dataframe
If you want to have 0 and 1 do instead:
df['has_artist'] = df['artist'].notnull().astype(int)
You can also store it in a different variable and not alter your dataframe.

Pandas: Delete rows of a DataFrame if total count of a particular column occurs only 1 time

I'm looking to delete rows of a DataFrame if total count of a particular column occurs only 1 time
Example of raw table (values are arbitrary for illustrative purposes):
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
9 Bolivia #McDonalds 3456
10 Kenya #Schools 3455
11 Ukraine #Cars 3456
12 US #Tshirts 3456789
Intended outcome:
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
I know that df.Series.value_counts()>1 will identify which df.Series occur more than 1 time; and that the code returned will look something like the following:
Population
True
GDP
True
#McDonalds
False
#Schools
False
#Cars
False
#Tshirts
False
I want to write something like the following so that my new DataFrame drops column values from df.Series that occur only 1 time, but this doesn't work:
df.drop(df.Series.value_counts()==1,axis=1,inplace=True)
You can do this by creating a boolean list/array by either list comprehensions or using DataFrame's string manipulation methods.
The list comprehension approach is:
vc = df['Series'].value_counts()
u = [i not in set(vc[vc==1].index) for i in df['Series']]
df = df[u]
The other approach is to use the str.contains method to check whether the values of the Series column contain a given string or match a given regular expression (used in this case as we are using multiple strings):
vc = df['Series'].value_counts()
pat = r'|'.join(vc[vc==1].index) #Regular expression
df = df[~df['Series'].str.contains(pat)] #Tilde is to negate boolean
Using this regular expressions approach is a bit more hackish and may require some extra processing (character escaping, etc) on pat in case you have regex metacharacters in the strings you want to filter out (which requires some basic regex knowledge). However, it's worth noting this approach is about 4x faster than using the list comprehension approach (tested on the data provided in the question).
As a side note, I recommend avoiding using the word Series as a column name as that's the name of a pandas object.
This is an old question, but the current answer doesn't work for any moderately large dataframes. A much faster and more "dataframe" way is to add a value count column and filter out count.
Create the dataset:
df = pd.DataFrame({'Country': 'Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US'.split(),
'Series': 'Pop Pop Pop Pop GDP GDP GDP GDP McDonalds Schools Cars Tshirts'.split()})
Drop rows that have a count < 1 for the column ('Series' in this case):
# Group values for Series and add 'cnt' column with count
df['cnt'] = df.groupby(['Series'])['Country'].transform('count')
# Drop indexes for count value == 1, and dropping 'cnt' column
df.drop(df[df.cnt==1].index)[['Country','Series']]

Delete rows based on values in column in python

I am performing data clean on a .csv file for performing analytics. I am trying delete the rows having null values in their column in python.
Sample file:
Unnamed: 0 2012 2011 2010 2009 2008 2005
0 United States of America 760739 752423 781844 812514 843683 862220
1 Brazil 732913 717185 715702 651879 649996 NaN
2 Germany 520005 513458 515853 519010 518499 494329
3 United Kingdom (England and Wales) 310544 336997 367055 399869 419273 541455
4 Mexico 211921 212141 230687 244623 250932 239166
5 France 193081 192263 192906 193405 187937 148651
6 Sweden 87052 89457 87854 86281 84566 72645
7 Romania 17219 12299 12301 9072 9457 8898
8 Nigeria 15388 NaN 18093 14075 14692 NaN
So far used is:
from pandas import read_csv
link = "https://docs.google.com/spreadsheets......csv"
data = read_csv(link)
data.head(100000)
How can I delete these rows?
Once you have your data loaded you just need to figure out which rows to remove:
bad_rows = np.any(np.isnan(data), axis=1)
Then:
data[~bad_rows].head(100)
You need to use the dropna method to remove these values. Passing in how='any' into the method as an argument will remove the row if any of the values is null and how='all' will only remove the row if all of the values are null.
cleaned_data = data.dropna(how='any')
Edit 1.
It's worth noting that you may not want to have to create a copy of your cleaned data. (i.e. cleaned_data = data.dropna(how='any').
To save memory you can pass in the inplace option that will modify your original DataFrame and return None.
data.dropna(how='any', inplace=True)
data.head(100)

Categories

Resources