Combining similar rows in Stata / python

Combining similar rows in Stata / python - python

I am doing some data prep for graph analysis and my data looks as follows.
country1 country2 pair volume
USA CHN USA_CHN 10
CHN USA CHN_USA 5
AFG ALB AFG_ALB 2
ALB AFG ALB_AFG 5
I would like to combine them such that
country1 country2 pair volume
USA CHN USA_CHN 15
AFG ALB AFG_ALB 7
Is there a simple way for me to do so in Stata or python? I've tried making a duplicate dataframe and renamed the 'pair' as country2_country1, then merged them, and dropped duplicate volumes, but it's a hairy way of going about things: I was wondering if there is a better way.
If it helps to know, my data format is for a directed graph, and I am converting it to undirected.

Your key must consist of sets of two countries, so that they compare equal regardless of order. In Python/Pandas, this can be accomplished as follows.
import pandas as pd
import io
# load in your data
s = """
country1 country2 pair volume
USA CHN USA_CHN 10
CHN USA CHN_USA 5
AFG ALB AFG_ALB 2
ALB AFG ALB_AFG 5
"""
data = pd.read_table(io.BytesIO(s), sep='\s+')
# create your key (using frozenset instead of set, since frozenset is hashable)
key = data[['country1', 'country2']].apply(frozenset, 1)
# group by the key and aggregate using sum()
print(data.groupby(key).sum())
This results in
volume
(CHN, USA) 15
(AFG, ALB) 7
which isn't exactly what you wanted, but you should be able to get it into the right shape from here.

Here is a solution that takes pandas automatic alignment of indexes
df1 = df.set_index(['country1'])
df2 = df.set_index(['country2'])
df1['volume'] += df2['volume']
df1.reset_index().query('country1 > country2')
country1 country2 pair volume
0 USA CHN USA_CHN 15
3 ALB AFG ALB_AFG 7
Here is a solution based on #jean-françois-fabre comment.
split_sorted = df.pair.str.split('_').map(sorted)
df_switch = pd.concat([split_sorted.str[0],
split_sorted.str[1],
df['volume']], axis=1, keys=['country1', 'country2', 'volume'])
df_switch.groupby(['country1', 'country2'], as_index=False, sort=False).sum()
output
country1 country2 volume
0 CHN USA 15
1 AFG ALB 7

In Stata you can just lean on the fact that alphabetical ordering gives a distinct signature to each pair.
clear
input str3 (country1 country2) volume
USA CHN 10
CHN USA 5
AFG ALB 2
ALB AFG 5
end
gen first = cond(country1 < country2, country1, country2)
gen second = cond(country1 < country2, country2, country1)
collapse (sum) volume, by(first second)
list
+-------------------------+
| first second volume |
|-------------------------|
1. | AFG ALB 7 |
2. | CHN USA 15 |
+-------------------------+
You can merge back with the original dataset if wished.
Documented and discussed here
NB: Presenting a clear data example is helpful. Presenting it as the code to input the data is even more helpful.

Note: As Nick Cox comments below, this solution gets a bit crazy when the number of countries is large. (With 200 countries, you need to accurately store a 200-bit number)
Here's a neat way to do it using pure Stata.
I effectively convert the countries into binary "flags", making something like the following mapping:
AFG 0001
ALB 0010
CHN 0100
USA 1000
This is achieved by numbering each country as normal, then calculating 2^(country_number). When we then add these binary numbers, the result is a combination of the two "flags". For example,
AFG + CHN = 0101
CHN + AFG = 0101
Notice that it now doesn't make any difference which order the countries come in!
So we can now happily add the flags and collapse by the result, summing over volume as we go.
Here's the complete code (heavily commented so it looks much longer than it is!)
// Convert country names into numbers, storing the resulting
// name/number mapping in a label called "countries"
encode country1, generate(from_country) label(countries)
// Do it again for the other country, using the existing
// mappings where the countries already exist, and adding to the
// existing mapping where they don't
encode country2, generate(to_country) label(countries)
// Add these numbers as if they were binary flags
// Thus CHN (3) + USA (4) becomes:
// 010 +
// 100
// ---
// 110
// This makes adding strings commutative and unique. This means that
// the new variable doesn't care which way round the countries are
// nor can it get confused by pairs of countries adding up to the same
// number.
generate bilateral = 2^from_country + 2^to_country
// The rest is easy. Collapse by the new summed variable
// taking (arbitrarily) the lowest of the from_countries
// and the highest of the to_countries
collapse (sum) volume (min) from_country (max) to_country, by(bilateral)
// Tell Stata that these new min and max countries still have the same
// label:
label values from_country "countries"
label values to_country "countries"

Related

Using Query in Pandas to remove a vector of values

I work in R and this operation would be easy in tidyverse; However, I'm having trouble figuring out how to do it in Python and Pandas.
Let's say we're using the gapminder dataset
data_url = 'https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv'
gapminder = pd.read_csv(data_url)
and let's say that I want to filter out from the dataset all year values that are equal to 1952 and 1957. I would think that something like this would work, but it doesn't:
vector = [1952, 1957]
gapminder.query("year isin(vector)")
I realize here that I've made a vector in what is really a list. When I try to pass those two year values into an array as vector = pd.array(1952, 1957) That doesn't work either.
In R, for instance, you would have to do something simple like
vector = c(1952, 1957)
gapminder %>% filter(year %in% vector)
#or
gapminder %>% filter(year %in% c(1952, 1957))
So really this is a two part question: first, how can I create a vector of many values (if I were pulling these values from another dataset, I believe that I could just use pd.to_numpy) and then how do I then remove all rows based on that vector of observations from a dataframe?
I've looked at a lot of different variations for using query like here, for instance, https://www.geeksforgeeks.org/python-filtering-data-with-pandas-query-method/, but this has been surprisingly hard to find.
*Here I am updating my question: I found that this isn't working if I pull a vector from another dataset (or even from the same dataset); for instance:
vector = (1952, 1957)
#how to take a dataframe and make a vector
#how to make a vector
gapminder.vec = gapminder\
.query('year == [1952, 1958]')\
[['country']]\
.to_numpy()
gap_sum = gapminder.query("year != #gapminder.vec")
gap_sum
I receive the following error:
Thanks much!
James

You can use in or even == inside the query string like so:
# gapminder.query("year == #vector") returns the same result
print(gapminder.query("year in #vector"))
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
12 Albania 1952 1282697.0 Europe 55.230 1601.056136
13 Albania 1957 1476505.0 Europe 59.280 1942.284244
24 Algeria 1952 9279525.0 Africa 43.077 2449.008185
... ... ... ... ... ... ...
1669 Yemen Rep. 1957 5498090.0 Asia 33.970 804.830455
1680 Zambia 1952 2672000.0 Africa 42.038 1147.388831
1681 Zambia 1957 3016000.0 Africa 44.077 1311.956766
1692 Zimbabwe 1952 3080907.0 Africa 48.451 406.884115
1693 Zimbabwe 1957 3646340.0 Africa 50.469 518.764268
The # symbol tells the query string to look for a variable named vector outside of the context of the dataframe.
There are a couple of issues with the updated component of your question that I'll address:
The direct issue you're receiving is because you're using double square brackets to select a column. By using a double square bracket, you're forcing the selected column to be returned as a 2d table (e.g. a dataframe that contains a single column), instead of just the column itself. To resolve this issue, simply get rid of the double brackets. The to_numpy is also not necessary.
in your gap_sum variable, you're checking where the values in "year" are not in your gapminder.vec - which is a pd.Series (array for more generic term) of country names. So these don't really make sense to compare.
Don't use . notation to create variables in python. You're not making a new variable, but are attaching a new attribute to an existing object. Instead use underscores as is common practice in python (e.g. use gapminder_vec instead of gapminder.vec)
# countries that have years that are either 1952 or 1958
# will contain duplicate country names
gapminder_vec = gapminder.query('year == [1952, 1958]')['country']
# This won't actually filter anything- because `gapminder_vec` is
# a bunch of country names. Not years.
gapminder.query("year not in #gapminder_vec")
Also to perform a filter rather than a subset:
vec = (1952, 1958)
# returns a subset containing the rows who have a year in `vec`
subset_with_years_in_vec = gapminder.query('year in #vec')
# return subset containing rows who DO NOT have a year in `vec`
subset_without_years_in_vec = gapminder.query('year not in #vec')

To filter out years 1952 and 1957 you can use:
print(gapminder.loc[~(gapminder.year.isin([1952, 1957]))])
Prints:
country year pop continent lifeExp gdpPercap
2 Afghanistan 1962 1.026708e+07 Asia 31.99700 853.100710
3 Afghanistan 1967 1.153797e+07 Asia 34.02000 836.197138
4 Afghanistan 1972 1.307946e+07 Asia 36.08800 739.981106
5 Afghanistan 1977 1.488037e+07 Asia 38.43800 786.113360
6 Afghanistan 1982 1.288182e+07 Asia 39.85400 978.011439
7 Afghanistan 1987 1.386796e+07 Asia 40.82200 852.395945
8 Afghanistan 1992 1.631792e+07 Asia 41.67400 649.341395
9 Afghanistan 1997 2.222742e+07 Asia 41.76300 635.341351
10 Afghanistan 2002 2.526840e+07 Asia 42.12900 726.734055
11 Afghanistan 2007 3.188992e+07 Asia 43.82800 974.580338
14 Albania 1962 1.728137e+06 Europe 64.82000 2312.888958
15 Albania 1967 1.984060e+06 Europe 66.22000 2760.196931
16 Albania 1972 2.263554e+06 Europe 67.69000 3313.422188
17 Albania 1977 2.509048e+06 Europe 68.93000 3533.003910
...

Generating Binary columns based on string column which needs splitting

I have a pandas column, which is shown as below.
finalDF['question.DAL_negative_countries_MX_BR_VE']
0 NaN
1 China
2 NaN
...
9787 United States
9788 United States | China | France | Germany | Uni...
I want to generate individual binary columns for each response option (country). For example, one of the generated binary columns can be "question.DAL_positive_countries_MX_BR_VE_United_States" containing 1 if the respondent selected "United_States", and 0 otherwise.
I know to do df.col.str.split('|') but doesn't create binary columns as I need.

How to output the top 5 of a specific column along with associated columns using python?

I've tried to use df2.nlargest(5, ['1960'] this gives me:
Country Name Country Code ... 2017 2018
0 IDA & IBRD total IBT ... 6335039629.0000 6412522234.0000
1 Low & middle income LMY ... 6306560891.0000 6383958209.0000
2 Middle income MIC ... 5619111361.0000 5678540888.0000
3 IBRD only IBD ... 4731120193.0000 4772284113.0000
6 Upper middle income UMC ... 2637690770.0000 2655635719.0000
This is somewhat right, but it's outputting all the columns. I just want it to include the column name "Country Name" and "1960" only, but sort by the column "1960."
So the output should look like this...
Country Name 1960
China 5000000000
India 499999999
USA 300000
France 100000
Germany 90000

Pandas: Delete rows of a DataFrame if total count of a particular column occurs only 1 time

I'm looking to delete rows of a DataFrame if total count of a particular column occurs only 1 time
Example of raw table (values are arbitrary for illustrative purposes):
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
9 Bolivia #McDonalds 3456
10 Kenya #Schools 3455
11 Ukraine #Cars 3456
12 US #Tshirts 3456789
Intended outcome:
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
I know that df.Series.value_counts()>1 will identify which df.Series occur more than 1 time; and that the code returned will look something like the following:
Population
True
GDP
True
#McDonalds
False
#Schools
False
#Cars
False
#Tshirts
False
I want to write something like the following so that my new DataFrame drops column values from df.Series that occur only 1 time, but this doesn't work:
df.drop(df.Series.value_counts()==1,axis=1,inplace=True)

You can do this by creating a boolean list/array by either list comprehensions or using DataFrame's string manipulation methods.
The list comprehension approach is:
vc = df['Series'].value_counts()
u = [i not in set(vc[vc==1].index) for i in df['Series']]
df = df[u]
The other approach is to use the str.contains method to check whether the values of the Series column contain a given string or match a given regular expression (used in this case as we are using multiple strings):
vc = df['Series'].value_counts()
pat = r'|'.join(vc[vc==1].index) #Regular expression
df = df[~df['Series'].str.contains(pat)] #Tilde is to negate boolean
Using this regular expressions approach is a bit more hackish and may require some extra processing (character escaping, etc) on pat in case you have regex metacharacters in the strings you want to filter out (which requires some basic regex knowledge). However, it's worth noting this approach is about 4x faster than using the list comprehension approach (tested on the data provided in the question).
As a side note, I recommend avoiding using the word Series as a column name as that's the name of a pandas object.

This is an old question, but the current answer doesn't work for any moderately large dataframes. A much faster and more "dataframe" way is to add a value count column and filter out count.
Create the dataset:
df = pd.DataFrame({'Country': 'Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US'.split(),
'Series': 'Pop Pop Pop Pop GDP GDP GDP GDP McDonalds Schools Cars Tshirts'.split()})
Drop rows that have a count < 1 for the column ('Series' in this case):
# Group values for Series and add 'cnt' column with count
df['cnt'] = df.groupby(['Series'])['Country'].transform('count')
# Drop indexes for count value == 1, and dropping 'cnt' column
df.drop(df[df.cnt==1].index)[['Country','Series']]

Delete rows based on values in column in python

I am performing data clean on a .csv file for performing analytics. I am trying delete the rows having null values in their column in python.
Sample file:
Unnamed: 0 2012 2011 2010 2009 2008 2005
0 United States of America 760739 752423 781844 812514 843683 862220
1 Brazil 732913 717185 715702 651879 649996 NaN
2 Germany 520005 513458 515853 519010 518499 494329
3 United Kingdom (England and Wales) 310544 336997 367055 399869 419273 541455
4 Mexico 211921 212141 230687 244623 250932 239166
5 France 193081 192263 192906 193405 187937 148651
6 Sweden 87052 89457 87854 86281 84566 72645
7 Romania 17219 12299 12301 9072 9457 8898
8 Nigeria 15388 NaN 18093 14075 14692 NaN
So far used is:
from pandas import read_csv
link = "https://docs.google.com/spreadsheets......csv"
data = read_csv(link)
data.head(100000)
How can I delete these rows?

Once you have your data loaded you just need to figure out which rows to remove:
bad_rows = np.any(np.isnan(data), axis=1)
Then:
data[~bad_rows].head(100)

You need to use the dropna method to remove these values. Passing in how='any' into the method as an argument will remove the row if any of the values is null and how='all' will only remove the row if all of the values are null.
cleaned_data = data.dropna(how='any')
Edit 1.
It's worth noting that you may not want to have to create a copy of your cleaned data. (i.e. cleaned_data = data.dropna(how='any').
To save memory you can pass in the inplace option that will modify your original DataFrame and return None.
data.dropna(how='any', inplace=True)
data.head(100)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combining similar rows in Stata / python - python

Related

Using Query in Pandas to remove a vector of values

Generating Binary columns based on string column which needs splitting

How to output the top 5 of a specific column along with associated columns using python?

Pandas: Delete rows of a DataFrame if total count of a particular column occurs only 1 time

Delete rows based on values in column in python

Categories

Resources