Is there an efficient way to merge two tables? [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have a question about merging two tables. Say I have a table A with data consisting of these parameters: Country, City, Zip Code. Also, I have a table B with unique Country names and a column that specifies which continent it is located (NA, Asia, EU, etc..)
How can I merge the two tables into one such that I will have columns: Country, City, Zip Code and a column that corresponds to the continent of table B?
Many thanks!

You can make use of the pd.merge function
Example: You have a "country" df with "country", "city" and "zip" columns and "continent" df with "country" and "continent" columns. Use the pd.merge function on the common column "country"
country = pd.DataFrame([['country1','city1','zip1'],['country1','city1','zip2'],['country1','city2','zip3'],['country1','city2','zip4'],
['country2','city3','zip5'],['country2','city3','zip6'],['country2','city4','zip7'],
['country3','city5','zip8'],['country3','city6','zip9']],
columns=['country','city','zipcode'])
continent = pd.DataFrame([['country1','A'],['country2','B'],['country3','C'],['country4','D'],['country5','E']],
columns=['country','continent'])
country = country.merge(continent, on=['country'])
print(country)
Output:
country city zipcode continent
0 country1 city1 zip1 A
1 country1 city1 zip2 A
2 country1 city2 zip3 A
3 country1 city2 zip4 A
4 country2 city3 zip5 B
5 country2 city3 zip6 B
6 country2 city4 zip7 B
7 country3 city5 zip8 C
8 country3 city6 zip9 C

Related

combining dataframes that have the same 'country name' and same 'year'

I m trying to merge these dataframes in a way that the final data frame would have matched the country year gdp from first dataframe with its corresponding values from second data frame.
[]
[]
first data frame :
Country
Country code
year
rgdpe
country1
Code1
year1
rgdpe1
country1
Code1
yearn
rgdpen
country2
Code2
year1
rgdpe1'
second dataframe :
countries
value
year
country1
value1
year1
country1
valuen
yearn
country2
Code2
year1
combined dataframe:
| Country | Country code | year |rgdpe |value|
|:--------|:------------:|:----:|:-----:|:---:|
|country1 | Code1 | year1|rgdpe1 |value|
|country1 | Code1 | yearn|rgdpen |Value|
|country2 | Code2 | year1|rgdpe1'|Value|
combined=pd.merge(left=df_biofuel_prod, right=df_GDP[['rgdpe']], left_on='Value', right_on='country', how='right')
combined.to_csv('../../combined_test.csv')
the results of this code gives me just the rgdpe column while the other column are empty.
What would be the most efficient way to merge and match these dataframes ?
First, from the data screen cap, it looks like the "country" column in your first dataset "df_GDP" is set as index. Reset it using "reset_index()". Then merge on multiple columns like left_on=["countries","year"] and right_on=["country","year"]. And since you want to retain all records from your main dataframe "df_biofuel_prod", so it should be "left" join:
combined_df = df_biofuel_prod.merge(df_GDP.reset_index(), left_on=["countries","year"], right_on=["country","year"], how="left")
Full example with dummy data:
df_GDP = pd.DataFrame(data=[["USA",2001,400],["USA",2002,450],["CAN",2001,150],["CAN",2002,170]], columns=["country","year","rgdpe"]).set_index("country")
df_biofuel_prod = pd.DataFrame(data=[["USA",400,2001],["USA",450,2003],["CAN",150,2001],["CAN",170,2003]], columns=["countries","Value","year"])
combined_df = df_biofuel_prod.merge(df_GDP.reset_index(), left_on=["countries","year"], right_on=["country","year"], how="left")
[Out]:
countries Value year country rgdpe
0 USA 400 2001 USA 400.0
1 USA 450 2003 NaN NaN
2 CAN 150 2001 CAN 150.0
3 CAN 170 2003 NaN NaN
You see "NaN" where matching data is not available in "df_GDP".

How to extract unique values from pandas column where values are in list

I want to extract unique cities from city column in pandas dataframe. City column has values in list. How would I extract the cities frequency like:
Lahore 3
Karachi 2
Sydney 1
etc.
Sample dataframe:
Name Age City
a jack 34 [Sydney,Delhi]
b Riti 31 [Lahore,Delhi]
c Aadi 16 [New York, Karachi, Lahore]
d Mohit 32 [Peshawar,Delhi, Karachi]
Thank you
Let us try explode + value_counts
out = df.City.explode().value_counts()

Selecting all values greater than a number in a panda data frame

I have a dataframe like this with more than 50 columns(for years from 1963 to 2016). I was looking to select all countries with a population over a certain number(say 60 million). Now, when I looked, all the questions were about picking values from a single column. Which is not the case here. I also tried
df[df.T[(df.T > 0.33)].any()] as was suggested in an answer. Doesn't work. Any ideas?
The data frame looks like this:
Country Country_Code Year_1979 Year_1999 Year_2013
Aruba ABW 59980.0 89005 103187.0
Angola AGO 8641521.0 15949766 25998340.0
Albania ALB 2617832.0 3108778 2895092.0
Andorra AND 34818.0 64370 80788.0
First filter only columns with Year in columns names by DataFrame.filter, compare all rows and then test by DataFrame.any at least one matched value per row:
df1 = df[(df.filter(like='Year') > 2000000).any(axis=1)]
print (df1)
Country Country_Code Year_1979 Year_1999 Year_2013
1 Angola AGO 8641521.0 15949766 25998340.0
2 Albania ALB 2617832.0 3108778 2895092.0
Or compare all columns without first 2 selected by positons with DataFrame.iloc:
df1 = df[(df.iloc[:, 2:] > 2000000).any(axis=1)]
print (df1)
Country Country_Code Year_1979 Year_1999 Year_2013
1 Angola AGO 8641521.0 15949766 25998340.0
2 Albania ALB 2617832.0 3108778 2895092.0

pandas replace NaNs with modus of another column based on second column

I have a pandas dataframe with two columns, city and country. Both city and country contain missing values. consider this data frame:
temp = pd.DataFrame({"country": ["country A", "country A", "country A", "country A", "country B","country B","country B","country B", "country C", "country C", "country C", "country C"],
"city": ["city 1", "city 2", np.nan, "city 2", "city 3", "city 3", np.nan, "city 4", "city 5", np.nan, np.nan, "city 6"]})
I now want to fill in the NaNs in the city column with the mode of the country's city in the remaining data frame, e.g. for country A: city 1 is mentioned once; city 2 is mentioned twice; thus, fill the column city at index 2 with city 2 etc.
I have done
cities = [city for city in temp["country"].value_counts().index]
modes = temp.groupby(["country"]).agg(pd.Series.mode)
dict_locations = modes.to_dict(orient="index")
for k in dict_locations.keys():
new_dict_locations[k] = dict_locations[k]["city"]
Now having the value of the country and the corresponding city mode, I face two issues:
First: the case country C is bimodal - the key contains two entries. I want this key to refer to each of the entries with equal probability. The real data set has multiple modes, so it would be a list of len > 2.
Second: I'm stuck replacing the NaNs in city with the value corresponding to the value in the same line's country cell in new_dict_locations. In pseudo-code, this would be: `go through the column 'city'; if you find a missing value at position 'temp[i, city]', take the value of 'country' in that row (-> 'country_tmp'); take 'country_tmp' as key to the dictionary 'new_dict_locations'; if the dictionary at key 'country_temp' is a list, randomly select one item from that list; take the return value (-> 'city_tmp') and fill the cell with the missing value (temp[i, city]) with the value 'city_temp').
I've tried using different combinations of .fillna() and .replace() (and read this and other questions to no avail.* Can someone give me a pointer?
Many thanks in advance.
(Note: the referenced question replaces values in one cell according to a dict; my reference values are, however, in a different column.)
** EDIT **
executing temp["city"].fillna(temp['country'], inplace=True) and temp.replace({'city': dict_locations}) gives me an error: TypeError: unhashable type: 'dict' [This error is TypeError: unhashable type: 'numpy.ndarray' for the original data set but I cannot reproduce it with an example - if someone knows the whereabouts of the difference, I'd be super happy to hear their thoughts.]
Try map with dict new_dict_locations to create a new series s, and map again on s with np.random.choice to pick value from array. Finally, use s to fillna
s = (temp.country.map(new_dict_locations)
.map(lambda x: np.random.choice(x) if isinstance(x, np.ndarray) else x))
temp['city'] = temp.city.fillna(s)
Out[247]:
country city
0 country A city 1
1 country A city 2
2 country A city 2
3 country A city 2
4 country B city 3
5 country B city 3
6 country B city 3
7 country B city 4
8 country C city 5
9 country C city 6
10 country C city 5
11 country C city 6
Note: I thought 2 map may be joined to one by using dict comprehension. However, doing it will cause loosing of the randomness.
def get_mode(d):
for k,v in d.items():
if len(v)>1 and isinstance(v, np.ndarray):
d[k]=np.random.choice(list(v), 1, p=[0.5 for i in range(len(v))])[0]
return d
Below dictionary is the one which will be used for filling.
new_dict_locations=get_mode(new_dict_locations)
keys=list(new_dict_locations.keys())
values=list(new_dict_locations.values())
# Filling happens here
temp.city=temp.city.fillna(temp.country).replace(keys, values)
This will give desired output:
country city
0 country A city 1
1 country A city 2
2 country A city 2
3 country A city 2
4 country B city 3
5 country B city 3
6 country B city 3
7 country B city 4
8 country C city 5
9 country C city 5
10 country C city 5
11 country C city 6

Add a value to a new column on Data frame that depends on the value on another Data frame [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have two data frames df1 and df2. df1 has entries of amounts spent by users and each user can have several entries with different amounts values.
The second data frame just holds the information of every users(each user is unique in this data frame).
i want to create a new column on df1 that includes the country value of each unique user from df2.
Any help will be appreciated
df1
name_id Dept amt_spent
0 Alex-01 Engineering 5
1 Bob-01 Finance 5
2 Charles-01 HR 10
3 David-01 HR 6
4 Alex-01 Engineering 50
df2
name_id Country
0 Alex-01 UK
1 Bob-01 USA
2 Charles-01 GHANA
3 David-01 BRAZIL
Result
name_id Dept amt_spent Country
0 Alex-01 Engineering 5 UK
1 Bob-01 Finance 5 USA
2 Charles-01 HR 10 GHANA
3 David-01 HR 6 BRAZIL
4 Alex-01 Engineering 50 UK
This should work:
df = pd.merge(df1, df2)

Categories

Resources