How to fill pandas dataframe columns with random dictionary values - python

I'm new to Pandas and I would like to play with random text data. I am trying to add 2 new columns to a DataFrame df which would be each filled by a key (newcol1) + value (newcol2) randomly selected from a dictionary.
countries = {'Africa':'Ghana','Europe':'France','Europe':'Greece','Asia':'Vietnam','Europe':'Lithuania'}
My df already has 2 columns and I'd like something like this :
Year Approved Continent Country
0 2016 Yes  Africa  Ghana
1  2016 Yes Europe Lithuania
2  2017 No Europe  Greece
I can certainly use a for or while loop to fill df['Continent'] and df['Country'] but I sense .apply() and np.random.choice may provide a simpler more pandorable solution for that.

Yep, you're right. You can use np.random.choice with map:
df
Year Approved
0 2016 Yes
1  2016 Yes
2  2017 No
df['Continent'] = np.random.choice(list(countries), len(df))
df['Country'] = df['Continent'].map(countries)
df
Year Approved Continent Country
0 2016 Yes Africa Ghana
1  2016 Yes Asia Vietnam
2  2017 No Europe Lithuania
You choose len(df) number of keys at random from the country key-list, and then use the country dictionary as a mapper to find the country equivalents of the previously picked keys.

You could also try using DataFrame.sample():
df.join(
pd.DataFrame(list(countries.items()), columns=["continent", "country"])
.sample(len(df), replace=True)
.reset_index(drop=True)
)
Which can be made faster if your continent-country map is already a dataframe.
If you're on Python 3.6, another method would be to use random.choices():
df.join(
pd.DataFrame(choices([*countries.items()], k=len(df)), columns=["continent", "country"])
)
random.choices() is similar to numpy.random.choice() except that you can pass a list of key-value tuple pairs whereas numpy.random.choice() only accepts 1-D arrays.

Related

Merging data in Pandas dataframe by values in a column

I have a pandas dataframe like this:
Name Year Sales
Ann 2010 500
Ann 2011 500
Bob 2010 400
Bob 2011 700
Ed 2010 300
Ed 2011 300
I want to be able to combine the figures in the sales column for each name returning:
Name Sales
Ann 1000
Bob 1100
Ed 600
Perhaps I need a for loop to go through and combine the 2 values for both years and create a new column, but I'm not quite sure. Is there a pandas function that can help me with this?
That's a simple dataframe groupby.
In that case you'll just have to select the two columns you need
df = df[["Name", "Sales"]]
And then apply the groupby
df.groupby(["name"], as_index=False).sum()
By default the groupby will make the grouped by columns part of the index. If you want to keep them as colum you need to specify as_index=False

merge two datasets to find a mean

I have two similar looking tables:
df1:
country type mean count last_checked_date
Brazil Weather x 2 2022-02-13
Brazil Corona y 3 2022-02-13
China Corona z 1 2022-02-13
China Fruits s 2 2022-02-13
df2
country type mean count last_checked_date
Ghana Weather a 2 2022-02-13
Brazil Corona b 5 2022-02-13
China Corona c 1 2022-02-13
Germany Fruits d 2 2022-02-13
I want to join df2 with df1 such that no combination of country, type is lost. For each combination of country and type, I want to calculate a mean value with this formula:
df find_new_values(old_mean, new_mean, old_count, new_count):
mean = (old_mean + new_mean)/(old_count+new_count)
count = old_count+new_count
return mean, count
For example, in df2, China, Corona is present in df1 as well so the mean would be (c+z)/(1+1)
However, Ghana, Weather is present in df2 but not in df1 so in this case, I want to simply add a row to df1 as it is without the formula calculation.
How can I achieve this? What's the correct join/merge type to use here?
We may consider the problem this way, we combine them into one table,
df = pd.concat([df1, df2])
then use groupby to apply aggregations on each group of the rows that share the same country and type.
df.groupby(['country', 'type']).agg({'mean': 'mean', 'count': 'sum'})
For country-type combination that only occur once in one of the dataframe, the corresponding group will only discover one row and the aggregation functions won't change anything.
You may add 'last_checked_date': 'last' to the list of agg if needed.

Split DataFrame as per series values

I am working on a Netflix dataset where some columns having comma-separated values.
I would like a have count of shows released per country but data is like
Image of dataset
How do I split the data and make it countrywide like 1 show releases in 3 countries(Norway, Iceland, United States) then row should come 3 times with a single country in the country column.
show_id
country
s5
Norway
s5
Iceland
NOTE: Using pandas
You can split the comma-separated string to the list and then apply 'explode' to that column.
df['country'] = df['country'].str.split(',')
df = df.explode('country')
print(df)

Create a new dataframe column by comparing two other columns in different dataframes [duplicate]

This question already has answers here:
Mapping columns from one dataframe to another to create a new column [duplicate]
(2 answers)
Closed 4 years ago.
I have a DataFrame which contains Alpha 2 country codes (UK, ES, SL etc) and I need these to be the country names. I created a second data frame that has all the Alpha 2 country codes in one column and the corresponding names in another.
I'm trying to compare these two columns then using the index to create the new column. However I am struggling to do this without using a loop. I feel like there is a more efficient way to do this without looping?
I have tried using a for loop, iterating over:
cube_data = pd.DataFrame({'Country Code':['UK','ES','SL']})
alpha2 = pd.DataFrame({'Code':['ES','GH','UK','SL'],
'Name':['Spain','Ghana','United Kingdom','Sierra Leone']})
cube_data
Country Code
0 UK
1 ES
2 SL
alpha2
Code Name
0 ES Spain
1 GH Ghana
2 UK United Kingdom
3 SL Sierra Leone
I have used a for loop to iterate through the columns and when the code from cube_data is found in alpha2['Code'] the index is used to create a new series which has alpha['Name'] at the correct position corresponding to the cube_data.
end result is:
cube_data
Country Code Name
0 UK United Kingdom
1 ES Spain
2 SL Sierra Leone
Surely there is a better way to do this without looping? I have had a look at series.isin() and series.map() but these do not seem to provide the result I need.
Can this be done without a loop?
You can use pandas merge:
df = alpha2.merge(cube_data, left_on='Code', right_on='Country Code', how='inner').drop('Code', axis=1)
merge works like an SQL join: here we merge alpha2 with cube_data. We use the columns 'Code' from alpha2 and 'Country Code' from cube_data to merge the two datframes together and use an 'inner' join logic meaning that only values present in both dataframes will be kept. Finally we drop the column 'Code' from alpha2 which contains the same values as the column 'Country Code'
Use map after converting alpha2 to a mappable object.
First we make our map:
>> country_map = alpha2.set_index('Code')['Name'].to_dict()
>> # country_map = dict(alpha2[['Code', 'Name']].values)
>> # country_map = alpha2.set_index('Code')['Name']
>> print(country_map)
{'ES': 'Spain', 'UK': 'United Kingdom', 'GH': 'Ghana', 'SL': 'Sierra Leone'}
Then we map it on the Country Code column:
>> cube_data['Country'] = cube_data['Country Code'].map(country_map)
>> print(cube_data)
Country Code Country
0 UK United Kingdom
1 ES Spain
2 SL Sierra Leone
Have you looked into the pycountry module?
I've changed your 'UK' alpha_2 to 'GB'.
import pandas as pd
import pycountry
cube_data = pd.DataFrame({'Country Code':['GB','ES','SL']})
for alpha2_code in cube_data['Country Code']:
c = pycountry.countries.get(alpha_2=alpha2_code)
print(c.name)
output:
United Kingdom
Spain
Sierra Leone
Using a lambda to create new column
df = cube_data
df['Name'] = df['Country Code'].apply(lambda x: pycountry.countries.get(alpha_2=x).name)
print(df)
output:
Country Code name
0 GB United Kingdom
1 ES Spain
2 SL Sierra Leone

Pandas: Delete rows of a DataFrame if total count of a particular column occurs only 1 time

I'm looking to delete rows of a DataFrame if total count of a particular column occurs only 1 time
Example of raw table (values are arbitrary for illustrative purposes):
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
9 Bolivia #McDonalds 3456
10 Kenya #Schools 3455
11 Ukraine #Cars 3456
12 US #Tshirts 3456789
Intended outcome:
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
I know that df.Series.value_counts()>1 will identify which df.Series occur more than 1 time; and that the code returned will look something like the following:
Population
True
GDP
True
#McDonalds
False
#Schools
False
#Cars
False
#Tshirts
False
I want to write something like the following so that my new DataFrame drops column values from df.Series that occur only 1 time, but this doesn't work:
df.drop(df.Series.value_counts()==1,axis=1,inplace=True)
You can do this by creating a boolean list/array by either list comprehensions or using DataFrame's string manipulation methods.
The list comprehension approach is:
vc = df['Series'].value_counts()
u = [i not in set(vc[vc==1].index) for i in df['Series']]
df = df[u]
The other approach is to use the str.contains method to check whether the values of the Series column contain a given string or match a given regular expression (used in this case as we are using multiple strings):
vc = df['Series'].value_counts()
pat = r'|'.join(vc[vc==1].index) #Regular expression
df = df[~df['Series'].str.contains(pat)] #Tilde is to negate boolean
Using this regular expressions approach is a bit more hackish and may require some extra processing (character escaping, etc) on pat in case you have regex metacharacters in the strings you want to filter out (which requires some basic regex knowledge). However, it's worth noting this approach is about 4x faster than using the list comprehension approach (tested on the data provided in the question).
As a side note, I recommend avoiding using the word Series as a column name as that's the name of a pandas object.
This is an old question, but the current answer doesn't work for any moderately large dataframes. A much faster and more "dataframe" way is to add a value count column and filter out count.
Create the dataset:
df = pd.DataFrame({'Country': 'Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US'.split(),
'Series': 'Pop Pop Pop Pop GDP GDP GDP GDP McDonalds Schools Cars Tshirts'.split()})
Drop rows that have a count < 1 for the column ('Series' in this case):
# Group values for Series and add 'cnt' column with count
df['cnt'] = df.groupby(['Series'])['Country'].transform('count')
# Drop indexes for count value == 1, and dropping 'cnt' column
df.drop(df[df.cnt==1].index)[['Country','Series']]

Categories

Resources