merge two datasets to find a mean - python

I have two similar looking tables:
df1:
country type mean count last_checked_date
Brazil Weather x 2 2022-02-13
Brazil Corona y 3 2022-02-13
China Corona z 1 2022-02-13
China Fruits s 2 2022-02-13
df2
country type mean count last_checked_date
Ghana Weather a 2 2022-02-13
Brazil Corona b 5 2022-02-13
China Corona c 1 2022-02-13
Germany Fruits d 2 2022-02-13
I want to join df2 with df1 such that no combination of country, type is lost. For each combination of country and type, I want to calculate a mean value with this formula:
df find_new_values(old_mean, new_mean, old_count, new_count):
mean = (old_mean + new_mean)/(old_count+new_count)
count = old_count+new_count
return mean, count
For example, in df2, China, Corona is present in df1 as well so the mean would be (c+z)/(1+1)
However, Ghana, Weather is present in df2 but not in df1 so in this case, I want to simply add a row to df1 as it is without the formula calculation.
How can I achieve this? What's the correct join/merge type to use here?

We may consider the problem this way, we combine them into one table,
df = pd.concat([df1, df2])
then use groupby to apply aggregations on each group of the rows that share the same country and type.
df.groupby(['country', 'type']).agg({'mean': 'mean', 'count': 'sum'})
For country-type combination that only occur once in one of the dataframe, the corresponding group will only discover one row and the aggregation functions won't change anything.
You may add 'last_checked_date': 'last' to the list of agg if needed.

Related

Multiple similar columns with similar values

The dataframe looks like:
name education education_2 education_3
name_1 NaN some college NaN
name_2 NaN NaN graduate degree
name_3 high school NaN NaN
I just want to keep one education column. I tried to use the conditional statement and compared to each other, I got nothing but error though. I also looked through the merge solution, but in vain. Does anyone know how to deal with it using Python or pandas? Thank you in advance.
name education
name_1 some college
name_2 graduate degree
name_3 high school
One day I hope they'll have better functions for String type rows, rather than the limited support for columns currently available:
df['education'] = (df.filter(like='education') # Filters to only Education columns.
.T # Transposes
.convert_dtypes() # Converts to pandas dtypes, still somewhat in beta.
.max() # Gets the max value from the column, which will be the not-null one.
)
df = df[['name', 'education']]
print(df)
Output:
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
Looping this wouldn't be too hard e.g.:
cols = ['education', 'age', 'income']
for col in cols:
df[col] = df.filter(like=col).bfill(axis=1)[col]
df = df[['name'] + cols]
You can use df.fillna to do so.
df['combine'] = df[['education','education2','education3']].fillna('').sum(axis=1)
df
name education education2 education3 combine
0 name1 NaN some college NaN some college
1 name2 NaN NaN graduate degree graduate degree
2 name3 high school NaN NaN high school
If you have a lot of columns to combine, you can try this.
df['combine'] = df[df.columns[1:]].fillna('').sum(axis=1)
use bfill to fill the empty (NaN) values
df.bfill(axis=1).drop(columns=['education 2','education 3'])
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
if there are other columns in between then choose the columns to apply bfill
In essence, if you have multiple columns for education that you need to consolidate under a single column then choose the columns to which you apply the bfill. subsequently, you can delete those columns from which you back filled.
df[['education','education 2','education 3']].bfill(axis=1).drop(columns=['education 2','education 3'])

calculate new mean using old mean

I had a dataset dfthat looked like this:
Value themes country date
-1.975767 Weather Brazil 2022-02-13
-0.540979 Fruits China 2022-02-13
-2.359127 Fruits China 2022-02-13
-2.815604 Corona China 2022-02-13
-0.712323 Weather UK 2022-02-13
-0.929755 Weather Brazil 2022-02-13
I grouped themes+country to calculate mean and count values of each combination of theme and country (eg: Weather, Brazil or Weather, UK)
df_calculations = df.groupby(["themes", "country"], as_index = False)["value"].mean()
df_calculations['count'] = df.groupby(["themes", "country"])["value"].count().tolist()
Then I added this info to a new table df_avg that looks like this:
country type mean count last_checked_date
Brazil Weather x 2 2022-02-13 #same for all rows
Brazil Corona y 2022-02-13
China Corona z 1 2022-02-13
China Fruits s 2 2022-02-13
However, now, there's new are additional rows in the same original df.
Value themes country date
-1.975560 Weather Brazil 2022-02-15
-0.540123 Fruits China 2022-02-16
-2.359234 Fruits China 2022-02-16
-2.359234 Corona UK 2022-02-16
I want to go through the df rows who's date is after the last_checked_date.
Then I want to calculate a new mean for each combination again but using the old mean and n value from my df_avgtable instead of re-calculating for the whole df
How can I achieve this?
Please see this: Calculate new mean from old mean
Since you are maintaining a count (if not, it is pretty trivial) you can use that along with existing mean to calculate updated mean using the new observation.

Pandas duplicated rows with missing values

Hello I have a dataframe that contains duplicates.
df = pd.DataFrame({'id':[1,1,1],
'name':['Hamburg','Hamburg','Hamburg'],
'country':['Germany','Germany',None],
'state':[None,None,'Hamburg']})
removing the duplicates with df.drop_duplicates() returns:
How can I configure drop_duplicates such that only one row is left, that contains all the information?
In the case you have no row with all the information at once, you can use groupby and first but first fillna None with np.nan to work with missing values:
print (df.fillna(value=np.nan).groupby('id').first())
name country state
id
1 Hamburg Germany Hamburg
In your very special case, here's my proposal :
import pandas
df = pandas.DataFrame({'id':[1,1,1,2,2],
'name':['Hamburg','Hamburg','Hamburg','Paris','Paris'],
'country':['Germany','Germany',None, None, 'France'],
'state':[None,None,'Hamburg', 'Paris', None]})
df_result=pandas.DataFrame()
for id in df['id'].unique().tolist() :
df_subset=df[df['id']==id].copy(deep=True)
df_subset.sort_values(by=['id','name','country','state'],inplace=True)
df_subset.bfill(inplace=True)
df_subset.ffill(inplace=True)
df_subset.drop_duplicates(inplace=True)
df_result=df_result.append(df_subset)
df=df_result
Out[18]:
id name country state
0 1 Hamburg Germany Hamburg
4 2 Paris France Paris
Subsetting the records will avoid ffill or bfill to fill adjacent but different id records.
Regards

How to fill pandas dataframe columns with random dictionary values

I'm new to Pandas and I would like to play with random text data. I am trying to add 2 new columns to a DataFrame df which would be each filled by a key (newcol1) + value (newcol2) randomly selected from a dictionary.
countries = {'Africa':'Ghana','Europe':'France','Europe':'Greece','Asia':'Vietnam','Europe':'Lithuania'}
My df already has 2 columns and I'd like something like this :
Year Approved Continent Country
0 2016 Yes  Africa  Ghana
1  2016 Yes Europe Lithuania
2  2017 No Europe  Greece
I can certainly use a for or while loop to fill df['Continent'] and df['Country'] but I sense .apply() and np.random.choice may provide a simpler more pandorable solution for that.
Yep, you're right. You can use np.random.choice with map:
df
Year Approved
0 2016 Yes
1  2016 Yes
2  2017 No
df['Continent'] = np.random.choice(list(countries), len(df))
df['Country'] = df['Continent'].map(countries)
df
Year Approved Continent Country
0 2016 Yes Africa Ghana
1  2016 Yes Asia Vietnam
2  2017 No Europe Lithuania
You choose len(df) number of keys at random from the country key-list, and then use the country dictionary as a mapper to find the country equivalents of the previously picked keys.
You could also try using DataFrame.sample():
df.join(
pd.DataFrame(list(countries.items()), columns=["continent", "country"])
.sample(len(df), replace=True)
.reset_index(drop=True)
)
Which can be made faster if your continent-country map is already a dataframe.
If you're on Python 3.6, another method would be to use random.choices():
df.join(
pd.DataFrame(choices([*countries.items()], k=len(df)), columns=["continent", "country"])
)
random.choices() is similar to numpy.random.choice() except that you can pass a list of key-value tuple pairs whereas numpy.random.choice() only accepts 1-D arrays.

Pandas: Delete rows of a DataFrame if total count of a particular column occurs only 1 time

I'm looking to delete rows of a DataFrame if total count of a particular column occurs only 1 time
Example of raw table (values are arbitrary for illustrative purposes):
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
9 Bolivia #McDonalds 3456
10 Kenya #Schools 3455
11 Ukraine #Cars 3456
12 US #Tshirts 3456789
Intended outcome:
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
I know that df.Series.value_counts()>1 will identify which df.Series occur more than 1 time; and that the code returned will look something like the following:
Population
True
GDP
True
#McDonalds
False
#Schools
False
#Cars
False
#Tshirts
False
I want to write something like the following so that my new DataFrame drops column values from df.Series that occur only 1 time, but this doesn't work:
df.drop(df.Series.value_counts()==1,axis=1,inplace=True)
You can do this by creating a boolean list/array by either list comprehensions or using DataFrame's string manipulation methods.
The list comprehension approach is:
vc = df['Series'].value_counts()
u = [i not in set(vc[vc==1].index) for i in df['Series']]
df = df[u]
The other approach is to use the str.contains method to check whether the values of the Series column contain a given string or match a given regular expression (used in this case as we are using multiple strings):
vc = df['Series'].value_counts()
pat = r'|'.join(vc[vc==1].index) #Regular expression
df = df[~df['Series'].str.contains(pat)] #Tilde is to negate boolean
Using this regular expressions approach is a bit more hackish and may require some extra processing (character escaping, etc) on pat in case you have regex metacharacters in the strings you want to filter out (which requires some basic regex knowledge). However, it's worth noting this approach is about 4x faster than using the list comprehension approach (tested on the data provided in the question).
As a side note, I recommend avoiding using the word Series as a column name as that's the name of a pandas object.
This is an old question, but the current answer doesn't work for any moderately large dataframes. A much faster and more "dataframe" way is to add a value count column and filter out count.
Create the dataset:
df = pd.DataFrame({'Country': 'Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US'.split(),
'Series': 'Pop Pop Pop Pop GDP GDP GDP GDP McDonalds Schools Cars Tshirts'.split()})
Drop rows that have a count < 1 for the column ('Series' in this case):
# Group values for Series and add 'cnt' column with count
df['cnt'] = df.groupby(['Series'])['Country'].transform('count')
# Drop indexes for count value == 1, and dropping 'cnt' column
df.drop(df[df.cnt==1].index)[['Country','Series']]

Categories

Resources