Given a dataframe such as this, is it possible to add up the countries specific value even if there are multiple countries in one row? For example, for the 1st row Japan and USA are present, so i would want the value to be Japan=1 USA=1
import pandas as pd
import numpy as np
countries=["Europe","USA","Japan"]
data= {'Employees':[1,2,3,4],
'Country':['Japan;USA','USA;Europe',"Japan","Europe;Japan"]}
df=pd.DataFrame(data)
print(df)
patt = '(' + '|'.join(countries) + ')'
grp = df.Country.str.extractall(pat=patt).values
new_df = df.groupby(grp).agg({'Employees': sum})
print(new_df)
I have tried this but it returns a grouper and axis must be same length error. Is this the correct way to do it?
ValueError Traceback (most recent call last)
<ipython-input-81-53e8e9f0f301> in <module>()
10 patt = '(' + '|'.join(countries) + ')'
11 grp = df.Country.str.extractall(pat=patt).values
---> 12 new_df = df.groupby(grp).agg({'Employees': sum})
13 print(new_df)
4 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/grouper.py in _convert_grouper(axis, grouper)
842 elif isinstance(grouper, (list, Series, Index, np.ndarray)):
843 if len(grouper) != len(axis):
--> 844 raise ValueError("Grouper and axis must be same length")
845 return grouper
846 else:
Thus, i would like the end result to be
Japan: 8
Europe:6
USA:3
Thanks
Could you please try following, written and tested with shown samples. Using split, explode, groupby functions of Pandas.
df['Country'] = df['Country'].str.split(';')
df.explode('Country').groupby('Country')['Employees'].sum()
Output will be as follows:
Country
Eurpoe 6
Japan 8
USA 3
Name: Employees, dtype: int64
Explanation: Simple explanation would be:
Firstly splitting Country column of DataFrame by ; and saving results into same column.
Then using explode on Country column then using groupby on Country column and using sum function on it to get its sum in Employees column.
Related
I am a beginner and getting familiar with pandas .
It is throwing an error , When I was trying to create a new column this way :
drinks['total_servings'] = drinks.loc[: ,'beer_servings':'wine_servings'].apply(calculate,axis=1)
Below is my code, and I get the following error for line number 9:
"Cannot set a DataFrame with multiple columns to the single column total_servings"
Any help or suggestion would be appreciated :)
import pandas as pd
drinks = pd.read_csv('drinks.csv')
def calculate(drinks):
return drinks['beer_servings']+drinks['spirit_servings']+drinks['wine_servings']
print(drinks)
drinks['total_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].apply(calculate,axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
drinks
In your code, when functioncalculate is called with axis=1, it passes each row of the Dataframe as an argument. Here, the function calculate is returning dataframe with multiple columns but you are trying to assigned to a single column, which is not possible. You can try updating your code to this,
def calculate(each_row):
return each_row['beer_servings'] + each_row['spirit_servings'] + each_row['wine_servings']
drinks['total_servings'] = drinks.apply(calculate, axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
print(drinks)
I suppose the reason is the wrong argument name inside calculate method. The given argument is drink but drinks used to calculate sum of columns.
The reason is drink is Series object that represents Row and sum of its elements is scalar. Meanwhile drinks is a DataFrame and sum of its columns will be a Series object
Sample code shows that this method works.
import pandas as pd
df = pd.DataFrame({
"A":[1,1,1,1,1],
"B":[2,2,2,2,2],
"C":[3,3,3,3,3]
})
def calculate(to_calc_df):
return to_calc_df["A"] + to_calc_df["B"] + to_calc_df["C"]
df["total"] = df.loc[:, "A":"C"].apply(calculate, axis=1)
print(df)
Result
A B C total
0 1 2 3 6
1 1 2 3 6
2 1 2 3 6
3 1 2 3 6
4 1 2 3 6
I am working on the following dataset: https://drive.google.com/file/d/1UVgSfIO-46aLKHeyk2LuKV6nVyFjBdWX/view?usp=sharing
I am trying to replace the countries in the "Nationality" column whose value_counts() are less than 450 with the value of "Others".
def collapse_category(df):
df.loc[df['Nationality'].map(df['Nationality'].value_counts(normalize=True)
.lt(450)), 'Nationality'] = 'Others'
print(df['Nationality'].unique())
This is the code I used but it returns the result as this: ['Others']
Here is the link to my notebook for reference: https://colab.research.google.com/drive/1MfwwBfi9_4E1BaZcPnS7KJjTy8xVsgZO?usp=sharing
Use boolean indexing:
s = df['Nationality'].value_counts()
df.loc[df['Nationality'].isin(s[s<450].index), 'Nationality'] = 'Others'
New value_counts after the change:
FRA 12307
PRT 11382
DEU 10164
GBR 8610
Others 5354
ESP 4864
USA 3398
... ...
FIN 632
RUS 578
ROU 475
Name: Nationality, dtype: int64
value_filter = df.Nationality.value_counts().lt(450)
temp_dict = value_filter[value_filter == False].replace({False: "others"}).to_dict()
df = df.replace(temp_dict)
In general, the third line will look up the entire df rather than a particular column. But the above code will work for you.
So, here for example I have 2 columns as Column1a, Column1b, and another 3 columns as Column2a, Column2b, Column2c. I want to make an output column where there is an array of Column1a to Column2c (if present) as given below.
At least one from column 1 and 1 from column 2 must be present for the output.
Column1a Column1b Column2a Column2b Column2c OUTPUT
123A QWER ERTY 1256Y 234
3456 89AS
WERT 1234 9087
CVBT
OUTPUT should be as follows:
OUTPUT
["123A|ERTY","123A|1256Y","123A|234","QWER|ERTY","QWER|1256Y","QWER|234]
""
["WERT|1234","WERT|9087"]
""
Please help me with using the loop in such cases.Thanks
Here is the answer to your question:
import pandas as pd
import numpy as np
# df=pd.read_excel('demo2.xlsx')
all_columns = list(df) # Creates list of all column headers
df[all_columns] = df[all_columns].astype(str)
from itertools import product
x=pd.DataFrame(list(product([0,1], [2,3,4])), columns=['l1', 'l2'])
for j in range(len(df)):
full=[]
if (((df.iloc[j,0]=="nan") & (df.iloc[j,1]=="nan")) | ((df.iloc[j,2]=="nan") & (df.iloc[j,3]=="nan") &(df.iloc[j,4]=="nan")) ):
full.append("")
else:
l=[]
for k in range(len(x)):
if (df.iloc[j,x.iloc[k,0]]!="nan"):
l1=df.iloc[j,x.iloc[k,0]]
if (df.iloc[j,x.iloc[k,1]]!="nan"):
l2=df.iloc[j,x.iloc[k,1]]
full.append(l1+"|"+l2)
df.loc[j,"OUTPUT"]=full
Output looks like this:
I have a dataframe news_count. Here are its column names, from the output of news_count.columns.values:
[('date', '') ('EBIX UW Equity', 'NEWS_SENTIMENT_DAILY_AVG') ('Date', '')
('day', '') ('month', '') ('year', '')]
I need to groupby by year and month and sum values of 'NEWS_SENTIMENT_DAILY_AVG'. Below is code I tried, but neither work:
Attempt 1
news_count.groupby(['year','month']).NEWS_SENTIMENT_DAILY_AVG.values.sum()
'AttributeError: 'DataFrameGroupBy' object has no attribute'
Attempt 2
news_count.groupby(['year','month']).iloc[:,1].values.sum()
AttributeError: Cannot access callable attribute 'iloc' of 'DataFrameGroupBy' objects, try using the 'apply' method
Input data:
ticker date EBIX UW Equity month year
field NEWS_SENTIMENT_DAILY_AVG
0 2007-05-25 0.3992 5 2007
1 2007-11-06 0.3936 11 2007
2 2007-11-07 0.2039 11 2007
3 2009-01-14 0.2881 1 2014
extract required columns from dataframe in news_count_res variable and then apply aggregation function
news_count_res = news_count[['year','month','NEWS_SENTIMENT_DAILY_AVG']]
news_count_res.group(['year','month']).sum()
Thanks to answers so far (I've made comments there as I haven't got those solutions to work--maybe I'm not understanding something). In the meantime, I've also come up with another approach, which I still suspect isn't very Pythonic. It does get the job done and doesn't take too long for my purposes, but it would be great if I could figure out how to tweak the approaches suggested above to get them to work...any thoughts very welcome!
Here's what I've got:
import pandas as pd
import math
y = ['Alex'] * 2321 + ['Doug'] * 34123 + ['Chuck'] * 2012 + ['Bob'] * 9281
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
percent = 10 #CHANGE AS NEEDED
#add a 'helper'column with random numbers
df['rand'] = np.random.random(df.shape[0])
df = df.sample(frac=1) #optional: this shuffles data, just to show order doesn't matter
#CREATE A HELPER LIST
helper = pd.DataFrame(df.groupby('persons')['rand'].count()).reset_index().values.tolist()
for row in helper:
df_temp = df[df['persons'] == row[0]][['persons','rand']]
lim = math.ceil(len(df_temp) * percent * 0.01)
row.append(df_temp.nlargest(lim,'rand').iloc[-1][1])
def flag(name,num):
for row in helper:
if row[0] == name:
if num >= row[2]:
return 'yes'
else:
return 'no'
df['flag'] = df.apply(lambda x: flag(x['persons'], x['rand']), axis=1)
And to check the results:
piv = df.pivot_table(index="persons", columns="flag", values="data", aggfunc='count', fill_value=0)
piv = piv.apivend(piv.sum().rename('Total')).assign(Total=lambda x: x.sum(1))
piv['% selected'] = 100 * piv.yes/piv.Total
print(piv)
OUTPUT:
flag no yes Total % selected
persons
Alex 2088 233 2321 10.038776
Bob 8352 929 9281 10.009697
Chuck 1810 202 2012 10.039761
Doug 30710 3413 34123 10.002051
Total 42960 4777 47737 10.006913
Seems to work with different %s and different numbers of persons...but it would be nice to make it simpler, I think.
I have a bunch of data files, with columns 'Names', 'Gender', 'Count', one file per one year. I need to concatenate all the files for some period, sum all counts for all unique names and add a new column with amount of consonant. I can't extract string value from 'Names'. How can I implement that?
Here is my code:
import os
import re
import pandas as pd
PATH = ...
def consonants_dynamics (years):
names_by_year = {}
for year in years:
names_by_year[year] = pd.read_csv(PATH+"\\yob{}.txt".format(year), names =['Names', 'Gender', 'Count'])
names_all = pd.concat(names_by_year, names=['Year', 'Pos'])
dynamics = names_all.groupby('Names').sum().sort_values(by='Count', ascending=False).unstack('Names')
dynamics['Consonants'] = dynamics.apply(count_vowels(dynamics.Names), axis = 1)
return dynamics.head(10)
def count_vowels (name):
vowels = re.compile('A|E|I|O|U|a|e|i|o|u')
return len(name) - len (vowels.findall(name))
If I run something like
a = consonants_dynamics(i for i in range (1900, 2001, 10))
I get the following error message
<ipython-input-9-942fc155267e> in consonants_dynamcis(years)
...
---> 12 dynamics['Consonants'] = dynamics.apply(count_vowels(dynamics.Names), axis = 1)
AttributeError: 'Series' object has no attribute 'Names'
I tried various ways but all failed. How can it be done?
after doing unstack you converted dynamics to a series object where you no longer have Names column dynamics.Names. I think it should be fixed by removing .unstack('Names')
after that use dynamics.index:
dynamics['Consonants'] = dynamics.reset_index()['Names'].apply(count_vowels)
Convert index to_series and apply function:
print (dynamics)
Count
Names
James 2
John 3
Robert 10
def count_vowels (name):
vowels = re.compile('A|E|I|O|U|a|e|i|o|u')
return len(name) - len (vowels.findall(name))
dynamics['Consonants'] = dynamics.index.to_series().apply(count_vowels)
Solution without function with str.len and substract only wovels by str.count:
pat = 'A|E|I|O|U|a|e|i|o|u'
s = dynamics.index.to_series()
dynamics['Consonants_new'] = s.str.len() - s.str.count(pat)
print (dynamics)
Count Consonants_new Consonants
Names
James 2 3 3
John 3 3 3
Robert 10 4 4
EDIT:
Solutions without to_series is add as_index=False to groupby for return DataFrame:
names_all = pd.DataFrame({
'Names':['James','James','John','John', 'Robert', 'Robert'],
'Count':[10,20,10,30, 80,20]
})
dynamics = names_all.groupby('Names', as_index=False).sum()
.sort_values(by='Count', ascending=False)
pat = 'A|E|I|O|U|a|e|i|o|u'
s = dynamics.index.to_series()
dynamics['Consonants'] = dynamics['Names'].str.len() - dynamics['Names'].str.count(pat)
print (dynamics)
Names Count Consonants
2 Robert 100 4
1 John 40 3
0 James 30 3