I am new to pandas and python.
I am trying to group items by one column and list the information from the data frame per group.
My dataframe:
B C D E F
1 Honda USA 2000 Washington New
2 Honda USA 2001 Salt Lake Used
3 Ford Canada 2005 Washington New
4 Toyota USA 2010 Ney York Used
5 Honda USA 2001 Salt Lake Used
6 Honda Canada 2011 Salt Lake Crashed
7 Ford Italy 2014 Rome New
I am trying to group my dataframe by column B and list how many C, D, E, F column values are in group B. For example we see that in column B there are 4 Honda which I am grouping it together. Then I want to list the following information - USA(3), Canada(1), 2000(1),2001(2), 2011(1), Washington(1), Salt Lake(3), New(1), Used(2), Crashed(1) and do the same per every group ( car make ) in column B:
Car Country Year City Condition
1 Honda(4) USA(3) 2000(1) Washington(1) New(1)
Canada(1) 2001(2) Salt Lake(3) Used(2)
2011(1) Crashed(1)
2 Ford(2) Canada(1) 2005(5) Washington(1) New(2)
Italy(1) 2014(1) Rome(1)
...
What I've tried so far:
df.groupby(['B'])
Which gives me back <pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d559080>
At this point, I am not sure how I should code moving on forward getting the desired results after grouping the column B.
Thank you for your suggestions.
You need lambda function with custom function for processing each column separately with Series.value_counts and then join values of index to values of counts of Series together:
def f(x):
x = x.value_counts()
y = x.index.astype(str) + '(' + x.astype(str) + ')'
return y.reset_index(drop=True)
df1 = df.groupby(['B']).apply(lambda x: x.apply(f)).reset_index(drop=True)
print (df1)
B C D E F
0 Ford(2) Italy(1) 2014(1) Washington(1) New(2)
1 NaN Canada(1) 2005(1) Rome(1) NaN
2 Honda(4) USA(3) 2001(2) Salt Lake(3) Used(2)
3 NaN Canada(1) 2011(1) Washington(1) Crashed(1)
4 NaN NaN 2000(1) NaN New(1)
5 Toyota(1) USA(1) 2010(1) Ney York(1) Used(1)
Related
I am trying to replace values from a dataframe column with values from another based on a third one and keep the rest of the values from the first df.
# df1
country name value
romania john 100
russia emma 200
sua mark 300
china jack 400
# df2
name value
emma 2
mark 3
Desired result:
# df3
country name value
romania john 100
russia emma 2
sua mark 3
china jack 400
Thank you
One approach could be as follows:
Use Series.map on column name and turn df2 into a Series for mapping by setting its index to name (df.set_index).
Next, chain Series.fillna to replace NaN values with original values from df.value (i.e. whenever mapping did not result in a match) and assign to df['value'].
df['value'] = df['name'].map(df2.set_index('name')['value']).fillna(df['value'])
print(df)
country name value
0 romania john 100.0
1 russia emma 2.0
2 sua mark 3.0
3 china jack 400.0
N.B. The result will now contain floats. If you prefer integers, chain .astype(int) as well.
Another option could be using pandas.DataFrame.Update:
df1.set_index('name', inplace=True)
df1.update(df2.set_index('name'))
df1.reset_index(inplace=True)
name country value
0 john romania 100.0
1 emma russia 2.0
2 mark sua 3.0
3 jack china 400.0
Another option:
df3 = df1.merge(df2, on = 'name', how = 'left')
df3['value'] = df3.value_y.fillna(df3.value_x)
df3.drop(['value_x', 'value_y'], axis = 1, inplace = True)
# country name value
# 0 romania john 100.0
# 1 russia emma 2.0
# 2 sua mark 3.0
# 3 china jack 400.0
Reproducible data:
df1=pd.DataFrame({'country':['romania','russia','sua','china'],'name':['john','emma','mark','jack'],'value':[100,200,300,400]})
df2=pd.DataFrame({'name':['emma','mark'],'value':[2,3]})
I have a dataframe (3.7 million rows) with a column with different country names
id Country
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 INDIA
6 USA
7 USA
8 ITALY
9 USA
10 RUSSIA
I want to replace INDIA and ITALY with "Miscellanous" because they occur less than 15% in the column
My alternate solution is to replace the names with there frequency using
df.column_name = df.column_name.map(df.column_name.value_counts())
Use:
df.loc[df.groupby('Country')['id']
.transform('size')
.div(len(df))
.lt(0.15),
'Country'] = 'Miscellanous'
Or
df.loc[df['Country'].map(df['Country'].value_counts(normalize=True)
.lt(0.15)),
'Country'] = 'Miscellanous'
If you want to put all country whose frequency is less than a threshold into the "Misc" category:
threshold = 0.15
freq = df['Country'].value_counts(normalize=True)
mappings = freq.index.to_series().mask(freq < threshold, 'Misc').to_dict()
df['Country'].map(mappings)
Here is another option
s = df.value_counts()
s = s/s.sum()
s = s.loc[s<.15].reset_index()
df = df.replace(s['Place'].tolist(),'Miscellanous')
You can use dictionary and map for this:
d = df.Country.value_counts(normalize=True).to_dict()
df.Country.map(lambda x : x if d[x] > 0.15 else 'Miscellanous' )
Output:
id
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 Miscellanous
6 USA
7 USA
8 Miscellanous
9 USA
10 RUSSIA
Name: Country, dtype: object
I have a dataset in which I add coordinates to cities based on zip-codes but several of these zip-codes are missing. Also, in some cases cities are missing, states are missing, or both are missing. For example:
ca_df[['OWNER_CITY', 'OWNER_STATE', 'OWNER_ZIP']]
OWNER_CITY OWNER_STATE OWNER_ZIP
495 MIAMI SHORE PA
496 SEATTLE
However, a second dataset has city, state & the matching zip-codes. This one is complete without any missing values.
df_coord.head()
OWNER_ZIP CITY STATE
0 71937 Cove AR
1 72044 Edgemont AR
2 56171 Sherburn MN
I want to fill in the missing zip-codes in the first dataframe if:
Zip-code is empty
City is present
State is present
This is an all-or-nothing operations means, either all three criteria are met and the zip-code gets filled or nothing changes.
However, this is a fairly large dataset with > 50 million records so ideally I want to vectorize the operation by working column-wise.
Technically, that would fit np.where but as far as I know, np.where only takes of condition in the following format:
df1['OWNER_ZIP'] = np.where(df["cond"] ==X, df_coord['OWNER_ZIP'], "")
How do I ensure I only fill missing zip-codes when all conditions are met?
Given ca_df:
OWNER_CITY OWNER_STATE OWNER_ZIP
0 Miami Shore Florida 111
1 Los Angeles California NaN
2 Houston NaN NaN
and df_coord:
OWNER_ZIP CITY STATE
0 111 Miami Shore Florida
1 222 Los Angeles California
2 333 Houston Texas
You can use pd.notna along with pd.DataFrame#index like this:
inferrable_zips_df = pd.notna(ca_df["OWNER_CITY"]) & pd.notna(ca_df["OWNER_STATE"])
is_inferrable_zip = ca_df.index.isin(df_coord[inferrable_zips_df].index)
ca_df.loc[is_inferrable_zip, "OWNER_ZIP"] = df_coord["OWNER_ZIP"]
with ca_df resulting as:
OWNER_CITY OWNER_STATE OWNER_ZIP
0 Miami Shore Florida 111
1 Los Angeles California 222
2 Houston NaN NaN
I've changed the "" to np.nan, but if you still wish to use "" then you just need to change pd.notna(ca_df[...]) to ca_df[...] == "".
You can combine numpy.where statements to combine multiple rules. This should give you the array of row indices which abide to each of the three rules:
np.where(df["OWNER_ZIP"] == X) and np.where(df["CITY"] == Y) and np.where(df["STATE"] == Z)
Use:
print (df_coord)
OWNER_ZIP CITY STATE
0 71937 Cove AR
1 72044 Edgemont AR
2 56171 Sherburn MN
3 123 MIAMI SHORE PA
4 789 SEATTLE AA
print (ca_df)
OWNER_ZIP OWNER_CITY OWNER_STATE
0 NaN NaN NaN
1 72044 Edgemont AR
2 56171 NaN MN
3 NaN MIAMI SHORE PA
4 NaN SEATTLE NaN
First is necessary test if same dtypes in columns matching:
#or convert ca_df['OWNER_ZIP'] to integers
df_coord['OWNER_ZIP'] = df_coord['OWNER_ZIP'].astype(str)
print (df_coord.dtypes)
OWNER_ZIP object
CITY object
STATE object
dtype: object
print (ca_df.dtypes)
OWNER_ZIP object
OWNER_CITY object
OWNER_STATE object
dtype: object
Then filter for each combinations of columns - missing and non missing values and add new data by merge, then convert index to same like filtered data and assign back:
mask1 = ca_df['OWNER_CITY'].notna() & ca_df['OWNER_STATE'].notna() & ca_df['OWNER_ZIP'].isna()
df1 = ca_df[mask1].drop('OWNER_ZIP', axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask1])
ca_df.loc[mask1, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df1
mask2 = ca_df['OWNER_CITY'].notna() & ca_df['OWNER_STATE'].isna() & ca_df['OWNER_ZIP'].isna()
df2 = ca_df[mask2].drop(['OWNER_ZIP','OWNER_STATE'], axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask2])
ca_df.loc[mask2, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df2
mask3 = ca_df['OWNER_CITY'].isna() & ca_df['OWNER_STATE'].notna() & ca_df['OWNER_ZIP'].notna()
df3 = ca_df[mask3].drop(['OWNER_CITY'], axis=1).merge(df_coord.rename(columns={'CITY':'OWNER_CITY','STATE':'OWNER_STATE'})).set_index(ca_df.index[mask3])
ca_df.loc[mask3, ['OWNER_ZIP','OWNER_CITY','OWNER_STATE']] = df3
print (ca_df)
OWNER_ZIP OWNER_CITY OWNER_STATE
0 NaN NaN NaN
1 72044 Edgemont AR
2 56171 Sherburn MN
3 123 MIAMI SHORE PA
4 789 SEATTLE AA
You can do a left join on these dataframes considering join on the columns 'city' and 'state'. That would give you the zip-code corresponding to a city and state if both values are non-null in the first dataframe (OWNER_CITY, OWNER_STATE, OWNER_ZIP) and since it would be a left join, it would also preserve your rows which either don't have a zip-code or have null/empty city and state values.
I have a data frame which looks like this:
ga:country ga:hostname ga:pagePathLevel1 ga:pagePathLevel2 ga:keyword ga:adMatchedQuery ga:operatingSystem ga:hour ga:exitPagePath ga:sessions
0 (not set) de.google.com /beste-sms/ / +sms sms Germany best for Android 09 /beste-sms/ 1
1 (not set) de.google.com /beste-sms/ / +sms sms argentinien Macintosh 14 /beste-sms/ 1
2 (not set) de.google.com /beste-sms/ / +sms sms skandinav Android 18 /beste-sms/ 1
3 (not set) de.google.com /beste-sms/ / +sms sms skandinav Macintosh 20 /beste-sms/ 1
4 (not set) de.google.com /beste-sms/ / sms sms iOS 22 /beste-sms/ 1
... ... ... ... ... ... ... ... ... ... ...
85977 Yemen google.com /reviews/ /iphone/ 45to54 not set) Android 23 /reviews/iphone/ 1
85978 Yemen google.com /tr/ /best-sms/ sms sms Windows 10 /tr/best-sms/ 1
85979 Zambia google.com /best-sms/ /iphone/ +best +sms (not set) Android 16 /best-sms/iphone/ 1
85980 Zimbabwe google.com /reviews/ /testsms/ test test Windows 22 /reviews/testsms/ 1
85981 Zimbabwe google.com /reviews/ /testsms/ testsms testsms Windows 23 /reviews/testsms/ 1
I would like to group them by column ga:adMatchedQuery and get counts of each column values for each group in ga:adMatchedQuery
This question is a follow up on this question which may provide more information of what I am trying to achieve.
After using the same code structure as #jezrael suggested:
def f(x):
x = x.value_counts()
y = x.index.astype(str) + ' (' + x.astype(str) + ')'
return y.reset_index(drop=True)
df = df.groupby(['ga:adMatchedQuery']).apply(lambda x: x.apply(f))
print(df)
I get this result:
ga:country ga:hostname ga:pagePathLevel1 ga:pagePathLevel2 ga:keyword ga:adMatchedQuery ga:operatingSystem ga:hour ga:exitPagePath ga:sessions
United States(5683) google.com(14924) /us/(4187) /best-sms/(4565) Undetermined(1855) (not set)(15327) Windows(7616) 18(806) /reviews/testsms/(1880) 1(14005)
United Kingdom(1691) zh.google.com(170) /reviews/(4093) /testsms/(3561) free sms(1729) Android(4291) 20(805) /reviews/scandina/(1307) 2(815)
Canada(1201) t.google.com(80) /best-sms/(2169) /free-sms/(2344) +sms(1414) iOS(2136) 19(804) /best-sms/(1291) 3(231)
Indonesia(445) es.google.com(33) /coupons/(1264) /scandina/(1751) +free +sms(1008) Macintosh(978) 17(787) /coupons/testsms/holiday-deal/(760) 4(92)
Hong Kong(443) pl.google.com(33) /uk/(1172) /(1508) 25to34(988) Linux(160) 21(779) /coupons/scandina/holiday-deal/(239) 6(40)
Australia(353) fr.google.com(27) /ca/(886) /windows/(365) best sms(803) Chrome OS(73) 16(766) (not set)(112) 5(38)
Whereas I am trying to achieve this:
ga:adMatchedQuery ga:country ga:hostname
Undetermined(1855) United States(100) google.com(1000)
United Kingdom(200) zh.google.com(12)
free sms(1855) United States(100) google.com(1000)
United Kingdom(200) zh.google.com(12)
...
Thank you for your suggestions.
I think there is only changed order of columns, you can use before my solution:
cols = df.columns.difference(['ga:adMatchedQuery'], sort=False).tolist()
df = df[['ga:adMatchedQuery'] + cols]
Sample with data from previous answer:
Here is data grouped by F column, order of columns names is not changed:
def f(x):
x = x.value_counts()
y = x.index.astype(str) + '(' + x.astype(str) + ')'
return y.reset_index(drop=True)
df1 = df.groupby(['F']).apply(lambda x: x.apply(f)).reset_index(drop=True)
print (df1)
B C D E F
0 Honda(1) Canada(1) 2011(1) Salt Lake(1) Crashed(1)
1 Ford(2) Italy(1) 2014(1) Washington(2) New(3)
2 Honda(1) Canada(1) 2005(1) Rome(1) NaN
3 NaN USA(1) 2000(1) NaN NaN
4 Honda(2) USA(3) 2001(2) Salt Lake(2) Used(3)
5 Toyota(1) NaN 2010(1) Ney York(1) NaN
Columns names are changed:
cols = df.columns.difference(['F'], sort=False).tolist()
df = df[['F'] + cols]
print (df)
F B C D E
1 New Honda USA 2000 Washington
2 Used Honda USA 2001 Salt Lake
3 New Ford Canada 2005 Washington
4 Used Toyota USA 2010 Ney York
5 Used Honda USA 2001 Salt Lake
6 Crashed Honda Canada 2011 Salt Lake
7 New Ford Italy 2014 Rome
def f(x):
x = x.value_counts()
y = x.index.astype(str) + '(' + x.astype(str) + ')'
return y.reset_index(drop=True)
df1 = df.groupby(['F']).apply(lambda x: x.apply(f)).reset_index(drop=True)
print (df1)
F B C D E
0 Crashed(1) Honda(1) Canada(1) 2011(1) Salt Lake(1)
1 New(3) Ford(2) Italy(1) 2014(1) Washington(2)
2 NaN Honda(1) Canada(1) 2005(1) Rome(1)
3 NaN NaN USA(1) 2000(1) NaN
4 Used(3) Honda(2) USA(3) 2001(2) Salt Lake(2)
5 NaN Toyota(1) NaN 2010(1) Ney York(1)
I have the following dataset:
user artist sex country
0 1 red hot chili peppers f Germany
1 1 the black dahlia murder f Germany
2 1 goldfrapp f Germany
3 2 dropkick murphys f Germany
4 2 le tigre f Germany
.
.
289950 19718 bob dylan f Canada
289951 19718 pixies f Canada
289952 19718 the clash f Canada
I want to create a Boolean indicator matrix using a dataframe, where there is one row for each user and one column for each artist. For each row(user) if there is artist return 1 else return 0.
Just to mention, there are 1004 unique artists and 15000 unique users—it’s a large data set.
I have created an empty matrix using the following:
pd.DataFrame(index=user, columns=artist)
I am having difficulty populating the dataframe correctly.
There is a method in pandas called notnull
Suppose your dataframe is named df, you should use:
df['has_artist'] = df['artist'].notnull()
This will add a column of boolean named has_artist to your dataframe
If you want to have 0 and 1 do instead:
df['has_artist'] = df['artist'].notnull().astype(int)
You can also store it in a different variable and not alter your dataframe.