I have excel files that contains two columns, I want to check presence of every cell in column 1 against data in column 2,
If data in a cell in column 1 is present in column 2 then it must output 1 and if not 0.
Here's dataframe
COLUMN 1 COLUMN 2
ZUBEDA SALIBOKO JUMANNE REDEMPTHA MATINDI
STEPHEN STAFFORD MIHUNGO PETER G. DATTAN
JUMANNE MWALIMU JOANES PETER LUGAZIA
HUWAIDA IDRISSA JUMBE HAMIS JUMA IDD ISAKA
AIDANIA LUAMBANO EDWIN MARTIN MUHONDEZI
KESSY BONIFAS FULANO RICHARD THOMAS MLIWA
KENEDY STEPHEN MSHOMI JUMANNE MWALIMU
JOANES PETER LUGAZIA ISAAC RUGEMALILA ABRAHAM
MWANAISHA MOHAMED MUNGIA ZAITUN SALUM MGAZA
PETRO ZACHARIA MAGANGA STEPHEN STAFFORD MIHUNGO
Desired output
COLUMN 1 COLUMN 2 RESULTS
ZUBEDA SALIBOKO JUMANNE REDEMPTHA MATINDI 0
STEPHEN STAFFORD MIHUNGO PETER G. DATTAN 1
JUMANNE MWALIMU JOANES PETER LUGAZIA 1
HUWAIDA IDRISSA JUMBE HAMIS JUMA IDD ISAKA 0
AIDANIA LUAMBANO EDWIN MARTIN MUHONDEZI 0
KESSY BONIFAS FULANO PETRO ZACHARIA MAGANGA 0
KENEDY STEPHEN MSHOMI JUMANNE MWALIMU 0
JOANES PETER LUGAZIA ISAAC RUGEMALILA ABRAHAM 0
MWANAISHA MOHAMED MUNGIA ZAITUN SALUM MGAZA 0
PETRO ZACHARIA MAGANGA STEPHEN STAFFORD MIHUNGO 1
df['RESULTS'] = df['COLUMN 1'] isin df['COLUMN 2']
You almost had it:
df["RESULTS"] = df["COLUMN 1"].isin(df["COLUMN 2"]).astype(int)
>>> df
COLUMN 1 COLUMN 2 RESULTS
0 ZUBEDA SALIBOKO JUMANNE REDEMPTHA MATINDI 0
1 STEPHEN STAFFORD MIHUNGO PETER G. DATTAN 1
2 JUMANNE MWALIMU JOANES PETER LUGAZIA 1
3 HUWAIDA IDRISSA JUMBE HAMIS JUMA IDD ISAKA 0
4 AIDANIA LUAMBANO EDWIN MARTIN MUHONDEZI 0
5 KESSY BONIFAS FULANO RICHARD THOMAS MLIWA 0
6 KENEDY STEPHEN MSHOMI JUMANNE MWALIMU 0
7 JOANES PETER LUGAZIA ISAAC RUGEMALILA ABRAHAM 1
8 MWANAISHA MOHAMED MUNGIA ZAITUN SALUM MGAZA 0
9 PETRO ZACHARIA MAGANGA STEPHEN STAFFORD MIHUNGO 0
Use np.where
import numpy as np
df["RESULTS"] = np.where(df["COLUMN 1"]==df["COLUMN 2"], 1, 0)
https://numpy.org/doc/stable/reference/generated/numpy.where.html
Related
I have a dataset with unique names. Another dataset contains several rows with the same names as in the first dataset.
I want to create a column with unique ids in the first dataset and another column in the second dataset with the same ids corresponding to all the same names in the first dataset.
For example:
Dataframe 1:
player_id Name
1 John Dosh
2 Michael Deesh
3 Julia Roberts
Dataframe 2:
player_id Name
1 John Dosh
1 John Dosh
2 Michael Deesh
2 Michael Deesh
2 Michael Deesh
3 Julia Roberts
3 Julia Roberts
I want to do to use both data frames to run deep feature synthesis using featuretools.
To be able to do something like this:
entity_set = ft.EntitySet("basketball_players")
entity_set.add_dataframe(dataframe_name="players_set",
dataframe=players_set,
index='name'
)
entity_set.add_dataframe(dataframe_name="season_stats",
dataframe=season_stats,
index='season_stats_id'
)
entity_set.add_relationship("players_set", "player_id", "season_stats", "player_id")
This should do what your question asks:
import pandas as pd
df1 = pd.DataFrame([
'John Dosh',
'Michael Deesh',
'Julia Roberts'], columns=['Name'])
df2 = pd.DataFrame([
['John Dosh'],
['John Dosh'],
['Michael Deesh'],
['Michael Deesh'],
['Michael Deesh'],
['Julia Roberts'],
['Julia Roberts']], columns=['Name'])
print('inputs:', '\n')
print(df1)
print(df2)
df1 = df1.reset_index().rename(columns={'index':'id'}).assign(id=df1.index + 1)
df2 = df2.join(df1.set_index('Name'), on='Name')[['id'] + list(df2.columns)]
print('\noutputs:', '\n')
print(df1)
print(df2)
Input/output:
inputs:
Name
0 John Dosh
1 Michael Deesh
2 Julia Roberts
Name
0 John Dosh
1 John Dosh
2 Michael Deesh
3 Michael Deesh
4 Michael Deesh
5 Julia Roberts
6 Julia Roberts
outputs:
id Name
0 1 John Dosh
1 2 Michael Deesh
2 3 Julia Roberts
id Name
0 1 John Dosh
1 1 John Dosh
2 2 Michael Deesh
3 2 Michael Deesh
4 2 Michael Deesh
5 3 Julia Roberts
6 3 Julia Roberts
UPDATE:
An alternative solution which should give the same result is:
df1 = df1.assign(id=list(range(1, len(df1) + 1)))[['id'] + list(df1.columns)]
df2 = df2.merge(df1)[['id'] + list(df2.columns)]
I have a dataframe that looks like below
Date Name
01-2021 Mark 714.53
Chris 112681.49
Ashley 3127.07
Brad 16875.00
Michelle 429520.04
...
12-2021 Mark 429520.04
Chris 975261.29
Ashley 377449.79
Brad 53391.73
Michelle 4286.00
But I need to transpose it like below:
Name 01-2021 12-2021
Mark 714.53 429520.04
Chris 112681.49 975261.29
Ashley 3127.07 377449.79
Brad 16875.00 53391.73
Michelle 429520.04 4286.00
Does anyone have a solution please.
pd.pivot
# the last column is assumed as 'amt'
df.pivot(index='Name', columns='Date', values='amt').reset_index().rename_axis(columns=None)
Name 01-2021 12-2021
0 Ashley 3127.07 377449.79
1 Brad 16875.00 53391.73
2 Chris 112681.49 975261.29
3 Mark 714.53 429520.04
4 Michelle 429520.04 4286.00
I'm a relative python noob and also new to natural language processing (NLP).
I have dataframe containing names and sales. I want to: 1) break out all the tokens and 2) aggregate sales by each token.
Here's an example of the dataframe:
name sales
Mike Smith 5
Mike Jones 3
Mary Jane 4
Here's the desired output:
token sales
mike 8
mary 4
Smith 5
Jones 3
Jane 4
Thoughts on what to do? I'm using Python.
Assumption: you have a function tokenize that takes in a string as input and returns a list of tokens
I'll use this function as a tokenizer for now:
def tokenize(word):
return word.casefold().split()
Solution
df.assign(tokens=df['name'].apply(tokenize)).explode('tokens').groupby('tokens')['sales'].sum().reset_index()
In [45]: df
Out[45]:
name sales
0 Mike Smith 5
1 Mike Jones 3
2 Mary Jane 4
3 Mary Anne Jane 1
In [46]: df.assign(tokens=df['name'].apply(tokenize)).explode('tokens').groupby('tokens')['sales'].sum().reset_index()
Out[46]:
tokens sales
0 anne 1
1 jane 5
2 jones 3
3 mary 5
4 mike 8
5 smith 5
Explanation
The assign step creates a column called tokens that applies the tokenize functio
Note: For this particular tokenize function - you can use df['name'].str.lower().str.split() - however this won't generalize to custom tokenizers hence the .apply(tokenize)
this generates a df that looks like
name sales tokens
0 Mike Smith 5 [mike, smith]
1 Mike Jones 3 [mike, jones]
2 Mary Jane 4 [mary, jane]
3 Mary Anne Jane 1 [mary, anne, jane]
use df.explode on this to get
name sales tokens
0 Mike Smith 5 mike
0 Mike Smith 5 smith
1 Mike Jones 3 mike
1 Mike Jones 3 jones
2 Mary Jane 4 mary
2 Mary Jane 4 jane
3 Mary Anne Jane 1 mary
3 Mary Anne Jane 1 anne
3 Mary Anne Jane 1 jane
last step is just a groupy-agg step.
You can use the str.split() method and keep item 0 for the first name, using that as the groupby key and take the sum, then do the same for item -1 (last name) and concatenate the two.
import pandas as pd
df = pd.DataFrame({'name': {0: 'Mike Smith', 1: 'Mike Jones', 2: 'Mary Jane'},
'sales': {0: 5, 1: 3, 2: 4}})
df = pd.concat([df.groupby(df.name.str.split().str[0]).sum(),
df.groupby(df.name.str.split().str[-1]).sum()]).reset_index()
df.rename(columns={'name':'token'}, inplace=True)
df[["fname", "lname"]] = df["name"].str.split(expand=True) # getting tokens,considering separated by space
tokens_df = pd.concat([df[['fname', 'sales']].rename(columns = {'fname': 'tokens'}),
df[['lname', 'sales']].rename(columns = {'lname': 'tokens'})])
pd.DataFrame(tokens_df.groupby('tokens')['sales'].apply(sum), columns=['sales'])
I have two dataframes, here are snippets of both below. I am trying to find and replace the artists names in the second dataframe with the id's in the first dataframe. Is there a good way to do this?
id fullName
0 1 Colin McCahon
1 2 Robert Henry Dickerson
2 3 Arthur Dagley
Artists
0 Arthur Dagley, Colin McCahon, Maria Cruz
1 Fiona Gilmore, Peter Madden, Nicholas Spratt, ...
2 Robert Henry Dickerson
3 Steve Carr
Desired output:
Artists
0 3, 1, Maria Cruz
1 Fiona Gilmore, Peter Madden, Nicholas Spratt, ...
2 2
3 Steve Carr
You mean check with replace
df1.Artists.replace(dict(zip(df.fullName,df.id.astype(str))),regex=True)
0 3, 1, Maria Cruz
1 Fiona Gilmore, Peter Madden, Nicholas Spratt, ...
2 2
3 Steve Carr
Name: Artists, dtype: object
Convert your first dataframe into a dictionary:
d = Series(name_df.id.astype(str),index=name_df.fullName).to_dict()
Then use .replace():
artists_df["Artists"] = artists_df["Artists"].replace(d, regex=True)
Given a Pandas dataframe which has a few labeled series in it, say Name and Villain.
Say the dataframe has values such:
Name: {'Batman', 'Batman', 'Spiderman', 'Spiderman', 'Spiderman', 'Spiderman'}
Villain: {'Joker', 'Bane', 'Green Goblin', 'Electro', 'Venom', 'Dr Octopus'}
In total the above dataframe has 2 series(or columns) each with six datapoints.
Now, based on the Name, I want to concatenate 3 more columns: FirstName, LastName, LoveInterest to each datapoint.
The result of which adds 'Bruce; Wayne; Catwoman' to every row which has Name as Batman. And 'Peter; Parker; MaryJane' to every row which has Name as Spiderman.
The final result should be a dataframe containing 5 columns(series) and 6 rows each.
This is a classic inner-join scenario. In pandas, use the merge module-level function:
In [13]: df1
Out[13]:
Name Villain
0 Batman Joker
1 Batman Bane
2 Spiderman Green Goblin
3 Spiderman Electro
4 Spiderman Venom
5 Spiderman Dr. Octopus
In [14]: df2
Out[14]:
FirstName LastName LoveInterest Name
0 Bruce Wayne Catwoman Batman
1 Peter Parker MaryJane Spiderman
In [15]: pd.DataFrame.merge(df1,df2,on='Name')
Out[15]:
Name Villain FirstName LastName LoveInterest
0 Batman Joker Bruce Wayne Catwoman
1 Batman Bane Bruce Wayne Catwoman
2 Spiderman Green Goblin Peter Parker MaryJane
3 Spiderman Electro Peter Parker MaryJane
4 Spiderman Venom Peter Parker MaryJane
5 Spiderman Dr. Octopus Peter Parker MaryJane