Create new columns in dataframe using a dictionary mapping - python

I created a map where they key is a string and the value is a tuple. I also have a dataframe that looks like this
d = {'comments' : ['This is a bad website', 'The website is slow']}
df = pd.DataFrame(data = d)
The maps value for This is a bad website contains something like this
[("There isn't enough support to our site",
'Staff Not Onsite',
0.7323943661971831),
('I would like to have them on site more frequently',
'Staff Not Onsite',
0.6875)]
What I want to do now is create 6 new columns inside the data frame using the first two tuple entries in the map.
So what I would want is something like this
d = {'comments' : ['This is a bad website', 'The website is slow'],
'comment_match_1' : ['There isn't enough support to our site', ''],
'Negative_category_1' : ['Staff Not Onsite', ''],
'Score_1' : [0.7323, 0],
'comment_match_2' : ['I would like to have them on site more frequently', ''],
'Negative_category_2' : ['Staff Not Onsite', ''],
'Score_2' : [0.6875, 0]}
df = pd.DataFrame(data = d)
Any suggestions on how to achieve this are greatly appreciated.
Here is how I generated the map for reference
d = {}
a = []
for x, y in zip(df['comments'], df['negative_category']):
for z in unlabeled_df['comments']:
a.append((x, y, difflib.SequenceMatcher(None, x, z).ratio()))
d[z] = a
Thus when I execute this line of code
d['This is a bad website']
I get
[("There isn't enough support to our site",
'Staff Not Onsite',
0.7323943661971831),
('I would like to have them on site more frequently',
'Staff Not Onsite',
0.6875), ...]

You can recreate a mapping dictionary by flattening the values corresponding to each of the key in the dictionary, then with the help of Series.map substitute the values in the column comments from mapping dictionary, finally create new dataframe from these substituted values and join this new dataframe with the comments column:
mapping = {k: np.hstack(v) for k, v in d.items()}
df.join(pd.DataFrame(df['comments'].map(mapping).dropna().tolist()))
comments 0 1 2 3 4 5
0 This is a bad website There isn't enough support to our site Staff Not Onsite 0.7323943661971831 I would like to have them on site more frequently Staff Not Onsite 0.6875
1 The website is slow NaN NaN NaN NaN NaN NaN

Related

Long to wide format using a dictionary

I would like to make a long to wide transformation of my dataframe, starting from
match_id player goals home
1 John 1 home
1 Jim 3 home
...
2 John 0 away
2 Jim 2 away
...
ending up with:
match_id player_1 player_2 player_1_goals player_2_goals player_1_home player_2_home ...
1 John Jim 1 3 home home
2 John Jim 0 2 away away
...
Since I'm going to have columns with new names, I though that maybe I should try to build a dictionary for that, where the outer key is match id, for everylike so:
dict = {1: {
'player_1': 'John',
'player_1_goals':1,
'player_1_home': 'home'
'player_2': 'Jim',
'player_2_goals':3,
'player_2_home': 'home'
},
2: {
'player_1': 'John',
'player_1_goals':0,
'player_1_home': 'away',
'player_2': 'Jim',
'player_2_goals':2
'player_2_home': 'away'
},
}
and then:
pd.DataFrame.from_dict(dict).T
In the real case scenario, however, the number of players will vary and I can't hardcode it.
Is this the best way of doing this using diciotnaries? If so, how could I build this dict and populate it from my original pandas dataframe?
It looks like you want to pivot the dataframe. The problem is there is no column in your dataframe that "enumerates" the players for you. If you assign such a column via assign() method, then pivot() becomes easy.
So far, it actually looks incredibly similar this case here. The only difference is you seem to need to format the column names in a specific way where the string "player" needs to prepended to each column name. The set_axis() call below does that.
(df
.assign(
ind=df.groupby('match_id').cumcount().add(1).astype(str)
)
.pivot('match_id', 'ind', ['player', 'goals', 'home'])
.pipe(lambda x: x.set_axis([
'_'.join([c, i]) if c == 'player' else '_'.join(['player', i, c])
for (c, i) in x
], axis=1))
.reset_index()
)

Adding entries to a column in a pandas DataFrame row-by-row using a dictionary

I am attempting to build a word cloud of cuisine types and wanted to include synonyms of a cuisine into its counter as a dictionary where the key is the cuisine and the values are a list of its synonyms. For example:
> 'Desserts': {'afters', 'sweet', 'dessert'}
The DataFrame I'm working with has thousands of rows but a something very similar can be generated using this (Note: The synonyms column was added for this exercise):
import pandas as pd
import numpy as np
df = pd.DataFrame({
'primary_cuisine':['indian','desserts','chinese','american','asian','turkish'],
'synonyms':['', '', '', '', '', '']
})
It generates this sample:
primary_cuisine synonyms
0 fast food
1 desserts
2 chinese
3 american
4 asian
5 turkish
I've generated the list of synonyms for each cuisine as follows:
word = ''
syn_dict = {}
for cuisine in df['primary_cuisine']:
synonyms = []
word = cuisine
# Store synonyms in a dictionary
for syn in wn.synsets(word):
for lm in syn.lemmas():
synonyms.append(lm.name())
# Adding values to a key as a set to remove duplicates
if (len(synonyms) > 0):
syn_dict[word] = set(synonyms)
else:
syn_dict[word] = {}
Here's where I'm stuck, how would I write the synonyms column of the DataFrame for each key using the values within the dictionary. Any help would be appreciated and any suggestions for a better/easier way to do what I'm trying to accomplish would be great too! Here is what I'm hoping to achieve (for the above sample) if it helps:
primary_cuisine synonyms
0 fast food
1 desserts afters, sweet, dessert
2 chinese Chinese, Taiwanese, Formosan
3 american American_language, American_English, American,
4 asian Asian, Asiatic
5 turkish Turkish
You can use .map() and a dict like so:
dct = {'Desserts': ['afters', 'sweet', 'dessert'}}
df['synonyms'] = df['primary_cuisine'].map(dct)

Best way to match list of words with a list of job descriptions python

Here is my problem (I'm working on python) :
I have a Dataframe with columns: Index(['job_title', 'company', 'job_label', 'description'], dtype='object')
And I have a list of words that contains 300 skills:
keywords = ["C++","Data Analytics","python","R", ............ "Django"]
I need to match those keywords with each of the jobs descriptions and obtain a new dataframe saying if is true or false that C++ is in job description[0]...job description[1], job description[2] and so on.
My new dataframe will be:
columns : ['job_title', 'company', 'description', "C++", "Data Analytics",
....... "Django"]
Where each column of keywords said true or false if it match(is found) or not on the job description.
There might be another ways to structure the dataframe (I'm listening suggestions).
Hope I'm clear with my question. I try regex but I can't make it iterate trough each row, I try with a loop using "fnmatch" library and I can't make it work. The best approach so far was:
df["microservice"]= df.description.str.contains("microservice")
df["cloud-based architecture"] = df.description.str.contains("cloud-based architecture")
df["service oriented architecture"] = df.description.str.contains("service oriented architecture")
However, First I could not manage to make it loop trough each rows of description column, so i have to input 300 times the code with each word (it doesn't make sense). Second, trough this way, I have problems with few words such as "R" because it find the letter R in each description, so it will pull true in each of them.
Iterate over list of keywords and extract each column from the description one:
for name in keywords:
df[name] = df['description'].apply(lambda x: True if name in x else False)
EDIT:
That doesn't solve the problem with R. To do so you could add a space to make sure it's isolated so the code would be:
for name in keywords:
df[name] = df['description'].apply(lambda x: True if ' '+str(name)+' ' in x else False)
But that's really ugly and not optimised. Regular expression should do the trick but I have to look back into it: found it! [ ]*+[str(name)]+[.?!] is better! (and more appropriate)
One way is to build a regex string to identify any keyword in your string... this example is case insensitive and will find any substring matches - not just whole words...
import pandas as pd
import re
keywords = ['python', 'C++', 'admin', 'Developer']
rx = '(?i)(?P<keywords>{})'.format('|'.join(re.escape(kw) for kw in keywords))
Then with a sample DF of:
df = pd.DataFrame({
'job_description': ['C++ developer', 'traffic warden', 'Python developer', 'linux admin', 'cat herder']
})
You can find all keywords for the relevant column...
matches = df['job_description'].str.extractall(rx)
Which gives:
keyword
match
0 0 C++
1 developer
2 0 Python
1 developer
3 0 admin
Then you want to get a list of "dummies" and take the max (so you always get a 1 where a word was found) using:
dummies = pd.get_dummies(matches).max(level=0)
Which gives:
keyword_C++ keyword_Python keyword_admin keyword_developer
0 1 0 0 1
2 0 1 0 1
3 0 0 1 0
You then left join that back to your original DF:
result = df.join(dummies, how='left')
And the result is:
job_description keyword_C++ keyword_Python keyword_admin keyword_developer
0 C++ developer 1.0 0.0 0.0 1.0
1 traffic warden NaN NaN NaN NaN
2 Python developer 0.0 1.0 0.0 1.0
3 linux admin 0.0 0.0 1.0 0.0
4 cat herder NaN NaN NaN NaN
skill = "C++", or any of the others
frame = an instance of
Index(['job_title', 'company', 'job_label', 'description'],
dtype='object')
jobs = a list/np.array of frames, which is probably your input
A naive implementation could look a bit like this:
for skill in keywords:
for frame in jobs:
if skill in frame["description"]: # or more exact matching, but this is what's in the question
# exists
But you need to put more work into what output structure you are going to use. Just having an output array of 300 columns most of which just contain a False isn't going to be a good plan. I've never worked with Panda's myself, but if it were normal numpy arrays (which panda's DataFrames are under the hood), I would add a column "skills" that can enumerate them.
You can leverage .apply() like so (#Jacco van Dorp made a solid suggestion of storing all of the found skills inside a single column, which I agree is likely the best approach to your problem):
df = pd.DataFrame([['Engineer','Firm','AERO1','Work with python and Django'],
['IT','Dell','ITD4','Work with Django and R'],
['Office Assistant','Dental','OAD3','Coordinate schedules'],
['QA Engineer','Factory','QA2','Work with R and python'],
['Mechanic','Autobody','AERO1','Love the movie Django']],
columns=['job_title','company','job_label','description'])
Which yields:
job_title company job_label description
0 Engineer Firm AERO1 Work with python and Django
1 IT Dell ITD4 Work with Django and R
2 Office Assistant Dental OAD3 Coordinate schedules
3 QA Engineer Factory QA2 Work with R and python
4 Mechanic Autobody AERO1 Love the movie Django
Then define your skill set and your list comprehension to pass to .apply():
skills = ['python','R','Django']
df['skills'] = df.apply(lambda x: [i for i in skills if i in x['description'].split()], axis=1)
Which yields this column:
skills
0 [python, Django]
1 [R, Django]
2 []
3 [python, R]
4 [Django]
If you are still interested in having individual columns for each skill, I can edit my answer to provide that as well.

Conditional Filling in Missing Values in a Pandas Data frame using non-conventional means

TLDR; How can I improve my code and make it more pythonic?
Hi,
One of the interesting challenge(s) we were given in a tutorial was the following:
"There are X missing entries in the data frame with an associated code but a 'blank' entry next to the code. This is a random occurance across the data frame. Using your knowledge of pandas, map each missing 'blank' entry to the associated code."
So this looks like the following:
|code| |name|
001 Australia
002 London
...
001 <blank>
My approach I have used is as follows:
Loop through entire dataframe and identify areas with blanks "". Replace all blanks via copying the associated correct code (ordered) to the dataframe.
code_names = [ "",
'Economic management',
'Public sector governance',
'Rule of law',
'Financial and private sector development',
'Trade and integration',
'Social protection and risk management',
'Social dev/gender/inclusion',
'Human development',
'Urban development',
'Rural development',
'Environment and natural resources management'
]
df_copy = df_.copy()
# Looks through each code name, and if it is empty, stores the proper name in its place
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
if(df_copy.mjtheme_namecode[x][y]['name'] == ""):
df_copy.mjtheme_namecode[x][y]['name'] = code_names[int(df_copy.mjtheme_namecode[x][y]['code'])]
limit = 25
counter = 0
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
print(df_copy.mjtheme_namecode[x][y])
counter += 1
if(counter >= limit):
break
While the above approach works - is there a better, more pythonic way of achieving what I'm after? I feel the approach I have used is very clunky due to my skills not being very well developed.
Thank you!
Method 1:
One way to do this would be to replace all your "" blanks with NaN, sort the dataframe by code and name, and use fillna(method='ffill'):
Starting with this:
>>> df
code name
0 1 Australia
1 2 London
2 1
You can apply the following:
new_df = (df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.fillna(method='ffill')
.sort_index())
>>> new_df
code name
0 1 Australia
1 2 London
2 1 Australia
Method 2:
This is more convoluted, but will work as well:
Using groupby, first, and sqeeze, you can create a pd.Series to map the codes to non-blank names, and use .map to map that series to your code column:
df['name'] = (df['code']
.map(
df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.groupby('code')
.first()
.squeeze()
))
>>> df
code name
0 1 Australia
1 2 London
2 1 Australia
Explanation: The pd.Series map that this creates looks like this:
code
1 Australia
2 London
And it works because it gets the first instance for every code (via the groupby), sorted in such a manner that the NaNs are last. So as long as each code is associated with a name, this method will work.

How to fix Pandas dataframe which shows NaN for string, and remove list brackets when write dataframe to csv

I am converting python lists into Pandas dataframe, then write the dataframe into csv. The lists are as following:
name = ['james beard', 'james beard']
ids = [304589, 304589]
year = [1999, 1999]
co_authors = [['athman bouguettaya', 'boualem benatallah', 'lily hendra', 'kevin smith', 'mourad quzzani'], ['athman bouguettaya', 'boualem benatallah', 'lily hendra', 'kevin smith', 'mourad quzzani']]
title = ['world wide databaseintegrating the web corba and databases', 'world wide databaseintegrating the web corba and databases']
venue = ['international conference on management of data', 'international conference on management of data']
data = {
'Name': name,
'ID': ids,
'Year': year,
'Co-author': co_authors,
'Title:': title,
'Venue:': venue,
}
df = pd.DataFrame(data, columns=['Name','ID','Year','Co-author','Title', 'Venue'])
df
df.to_csv('test.csv')
My questions are
(a) "Title" and "Venue" columns are shown as 'NaN' instead of their values (see below). How can I fix this?
Name ID Year Co-author Title Venue
0 james beard 304589 1999 [athman bouguettaya, boualem benatallah, lily ... NaN NaN
1 james beard 304589 1999 [athman bouguettaya, boualem benatallah, lily ... NaN NaN
(b) In CSV (see below), how to add "Index" to the header and remove brackets in "Co-author"?
,Name,ID,Year,Co-author,Title,Venue
0,james beard,304589,1999,"['athman bouguettaya', 'boualem benatallah', 'lily hendra', 'kevin smith', 'mourad quzzani']",,
1,james beard,304589,1999,"['athman bouguettaya', 'boualem benatallah', 'lily hendra', 'kevin smith', 'mourad quzzani']",,
As for first problem: in data you have char : in names 'Title:', 'Venue:'
so DataFrame can't find 'Title', 'Venue' in data.
You have to remove :
Or you can skip columns=[...] and it will use names with : -'Title:', 'Venue:'
df = pd.DataFrame(data)
As for second: I was searching solution with pandas after (or during) creating DataFrame.
And I didn't find it.
But if you assume you can modify data before you create DataFrame then you can write you version shorter
co_authors = [','.join(row) for row in co_authors]
Ah Well, I solve (b) using the below before loading into data..
tmp = []
for c in xrange(len(co_authors)):
tmp.append(','.join(map(str,co_authors[c])))
co_authors = tmp

Categories

Resources