(I suck at titling these questions...)
So I've gotten 90% of the way through a very laborious learning process with pandas, but I have one thing left to figure out. Let me show an example (actual original is a comma-delimited CSV that has many more rows):
Name Price Rating URL Notes1 Notes2 Notes3
Foo $450 9 a.com/x NaN NaN NaN
Bar $99 5 see over www.b.com Hilarious Nifty
John $551 2 www.c.com Pretty NaN NaN
Jane $999 8 See Over in Notes Funky http://www.d.com Groovy
The URL column can say many different things, but they all include "see over," and do not indicate with consistency which column to the right includes the site.
I would like to do a few things, here: first, move websites from any Notes column to URL; second, collapse all notes columns to one column with a new line between them. So this (NaN's removed because pandas makes me in order to use them in df.loc):
Name Price Rating URL Notes1
Foo $450 9 a.com/x
Bar $99 5 www.b.com Hilarious
Nifty
John $551 2 www.c.com Pretty
Jane $999 8 http://www.d.com Funky
Groovy
I got partway there by doing this:
df['URL'] = df['URL'].fillna('')
df['Notes1'] = df['Notes1'].fillna('')
df['Notes2'] = df['Notes2'].fillna('')
df['Notes3'] = df['Notes3'].fillna('')
to_move = df['URL'].str.lower().str.contains('see over')
df.loc[to_move, 'URL'] = df['Notes1']
What I don't know is how to find the Notes column with either www or .com. If I, for example, try to use my above method as a condition, e.g.:
if df['Notes1'].str.lower().str.contains('www'):
df.loc[to_move, 'URL'] = df['Notes1']
I get back ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() But adding .any() or .all() has the obvious flaw that they don't give me what I'm looking for: with any, e.g., every line that meets the to_move requirement in URL will get whatever's in Notes1. I need the check to occur row by row. For similar reasons, I can't even get started collapsing the Notes columns (and I don't know how to check for non-null empty string cells, either, a problem I created at this point).
Where it stands, I know I also have to move in Notes2 to Notes1, Notes3 to Notes2, and '' to Notes3 when the first condition is satisfied, because I don't want the leftover URLs in the Notes columns. I'm sure pandas has easier routes than what I'm doing, because it's pandas, and when I try to do anything with pandas, I find out that it can be done in one line instead of my 20...
(PS, I don't care if the empty columns Notes2 and Notes3 are left over, b/c I'm not using them in my CSV import in the next step, though I can always learn more than I need)
UPDATE: So I figured out a crummy verbose solution using my non-pandas python logic one step at a time. I came up with this (same first five lines above, minus the df.loc line):
url_in1 = df['Notes1'].str.contains('\.com')
url_in2 = df['Notes2'].str.contains('\.com')
to_move = df['URL'].str.lower().str.contains('see-over')
to_move1 = to_move & url_in1
to_move2 = to_move & url_in2
df.loc[to_move1, 'URL'] = df.loc[url_in1, 'Notes1']
df.loc[url_in1, 'Notes1'] = df['Notes2']
df.loc[url_in1, 'Notes2'] = ''
df.loc[to_move2, 'URL'] = df.loc[url_in2, 'Notes2']
df.loc[url_in2, 'Notes2'] = ''
(Lines moved around and to_move repeated in actual code) I know there has to be a more efficient method... This also doesn't collapse in the Notes columns, but that should be easy using the same method, except that I still don't know a good way to find the empty strings.
I'm still learning pandas, so some parts of this code may be not so elegant, but general idea is - get all notes columns, find all urls in there, combine it with URL column and then concat remaining notes into Notes1 column:
import pandas as pd
import numpy as np
import pandas.core.strings as strings
# Just to get first notnull occurence
def geturl(s):
try:
return next(e for e in s if not pd.isnull(e))
except:
return np.NaN
df = pd.read_csv("d:/temp/data2.txt")
dfnotes = df[[e for e in df.columns if 'Notes' in e]]
# Notes1 Notes2 Notes3
# 0 NaN NaN NaN
# 1 www.b.com Hilarious Nifty
# 2 Pretty NaN NaN
# 3 Funky http://www.d.com Groovy
dfurls = dfnotes.apply(lambda x: x.str.contains('\.com'), axis=1)
dfurls = dfurls.fillna(False).astype(bool)
# Notes1 Notes2 Notes3
# 0 False False False
# 1 True False False
# 2 False False False
# 3 False True False
turl = dfnotes[dfurls].apply(geturl, axis=1)
df['URL'] = np.where(turl.isnull(), df['URL'], turl)
df['Notes1'] = dfnotes[~dfurls].apply(lambda x: strings.str_cat(x[~x.isnull()], sep=' '), axis=1)
del df['Notes2']
del df['Notes3']
df
# Name Price Rating URL Notes1
# 0 Foo $450 9 a.com/x
# 1 Bar $99 5 www.b.com Hilarious Nifty
# 2 John $551 2 www.c.com Pretty
# 3 Jane $999 8 http://www.d.com Funky Groovy
Related
I have the following dataframe
index,name,score,attempts,qualify
a,Anastasia,12.5,1,yes
b,Dima,9.0,3,no
c,Katherine,16.5,2,yes
d,James,NaN,3,no
e,Emily,9.0,2,no
I am trying to use pandas map function to update name column where name is either James or Emily to any test value 99.
codes = {'James':'99','Emily':'99'}
dff['name'] = dff['name'].map(codes)
dff
I am getting the following output -
index,name,score,attempts,qualify
a,NaN,12.5,1,yes
b,NaN,9.0,3,no
c,NaN,16.5,2,yes
d,99,NaN,3,no
e,99,9.0,2,no
Note that name column values James and Emily have been updated to 99, but the rest of name values are mapped to NaN.
How can we ignore indexes which are not intended to be mapped?
The issue is that the map function will apply the dictionary values to all values in the 'name' column, not just the ones specified. To get around this, you can use the replace method instead:
dff['name'] = dff['name'].replace({'James':'99','Emily':'99'})
This will replace only the specified values and leave the others unchanged.
I believe you may be looking for replace instead of map.
import pandas as pd
names = pd.Series([
"Anastasia",
"Dima",
"Katherine",
"James",
"Emily"
])
names.replace({"James": "99", "Emily": "99"})
# 0 Anastasia
# 1 Dima
# 2 Katherine
# 3 99
# 4 99
# dtype: object
If you're really set on using map, then you have to provide a function that knows how to handle every single name it might encounter.
codes = {"James": "99", "Emily": "99"}
# If the lookup into `code` fails,
# return the name that was used for lookup
names.map(lambda name: codes.get(name, name))
codes = {'James':'99',
'Emily':'99'}
dff['name'] = dff['name'].replace(codes)
dff
replace() satisfies the requirement -
index,name,score,attempts,qualify
a,Anastasia,12.5,1,yes
b,Dima,9.0,3,no
c,Katherine,16.5,2,yes
d,99,NaN,3,no
e,99,9.0,2,no
You can replace back one way to achiev it
dff['name'] = dff['name'].map(codes).fillna(dff['name'])
codes = {'James':'99','Emily':'99'}
dff['name'] = dff['name'].map(codes).fillna(dff['name'])
dff
index name score attempts qualify
0 a Anastasia 12.5 1 yes
1 b Dima 9.0 3 no
2 c Katherine 16.5 2 yes
3 d 99 NaN 3 no
4 e 99 9.0 2 no
I'm trying to change the Strings "SLL" under the competitions column to "League" but when i tried this:
messi_dataset.replace("SLL", "League",regex = True)
It only changed the first "SLL" to "League" but then other strings that were "SLL" became "UCL. I have no idea why. I also tried changing regex = True to inlace = True but no luck.
https://drive.google.com/file/d/1ldq6o70j-FsjX832GbYq24jzeR0IwlEs/view?usp=sharing
https://drive.google.com/file/d/1OeCSutkfdHdroCmTEG9KqnYypso3bwDm/view?usp=sharing
Suppose you have a dataframe as below:
import pandas as pd
import re
df = pd.DataFrame({'Competitions': ['SLL', 'sll','apple', 'banana', 'aabbSLL', 'ccddSLL']})
# write a regex pattern that replaces 'SLL'
# I assumed case-irrelevant
regex_pat = re.compile(r'SLL', flags=re.IGNORECASE)
df['Competitions'].str.replace(regex_pat, 'league', regex=True)
# Input DataFrame
Competitions
0 SLL
1 sll
2 apple
3 banana
4 aabbSLL
5 ccddSLL
Output:
0 league
1 league
2 apple
3 banana
4 aabbleague
5 ccddleague
Name: Competitions, dtype: object
Hope it clarifies.
base on this Answer test this code:
messi_dataset['competitions'] = messi_dataset['competitions'].replace("SLL", "League")
also, there are many different ways to do this like this one that I test:
messi_dataset.replace({'competitions': 'SLL'}, "League")
for those cases that 'SLL' is a part of another word:
messi_dataset.replace({'competitions': 'SLL'}, "League", regex=True)
domain intents expressionId name
0 [] [] 8 When I dare to be powerful – to use my strengt...
1 [] [] 7 When one door of happiness closes, another ope...
2 [] [] 6 The first step toward success is taken when yo...
3 [] [] 5 Get busy living or get busy dying
4 [financial, education] [resolutions] 4 You know you’re in love when you can’t fall as...
5 [financial, business] [materialistic]3 Honesty is the best policy
Here is my dataframe which is having the domain column with list of strings. And, what I want is to fetch only rows that are having
'domain' contains 'financial'(domain==financial), so that I get the below results:
domain intents expressionId name
4 [financial, education] [resolutions] 4 You know you’re in love when you can’t fall as...
5 [financial, business] [materialistic]3 Honesty is the best policy
What I have tried so far is with the below command:
df['domain'].map(lambda x: 'financial' in x)
This is returning, the column with objectType as 'bool'.
0 False
1 False
2 False
3 False
4 True
5 True
Name: domain, dtype: bool
But what I want is the filtered result not the bool values.
Please help me with this. Thank you.
dfFinaicals = df[df['domain'].map(lambda x: 'financial' in x)]
This is just asking the domain and using it to select the rows. You were almost there.
dfFinaicals = df[df['domain'].contains('financial')]
is more elegant
df[df.domain.apply(lambda v: 'financial' in v)]
df.domain.contains('financial') didn't work for me.
Without actually trying to make your dataframe, I would suggest:
new_df = old_df[old_df['domain'] == 'financial']
Overview: I am working with pandas dataframes of census information, while they only have two columns, they are several hundred thousand rows in length. One column is a census block ID number and the other is a 'place' value, which is unique to the city in which that census block ID resides.
Example Data:
BLOCKID PLACEFP
0 60014001001000 53000
1 60014001001001 53000
...
5844 60014099004021 53000
5845 60014100001000
5846 60014100001001
5847 60014100001002 53000
Problem: As shown above, there are several place values that are blank, though they have a census block ID in their corresponding row. What I found was that in several instances, the census block ID that is missing a place value, is located within the same city as the surrounding blocks that do not have a missing place value, especially if the bookend place values are the same - as shown above, with index 5844 through 5847 - those two blocks are located within the same general area as the surrounding blocks, but just seem to be missing the place value.
Goal: I want to be able to go through this dataframe, find these instances and fill in the missing place value, based on the place value before the missing value and the place value that immediately follows.
Current State & Obstacle: I wrote a loop that goes through the dataframe to correct these issues, shown below.
current_state_blockid_df = pandas.DataFrame({'BLOCKID':[60014099004021,60014100001000,60014100001001,60014100001002,60014301012019,60014301013000,60014301013001,60014301013002,60014301013003,60014301013004,60014301013005,60014301013006],
'PLACEFP': [53000,,,53000,11964,'','','','','','',11964]})
for i in current_state_blockid_df.index:
if current_state_blockid_df.loc[i, 'PLACEFP'] == '':
#Get value before blank
prior_place_fp = current_state_blockid_df.loc[i - 1, 'PLACEFP']
next_place_fp = ''
_n = 1
# Find the end of the blank section
while next_place_fp == '':
next_place_fp = current_state_blockid_df.loc[i + _n, 'PLACEFP']
if next_place_fp == '':
_n += 1
# if the blanks could likely be in the same city, assign them the city's place value
if prior_place_fp == next_place_fp:
for _i in range(1, _n):
current_state_blockid_df.loc[_i, 'PLACEFP'] = prior_place_fp
However, as expected, it is very slow when dealing with hundreds of thousands or rows of data. I have considered using maybe ThreadPool executor to split up the work, but I haven't quite figured out the logic I'd use to get that done. One possibility to speed it up slightly, is to eliminate the check to see where the end of the gap is and instead just fill it in with whatever the previous place value was before the blanks. While that may end up being my goto, there's still a chance it's too slow and ideally I'd like it to only fill in if the before and after values match, eliminating the possibility of the block being mistakenly assigned. If someone has another suggestion as to how this could be achieved quickly, it would be very much appreciated.
You can use shift to help speed up the process. However, this doesn't solve for cases where there are multiple blanks in a row.
df['PLACEFP_PRIOR'] = df['PLACEFP'].shift(1)
df['PLACEFP_SUBS'] = df['PLACEFP'].shift(-1)
criteria1 = df['PLACEFP'].isnull()
criteria2 = df['PLACEFP_PRIOR'] == df['PLACEFP_AFTER']
df.loc[criteria1 & criteria2, 'PLACEFP'] = df.loc[criteria1 & criteria2, 'PLACEFP_PRIOR']
If you end up needing to iterate over the dataframe, use df.itertuples. You can access the column values in the row via dot notation (row.column_name).
for idx, row in df.itertuples():
# logic goes here
Using your dataframe as defined
def fix_df(current_state_blockid_df):
df_with_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] == '']
df_no_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] != '']
sections = {}
last_i = 0
grouping = []
for i in df_with_blanks.index:
if i - 1 == last_i:
grouping.append(i)
last_i = i
else:
last_i = i
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
grouping = []
grouping.append(i)
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
for i in sections.keys():
sections[i]['place'] = current_state_blockid_df.loc[i-1, 'PLACEFP']
l = []
for i in sections:
for x in sections[i]['indexes']:
l.append(sections[i]['place'])
df_with_blanks['PLACEFP'] = l
final_df = pandas.concat([df_with_blanks, df_no_blanks]).sort_index(axis=0)
return final_df
df = fix_df(current_state_blockid_df)
print(df)
Output:
BLOCKID PLACEFP
0 60014099004021 53000
1 60014100001000 53000
2 60014100001001 53000
3 60014100001002 53000
4 60014301012019 11964
5 60014301013000 11964
6 60014301013001 11964
7 60014301013002 11964
8 60014301013003 11964
9 60014301013004 11964
10 60014301013005 11964
11 60014301013006 11964
Here is my problem (I'm working on python) :
I have a Dataframe with columns: Index(['job_title', 'company', 'job_label', 'description'], dtype='object')
And I have a list of words that contains 300 skills:
keywords = ["C++","Data Analytics","python","R", ............ "Django"]
I need to match those keywords with each of the jobs descriptions and obtain a new dataframe saying if is true or false that C++ is in job description[0]...job description[1], job description[2] and so on.
My new dataframe will be:
columns : ['job_title', 'company', 'description', "C++", "Data Analytics",
....... "Django"]
Where each column of keywords said true or false if it match(is found) or not on the job description.
There might be another ways to structure the dataframe (I'm listening suggestions).
Hope I'm clear with my question. I try regex but I can't make it iterate trough each row, I try with a loop using "fnmatch" library and I can't make it work. The best approach so far was:
df["microservice"]= df.description.str.contains("microservice")
df["cloud-based architecture"] = df.description.str.contains("cloud-based architecture")
df["service oriented architecture"] = df.description.str.contains("service oriented architecture")
However, First I could not manage to make it loop trough each rows of description column, so i have to input 300 times the code with each word (it doesn't make sense). Second, trough this way, I have problems with few words such as "R" because it find the letter R in each description, so it will pull true in each of them.
Iterate over list of keywords and extract each column from the description one:
for name in keywords:
df[name] = df['description'].apply(lambda x: True if name in x else False)
EDIT:
That doesn't solve the problem with R. To do so you could add a space to make sure it's isolated so the code would be:
for name in keywords:
df[name] = df['description'].apply(lambda x: True if ' '+str(name)+' ' in x else False)
But that's really ugly and not optimised. Regular expression should do the trick but I have to look back into it: found it! [ ]*+[str(name)]+[.?!] is better! (and more appropriate)
One way is to build a regex string to identify any keyword in your string... this example is case insensitive and will find any substring matches - not just whole words...
import pandas as pd
import re
keywords = ['python', 'C++', 'admin', 'Developer']
rx = '(?i)(?P<keywords>{})'.format('|'.join(re.escape(kw) for kw in keywords))
Then with a sample DF of:
df = pd.DataFrame({
'job_description': ['C++ developer', 'traffic warden', 'Python developer', 'linux admin', 'cat herder']
})
You can find all keywords for the relevant column...
matches = df['job_description'].str.extractall(rx)
Which gives:
keyword
match
0 0 C++
1 developer
2 0 Python
1 developer
3 0 admin
Then you want to get a list of "dummies" and take the max (so you always get a 1 where a word was found) using:
dummies = pd.get_dummies(matches).max(level=0)
Which gives:
keyword_C++ keyword_Python keyword_admin keyword_developer
0 1 0 0 1
2 0 1 0 1
3 0 0 1 0
You then left join that back to your original DF:
result = df.join(dummies, how='left')
And the result is:
job_description keyword_C++ keyword_Python keyword_admin keyword_developer
0 C++ developer 1.0 0.0 0.0 1.0
1 traffic warden NaN NaN NaN NaN
2 Python developer 0.0 1.0 0.0 1.0
3 linux admin 0.0 0.0 1.0 0.0
4 cat herder NaN NaN NaN NaN
skill = "C++", or any of the others
frame = an instance of
Index(['job_title', 'company', 'job_label', 'description'],
dtype='object')
jobs = a list/np.array of frames, which is probably your input
A naive implementation could look a bit like this:
for skill in keywords:
for frame in jobs:
if skill in frame["description"]: # or more exact matching, but this is what's in the question
# exists
But you need to put more work into what output structure you are going to use. Just having an output array of 300 columns most of which just contain a False isn't going to be a good plan. I've never worked with Panda's myself, but if it were normal numpy arrays (which panda's DataFrames are under the hood), I would add a column "skills" that can enumerate them.
You can leverage .apply() like so (#Jacco van Dorp made a solid suggestion of storing all of the found skills inside a single column, which I agree is likely the best approach to your problem):
df = pd.DataFrame([['Engineer','Firm','AERO1','Work with python and Django'],
['IT','Dell','ITD4','Work with Django and R'],
['Office Assistant','Dental','OAD3','Coordinate schedules'],
['QA Engineer','Factory','QA2','Work with R and python'],
['Mechanic','Autobody','AERO1','Love the movie Django']],
columns=['job_title','company','job_label','description'])
Which yields:
job_title company job_label description
0 Engineer Firm AERO1 Work with python and Django
1 IT Dell ITD4 Work with Django and R
2 Office Assistant Dental OAD3 Coordinate schedules
3 QA Engineer Factory QA2 Work with R and python
4 Mechanic Autobody AERO1 Love the movie Django
Then define your skill set and your list comprehension to pass to .apply():
skills = ['python','R','Django']
df['skills'] = df.apply(lambda x: [i for i in skills if i in x['description'].split()], axis=1)
Which yields this column:
skills
0 [python, Django]
1 [R, Django]
2 []
3 [python, R]
4 [Django]
If you are still interested in having individual columns for each skill, I can edit my answer to provide that as well.