Python Pandas replace substring if string starts with - python

I am trying to write the following function:
def d (row):
if df['name'].str.startswith('"'):
return df['name'].str.replace("'","''")
else:
return df['name']
df['name2'] = df.apply(lambda row: d(row), axis=1)
I am trying to add a second apostrophe whenever a string has a single apostrophe within a contraction. this only appears when i have double quoted strings.
I continue to get a ' KeyError: ('name', occurred at index 0')
this only happens a few times in my dataset, but i need to replace "jack's place" with "jack''s place" so that i can inject this into a sql query.

Why can't you do a full replace:
df['name2'] = df['name'].str.replace('\'', '"')
print(df)
name name2
0 ABC ABC
1 SDF SDF
2 jack's place jack"s place
3 jack's place jack"s place
4 jack's place jack"s place

Related

How to identify records in a DataFrame (Python/Pandas) that contains leading or trailing spaces

I would like to know how to write a formula that would identify/display records of string/object data type on a Pandas DataFrame that contains leading or trailing spaces.
The purpose for this is to get an audit on a Jupyter notebook of such records before applying any strip functions.
The goal is for the script to identify these records automatically without having to type the name of the columns manually. The scope should be any column of str/object data type that contains a value that includes either a leading or trailing spaces or both.
Please notice. I would like to see the resulting output in a dataframe format.
Thank you!
Link to sample dataframe data
You can use:
df['col'].str.startswith(' ')
df['col'].str.endswith(' ')
or with a regex:
df['col'].str.match(r'\s+')
df['col'].str.contains(r'\s+$')
Example:
df = pd.DataFrame({'col': [' abc', 'def', 'ghi ', ' jkl ']})
df['start'] = df['col'].str.startswith(' ')
df['end'] = df['col'].str.endswith(' ')
df['either'] = df['start'] | df['stop']
col start end either
0 abc True False True
1 def False False False
2 ghi False True True
3 jkl True True True
However, this is likely not faster than directly stripping the spaces:
df['col'] = df['col'].str.strip()
col
0 abc
1 def
2 ghi
3 jkl
updated answer
To detect the columns with leading/traiing spaces, you can use:
cols = df.astype(str).apply(lambda c: c.str.contains(r'^\s+|\s+$')).any()
cols[cols].index
example on the provided link:
Index(['First Name', 'Team'], dtype='object')

Multiple lambda outputs in string replacement using apply [Python]

I have a list of "states" from which I have to iterate:
states = ['antioquia', 'boyaca', 'cordoba', 'choco']
I have to iterate one column in a pandas df to replace or cut the string where the state text is found, so I try:
df_copy['joined'].apply([(lambda x: x.replace(x,x[:-len(j)]) if x.endswith(j) and len(j) != 0 else x) for j in states])
And the result is:
Result wanted:
joined column is the input and the desired output is p_joined column
If it's possible also to find the state not only in the end of the string but check if the string contains it and replace it
Thanks in advance for your help.
This will do what your question asks:
df_copy['p_joined'] = df_copy.joined.str.replace('(' + '|'.join(states) + ')$', '')
Output:
joined p_joined
0 caldasantioquia caldas
1 santafeantioquia santafe
2 medelinantioquiamedelinantioquia medelinantioquiamedelin
3 yarumalantioquia yarumal
4 medelinantioquiamedelinantioquia medelinantioquiamedelin

parsing dataframe columns containing functions

Python/pandas newbie here. The csv file I'm trying to work with has been populated with data that looks something like this:
A B C D
Option1(item1=12345, item12='string', item345=0.123) 2020-03-16 1.234 Option2(item4=123, item56=234, item678=345)
I'd like it to look like this:
item1 item12 item345 B C item4 item56 item678
12345 'string' 0.123 2020-03-16 1.234 123 234 345
In other words, I want to replace columns A and D with new columns headed by what's on the left of the equal sign, using what's to the right of the equal sign as the corresponding value, and with the Option1() and Option2() parts and the commas stripped out. The columns that don't contain functions should be left as is.
Is there an elegant way to do this?
Actually, at this point, I'd settle for any old way, elegant or not; I've found various ways of dealing with this situation if, say, there were dicts populating columns, but nothing to help me pick it apart if there are functions there. Trying to search for the answer only gives me a bunch of results for how to apply functions to dataframes.
As long as your functions always have the same arguments, this should work.
You can read the csv with (if separators are 2 or more spaces, that's what I get when I paste your question example):
df = pd.read_csv('test.csv',sep='[\s]{2,}', index_col=False, engine='python')
If your dataframe is df:
# break out both sides of the equal sign in function into columns
A_vals = df['A'].str.extractall(r'([\w\d]+)=([^,\)]*)')
# get rid of the multi-index and put the values after '=' into columns
A_converted = A_vals.unstack(level=-1)[1]
# set column names to values before '='
A_converted.columns = list(A_vals.unstack(level=-1)[0].values[0])
# same thing for 'D'
D_vals = df['D'].str.extractall(r'([\w\d]+)=([^,\)]*)')
D_converted = D_vals.unstack(level=-1)[1]
D_converted.columns = list(D_vals.unstack(level=-1)[0].values[0])
# join everything together
df = A_converted.join(df.drop(['A','D'], axis=1)).join(D_converted)
Some clarification on the regex '([\w\d]+)=([^,\)]*)' has two capture groups (each part in parens):
Group 1 ([\w\d]+) is one or more characters (+) that are word characters \w or numbers \d.
= between groups.
Group 2 ([^,\)]*) is 0 or more characters (*) that are not (^) a comma , or paren \).
I believe you're looking for something along these lines:
contracts = ["Option(conId=384688665, symbol='SPX', lastTradeDateOrContractMonth='20200116', strike=3205.0, right='P', multiplier='100', exchange='SMART', currency='USD', localSymbol='SPX 200117P03205000', tradingClass='SPX')",
"Option(conId=12345678, symbol='DJX', lastTradeDateOrContractMonth='20200113', strike=1205.0, right='P', multiplier='200', exchange='SMART', currency='USD', localSymbol='DJXX 333117Y13205000', tradingClass='DJX')"]
new_conts = []
columns = []
for i in range (len(contracts)):
mod = contracts[i].replace('Option(','').replace(')','')
contracts[i] = mod
new_cont = contracts[i].split(',')
new_conts.append(new_cont)
for contract in new_conts:
column = []
for i in range (len(contract)):
mod = contract[i].split('=')
contract[i] = mod[1]
column.append(mod[0])
columns.append(column)
print(len(columns[0]))
df = pd.DataFrame(new_conts,columns=columns[0])
df
Output:
conId symbol lastTradeDateOrContractMonth strike right multiplier exchange currency localSymbol tradingClass
0 384688665 'SPX' '20200116' 3205.0 'P' '100' 'SMART' 'USD' 'SPX 200117P03205000' 'SPX'
1 12345678 'DJX' '20200113' 1205.0 'P' '200' 'SMART' 'USD' 'DJXX 333117Y13205000' 'DJX'
Obviously you can then delete unwanted columns, change names, etc.

Is there a way in pandas to remove duplicates from within a series?

I have a dataframe which has some duplicate tags separated by commas in the "Tags" column, is there a way to remove the duplicate strings from the series. I want the output in 400 to have just Museum, Drinking, Shopping.
I can't split on a comma & remove them because there are some tags in the series that have similar words like for example: [Museum, Art Museum, Shopping] so splitting and dropping multiple museum strings would affect the unique 'Art Museum' string.
Desired Output
You can split by comma and convert to a set(),which removes duplicates, after removing leading/trailing white space with str.strip(). Then, you can df.apply() this to your column.
df['Tags']=df['Tags'].apply(lambda x: ', '.join(set([y.strip() for y in x.split(',')])))
You can create a function that removes duplicates from a given string. Then apply this function to your column Tags.
def remove_dup(strng):
'''
Input a string and split them
'''
return ', '.join(list(dict.fromkeys(strng.split(', '))))
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
DEMO:
import pandas as pd
my_dict = {'Tags':["Museum, Art Museum, Shopping, Museum",'Drink, Drink','Shop','Visit'],'Country':['USA','USA','USA', 'USA']}
df = pd.DataFrame(my_dict)
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
df
Output:
Tags Country
0 Museum, Art Museum, Shopping USA
1 Drink USA
2 Shop USA
3 Visit USA
Without some code example, I've thrown together something that would work.
import pandas as pd
test = [['Museum', 'Art Museum', 'Shopping', "Museum"]]
df = pd.DataFrame()
df[0] = test
df[0]= df.applymap(set)
Out[35]:
0
0 {Museum, Shopping, Art Museum}
One approach that avoids apply
# in your code just s = df['Tags']
s = pd.Series(['','', 'Tour',
'Outdoors, Beach, Sports',
'Museum, Drinking, Drinking, Shopping'])
(s.str.split(',\s+', expand=True)
.stack()
.reset_index()
.drop_duplicates(['level_0',0])
.groupby('level_0')[0]
.agg(','.join)
)
Output:
level_0
0
1
2 Tour
3 Outdoors,Beach,Sports
4 Museum,Drinking,Shopping
Name: 0, dtype: object
there maybe mach fancier way doing these kind of stuffs.
but will do the job.
make it lower-case
data['tags'] = data['tags'].str.lower()
split every row in tags col by comma it will return a list of string
data['tags'] = data['tags'].str.split(',')
map function str.strip to every element of list (remove trailing spaces).
apply set function return set of current words and remove duplicates
data['tags'] = data['tags'].apply(lambda x: set(map(str.strip , x)))

Remove all rows with a special character in pandas

I have a dataframe where there are special characters (like a square) in one of the columns EPI_ID. I want to remove all rows that contain this special character. This isn't a standard character and I haven't found issues similar to this in a dataframe, mostly as strings. Nevertheless, I am having trouble identifying these columns. Any suggestions?
df
EPI_ID stuff
2342F randoM_words
FER43 predictive_words
u'\u25A1' blank
My attempt:
df[~df['EPI_ID'].apply(lambda x: x.encode('ascii') == True)]
My results are throwing False for every row.
Expected output:
EPI_ID stuff
2342F randoM_words
FER43 predictive_words
Edit: the square doesn't come up in the mock df. But this is what it is square
Assuming your DataFrame looks something like this:
>>> df = pd.DataFrame({'EPI_ID': ['2343F', 'FER43', 'DF' + u'\u25A1' + '123', 'PQRX74'], 'STUFF': ['abc', 'def', 'ghi', 'jkl']})
>>> df
EPI_ID STUFF
0 2343F abc
1 FER43 def
2 DF□123 ghi
3 PQRX74 jkl
You can use str.contains, which handles regex:
df.loc[df['EPI_ID'].str.contains(r'[^\x00-\x7F]+') == False]
EPI_ID STUFF
0 2343F abc
1 FER43 def
3 PQRX74 jkl
Regex courtesy of this answer: (grep) Regex to match non-ASCII characters?

Categories

Resources