I have a list of "states" from which I have to iterate:
states = ['antioquia', 'boyaca', 'cordoba', 'choco']
I have to iterate one column in a pandas df to replace or cut the string where the state text is found, so I try:
df_copy['joined'].apply([(lambda x: x.replace(x,x[:-len(j)]) if x.endswith(j) and len(j) != 0 else x) for j in states])
And the result is:
Result wanted:
joined column is the input and the desired output is p_joined column
If it's possible also to find the state not only in the end of the string but check if the string contains it and replace it
Thanks in advance for your help.
This will do what your question asks:
df_copy['p_joined'] = df_copy.joined.str.replace('(' + '|'.join(states) + ')$', '')
Output:
joined p_joined
0 caldasantioquia caldas
1 santafeantioquia santafe
2 medelinantioquiamedelinantioquia medelinantioquiamedelin
3 yarumalantioquia yarumal
4 medelinantioquiamedelinantioquia medelinantioquiamedelin
Related
I would like to know if it is possible to create a dataframe from two dictionaries.
I get two dictionaries like this:
dict= {'MO': ['N-2', 'N-8', 'N-7', 'N-6', 'N-9'], 'MO2': ['N0-6'], 'MO3': ['N-2']}
My result will be like this :
ID NUM
0 MO 'N-2', 'N-8', 'N-7', 'N-6', 'N-9'
1 MO2 'N0-6'
2 MO3 'N-2'
I try to obtain this result but in the column with the value I get [] and I can't remove it
liste_id=list(dict.keys())
liste_num=list(dict.values())
df = pandas.DataFrame({'ID':liste_id,'NUM':liste_num})
Merge the values in the dictionary into a string, before creating the dataframe; this ensures the arrays are of the same length
pd.DataFrame([(key, ", ".join(value))
for key, value in dicts.items()],
columns = ['ID', 'NUM'])
ID NUM
0 MO N-2, N-8, N-7, N-6, N-9
1 MO2 N0-6
2 MO3 N-2
I am trying to write the following function:
def d (row):
if df['name'].str.startswith('"'):
return df['name'].str.replace("'","''")
else:
return df['name']
df['name2'] = df.apply(lambda row: d(row), axis=1)
I am trying to add a second apostrophe whenever a string has a single apostrophe within a contraction. this only appears when i have double quoted strings.
I continue to get a ' KeyError: ('name', occurred at index 0')
this only happens a few times in my dataset, but i need to replace "jack's place" with "jack''s place" so that i can inject this into a sql query.
Why can't you do a full replace:
df['name2'] = df['name'].str.replace('\'', '"')
print(df)
name name2
0 ABC ABC
1 SDF SDF
2 jack's place jack"s place
3 jack's place jack"s place
4 jack's place jack"s place
I have used re.search to get strings of uniqueID from larger strings.
ex:
import re
string= 'example string with this uniqueID: 300-350'
combination = '(\d+)[-](\d+)'
m = re.search(combination, string)
print (m.group(0))
Out: '300-350'
I have created a dataframe with the UniqueID and the Combination as columns.
uniqueID combinations
0 300-350 (\d+)[-](\d+)
1 off-250 (\w+)[-](\d+)
2 on-stab (\w+)[-](\w+)
And a dictionary meaning_combination relating the combination with the variable meaning it represents:
meaning_combination={'(\\d+)[-](\\d+)': 'A-B',
'(\\w+)[-](\\d+)': 'C-A',
'(\\w+)[-](\\w+)': 'C-D'}
I want to create new columns for each variable (A, B, C, D) and fill them with their corresponding values.
the final result should look like this:
uniqueID combinations A B C D
0 300-350 (\d+)[-](\d+) 300 350
1 off-250 (\w+)[-](\d+) 250 off
2 on-stab (\w+)[-](\w+) stab on
I would fix your regexes to:
meaning_combination={'(\d+-\d+)': 'A-B',
'([^0-9\W]+\-\d+)': 'C-A',
'([^0-9\W]+\-[^0-9\W]+)': 'C-D'}
To capture the full group instead of having three capturing groups.
I.e. (300-350, 300, 350) --> (300-350)
You don't need to have extra two capturing groups because if a specific pattern is satisfied then you know what the positions of the word or digit characters are (based on how you defined the pattern) and you can split by - to access them individually.
I.e.:
str = 'example string with this uniqueID: 300-350'
values = re.findall('(\d+-\d+)', str)
>>>['300-350']
#first digit char:
values[0].split('-')[0]
>>>'300'
If you use this way, you can loop over dictionary keys and list of strings and test if the pattern is satisfied in the string. If it's satisfied (len(re.findall(pattern, string)) != 0), then grab the corresponding dictionary value for the key and split it and split the match and assign dictionary_value.split('-')[0] : match[0].split('-')[0] and dictionary_value.split('-')[1] : match[0].split('-')[1] in a new dictionary that your creating in the loop - also assign unique id to the full match value and combination to the matched pattern. Then use pandas to make a Dataframe.
Altogether:
import re
import pandas as pd
stri= ['example string with this uniqueID: 300-350', 'example string with this uniqueID: off-250', 'example string with this uniqueID: on-stab']
meaning_combination={'(\d+-\d+)': 'A-B',
'([^0-9\W]+\-\d+)': 'C-A',
'([^0-9\W]+\-[^0-9\W]+)': 'C-D'}
values = [{'Unique ID': re.findall(x, st)[0], 'Combination': x, y.split('-')[0] : re.findall(x, st)[0].split('-')[0], y.split('-')[1] : re.findall(x, st)[0].split('-')[1]} for st in stri for x, y in meaning_combination.items() if len(re.findall(x, st)) != 0]
df = pd.DataFrame.from_dict(values)
#just to sort it in order since default is alphabetical
col_val = ['Unique ID', 'Combination', 'A', 'B', 'C', 'D']
df = df.reindex(sorted(df.columns, key=lambda x: col_val.index(x) ), axis=1)
print(df)
output:
Unique ID Combination A B C D
0 300-350 (\d+-\d+) 300 350 NaN NaN
1 off-250 ([^0-9\W]+\-\d+) 250 NaN off NaN
2 on-stab ([^0-9\W]+\-[^0-9\W]+) NaN NaN on stab
Also, note, I think you have a typo in your expected output because you have:
'(\\w+)[-](\\d+)': 'C-A'
which would match off-250, but in your final result you have:
uniqueID combinations A B C D
1 off-250 (\w+)[-](\d+) 250 off
When based on your key this should be in C and A.
I have some problems with regular expression. I have a dataset with money amount and in some rows there is an odd separator. And i need a regular expression to remove only the odd separator.
For example, this is a data i have:
user_id sum
1 10.10
2 154.24
3 19.565.02
4 2.142.00
And the expected result is:
user_id sum
1 10.10
2 154.24
3 19565.02
4 2142.00
5 1.99
I use python and pandas lib for data analysis.
Help please with regex. Thank you!
Well, if your data is formed with 2 decimal places on the end, you can skip the regex and just use python.
For example, let's say you get all your data into a list (negate the header row) you can do the following to fix the dataset:
dirty = ['10.10', '154.24', '19.565.02', '2.142.00', '1.99']
# this is a list comprehension that replaces the any '.' with '' in all
# but the last three characters of your strings
clean = [item[:-3].replace('.', '') + item[-3:] for item in dirty]
>>> clean
['10.10', '154.24', '19565.02', '2142.00', '1.99']
Answer updated thanks to #match.
slighty different way with conditional column creation using np.where from the numpy module:
df['sum'] = np.where(df.sum_col.str.count('\.') >= 2, df.sum_col.str.replace('.', '', 1), df.sum_col )
or for any amount of .:
df['sum'] = pd.to_numeric([i.replace('.','',x) for i,x in
zip(df['sum'],df['sum'].str.count('\.')-1)])
Returns:
sum_col sum
0 10.10 10.10
1 154.24 154.24
2 19.565.02 19565.02
3 2.142.00 2142.00
The sum column is the cleaned up column
I have currently run the following script which uses Fuzzylogic to replace some common words from the list. Dataframe df1 contains my default list of possible values. Dataframe df2 is the main dataframe where transformations/changes are undertaken after referring to Dataframe df1. The code is as follows:
df1 = pd.DataFrame(['one','two','three','four','five','tsst'])
df2 = pd.DataFrame({'not_shifted':[np.nan,'one','too','three','fours','five','six',np.nan,'test']})
# Drop nan value
df2=pd.DataFrame(df2['not_shifted'].fillna(value=''))
df2['not_shifted'] = df2['not_shifted'].map(lambda x: difflib.get_close_matches(x, df1[0]))
The problem is the output is a dataframe which contains square brackets. To make matters worse, none of the texts within df2['not_shifted'] are viewable/ recallable:
Out[421]:
not_shifted
0 []
1 [one]
2 [two]
3 [three]
4 [four]
5 [five]
6 []
7 []
8 [tsst]
Please help.
df2.not_shifted.apply(lambda x: x[0] if len(x) != 0 else "") or simply df2.not_shifted.str[0] as solved by #Psidom
def replace_all(eg):
rep = {"[":"",
"]":"",
"u":"",
"}":"",
"'":"",
'"':"",
"frozenset":""}
for i,j in rep.items():
eg = eg.replace(i,j)
return eg
for each in df.columns:
df[each] = df[each].apply(lambda x : replace_all(str(x)))