Pandas: Replacing string with hashed string via regex

Pandas: Replacing string with hashed string via regex - python

I have a DataFrame with 29 columns, and need to replace part of a string in some columns with a hashed part of the string.
Example of the column is as follows:
ABSX, PLAN=PLAN_A ;SFFBJD
ADSFJ, PLAN=PLAN_B ;AHJDG
...
...
Code that captures the part of the string:
Test[14] = Test[14].replace({'(?<=PLAN=)(^"]+ ;)' :'hello'}, regex=True)
I want to change the 'hello' to hash of '(?<=PLAN=)(^"]+ ;)' but it doesn't work this way. Wanted to check if anyone did this before without looping line by line of the DataFrame?

here is what I suggest:
import hashlib
import re
import pandas as pd
# First I reproduce a similar dataset
df = pd.DataFrame({"v1":["ABSX", "ADSFJ"],
"v2": ["PLAN=PLAN_A", "PLAN=PLAN_B"],
"v3": ["SFFBJD", "AHJDG"]})
# I search for the regex and create a column matched_el with the hash
r = re.compile(r'=[a-zA-Z_]+')
df["matched_el"] = ["".join(r.findall(w)) for w in df.v2]
df["matched_el"] = df["matched_el"].str.replace("=","")
df["matched_el"] = [hashlib.md5(w.encode()).hexdigest() for w in df.matched_el]
# Then I replace in v2 using this hash
df["v2"] = df["v2"].str.replace("(=[a-zA-Z_]+)", "=")+df["matched_el"]
df = df.drop(columns="matched_el")
Here is the result
v1 v2 v3
0 ABSX PLAN=8d846f78aa0b0debd89fc1faafc4c40f SFFBJD
1 ADSFJ PLAN=3b9a3c8184829ca5571cb08c0cf73c8d AHJDG

Related

How can I handle missing values in the dictionary when I use the function eval(String dictionary) -> dictionary PYTHON?

I need to convert the ‘content’ column from a string dictionary to a dictionary in python. After that I will use the following line of code:
df[‘content’].apply(pd.Series).
To have the dictionary values as a column name and the dictionary value in a cell.
I can’t do this now because there are missing values in the dictionary string.
How can I handle missing values in the dictionary when I use the function eval(String dictionary) -> dictionary?
[I'm working on the 'content' column that I want to convert to the correct format first, I tried with the eval() function, but it doesn't work, because there are missing values. This is json data.
My goal is to have the content column data for the keys in the column titles and the values in the cells](https://i.stack.imgur.com/1CsIl.png)

you can use json.loads in lambda function. if row value is nan, pass, if not, apply json.loads:
:
import json
import numpy as np
df['content']=df['content'].apply(lambda x: json.loads(x) if pd.notna(x) else np.nan)
now you can use pd.Series.
v1 = df['Content'].apply(pd.Series)
df = df.drop(['Content'],axis=1).join(v1)
if you have missing values in string dictionaries:
def check_json(x):
import ast
import json
if pd.isna(x):
return np.nan
else:
try:
return json.loads(x)
except:
try:
mask=x.replace('{','').replace('}','') #missing dictionary
mask=mask.split(",")
for i in range(0,len(mask)):
if not len(mask[i].partition(":")[-1]) > 0:
print(mask[i])
mask[i]=mask[i] + '"None"' # ---> you can replace None with what do you want
return json.loads(str({','.join(mask)}).replace("\'", ""))
except:
try:
x=x.replace("\'", "\"")
mask=x.replace('{','').replace('}',"") #missing dictionary
mask=mask.split(",")
for i in range(0,len(mask)):
if not len(mask[i].partition(":")[-1]) > 0:
print(mask[i])
mask[i]=mask[i] + '"None"' # ---> you can replace None with what do you want
b=str({','.join(mask)}).replace("\'", "")
return ast.literal_eval(b)
except:
print("Could not parse json object. Returning nan")
return np.nan
df['content']=df['content'].apply(lambda x: check_json(x))
v1 = df['Content'].apply(pd.Series)
df = df.drop(['Content'],axis=1).join(v1)

I cannot see what the missing values look like in your screenshot, but i tested the following code and got what seems to be a good result. The simple explanation in to use str.replace() to fix the null values before parsing the string to dict.
import pandas as pd
import numpy as np
import json
## setting up an example dataframe. note that row2 has a null value
json_example = [
'{"row1_key1":"row1_value1","row1_key2":"row1_value2"}',
'{"row2_key1":"row2_value1","row2_key2": null}'
]
df= pd.DataFrame()
df['Content'] = json_example
## using string replace on the string representation of the json to clean it up
df['Content'].apply(lambda x: x.replace('null','"0"'))
## using lambda x to first load the string into a dict, then applying pd.Series()
df['Content'].apply(lambda x: pd.Series(json.loads(x)))
Output

How to check the pattern of a column in a dataframe

I have a dataframe which has some id's. I want to check the pattern of those column values.
Here is how the column looks like-
id: {ASDH12HK,GHST67KH,AGSH90IL,THKI86LK}
I want to to write a code that can distinguish characters and numerics in the pattern above and display an output like 'SSSS99SS' as the pattern of the column above where 'S' represents a character and '9' represents a numeric.This dataset is a large dataset so I can't predefine the position the characters and numeric will be in.I want the code to calculate the position of the characters and numerics. I am new to python so any leads will be helpful!

You can try something like:
my_string = "ASDH12HK"
def decode_pattern(my_string):
my_string = ''.join(str(9) if s.isdigit() else s for s in my_string)
my_string = ''.join('S' if s.isalpha() else s for s in my_string)
return my_string
decode_pattern(my_string)
Output:
'SSSS99SS'
You can apply this to the column in your dataframe as well as below:
import pandas as pd
df = pd.DataFrame(['ASDH12HK','GHST67KH','AGSH90IL','THKI86LK', 'SOMEPATTERN123'], columns=['id'])
df['pattern'] = df['id'].map(decode_pattern)
df
Output:
id pattern
0 ASDH12HK SSSS99SS
1 GHST67KH SSSS99SS
2 AGSH90IL SSSS99SS
3 THKI86LK SSSS99SS
4 SOMEPATTERN123 SSSSSSSSSSS999

You can use regular experssion:
st = "SSSS99SSSS"
a = re.match("[A-Za-z]{4}[0-9]{2}[A-Za-z]{4}", st)
It will return a match if the string starting with 4 Char followed by 2 numeric and again 4 char
So you can use this in your df to filter the df

You can use the function findall() from the re module:
import re
text = "ASDH12HK,GHST67KH,AGSH90IL,THKI86LK"
result = re.findall("[A-Za-z]{4}[0-9]{2}[A-Za-z]{2}", text)
print(result)

How to replace substrings in a dataframe in Python

I have a dataframe, where I want to replace some words to others, based on another dataframe:
import pandas as pd
dist = pd.DataFrame([["21","apple"],["25","balana"],["30","lemon"]],columns=["idx","item"])
a = pd.DataFrame(["apple - banana"],columns=["pf"])
a['pf'] = a['pf'].replace(dist["item"], dist["idx"], regex=True)
print(a)
How can I do that? (this does not work in its current form)

You can try this:
dist = pd.DataFrame([["21","apple"],["25","balana"],["30","lemon"]],columns= ["idx","item"])
a = pd.DataFrame(["apple - banana"],columns=["pf"])
b = dict(zip(dist["idx"], dist["item"]))
def replace_items(token):
for key, value in b.items():
token = token.replace(value, key)
return token
a["pf"] = a["pf"].apply(replace_items)
Please be aware that the banana in your dist dataframe is balana. Not sure if this is intended...

Converting the translation table to dictionary seems to solve the problem:
import pandas as pd
dist = pd.DataFrame([["apple","21"],["banana","25"],["lemon","30"]],columns=["item","idx"])
dist = dist.set_index('item')['idx'].to_dict()
a = pd.DataFrame(["apple - banana"],columns=["pf"])
a['pf'] = a['pf'].replace(dist, regex=True)
print(a)

Create pandas dataframe from string

I can easily build a pandas dataframe from a string that contains only one key value pair. For example:
string1 = '{"Country":"USA","Name":"Ryan"}'
dict1 = json.loads(string1)
df=pd.DataFrame([dict1])
print(df)
However, when I use a string that has more than one key value pair :
string2 = '{"Country":"USA","Name":"Ryan"}{"Country":"Sweden","Name":"Sam"}{"Country":"Brazil","Name":"Ralf"}'
dict2 = json.loads(string2)
I get the following error:
raise JSONDecodeError("Extra data", s, end)
I am aware that string2 is not a valid JSON.
What modifications can I do on string2 programmatically so that I can convert it to a valid JSON and then get a dataframe output which is as follows:
| Country | Name |
|---------|------|
| USA | Ryan |
| Sweden | Sam |
| Brazil | Ralf |

Your error
The error says it all. The JSON is not valid. Where did you get that string2? Are you typing it in yourself?
In that case you should surround the items with brackets [] and separate the items with comma ,.
Working example:
import pandas as pd
import json
string2 = '[{"Country":"USA","Name":"Ryan"},{"Country":"Sweden","Name":"Sam"},{"Country":"Brazil","Name":"Ralf"}]'
df = pd.DataFrame(json.loads(string2))
print(df)
Returns:
Country Name
0 USA Ryan
1 Sweden Sam
2 Brazil Ralf
Interestingly, if you are extra observant, in this line here df=pd.DataFrame([dict1]) you are actually putting your dictionary inside an array with brackers[]. This is because pandas DataFrame accepts arrays of data. What you actually have in your first example is an item in which case a serie would make more sense or df = pd.Series(dict1).to_frame().T.
Or:
string1 = '[{"Country":"USA","Name":"Ryan"}]' # <--- brackets here to read json as arr
dict1 = json.loads(string1)
df=pd.DataFrame(dict1)
print(df)
And if you understood this I think it becomes easier to understand that we need , to seperate the elements.
Alternative inputs
But let's say you are creating this dataset yourself, then you could go ahead and do this:
data = [("USA","Ryan"),("Sweden","Sam"),("Brazil","Ralf")]
dict1 = [{"Country":i, "Name":y} for i,y in data] # <-- dictionaries inside arr
df = pd.DataFrame(dict1)
Or:
data = [("USA","Ryan"),("Sweden","Sam"),("Brazil","Ralf")]
df = pd.DataFrame(dict1, columns=['Country','Name'])
Or which I would prefer to use a CSV-structure:
data = '''\
Country,Name
USA,Ryan
Sweden,Sam
Brazil,Ralf'''
df = pd.read_csv(pd.compat.StringIO(data))

In the off chance that you are getting data from elsewhere in the weird format that you described, following regular expression based substitutions can fix your json and there after you can go as per #Anton vBR 's solution.
import pandas as pd
import json
import re
string2 = '{"Country":"USA","Name":"Ryan"}{"Country":"Sweden","Name":"Sam"}{"Country":"Brazil","Name":"Ralf"}'
#create dict of substitutions
rd = { '^{' : '[{' , #substitute starting char with [
'}$' : '}]', #substitute ending char with ]
'}{' : '},{' #Add , in between two dicts
}
#replace as per dict
for k,v in rd.iteritems():
string2 = re.sub(r'{}'.format(k),r'{}'.format(v),string2)
df = pd.DataFrame(json.loads(string2))
print(df)

python pandas - function applied to csv is not persisted

I need to polish a csv dataset, but it seems the changes are not applied to the dataset itslef.
CSV is in this format:
ID, TRACK_LINK
761607, https://mylink.com//track/...
This is my script:
import pandas as pd
df = pd.read_csv('./file.csv').fillna('')
# remove double // from TRACK_LINK
def polish_track_link(track_link):
return track_link.replace("//track", "/track")
df['LINK'].apply(polish_track_link)
print(df)
this prints something like:
...
761607 https://mylink.com//track/...
note the //track
If I do print(df['LINK'].apply(polish_track_link)) I get:
...
761607, https://mylink.com/track/...
So the function polish_track_link works but it's not applied to the dataset. Any idea why?

You need assign back:
df['TRACK_LINK'] = df['TRACK_LINK'].apply(polish_track_link)
But better is use pandas functions str.replace or replace with regex=True for replace substrings:
df['TRACK_LINK'] = df['TRACK_LINK'].str.replace("//track", "/track")
Or:
df['TRACK_LINK'] = df['TRACK_LINK'].replace("//track", "/track", regex=True)
print(df)
ID TRACK_LINK
0 761607 https://mylink.com/track/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Replacing string with hashed string via regex - python

Related

How can I handle missing values in the dictionary when I use the function eval(String dictionary) -> dictionary PYTHON?

How to check the pattern of a column in a dataframe

How to replace substrings in a dataframe in Python

Create pandas dataframe from string

python pandas - function applied to csv is not persisted

Categories

Resources