Remove unwanted str in Pandas dataframe

Remove unwanted str in Pandas dataframe - python

'I am reading a csv file using panda read_csv which contains data,
Id;LibId;1;mod;modId;28;Index=10, Step=0, data=d720983f0000c0bf0000000014ae47bf0fe7c23ad1de3039;
Id;LibId;1;mod;modId;4;f9e9003e;
.
.
.
.
In the last column, I want to remove the Index, Step, data= and want to retain the hex value part.
I have created a list with the unwanted values and used regex but nothing seem to work.
to_remove = ['Index','Step','data=']
rex = '[' + re.escape (''. join (to_remove )) + ']'
output_csv['Column_name'].str.replace(rex , '', regex=True)

I suggest that you fix your code using
to_remove = ['Index','Step','data=']
output_csv['Column_name'] = output_csv['Column_name'].str.replace('|'.join([re.escape(x) for x in to_remove]), '', regex=True)
The '|'.join([re.escape(x) for x in to_remove]) part will create a regex like Index|Step|data\= and will match any of the to_remove substrings.

Input (added columns name for reference, can be avoided):
col1;col2;col3;col4;col5;col6;col7
Id;LibId;1;mod;modId;28;Index=10, Step=0, data=d720983f0000c0bf0000000014ae47bf0fe7c23ad1de3039
Id;LibId;1;mod;modId;28;Index=10, Step=0, data=d7203ad1de3039
Id;LibId;1;mod;modId;28;Index=10, Step=0, data=d720e47bf0fe7c23ad1de3039
Code:
import pandas as pd
df = pd.read_csv(r"check.csv", sep=";")
df["col7"].replace(regex=True, to_replace="(Index=)(.*)(data=)", value="", inplace=True)
This will extract only the hex value from "data" part and remove everything else. Do not forget about inplace=True.

Related

Can pandas findall() return a str instead of list?

I have a pandas dataframe containing a lot of variables:
df.columns
Out[0]:
Index(['COUNADU_SOIL_P_NUMBER_16_DA_B_VE_count_nr_lesion_PRATZE',
'COUNEGG_SOIL_P_NUMBER_50_DA_B_VT_count_nr_lesion_PRATZE',
'COUNJUV_SOIL_P_NUMBER_128_DA_B_V6_count_nr_lesion_PRATZE',
'COUNADU_SOIL_P_SAUDPC_150_DA_B_V6_lesion_saudpc_PRATZE',
'CONTRO_SOIL_P_pUNCK_150_DA_B_V6_lesion_p_control_PRATZE',
'COUNJUV_SOIL_P_p_0_100_16_DA_B_V6_lesion_incidence_PRATZE',
'COUNADU_SOIL_P_p_0_100_50_DA_B_VT_lesion_incidence_PRATZE',
'COUNEGG_SOIL_P_p_0_100_128_DA_B_VT_lesion_incidence_PRATZE',
'COUNEGG_SOIL_P_NUMBER_50_DA_B_V6_count_nr_spiral_HELYSP',
'COUNJUV_SOIL_P_NUMBER_128_DA_B_V10_count_nr_spiral_HELYSP', # and so on
I would like to keep only the number followed by DA, so the first column is 16_DA. I have been using the pandas function findall():
df.columns.str.findall(r'[0-9]*\_DA')
Out[595]:
Index([ ['16_DA'], ['50_DA'], ['128_DA'], ['150_DA'], ['150_DA'],
['16_DA'], ['50_DA'], ['128_DA'], ['50_DA'], ['128_DA'], ['150_DA'],
['150_DA'], ['50_DA'], ['128_DA'],
But this returns a list, which i would like to avoid, so that i end up with a column index looking like this:
df.columns
Out[595]:
Index('16_DA', '50_DA', '128_DA', '150_DA', '150_DA',
'16_DA', '50_DA', '128_DA', '50_DA', '128_DA', '150_DA',
Is there a smoother way to do this?

You can use .str.join(", ") to join all found matches with a comma and space:
df.columns.str.findall(r'\d+_DA').str.join(", ")
Or, just use str.extract to get the first match:
df.columns.str.extract(r'(\d+_DA)', expand=False)

from typing import List
pattern = r'[0-9]*\_DA'
flattened: List[str] = sum(df.columns.str.findall(pattern), [])
output: str = ",".join(flattened)

How to strip/replace "domain\" from Pandas DataFrame Column?

I have a pandas DataFrame that's being read in from a CSV that has hostnames of computers including the domain they belong to along with a bunch of other columns. I'm trying to strip out the Domain information such that I'm left with ONLY the Hostname.
DataFrame ex:
name
domain1\computername1
domain1\computername45
dmain3\servername1
dmain3\computername3
domain1\servername64
....
I've tried using both str.strip() and str.replace() with a regex as well as a string literal, but I can't seem to correctly target the domain information correctly.
Examples of what I've tried thus far:
df['name'].str.strip('.*\\')
df['name'].str.replace('.*\\', '', regex = True)
df['name'].str.replace(r'[.*\\]', '', regex = True)
df['name'].str.replace('domain1\\\\', '', regex = False)
df['name'].str.replace('dmain3\\\\', '', regex = False)
None of these seem to make any changes when I spit the DataFrame out using logging.debug(df)

You are already close to the answer, just use:
df['name'] = df['name'].str.replace(r'.*\\', '', regex = True)
which just adds using r-string from one of your tried code.
Without using r-string here, the string is equivalent to .*\\ which will be interpreted to only one \ in the final regex. However, with r-string, the string will becomes '.*\\\\' and each pair of \\ will be interpreted finally as one \ and final result becomes 2 slashes as you expect.
Output:
0 computername1
1 computername45
2 servername1
3 computername3
4 servername64
Name: name, dtype: object

You can use .str.split:
df["name"] = df["name"].str.split("\\", n=1).str[-1]
print(df)
Prints:
name
0 computername1
1 computername45
2 servername1
3 computername3
4 servername64

No regex approach with ntpath.basename:
import pandas as pd
import ntpath
df = pd.DataFrame({'name':[r'domain1\computername1']})
df["name"] = df["name"].apply(lambda x: ntpath.basename(x))
Results: computername1.
With rsplit:
df["name"] = df["name"].str.rsplit('\\').str[-1]

How to check the pattern of a column in a dataframe

I have a dataframe which has some id's. I want to check the pattern of those column values.
Here is how the column looks like-
id: {ASDH12HK,GHST67KH,AGSH90IL,THKI86LK}
I want to to write a code that can distinguish characters and numerics in the pattern above and display an output like 'SSSS99SS' as the pattern of the column above where 'S' represents a character and '9' represents a numeric.This dataset is a large dataset so I can't predefine the position the characters and numeric will be in.I want the code to calculate the position of the characters and numerics. I am new to python so any leads will be helpful!

You can try something like:
my_string = "ASDH12HK"
def decode_pattern(my_string):
my_string = ''.join(str(9) if s.isdigit() else s for s in my_string)
my_string = ''.join('S' if s.isalpha() else s for s in my_string)
return my_string
decode_pattern(my_string)
Output:
'SSSS99SS'
You can apply this to the column in your dataframe as well as below:
import pandas as pd
df = pd.DataFrame(['ASDH12HK','GHST67KH','AGSH90IL','THKI86LK', 'SOMEPATTERN123'], columns=['id'])
df['pattern'] = df['id'].map(decode_pattern)
df
Output:
id pattern
0 ASDH12HK SSSS99SS
1 GHST67KH SSSS99SS
2 AGSH90IL SSSS99SS
3 THKI86LK SSSS99SS
4 SOMEPATTERN123 SSSSSSSSSSS999

You can use regular experssion:
st = "SSSS99SSSS"
a = re.match("[A-Za-z]{4}[0-9]{2}[A-Za-z]{4}", st)
It will return a match if the string starting with 4 Char followed by 2 numeric and again 4 char
So you can use this in your df to filter the df

You can use the function findall() from the re module:
import re
text = "ASDH12HK,GHST67KH,AGSH90IL,THKI86LK"
result = re.findall("[A-Za-z]{4}[0-9]{2}[A-Za-z]{2}", text)
print(result)

python pandas - function applied to csv is not persisted

I need to polish a csv dataset, but it seems the changes are not applied to the dataset itslef.
CSV is in this format:
ID, TRACK_LINK
761607, https://mylink.com//track/...
This is my script:
import pandas as pd
df = pd.read_csv('./file.csv').fillna('')
# remove double // from TRACK_LINK
def polish_track_link(track_link):
return track_link.replace("//track", "/track")
df['LINK'].apply(polish_track_link)
print(df)
this prints something like:
...
761607 https://mylink.com//track/...
note the //track
If I do print(df['LINK'].apply(polish_track_link)) I get:
...
761607, https://mylink.com/track/...
So the function polish_track_link works but it's not applied to the dataset. Any idea why?

You need assign back:
df['TRACK_LINK'] = df['TRACK_LINK'].apply(polish_track_link)
But better is use pandas functions str.replace or replace with regex=True for replace substrings:
df['TRACK_LINK'] = df['TRACK_LINK'].str.replace("//track", "/track")
Or:
df['TRACK_LINK'] = df['TRACK_LINK'].replace("//track", "/track", regex=True)
print(df)
ID TRACK_LINK
0 761607 https://mylink.com/track/

replacing quotes, commas, apostrophes w/ regex - python/pandas

I have a column with addresses, and sometimes it has these characters I want to remove => ' - " - ,(apostrophe, double quotes, commas)
I would like to replace these characters with space in one shot. I'm using pandas and this is the code I have so far to replace one of them.
test['Address 1'].map(lambda x: x.replace(',', ''))
Is there a way to modify these code so I can replace these characters in one shot? Sorry for being a noob, but I would like to learn more about pandas and regex.
Your help will be appreciated!

You can use str.replace:
test['Address 1'] = test['Address 1'].str.replace(r"[\"\',]", '')
Sample:
import pandas as pd
test = pd.DataFrame({'Address 1': ["'aaa",'sa,ss"']})
print (test)
Address 1
0 'aaa
1 sa,ss"
test['Address 1'] = test['Address 1'].str.replace(r"[\"\',]", '')
print (test)
Address 1
0 aaa
1 sass

Here's the pandas solution:
To apply it to an entire dataframe use, df.replace. Don't forget the \ character for the apostrophe.
Example:
import pandas as pd
df = #some dataframe
df.replace('\'','', regex=True, inplace=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove unwanted str in Pandas dataframe - python

Related

Can pandas findall() return a str instead of list?

How to strip/replace "domain\" from Pandas DataFrame Column?

How to check the pattern of a column in a dataframe

python pandas - function applied to csv is not persisted

replacing quotes, commas, apostrophes w/ regex - python/pandas

Categories

Resources