Pandas Split rows based on different delimiters - python

So i currently have this :
s = final_df['Column Name'].str.split(';').apply(pd.Series, 1).stack()
which splits row when it finds the ; delimiter. However, I will not always have the semicolon as my delimiter. Is there to incorporate re.split or other delimiters into str.split? Basically, there could be ':', ';' ,'|' as my delimiters but I won't know.
I tried to just do split(';', '|') but I knew that wouldn't work.

str.split offers regex just like re.split does. So, you do need to use the latter. The following should do:
s = final_df['Column Name'].str.split(r'[;:|]').apply(pd.Series, 1).stack()
If the starting file contains those delimiters, you could actually provide the regular expression pattern to the sep parameter of the read_table function and set its engine parameter to "python". The following uses the io module and a random string to illustrate the point:
import io
import pandas as pd
mystring = u"hello:world|123;here|we;go,again"
with io.StringIO(mystring) as f:
df = pd.read_table(f, sep=r"[;:|,]", engine="python", header=None)
df
# 0 1 2 3 4 5 6
# 0 hello world 123 here we go again
This one split on :, ;, | and ,.
I hope this proves useful.

Related

Remove everything after second caret regex and apply to pandas dataframe column

I have a dataframe with a column that looks like this:
0 EIAB^EIAB^6
1 8W^W844^A
2 8W^W844^A
3 8W^W858^A
4 8W^W844^A
...
826136 EIAB^EIAB^6
826137 SICU^6124^A
826138 SICU^6124^A
826139 SICU^6128^A
826140 SICU^6128^A
I just want to keep everything before the second caret, e.g.: 8W^W844, what regex would I use in Python? Similarly PACU^SPAC^06 would be PACU^SPAC. And to apply it to the whole column.
I tried r'[\\^].+$' since I thought it would take the last caret and everything after, but it didn't work.
You can negate the character group to find everything except ^ and put it in a match group. you don't need to escape the ^ in the character group but you do need to escape the one outside.
re.match(r"([^^]+\^[^^]+)", "8W^W844^A").group(1)
This is quite useful in a pandas dataframe. Assuming you want to do this on a single column you can extract the string you want with
df['col'].str.extract(r'^([^^]+\^[^^]+)', expand=False)
NOTE
Originally, I used replace, but the extract solution suggested in the comments executed in 1/4 the time.
import pandas as pd
import numpy as np
from timeit import timeit
df = pd.DataFrame({"foo":np.arange(1_000_000)})
df["bar"] = "8W^W844^A"
df2 = df.copy()
def t1():
df.bar.str.replace(r"([^^]+\^[^^]+).*", r"\1", regex=True)
def t2():
df.bar.str.extract(r'^([^^]+\^[^^]+)', expand=False)
print("replace", timeit("t1()", globals=globals(), number=20))
print("extract", timeit("t2()", globals=globals(), number=20))
output
replace 39.73989862400049
extract 9.910304663004354
I don't think regex is really necessary here, just slice the string up to the position of the second caret:
>>> s = 'PACU^SPAC^06'
>>> s[:s.find("^", s.find("^") + 1)]
'PACU^SPAC'
Explanation: str.find accepts a second argument of where to start the search, place it just after the position of the first caret.

Problem with strip, replace functions in pandas dataframe

I am trying to strip all the special characters from a pandas dataframe column of words with the split() and replace() functions.
Howerver, it does not work. The special characters are not stripped from the words.
Can somebody enlight me please ?
import pandas as pd
import datetime
df = pd.read_csv("2022-12-08_word_selection.csv")
for n in df.index:
i = str(df.loc[n, "words"])
if len(i) > 12:
df.loc[n, "words"] = ""
df["words"] = df["words"].str.replace("$", "s")
df["words"] = df["words"].str.strip('[,:."*+-#/\^`#}{~&%’àáâæ¢ß¥£™©®ª×÷±²³¼½¾µ¿¶·¸º°¯§…¤¦≠¬ˆ¨‰øœšÞùúûý€')
df["words"] = df["words"].str.strip("\n")
df = df.groupby(["words"]).mean()
print(df)
Firstly, the program replaces all words in the "words" column longer than 12 characters. Then , I was hoping it would strip all the special characters from the "words" column.
First, avoid using a loop and instead use transform() to replace words longer than 12 characters with an empty string. Second, the Series.str conversion is not necessary prior to calling replace(). Third, split() only removes leading and trailing characters so it is not what you want. Use a regular expression with replace() instead. Finally, to remove special characters, it is cleaner to use a regex negative set to match and remove only the characters that are not letters or numbers. This looks like: "[^A-Za-z0-9]".
Here is some example data and code that works:
import pandas as pd
import re
df = pd.DataFrame(
{
"words": [
123,
"abcd",
"efgh",
"abcdefghijklmn",
"lol%",
"Hornbæk",
"10:03",
"$999¼",
]
}
)
# Faster and more concise than a loop
df["words"] = df["words"].transform(lambda x: "" if len(x) > 12 else x)
# Not sure why you do this but okay
df["words"] = df["words"].replace("$", "s")
# Use a regex negative set to keep only letters and numbers
df["words"] = df["words"].replace(re.compile("[^A-Za-z0-9]"), "")
display(df)
outputs:
words
0 123
1 abcd
2 efgh
3 abcdefghijklmn
4 lol
5 Hornbk
6 1003
7 999

How to get only the word from a string in python?

I am new to pandas, I have an issue with strings. So I have a string s = "'hi'+'bikes'-'cars'>=20+'rangers'" I want only the words from the string, not the symbols or the integers. How can I do it?
My input:
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
Excepted Output:
s = "'hi','bikes','cars','rangers'"
try this using regex
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
samp= re.compile('[a-zA-z]+')
word= samp.findall(s)
not sure about pandas, but you can also do it with Regex as well, and here is the solution
import re
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
words = re.findall("(\'.+?\')", s)
output = ','.join(words)
print(output)
For pandas I would convert the column in the dataframe to string first:
df
a b
0 'hi'+'bikes'-'cars'>=20+'rangers' 1
1 random_string 'with'+random,# 4
2 more,weird/stuff=wrong 6
df["a"] = df["a"].astype("string")
df["a"]
0 'hi'+'bikes'-'cars'>=20+'rangers'
1 random_string 'with'+random,#
2 more,weird/stuff=wrong
Name: a, dtype: string
Now you can see that dtype is string, which means you can do string operations on it,
including translate and split (pandas strings). But first you have to make a translate table with punctuation and digits imported from string module string docs
from string import digits, punctuation
Then make a dictionary mapping each of the digits and punctuation to whitespace
from itertools import chain
t = {k: " " for k in chain(punctuation, digits)}
create the translation table using str.maketrans (no import necessary with python 3.8 but may be a bit different with other versions) and apply the translate and split (with "str" in between) to the column)
t = str.maketrans(t)
df["a"] = df["a"].str.translate(t).str.split()
df
a b
0 [hi, bikes, cars, rangers] 1
1 [random, string, with, random] 4
2 [more, weird, stuff, wrong] 6
As you can see you only have the words now.

How to replace string in python string with specific character?

for example, I have a column named Children in data frame of python,
few names are [ tom (peter) , lily, fread, gregson (jaeson 123)] etc.
I want to ask that what code I should write, that could remove part of each name staring with bracket e.g '(' and so on. So that from my given names example tom(peter) will become tom in my column and gregson (123) would become gregson. Since there are thousands of names with bracket part and I want to remove the part of string staring from bracket '(' and ending on bracket ')'. This is a data frame of many columns but i want to do this editing in one specific column named as CHILDREN in my dataframe named DF.
As suggested by #Ruslan S., you can use pandas.Series.str.replace or you could also use re.sub (and there are other methods as well):
import pandas as pd
df = pd.DataFrame({"name":["tom (peter)" , "lily", "fread", "gregson (jaeson 123)"]})
# OPTION 1 with str.replace :
df["name"] = df["name"].str.replace(r"\([a-zA-Z0-9\s]+\)", "").str.strip()
# OPTION 2 :with re sub
import re
r = re.compile(r"\([a-zA-Z0-9\s]+\)")
df["name"] = df["name"].apply(lambda x: r.sub("", x).strip())
And the result in both cases:
name
0 tom
1 lily
2 fread
3 gregson
Note that I also use strip to remove leading and trailing whitespaces here. For more info on the regular expression to use, see re doc for instance.
You can try:
#to remove text between ()
df['columnname'] = df['columnname'].str.replace(r'\((.*)\)', '')
#to remove text between %%
df['columnname'] = df['columnname'].str.replace(r'%(.*)%', '')

read csv with escape characters

I have a csv file with some text, among others. I want to tokenize (split into a list of words) this text and am having problems with how pd.read_csv interprets escape characters.
My csv file looks like this:
text, number
one line\nother line, 12
and the code is like follows:
df = pd.read_csv('test.csv')
word_tokenize(df.iloc[0,0])
output is:
['one', 'line\\nother', 'line']
while what I want is:
['one', 'line', 'other', 'line']
The problem is pd.read_csv() is not interpreting the \n as a newline character but as two characters (\ and n).
I've tried setting the escapechar argument to '\' and to '\\' but both just remove the slash from the string without doing any interpretation of a newline character, i.e. the string becomes on one linenon other line.
If I explicitly set df.iloc[0,0] = 'one line\nother line', word_tokenize works just fine, because \n is actually interpreted as a newline character this time.
Ideally I would do this simply changing the way pd.read_csv() interprets the file, but other solutions are also ok.
The question is a bit poorly worded. I guess pandas escaping the \ in the string is confusing nltk.word_tokenize. pandas.read_csv can only use one separator (or a regex, but I doubt you want that), so it will always read the text column as "one line\nother line", and escape the backslash to preserve it. If you want to further parse and format it, you could use converters. Here's an example:
import pandas as pd
import re
df = pd.read_csv(
"file.csv", converters={"text":lambda s: re.split("\\\\n| ", s)}
)
The above results to:
text number
0 [one, line, other, line] 12
Edit: In case you need to use nltk to do the splitting (say the splitting depends on the language model), you would need to unescape the string before passing on to word_tokenize; try something like this:
lambda s: word_tokenize(s.encode('utf-8').decode('unicode_escape')
Note: Matching lists in queries is incredibly tricky, so you might want to convert them to tuples by altering the lambda like this:
lambda s: tuple(re.split("\\\\n| ", s))
You can simply try this
import pandas as pd
df = pd.read_csv("test.csv", header=None)
df = df.apply(lambda x: x.str.replace('\\', " "))
print(df.iloc[1, 0])
# output: one line other line
In you case simply use:
data = pd.read_csv('test.csv', sep='\\,', names=['c1', 'c2', 'c3', 'c4'], engine='python')

Categories

Resources