In dataframe: delete the paranthesis and everything within them in a column - python

I have a pandas dataframe, where a column has parentheses . I want to keep the content of the column, but delete everything inside the parentheses as below. Then add a constant text called "data" to it.
col1
counties(17) - cities(8)
I tried df['col1']=df['col1'].str.replace(r"\(.*\)","")
this command is outputting only counties
My desired output is
counties - cities data

You are your expression is replacing everything with ""? You should replace with " data" so as to get the result shown above.
Change
df['col1']=df['col1'].str.replace(r"\(.*\)","")
to
df['col1']=df['col1'].str.replace(r"\(.*\)", " data")

Pandas uses the re module under the hood, so you regex must respect it and can use all its features. Here you want a non greedy match for the parenthesed word (the shorter) so you should use df['col1'].str.replace(r"\(.*?\)",""). If you want to add " data" it ends in:
df['col1'] = df.col1.str.replace(r'\(.*?\)', '') + ' data'

Related

Extracting Sub-string Between Two Characters in String in Pandas Dataframe

I have a column containing strings that are comprised of different words but always have a similar structure structure. E.g.:
2cm off ORDER AGAIN (191 1141)
I want to extract the sub-string that starts after the second space and ends at the space before the opening bracket/parenthesis. So in this example I want to extract ORDER AGAIN.
Is this possible?
You could use str.extract here:
df["out"] = df["col"].str.extract(r'^\w+ \w+ (.*?)(?: \(|$)')
Note that this answer is robust even if the string doesn't have a (...) term at the end.
Here is a demo showing that the regex logic is working.
You can try the following:
r"2cm off ORDER AGAIN (191 1141)".split(r"(")[0].split(" ", maxsplit=2)[-1].strip()
#Out[3]: 'ORDER AGAIN'
If the pattern of data is similar to what you have posted then I think the below code snippet should work for you:
import re
data = "2cm off ORDER AGAIN (191 1141)"
extr = re.match(r".*?\s.*?\s(.*)\s\(.*", data)
if extr:
print (extr.group(1))
You can try the following code
s = '2cm off ORDER AGAIN (191 1141)'
second_space = s.find(' ', s.find(' ') + 1)
openparenthesis = s.find('(')
substring = s[second_space : openparenthesis]
print(substring) #ORDER AGAIN

Why does "\n" appear in my string output?

I have elements that I've scraped off of a website and when I print them using the following code, they show up neatly as spaced out elements.
print("\n" + time_element)
prints like this
F
4pm-5:50pm
but when I pass time_element into a dataframe as a column and convert it to a string, the output looks like this
# b' \n F\n \n 4pm-5:50pm\n
I am having trouble understanding why it appears so and how to get rid of this "\n" character. I tried using regex to match the "F" and the "4pm-5:50pm" and I thought this way I could separate out the data I need. But using various methods including
# Define the list and the regex pattern to match
time = df['Time']
pattern = '[A-Z]+'
# Filter out all elements that match the pattern
filtered = [x for x in time if re.match(pattern, x)]
print(filtered)
I get back an empty list.
From my research, I understand the "\n" represents a new line and that there might be invisible characters. However, I'm not understanding more about how they behave so I can get rid of them/around them to extract the data that I need.
When I pass the data to csv format, it prints like this all in one cell
F
4pm-5:50pm
but I still end up in the similar place when it comes to separating out the data that I need.
you can use the function strip() when you extract data from the website to avoid "\n"

Filtering keywords/sentences in a dataframe pandas

Currently I have a dataframe. Here is an example of my dataframe:
I also have a list of keywords/ sentences. I want to match it to the column 'Content' and see if any of the keywords or sentences match.
Here is what I've done
# instructions_list is just the list of keywords and key sentences
instructions_list = instructions['Key words & sentence search'].tolist()
pattern = '|'.join(instructions_list)
bureau_de_sante[bureau_de_sante['Content'].str.contains(pattern, regex = True)]
While it is giving me the results, it is also giving me this UserWarning : UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
return func(self, *args, **kwargs).
Questions:
How can I prevent the userwarning from showing up?
After finding and see if a match is in the column, how can I print the specific match in a new column?
You are supplying a regex to search the dataframe. If you have parenthesis in your instruction list (like it is the case in your example), then that constitutes a match group. In order to avoid this, you have to escape them (i.e.: add \ in front of them, so that (Critical risk) becomes \(Critical risk\)). You will also probably want to escape all special characters like \ . " ' etc.
Now, you can use these groups to extract the match from your data. Here is an example:
df = pd.DataFrame(["Hello World", "Foo Bar Baz", "Goodbye"], columns=["text"])
pattern = "(World|Bar)"
print(df.str.extract(pattern))
# 0
# 0 World
# 1 Bar
# 2 NaN
You can add this column in your dataframe with a simple assignment (eg df["result"] = df.str.extract(pattern))

How can I split a string and take only one from the separated string in Python?

I have a string which includes ":", it shows like this:
: SHOES
and I want to split the colon and SHOES, then make a variable that contains only "SHOES"
I have split them used df.split(':') but then how should I create a variable with "SHOES" only?
You can use the list slicing function. and then use lstrip and rstrip to remove excess spaces before and after the word.
df=": shoes"
d=df.split(":")[-1].lstrip().rstrip()
print(d)
You can use 'apply' method to execute a loop over all dataset and split the column with 'split()'.
This is an example:
import pandas as pd
df=pd.DataFrame({'A':[':abd', ':cda', ':vfe', ':brg']})
# First we create a new column just named a new column -> df['new_column']
# Second, we loop dataset with apply
# Third, we execute a lambda with split function, getting only text after ':'
df['new_column']=df['A'].apply(lambda x: x.split(':')[1] )
df
A new_column
0 :abd abd
1 :cda cda
2 :vfe vfe
3 :brg brg
If your original strings always start with ": " then you could just remove the first two characters using:
myString[2:]
Here is a small working sample. Both stripValue and newString return the same value. It is matter cleaner code vs verbose code:
# set initial string
myString = "string : value"
# split it which will return an array [0,1,2,3...]
stripValue = myString.split(":")
# you can create a new var with the value you want/need from the array
newString = (stripValue[1])
# or you can short hand it
print(stripValue[1])
# calling the new var
print(newString)

Python 2.7: How to split on first occurrence?

I am trying to split a string I extract on the first occurrence of a comma. I have tried using the split, but something is wrong, as it doesn't split.
for i in range(len(items)):
alldata = items[i].getText().encode('utf-8').split(',', 1)
csvfile.writerow(alldata)
The variable items contains the data I extract from an URL. The output in the CSV file is put in one column. I want it to be on two columns. An example of the data (alldata) I get in the CSV file, looks like this:
['\n\n\n1958\n\n\nGeorge Lees\n']
Using this data as an example, I need the year 1958 to be on one column, and the name George Lees to be on another column instead of the new lines.
EDIT
Forgot to mention what I meant with the commas. The reason why I mentioned the commas is that I also tried splitting on whitespaces. When I did that I got the data:
['1958', 'George', 'Lees']
So what I tried to achieve was to split the data on the first comma occurrence. That's why I did split(',', 1) forgetting that I also need to split on whitespaces. So my problem is that I don't know how I split on both the first commas occurence, so that the year is on oe column and the whole name is on another column. I got
['\n\n\n1958\n\n\nGeorge Lees\n']
When I tried to split with split(',', 1)
You can use strip to remove all spaces in the start & end and then use split by "\n" to get the required output. I have also used the filter method to remove any empty string or values.
Ex:
A = ['\n\n\n1958\n\n\nGeorge Lees\n']
print filter(None, A[0].strip().split("\n"))
Output:
['1958', 'George Lees']

Categories

Resources