I have the following data stored in my Pandas datframe:
Factor SimTime RealTime SimStatus
0 Factor[0.48] SimTime[83.01] RealTime[166.95] Paused[F]
1 Factor[0.48] SimTime[83.11] RealTime[167.15] Paused[F]
2 Factor[0.49] SimTime[83.21] RealTime[167.36] Paused[F]
3 Factor[0.48] SimTime[83.31] RealTime[167.57] Paused[F]
I want to create a new dataframe with only everything within [].
I am attempting to use the following code:
df = dataframe.apply(lambda x: x.str.slice(start=x.str.find('[')+1, stop=x.str.find(']')))
However, all I see in df is NaN. Why? What's going on? What should I do to achieve the desired behavior?
You can use regex to replace the contents.
df.replace(r'\w+\[([\S]+)\]', r'\1', regex=True)
Edit
replace function of pandas DataFrame
Replace values given in to_replace with value
The target string and the value with which it needs to be replaced can be regex expressions. And for that you need to set the regex=True in the arguments to replace
https://regex101.com/r/7KCs6q/1
Look at the above link to see the explanation of the regular expression in detail.
Basically it is using the non whitespace content within the square brackets as the value and any string with some characters followed by square brackets with non whitespace characters as the target string.
Related
I have a column in my dataframe, containing very large strings.
here is a short sample of the string
FixedChar{3bf3423 Data to keep}, FixedChar{5e0d20 Data to keep}, FixedChar{6cb86d9 Data to keep}, ...
I need to remove the recurring static "FixedChar{" and the variable substring after it that has static length of 6 and also "}"
and just keep the "Data to keep" strings that have variable lengths.
what is the best way to remove this recurring variable pattern?
It was easier than I thought.
At first I started to use re.sub() from re library.
regex \w* removes all the word characters (letters and numbers) after the "FixedChar" and the argument flags = re.I makes it case insensitive.
import re
re.sub(r"FixedChar{\w*","",dataFrame.Column[row],flags = re.I)
but I found str.replace() more useful and replaced the values in my dataFrame using loc, as I needed to filter my dataframe cause this pattern shows up only in specific rows.
dataFrame.loc['Column'] = dataFrame.Column.str.replace("FixedChar{\w* ",'',regex=True)
dataFrame.loc['Column'] = dataFrame.Column.str.replace("}",'',regex=True)
I am using this line of code
df_mask = ~df[new_col_titles[:1]].apply(lambda x : x.str.contains('|'.join(filter_list), flags=re.IGNORECASE)).any(1)
to create a mask for my df. The filter list is
filter_list = ["[1]", "[2]", "[3]", "[4]", "[5]", "[6]", "[7]", "[8]","[9]",..."[n]"]
But I am having weird results I was hoping it would just filter the rows in column 0 of the df that have [1]...[n] in. But it doesn't it is also filtering rows that don't have those elements in. There is somewhat a pattern to it though. It will filter out rows that have numbers with "characters" by which i mean £55, 2010), 55*, 55 *
Can anyone explaine what is going on and if there is a workaround for this?
If you want to match the items in filter list exactly, use re.escape() to escape the special characters. [1] is a regular expression that just matches the digit 1, not the string [1].
df_mask = ~df[new_col_titles[:1]].apply(lambda x : x.str.contains('|'.join(re.escape(f) for f in filter_list), flags=re.IGNORECASE)).any(1)
See Reference - What does this regex mean?
I have a list of words negative that has 4783 elements. I want to use the following code
tweets3 = tweets2[tweets2['full_text'].str.contains('|'.join(negative))]
But, it gives ane error like this error: multiple repeat at position 4193.
I do not understand this error. Apparently, if I use a single word in str.contains such as str.contains("deal") I am able to get results.
All I need is a new dataframe that carries only those rows which carry any of the words occuring in the dataframe tweets2 column full_text.
As a matter of choice I would also like to see if I can have a boolean column for present and absent values as 0 or 1.
I arrived at using the following code with the help of #wp78de:
tweets2['negative'] = tweets2.loc[tweets2['full_text'].str.contains(r'(?:{})'.format('|'.join(negative)), regex=True, na=False)].copy()
For arbitrary literal strings that may have regular expression metacharacters in it you can use the re.escape() function. Something along this line should be sufficient:
.str.contains(r'(?:{})'.format(re.escape('|'.join(words)), regex=True, na=False)]
I have a pandas dataframe that consists of strings. I would like to remove the n-th character from the end of the strings. I have the following code:
DF = pandas.DataFrame({'col': ['stri0ng']})
DF['col'] = DF['col'].str.replace('(.)..$','')
Instead of removing the third to the last character (0 in this case), it removes 0ng. The result should be string but it outputs stri. Where am I wrong?
You may want to rather replace a single character followed by n-1 characters at the end of the string:
DF['col'] = DF['col'].str.replace('.(?=.{2}$)', '')
col
0 string
If you want to make sure you're only removing digits (so that 'string' in one special row doesn't get changed to 'strng'), then use something like '[0-9](?=.{2}$)' as pattern.
Another way using pd.Series.str.slice_replace:
df['col'].str.slice_replace(4,5,'')
Output:
0 string
Name: col, dtype: object
I'm trying to pull wind direction out of a Metar string with format:
EGAA 010020Z 33004KT 300V010 9999 FEW029 04/04 Q1019
I'm using this to extract the wind direction which works for most of my data but fails on some strings:
df["Wind_Dir"] = df.metar.apply(lambda x: int(re.search(r"\s\d*KT\s", metar_data.metar[0]).group().strip()[:3]))
I'd like to inspect the Metar strings that it's failing on so instead of pulling group() out of the re.search I just applied the search as follows to get the re.Match object:
df["Wind_Dir"] = df.metar.apply(lambda x: re.search(r"\s\d*KT\s", x))
I've tried filtering by type and by Null but neither of those work.
Any help would be appreciated.
Thanks for your answers unfortunately I can't mark them both as solutions despite using both to solve my problem.
In the end I changed my regex to:
df["Wind_Dir"] = df.metar.str.findall(r"Z\s\d\d\d|Z\sVRB")
to match for variable direction but wouldn't have been able to find that without df.metar.str.contains().
You are searching for this:
pandas.Series.str.contains returns a mask with True for indexes that match the pattern based on re.search.
As Pandas documentation states, if you want a mask based on re.match you should use: pandas.Series.str.match.
You can also use the following:
pandas.Series.str.extract which extracts the first match of the pattern on every rows of the Series on which you perform the analysis. NaN will fill the rows that didn't contain the pattern so you can fetch for Nan values to retrieve such rows.
You need your code to return matched string and not an re object.
This will also not work when there is no match since re.search won't return anything.
Try pandas.series.str.findall
In your case try this
df['Wind_Dir'] = df.metar.str.findall(r"\s\d*KT\s")
df["Wind_Dir"] = df['Wind_Dir'].apply(lambda x: x[0].strip()[:3])
You also might want to check whether there is a match or not before executing 2nd statement.