How to find rows in Pandas dataframe where re.search fails - python

I'm trying to pull wind direction out of a Metar string with format:
EGAA 010020Z 33004KT 300V010 9999 FEW029 04/04 Q1019
I'm using this to extract the wind direction which works for most of my data but fails on some strings:
df["Wind_Dir"] = df.metar.apply(lambda x: int(re.search(r"\s\d*KT\s", metar_data.metar[0]).group().strip()[:3]))
I'd like to inspect the Metar strings that it's failing on so instead of pulling group() out of the re.search I just applied the search as follows to get the re.Match object:
df["Wind_Dir"] = df.metar.apply(lambda x: re.search(r"\s\d*KT\s", x))
I've tried filtering by type and by Null but neither of those work.
Any help would be appreciated.
Thanks for your answers unfortunately I can't mark them both as solutions despite using both to solve my problem.
In the end I changed my regex to:
df["Wind_Dir"] = df.metar.str.findall(r"Z\s\d\d\d|Z\sVRB")
to match for variable direction but wouldn't have been able to find that without df.metar.str.contains().

You are searching for this:
pandas.Series.str.contains returns a mask with True for indexes that match the pattern based on re.search.
As Pandas documentation states, if you want a mask based on re.match you should use: pandas.Series.str.match.
You can also use the following:
pandas.Series.str.extract which extracts the first match of the pattern on every rows of the Series on which you perform the analysis. NaN will fill the rows that didn't contain the pattern so you can fetch for Nan values to retrieve such rows.

You need your code to return matched string and not an re object.
This will also not work when there is no match since re.search won't return anything.
Try pandas.series.str.findall
In your case try this
df['Wind_Dir'] = df.metar.str.findall(r"\s\d*KT\s")
df["Wind_Dir"] = df['Wind_Dir'].apply(lambda x: x[0].strip()[:3])
You also might want to check whether there is a match or not before executing 2nd statement.

Related

Extract a match in a row and take everything until a comma or take it if it is the end ot the string in pandas

I have a dataset. In the column 'Tags' I want to extract from each row all the content that has the word player. I could repeat or be alone in the same cell. Something like this:
'view_snapshot_hi:hab,like_hi:hab,view_snapshot_foinbra,completed_profile,view_page_investors_landing,view_foinbra_inv_step1,view_foinbra_inv_step2,view_foinbra_inv_step3,view_snapshot_acium,player,view_acium_inv_step1,view_acium_inv_step2,view_acium_inv_step3,player_acium-ronda-2_r1,view_foinbra_rinv_step1,view_page_makers_landing'
expected output:
'player,player_acium-ronda-2_r1'
And I need both.
df["Tags"] = df["Tags"].str.ectract(r'*player'*,?\s*')
I tried this but it´s not working.
You need to use Series.str.extract keeping in mind that the pattern should contain a capturing group embracing the part you need to extract.
The pattern you need is player[^,]*:
df["Tags"] = df["Tags"].str.extract(r'(player[^,]*)', expand=False)
The expand=False returns a Series/Index rather than a dataframe.
Note that Series.str.extract finds and fetches the first match only. To get all matches use either of the two solutions below with Series.str.findall:
df["Tags"] = df["Tags"].str.findall(r'player[^,]*', expand=False)
df["Tags"] = df["Tags"].str.findall(r'player[^,]*', expand=False).str.join(", ")
This simple list also gives what you want
words_with_players = [item for item in your_str.split(',') if 'player' in item]
players = ','.join(words_with_players)

Find any word of a list in the column of dataframe

I have a list of words negative that has 4783 elements. I want to use the following code
tweets3 = tweets2[tweets2['full_text'].str.contains('|'.join(negative))]
But, it gives ane error like this error: multiple repeat at position 4193.
I do not understand this error. Apparently, if I use a single word in str.contains such as str.contains("deal") I am able to get results.
All I need is a new dataframe that carries only those rows which carry any of the words occuring in the dataframe tweets2 column full_text.
As a matter of choice I would also like to see if I can have a boolean column for present and absent values as 0 or 1.
I arrived at using the following code with the help of #wp78de:
tweets2['negative'] = tweets2.loc[tweets2['full_text'].str.contains(r'(?:{})'.format('|'.join(negative)), regex=True, na=False)].copy()
For arbitrary literal strings that may have regular expression metacharacters in it you can use the re.escape() function. Something along this line should be sufficient:
.str.contains(r'(?:{})'.format(re.escape('|'.join(words)), regex=True, na=False)]

How to slice all of the elements of pandas dataframe at once?

I have the following data stored in my Pandas datframe:
Factor SimTime RealTime SimStatus
0 Factor[0.48] SimTime[83.01] RealTime[166.95] Paused[F]
1 Factor[0.48] SimTime[83.11] RealTime[167.15] Paused[F]
2 Factor[0.49] SimTime[83.21] RealTime[167.36] Paused[F]
3 Factor[0.48] SimTime[83.31] RealTime[167.57] Paused[F]
I want to create a new dataframe with only everything within [].
I am attempting to use the following code:
df = dataframe.apply(lambda x: x.str.slice(start=x.str.find('[')+1, stop=x.str.find(']')))
However, all I see in df is NaN. Why? What's going on? What should I do to achieve the desired behavior?
You can use regex to replace the contents.
df.replace(r'\w+\[([\S]+)\]', r'\1', regex=True)
Edit
replace function of pandas DataFrame
Replace values given in to_replace with value
The target string and the value with which it needs to be replaced can be regex expressions. And for that you need to set the regex=True in the arguments to replace
https://regex101.com/r/7KCs6q/1
Look at the above link to see the explanation of the regular expression in detail.
Basically it is using the non whitespace content within the square brackets as the value and any string with some characters followed by square brackets with non whitespace characters as the target string.

Regex loop stops after 60 iterations

I'm trying to find a regex pattern and put it in a dataframe column, while looping over the values of another column.
Problem : It works wonders up until the 60th iteration but then it only shows NaN. I have 400 000 entries and most of them should match.
Why is that and how can I fix it?
import re
new_mail = []
for urlcore in re.finditer('https*://[www.]*(\S*).*\.(fr|com)',str(df['Site_Web'])):
yolo = urlcore.group(1)
new_mail.append(yolo)
df['urlcore'] = pd.Series(new_mail)
df['urlcore'] = df['urlcore'].str.replace('.', '', regex=True).replace('-', '', regex=True)
Your regex suffer from performance issues due to (\S*).*. Change it to https?:\/\/(www\.)?(\S*)\.(fr|com)
the correct regular expression for that it's:
(?:https?://)?(?:www\.)?([a-zA-Z0-9][a-zA-Z0-9-]{1,61})\.[a-zA-Z]{2,}
note that you have three unnamed groups in the regular expression but the first and the second don't be capture, so for access to the core part should be urlcore.group(1)
in your case you need to change the end part for (fr|com) and if you need to handle sub-domains also need to modify the regex to handle a previous optional group (?:[a-zA-Z0-9][a-zA-Z0-9-]{1,61}\.)*

Search for string in file while ignoring id and replacing only a substring

I’ve got a master .xml file generated by an external application and want to create several new .xmls by adapting and deleting some rows with python. The search strings and replace strings for these adaptions are stored within an array, e.g.:
replaceArray = [
[u'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"',
u'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"'],
[u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="false"/>',
u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="true"/>'],
[u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="false"/>',
u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="true"/>']]
So I'd like to iterate through my file and replace all occurences of 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"' with 'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"' and so on.
Unfortunately the ID values of "RowID", “id_tool_base” and “ref_layerid_mapping” might change occassionally. So what I need is to search for matches of the whole string in the master file regardless which id value is inbetween the quotation mark and only to replace the substring that is different in both strings of the replaceArray (e.g. use=”true” instead of use=”false”). I’m not very familiar with regular expressions, but I think I need something like that for my search?
re.sub(r'<TOOL_SELECT_LINE RowID="\d+" id_tool_base="\d+" use="false"/>', "", sentence)
I'm happy about any hint that points me in the right direction! If you need any further information or if something is not clear in my question, please let me know.
One way to do this is to have a function for replacing text. The function would get the match object from re.sub and insert id captured from the string being replaced.
import re
s = 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"'
pat = re.compile(r'ref_layerid_mapping=(.+) lyvis="off" toc_visible="off"')
def replacer(m):
return "ref_layerid_mapping=" + m.group(1) + 'lyvis="on" toc_visible="on"';
re.sub(pat, replacer, s)
Output:
'ref_layerid_mapping="x4049"lyvis="on" toc_visible="on"'
Another way is to use back-references in replacement pattern. (see http://www.regular-expressions.info/replacebackref.html)
For example:
import re
s = "Ab ab"
re.sub(r"(\w)b (\w)b", r"\1d \2d", s)
Output:
'Ad ad'

Categories

Resources