I have a pandas series ['\ufffa', 'abc'] and I would like to check if a string contains \. I try
import pandas as pd
tmp = ['\ufffa', 'abc']
tmp = pd.Series(tmp)
tmp.str.startswith('\\')
and it returns
0 False
1 False
dtype: bool
With a single string, I can use r'\ufffa'.startswith('\\'). Could you please elaborate on how to do so for a whole series?
Your string doesn't start with a backslash. \ufffa is a unicode escape and your string contains the unicode code point U+FFFA ("Interlinear Annotation Separator").
In your other example, you used r'\ufffa', not '\ufffa'; you're using a raw string there, so the unicode escape doesn't take effect. If you do the same in your DataFrame, then startswith with work as you expect there as well.
Related
I have a dataframe with one column and i want to know if the value of the column contain a "+". I made like this:
mask = df['column'].str.contains("\+")
But when I execute the sonarqube analysis throws me a bug ("" should only be used as an escape character outside of raw strings).
I tried to change the line of code to this:
mask = df['column'].str.contains(r"+")
But throws me an error:
re.error: nothing to repeat at position 0
How can i make the same as the first line without the error??
There is a difference between escaping the characters in python to be interpreted as a special character (e.g. \n is a newline and has nothing to do with n), and escaping the character not to be interpreted as a special symbol in the regex. Here you need both.
Either use a raw string:
mask = df['column'].str.contains(r'\+')
or escape the backslash:
mask = df['column'].str.contains('\\+')
Example:
df = pd.DataFrame({'column': ['abc', 'ab+c']})
df['column'].str.contains(r'\+')
0 False
1 True
Name: column, dtype: bool
I have a pandas dataframe that consists of strings. I would like to remove the n-th character from the end of the strings. I have the following code:
DF = pandas.DataFrame({'col': ['stri0ng']})
DF['col'] = DF['col'].str.replace('(.)..$','')
Instead of removing the third to the last character (0 in this case), it removes 0ng. The result should be string but it outputs stri. Where am I wrong?
You may want to rather replace a single character followed by n-1 characters at the end of the string:
DF['col'] = DF['col'].str.replace('.(?=.{2}$)', '')
col
0 string
If you want to make sure you're only removing digits (so that 'string' in one special row doesn't get changed to 'strng'), then use something like '[0-9](?=.{2}$)' as pattern.
Another way using pd.Series.str.slice_replace:
df['col'].str.slice_replace(4,5,'')
Output:
0 string
Name: col, dtype: object
I have the following data stored in my Pandas datframe:
Factor SimTime RealTime SimStatus
0 Factor[0.48] SimTime[83.01] RealTime[166.95] Paused[F]
1 Factor[0.48] SimTime[83.11] RealTime[167.15] Paused[F]
2 Factor[0.49] SimTime[83.21] RealTime[167.36] Paused[F]
3 Factor[0.48] SimTime[83.31] RealTime[167.57] Paused[F]
I want to create a new dataframe with only everything within [].
I am attempting to use the following code:
df = dataframe.apply(lambda x: x.str.slice(start=x.str.find('[')+1, stop=x.str.find(']')))
However, all I see in df is NaN. Why? What's going on? What should I do to achieve the desired behavior?
You can use regex to replace the contents.
df.replace(r'\w+\[([\S]+)\]', r'\1', regex=True)
Edit
replace function of pandas DataFrame
Replace values given in to_replace with value
The target string and the value with which it needs to be replaced can be regex expressions. And for that you need to set the regex=True in the arguments to replace
https://regex101.com/r/7KCs6q/1
Look at the above link to see the explanation of the regular expression in detail.
Basically it is using the non whitespace content within the square brackets as the value and any string with some characters followed by square brackets with non whitespace characters as the target string.
How can I copy data from changing string?
I tried to slice, but length of slice is changing.
For example in one case I should copy number 128 from string '"edge_liked_by":{"count":128}', in another I should copy 15332 from "edge_liked_by":{"count":15332}
You could use a regular expression:
import re
string = '"edge_liked_by":{"count":15332}'
number = re.search(r'{"count":(\d*)}', string).group(1)
Really depends on the situation, however I find regular expressions to be useful.
To grab the numbers from the string without caring about their location, you would do as follows:
import re
def get_string(string):
return re.search(r'\d+', string).group(0)
>>> get_string('"edge_liked_by":{"count":128}')
'128'
To only get numbers from the *end of the string, you can use an anchor to ensure the result is pulled from the far end. The following example will grab any sequence of unbroken numbers that is both preceeded by a colon and ends within 5 characters of the end of the string:
import re
def get_string(string):
rval = None
string_match = re.search(r':(\d+).{0,5}$', string)
if string_match:
rval = string_match.group(1)
return rval
>>> get_string('"edge_liked_by":{"count":128}')
'128'
>>> get_string('"edge_liked_by":{"1321":1}')
'1'
In the above example, adding the colon will ensure that we only pick values and don't match keys such as the "1321" that I added in as a test.
If you just want anything after the last colon, but excluding the bracket, try combining split with slicing:
>>> '"edge_liked_by":{"count":128}'.split(':')[-1][0:-1]
'128'
Finally, considering this looks like a JSON object, you can add curly brackets to the string and treat it as such. Then it becomes a nested dict you can query:
>>> import json
>>> string = '"edge_liked_by":{"count":128}'
>>> string = '{' + string + '}'
>>> string = json.loads(string)
>>> string.get('edge_liked_by').get('count')
128
The first two will return a string and the final one returns a number due to being treated as a JSON object.
It looks like the type of string you are working with is read from JSON, maybe you are getting it as the output of some API you are working with?
If it is JSON, you've probably gone one step too far in atomizing it to a string like this. I'd work with the original output, if possible, if I were you.
If not, to make it more JSON like, I'd convert it to JSON by wrapping it in {}, and then working with the json.loads module.
import json
string = '"edge_liked_by":{"count":15332}'
string = "{"+string+"}"
json_obj = json.loads(string)
count = json_obj['edge_liked_by']['count']
count will have the desired output. I prefer this option to using regular expressions because you can rely on the structure of the data and reuse the code in case you wish to parse out other attributes, in a very intuitive way. With regular expressions, the code you use will change if the data are decimal, or negative, or contain non-numeric characters.
Does this help ?
a='"edge_liked_by":{"count":128}'
import re
b=re.findall(r'\d+', a)[0]
b
Out[16]: '128'
This question already has answers here:
Reading csv containing a list in Pandas
(4 answers)
Pandas DataFrame stored list as string: How to convert back to list
(9 answers)
Closed 1 year ago.
I have this 'file.csv' file to read with pandas:
Title|Tags
T1|"[Tag1,Tag2]"
T1|"[Tag1,Tag2,Tag3]"
T2|"[Tag3,Tag1]"
using
df = pd.read_csv('file.csv', sep='|')
the output is:
Title Tags
0 T1 [Tag1,Tag2]
1 T1 [Tag1,Tag2,Tag3]
2 T2 [Tag3,Tag1]
I know that the column Tags is a full string, since:
In [64]: df['Tags'][0][0]
Out[64]: '['
I need to read it as a list of strings like ["Tag1","Tag2"]. I tried the solution provided in this question but no luck there, since I have the [ and ] characters that actually mess up the things.
The expecting output should be:
In [64]: df['Tags'][0][0]
Out[64]: 'Tag1'
You can split the string manually:
>>> df['Tags'] = df.Tags.apply(lambda x: x[1:-1].split(','))
>>> df.Tags[0]
['Tag1', 'Tag2']
You could use the inbuilt ast.literal_eval, it works for tuples as well as lists
import ast
import pandas as pd
df = pd.DataFrame({"mytuples": ["(1,2,3)"]})
print(df.iloc[0,0])
# >> '(1,2,3)'
df["mytuples"] = df["mytuples"].apply(ast.literal_eval)
print(df.iloc[0,0])
# >> (1,2,3)
EDIT: eval should be avoided! If the the string being evaluated is os.system(‘rm -rf /’) it will start deleting all the files on your computer (here). For ast.literal_eval the string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, and None (here). Thanks #TrentonMcKinney :)
Or
df.Tags=df.Tags.str[1:-1].str.split(',').tolist()
I think you could use the json module.
import json
import pandas
df = pd.read_csv('file.csv', sep='|')
df['Tags'] = df['Tags'].apply(lambda x: json.loads(x))
So this will load your dataframe as before, then apply a lambda function to each of the items in the Tags column. The lambda function calls json.loads() which converts the string representation of the list to an actual list.
You can convert the string to a list using strip and split.
df_out = df.assign(Tags=df.Tags.str.strip('[]').str.split(','))
df_out.Tags[0][0]
Output:
'Tag1'
Your df['Tags'] appears to be a list of strings. If you print that list you should get ["[tag1,tag2]","[Tag1,Tag2,Tag3]","[Tag3,Tag1]"] this is why when you call the first element of the first element you're actually getting the first single character of the string, rather than what you want.
You either need to parse that string afterward. Performing something like
df['Tags'][0] = df['Tags'][0].split(',')
But as you saw in your cited example this will give you a list that looks like
in: df['Tags'][0][0]
out: '[tag1'`
What you need is a way to parse the string editing out multiple characters. You can use a simple regex expression to do this. Something like:
import re
df['Tags'][0] = re.findall(r"[\w']+", df['Tags'][0])
print(df['Tags'][0][0])
will print:
'tag1'
Using the other answer involving Pandas converters you might write a converter like this:
def clean(seq_string):
return re.findall(r"[\w']+", seq_string)
If you don't know regex, they can be quite powerful, but also unpredictable if you're not sure on the content of your input strings. The expression used here r"[\w']+" will match any common word character alpha-numeric and underscores and treat everything else as a point for re.findall to split the list at.