pandas - convert string into list of strings [duplicate] - python

This question already has answers here:
Reading csv containing a list in Pandas
(4 answers)
Pandas DataFrame stored list as string: How to convert back to list
(9 answers)
Closed 1 year ago.
I have this 'file.csv' file to read with pandas:
Title|Tags
T1|"[Tag1,Tag2]"
T1|"[Tag1,Tag2,Tag3]"
T2|"[Tag3,Tag1]"
using
df = pd.read_csv('file.csv', sep='|')
the output is:
Title Tags
0 T1 [Tag1,Tag2]
1 T1 [Tag1,Tag2,Tag3]
2 T2 [Tag3,Tag1]
I know that the column Tags is a full string, since:
In [64]: df['Tags'][0][0]
Out[64]: '['
I need to read it as a list of strings like ["Tag1","Tag2"]. I tried the solution provided in this question but no luck there, since I have the [ and ] characters that actually mess up the things.
The expecting output should be:
In [64]: df['Tags'][0][0]
Out[64]: 'Tag1'

You can split the string manually:
>>> df['Tags'] = df.Tags.apply(lambda x: x[1:-1].split(','))
>>> df.Tags[0]
['Tag1', 'Tag2']

You could use the inbuilt ast.literal_eval, it works for tuples as well as lists
import ast
import pandas as pd
df = pd.DataFrame({"mytuples": ["(1,2,3)"]})
print(df.iloc[0,0])
# >> '(1,2,3)'
df["mytuples"] = df["mytuples"].apply(ast.literal_eval)
print(df.iloc[0,0])
# >> (1,2,3)
EDIT: eval should be avoided! If the the string being evaluated is os.system(‘rm -rf /’) it will start deleting all the files on your computer (here). For ast.literal_eval the string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, and None (here). Thanks #TrentonMcKinney :)

Or
df.Tags=df.Tags.str[1:-1].str.split(',').tolist()

I think you could use the json module.
import json
import pandas
df = pd.read_csv('file.csv', sep='|')
df['Tags'] = df['Tags'].apply(lambda x: json.loads(x))
So this will load your dataframe as before, then apply a lambda function to each of the items in the Tags column. The lambda function calls json.loads() which converts the string representation of the list to an actual list.

You can convert the string to a list using strip and split.
df_out = df.assign(Tags=df.Tags.str.strip('[]').str.split(','))
df_out.Tags[0][0]
Output:
'Tag1'

Your df['Tags'] appears to be a list of strings. If you print that list you should get ["[tag1,tag2]","[Tag1,Tag2,Tag3]","[Tag3,Tag1]"] this is why when you call the first element of the first element you're actually getting the first single character of the string, rather than what you want.
You either need to parse that string afterward. Performing something like
df['Tags'][0] = df['Tags'][0].split(',')
But as you saw in your cited example this will give you a list that looks like
in: df['Tags'][0][0]
out: '[tag1'`
What you need is a way to parse the string editing out multiple characters. You can use a simple regex expression to do this. Something like:
import re
df['Tags'][0] = re.findall(r"[\w']+", df['Tags'][0])
print(df['Tags'][0][0])
will print:
'tag1'
Using the other answer involving Pandas converters you might write a converter like this:
def clean(seq_string):
return re.findall(r"[\w']+", seq_string)
If you don't know regex, they can be quite powerful, but also unpredictable if you're not sure on the content of your input strings. The expression used here r"[\w']+" will match any common word character alpha-numeric and underscores and treat everything else as a point for re.findall to split the list at.

Related

in python wants to take certain part from a given string [duplicate]

This question already has answers here:
Python: Get URL path sections
(7 answers)
Closed 5 months ago.
I have a string type URL --https://drive.google.com/file/d/1vBiwyAL3OZ9VVCWCn5t6BagvLQoMjk82/view?usp=sharing
I want only part after "d/" and before "/view" from the above string so how can I use it using regex or any other function example I only want 1vBiwyAL3OZ9VVCWCn5t6BagvLQoMjk82 from the above string using python. now I am using this
a =df['image']
for i in a :
print(type(i))
res=re.findall('/d/(.*)/view', i)
print(res)
but getting blank array res while printing it
I guess /d/(.*)/view is the regex you are looking for. re.findall('/d/(.*)/view', url)[0] works with your example.
import re
s="--https://drive.google.com/file/d/1vBiwyAL3OZ9VVCWCn5t6BagvLQoMjk82/view?usp=sharing"
res=re.findall('/d/(.*)/view', s)
if len(res)==0:
print('bad url')
else:
print(res[0])
ADDITION (example using dataframes)
import pandas as pd
import re
df=pd.DataFrame([["url1", "--https://drive.google.com/file/d/1vBiwyAL3OZ9VVCWCn5t6BagvLQoMjk82/view?usp=sharing"], ["url2", "--https://drive.google.com/file/d/second/one/view?usp=sharing"], ["noUrl", "anotherstring"]], columns=['name', 'image'])
Then
for i in df['image']:
res=re.findall('/d/(.*)/view', i)
print(res)
output
['1vBiwyAL3OZ9VVCWCn5t6BagvLQoMjk82']
['second/one']
[]
as expected. A len(res)=1 array whose res[0] element is the pattern you want if it matches, or an empty array if not.
If you have more than one /d/.../view patter in your url, then you may even have a longer than 1 answer. But empty one is only if you don't have a .../d/.../view... form
You can utilize str.split method.
First split the string using "/d/" and then use second part and split based on "/view". Take the first part of the new string.
html.split("/d/")[1].split("/view")[0]
If your string is always going to look like this, I would break the string apart using the "/" as a delimiter.
s = "https://drive.google.com/file/d/1vBiwyAL3OZ9VVCWCn5t6BagvLQoMjk82/view?usp=sharing"
s = s.split('/')[5]

How to check if a pandas series contains "\"?

I have a pandas series ['\ufffa', 'abc'] and I would like to check if a string contains \. I try
import pandas as pd
tmp = ['\ufffa', 'abc']
tmp = pd.Series(tmp)
tmp.str.startswith('\\')
and it returns
0 False
1 False
dtype: bool
With a single string, I can use r'\ufffa'.startswith('\\'). Could you please elaborate on how to do so for a whole series?
Your string doesn't start with a backslash. \ufffa is a unicode escape and your string contains the unicode code point U+FFFA ("Interlinear Annotation Separator").
In your other example, you used r'\ufffa', not '\ufffa'; you're using a raw string there, so the unicode escape doesn't take effect. If you do the same in your DataFrame, then startswith with work as you expect there as well.

How to copy changing substring in string?

How can I copy data from changing string?
I tried to slice, but length of slice is changing.
For example in one case I should copy number 128 from string '"edge_liked_by":{"count":128}', in another I should copy 15332 from "edge_liked_by":{"count":15332}
You could use a regular expression:
import re
string = '"edge_liked_by":{"count":15332}'
number = re.search(r'{"count":(\d*)}', string).group(1)
Really depends on the situation, however I find regular expressions to be useful.
To grab the numbers from the string without caring about their location, you would do as follows:
import re
def get_string(string):
return re.search(r'\d+', string).group(0)
>>> get_string('"edge_liked_by":{"count":128}')
'128'
To only get numbers from the *end of the string, you can use an anchor to ensure the result is pulled from the far end. The following example will grab any sequence of unbroken numbers that is both preceeded by a colon and ends within 5 characters of the end of the string:
import re
def get_string(string):
rval = None
string_match = re.search(r':(\d+).{0,5}$', string)
if string_match:
rval = string_match.group(1)
return rval
>>> get_string('"edge_liked_by":{"count":128}')
'128'
>>> get_string('"edge_liked_by":{"1321":1}')
'1'
In the above example, adding the colon will ensure that we only pick values and don't match keys such as the "1321" that I added in as a test.
If you just want anything after the last colon, but excluding the bracket, try combining split with slicing:
>>> '"edge_liked_by":{"count":128}'.split(':')[-1][0:-1]
'128'
Finally, considering this looks like a JSON object, you can add curly brackets to the string and treat it as such. Then it becomes a nested dict you can query:
>>> import json
>>> string = '"edge_liked_by":{"count":128}'
>>> string = '{' + string + '}'
>>> string = json.loads(string)
>>> string.get('edge_liked_by').get('count')
128
The first two will return a string and the final one returns a number due to being treated as a JSON object.
It looks like the type of string you are working with is read from JSON, maybe you are getting it as the output of some API you are working with?
If it is JSON, you've probably gone one step too far in atomizing it to a string like this. I'd work with the original output, if possible, if I were you.
If not, to make it more JSON like, I'd convert it to JSON by wrapping it in {}, and then working with the json.loads module.
import json
string = '"edge_liked_by":{"count":15332}'
string = "{"+string+"}"
json_obj = json.loads(string)
count = json_obj['edge_liked_by']['count']
count will have the desired output. I prefer this option to using regular expressions because you can rely on the structure of the data and reuse the code in case you wish to parse out other attributes, in a very intuitive way. With regular expressions, the code you use will change if the data are decimal, or negative, or contain non-numeric characters.
Does this help ?
a='"edge_liked_by":{"count":128}'
import re
b=re.findall(r'\d+', a)[0]
b
Out[16]: '128'

how to get second last and last value in a string after separator in python

In Python, how do you get the last and second last element in string ?
string "client_user_username_type_1234567"
expected output : "type_1234567"
Try this :
>>> s = "client_user_username_type_1234567"
>>> '_'.join(s.split('_')[-2:])
'type_1234567'
You can also use re.findall:
import re
s = "client_user_username_type_1234567"
result = re.findall('[a-zA-Z]+_\d+$', s)[0]
Output:
'type_1234567'
There's no set function that will do this for you, you have to use what Python gives you and for that I present:
split slice and join
"_".join("one_two_three".split("_")[-2:])
In steps:
Split the string by the common separator, "_"
s.split("_")
Slice the list so that you get the last two elements by using a negative index
s.split("_")[-2:]
Now you have a list composed of the last two elements, now you have to merge that list again so it's like the original string, with separator "_".
"_".join("one_two_three".split("_")[-2:])
That's pretty much it. Another way to investigate is through regex.

Why is the split() returning list objects that are empty? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']
I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)
Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')
If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

Categories

Resources