Related
I have a csv file with some text, among others. I want to tokenize (split into a list of words) this text and am having problems with how pd.read_csv interprets escape characters.
My csv file looks like this:
text, number
one line\nother line, 12
and the code is like follows:
df = pd.read_csv('test.csv')
word_tokenize(df.iloc[0,0])
output is:
['one', 'line\\nother', 'line']
while what I want is:
['one', 'line', 'other', 'line']
The problem is pd.read_csv() is not interpreting the \n as a newline character but as two characters (\ and n).
I've tried setting the escapechar argument to '\' and to '\\' but both just remove the slash from the string without doing any interpretation of a newline character, i.e. the string becomes on one linenon other line.
If I explicitly set df.iloc[0,0] = 'one line\nother line', word_tokenize works just fine, because \n is actually interpreted as a newline character this time.
Ideally I would do this simply changing the way pd.read_csv() interprets the file, but other solutions are also ok.
The question is a bit poorly worded. I guess pandas escaping the \ in the string is confusing nltk.word_tokenize. pandas.read_csv can only use one separator (or a regex, but I doubt you want that), so it will always read the text column as "one line\nother line", and escape the backslash to preserve it. If you want to further parse and format it, you could use converters. Here's an example:
import pandas as pd
import re
df = pd.read_csv(
"file.csv", converters={"text":lambda s: re.split("\\\\n| ", s)}
)
The above results to:
text number
0 [one, line, other, line] 12
Edit: In case you need to use nltk to do the splitting (say the splitting depends on the language model), you would need to unescape the string before passing on to word_tokenize; try something like this:
lambda s: word_tokenize(s.encode('utf-8').decode('unicode_escape')
Note: Matching lists in queries is incredibly tricky, so you might want to convert them to tuples by altering the lambda like this:
lambda s: tuple(re.split("\\\\n| ", s))
You can simply try this
import pandas as pd
df = pd.read_csv("test.csv", header=None)
df = df.apply(lambda x: x.str.replace('\\', " "))
print(df.iloc[1, 0])
# output: one line other line
In you case simply use:
data = pd.read_csv('test.csv', sep='\\,', names=['c1', 'c2', 'c3', 'c4'], engine='python')
When I'm splitting a string "abac" I'm getting undesired results.
Example
print("abac".split("a"))
Why does it print:
['', 'b', 'c']
instead of
['b', 'c']
Can anyone explain this behavior and guide me on how to get my desired output?
Thanks in advance.
As #DeepSpace pointed out (referring to the docs)
If sep is given, consecutive delimiters are not grouped together and are deemed to delimit empty strings (for example, '1,,2'.split(',') returns ['1', '', '2']).
Therefore I'd suggest using a better delimiter such as a comma , or if this is the formatting you're stuck with then you could just use the builtin filter() function as suggested in this answer, this will remove any "empty" strings if passed None as the function.
sample = 'abac'
filtered_sample = filter(None, sample.split('a'))
print(filtered_sample)
#['b', 'c']
When you split a string in python you keep everything between your delimiters (even when it's an empty string!)
For example, if you had a list of letters separated by commas:
>>> "a,b,c,d".split(',')
['a','b','c','d']
If your list had some missing values you might leave the space in between the commas blank:
>>> "a,b,,d".split(',')
['a','b','','d']
The start and end of the string act as delimiters themselves, so if you have a leading or trailing delimiter you will also get this "empty string" sliced out of your main string:
>>> "a,b,c,d,,".split(',')
['a','b','c','d','','']
>>> ",a,b,c,d".split(',')
['','a','b','c','d']
If you want to get rid of any empty strings in your output, you can use the filter function.
If instead you just want to get rid of this behavior near the edges of your main string, you can strip the delimiters off first:
>>> ",,a,b,c,d".strip(',')
"a,b,c,d"
>>> ",,a,b,c,d".strip(',').split(',')
['a','b','c','d']
In your example, "a" is what's called a delimiter. It acts as a boundary between the characters before it and after it. So, when you call split, it gets the characters before "a" and after "a" and inserts it into the list. Since there's nothing in front of the first "a" in the string "abac", it returns an empty string and inserts it into the list.
split will return the characters between the delimiters you specify (or between an end of the string and a delimiter), even if there aren't any, in which case it will return an empty string. (See the documentation for more information.)
In this case, if you don't want any empty strings in the output, you can use filter to remove them:
list(filter(lambda s: len(s) > 0, "abac".split("a"))
I have strings as tuples that I'm trying to remove quotation marks from. If there isn't a comma present in the string, then it works. But if there is a comma, then quotation marks still remain:
example = [('7-30-17','0x34','"Upload Complete"'),('7-31-17','0x35','"RCM","Interlock error"')]
example = [(x,y,(z.strip('"')))
for x,y,z in example]
The result is that quotation marks partially remain in the strings that had commas in them. The second tuple now reads RCM","Interlock error as opposed to RCM, Interlock error
('7-30-17','0x34','Upload Complete')
('7-31-17','0x35','RCM","Interlock error')
Any ideas what I'm doing wrong? Thanks!
You can use list comprehension to iterate the list items and similarly for the inner tuple items
>>> [tuple(s.replace('"','') for s in tup) for tup in example]
[('7-30-17', '0x34', 'Upload Complete'), ('7-31-17', '0x35', 'RCM,Interlock error')]
It seems like you're looking for the behaviour of replace(), rather than strip().
Try using replace('"', '') instead of strip('"'). strip only removes characters from the beginning and end of strings, while replace will take care of all occurrences.
Your example would be updated to look like this:
example = [('7-30-17','0x34','"Upload Complete"'),('7-31-17','0x35','"RCM","Interlock error"')]
example = [(x,y,(z.replace('"', '')))
for x,y,z in example]
example ends up with the following value:
[('7-30-17', '0x34', 'Upload Complete'), ('7-31-17', '0x35', 'RCM,Interlock error')]
The problem is because strip will remove only from ends of string.
Use a regex to replace ":
import re
example = [('7-30-17','0x34','"Upload Complete"'),('7-31-17','0x35','"RCM","Interlock error"')]
example = [(x,y,(re.sub('"','',z)))
for x,y,z in example]
print(example)
# [('7-30-17', '0x34', 'Upload Complete'), ('7-31-17', '0x35', 'RCM,Interlock error')]
This question already has answers here:
Reading csv containing a list in Pandas
(4 answers)
Pandas DataFrame stored list as string: How to convert back to list
(9 answers)
Closed 1 year ago.
I have this 'file.csv' file to read with pandas:
Title|Tags
T1|"[Tag1,Tag2]"
T1|"[Tag1,Tag2,Tag3]"
T2|"[Tag3,Tag1]"
using
df = pd.read_csv('file.csv', sep='|')
the output is:
Title Tags
0 T1 [Tag1,Tag2]
1 T1 [Tag1,Tag2,Tag3]
2 T2 [Tag3,Tag1]
I know that the column Tags is a full string, since:
In [64]: df['Tags'][0][0]
Out[64]: '['
I need to read it as a list of strings like ["Tag1","Tag2"]. I tried the solution provided in this question but no luck there, since I have the [ and ] characters that actually mess up the things.
The expecting output should be:
In [64]: df['Tags'][0][0]
Out[64]: 'Tag1'
You can split the string manually:
>>> df['Tags'] = df.Tags.apply(lambda x: x[1:-1].split(','))
>>> df.Tags[0]
['Tag1', 'Tag2']
You could use the inbuilt ast.literal_eval, it works for tuples as well as lists
import ast
import pandas as pd
df = pd.DataFrame({"mytuples": ["(1,2,3)"]})
print(df.iloc[0,0])
# >> '(1,2,3)'
df["mytuples"] = df["mytuples"].apply(ast.literal_eval)
print(df.iloc[0,0])
# >> (1,2,3)
EDIT: eval should be avoided! If the the string being evaluated is os.system(‘rm -rf /’) it will start deleting all the files on your computer (here). For ast.literal_eval the string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, and None (here). Thanks #TrentonMcKinney :)
Or
df.Tags=df.Tags.str[1:-1].str.split(',').tolist()
I think you could use the json module.
import json
import pandas
df = pd.read_csv('file.csv', sep='|')
df['Tags'] = df['Tags'].apply(lambda x: json.loads(x))
So this will load your dataframe as before, then apply a lambda function to each of the items in the Tags column. The lambda function calls json.loads() which converts the string representation of the list to an actual list.
You can convert the string to a list using strip and split.
df_out = df.assign(Tags=df.Tags.str.strip('[]').str.split(','))
df_out.Tags[0][0]
Output:
'Tag1'
Your df['Tags'] appears to be a list of strings. If you print that list you should get ["[tag1,tag2]","[Tag1,Tag2,Tag3]","[Tag3,Tag1]"] this is why when you call the first element of the first element you're actually getting the first single character of the string, rather than what you want.
You either need to parse that string afterward. Performing something like
df['Tags'][0] = df['Tags'][0].split(',')
But as you saw in your cited example this will give you a list that looks like
in: df['Tags'][0][0]
out: '[tag1'`
What you need is a way to parse the string editing out multiple characters. You can use a simple regex expression to do this. Something like:
import re
df['Tags'][0] = re.findall(r"[\w']+", df['Tags'][0])
print(df['Tags'][0][0])
will print:
'tag1'
Using the other answer involving Pandas converters you might write a converter like this:
def clean(seq_string):
return re.findall(r"[\w']+", seq_string)
If you don't know regex, they can be quite powerful, but also unpredictable if you're not sure on the content of your input strings. The expression used here r"[\w']+" will match any common word character alpha-numeric and underscores and treat everything else as a point for re.findall to split the list at.
I need to turn the input_string into the comment below using a for loop. First I sliced it using the split() function, but now I need to somehow turn the input string into ['result1', 'result2', 'result3', 'result5']. I tried replacing the .xls and the dash for nothing (''), but the string output is unchanged. Please don't import anything, I'm trying to do this with functions and loops only.
input_string = "01-result.xls,2-result.xls,03-result.xls,05-result.xls"
# Must be turned into ['result1','result2', 'result3', 'result5']
splitted = input_string.split(',')
for c in ['.xls', '-', '0']:
if c in splitted:
splitted = splitted.replace(splitted, 'c', '')
When I type splitted, the output is ['01-result.xls', '2-result.xls', '03-result.xls', '05-result.xls'] therefore nothing is happening.
Use the re module's sub function and split.
>>> input_string = "01-result.xls,2-result.xls,03-result.xls,05-result.xls"
>>> import re
>>> re.sub(r'(\d+)-(\w+)\.xls',r'\2\1',input_string)
'result01,result2,result03,result05'
>>> re.sub(r'(\d+)-(\w+)\.xls',r'\2\1',input_string).split(',')
['result01', 'result2', 'result03', 'result05']
Using no imports, you can use a list comprehension
>>> [''.join(x.split('.')[0].split('-')[::-1]) for x in input_string.split(',')]
['result01', 'result2', 'result03', 'result05']
The algo here is, we loop through the string after splitting it on ,. Now we split the individual words on . and the first element of these on -. We now have the number and the words, which we can easily join.
Complete explanation of the list comp answer -
To understand what a list comprehension is, Read What does "list comprehension" mean? How does it work and how can I use it?
Coming to the answer,
Splitting the input list on ,, gives us the list of individual file names
>>> input_string.split(',')
['01-result.xls', '2-result.xls', '03-result.xls', '05-result.xls']
Now using the list comprehension construct, we can iterate through this,
>>> [i for i in input_string.split(',')]
['01-result.xls', '2-result.xls', '03-result.xls', '05-result.xls']
As we need only the file name and not the extension, we split by using . and take the first value.
>>> [i.split('.')[0] for i in input_string.split(',')]
['01-result', '2-result', '03-result', '05-result']
Now again, what we need is the number and the name as two parts. So we again split by -
>>> [i.split('.')[0].split('-') for i in input_string.split(',')]
[['01', 'result'], ['2', 'result'], ['03', 'result'], ['05', 'result']]
Now we have the [number, name] in a list, However the format that we need is "namenumber". Hence we have two options
Concat them like i.split('.')[0].split('-')[1]+i.split('.')[0].split('-')[0]. This is an unnecessarily long way
Reverse them and join. We can use slices to reverse a list (See How can I reverse a list in python?) and str.join to join like ''.join(x.split('.')[0].split('-')[::-1]).
So we get our final list comprehension
>>> [''.join(x.split('.')[0].split('-')[::-1]) for x in input_string.split(',')]
['result01', 'result2', 'result03', 'result05']
Here's a solution using list comprehension and string manipulation if you don't want to use re.
input_string = "01-result.xls,2-result.xls,03-result.xls,05-result.xls"
# Must be turned into ['result1','result2', 'result3', 'result5']
splitted = input_string.split(',')
#Remove extension, then split by hyphen, switch the two values,
#and combine them into the result string
print ["".join(i.split(".")[0].split("-")[::-1]) for i in splitted]
#Output
#['result01', 'result2', 'result03', 'result05']
The way this list comprehension works is:
Take the list of results and remove the ".xls". i.split(".)[0]
Split on the - and switch positions of the number and "result". .split("-")[::-1]
For every item in the list, join the list into a string. "".join()