I have a csv file with some text, among others. I want to tokenize (split into a list of words) this text and am having problems with how pd.read_csv interprets escape characters.
My csv file looks like this:
text, number
one line\nother line, 12
and the code is like follows:
df = pd.read_csv('test.csv')
word_tokenize(df.iloc[0,0])
output is:
['one', 'line\\nother', 'line']
while what I want is:
['one', 'line', 'other', 'line']
The problem is pd.read_csv() is not interpreting the \n as a newline character but as two characters (\ and n).
I've tried setting the escapechar argument to '\' and to '\\' but both just remove the slash from the string without doing any interpretation of a newline character, i.e. the string becomes on one linenon other line.
If I explicitly set df.iloc[0,0] = 'one line\nother line', word_tokenize works just fine, because \n is actually interpreted as a newline character this time.
Ideally I would do this simply changing the way pd.read_csv() interprets the file, but other solutions are also ok.
The question is a bit poorly worded. I guess pandas escaping the \ in the string is confusing nltk.word_tokenize. pandas.read_csv can only use one separator (or a regex, but I doubt you want that), so it will always read the text column as "one line\nother line", and escape the backslash to preserve it. If you want to further parse and format it, you could use converters. Here's an example:
import pandas as pd
import re
df = pd.read_csv(
"file.csv", converters={"text":lambda s: re.split("\\\\n| ", s)}
)
The above results to:
text number
0 [one, line, other, line] 12
Edit: In case you need to use nltk to do the splitting (say the splitting depends on the language model), you would need to unescape the string before passing on to word_tokenize; try something like this:
lambda s: word_tokenize(s.encode('utf-8').decode('unicode_escape')
Note: Matching lists in queries is incredibly tricky, so you might want to convert them to tuples by altering the lambda like this:
lambda s: tuple(re.split("\\\\n| ", s))
You can simply try this
import pandas as pd
df = pd.read_csv("test.csv", header=None)
df = df.apply(lambda x: x.str.replace('\\', " "))
print(df.iloc[1, 0])
# output: one line other line
In you case simply use:
data = pd.read_csv('test.csv', sep='\\,', names=['c1', 'c2', 'c3', 'c4'], engine='python')
Related
I am zipping two lists into a dictionary using dict(zip(list1, list2))
list2 contains escape characters generated through:
import pandas as pd
import re
data = pd.read_table(file, sep = '\t', usecols = ['list1', 'list2'], error_bad_lines=False)
data['list2'] = data['list2'].map(re.escape)
data
list1 list2
0 c100001_g1_i1 mRNA::jcf7190000025784:336550\-338439\(\-\)
1 c100003_g1_i1 mRNA::jcf7190000164685:24994\-31705\(\+\)
When I attempt to create a dictionary, extranneous escape characters are introduced:
data_dict = dict(zip(data.list1, data.list2))
data_dict
{'c99999_g2_i2': 'mRNA::jcf7190000086075:207510\\-229401\\(\\-\\)', 'c99999_g2_i3': 'mRNA::jcf7190000086075:207510\\-229401\\(\\-\\)'}
How can I get the extraneous escape characters to stop being introduced? What am I doing that is invoking these extra escape characters into the dictionary?
It's just the way that dictionaries are printed that gives that behaviour. to see the real output, you should print every individual key and value, see if it still "adds" extra backslashes.
I'm using this below code to remove special characters and punctuations from a column in pandas dataframe. But this method of using regex.sub is not time efficient. Is there other options I could try to have better time efficiency and remove punctuations and special characters? Or the way I'm removing special characters and parsing it back to the column, pandas dataframe is causing me major computation burn?
for n, string in data['text'].iteritems():
data['text'] = re.sub('([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])','', string)
One way would be to keep only alphanumeric. Consider this dataframe
df=pd.DataFrame({'Text':['#^#346fetvx#!.,;:', 'fhfgd54#!#><?']})
Text
0 #^#346fetvx#!.,;:
1 fhfgd54#!#><?
You can use
df['Text'] = df['Text'].str.extract('(\w+)', expand = False)
Text
0 346fetvx
1 fhfgd54
Use Regex and lambda function:
import re
data['PROD_NAME'] = data['PROD_NAME'].apply(lambda x: re.sub('[^A-Za-z0-9]', ' ', x))
This would remove all characters except alphabets and digits.
I am dealing with a type of ASCII file where there are effectively 4 columns of data and the each row is assigned to a line in the file. Below is an example of a row of data from this file
'STOP.F 11966.0000:STOP DEPTH'
The data is always structured so that the delimiter between the first and second column is a period, the delimiter between the second and third column is a space and the delimiter between the third and fourth column is a colon.
Ideally, I would like to find a way to return the following result from the string above
['STOP', 'F', '11966.0000', 'STOP DEPTH']
I tried using a regular expression with the period, space and colon as delimiters, but it breaks down (see example below) because I don't know how to specify the specific order in which to split the string, and I don't know if there is a way to specify the maximum number of splits per delimiter right in the regular expression itself. I want it to split the delimiters in the specific order and each delimiter a maximum of 1 time.
import re
line = 'STOP.F 11966.0000:STOP DEPTH'
re.split("[. :]", line)
>>> ['STOP', 'F', '11966', '0000', 'STOP', 'DEPTH']
Any suggestions on a tidy way to do this?
This may work. Credit to Juan
import re
pattern = re.compile(r'^(.+)\.(.+) (.+):(.+)$')
line = 'STOP.F 11966.0000:STOP DEPTH'
pattern.search(line).groups()
Out[6]: ('STOP', 'F', '11966.0000', 'STOP DEPTH')
re.split() solution with specific regex pattern:
import re
s = 'STOP.F 11966.0000:STOP DEPTH'
result = re.split(r'(?<=^[^.]+)\.|(?<=^[^ ]+) |:', s)
print(result)
The output:
['STOP', 'F', '11966.0000', 'STOP DEPTH']
So i currently have this :
s = final_df['Column Name'].str.split(';').apply(pd.Series, 1).stack()
which splits row when it finds the ; delimiter. However, I will not always have the semicolon as my delimiter. Is there to incorporate re.split or other delimiters into str.split? Basically, there could be ':', ';' ,'|' as my delimiters but I won't know.
I tried to just do split(';', '|') but I knew that wouldn't work.
str.split offers regex just like re.split does. So, you do need to use the latter. The following should do:
s = final_df['Column Name'].str.split(r'[;:|]').apply(pd.Series, 1).stack()
If the starting file contains those delimiters, you could actually provide the regular expression pattern to the sep parameter of the read_table function and set its engine parameter to "python". The following uses the io module and a random string to illustrate the point:
import io
import pandas as pd
mystring = u"hello:world|123;here|we;go,again"
with io.StringIO(mystring) as f:
df = pd.read_table(f, sep=r"[;:|,]", engine="python", header=None)
df
# 0 1 2 3 4 5 6
# 0 hello world 123 here we go again
This one split on :, ;, | and ,.
I hope this proves useful.
This question already has answers here:
Reading csv containing a list in Pandas
(4 answers)
Pandas DataFrame stored list as string: How to convert back to list
(9 answers)
Closed 1 year ago.
I have this 'file.csv' file to read with pandas:
Title|Tags
T1|"[Tag1,Tag2]"
T1|"[Tag1,Tag2,Tag3]"
T2|"[Tag3,Tag1]"
using
df = pd.read_csv('file.csv', sep='|')
the output is:
Title Tags
0 T1 [Tag1,Tag2]
1 T1 [Tag1,Tag2,Tag3]
2 T2 [Tag3,Tag1]
I know that the column Tags is a full string, since:
In [64]: df['Tags'][0][0]
Out[64]: '['
I need to read it as a list of strings like ["Tag1","Tag2"]. I tried the solution provided in this question but no luck there, since I have the [ and ] characters that actually mess up the things.
The expecting output should be:
In [64]: df['Tags'][0][0]
Out[64]: 'Tag1'
You can split the string manually:
>>> df['Tags'] = df.Tags.apply(lambda x: x[1:-1].split(','))
>>> df.Tags[0]
['Tag1', 'Tag2']
You could use the inbuilt ast.literal_eval, it works for tuples as well as lists
import ast
import pandas as pd
df = pd.DataFrame({"mytuples": ["(1,2,3)"]})
print(df.iloc[0,0])
# >> '(1,2,3)'
df["mytuples"] = df["mytuples"].apply(ast.literal_eval)
print(df.iloc[0,0])
# >> (1,2,3)
EDIT: eval should be avoided! If the the string being evaluated is os.system(‘rm -rf /’) it will start deleting all the files on your computer (here). For ast.literal_eval the string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, and None (here). Thanks #TrentonMcKinney :)
Or
df.Tags=df.Tags.str[1:-1].str.split(',').tolist()
I think you could use the json module.
import json
import pandas
df = pd.read_csv('file.csv', sep='|')
df['Tags'] = df['Tags'].apply(lambda x: json.loads(x))
So this will load your dataframe as before, then apply a lambda function to each of the items in the Tags column. The lambda function calls json.loads() which converts the string representation of the list to an actual list.
You can convert the string to a list using strip and split.
df_out = df.assign(Tags=df.Tags.str.strip('[]').str.split(','))
df_out.Tags[0][0]
Output:
'Tag1'
Your df['Tags'] appears to be a list of strings. If you print that list you should get ["[tag1,tag2]","[Tag1,Tag2,Tag3]","[Tag3,Tag1]"] this is why when you call the first element of the first element you're actually getting the first single character of the string, rather than what you want.
You either need to parse that string afterward. Performing something like
df['Tags'][0] = df['Tags'][0].split(',')
But as you saw in your cited example this will give you a list that looks like
in: df['Tags'][0][0]
out: '[tag1'`
What you need is a way to parse the string editing out multiple characters. You can use a simple regex expression to do this. Something like:
import re
df['Tags'][0] = re.findall(r"[\w']+", df['Tags'][0])
print(df['Tags'][0][0])
will print:
'tag1'
Using the other answer involving Pandas converters you might write a converter like this:
def clean(seq_string):
return re.findall(r"[\w']+", seq_string)
If you don't know regex, they can be quite powerful, but also unpredictable if you're not sure on the content of your input strings. The expression used here r"[\w']+" will match any common word character alpha-numeric and underscores and treat everything else as a point for re.findall to split the list at.