How to remove nonprintable characters in csv file? [closed]

How to remove nonprintable characters in csv file? [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have some invalid characters in my file that I'm trying to remove. But I ran into a strange problem with one of them.
When I try to use the replace function then I'm getting an error SyntaxError: EOL while scanning string literal.
I found that I was dealing with \x1d which is a group separator. I have this code to remove it:
import pandas as pd
df = pd.read_csv('C:/Users/tkp/Desktop/Holdings_Download/dws/example.csv',index_col=False, sep=';', encoding='utf-8')
print(df['col'][0])
df = df['col'][0].encode("utf-8").replace(b"\x1d", b"").decode()
df = pd.DataFrame([x.split(';') for x in df.split('\n')])
print(df[0][0])
Output:
Is there another way to do this? Because it seems to me that I couldn't do it any worse this.

Notice that you are getting a SyntaxError. This means that Python never gets as far as actually running your program, because it can't figure out what the program is!
To be honest, I'm not quite sure why this happens in this case, but using "exotic" characters in string constants is always a bit iffy, because it makes you dependent on what the character encoding of the source code is, and puts you at the mercy of all sorts of buggy editors. Therefore, I would recommend using the '\uXXXX' syntax to explicitly write the Unicode number for the character you wish to replace. (It looks like what you have here is U+2194 DOUBLE ARROW, so '\u2194' should do it.)
Having said that, I would first verify that this is actually the problem, by changing the '↔' bit to something more mundane, like 'x' and seeing whether that causes the same error. If it does, then your problem is somewhere else...

You have to specify the encoding for which this character is defined in the charset.
df = df.replace('#', '', encoding='utf-8')

Related

Search and Replace a word within a word in Python. Replace() method not working [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
How do I search and replace using built-in Python methods?
For instance, with a string of appleorangegrapes (yes all of them joined),
Replace "apple" with "mango".
The .replace method only works if the words are evenly spaced out but not if they are combined as one. Is there a way around this?
I searched the web but again the .replace method only gives me an example if they are spaced out.
Thank you for looking at the problem!

This works exactly as expected and advertised. Have a look:
s = 'appleorangegrapes'
print(s) # -> appleorangegrapes
s = s.replace('apple', 'mango')
print(s) # -> mangoorangegrapes
The only thing that you have to be careful of is that replace is not an in-place operator and as such it does not update s automatically; it only creates a new string that you have to assign to something.
s = 'appleorangegrapes'
s.replace('apple', 'mango') # the change is made but not saved
print(s) # -> appleorangegrapes

replace can work for any string, why you think that it doesn't, here is the test:
>>> s='appleorangegrapes'
>>> s.replace('apple','mango')
'mangoorangegrapes'
>>>
Don't you see that you received your expected result?

Use u'string' on string stored as variable in Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
As a French user of Python 2.7, I'm trying to properly print strings containing accents such as "é", "è", "à", etc. in the Python console.
I already know the trick of using u before the explicit value of a string, such as :
print(u'Université')
which properly prints the last character.
Now, my question is: how can I do the same for a string that is stored as a variable?
Indeed, I know that I could do the following:
mystring = u'Université'
print(mystring)
but the problem is that the value of mystring is bound to be passed into a SQL query (using psycopg2), and therefore I can't afford to store the u inside the value of mystring.
so how could I do something like
"print the unicode value of mystring" ?

The u sigil is not part of the value, it's just a type indicator. To convert a string into a Unicode string, you need to know the encoding.
unicodestring = mystring.decode('utf-8') # or 'latin-1' or ... whatever
and to print it you typically (in Python 2) need to convert back to whatever the system accepts on the output filehandle:
print(unicodestring.encode('utf-8')) # or 'latin-1' or ... whatever
Python 3 clarifies (though not directly simplifies) the situation by keeping Unicode strings and (what is now called) bytes objects separate.

re.findall working in console but not in script? [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 7 years ago.
Improve this question
I'm probably missing something very basic here, but here goes:
I'm using Python 2.7 and regex to identify digits within a string.
In the console, I type in:
>>> newstr = 'NukeNews/File_132.txt'
>>> int(re.findall(r'\d+',newstr)[0])
132
Which is what I expect.
However, in the script I'm running, I have the strings stored in a dictionary, linedict. I'm running this script:
news_id=[]
for line in line_vec:
print linedict[line]
newstr= linedict[line]
id_int = re.findall('r\d+',newstr)
print id_int
news_id.append(id_int)
It's a long list, but the output looks like:
NukeNews/File_132.txt
[]
So - the correct string is registered, but it's not matching on anything.
I was calling the first item in the list earlier (to match the console input of int(re.findall(r'\d+',newstr)[0]), but the script is telling me that the regex didn't find any instances of the digits in the string. I would expect this to return:
NukeNews/File_132.txt
['132']
Any idea why it's not working as expected? When I try running re.match(r'/d+',newstr) I also get an empty group (following the groups example on https://docs.python.org/2/library/re.html).
Edit: As pointed out, this is a case of not being careful with 'r' and r'*'. I'm just going to leave this up in case anyone else googling "why does my regex work in console but not in script" forgets to check this typo, like I did.

You've got your r inside the quotes so instead of getting a "raw string" you're getting a string with an 'r' in it ...
id_int = re.findall('r\d+',newstr)
# ^
# should be:
id_int = re.findall(r'\d+',newstr)
your "console" version also only takes the first of the found matches compared to your "script" version which appends the entire list.

Biopython - String assigning error [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I tried to begin with Biopython. So that I can do my thesis in it. But this really makes me think twice. Show features missing, when I tried a integer value, it does not work and same is the case with string too. Kindly help. Thank you.
Link:
http://imgur.com/87Gw9E5

Biopython seems pretty robust to me, the errors are probably due to your inexperience with it.
You have several errors, one of them is that you forgot to end the strings with "". The following lines
print "location start, features[ftNum].location.start # note location.start"
print "feature qualifiers,features[ftNum].qualifiers"
should be corrected to
print "location start", features[ftNum].location.start # note location.start
print "feature qualifiers", features[ftNum].qualifiers
Furthermore, as Wooble pointed out the condition in your while loop is wrong. I'm guessing you meant to to invert the ">", that is, the number of features should be greater than zero.
Please add some example data and error messages.

The guys at Biopython actually made it easy to deal with the features. Your problem is string management (plain python). I've used format, but you can use the % operator.
Also in python you rarely have to keep the count when looping. Python is not C.
from Bio import SeqIO
for record in SeqIO.parse("NG_009616.gb", "genbank"):
# You don't have to take care of the number of features with a while
# Loop all of them.
for feature in record.features:
print "Attributes of feature"
print "Type {0}".format(feature.type)
print "Start {0}".format(feature.location.start)
print "End {0}".format(feature.location.end)
print "Qualifiers {0}".format(feature.qualifiers)
# This is the right way to extract the sequence:
print "Sequence {0}".format(feature.location.extract(record).seq)
print "Sub-features {0}".format(feature.sub_features)

Python -- Morse Code Translation through a binary tree [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I'm writing a program that will create a binary tree of the Morse Code alphabet (as well as a period and an apostrophe), and which will then read a line of Morse Code and translate it into English. (Yes, I know that a look-up table would be easier, but I need to sort out my binary trees). I think a good bit of my problem is that I want to put the values into the tree in alphabetical order, rather than by symbol order. But surely there must be a way to do that? Because if I had a million such values that weren't numeric, I wouldn't need to sort them into the simplest order for insertion...right?
It's reading from a text file where each line has one sentence in Morse Code.
- .... .. ... .. ... ..-. ..- -. .-.-.- for example, which is "This is fun."
1 space between symbols means it's a new letter, 2 spaces means it's a new word.
As it stands, I'm getting the output ".$$$" for that line given above, which means it's reading a period and then getting an error which is symbolized by ('$$$'), which is obviously wrong...
Like I said before, I know I'm being complicated, but surely there's a way to do this without sorting the values in my tree first, and I'd like to figure this out now, rather than when I'm in a time crunch.
Does anyone have any insight? Is this something so horribly obvious that I should be embarrassed for asking about it?

Welcome to SO and thanks for an interesting question. Yes, it looks to me like you're overcomplicating things a bit. For example, there's absolutely no need to use classes here. You can reuse existing python data structures to represent a tree:
def add(node, value, code):
if code:
add(node.setdefault(code[0], {}), value, code[1:])
else:
node['value'] = value
tree = {}
for value, code in alphabet:
add(tree, value, code)
import pprint; pprint.pprint(tree)
This gives you a nested dict with keys ., -, and value which will be easier to work with.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.