I'm having a hard time trying to encode a python list, I already did it with a text file in order to count specific words inside it, using re module.
This is the code:
# encoding text file
with codecs.open('projectsinline.txt', 'r', encoding="utf-8") as f:
for line in f:
# Using re module to extract specific words
unicode_pattern = re.compile(r'\b\w{4,20}\b', re.UNICODE)
result = unicode_pattern.findall(line)
word_counts = Counter(result) # It creates a dictionary key and wordCount
Allwords = []
for clave in word_counts:
if word_counts[clave] >= 10: # We look for the most repeated words
word = clave
Allwords.append(word)
print Allwords
Part of the output looks like this:
[...u'recursos', u'Partidos', u'Constituci\xf3n', u'veh\xedculos', u'investigaci\xf3n', u'Pol\xedticos']
If I print variable word the output looks as it should be. However, when I use append, all the words breaks again, as the example before.
I use this example:
[x.encode("utf-8") for x in Allwords]
The output looks exactly the same as before.
I also use this example:
Allwords.append(str(word.encode("utf-8")))
The output change, but the words don't look as they should be:
[...'recursos', 'Partidos', 'Constituci\xc3\xb3n', 'veh\xc3\xadculos', 'investigaci\xc3\xb3n', 'Pol\xc3\xadticos']
Some of the answers have given this example:
print('[' + ', '.join(Allwords) + ']')
The output looks like this:
[...recursos, Partidos, Constitución, vehÃculos, investigación, PolÃticos]
To be honest I do not want to print the list, just encode it, so that all items (words) are recognized.
I'm looking for something like this:
[...'recursos', 'Partidos', 'Constitución', 'vehículos', 'investigación', 'Políticos']
Any suggestions to solve the problem are appreciated
Thanks,
you might what to try
print('[' + ', '.join(Allwords) + ']')
Your Unicode string list is correct. When you print lists the items in the list display as their repr() function. When you print the items themselves, the items display as their str() function. It is only a display option, similar to printing integers as decimal or hexadecimal.
So print the individual words if you want to see them correctly, but for comparisons the content is correct.
It's worth noting that Python 3 changes the behavior of repr() and now will display non-ASCII characters without escape codes if the terminal supports them directly and the ascii() function reproduces the Python 2 repr() behavior.
Related
I'm writing a program which jumbles clauses within a text using punctuation marks as delimiters for when to split the text.
At the moment my code has a large list where each item is a group of clauses.
import re
from random import shuffle
clause_split_content = []
text = ["this, is. a test?", "this: is; also. a test!"]
for i in text:
clause_split = re.split('[,;:".?!]', i)
clause_split.remove(clause_split[len(clause_split)-1])
for x in range(0, len(clause_split)):
clause_split_content.append(clause_split[x])
shuffle(clause_split_content)
print(*content, sep='')
at the moment the result jumbles the text without retaining the punctuation which is used as the delimiter to split it.
The output would be something like this:
a test this also this is a test is
I want to retain the punctuation within the final output so it would look something like this:
a test! this, also. this: is. a test? is;
I think you are simply using the wrong function of re for your purpose. split() excludes your separator, but you can use another function e.g. findall() to manually select all words you want. For example with the following code I can create your desired output:
import re
from random import shuffle
clause_split_content = []
text = ["this, is. a test?", "this: is; also. a test!"]
for i in text:
words_with_seperator = re.findall(r'([^,;:".?!]*[,;:".?!])\s?', i)
clause_split_content.extend(words_with_seperator)
shuffle(clause_split_content)
print(*clause_split_content, sep=' ')
Output:
this, this: is. also. a test! a test? is;
The pattern ([^,;:".?!]*[,;:".?!])\s? simply takes all characters that are not a separator until a separator is seen. These characters are all in the matching group, which creates your result. The \s? is only to get rid of the space characters in between the words.
Here's a way to do what you've asked:
import re
from random import shuffle
text = ["this, is. a test?", "this: is; also. a test!"]
content = [y for x in text for y in re.findall(r'([^,;:".?!]*[,;:".?!])', x)]
shuffle(content)
print(*content, sep=' ')
Output:
is; is. also. a test? this, a test! this:
Explanation:
the regex pattern r'([^,;:".?!]*[,;:".?!])' matches 0 or more non-separator characters followed by a separator character, and findall() returns a list of all such non-overlapping matches
the list comprehension iterates over the input strings in list text and has an inner loop that iterates over the findall results for each input string, so that we create a single list of every matched pattern within every string.
shuffle and print are as in your original code.
I'm exploring python and want to iterate through a string
mystring = '''1,"abc","abcde",1,2,3
2,"zzz","zzzde",1,2,5
3,"xyz","xyzde",4,3,2'''
This is a collection of three pairs of string and separated by a new line '\n'
Ideally I want to split it with '\n' and iterate through that array.
How can I do this in Python?
This is what I was thinking:
x = mystring.split('\n')
for aword in x
print('\n' + str(aword[1]))
I want to be able to access each element of each line. I hope it makes sense.
This should work. Python reads those enters as a newline. So using the built in python function split() we can create a list separated by "\n".
Code
mystring = '''1,"abc","abcde",1,2,3
2,"zzz","zzzde",1,2,5
3,"xyz","xyzde",4,3,2'''
mystring = mystring.split()
for i in mystring:
pass
I am handed a bunch of data and trying to get rid of certain characters. The data contains multiple instances of "^{number}" → "^0", "^1", "^2", etc.
I am trying to set all of these instances to an empty string, "", is there a better way to do this than
string.replace("^0", "").replace("^1", "").replace("^2", "")
I understand you can use a dictionary, but it seems a little overkill considering each item will be replaced with "".
I understand that the digits are always at the end of the string, have a look at the solutions below.
with regex:
import re
text = 'xyz125'
s = re.sub("\d+$",'', text)
print(s)
it should print:
'xyz'
without regex, keep in mind that this solution removes all digits and not only the ones at the end of a string:
text = 'xyz125'
result = ''.join(i for i in text if not i.isdigit())
print(result)
it should print:
'xyz'
I'm trying to replace words like 'rna' with 'ribonucleic acid' from a dictionary of abbreviations. I tried writing the following, but it doesn't replace the abbreviations.
import csv,re
outfile = open ("Dict.txt", "w")
with open('Dictionary.csv', mode='r') as infile:
reader = csv.reader(infile)
mydict = {rows[0]:rows[1] for rows in reader}
print >> outfile, mydict
out = open ("out.txt", "w")
ss = open ("trial.csv", "r").readlines()
s = str(ss)
def process(s):
da = ''.join( mydict.get( word, word ) for word in re.split( '(\W+)', s ) )
print >> out, da
process(s)
A sample trial.csv file would be
A,B,C,D
RNA,lung cancer,15,biotin
RNA,lung cancer,15,biotin
RNA,breast cancer,15,biotin
RNA,breast cancer,15,biotin
RNA,lung cancer,15,biotin
Sample Dictionary.csv:
rna,ribonucleic acid
rnd,radical neck dissection
rni,recommended nutrient intake
rnp,ribonucleoprotein
My output file should have 'RNA' replaced by 'ribonucleic acid'
I'm trying to replace 'RNA' but my dictionary has 'rna'. Is there a way I can ignore the case.
Sure. Just call casefold on each key while creating the dictionary, and again while looking up values:
mydict = {rows[0].casefold(): rows[1] for rows in reader}
# ...
da = ''.join( mydict.get(word.casefold(), word) for word in re.split( '(\W+)', s ) )
If you're using an older version of Python that doesn't have casefold (IIRC, it was added in 2.7 and 3.2, but it may have been later than that…), use lower instead. It won't always do the right thing for non-English characters (e.g., 'ß'.casefold() is 'ss', while 'ß'.lower() is 'ß'), but it seems like that's OK for your application. (If it's not, you have to either write something more complicated with unicodedata, or find a third-party library.)
Also, I don't want it to replace 'corna' (I know such a word doesn't exist, but I want to make sure it doesn't happen) with 'coribonucleic acid'.
Well, you're already doing that with your re.split, which splits on any "non-word" characters; you then look up each resulting word separtely. Since corna won't be in the dict, it won't be replaced. (Although note that re's notion of "word" characters may not actually be what you want—it includes underscores and digits as part of a word, so rna2dna won't match, while a chunk of binary data like s1$_2(rNa/ might.)
You've also got another serious problem in your code:
ss = open ("trial.csv", "r").readlines()
s = str(ss)
Calling readlines means that ss is going to be a list of lines. Calling str on that list means that s is going to be a big string with [, then the repr of each line (with quotes around it, backslash escapes within it, etc.) separated by commas, then ]. You almost certainly don't want that. Just use read() if you want to read the whole file into a string as-is.
And you appear to have a problem in your data, too:
rna,ibonucleic acid
If you replace rna with ibonucleic acid, and so forth, you're going to have some hard-to-read output. If this is really your dictionary format, and the dictionary's user is supposed to infer some logic, e.g., that the first letter gets copied from the abbreviation, you have to write that logic. For example:
def lookup(word):
try:
return word[0] + mydict[word.casefold()]
except KeyError:
return word
da = ''.join(lookup(word) for word in re.split('(\W+), s))
Finally, it's a bad idea to use unescaped backslashes in a string literal. In this case, you get away with it, because Python happens to not have a meaning for \W, but that's not always going to be true. The best way around this is to use raw string literals, like r'(\W+)'.
I think this line s = str(ss) is causing the problem - the list that was created just became a string!
Try this instead:
def process(ss):
for line in ss:
da = ''.join( mydict.get( word, word ) for word in re.split( '(\W+)', line ) )
print >> out, da
process(ss)
I have this code:
filenames=["file1","FILE2","file3","fiLe4"]
def alignfilenames():
#build a string that can be used to add labels to the R variables.
#format goal: suffixes=c(".fileA",".fileB")
filestring='suffixes=c(".'
for filename in filenames:
filestring=filestring+str(filename)+'",".'
print filestring[:-3]
#now delete the extra characters
filestring=filestring[-1:-4]
filestring=filestring+')'
print "New String"
print str(filestring)
alignfilenames()
I'm trying to get the string variable to look like this format: suffixes=c(".fileA",".fileB".....) but adding on the final parenthesis is not working. When I run this code as is, I get:
suffixes=c(".file1",".FILE2",".file3",".fiLe4"
New String
)
Any idea what's going on or how to fix it?
Does this do what you want?
>>> filenames=["file1","FILE2","file3","fiLe4"]
>>> c = "suffixes=c(%s)" % (",".join('".%s"' %f for f in filenames))
>>> c
'suffixes=c(".file1",".FILE2",".file3",".fiLe4")'
Using a string.join is a much better way to add a common delimiter to a list of items. It negates the need to have to check for being on the last item before adding the delimiter, or in your case attempting to strip off the last one added.
Also, you may want to look into List Comprehensions
It looks like you might be trying to use python to write an R script, which can be a quick solution if you don't know how to do it in R. But in this case the R-only solution is actually rather simple:
R> filenames= c("file1","FILE2","file3","fiLe4")
R> suffixes <- paste(".", tolower(filenames), sep="")
R> suffixes
[1] ".file1" ".file2" ".file3" ".file4"
R>
What's going on is that this slicing returns an empty string
filestring=filestring[-1:-4]
Because the end is before the begin. Try the following on the command line:
>>> a = "hello world"
>>> a[-1:-4]
''
The solution is to instead do
filestring=filestring[:-4]+filestring[-1:]
But I think what you actually wanted was to just drop the last three characters.
filestring=filestring[:-3]
The better solution is to use the join method of strings as sberry2A suggested