I'm writing a program in python and I want to compare two strings that exist in a text file and are separated by a new line character. How can I read the file in and set each string to a different variable. i.e string1 and string2?
Right now I'm using:
file = open("text.txt").read();
but this gives me extra content and not just the strings. I'm not sure what it is returning but this text file just contains two strings. I tried using other methods such as ..read().splitlines() but this did not yield the result I'm looking for. I'm new to python so any help would be appreciated!
This only reads the first 2 lines, strips off the newline char at the end, and stores them in 2 separate variables. It does not read in the entire file just to get the first 2 strings in it.
with open('text.txt') as f:
word1 = f.readline().strip()
word2 = f.readline().strip()
print word1, word2
# now you can compare word1 and word2 if you like
text.txt:
foo
bar
asdijaiojsd
asdiaooiasd
Output:
foo bar
EDIT: to make it work with any number of newlines or whitespace:
with open('text.txt') as f:
# sequence of all words in all lines
words = (word for line in f for word in line.split())
# consume the first 2 items from the words sequence
word1 = next(words)
word2 = next(words)
I've verified this to work reliably with various "non-clean" contents of text.txt.
Note: I'm using generator expressions which are like lazy lists so as to avoid reading more than the needed amount of data. Generator expressions are otherwise equivalent to list comprehensions except they produce items in the sequence lazily, i.e. as just as much as asked.
with open('text.txt') as f:
lines = [line.strip() for line in f]
print lines[0] == lines[1]
I'm not sure what it is returning but this text file just contains two strings.
Your problem is likely related to whitespace characters (most common being carriage return, linefeed/newline, space and tab). So if you tried to compare your string1 to 'expectedvalue' and it fails, it's likely because of the newline itself.
Try this: print the length of each string then print each of the actual bytes in each string to see why the comparison fails.
For example:
>>> print len(string1), len(expected)
4 3
>>> for got_character, expected_character in zip(string1, expected):
... print 'got "{}" ({}), but expected "{}" ({})'.format(got_character, ord(got_character), expected_character, ord(expected_character))
...
got " " (32), but expected "f" (102)
got "f" (102), but expected "o" (111)
got "o" (111), but expected "o" (111)
If that's your problem, then you should strip off the leading and trailing whitespace and then execute the comparison:
>>> string1 = string1.strip()
>>> string1 == expected
True
If you're on a unix-like system, you'll probably have an xxd or od binary available to dump a more detailed representation of the file. If you're using windows, you can download many different "hex editor" programs to do the same.
Related
I have a text file with all existing words in the Dutch language and I need only the words with a specific amount of characters, without any digits or special characters or capitals. I tried to do it by hand (which works) but it's about 400 thousand words :) So I wanted to use Python. I'm very new to Python and I can't find a good solution.
With my code (which is far from optimal) I get results but not good enough. Some words seem to be split halfway and concatenated, in some lines two words are not put on a separate line (to name a few things that I don't want).
My question: Is there a simple code that can remove words longer than 10 characters, remove all words starting or containing a Cap, remove all words with special characters? Thank you all in advance.
My code:
import re
input_file = open("basiswoorden-gekeurd.txt", "r+")
output_file = open("word_crumble_wordlist.txt", "w")
filetext = input_file.read()
res_caps = re.sub(r"\s*[A-Z]\w*\s*", " ", filetext).strip()
res_dig = re.sub(r"\s*\d\w*\s*", "", res_caps).strip()
res = re.sub(r"[^a-zA-Z0-9\n\.]\w*\s*", "", res_dig).strip()
for line in res:
if len(line) < 10:
output_file.write(line)
Original part of word-list:
Original: see the numbers and special characters
Resulting part:
Result: looks ok but the word "aaaaagje" seems a combination of other words :) HOW?
Also:
Original, with "aanbevolencomité AND aanbevolen" as two separate words on two separate lines
And:
See "aanbevolencomitaanbevolen"
In this case it might be easier to find matching words, rather than delete unwanted, consider following example let file.txt content be
Capital
okay
thisistoolong
okaytoo
d.o.t.s
then
import re
with open("file.txt","r") as f:
text = f.read()
for i in re.findall(r'^[a-z]{1,10}$',text,re.MULTILINE):
print(i)
gives output
okay
okaytoo
Explanation: I use MULTLINE line mode so ^ and $ mean start of line and end of line, then I am finding lines which contain from 1 to 10 lowercase ASCII letters.
I am reading a .dat file and the first few lines are just metadata before it gets to the actual data. A shortened example of the .dat file is below.
&SRS
SRSRUN=266128,SRSDAT=20180202,SRSTIM=122132,
fc.fcY=0.9000
&END
energy rc ai2
8945.016 301.32 6.7959
8955.497 301.18 6.8382
8955.989 301.18 6.8407
8956.990 301.16 6.8469
Or as the list:
[' &SRS\n', ' SRSRUN=266128,SRSDAT=20180202,SRSTIM=122132,\n', 'fc.fcY=0.9000\n', '\n', ' &END\n', 'energy\trc\tai2\n', '8945.016\t301.32\t6.7959\n', '8955.497\t301.18\t6.8382\n', '8955.989\t301.18\t6.8407\n', '8956.990\t301.16\t6.8469\n']
I tried this previously but it :
def import_absorptionscan(file_path,start,end):
for i in range(start,end):
lines=[]
f=open(file_path+str(i)+'.dat', 'r')
for line in f:
lines.append(line)
for line in lines:
for c in line:
if c.isalpha():
lines.remove(line)
print lines
But i get this error: ValueError: list.remove(x): x not in list
i started looking through stack overflow then but most of what came up was how to strip alphabetical characters from a string, so I made this question.
This produces a list of strings, with each string making up one line in the file. I want to remove any string which contains any alphabet characters as this should remove all the metadata and leave just the data. Any help would be appreciated thank you.
I have a suspicion you will want a more robust rule than "does the string contain a letter?", but you can use a regular expression to check:
re.search("[a-zA-Z]", line)
You'll probably want to take a look at the regular expression docs.
Additionally, you can use the any statement to check for letters. Inside your inner for loop add:
If any (word.isalpha() for word in line)
Notice that this will say that "ver9" is all numbers, so if this is a problem, just replace it with:
line_is_meta = False
for word in line:
if any (letter.isalpha() for letter in word):
line_is_meta = True
break
for letter in word:
if letter.isalpha():
line_is_meta = True
break
if not line_is_meta: lines.append (line)
I have a .txt doc full of text. I'd like to search it for specific characters (or ideally groups of characters (strings) , then do things with the charcter found, and the characters 2 in front/4behind the selected characters.
I made a version that searches lines for the character, but I cant find the equivalent for characters.
f = open("C:\Users\Calum\Desktop\Robopipe\Programming\data2.txt", "r")
searchlines = f.readlines()
f.close()
for i, line in enumerate(searchlines):
if "_" in line:
for l in searchlines[i:i+2]: print l, #if i+2 then prints line and the next
print
If I understand the problem, what you want is to repeatedly search one giant string, instead of a searching a list of strings one by one.
So, the first step is, don't use readlines, use read, so you get that one giant string in the first place.
Next, how do you repeatedly search for all matches in a string?
Well, a string is an iterable, just like a list is—it's an iterable of characters (which are themselves strings with length 1). So, you can just iterate over the string:
f = open(path)
searchstring = f.read()
f.close()
for i, ch in enumerate(searchstring):
if ch == "_":
print searchstring[i-4:i+2]
However, notice that this only works if you're only searching for a single-character match. And it will fail if you find a _ in the first four characters. And it can be inefficient to loop over a few MB of text character by character.* So, you probably want to instead loop over str.find:
i = 4
while True:
i = searchstring.find("_", i)
if i == -1:
break
print searchstring[i-4:i+2]
* You may be wondering how find could possibly be doing anything but the same kind of loop. And you're right, it's still iterating character by character. But it's doing it in optimized code provided by the standard library—with the usual CPython implementation, this means the "inner loop" is in C code rather than Python code, it doesn't have to "box up" each character to test it, etc., so it can be much, much faster.
You could use a regex for this:
The regex searches for any two characters (that are not _), an _, then any four characters that are not an underscore.
import re
with open(path) as f:
searchstring = f.read()
regex = re.compile("([^_]{2}_[^_]{4})")
for match in regex.findall(searchstring):
print match
With the input of:
hello_there my_wonderful_friend
The script returns:
lo_ther
my_wond
ul_frie
I'm trying to replace words like 'rna' with 'ribonucleic acid' from a dictionary of abbreviations. I tried writing the following, but it doesn't replace the abbreviations.
import csv,re
outfile = open ("Dict.txt", "w")
with open('Dictionary.csv', mode='r') as infile:
reader = csv.reader(infile)
mydict = {rows[0]:rows[1] for rows in reader}
print >> outfile, mydict
out = open ("out.txt", "w")
ss = open ("trial.csv", "r").readlines()
s = str(ss)
def process(s):
da = ''.join( mydict.get( word, word ) for word in re.split( '(\W+)', s ) )
print >> out, da
process(s)
A sample trial.csv file would be
A,B,C,D
RNA,lung cancer,15,biotin
RNA,lung cancer,15,biotin
RNA,breast cancer,15,biotin
RNA,breast cancer,15,biotin
RNA,lung cancer,15,biotin
Sample Dictionary.csv:
rna,ribonucleic acid
rnd,radical neck dissection
rni,recommended nutrient intake
rnp,ribonucleoprotein
My output file should have 'RNA' replaced by 'ribonucleic acid'
I'm trying to replace 'RNA' but my dictionary has 'rna'. Is there a way I can ignore the case.
Sure. Just call casefold on each key while creating the dictionary, and again while looking up values:
mydict = {rows[0].casefold(): rows[1] for rows in reader}
# ...
da = ''.join( mydict.get(word.casefold(), word) for word in re.split( '(\W+)', s ) )
If you're using an older version of Python that doesn't have casefold (IIRC, it was added in 2.7 and 3.2, but it may have been later than that…), use lower instead. It won't always do the right thing for non-English characters (e.g., 'ß'.casefold() is 'ss', while 'ß'.lower() is 'ß'), but it seems like that's OK for your application. (If it's not, you have to either write something more complicated with unicodedata, or find a third-party library.)
Also, I don't want it to replace 'corna' (I know such a word doesn't exist, but I want to make sure it doesn't happen) with 'coribonucleic acid'.
Well, you're already doing that with your re.split, which splits on any "non-word" characters; you then look up each resulting word separtely. Since corna won't be in the dict, it won't be replaced. (Although note that re's notion of "word" characters may not actually be what you want—it includes underscores and digits as part of a word, so rna2dna won't match, while a chunk of binary data like s1$_2(rNa/ might.)
You've also got another serious problem in your code:
ss = open ("trial.csv", "r").readlines()
s = str(ss)
Calling readlines means that ss is going to be a list of lines. Calling str on that list means that s is going to be a big string with [, then the repr of each line (with quotes around it, backslash escapes within it, etc.) separated by commas, then ]. You almost certainly don't want that. Just use read() if you want to read the whole file into a string as-is.
And you appear to have a problem in your data, too:
rna,ibonucleic acid
If you replace rna with ibonucleic acid, and so forth, you're going to have some hard-to-read output. If this is really your dictionary format, and the dictionary's user is supposed to infer some logic, e.g., that the first letter gets copied from the abbreviation, you have to write that logic. For example:
def lookup(word):
try:
return word[0] + mydict[word.casefold()]
except KeyError:
return word
da = ''.join(lookup(word) for word in re.split('(\W+), s))
Finally, it's a bad idea to use unescaped backslashes in a string literal. In this case, you get away with it, because Python happens to not have a meaning for \W, but that's not always going to be true. The best way around this is to use raw string literals, like r'(\W+)'.
I think this line s = str(ss) is causing the problem - the list that was created just became a string!
Try this instead:
def process(ss):
for line in ss:
da = ''.join( mydict.get( word, word ) for word in re.split( '(\W+)', line ) )
print >> out, da
process(ss)
So, I already have the code to get all the words with digits in them out of the text, now all I need to do is to have the text all in one line.
with open("lolpa.txt") as f:
for word in f.readline().split():
digits = [c for c in word if c.isdigit()]
if not digits:
print(word)
The split makes the words all be in a different column.
If I take out the .split(), it types in the words without the digits, literally just takes the digits out of the words, and makes every letter to be in a different column.
EDIT: Yes, print(word,end=" ") works, thanks. But I also want the script to now read only one line. It can't read anything that is on line 2 or 3 etc.
The second problem is that the script reads only the FIRST line. So if the input in the first line would be
i li4ke l0ke like p0tatoes potatoes
300 bla-bla-bla 00bla-bla-0211
the output would be
i like potatoes
In Python v 3.x you'd use
print(word, end='')
to avoid the newline.
in Python v 2.x
print word,
you'd use the comma at the end of the items you are printing. Note that unlike in v3 you'd get a single blank space between consecutive prints
Note that print(word), won't prevent a newline in v 3.x.
--
Update based on edit in original post re code problem:
With input:
i li4ke l0ke like p0tatoes potatoes
300 bla-bla-bla 00bla-bla-0211
this code:
def hasDigit(w):
for c in w:
if c.isdigit():
return True
return False
with open("data.txt") as f:
for line in f:
digits = [w for w in line.split() if not hasDigit(w)]
if digits:
print ' '.join(digits)
# break # uncomment the "break" if you ONLY want to process the first line
will yield all the "words" that do not contain digits:
i like potatoes
bla-bla-bla <-- this line won't show if the "break" is uncommented above
Note:
The post was a bit unclear if you wanted to process only the first line of the file, or if the problem was that your script only processed the first line. This solution can work either way depending on whether the break statement is commented out or not.
with open("lolpa.txt") as f:
for word in f.readline().split():
digits = [c for c in word if c.isdigit()]
if not digits:
print word,
print
Not , at the end of print.
If you're using python 3.x, you can do:
print (word,end="")
to suppress the newline -- python 2.x uses the somewhat strange syntax:
print word, #trailing comma
Alternatively, use sys.stdout.write(str(word)). (this works for both python 2.x and 3.x).
you can use join():
with open("lolpa.txt") as f:
print ' '.join(str(x.split()) for x in f if not [c for c in x.split() if c.isdigit()])
using a simple for loop:
import sys
with open("data.txt") as f:
for x in f: #loop over f not f.readline()
word=x.split()
digits = [c for c in word if c.isdigit()]
if not digits:
sys.stdout.write(str(word)) #from mgilson's solution