Create dictionary from a list deleting all spaces - python

sorry for asking but I'm kind of new to these things. I'm doing a splitting words from the text and putting them to dict creating an index for each token:
import re
f = open('/Users/Half_Pint_Boy/Desktop/sentenses.txt', 'r')
a=0
c=0
e=[]
for line in f:
b=re.split('[^a-z]', line.lower())
a+=len(list(filter(None, b)))
c = c + 1
e = e + b
d = dict(zip(e, range(len(e))))
But in the end I receive a dict with spaces in it like that:
{'': 633,
'a': 617,
'according': 385,
'adjacent': 237,
'allow': 429,
'allows': 459}
How can I remove "" from the final result in dict? Also how can I change the indexing after that to not use "" in index counting? (with "" the index count is 633, without-248)
Big thanks!

How about this?
b = list(filter(None, re.split('[^a-z]', line.lower())))
As an alternative:
b = re.findall('[a-z]+', line.lower())
Either way, you can then also remove that filter from the next line:
a += len(b)
EDIT
As an aside, I think what you end up with here is a dictionary mapping words to the last position in which they appear in the text. I'm not sure if that's what you intended to do. E.g.
>>> dict(zip(['hello', 'world', 'hello', 'again'], range(4)))
{'world': 1, 'hello': 2, 'again': 3}
If you instead want to keep track of all the positions a word occurs, perhaps try this code instead:
from collections import defaultdict
import re
indexes = defaultdict(list)
with open('test.txt', 'r') as f:
for index, word in enumerate(re.findall(r'[a-z]+', f.read().lower())):
indexes[word].append(index)
indexes then maps each word to a list of indexes at which the word appears.
EDIT 2
Based on the comment discussion below, I think you want something more like this:
from collections import defaultdict
import re
word_positions = {}
with open('test.txt', 'r') as f:
index = 0
for word in re.findall(r'[a-z]+', f.read().lower()):
if word not in word_positions:
word_positions[word] = index
index += 1
print(word_positions)
# Output:
# {'hello': 0, 'goodbye': 2, 'world': 1}

Your regex looks not a good one. Consider to use:
line = re.sub('[^a-z]*$', '', line.strip())
b = re.split('[^a-z]+', line.lower())

Replace:
d = dict(zip(e, range(len(e))))
With:
d = {word:n for n, word in enumerate(e) if word}
Alternatively, to avoid the empty entries in the first place, replace:
b=re.split('[^a-z]', line.lower())
With:
b=re.split('[^a-z]+', re.sub('(^[^a-z]+|[^a-z]+$)', '', line.lower()))

Related

Remove duplicates from large list but remove both if it does exist?

So I have a text file like this
123
1234
123
1234
12345
123456
You can see 123 appears twice so both instances should be removed. but 12345 appears once so it stays. My text file is about 70,000 lines.
Here is what I came up with.
file = open("test.txt",'r')
lines = file.read().splitlines() #to ignore the '\n' and turn to list structure
for appId in lines:
if(lines.count(appId) > 1): #if element count is not unique remove both elements
lines.remove(appId) #first instance removed
lines.remove(appId) #second instance removed
writeFile = open("duplicatesRemoved.txt",'a') #output the left over unique elements to file
for element in lines:
writeFile.write(element + "\n")
When I run this I feel like my logic is correct, but I know for a fact the output is suppose to be around 950, but Im still getting 23000 elements in my output so a lot is not getting removed. Any ideas where the bug could reside?
Edit: I FORGOT TO MENTION. An element can only appear twice MAX.
Use Counter from built in collections:
In [1]: from collections import Counter
In [2]: a = [123, 1234, 123, 1234, 12345, 123456]
In [3]: a = Counter(a)
In [4]: a
Out[4]: Counter({123: 2, 1234: 2, 12345: 1, 123456: 1})
In [5]: a = [k for k, v in a.items() if v == 1]
In [6]: a
Out[6]: [12345, 123456]
For your particular problem I will do it like this:
from collections import defaultdict
out = defaultdict(int)
with open('input.txt') as f:
for line in f:
out[line.strip()] += 1
with open('out.txt', 'w') as f:
for k, v in out.items():
if v == 1: #here you use logic suitable for what you want
f.write(k + '\n')
Be careful about removing elements from a list while still iterating over that list. This changes the behavior of the list iterator, and can make it skip over elements, which may be part of your problem.
Instead, I suggest creating a filtered copy of the list using a list comprehension - instead of removing elements that appear more than twice, you would keep elements that appear less than that:
file = open("test.txt",'r')
lines = file.read().splitlines()
unique_lines = [line for line in lines if lines.count(line) <= 2] # if it appears twice or less
with open("duplicatesRemoved.txt", "w") as writefile:
writefile.writelines(unique_lines)
You could also easily modify this code to look for only one occurrence (if lines.count(line) == 1) or for more than two occurrences.
You can count all of the elements and store them in a dictionary:
dic = {a:lines.count(a) for a in lines}
Then remove all duplicated one from array:
for k in dic:
if dic[k]>1:
while k in lines:
lines.remove(k)
NOTE: The while loop here is becaues line.remove(k) removes first k value from array and it must be repeated till there's no k value in the array.
If the for loop is complicated, you can use the dictionary in another way to get rid of duplicated values:
lines = [k for k, v in dic.items() if v==1]

Python: Read text file into dict and ignore comments

I am trying to put the following text file into a dictionary but I would like any section starting with '#' or empty lines ignored.
My text file looks something like this:
# This is my header info followed by an empty line
Apples 1 # I want to ignore this comment
Oranges 3 # I want to ignore this comment
#~*~*~*~*~*~*~*Another comment~*~*~*~*~*~*~*~*~*~*
Bananas 5 # I want to ignore this comment too!
My desired output would be:
myVariables = {'Apples': 1, 'Oranges': 3, 'Bananas': 5}
My Python code reads as follows:
filename = "myFile.txt"
myVariables = {}
with open(filename) as f:
for line in f:
if line.startswith('#') or not line:
next(f)
key, val = line.split()
myVariables[key] = val
print "key: " + str(key) + " and value: " + str(val)
The error I get:
Traceback (most recent call last):
File "C:/Python27/test_1.py", line 11, in <module>
key, val = line.split()
ValueError: need more than 1 value to unpack
I understand the error but I do not understand what is wrong with the code.
Thank you in advance!
Given your text:
text = """
# This is my header info followed by an empty line
Apples 1 # I want to ignore this comment
Oranges 3 # I want to ignore this comment
#~*~*~*~*~*~*~*Another comment~*~*~*~*~*~*~*~*~*~*
Bananas 5 # I want to ignore this comment too!
"""
We can do this in 2 ways. Using regex, or using Python generators. I would choose the latter (described below) as regex is not particularly fast(er) in such cases.
To open the file:
with open('file_name.xyz', 'r') as file:
# everything else below. Just substitute `for line in lines` with
# `for line in file.readline()`
Now to create a similar, we split the lines, and create a list:
lines = text.split('\n') # as if read from a file using `open`.
Here is how we do all you want in a couple of lines:
# Discard all comments and empty values.
comment_less = filter(None, (line.split('#')[0].strip() for line in lines))
# Separate items and totals.
separated = {item.split()[0]: int(item.split()[1]) for item in comment_less}
Lets test:
>>> print(separated)
{'Apples': 1, 'Oranges': 3, 'Bananas': 5}
Hope this helps.
This doesn't exactly reproduce your error, but there's a problem with your code:
>>> x = "Apples\t1\t# This is a comment"
>>> x.split()
['Apples', '1', '#', 'This', 'is', 'a', 'comment']
>>> key, val = x.split()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: too many values to unpack
Instead try:
key = line.split()[0]
val = line.split()[1]
Edit: and I think your "need more than 1 value to unpack" is coming from the blank lines. Also, I'm not familiar with using next() like this. I guess I would do something like:
if line.startswith('#') or line == "\n":
pass
else:
key = line.split()[0]
val = line.split()[1]
To strip comments, you could use str.partition() which works whether the comment sign is present or not in the line:
for line in file:
line, _, comment = line.partition('#')
if line.strip(): # non-blank line
key, value = line.split()
line.split() may raise an exception in this code too—it happens if there is a non-blank line that does not contain exactly two whitespace-separated words—it is application depended what you want to do in this case (ignore such lines, print warning, etc).
You need to ignore empty lines and lines starting with # splitting the remaining lines after either splitting on # or using rfind as below to slice the string, an empty line will have a new line so you need and line.strip() to check for one, you cannot just split on whitespace and unpack as you have more than two elements after splitting including what is in the comment:
with open("in.txt") as f:
d = dict(line[:line.rfind("#")].split() for line in f
if not line.startswith("#") and line.strip())
print(d)
Output:
{'Apples': '1', 'Oranges': '3', 'Bananas': '5'}
Another option is to split twice and slice:
with open("in.txt") as f:
d = dict(line.split(None,2)[:2] for line in f
if not line.startswith("#") and line.strip())
print(d)
Or splitting twice and unpacking using an explicit loop:
with open("in.txt") as f:
d = {}
for line in f:
if not line.startswith("#") and line.strip():
k, v, _ = line.split(None, 2)
d[k] = v
You can also use itertools.groupby to group the lines you want.
from itertools import groupby
with open("in.txt") as f:
grouped = groupby(f, lambda x: not x.startswith("#") and x.strip())
d = dict(next(v).split(None, 2)[:2] for k, v in grouped if k)
print(d)
To handle where we have multiple words in single quotes we can use shlex to split:
import shlex
with open("in.txt") as f:
d = {}
for line in f:
if not line.startswith("#") and line.strip():
data = shlex.split(line)
d[data[0]] = data[1]
print(d)
So changing the Banana line to:
Bananas 'north-side disabled' # I want to ignore this comment too!
We get:
{'Apples': '1', 'Oranges': '3', 'Bananas': 'north-side disabled'}
And the same will work for the slicing:
with open("in.txt") as f:
d = dict(shlex.split(line)[:2] for line in f
if not line.startswith("#") and line.strip())
print(d)
If the format of the file is correctly defined you can try a solution with regular expressions.
Here's just an idea:
import re
fruits = {}
with open('fruits_list.txt', mode='r') as f:
for line in f:
match = re.match("([a-zA-Z0-9]+)[\s]+([0-9]+).*", line)
if match:
fruit_name, fruit_amount = match.groups()
fruits[fruit_name] = fruit_amount
print fruits
UPDATED:
I changed the way of reading lines taking care of large files. Now I read line by line and not all in one. This improves the memory usage.

Why my code is recording into the file only when I run it second time?

My goal is to calculate amount of words. When I run my code I am suppose to:
read in strings from the file
split every line in words
add these words into the dictionary
sort keys and add them to the list
write the string that consists of keys and appropriate values into the file
When I run code for the first time it does not write anything in the file, but I see the result on my screen. The file is empty. Only when I run code second time I see content is recorded into the file.
Why is that happening?
#read in the file
fileToRead = open('../folder/strings.txt')
fileToWrite = open('../folder/count.txt', 'w')
d = {}
#iterate over every line in the file
for line in fileToRead:
listOfWords = line.split()
#iterate over every word in the list
for word in listOfWords:
if word not in d:
d[word] = 1
else:
d[word] = d.get(word) + 1
#sort the keys
listF = sorted(d)
#iterate over sorted keys and write them in the file with appropriate value
for word in listF:
string = "{:<18}\t\t\t{}\n".format(word, d.get(word))
print string
fileToWrite.write(string)
A minimalistic version:
import collections
with open('strings.txt') as f:
d = collections.Counter(s for line in f for s in line.split())
with open('count.txt', 'a') as f:
for word in sorted(d.iterkeys()):
string = "{:<18}\t\t\t{}\n".format(word, d[word])
print string,
f.write(string)
Couple changes, it think you meant 'a' (append to file) instead of 'w' overwrite file each time in open('count.txt', 'a'). Please also try to use with statement for reading and writing files, as it automatically closes the file descriptor after the read/write is done.
#read in the file
fileToRead = open('strings.txt')
d = {}
#iterate over every line in the file
for line in fileToRead:
listOfWords = line.split()
#iterate over every word in the list
for word in listOfWords:
if word not in d:
d[word] = 1
else:
d[word] = d.get(word) + 1
#sort the keys
listF = sorted(d)
#iterate over sorted keys and write them in the file with appropriate value
with open('count.txt', 'a') as fileToWrite:
for word in listF:
string = "{:<18}\t\t\t{}\n".format(word, d.get(word))
print string,
fileToWrite.write(string)
When you do file.write(some_data), it writes the data into a buffer but not into the file. It only saves the file to disk when you do file.close().
f = open('some_temp_file.txt', 'w')
f.write("booga boo!")
# nothing written yet to disk
f.close()
# flushes the buffer and writes to disk
The better way to do this would be to store the path in the variable, rather than the file object. Then you can open the file (and close it again) on demand.
read_path = '../folder/strings.txt'
write_path = '../folder/count.txt'
This also allows you to use the with keyword, which handles file opening and closing much more elegantly.
read_path = '../folder/strings.txt'
write_path = '../folder/count.txt'
d = dict()
with open(read_path) as inf:
for line in inf:
for word in line.split()
d[word] = d.get(word, 0) + 1
# remember dict.get's default value! Saves a conditional
# since we've left the block, `inf` is closed by here
sorted_words = sorted(d)
with open(write_path, 'w') as outf:
for word in sorted_words:
s = "{:<18}\t\t\t{}\n".format(word, d.get(word))
# don't shadow the stdlib `string` module
# also: why are you using both fixed width AND tab-delimiters in the same line?
print(s) # not sure why you're doing this, but okay...
outf.write(s)
# since we leave the block, the file closes automagically.
That said, there's a couple things you could do to make this a little better in general. First off: counting how many of something are in a container is a job for a collections.Counter.
In [1]: from collections import Counter
In [2]: Counter('abc')
Out[2]: Counter({'a': 1, 'b': 1, 'c': 1})
and Counters can be added together with the expected behavior
In [3]: Counter('abc') + Counter('cde')
Out[3]: Counter({'c': 2, 'a': 1, 'b': 1, 'd': 1, 'e': 1})
and also sorted the same way you'd sort a dictionary with keys
In [4]: sorted((Counter('abc') + Counter('cde')).items(), key=lambda kv: kv[0])
Out[4]: [('a', 1), ('b', 1), ('c', 2), ('d', 1), ('e', 1)]
Put those all together and you could do something like:
from collections import Counter
read_path = '../folder/strings.txt'
write_path = '../folder/count.txt'
with open(read_path) as inf:
results = sum([Counter(line.split()) for line in inf])
with open(write_path, 'w') as outf:
for word, count in sorted(results.items(), key=lambda kv: kv[0]):
s = "{:<18}\t\t\t{}\n".format(word, count)
outf.write(s)

Trying to Find Most Occurring Name in a File

I have 4 text files that I want to read and find the top 5 most occurring names of. The text files have names in the following format "Rasmus,M,11". Below is my code which right now is able to call all of the text files and then read them. Right now, this code prints out all of the names in the files.
def top_male_names ():
for x in range (2008, 2012):
txt = "yob" + str(x) + ".txt"
file_handle = open(txt, "r", encoding="utf-8")
file_handle.seek(0)
line = file_handle.readline().strip()
while line != "":
print (line)
line = file_handle.readline().strip()
top_male_names()
My question is, how can I keep track of all of these names, and find the top 5 that occur the most? The only way I could think of was creating a variable for each name, but that wouldn't work because there are 100s of entries in each text file, probably with 100s of different of names.
This is the gist of it:
from collections import Counter
counter = Counter()
for line in file_handle:
name, gender, age = line.split(',')
counter[name] += 1
print counter.most_common()
You can adapt it to your program.
If you need to count a number of words in a text, use regex.
For example
import re
my_string = "Wow! Is this true? Really!?!? This is crazy!"
words = re.findall(r'\w+', my_string) #This finds words in the document
Output::
>>> words
['Wow', 'Is', 'this', 'true', 'Really', 'This', 'is', 'crazy']
"Is" and "is" are two different words. So we can just capitalize all the words, and then count them.
from collections import Counter
cap_words = [word.upper() for word in words] #capitalizes all the words
word_counts = Counter(cap_words) #counts the number each time a word appears
Output:
>>> word_counts
Counter({'THIS': 2, 'IS': 2, 'CRAZY': 1, 'WOW': 1, 'TRUE': 1, 'REALLY': 1})
Now reading a file :
import re
from collections import Counter
with open('file.txt') as f: text = f.read()
words = re.findall(r'\w+', text )
cap_words = [word.upper() for word in words]
word_counts = Counter(cap_words)
Then you only have to sort the dict containing all the words, for the values not for keys and see the top 5 words.

prints line number in both txtfile and list?

i have this code which prints the line number in infile but also the linenumber in words what do i do to only print the line number of the txt file next to the words???
d = {}
counter = 0
wrongwords = []
for line in infile:
infile = line.split()
wrongwords.extend(infile)
counter += 1
for word in infile:
if word not in d:
d[word] = [counter]
if word in d:
d[word].append(counter)
for stuff in wrongwords:
print(stuff, d[stuff])
the output is :
hello [1, 2, 7, 9] # this is printing the linenumber of the txt file
hello [1] # this is printing the linenumber of the list words
hello [1]
what i want is:
hello [1, 2, 7, 9]
Four things:
You can keep track of the line number by doing this instead of handling a
counter on your own:
for line_no, word in enumerate(infile):
As sateesh pointed out above, you probably need an else in your
conditions:
if word not in d:
d[word] = [counter]
else:
d[word].append(counter)
Also note that the above code snippet is exactly what defaultdicts are
for:
from collections import defaultdict
d = defaultdict(list)
Then in your main loop, you can get rid of the if..else part:
d[word].append(counter)
Why are you doing wrongwords.extend(infile)?
Also, I don't really understand how you are supposed to decide what "wrong words" are. I assume that you have a set named wrongwords that contains the wrong words, which makes your final code something like this:
from collections import defaultdict
d = defaultdict(list)
wrongwords = set(["hello", "foo", "bar", "baz"])
for counter, line in enumerate(infile):
infile = line.split()
for word in infile:
if word in wrongwords:
d[word].append(counter)

Categories

Resources