Python - Importing strings into a list, into another list :) - python

Basically I want to read strings from a text file, put them in lists three by three, and then put all those three by three lists into another list. Actually let me explain it better :)
Text file (just an example, I can structure it however I want):
party
sleep
study
--------
party
sleep
sleep
-----
study
sleep
party
---------
etc
From this, I want Python to create a list that looks like this:
List1 = [['party','sleep','study'],['party','sleep','sleep'],['study','sleep','party']etc]
But it's super hard. I was experimenting with something like:
test2 = open('test2.txt','r')
List=[]
for line in 'test2.txt':
a = test2.readline()
a = a.replace("\n","")
List.append(a)
print(List)
But this just does horrible horrible things. How to achieve this?

If you want to group the data in size of 3. Assumes your data in the text file is not grouped by any separator.
You need to read the file, sequentially and create a list. To group it you can use any of the known grouper algorithms
from itertools import izip, imap
with open("test.txt") as fin:
data = list(imap(list, izip(*[imap(str.strip, fin)]*3)))
pprint.pprint(data)
[['party', 'sleep', 'study'],
['party', 'sleep', 'sleep'],
['study', 'sleep', 'party']]
Steps of Execution
Create a Context Manager with the file object.
Strip each line. (Remove newline)
Using zip on the iterator list of size 3, ensures that the items are grouped as tuples of three items
Convert tuples to list
Convert the generator expression to a list.
Considering all are generator expressions, its done on a single iteration.
Instead, if your data is separated and grouped by a delimiter ------ you can use the itertools.groupby solution
from itertools import imap, groupby
class Key(object):
def __init__(self, sep):
self.sep = sep
self.count = 0
def __call__(self, line):
if line == self.sep: self.count += 1
return self.count
with open("test.txt") as fin:
data = [[e for e in v if "----------" not in e]
for k, v in groupby(imap(str.strip, fin), key = Key("----------"))]
pprint.pprint(data)
[['party', 'sleep', 'study'],
['party', 'sleep', 'sleep'],
['study', 'sleep', 'party']]
Steps of Execution
Create a Key Class, to increase a counter when ever the separator is encountered. The function call spits out the counter every-time its called apart from conditionally increasing it.
Create a Context Manager with the file object.
Strip each line. (Remove newline)
Group the data using itertools.groupby and using your custom key
Remove the separator from the grouped data and create a list of the groups.

You can try with this:
res = []
tmp = []
for i, line in enumerate(open('file.txt'), 1):
tmp.append(line.strip())
if i % 3 == 0:
res.append(tmp)
tmp = []
print(res)
I've assumed that you don't have the dashes.
Edit:
Here is an example for when you have dashes:
res = []
tmp = []
for i, line in enumerate(open('file.txt')):
if i % 4 == 0:
res.append(tmp)
tmp = []
continue
tmp.append(line.strip())
print(res)

First big problem:
for line in 'test2.txt':
gives you
't', 'e', 's', 't', '2', '.', 't', 'x', 't'
You need to loop through the file you open:
for line in test2:
Or, better:
with open("test2.txt", 'r') as f:
for line in f:
Next, you need to do one of two things:
If the line contains "-----", create a new sub-list (myList.append([]))
Otherwise, append the line to the last sub-list in your list (myList[-1].append(line))
Finally, your print at the end should not be so far indented; currently, it prints for every line, rather than just when the processing is complete.
List.append(a)
print(List)
Perhaps a better structure for your file would be:
party,sleep,study
party,sleep,sleep
...
Now each line is a sub-list:
for line in f:
myList.append(line.split(','))

Related

Remove duplicates from large list but remove both if it does exist?

So I have a text file like this
123
1234
123
1234
12345
123456
You can see 123 appears twice so both instances should be removed. but 12345 appears once so it stays. My text file is about 70,000 lines.
Here is what I came up with.
file = open("test.txt",'r')
lines = file.read().splitlines() #to ignore the '\n' and turn to list structure
for appId in lines:
if(lines.count(appId) > 1): #if element count is not unique remove both elements
lines.remove(appId) #first instance removed
lines.remove(appId) #second instance removed
writeFile = open("duplicatesRemoved.txt",'a') #output the left over unique elements to file
for element in lines:
writeFile.write(element + "\n")
When I run this I feel like my logic is correct, but I know for a fact the output is suppose to be around 950, but Im still getting 23000 elements in my output so a lot is not getting removed. Any ideas where the bug could reside?
Edit: I FORGOT TO MENTION. An element can only appear twice MAX.
Use Counter from built in collections:
In [1]: from collections import Counter
In [2]: a = [123, 1234, 123, 1234, 12345, 123456]
In [3]: a = Counter(a)
In [4]: a
Out[4]: Counter({123: 2, 1234: 2, 12345: 1, 123456: 1})
In [5]: a = [k for k, v in a.items() if v == 1]
In [6]: a
Out[6]: [12345, 123456]
For your particular problem I will do it like this:
from collections import defaultdict
out = defaultdict(int)
with open('input.txt') as f:
for line in f:
out[line.strip()] += 1
with open('out.txt', 'w') as f:
for k, v in out.items():
if v == 1: #here you use logic suitable for what you want
f.write(k + '\n')
Be careful about removing elements from a list while still iterating over that list. This changes the behavior of the list iterator, and can make it skip over elements, which may be part of your problem.
Instead, I suggest creating a filtered copy of the list using a list comprehension - instead of removing elements that appear more than twice, you would keep elements that appear less than that:
file = open("test.txt",'r')
lines = file.read().splitlines()
unique_lines = [line for line in lines if lines.count(line) <= 2] # if it appears twice or less
with open("duplicatesRemoved.txt", "w") as writefile:
writefile.writelines(unique_lines)
You could also easily modify this code to look for only one occurrence (if lines.count(line) == 1) or for more than two occurrences.
You can count all of the elements and store them in a dictionary:
dic = {a:lines.count(a) for a in lines}
Then remove all duplicated one from array:
for k in dic:
if dic[k]>1:
while k in lines:
lines.remove(k)
NOTE: The while loop here is becaues line.remove(k) removes first k value from array and it must be repeated till there's no k value in the array.
If the for loop is complicated, you can use the dictionary in another way to get rid of duplicated values:
lines = [k for k, v in dic.items() if v==1]

Python - How do i build a dictionary from a text file?

for the class data structures and algorithms at Tilburg University i got a question in an in class test:
build a dictionary from testfile.txt, with only unique values, where if a value appears again, it should be added to the total sum of that productclass.
the text file looked like this, it was not a .csv file:
apples,1
pears,15
oranges,777
apples,-4
oranges,222
pears,1
bananas,3
so apples will be -3 and the output would be {"apples": -3, "oranges": 999...}
in the exams i am not allowed to import any external packages besides the normal: pcinput, math, etc. i am also not allowed to use the internet.
I have no idea how to accomplish this, and this seems to be a big problem in my development of python skills, because this is a question that is not given in a 'dictionaries in python' video on youtube (would be to hard maybe), but also not given in a expert course because there this question would be to simple.
hope you guys can help!
enter code here
from collections import Counter
from sys import exit
from os.path import exists, isfile
##i did not finish it, but wat i wanted to achieve was build a list of the
strings and their belonging integers. then use the counter method to add
them together
## by splitting the string by marking the comma as the split point.
filename = input("filename voor input: ")
if not isfile(filename):
print(filename, "bestaat niet")
exit()
keys = []
values = []
with open(filename) as f:
xs = f.read().split()
for i in xs:
keys.append([i])
print(keys)
my_dict = {}
for i in range(len(xs)):
my_dict[xs[i]] = xs.count(xs[i])
print(my_dict)
word_and_integers_dict = dict(zip(keys, values))
print(word_and_integers_dict)
values2 = my_dict.split(",")
for j in values2:
print( value2 )
the output becomes is this:
[['schijndel,-3'], ['amsterdam,0'], ['tokyo,5'], ['tilburg,777'], ['zaandam,5']]
{'zaandam,5': 1, 'tilburg,777': 1, 'amsterdam,0': 1, 'tokyo,5': 1, 'schijndel,-3': 1}
{}
so i got the dictionary from it, but i did not separate the values.
the error message is this:
28 values2 = my_dict.split(",") <-- here was the error
29 for j in values2:
30 print( value2 )
AttributeError: 'dict' object has no attribute 'split'
I don't understand what your code is actually doing, I think you don't know what your variables are containing, but this is an easy problem to solve in Python. Split into a list, split each item again, and count:
>>> input = "apples,1 pears,15 oranges,777 apples,-4 oranges,222 pears,1 bananas,3"
>>> parts = input.split()
>>> parts
['apples,1', 'pears,15', 'oranges,777', 'apples,-4', 'oranges,222', 'pears,1', 'bananas,3']
Then split again. Behold the list comprehension. This is an idiomatic way to transform a list to another in python. Note that the numbers are strings, not ints yet.
>>> strings = [s.split(',') for s in strings]
>>> strings
[['apples', '1'], ['pears', '15'], ['oranges', '777'], ['apples', '-4'], ['oranges', '222'], ['pears', '1'], ['bananas', '3']]
Now you want to iterate over pairs, and sum all the same fruits. This calls for a dict:
>>> result = {}
>>> for fruit, countstr in pairs:
... if fruit not in result:
... result[fruit] = 0
... result[fruit] += int(countstr)
>>> result
{'pears': 16, 'apples': -3, 'oranges': 999, 'bananas': 3}
This pattern of adding an element if it doesn't exist comes up frequently. You should checkout defaultdict in the collections module. If you use that, you don't even need the if.
Let's walk through what you need to do to. First, check if the file exists and read the contents to a variable. Second, parse each line - you need to split the line on the comma, convert the number from a string to an integer, and then pass the values to a dictionary. In this case I would recommend using defaultdict from collections, but we can also do it with a standard dictionary.
from os.path import exists, isfile
from collections import defaultdict
filename = input("filename voor input: ")
if not isfile(filename):
print(filename, "bestaat niet")
exit()
# this reads the file to a list, removing newline characters
with open(filename) as f:
line_list = [x.strip() for x in f]
# create a dictionary
my_dict = {}
# update the value in the dictionary if it already exists,
# otherwise add it to the dictionary
for line in line_list:
k, v_str = line.split(',')
if k in my_dict:
my_dict[k] += int(v_str)
else:
my_dict[k] = int(v_str)
# print the dictionary
table_str = '{:<30}{}'
print(table_str.format('Item','Count'))
print('='*35)
for k,v in sorted(my_dict.item()):
print(table_str.format(k,v))

Why my code is recording into the file only when I run it second time?

My goal is to calculate amount of words. When I run my code I am suppose to:
read in strings from the file
split every line in words
add these words into the dictionary
sort keys and add them to the list
write the string that consists of keys and appropriate values into the file
When I run code for the first time it does not write anything in the file, but I see the result on my screen. The file is empty. Only when I run code second time I see content is recorded into the file.
Why is that happening?
#read in the file
fileToRead = open('../folder/strings.txt')
fileToWrite = open('../folder/count.txt', 'w')
d = {}
#iterate over every line in the file
for line in fileToRead:
listOfWords = line.split()
#iterate over every word in the list
for word in listOfWords:
if word not in d:
d[word] = 1
else:
d[word] = d.get(word) + 1
#sort the keys
listF = sorted(d)
#iterate over sorted keys and write them in the file with appropriate value
for word in listF:
string = "{:<18}\t\t\t{}\n".format(word, d.get(word))
print string
fileToWrite.write(string)
A minimalistic version:
import collections
with open('strings.txt') as f:
d = collections.Counter(s for line in f for s in line.split())
with open('count.txt', 'a') as f:
for word in sorted(d.iterkeys()):
string = "{:<18}\t\t\t{}\n".format(word, d[word])
print string,
f.write(string)
Couple changes, it think you meant 'a' (append to file) instead of 'w' overwrite file each time in open('count.txt', 'a'). Please also try to use with statement for reading and writing files, as it automatically closes the file descriptor after the read/write is done.
#read in the file
fileToRead = open('strings.txt')
d = {}
#iterate over every line in the file
for line in fileToRead:
listOfWords = line.split()
#iterate over every word in the list
for word in listOfWords:
if word not in d:
d[word] = 1
else:
d[word] = d.get(word) + 1
#sort the keys
listF = sorted(d)
#iterate over sorted keys and write them in the file with appropriate value
with open('count.txt', 'a') as fileToWrite:
for word in listF:
string = "{:<18}\t\t\t{}\n".format(word, d.get(word))
print string,
fileToWrite.write(string)
When you do file.write(some_data), it writes the data into a buffer but not into the file. It only saves the file to disk when you do file.close().
f = open('some_temp_file.txt', 'w')
f.write("booga boo!")
# nothing written yet to disk
f.close()
# flushes the buffer and writes to disk
The better way to do this would be to store the path in the variable, rather than the file object. Then you can open the file (and close it again) on demand.
read_path = '../folder/strings.txt'
write_path = '../folder/count.txt'
This also allows you to use the with keyword, which handles file opening and closing much more elegantly.
read_path = '../folder/strings.txt'
write_path = '../folder/count.txt'
d = dict()
with open(read_path) as inf:
for line in inf:
for word in line.split()
d[word] = d.get(word, 0) + 1
# remember dict.get's default value! Saves a conditional
# since we've left the block, `inf` is closed by here
sorted_words = sorted(d)
with open(write_path, 'w') as outf:
for word in sorted_words:
s = "{:<18}\t\t\t{}\n".format(word, d.get(word))
# don't shadow the stdlib `string` module
# also: why are you using both fixed width AND tab-delimiters in the same line?
print(s) # not sure why you're doing this, but okay...
outf.write(s)
# since we leave the block, the file closes automagically.
That said, there's a couple things you could do to make this a little better in general. First off: counting how many of something are in a container is a job for a collections.Counter.
In [1]: from collections import Counter
In [2]: Counter('abc')
Out[2]: Counter({'a': 1, 'b': 1, 'c': 1})
and Counters can be added together with the expected behavior
In [3]: Counter('abc') + Counter('cde')
Out[3]: Counter({'c': 2, 'a': 1, 'b': 1, 'd': 1, 'e': 1})
and also sorted the same way you'd sort a dictionary with keys
In [4]: sorted((Counter('abc') + Counter('cde')).items(), key=lambda kv: kv[0])
Out[4]: [('a', 1), ('b', 1), ('c', 2), ('d', 1), ('e', 1)]
Put those all together and you could do something like:
from collections import Counter
read_path = '../folder/strings.txt'
write_path = '../folder/count.txt'
with open(read_path) as inf:
results = sum([Counter(line.split()) for line in inf])
with open(write_path, 'w') as outf:
for word, count in sorted(results.items(), key=lambda kv: kv[0]):
s = "{:<18}\t\t\t{}\n".format(word, count)
outf.write(s)

How to read specific lines of a large csv file

I am trying to read some specific rows of a large csv file, and I don't want to load the whole file into memory. The index of the specific rows are given in a list L = [2, 5, 15, 98, ...] and my csv file looks like this:
Col 1, Col 2, Col3
row11, row12, row13
row21, row22, row23
row31, row32, row33
...
Using the ideas mentioned here I use the following command to read the rows
with open('~/file.csv') as f:
r = csv.DictReader(f) # I need to read it as a dictionary for my purpose
for i in L:
for row in enumerate(r):
print row[i]
I immediately get the following error:
IndexError Traceback (most recent call last)
<ipython-input-25-78951a0d4937> in <module>()
6 for i in L:
7 for row in enumerate(r):
----> 8 print row[i]
IndexError: tuple index out of range
Question 1. It seems like my use of the for loops here is obviously wrong. Any ideas on how to fix this?
On the other hand, the following gets the job done, but it's too slow:
def read_csv_line(line_number):
with open("~/file.csv") as f:
r = csv.DictReader(f)
for i, line in enumerate(r):
if i == (line_number - 2):
return line
return None
for i in L:
print read_csv_line(i)
Question 2. Any idea on how to improve this basic method of going through the whole file until I reach row i then print it?
A file doesn't have "lines" or "rows". What you consider a "line" is "what is found between two newline characters". As such you cannot read the nth line without reading the lines before it, as you couldn't count the newline characters.
Answer 1: if you consider your example, but with L=[9], unrolling your loops would give:
i=9
row = (0, {'Col 2': 'row12', 'Col 3': 'row13', 'Col 1': 'row11'})
As you can see, row is a tuple with two members, calling row[i] means row[9], hence the IndexError.
Answer 2: This is very slow because you are reading the file up to the line number every time. In your example, you read the first 2 lines, then the first 5, then the first 15, then the first 98, etc. So you've read the first 5 lines 3 times. You could create a generator that only returns the lines you want (beware, line numbers would be 0-indexed):
def read_my_lines(csv_reader, lines_list):
for line_number, row in enumerate(csv_reader):
if line_number in lines_list:
yield line_number, row
So when you want to process the lines, you would do:
L = [2, 5, 15, 98, ...]
with open('~/file.csv') as f:
r = csv.DictReader(f)
for line_number, line in read_my_lines(r, L):
do_something_with_line(line)
* Edit *
This could further be improved to stop reading the file when you've read all the lines you wanted:
def read_my_lines(csv_reader, lines_list):
# make sure every line number shows up only once:
lines_set = set(lines_list)
for line_number, row in enumerate(csv_reader):
if line_number in lines_set:
yield line_number, row
lines_set.remove(line_number)
# Stop when the set is empty
if not lines_set:
raise StopIteration
Assuming L is a list containing the line numbers you want, you could do :
with open("~/file.csv") as f:
r = csv.DictReader(f)
for i, line in enumerate(r):
if i in L: # or (i+2) in L: from your second example
print line
That way :
you read the file only once
you do not load the whole file in memory
you only get the lines you are interested in
The only caveat is that you read whole file even if L = [3]
for row in enumerate(r):
will pull tuples. You are then trying to select your ith element from a 2 element tuple.
for example
>> for i in enumerate({"a":1, "b":2}): print i
(0, 'a')
(1, 'b')
Additionally, since dictionaries are hash tables, your initial order is not necessarily preserved. for instance:
>>list({"a":1, "b":2, "c":3, "d":5})
['a', 'c', 'b', 'd']
Just to sum up the great ideas, I ended up using something like this: L can be sorted relatively quickly, and in my case it was actually already sorted. So, instead of several membership checks in L it pays off to sort it and then only check each index against the first entry of it. Here is my piece of code:
count=0
with open('~/file.csv') as f:
r = csv.DictReader(f)
for row in r:
count += 1
if L == []:
break
elif count == L[0]:
print (row)
L.pop(0)
Note that this stops as soon as we've gone through L once.

Simple way to "mix" respectively two lists in python?

I have the following issue:
I need to "mix" respectively two lists in python...
I have this:
names = open('contactos.txt')
numbers = open('numeros.txt')
names1 = []
numbers1= []
for line in numbers:
numberdata = line.strip()
numbers1.append(numberdata)
print numbers1
for line in names:
data = line.strip()
names1.append(data)
print names1
names.close()
numbers.close()
This prints abot 300 numbers first, and the respective 300 names later, what I need to do is to make a new file (txt) that prints the names and the numbers in one line, separated by a comma (,), like this:
Name1,64673635
Name2,63513635
Name3,67867635
Name4,12312635
Name5,78679635
Name6,63457635
Name7,68568635
..... and so on...
I hope you can help me do this, I've tried with "for"s but I'm not sure on how to do it if I'm iterating two lists at once, thank you :)
Utilize zip:
for num, name in zip(numbers, names):
print('{0}, {1}'.format(num, name))
zip will combine the two lists together, letting you write them to a file:
In [1]: l1 = ['one', 'two', 'three']
In [2]: l2 = [1, 2, 3]
In [3]: zip(l1, l2)
Out[3]: [('one', 1), ('two', 2), ('three', 3)]
However you can save yourself a bit of time. In your code, you are iterating over each file separately, creating a list from each. You could also iterate over both at the same time, creating your list in one sweep:
results = []
with open('contactos.txt') as c:
with open('numeros.txt') as n:
for line in c:
results.append([line.strip(), n.readline().strip()])
print results
This uses a with statement (context manager), that essentially handles the closing of files for you. This will iterate through contactos, reading a line from numeros and appending the pair to the list. You can even cut out the list step and write directly to your third file in the same loop:
with open('output.txt', 'w') as output:
with open('contactos.txt', 'r') as c:
with open('numeros.txt', 'r') as n:
for line in c:
output.write('{0}, {1}\n'.format(line.strip(), n.readline().strip()))
A "pythonic" way to mix two lists in this way is the zip function!
names = open('contactos.txt')
numbers = open('numeros.txt')
names1 = []
numbers1= []
for line in numbers:
numberdata = line.strip()
numbers1.append(numberdata)
for line in names:
data = line.strip()
names1.append(data)
names.close()
numbers.close()
for name, number in zip(names1, numbers1):
print '%s, %s' % (name number)
There are other and better ways to print formatted text (e.g. Yuushi's answer). I also like to use the with statement and list comprehensions, e.g.
with open('contactos.txt') as f:
names = [line.strip() for line in f]
with open('numeros.txt') as f:
numbers = [line.strip() for line in f]
for name, number in zip(names, numbers):
print '%s, %s' % (name, number)
Finally, I just want to comment on how you could do it without the zip function. I'm not sure what you want to do if there are a different number of numbers and names, but you can use a for loop like this for the last bit to access the values from both lists in a single for loop:
for i in range(len(numbers)):
print '%s, %s' % (names[i], numbers[i])
This code in particular will throw an exception if there are more names than numbers, so you would probably want to add some code to handle that.
Everything at once:
import itertools
with open('contactos.txt') as names, open('numeros.txt') as numbers:
for num, name in itertools.izip(numbers, names):
print '%s, %s' % (num.strip(), name.strip())
This reads the two files in parallel, rather than reading each file completely into memory one at a time.

Categories

Resources