How to read specific lines of a large csv file

How to read specific lines of a large csv file - python

I am trying to read some specific rows of a large csv file, and I don't want to load the whole file into memory. The index of the specific rows are given in a list L = [2, 5, 15, 98, ...] and my csv file looks like this:
Col 1, Col 2, Col3
row11, row12, row13
row21, row22, row23
row31, row32, row33
...
Using the ideas mentioned here I use the following command to read the rows
with open('~/file.csv') as f:
r = csv.DictReader(f) # I need to read it as a dictionary for my purpose
for i in L:
for row in enumerate(r):
print row[i]
I immediately get the following error:
IndexError Traceback (most recent call last)
<ipython-input-25-78951a0d4937> in <module>()
6 for i in L:
7 for row in enumerate(r):
----> 8 print row[i]
IndexError: tuple index out of range
Question 1. It seems like my use of the for loops here is obviously wrong. Any ideas on how to fix this?
On the other hand, the following gets the job done, but it's too slow:
def read_csv_line(line_number):
with open("~/file.csv") as f:
r = csv.DictReader(f)
for i, line in enumerate(r):
if i == (line_number - 2):
return line
return None
for i in L:
print read_csv_line(i)
Question 2. Any idea on how to improve this basic method of going through the whole file until I reach row i then print it?

A file doesn't have "lines" or "rows". What you consider a "line" is "what is found between two newline characters". As such you cannot read the nth line without reading the lines before it, as you couldn't count the newline characters.
Answer 1: if you consider your example, but with L=[9], unrolling your loops would give:
i=9
row = (0, {'Col 2': 'row12', 'Col 3': 'row13', 'Col 1': 'row11'})
As you can see, row is a tuple with two members, calling row[i] means row[9], hence the IndexError.
Answer 2: This is very slow because you are reading the file up to the line number every time. In your example, you read the first 2 lines, then the first 5, then the first 15, then the first 98, etc. So you've read the first 5 lines 3 times. You could create a generator that only returns the lines you want (beware, line numbers would be 0-indexed):
def read_my_lines(csv_reader, lines_list):
for line_number, row in enumerate(csv_reader):
if line_number in lines_list:
yield line_number, row
So when you want to process the lines, you would do:
L = [2, 5, 15, 98, ...]
with open('~/file.csv') as f:
r = csv.DictReader(f)
for line_number, line in read_my_lines(r, L):
do_something_with_line(line)
* Edit *
This could further be improved to stop reading the file when you've read all the lines you wanted:
def read_my_lines(csv_reader, lines_list):
# make sure every line number shows up only once:
lines_set = set(lines_list)
for line_number, row in enumerate(csv_reader):
if line_number in lines_set:
yield line_number, row
lines_set.remove(line_number)
# Stop when the set is empty
if not lines_set:
raise StopIteration

Assuming L is a list containing the line numbers you want, you could do :
with open("~/file.csv") as f:
r = csv.DictReader(f)
for i, line in enumerate(r):
if i in L: # or (i+2) in L: from your second example
print line
That way :
you read the file only once
you do not load the whole file in memory
you only get the lines you are interested in
The only caveat is that you read whole file even if L = [3]

for row in enumerate(r):
will pull tuples. You are then trying to select your ith element from a 2 element tuple.
for example
>> for i in enumerate({"a":1, "b":2}): print i
(0, 'a')
(1, 'b')
Additionally, since dictionaries are hash tables, your initial order is not necessarily preserved. for instance:
>>list({"a":1, "b":2, "c":3, "d":5})
['a', 'c', 'b', 'd']

Just to sum up the great ideas, I ended up using something like this: L can be sorted relatively quickly, and in my case it was actually already sorted. So, instead of several membership checks in L it pays off to sort it and then only check each index against the first entry of it. Here is my piece of code:
count=0
with open('~/file.csv') as f:
r = csv.DictReader(f)
for row in r:
count += 1
if L == []:
break
elif count == L[0]:
print (row)
L.pop(0)
Note that this stops as soon as we've gone through L once.

Related

Remove duplicates from large list but remove both if it does exist?

So I have a text file like this
123
1234
123
1234
12345
123456
You can see 123 appears twice so both instances should be removed. but 12345 appears once so it stays. My text file is about 70,000 lines.
Here is what I came up with.
file = open("test.txt",'r')
lines = file.read().splitlines() #to ignore the '\n' and turn to list structure
for appId in lines:
if(lines.count(appId) > 1): #if element count is not unique remove both elements
lines.remove(appId) #first instance removed
lines.remove(appId) #second instance removed
writeFile = open("duplicatesRemoved.txt",'a') #output the left over unique elements to file
for element in lines:
writeFile.write(element + "\n")
When I run this I feel like my logic is correct, but I know for a fact the output is suppose to be around 950, but Im still getting 23000 elements in my output so a lot is not getting removed. Any ideas where the bug could reside?
Edit: I FORGOT TO MENTION. An element can only appear twice MAX.

Use Counter from built in collections:
In [1]: from collections import Counter
In [2]: a = [123, 1234, 123, 1234, 12345, 123456]
In [3]: a = Counter(a)
In [4]: a
Out[4]: Counter({123: 2, 1234: 2, 12345: 1, 123456: 1})
In [5]: a = [k for k, v in a.items() if v == 1]
In [6]: a
Out[6]: [12345, 123456]
For your particular problem I will do it like this:
from collections import defaultdict
out = defaultdict(int)
with open('input.txt') as f:
for line in f:
out[line.strip()] += 1
with open('out.txt', 'w') as f:
for k, v in out.items():
if v == 1: #here you use logic suitable for what you want
f.write(k + '\n')

Be careful about removing elements from a list while still iterating over that list. This changes the behavior of the list iterator, and can make it skip over elements, which may be part of your problem.
Instead, I suggest creating a filtered copy of the list using a list comprehension - instead of removing elements that appear more than twice, you would keep elements that appear less than that:
file = open("test.txt",'r')
lines = file.read().splitlines()
unique_lines = [line for line in lines if lines.count(line) <= 2] # if it appears twice or less
with open("duplicatesRemoved.txt", "w") as writefile:
writefile.writelines(unique_lines)
You could also easily modify this code to look for only one occurrence (if lines.count(line) == 1) or for more than two occurrences.

You can count all of the elements and store them in a dictionary:
dic = {a:lines.count(a) for a in lines}
Then remove all duplicated one from array:
for k in dic:
if dic[k]>1:
while k in lines:
lines.remove(k)
NOTE: The while loop here is becaues line.remove(k) removes first k value from array and it must be repeated till there's no k value in the array.
If the for loop is complicated, you can use the dictionary in another way to get rid of duplicated values:
lines = [k for k, v in dic.items() if v==1]

Need help deleting repeating lines in txt file

I need to have an output printed in which only 1 list is split with no duplicates. The list i am using has like 100k emails and 1000x repeat. I want to remove those ..
I have tried some i have looked online
but nothing is written in my new file and the pycharm just freezes on running
def uniquelines(lineslist):
unique = {}
result = []
for item in lineslist:
if item.strip() in unique: continue
unique[item.strip()] = 1
result.append(item)
return result
file1 = open("wordlist.txt","r")
filelines = file1.readlines()
file1.close()
output = open("wordlist_unique.txt","w")
output.writelines(uniquelines(filelines))
output.close()
I expect it to just print all the emails with none repeating into a new text file

Before I get into the few ways to hopefully solve the issue, one thing I see off the bat is that you are using both a dictionary and a list within your function. This almost doubles the memory you will need to process things. I suggest using one or the other.
Using a set will provide you with a guaranteed "list" of unique items. The set.add() function will ignore duplicates.
s = {1, 2, 3}
print(s) #{1, 2, 3}
s.add(4)
print(s) #{1, 2, 3, 4}
s.add(4)
print(s) #{1, 2, 3, 4}
With that, you can modify your function to the following to achieve what you want. For my example, I have input.txt as a series of lines just containing a single integer value with plenty of duplicates.
def uniquelines(lineslist):
unique = set()
for line in lineslist:
unique.add(str(line).strip())
return list(unique)
with open('input.txt', 'r') as f:
lines = f.readlines()
output = uniquelines(lines)
with open('output.txt', 'w') as f:
f.write("\n".join([i for i in output]))
output.txt is as follows without any duplicates!
2
0
4
5
3
1
9
6
You can accomplish the same thing by calling set() on a list comprehension, but the disadvantage here is that you will need to load all the records into memory first and then pull out the duplicates. THe method above will hold all the unique values, but no duplicates, so depending on the size of your set, you probably want to use the function.
with open('input.txt', 'r') as f:
lines = f.readlines()
output = set([l.strip() for l in lines])
with open('output.txt', 'w') as f:
f.write("\n".join([i for i in output]))
I couldn't quite tell if you were looking to maintain a running count of how many times each unique line occured. If that's what you're going for, then you can use the in operator to see if it is in the keys already.
def uniquelines(lineslist):
unique = {}
for line in lineslist:
line = line.strip()
if line in unique:
unique[line] += 1
else:
unique[line] = 1
return unique
# {'9': 2, '0': 3, '4': 3, '1': 1, '3': 4, '2': 1, '6': 3, '5': 1}

How can I customize map() for a list of strings in Python?

How do I tell map() to selectively convert only some of the strings (not all the strings) within a list to integer values?
Input file (tab-delimited):
abc1 34 56
abc1 78 90
My attempt:
import csv
with open('file.txt') as f:
start = csv.reader(f, delimiter='\t')
for row in start:
X = map(int, row)
print X
Error message: ValueError: invalid literal for int() with base 10: 'abc1'
When I read in the file with the csv module, it is a list of strings:
['abc1', '34', '56']
['abc1', '78', '90']
map() obviously does not like 'abc1'even though it is a string just like '34' is a string.
I thoroughly examined Convert string to integer using map() but it did not help me deal with the first column of my input file.

def safeint(val):
try:
return int(val)
except ValueError:
return val
for row in start:
X = map(safeint, row)
print X
is one way to do it ... you can step it up even more
from functools import partial
myMapper = partial(map,safeint)
map(myMapper,start)

Map only the part of the list that interests you:
row[1:] = map(int, row[1:])
print row
Here, row[1:] is a slice of the list that starts at the second element (the one with index 1) up to the end of the list.

I like Roberto Bonvallet's answer, but if you want to do things immutably, as you're doing in your question, you can:
import csv
with open('file.txt') as f:
start = csv.reader(f, delimiter='\t')
for row in start:
X = [row[0]] + map(int, row[1:])
print X
… or…
numeric_cols = (1, 2)
X = [int(value) if col in numeric_cols else value
for col, value in enumerate(row])
… or, probably most readably, wrap that up in a map_partial function, so you can do this:
X = map_partial(int, (1, 2), row)
You could implement it as:
def map_partial(func, indices, iterable):
return [func(value) if i in indices else value
for i, value in enumerate(iterable)]
If you want to be able to access all of the rows after you're done, you can't just print each one, you have to store it in some kind of structure. What structure you want depends on how you want to refer to these rows later.
For example, maybe you just want a list of rows:
rows = []
with open('file.txt') as f:
for row in csv.reader(f, delimiter='\t'):
rows.append(map_partial(int, (1, 2), row))
print('The second column of the first row is {}'.format(rows[0][1]))
Or maybe you want to be able to look them up by the string ID in the first column, rather than by index. Since those IDs aren't unique, each ID will map to a list of rows:
rows = {}
with open('file.txt') as f:
for row in csv.reader(f, delimiter='\t'):
rows.setdefault(row[0], []).append(map_partial(int, (1, 2), row))
print('The second column of the first abc1 row is {}'.format(rows['abc1'][0][1]))

Python - Importing strings into a list, into another list :)

Basically I want to read strings from a text file, put them in lists three by three, and then put all those three by three lists into another list. Actually let me explain it better :)
Text file (just an example, I can structure it however I want):
party
sleep
study
--------
party
sleep
sleep
-----
study
sleep
party
---------
etc
From this, I want Python to create a list that looks like this:
List1 = [['party','sleep','study'],['party','sleep','sleep'],['study','sleep','party']etc]
But it's super hard. I was experimenting with something like:
test2 = open('test2.txt','r')
List=[]
for line in 'test2.txt':
a = test2.readline()
a = a.replace("\n","")
List.append(a)
print(List)
But this just does horrible horrible things. How to achieve this?

If you want to group the data in size of 3. Assumes your data in the text file is not grouped by any separator.
You need to read the file, sequentially and create a list. To group it you can use any of the known grouper algorithms
from itertools import izip, imap
with open("test.txt") as fin:
data = list(imap(list, izip(*[imap(str.strip, fin)]*3)))
pprint.pprint(data)
[['party', 'sleep', 'study'],
['party', 'sleep', 'sleep'],
['study', 'sleep', 'party']]
Steps of Execution
Create a Context Manager with the file object.
Strip each line. (Remove newline)
Using zip on the iterator list of size 3, ensures that the items are grouped as tuples of three items
Convert tuples to list
Convert the generator expression to a list.
Considering all are generator expressions, its done on a single iteration.
Instead, if your data is separated and grouped by a delimiter ------ you can use the itertools.groupby solution
from itertools import imap, groupby
class Key(object):
def __init__(self, sep):
self.sep = sep
self.count = 0
def __call__(self, line):
if line == self.sep: self.count += 1
return self.count
with open("test.txt") as fin:
data = [[e for e in v if "----------" not in e]
for k, v in groupby(imap(str.strip, fin), key = Key("----------"))]
pprint.pprint(data)
[['party', 'sleep', 'study'],
['party', 'sleep', 'sleep'],
['study', 'sleep', 'party']]
Steps of Execution
Create a Key Class, to increase a counter when ever the separator is encountered. The function call spits out the counter every-time its called apart from conditionally increasing it.
Create a Context Manager with the file object.
Strip each line. (Remove newline)
Group the data using itertools.groupby and using your custom key
Remove the separator from the grouped data and create a list of the groups.

You can try with this:
res = []
tmp = []
for i, line in enumerate(open('file.txt'), 1):
tmp.append(line.strip())
if i % 3 == 0:
res.append(tmp)
tmp = []
print(res)
I've assumed that you don't have the dashes.
Edit:
Here is an example for when you have dashes:
res = []
tmp = []
for i, line in enumerate(open('file.txt')):
if i % 4 == 0:
res.append(tmp)
tmp = []
continue
tmp.append(line.strip())
print(res)

First big problem:
for line in 'test2.txt':
gives you
't', 'e', 's', 't', '2', '.', 't', 'x', 't'
You need to loop through the file you open:
for line in test2:
Or, better:
with open("test2.txt", 'r') as f:
for line in f:
Next, you need to do one of two things:
If the line contains "-----", create a new sub-list (myList.append([]))
Otherwise, append the line to the last sub-list in your list (myList[-1].append(line))
Finally, your print at the end should not be so far indented; currently, it prints for every line, rather than just when the processing is complete.
List.append(a)
print(List)
Perhaps a better structure for your file would be:
party,sleep,study
party,sleep,sleep
...
Now each line is a sub-list:
for line in f:
myList.append(line.split(','))

Simple way to "mix" respectively two lists in python?

I have the following issue:
I need to "mix" respectively two lists in python...
I have this:
names = open('contactos.txt')
numbers = open('numeros.txt')
names1 = []
numbers1= []
for line in numbers:
numberdata = line.strip()
numbers1.append(numberdata)
print numbers1
for line in names:
data = line.strip()
names1.append(data)
print names1
names.close()
numbers.close()
This prints abot 300 numbers first, and the respective 300 names later, what I need to do is to make a new file (txt) that prints the names and the numbers in one line, separated by a comma (,), like this:
Name1,64673635
Name2,63513635
Name3,67867635
Name4,12312635
Name5,78679635
Name6,63457635
Name7,68568635
..... and so on...
I hope you can help me do this, I've tried with "for"s but I'm not sure on how to do it if I'm iterating two lists at once, thank you :)

Utilize zip:
for num, name in zip(numbers, names):
print('{0}, {1}'.format(num, name))

zip will combine the two lists together, letting you write them to a file:
In [1]: l1 = ['one', 'two', 'three']
In [2]: l2 = [1, 2, 3]
In [3]: zip(l1, l2)
Out[3]: [('one', 1), ('two', 2), ('three', 3)]
However you can save yourself a bit of time. In your code, you are iterating over each file separately, creating a list from each. You could also iterate over both at the same time, creating your list in one sweep:
results = []
with open('contactos.txt') as c:
with open('numeros.txt') as n:
for line in c:
results.append([line.strip(), n.readline().strip()])
print results
This uses a with statement (context manager), that essentially handles the closing of files for you. This will iterate through contactos, reading a line from numeros and appending the pair to the list. You can even cut out the list step and write directly to your third file in the same loop:
with open('output.txt', 'w') as output:
with open('contactos.txt', 'r') as c:
with open('numeros.txt', 'r') as n:
for line in c:
output.write('{0}, {1}\n'.format(line.strip(), n.readline().strip()))

A "pythonic" way to mix two lists in this way is the zip function!
names = open('contactos.txt')
numbers = open('numeros.txt')
names1 = []
numbers1= []
for line in numbers:
numberdata = line.strip()
numbers1.append(numberdata)
for line in names:
data = line.strip()
names1.append(data)
names.close()
numbers.close()
for name, number in zip(names1, numbers1):
print '%s, %s' % (name number)
There are other and better ways to print formatted text (e.g. Yuushi's answer). I also like to use the with statement and list comprehensions, e.g.
with open('contactos.txt') as f:
names = [line.strip() for line in f]
with open('numeros.txt') as f:
numbers = [line.strip() for line in f]
for name, number in zip(names, numbers):
print '%s, %s' % (name, number)
Finally, I just want to comment on how you could do it without the zip function. I'm not sure what you want to do if there are a different number of numbers and names, but you can use a for loop like this for the last bit to access the values from both lists in a single for loop:
for i in range(len(numbers)):
print '%s, %s' % (names[i], numbers[i])
This code in particular will throw an exception if there are more names than numbers, so you would probably want to add some code to handle that.

Everything at once:
import itertools
with open('contactos.txt') as names, open('numeros.txt') as numbers:
for num, name in itertools.izip(numbers, names):
print '%s, %s' % (num.strip(), name.strip())
This reads the two files in parallel, rather than reading each file completely into memory one at a time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read specific lines of a large csv file - python

Related

Remove duplicates from large list but remove both if it does exist?

Need help deleting repeating lines in txt file

How can I customize map() for a list of strings in Python?

Python - Importing strings into a list, into another list :)

Simple way to "mix" respectively two lists in python?

Categories

Resources