Python simple nested loop - python

I'm trying to do something very simple: I have a recurring csv file that may have repetitions of emails and I need to find how many times each email is repeated, so I did as follows:
file = open('click.csv')
reader = csv.reader(file)
for row in reader:
email = row[0]
print (email) # just to check which value is parsing
counter = 1
for row in reader:
if email == row[0]:
counter += 1
print (counter) # just to check if it counted correctly
and it only prints out:
firstemailaddress
2
Indeed there are 2 occurrencies of the first email but somehow this stops after the first email in the csv file.
So I simplified it to
for row in reader:
email = row[0]
print (email)
and this indeed prints out all the Email addresses in the csv file
This is a simple nested loop, so what's the deal here?
Of course just checking occurrencies could be done without a script but then I have to process those emails and data related to them and merge it with another csv file so that's why
Many thanks,

The problem with your first snippet comes down to a misunderstanding of iterators, or how csv.reader works.
Your reader object is an iterator. That means it yields rows, and similar to a generator object, it has a certain "state" between iterations. Every time you iterate over one of its elements - in this case rows, you are "consuming" the next available row, until you've consumed all rows and the iterator is entirely exhausted. Here's an example of a different kind of iterator being exhausted:
Imagine you have a text file, file.txt with these lines:
hello
world
this
is
a
test
Then this code:
with open("file.txt", "r") as file:
print("printing all lines for the first time:")
for line in file:
# strip the trailing newline character
print(line.rstrip())
print("printing all lines for the second time:")
for line in file:
# strip the trailing newline character
print(file.rstrip())
print("Done!")
Output:
printing all lines for the first time:
hello
world
this
is
a
test
printing all lines for the second time:
Done!
>>>
If this output surprises you, then it's because you've misunderstood how iterators work. In this case, file is an iterator, that yields lines. The first for-loop exhausts all available lines in the file iterator. This means the iterator will be exhausted by the time we reach the second for-loop, and there are no lines left to print.
The same thing is true for your reader. You're consuming rows from your csv-file for every iteration of your outer for-loop, and then consuming another row from the inner for-loop. You can expect to have your code behave strangely when you consume your rows in this way.

You cannot use the reader that way - it is stream based and cannot be "wound back" as you try it. You also do never close your file.
Reading the file multiple times is not needed - you can get all information with one pass through your file using a dictionary to count any email adresses:
# create demo file
with open("click.csv", "w") as f:
f.write("email#somewhere, other , value, in , csv\n" * 4)
f.write("otheremail#somewhere, other , value, in , csv\n" * 2)
Process demo file:
from collections import defaultdict
import csv
emails = defaultdict(int)
with open('click.csv') as file:
reader = csv.reader(file)
for row in reader:
email = row[0]
print (email) # just to check which value is parsing
emails[email] += 1
for adr,count in emails.items():
print(adr, count)
Output:
email#somewhere 4
otheremail#somewhere 2
See:
Why can't I call read() twice on an open file?
defaultdict

As answered already, the problem is that reader is an iterator, so it is only good for a single pass. You can just put all the items in a container, like a list.
However, you only need a single pass to count things. Using a dict the most basic approach is:
counts = {}
for row in reader:
email = row[0]
if email in counts:
counts[email] = 1
else:
counts[email] += 1
There are even cleaner ways. For example, using a collections.Counter object, which is just a dict specialized for counting:
import collections
counts = collections.Counter(row[0] for row in reader)
Or even:
counts = collections.counter(email for email, _* in reader)

Try appending your email ids in a list then follow this:-
import pandas as pd
email_list = ["abc#something.com","xyz#something.com","abc#something.com"]
series = pd.Series(email_list)
print(series.value_counts())
You will get output like:-
abc#something.com 2
xyz#something.com 1
dtype: int64

The problem is, reader is a handler for the file (it is a stream).
You can walk through it only into one direction and not go back.
Similar to how generators are "consumed" by walking once through them.
But what you need is to iterate again and again - IF you want to use for-loops.
But anyway this is not an efficient way. Because actually, you want to not count again those rows which you counted once already.
So for your purpose, the best is to create a dictionary,
go row by row and if there is no entry in the dictionary for this email, create a new key for the email and as value the counts.
import csv
file = open('click.csv')
reader = csv.reader(file)
email_counts = {} # initiate dictionary
for row in reader:
email_counts[row[0]] = email_counts.get(row[0], 0) + 1
That's it!
email_counts[row] = assigns a new value for that particular email in the dictionary.
the whole trick is in email_counts.get(row, 0) + 1.
email_counts.get(row) is nothing else than email_counts[row].
but the additional argument 0 is the default value.
So this means: check, if row has an entry in email_counts. If yes, return the value for row in email_counts. Otherwise, if it is first entry, return 0. What ever is returned, increase it by + 1. This does all the equality check and correctly increases the counts for the entry.
Finally email_counts will give you all entries with their counts.
And the best: Just by going once through the file!

Not sure I get your question but if you want to have a counter of email you should not have those nested loop and just go for 1 loop with dictionary like:
cnt_mails[email] = cnt_mails.get(email, 0) + 1
this should store the count. your code is not working because you have two loops on the same iterator.

The problem is that reader is an iterator and you are depleting it with your second loop.
If you did something like:
with open('click.csv') as file:
lines = list(csv.reader(file))
for row in lines:
email = row[0]
print (email) # just to check which value is parsing
counter = 1
for row in lines:
if email == row[0]:
counter += 1
print (counter) # just to check if it counted correctly
You should get what you are looking for.
A simpler implementation:
from collections import defaultdict
counter = defaultdict(int)
with open('click.csv') as file:
reader = csv.reader(file)
for row in lines:
counter[row[0]] += 1
# Counter is not a dictionary of each email address and the number of times they are seen.

Related

Need to read csv files (when csv file is multiple input files) in Python

I have a school assignment that is asking me to write a program that first reads in the name of an input file and then reads the file using the csv.reader() method. The file contains a list of words separated by commas. The program should output the words and their frequencies (the number of times each word appears in the file) without any duplicates.
I have been able to figure out how to do this somewhat for one specific input file, but the program needs to be able to read multiple input files. This is what I have so far:
with open('input1.csv', 'r') as input1file:
csv_reader = csv.reader(input1file, delimiter = ',')
for row in csv_reader:
new_row = set(row)
for m in new_row:
count = row.count(m)
print(m, count)
This is what I get:
woman 1
man 2
Cat 1
Hello 1
boy 2
cat 2
dog 2
hey 2
hello 1
This works (almost) for the input1 file, except it changes the order each time I run it.
And I need it to work for two other input files?
sample CSV
hello,cat,man,hey,dog,boy,Hello,man,cat,woman,dog,Cat,hey,boy
See the code below for an example, I've commented it so you understand what it does and why.
As for the fact that for your implementation the order is different is due to the usage of set. A set by definition is unordered.
Also note that with your implementation you are passing over the rows twice, once to turn it into a set, and once more to count. Besides this, if the file contains more than one row, your logic would fail, as the counting part only gets reached when the last line of the file is read.
import csv
def count_things(filename):
with open(filename) as infile:
csv_reader = csv.reader(infile, delimiter = ',')
result = {}
for row in csv_reader:
# go over the row by element
for element in row:
# does it exist already?
if element in result:
# if yes, increase count
result[element] += 1
else:
# if no, add and set count to 1
result[element] = 1
# sorting, explained in detail here:
# https://stackoverflow.com/a/613218/9267296
return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=True)}
# you could just return unsorted result by using:
# return result
for key, value in count_things("input1.csv").items():
# iterate over items() by key/value pairs
# see this link:
# https://www.w3schools.com/python/python_dictionaries_access.asp
print(key, value)

Creating a function to concatenate strings based on len(array)

I am trying to concatenate a string to send a message via python>telegram
My plan is so that the function is modular.
It first import lines from a .txt file and based on that many lines it creates two different arrays
array1[] and array2[], array1 will receive the values of the list as strings and array2 will receive user generated information to complemente what is stored in the same position as to a way to identify the differences in the array1[pos], as to put in a way:
while (k<len(list)):
array2[k]= str(input(array1[k]+": "))
k+=1
I wanted to create a single string to send in a single message like however in a way that all my list goes inside the same string
string1 = array1[pos]+": "+array2[pos]+"\n"
I have tried using while to compared the len but I kept recalling and rewriting my own string again and again.
It looks like what you're looking for is to have one list that comes directly from your text file. There's lots of ways to do that, but you most likely won't want to create a list iteratively with the index position. I would say to just append items to your list.
The accepted answer on this post has a good reference, which is basically the following:
import csv
with open('filename.csv', 'r') as fd:
reader = csv.reader(fd)
for row in reader:
# do something
Which, in your case would mean something like this:
import csv
actual_text_list = []
with open('filename.csv', 'r') as fd:
reader = csv.reader(fd)
for row in reader:
actual_text_list.append(row)
user_input_list = []
for actual_text in actual_text_list:
the_users_input = input(f'What is your response to {actual_text}? ')
user_input_list.append(the_users_input)
This creates two lists, one with the actual text, and the other with the other's input. Which I think is what you're trying to do.
Another way, if the list in your text file will not have duplicates, you could consider using a dict, which is just a dictionary, a key-value data store. You would make the key the actual_text from the file, and the value the user_input. Another technique, you could make a list of lists.
import csv
actual_text_list = []
with open('filename.csv', 'r') as fd:
reader = csv.reader(fd)
for row in reader:
actual_text_list.append(row)
dictionary = dict()
for actual_text in actual_text_list:
the_users_input = input(f'What is your response to {actual_text}? ')
dictionary[actual_text] = the_users_input
Then you could use that data like this:
for actual_text, user_input in dictionary.items():
print(f'In response to {actual_text}, you specified {user_input}.')
list_of_strings_from_txt = ["A","B","C"]
modified_list = [f"{w}: {input(f'{w}:')}" for w in list_of_strings_from_txt]
I guess? maybe?

Iterating through CSV file in python to find titles with leading spaces

I'm working with a large csv file that contains songs and their ownershp properties. Each song record is written top-down, with associated writer and publisher names below each title. So a given song may comprise of say, 4-6 rows, depending on how many writers/publishers control it (example with header row below):
Title,RoleType,Name,Shares,Note
BOOGIE BREAK 2,ASCAP,Total Current ASCAP Share,100,
BOOGIE BREAK 2,W,MERCADO JOSEPH M,,
BOOGIE BREAK 2,P,CRAFTIN MUSIC,,
BOOGIE BREAK 2,P,NEXT DIMENSION MUSIC,,
I'm currently trying to loop through the entire file to extract all of the song titles that contain leading spaces (e.g.,' song title'). Here's the code that I'm currently using:
import csv
import re
with open('output/sws.txt', 'w') as sws:
with open('data/ascap_catalog1.csv', 'r') as ac:
ascap = csv.reader(ac, delimiter=',')
ascap = list(ascap)
for row in ascap:
for strings in row:
if re.search('\A\s+', strings):
row = str(row)
sws.write(row)
sws.write('\n')
else:
continue
Due to the size of this file csv file that I'm working with (~2GB), it takes quite a bit of time to iterate through and produce a result file. However, based on the results that I've gotten, it appears the song titles with leading spaces are all clustered at the beginning of the file. Once those songs have all been listed, then normal songs w/o leading spaces appear.
Is there a way to make this code a bit more efficient, time-wise? I tried using a few breaks after every for and if statement, but depending on the amount that I used, it either didn't effect the statement at all, or broke too quickly, not capturing any rows.
I also tried wrapping it in a function and implementing return, however, for some reason the code only seemed to iterate through the first row (not counting the header row, which I would skip).
Thanks so much for your time,
list(ascap) isn't doing you nay favors. reader objects are iterators over their contents, but they don't load it all into memory until ti's needed. Just iterate over the reader object directly.
For each row, just check row[0][0].isspace(). That checks the first character of the first entry, which is all you need to determine whether something begins with whitespace.
with open('output/sws.txt', 'w', newline="") as sws:
with open('data/ascap_catalog1.csv', 'r', newline="") as ac:
ascap = csv.reader(ac, delimiter=',')
for row in ascap:
if row and row[0] and row[0][0].isspace():
print(row, file=sws)
You could also play with your output, like saving all the rows you want to keep in a list before writing them at the end. It sounds like your input might be sorted, if all the leading whitespace names come first. If that's the case, you can just add else: break to skip the rest of the file.
You can use a dictionary to find each song and group all of its associated values:
from collections import defaultdict
import csv, re
d = defaultdict(list)
count = 0 #count needed to remove the header, without loading the full data into memory
with open('filename.csv') as f:
for a, *b in csv.reader(f):
if count:
if re.findall('^\s', a):
d[a].append(b)
count += 1
this one worked well for me and seems to be simple enough.
import csv
import re
with open('C:\\results.csv', 'w') as sws:
with open('C:\\ascap.csv', 'r') as ac:
ascap = csv.reader(ac, delimiter=',')
for row in ascap:
if re.match('\s+', row[0]):
sws.write(str(row)+ '\n')
Here are some things you can improve:
Use the reader object as an iterator directly without creating an intermediate list. This will save you both computation time and memory.
Check only the first value in a row (which is a title), not all.
Remove an unnecessary else clause.
Combining all of this and applying some best practices you can do:
import csv
import re
with open('data/ascap_catalog1.csv') as ac, open('output/sws.txt', 'w') as sws:
reader = csv.reader(ac)
for row in reader:
if re.search(r'\A\s+', row[0]):
print(row, file=sws)
It appears the song titles with leading spaces are all clustered at
the beginning of the file.
In this case you can use itertools.takewhile to only iterate the file as long the titles have leading spaces:
import csv
import re
from itertools import takewhile
with open('data/ascap_catalog1.csv') as ac, open('output/sws.txt', 'w') as sws:
reader = csv.reader(ac)
next(reader) # skip the header
for row in takewhile(lambda x: re.search(r'\A\s+', x[0]), reader):
print(row, file=sws)

python: adding a zero if my value is less then 3 digits long

I have a csv file that needs to add a zero in front of the number if its less than 4 digits.
I only have to update a particular row:
import csv
f = open('csvpatpos.csv')
csv_f = csv.reader(f)
for row in csv_f:
print row[5]
then I want to parse through that row and add a 0 to the front of any number that is shorter than 4 digits. And then input it into a new csv file with the adjusted data.
You want to use string formatting for these things:
>>> '{:04}'.format(99)
'0099'
Format String Syntax documentation
When you think about parsing, you either need to think about regex or pyparsing. In this case, regex would perform the parsing quite easily.
But that's not all, once you are able to parse the numbers, you need to zero fill it. For that purpose, you need to use str.format for padding and justifying the string accordingly.
Consider your string
st = "parse through that row and add a 0 to the front of any number that is shorter than 4 digits."
In the above lines, you can do something like
Implementation
parts = re.split(r"(\d{0,3})", st)
''.join("{:>04}".format(elem) if elem.isdigit() else elem for elem in parts)
Output
'parse through that row and add a 0000 to the front of any number that is shorter than 0004 digits.'
The following code will read in the given csv file, iterate through each row and each item in each row, and output it to a new csv file.
import csv
import os
f = open('csvpatpos.csv')
# open temp .csv file for output
out = open('csvtemp.csv','w')
csv_f = csv.reader(f)
for row in csv_f:
# create a temporary list for this row
temp_row = []
# iterate through all of the items in the row
for item in row:
# add the zero filled value of each temporary item to the list
temp_row.append(item.zfill(4))
# join the current temporary list with commas and write it to the out file
out.write(','.join(temp_row) + '\n')
out.close()
f.close()
Your results will be in csvtemp.csv. If you want to save the data with the original filename, just add the following code to the end of the script
# remove original file
os.remove('csvpatpos.csv')
# rename temp file to original file name
os.rename('csvtemp.csv','csvpatpos.csv')
Pythonic Version
The code above is is very verbose in order to make it understandable. Here is the code refactored to make it more Pythonic
import csv
new_rows = []
with open('csvpatpos.csv','r') as f:
csv_f = csv.reader(f)
for row in csv_f:
row = [ x.zfill(4) for x in row ]
new_rows.append(row)
with open('csvpatpos.csv','wb') as f:
csv_f = csv.writer(f)
csv_f.writerows(new_rows)
Will leave you with two hints:
s = "486"
s.isdigit() == True
for finding what things are numbers.
And
s = "486"
s.zfill(4) == "0486"
for filling in zeroes.

Read initially unknown number of N lines from file in a nested dictionary and start in next iteration at line N+1

I want to process a text file (line by line). An (initially unknown) number of consecutive lines belong to the same entity (i.e. they carry the same identifier with the line). For example:
line1: stuff, stuff2, stuff3, ID1, stuff4, stuff5
line2: stuff, stuff2, stuff3, ID1, stuff4, stuff5
line3: stuff, stuff2, stuff3, ID1, stuff4, stuff5
line4: stuff, stuff2, stuff3, ID2, stuff4, stuff5
line5: stuff, stuff2, stuff3, ID2, stuff4, stuff5
...
In this dummy lines 1-3 belong to the entity ID1 and lines 4-5 to ID2. I want to read each of these lines as a dictionary and then want to nest them into a dictionary containing all the dictionaries of IDX (e.g. a dictionary ID1 with 3 nested dictionary of lines 1-3, respectively).
More specifically I would like to define a function that:
opens the file
reads all (but only) the lines of entity ID1 into individual dictionaries
returns the dictionary which carries the nested dictionaries of the ID1 lines
I want to be able to call the function some time later again to read in the next dictionary of all the lines of the following identifier (ID2) and later ID3 etc. One of the problems I am having is that I need to test in every line whether my current line is still carrying the ID of interest or already a new one. If it is a new one, I sure can stop and return the dictionary but in the next round (say, ID2) the first line of ID2 has then already been read and I thus seem to lose that line.
In other words: I would like to somehow reset the counter in the function once it encounters a line with new ID so that in the next iteration this first line with the new ID is not lost.
This seems such a straightforward task but I cannot figure out a way to do that elegantly. I currently pass some "memory"-flags/variables between functions in order to keep track of whether the first line of a new ID was already read in a previous iteration. That is quite bulky and error prone.
Thanks for reading... any ideas/hints are highly appreciated. If some points are unclear please ask.
Here is my "solution". It seems to work in the sense that it prints the dictionary correctly (although I am sure there is a more elegant way to do that).
I also forgot to mention that the textfile is very large and I hence want to process it ID by ID instead of reading the whole file into memory.
with open(infile, "r") as f:
newIDLine = None
for line in f:
if not line:
break
# the following function returns the ID
ID = get_ID_from_line(line)
counter = 1
ID_Dic = dict()
# if first line is completely new (i.e. first line in infile)
if newIDLine is None:
currID = ID
# the following function returns the line as a dic
ID_Dic[counter] = process_line(line)
# if first line of new ID was already read in
# the previous "while" iteration (see below).
if newIDLine is not None:
# if the current "line" is of the same ID then the
# previous one: put previous and current line in
# the same dic and start the while loop.
if ID == oldID:
ID_Dic[counter] = process_line(newIDLine)
counter += 1
ID_Dic[counter] = process_line(line)
currID = ID
# iterate over the following lines until file end or
# new ID starts. In the latter case: keep the info in
# objects newIDline and oldID
while True:
newLine = next(f)
if not newLine:
break
ID = get_ID_from_line(newLine)
if ID == currID:
counter += 1
ID_Dic[counter] = process_line(newLine)
# new ID; save line for the upcomming ID dic
if not ID == currID:
newIDLine = newLine
oldID = ID
break
# at this point it would be great to return the Dictionary of
# the current ID to the calling function but at return to this
# function continue where I left off.
print ID_Dic
If you want this function to lazily return a dict for each id, you should make it a generator function by using yield instead of return. At the end of each id, yield the dict for that id. Then you can iterate over that generator.
To handle the file, write a generator function that iterates over a source unless you send it a value, in which case it returns that value next, then goes back to iterating. (For example, here's a module I wrote to do this for myself: politer.py.)
Then you can solve this problem easily by sending the value "back" if you don't want it:
with open(infile, 'r') as f:
polite_f = politer(f)
current_id = None
while True:
id_dict = {}
for i, line in enumerate(polite_f):
id = get_id_from_line(line)
if id != current_id:
polite_f.send(line)
break
else:
id_dict[i] = process_line(line)
if current_id is not None:
yield id_dict
current_id = id
Note that this keeps the state handling abstracted in the generator where it belongs.
You could use a dictionary to keep track of all the IDX columns and just add each line's IDX column to the appropriate list in the dictionary, something like:
from collections import defaultdict
import csv
all_lines_dict = defaultdict(list)
with open('your_file') as f:
csv_reader = csv.reader(f)
for line_list in csv_reader:
all_lines_dict[line_list[3]].append(line_list)
Csv reader is part of python standard library, and makes reading csv files easy. It will read each line as a list of its columns.
This differs from your requirements because each key is not a dictionary of dictionaries but it is a list of the lines that share the IDX key.

Categories

Resources