Read CSV group by 1 column and apply sum, without pandas - python

As I wrote in the title I would like to read a CSV, do on this same CSV a group by column, apply sum, after replace the old CSV with the new values using as least libraries as possible (and avoid pandas). I have come this far:
index = {}
with open('event.csv') as f:
cr = reader(f)
for row in cr:
index.setdefault(row[0], []).append(int(row[1]))
f.close()
with open('event.csv', 'w', newline='\n') as csv_file:
writer = writer(csv_file)
for key, value in index.items():
writer.writerow([key, value[0]])
csv_file.close()
But in this way I can make the average…and also I have to open the file twice, which doesn't seem smart to me. Here is a CSV similar to event.csv:
work1,100
work2,200
work3,200
work1,50
work3,20
Desired output:
work1,150
work2,200
work3,220

You're actually very close. Just sum the values read while rewriting the file. Note that when using with on a file, you don't have to explicitly close them, it does it for you automatically. Also note that CSV files should be opened with newline=''—for reading and writing—as per the documentation.
import csv
index = {}
with open('event.csv', newline='') as csv_file:
cr = csv.reader(csv_file)
for row in cr:
index.setdefault(row[0], []).append(int(row[1]))
with open('event2.csv', 'w', newline='\n') as csv_file:
writer = csv.writer(csv_file)
for key, values in index.items():
value = sum(values)
writer.writerow([key, value])
print('-fini-')
The above could be written a little more concisely by eliminating some temporary variables and using a generator expression:
import csv
index = {}
with open('event.csv', newline='') as csv_file:
for row in csv.reader(csv_file):
index.setdefault(row[0], []).append(int(row[1]))
with open('event2.csv', 'w', newline='\n') as csv_file:
csv.writer(csv_file).writerows([key, sum(values)] for key, values in index.items())
print('-fini-')

Another simplification of solutions already shown, without additional libraries:
import csv
index = {}
with open('event.csv', newline='') as f:
cr = csv.reader(f)
for item,value in cr:
index[item] = index.get(item, 0) + int(value) # sum as you go
with open('event2.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(index.items()) # write all the items in one shot
print('-fini-')

With an additional library - convtools, which provides a lot of functionality not to write a lot of code by hand every time.
from convtools import conversion as c
from convtools.contrib.tables import Table
rows = Table.from_csv("event.csv", header=False).into_iter_rows(list)
converter = (
c.group_by(c.item(0))
.aggregate(
(
c.item(0),
c.ReduceFuncs.Sum(c.item(1).as_type(int)),
)
)
.gen_converter()
)
processed_rows = converter(rows)
Table.from_rows(processed_rows, header=False).into_csv(
"event2.csv", include_header=False
)

Here's another way to think of it.
Instead of storing arrays of ints during reading and then "compressing" them into the desired value during writing, show up-front that you're summing something during the read:
import csv
from collections import defaultdict
summed_work = defaultdict(int)
with open('event_input.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
work_id = row[0]
work_value = int(row[1])
summed_work[work_id] += work_value
with open('event_processed.csv', 'w', newline='') as f:
writer = csv.writer(f)
for work_id, summed_value in summed_work.items():
writer.writerow([work_id, summed_value])
This is functionally equivalent to what you were aiming for and what martineau helped you with, but, I argue, shows you and your reader sooner and more clearly what the intent is.
It technically uses one more library, defaultdict, but that's a standard library, and I'm not sure what value you're placing on the number of libraries being used.
EDIT
Oh, I just remembered there's the Counter class from collections, too. Might be even clearer:
summed_work = Counter()
and everything else is the same.

Related

How do I update every row of one column of a CSV with Python?

I'm trying to update every row of 1 particular column in a CSV.
My actual use-case is a bit more complex but it's just the CSV syntax I'm having trouble with, so for the example, I'll use this:
Name
Number
Bob
1
Alice
2
Bobathy
3
If I have a CSV with the above data, how would I get it to add 1 to each number & update the CSV or spit it out into a new file?
How can I take syntax like this & apply it to the CSV?
test = [1,2,3]
for n in test:
n = n+1
print(n)
I've been looking through a bunch of tutorials & haven't been able to quite figure it out.
Thanks!
Edit:
I can read the data & get what I'm looking for printed out, my issue now is just with getting that back into the CSV
import csv
with open('file.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['name'], (int (row['number']) +1) )
└─$ python3 test_csv_script.py
bob 2
alice 3
bobathy 4
Thank you Mark Tolonen for the comment - that example was very helpful & led me to my solution:
import csv
with open('file.csv', newline='') as csv_input, open('out.csv', 'w') as csv_output:
reader = csv.reader(csv_input)
writer = csv.writer(csv_output)
# Header doesn't need extra processing
header = next(reader)
writer.writerow(header)
for name, number in reader:
writer.writerow([name, (int(number)+1)])
 
Also sharing for anybody who finds this in the future, if you're looking to move the modified data to a new column/header, use this:
import csv
with open('file.csv', newline='') as csv_input, open('out.csv', 'w') as csv_output:
reader = csv.reader(csv_input)
writer = csv.writer(csv_output)
header = next(reader)
header.append("new column")
writer.writerow(header)
for name, number in reader:
writer.writerow([name, number, (int(number)+1)])
You can open another file, out.csv, which you write the new data into.
For example:
import csv
with open('file.csv', newline='') as csvfile, open('out.csv', 'w') as file_write:
reader = csv.DictReader(csvfile)
for row in reader:
file_write.write(row['name'], (int (row['number']) +1) )

CSV reading and writing; outputted CSV is blank

My program needs a function that reads data from a csv file ("all.csv") and extracts all the data pertaining to 'Virginia' (extract each row that has 'Virginia in it), then writes the extracted data to another csv file named "Virginia.csv" The program runs without error; however, when I open the "Virginia.csv" file, it is blank. My guess is that the issue is with my nested for loop, but I am not entirely sure what is causing the issue.
Here is the data within the all.csv file:
https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
Here is my code:
import csv
input_file = 'all.csv'
output_file = 'Virginia.csv'
state = 'Virginia'
mylist = []
def extract_records_for_state (input_file, output_file, state):
with open(input_file, 'r') as infile:
contents = infile.readlines()
with open(output_file, 'w') as outfile:
writer = csv.writer(outfile)
for row in range(len(contents)):
contents[row] = contents[row].split(',') #split elements
for row in range(len(contents)):
for word in range(len(contents[row])):
if contents[row][2] == state:
writer.writerow(row)
extract_records_for_state(input_file,output_file,state)
I ran your code and it gave me an error
Traceback (most recent call last):
File "c:\Users\Dolimight\Desktop\Stack Overflow\Geraldo\main.py", line 27, in
extract_records_for_state(input_file, output_file, state)
File "c:\Users\Dolimight\Desktop\Stack Overflow\Geraldo\main.py", line 24, in extract_records_for_state
writer.writerow(row)
_csv.Error: iterable expected, not int,
I fixed the error by putting the contents of the row [contents[row]] into the writerow() function and ran it again and the data showed up in Virginia.csv. It gave me duplicates so I also removed the word for-loop.
import csv
input_file = 'all.csv'
output_file = 'Virginia.csv'
state = 'Virginia'
mylist = []
def extract_records_for_state(input_file, output_file, state):
with open(input_file, 'r') as infile:
contents = infile.readlines()
with open(output_file, 'w') as outfile:
writer = csv.writer(outfile)
for row in range(len(contents)):
contents[row] = contents[row].split(',') # split elements
print(contents)
for row in range(len(contents)):
if contents[row][2] == state:
writer.writerow(contents[row]) # this is what I changed
extract_records_for_state(input_file, output_file, state)
You have two errors. The first is that you try to write the row index at writer.writerow(row) - the row is contents[row]. The second is that you leave the newline in the final column on read but don't strip it on write. Instead you could leverage the csv module more fully. Let the reader parse the rows. And instead of reading into a list, which uses a fair amount of memory, filter and write row by row.
import csv
input_file = 'all.csv'
output_file = 'Virginia.csv'
state = 'Virginia'
mylist = []
def extract_records_for_state (input_file, output_file, state):
with open(input_file, 'r', newline='') as infile, \
open(output_file, 'w', newline="") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
# add header
writer.writerow(next(reader))
# filter for state
writer.writerows(row for row in reader if row[2] == state)
extract_records_for_state(input_file,output_file,state)
Looking at your code two things jump out at me:
I see a bunch of nested statements (logic)
I see you reading a CSV as plain text, then interpreting it as CSV yourself (contents[row] = contents[row].split(',')).
I recommend two things:
break up logic into distinct chunks: all that nesting can be hard to interpret and debug; do one thing, prove that works; do another thing, prove that works; etc...
use the CSV API to its fullest: use it to both read and write your CSVs
I don't want to try and replicate/fix your code, instead I'm offering this general approach to achieve those two goals:
import csv
# Read in
all_rows = []
with open('all.csv', 'r', newline='') as f:
reader = csv.reader(f)
next(reader) # discard header (I didn't see you keep it)
for row in reader:
all_rows.append(row)
# Process
filtered_rows = []
for row in all_rows:
if row[2] == 'Virginia':
filtered_rows.append(row)
# Write out
with open('filtered.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(filtered_rows)
Once you understand both the logic and the API of those discrete steps, you can move on (advance) to composing something more complex, like the following which reads a row, decides if it should be written, and if so, writes it:
import csv
with open('filtered.csv', 'w', newline='') as f_out:
writer = csv.writer(f_out)
with open('all.csv', 'r', newline='') as f_in:
reader = csv.reader(f_in)
next(reader) # discard header
for row in reader:
if row[2] == 'Virginia':
writer.writerow(row)
Using either of those two pieces of code on this (really scaled-down) sample of all.csv:
date,county,state,fips,cases,deaths
2020-03-09,Fairfax,Virginia,51059,4,0
2020-03-09,Virginia Beach city,Virginia,51810,1,0
2020-03-09,Chelan,Washington,53007,1,1
2020-03-09,Clark,Washington,53011,1,0
gets me a filtered.csv that looks like:
2020-03-09,Fairfax,Virginia,51059,4,0
2020-03-09,Virginia Beach city,Virginia,51810,1,0
Given the size of this dataset, the second approach of write-on-demand-inside-the-read-loop is both faster (about 5x faster on my machine) and uses significantly less memory (about 40x less on my machine) because there's no intermediate storage with all_rows.
But, please take the time to run both, read them carefully, and see how each works the way it does.

file handling python - vlookup

source.csv as follows.
AB;CD
a;1;
b;2;
c;3;
target.csv as follows.
DE;FG;HI
1;e;1;
2;a;2;
3;f;3;
I need to do a vlookup using file handling mechanisms in python.
So need to update column 'FG' of 'target.csv' by looking up the column 'AB' of 'source.csv' and update with 'CD' column value of 'source.csv'.
So my desired output is like below.
DE;FG;HI
1;e;1;
2;1;2; #a is replaced with 1
3;f;3;
Without using pandas or any other method how I can approach this.
Below is how I approached this.
with open('D:/target.csv', "w+", encoding="utf-8") as Tgt_csvFile:
with open('D:/source.csv', "r", encoding="utf-8") as Src_csvFile:
for line in Src_csvFile:
fields = line.split(";")
x = fields[0]
for line_1 in Tgt_csvFile:
fields_1 = line_1.split(";")
y = fields[1]
if y == x:
update # not sure how to do this
else:
keep as it is
Appreciate on the support
This will solve your particular problem, but if the number of input/output columns changes you will need to adjust the logic accordingly.
It's also worth noting the trailing ; on each non-header row of your csv file will cause most packages to assume there is an extra column. I don't think you want that.
# Read in input, creating a dict where key is column 1 and value is column 2
with open('source.csv', mode='r') as infile:
reader = csv.reader(infile, delimiter=';')
s = {x[0]:x[1] for x in reader}
print(s)
# If column 2 is a key in dict s update with value from dict
output = []
with open('target.csv', mode='r') as infile:
reader = csv.reader(infile, delimiter=';')
for row in reader:
if row[1] in s.keys():
row[1] = s[row[1]]
output.append(row)
# Output to csv
with open('output.csv', mode='w', newline='') as outfile:
writer = csv.writer(outfile, delimiter=';')
writer.writerows(output)
Here is my suggestion:
with open('D:/source.csv', "r", encoding="utf-8") as Src_csvFile:
l=Src_csvFile.readlines()
d={}
for i in l[1:]:
x=i.split(';')
d[x[0]]=x[1]
with open('D:/target.csv', "r", encoding="utf-8") as Tgt_csvFile:
m=Tgt_csvFile.readlines()
for i in range(1,len(m)):
x=m[i].split(';')
if x[1] in d:
x[1]=d.get(x[1])
m[i]=';'.join(x)
with open('D:/target.csv', "w", encoding="utf-8") as Tgt_csvFile:
Tgt_csvFile.writelines(m)
Output:
DE;FG;HI
1;e;1;
2;1;2;
3;f;3;

Best way to count unique values from CSV in Python?

I need a quick way of counting unique values from a CSV (its a really big file (>100mb) that can't be opened in Excel for example) and I thought of creating a python script.
The CSV looks like this:
431231
3412123
321231
1234321
12312431
634534
I just need the script to return how many different values are in the file. E.g. for above the desired output would be:
6
So far this is what I have:
import csv
input_file = open(r'C:\Users\guill\Downloads\uu.csv')
csv_reader = csv.reader(input_file, delimiter=',')
thisdict = {
"UserId": 1
}
for row in csv_reader:
if row[0] not in thisdict:
thisdict[row[0]] = 1
print(len(thisdict)-1)
Seems to be working fine, but I wonder if there's a better/more efficient/elegant way to do this?
A set is more tailor-made for this problem than a dictionary:
with open(r'C:\Users\guill\Downloads\uu.csv') as f:
input_file = f
csv_reader = csv.reader(f, delimiter=',')
uniqueIds = set()
for row in csv_reader:
uniqueIds.add(row[0])
print(len(uniqueIds))
use a set instead of a dict, just like this
import csv
input_file = open(r'C:\Users\guill\Downloads\uu.csv')
csv_reader = csv.reader(input_file, delimiter=',')
aa = set()
for row in csv_reader:
aa.add(row[0])
print(len(aa))

Python, write csv single row, multiple column

Sorry for asking i have searched a lot but cant find what i need.
I need this list to be written to a csv file on one row(1) and each element from A to E
coluna_socio = ['bilhete', 'tipo', 'nome', 'idade', 'escalao']
outfile = open('socios_adeptos.csv', 'w', newline='')
writer = csv.writer(outfile)
for i in range(len(coluna_socio)):
writer.writerow([coluna_socio[i]])
Ive tried almost everything and it always write on on column or just in on cell(A1)
Thanks.
You can use the string join method to insert a comma between all the elements of your list. Then write the resulting string to file. The csv module isn't necessary in this case.
For example ...
with open('socios_adeptos.csv', 'w') as out:
out.write(','.join(coluna_socio))
You should call the csv.writer.writerow method with the list directly:
with open('socios_adeptos.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
writer.writerow(coluna_socio)
I could write it in one row with your code, as follows:
coluna_socio = ['bilhete', 'tipo', 'nome', 'idade', 'escalao']
outfile = open('socios_adeptos.csv', 'w', newline='')
writer = csv.writer(outfile)
writer.writerow(coluna_socio)
outfile.close()

Categories

Resources