Best way to count unique values from CSV in Python? - python

I need a quick way of counting unique values from a CSV (its a really big file (>100mb) that can't be opened in Excel for example) and I thought of creating a python script.
The CSV looks like this:
431231
3412123
321231
1234321
12312431
634534
I just need the script to return how many different values are in the file. E.g. for above the desired output would be:
6
So far this is what I have:
import csv
input_file = open(r'C:\Users\guill\Downloads\uu.csv')
csv_reader = csv.reader(input_file, delimiter=',')
thisdict = {
"UserId": 1
}
for row in csv_reader:
if row[0] not in thisdict:
thisdict[row[0]] = 1
print(len(thisdict)-1)
Seems to be working fine, but I wonder if there's a better/more efficient/elegant way to do this?

A set is more tailor-made for this problem than a dictionary:
with open(r'C:\Users\guill\Downloads\uu.csv') as f:
input_file = f
csv_reader = csv.reader(f, delimiter=',')
uniqueIds = set()
for row in csv_reader:
uniqueIds.add(row[0])
print(len(uniqueIds))

use a set instead of a dict, just like this
import csv
input_file = open(r'C:\Users\guill\Downloads\uu.csv')
csv_reader = csv.reader(input_file, delimiter=',')
aa = set()
for row in csv_reader:
aa.add(row[0])
print(len(aa))

Related

Compare two CSV files and write difference in the same file as an extra column in python

Hey intelligent community,
I need a little bit of help because i think i don't see the the wood in the trees.
i have to CSV files that look like this:
Name,Number
AAC;2.2.3
AAF;2.4.4
ZCX;3.5.2
Name,Number
AAC;2.2.3
AAF;2.4.4
ZCX;3.5.5
I would like to compare both files and than write any changes like this:
Name,Number,Changes
AAC;2.2.3
AAF;2.4.4
ZCX;5.5.5;change: 3.5.2
So on every line when there is a difference in the number, i want to add this as a new column at the end of the line.
The Files are formated the same but sometimes have a new row so thats why i think i have to map the keys.
I come this far but now iam lost in my thoughts:
Python 3.10.9
import csv
Reading the first csv and set mapping
with open('test1.csv', 'r') as csvfile:
reader= csv.reader(csvfile)
rows = list(reader)
file1_dict = {row[1]: row[0] for row in rows}
Reading the second csv and set mapping
with open('test2.csv', 'r') as csvfile:
reader= csv.reader(csvfile)
rows = list(reader)
file2_dict = {row[1]: row[0] for row in rows}
comparing the keys and find the diff
for k in test1_dict:
if test1_dict[k] != test2:dict[k]
test1_dict[k] = test2_dict[k]
for row in rows:
if row[1] == k:
row.append(test2_dict[k])
#write the csv (not sure how to add the word "change:")
with open('test1.csv', 'w', newline ='') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(rows)
If i try this, i don't get a new column, it just "updates" the csv file with the same columns.
For example this code gives me the diff row but i'am not able to just add it to existing file and row.
with open('test1.csv') as fin1:
with open('test2.csv') as fin2:
read1 = csv.reader(fin1)
read2 = csv.reader(fin2)
diff_rows = (row1 for row1, row2 in zip(read1, read2) if row1 != row2)
with open('test3.csv', 'w') as fout:
writer = csv.writer(fout)
writer.writerows(diff_rows)
Does someone have any tips or help for my problem? I read many answers on here but can't figure it out.
Thanks alot.
#bigkeefer
Thanks for your answer, i tried to change it for the delimiter ; but it gives an "list index out of range error".
with open('test3.csv', 'r') as file1:
reader = csv.reader(file1, delimiter=';')
rows = list(reader)[1:]
file1_dict = {row[0]: row[1] for row in rows}
with open('test4.csv', 'r') as file2:
reader = csv.reader(file2, delimiter=';')
rows = list(reader)[1:]
file2_dict = {row[0]: row[1] for row in rows}
new_file = ["Name;Number;Changes\n"]
with open('output.csv', 'w') as nf:
for key, value in file1_dict.items():
if value != file2_dict[key]:
new_file.append(f"{key};{file2_dict[key]};change: {value}\n")
else:
new_file.append(f"{key};{value}\n")
nf.writelines(new_file)
You will need to adapt this to overwrite your first file etcetera, as you mentioned above, but I've left it like this for your testing purposes. Hopefully this will help you in some way.
I've assumed you've actually got the headers above in each file. If not, remove the slicing on the list creations, and change the new_file variable assignment to an empty list ([]).
with open('f1.csv', 'r') as file1:
reader = csv.reader(file1, delimiter=";")
rows = list(reader)[1:]
file1_dict = {row[0]: row[1] for row in rows if row}
with open('f2.csv', 'r') as file2:
reader = csv.reader(file2, delimiter=";")
rows = list(reader)[1:]
file2_dict = {row[0]: row[1] for row in rows if row}
new_file = ["Name,Number,Changes\n"]
for key, value in file1_dict.items():
if value != file2_dict[key]:
new_file.append(f"{key};{file2_dict[key]};change: {value}\n")
else:
new_file.append(f"{key};{value}\n")
with open('new.csv', 'w') as nf:
nf.writelines(new_file)

How do I update every row of one column of a CSV with Python?

I'm trying to update every row of 1 particular column in a CSV.
My actual use-case is a bit more complex but it's just the CSV syntax I'm having trouble with, so for the example, I'll use this:
Name
Number
Bob
1
Alice
2
Bobathy
3
If I have a CSV with the above data, how would I get it to add 1 to each number & update the CSV or spit it out into a new file?
How can I take syntax like this & apply it to the CSV?
test = [1,2,3]
for n in test:
n = n+1
print(n)
I've been looking through a bunch of tutorials & haven't been able to quite figure it out.
Thanks!
Edit:
I can read the data & get what I'm looking for printed out, my issue now is just with getting that back into the CSV
import csv
with open('file.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['name'], (int (row['number']) +1) )
└─$ python3 test_csv_script.py
bob 2
alice 3
bobathy 4
Thank you Mark Tolonen for the comment - that example was very helpful & led me to my solution:
import csv
with open('file.csv', newline='') as csv_input, open('out.csv', 'w') as csv_output:
reader = csv.reader(csv_input)
writer = csv.writer(csv_output)
# Header doesn't need extra processing
header = next(reader)
writer.writerow(header)
for name, number in reader:
writer.writerow([name, (int(number)+1)])
 
Also sharing for anybody who finds this in the future, if you're looking to move the modified data to a new column/header, use this:
import csv
with open('file.csv', newline='') as csv_input, open('out.csv', 'w') as csv_output:
reader = csv.reader(csv_input)
writer = csv.writer(csv_output)
header = next(reader)
header.append("new column")
writer.writerow(header)
for name, number in reader:
writer.writerow([name, number, (int(number)+1)])
You can open another file, out.csv, which you write the new data into.
For example:
import csv
with open('file.csv', newline='') as csvfile, open('out.csv', 'w') as file_write:
reader = csv.DictReader(csvfile)
for row in reader:
file_write.write(row['name'], (int (row['number']) +1) )

Remove row from CSV file python

Is there a way to remove a row from a CSV file without rewriting then entire thing?
Currently, I am using a dictionary 'db' that contains the database with the row I want to delete, first I read the columns, then a completely rewrite every row in the CSV besides for the row with the ID I want to delete, is there a way to do this without having to rewrite everything?
def remove_from_csv(file_name, id, db):
with open(file_name, "r") as f:
reader = csv.reader(f)
i = next(reader)
with open(file_name, 'w') as f:
writer = csv.writer(f, lineterminator='\n')
writer.writerow(i)
for i in db:
if id != i:
for j in db[i]:
writer.writerow([i, j, db[i][j]])
A way I have done so in the past is to use a pandas dataframe and the drop function based on the row index or label.
For example:
import pandas as pd
df = pd.read_csv('yourFile.csv')
newDf = df.drop('rowLabel')
or by index position:
newDf =df.drop(df.index[indexNumber])

Beginner deleting columns from CSV (no pandas)

I've just started coding, I'm trying to remove certain columns from a CSV for a project, we aren't supposed to use pandas. For instance, one of the fields I have to delete is called DwTm, but there's about 15 columns I have to get rid of; I only want the first few, Here's what I've gotten:
import csv
FTemp = "D:/tempfile.csv"
FOut = "D:/NewFile.csv"
with open(FTemp, 'r') as csv_file:
csv_reader = csv.reader(csv_file)
with open(FOut, 'w') as new_file:
fieldnames = ['Stn_Name', 'Lat', 'Long', 'Prov', 'Tm']
csv_writer = csv.DictWriter(new_file, fieldnames=fieldnames)
for line in csv_reader:
del line['DwTm']
csv_writer.writerow(line)
When I run this, I get the error
del line['DwTm']
TypeError: list indices must be integers or slices, not str
This is the only method I've found to almost work without using pandas. Any ideas?
The easiest way around this is to use a DictReader to read the file. Like DictWriter, which you are using to write the file, DictReader uses dictionaries for rows, so your approach of deleting keys from the old row then writing to the new file will work as you expect.
import csv
FTemp = "D:/tempfile.csv"
FOut = "D:/NewFile.csv"
with open(FTemp, 'r') as csv_file:
# Adjust the list to be have the correct order
old_fieldnames = ['Stn_Name', 'Lat', 'Long', 'Prov', 'Tm', 'DwTm']
csv_reader = csv.DictReader(csv_file, fieldnames=old_fieldnames)
with open(FOut, 'w') as new_file:
fieldnames = ['Stn_Name', 'Lat', 'Long', 'Prov', 'Tm']
csv_writer = csv.DictWriter(new_file, fieldnames=fieldnames)
for line in csv_reader:
del line['DwTm']
csv_writer.writerow(line)
Below
import csv
# We only want to read the 'department' field
# We are not interested in 'name' and 'birthday month'
# Make sure the list items are in ascending order
NON_INTERESTING_FIELDS_IDX = [2,0]
rows = []
with open('example.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
for idx in NON_INTERESTING_FIELDS_IDX:
del row[idx]
rows.append(','.join(row))
with open('example_out.csv','w') as out:
for row in rows:
out.write(row + '\n')
example.csv
name,department,birthday month
John Smith,Accounting,November
Erica Meyers,IT,March
example_out.csv
department
Accounting
IT
It's possible to simultaneously open the file to read from and the file to write to. Let's say you know the indices of the columns you want to keep, say, 0,2, and 4:
good_cols = (0,2,4)
with open(Ftemp, 'r') as fin, open(Fout, 'w') as fout:
for line in fin:
line = line.rstrip() #clean up newlines
temp = line.split(',') #make a list from the line
data = [temp[x] for x in range(len(temp)) if x in good_cols]
fout.write(','.join(data) + '\n')
The list comprehension (data) pulls only the columns you want to keep out of each row and immediately writes line-by-line to your new file, using the join method (plus tacking on an endline for each new row).
If you only know the names of the fields you want to keep/remove it's a bit more involved, you have to extract the indices from the first line of the csv file, but it's not much more difficult.

How to search for a 'text' or 'number' in a csv file with Python AND if exists print only first and second column values to a new csv file

I want to do the following using Python.
Step-1: Read a specific third column on a csv file using Python.
Step-2: Create a list with values got from step-1
Step-3: Take the value of index[0], search in csv file, if present print the values of column 1 and 2 only to a new csv file(There are 6 columns). If Not presents just ignore and goto next search.
file1.csv:
Country,Location,number,letter,name,pup-name,null
a,ab,1,qw,abcd,test1,3
b,cd,1,df,efgh,test2,4
c,ef,2,er,fgh,test3,5
d,gh,3,sd,sds,test4,
e,ij,5,we,sdrt,test5,
f,kl,6,sc,asdf,test6,
g,mn,7,df,xcxc,test7,
h,op,8,gb,eretet,test8,
i,qr,8,df,hjjh,test9,
Python script written for this:
import csv
import time
from collections import defaultdict
columns = defaultdict(list)
with open('file1.csv') as f:
reader = csv.reader(f)
reader.next()
for row in reader:
for (i,v) in enumerate(row):
columns[i].append(v)
#print(columns[2])
b=(columns[2])
for x in b[:]:
time.sleep(1)
print x
Output of above script:
MacBook-Pro:test_usr$ python csv_file.py
1
1
2
3
5
6
7
8
8
MacBook-Pro:test_usr$
I am able to do the steps 1 and 2.
Please guide me on doing Step-3. That is how to search for text/string in csv file and if present how to extract only specific column values to a new csv file?
Output file should look like:
a,ab
b,cd
c,ef
d,gh
e,ij
f,kl
g,mn
h,op
i,qr
Note : Search string will be from another csv file. Please don't suggest the direct answer for printing values of column 1 and 2 directly.
FINAL CODE is looks this:
import csv
import time
from collections import defaultdict
columns = defaultdict(list)
with open('file1.csv') as f:
reader = csv.reader(f)
reader.next()
for row in reader:
for (i,v) in enumerate(row):
columns[i].append(v)
b=(columns[2])
for x in b[:]:
with open('file2.csv') as f, open('file3.csv', 'a') as g:
reader = csv.reader(f)
#next(reader, None) # discard the header
writer = csv.writer(g)
for row in reader:
if row[2] == x:
writer.writerow(row[:2])
file1.csv:
Country,Location,number,letter,name,pup-name,null
a,ab,1,qw,abcd,test1,3
b,cd,1,df,efgh,test2,4
c,ef,2,er,fgh,test3,5
d,gh,3,sd,sds,test4,
e,ij,5,we,sdrt,test5,
f,kl,6,sc,asdf,test6,
g,mn,7,df,xcxc,test7,
h,op,8,gb,eretet,test8,
i,qr,8,df,hjjh,test9,
file2.csv:
count,name,number,Type,status,Config Version,,IP1,port
1,bob,1,TRAFFIC,end,1.2,,1.1.1.1,1
2,john,1,TRAFFIC,end,2.1,,1.1.1.2,2
4,foo,2,TRAFFIC,end,1.1,,1.1.1.3,3
5.333333333,test,3,TRAFFIC,end,3.1,,1.1.1.4,4
6.833333333,raa,5,TRAFFIC,end,5.1,,1.1.1.5,5
8.333333333,kaa,6,TRAFFIC,end,7.1,,1.1.1.6,6
9.833333333,thaa,7,TRAFFIC,end,9.1,,1.1.1.7,7
11.33333333,paa,8,TRAFFIC,end,11.1,,1.1.1.8,8
12.83333333,maa,8,TRAFFIC,end,13.1,,1.1.1.9,9
If I run the above script, output of file3.csv:
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
.
.
.
Its goes like this in loop
But output should be like this:
count,name
1,bob,
2,john,
4,foo,
5.333333333,test,
6.833333333,raa,
8.333333333,kaa,
9.833333333,thaa,
11.33333333,paa,
12.83333333,maa,
I think you should reconsider your approach. You can achieve your goal simply by iterating over the CSV file, without creating intermediate dicts and lists..., and since you want to work with specific columns, you'll make your life easier and your code more readable by using DictReader and DictWriter
import csv
import time
search_string = "whatever"
with open('file1.csv', 'rb') as f, open('file2.csv', 'wb') as g:
reader = csv.DictReader(f)
c1, c2, c3, *_ = reader.fieldnames
writer = csv.DictWriter(g, fieldnames=(c1, c2))
for row in reader:
if row[c3] == search_string:
writer.writerow({c1:row[c1], c2:row[c2]})
Keep in mind that csv module will always return strings. You have to handle data-type conversions yourself, if you need them (I've left that out form above).
If you don't want to use DictReader/DictWriter, I suppose it is a little more verbose, and don't want a header in your output file:
with open('file1.csv') as f, open('file2.csv', 'w') as g:
reader = csv.reader(f)
next(reader, None) # discard the header
writer = csv.writer(g)
for row in reader:
if row[2] == search_string:
writer.writerow(row[:2])
That is how to search for text/string in csv file and if present how
to extract only specific column values to a new csv file?
This is two questions.
First question: to search for text in a file, the simplest answer would be to read the file text into memory and look for the text. If you want to look for the text in a specific column of the csv you're reading in, you can use a DictReader to make life easy:
for row in reader:
if search_target in row[header]:
# found it!
Second question:
One way to write specific columns to a new csv would be as follows:
keys = ["Country", "Location"]
new_rows = [{key: row[key] for key in keys} for row in reader]
writer = csv.DictWriter(somefile, keys)
writer.writerows(new_rows)
This may help to understand better. Reading two csv files and matching the row indexs values are same or not, If same, writing to another csv.
import numpy as np
import csv
import time
import os
output_dir = "D:\Laneending\data-ars540"
file1 = "3rd_test_rec_road_width_changing_scenarios_250_inference.csv"
file2 = "df_5_signals_1597515776730734.csv"
ars540 = os.path.join(output_dir, file1)
veh_dyn = os.path.join(output_dir, file2)
file3 = "df_5_signals_1597515776730734_processed.csv"
output_file = os.path.join(output_dir, file3)
with open(ars540, 'r') as f1, open(veh_dyn, 'r') as f2, \
open(output_file, 'w+', newline='') as f3:
f1_reader = csv.reader(f1)
f2_reader = csv.reader(f2)
header_f1 = []
header_f1 = next(f1_reader) # reading the next line after header of csv file.
header_f2 = []
header_f2 = next(f2_reader) # reading the next line after header of csv file.
count = 0
writer = csv.writer(f3) #preparing the file f3 for writing the file.
writer.writerow(["Timestamp", "no of detections", "velocity", "yawrate" , "afdr"])
for row_f1 in f1_reader: # looking each row from csv file f1
for row_f2 in f2_reader: # looking for each row from csv file f2
if row_f1[1] == row_f2[0]: #checking the condition; worse case Time complexity o(n2)
# print(row_f2)
print(count)
writer = csv.writer(f3)
writer.writerows([row_f2])
count +=1
break

Categories

Resources