Count and flag duplicates in a column in a csv - python

this type of question has been asked many times. So apologies; I have searched hard to get an answer - but have not found anything that is close enough to my needs (and I am not sufficiently advanced (I am a total newbie) to customize an existing answer). So thanks in advance for any help.
Here's my query:
I have 30 or so csv files and each contains between 500 and 15,000 rows.
Within each of them (in the 1st column) - are rows of alphabetical IDs (some contain underscores and some also have numbers).
I don't care about the unique IDs - but I would like to identify the duplicate IDs and the number of times they appear in all the different csv files.
Ideally I'd like the output for each duped ID to appear in a new csv file and be listed in 2 columns ("ID", "times_seen")
It may be that I need to compile just 1 csv with all the IDs for your code to run properly - so please let me know if I need to do that
I am using python 2.7 (a crawling script that I run needs this version, apparently).
Thanks again

It seems the most easy way to achieve want you want would make use of dictionaries.
import csv
import os
# Assuming all your csv are in a single directory we will iterate on the
# files in this directory, selecting only those ending with .csv
# to list files in the directory we will use the walk function in the
# os module. os.walk(path_to_dir) returns a generator (a lazy iterator)
# this generator generates tuples of the form root_directory,
# list_of_directories, list_of_files.
# So: declare the generator
file_generator = os.walk("/path/to/csv/dir")
# get the first values, as we won't recurse in subdirectories, we
# only ned this one
root_dir, list_of_dir, list_of_files = file_generator.next()
# Now, we only keep the files ending with .csv. Let me break that down
csv_list = []
for f in list_of_files:
if f.endswith(".csv"):
csv_list.append(f)
# That's what was contained in the line
# csv_list = [f for _, _, f in os.walk("/path/to/csv/dir").next() if f.endswith(".csv")]
# The dictionary (key value map) that will contain the id count.
ref_count = {}
# We loop on all the csv filenames...
for csv_file in csv_list:
# open the files in read mode
with open(csv_file, "r") as _:
# build a csv reader around the file
csv_reader = csv.reader(_)
# loop on all the lines of the file, transformed to lists by the
# csv reader
for row in csv_reader:
# If we haven't encountered this id yet, create
# the corresponding entry in the dictionary.
if not row[0] in ref_count:
ref_count[row[0]] = 0
# increment the number of occurrences associated with
# this id
ref_count[row[0]]+=1
# now write to csv output
with open("youroutput.csv", "w") as _:
writer = csv.writer(_)
for k, v in ref_count.iteritems():
# as requested we only take duplicates
if v > 1:
# use the writer to write the list to the file
# the delimiters will be added by it.
writer.writerow([k, v])
You may need to tweek a little csv reader and writer options to fit your needs but this should do the trick. You'll find the documentation here https://docs.python.org/2/library/csv.html. I haven't tested it though. Correcting the little mistakes that may have occurred is left as a practicing exercise :).

That's rather easy to achieve. It would look something like:
import os
# Set to what kind of separator you have. '\t' for TAB
delimiter = ','
# Dictionary to keep count of ids
ids = {}
# Iterate over files in a dir
for in_file in os.listdir(os.curdir):
# Check whether it is csv file (dummy way but it shall work for you)
if in_file.endswith('.csv'):
with open(in_file, 'r') as ifile:
for line in ifile:
my_id = line.strip().split(delimiter)[0]
# If id does not exist in a dict = set count to 0
if my_id not in ids:
ids[my_id] = 0
# Increment the count
ids[my_id] += 1
# saves ids and counts to a file
with open('ids_counts.csv', 'w') as ofile:
for key, val in ids.iteritems():
# write down counts to a file using same column delimiter
ofile.write('{}{}{}\n'.format(key, delimiter, value))

Check out the pandas package. You can read an write csv files quite easily with it.
http://pandas.pydata.org/pandas-docs/stable/10min.html#csv
Then, when having the csv-content as a dataframe you convert it with the as_matrix function.
Use the answers to this question to get the duplicates as a list.
Find and list duplicates in a list?
I hope this helps

As you are a newbie, Ill try to give some directions instead of posting an answer. Mainly because this is not a "code this for me" platform.
Python has a library called csv, that allows to read data from CSV files (Boom!, surprised?). This library allows you to read the file. Start by reading the file (preferably an example file that you create with just 10 or so rows and then increase the amount of rows or use a for loop to iterate over different files). The examples in the bottom of the page that I linked will help you printing this info.
As you will see, the output you get from this library is a list with all the elements of each row. Your next step should be extracting just the ID that you are interested in.
Next logical step is counting the amount of appearances. There is also a class from the standard library called counter. They have a method called update that you can use as follows:
from collections import Counter
c = Counter()
c.update(['safddsfasdf'])
c # Counter({'safddsfasdf': 1})
c['safddsfasdf'] # 1
c.update(['safddsfasdf'])
c # Counter({'safddsfasdf': 2})
c['safddsfasdf'] # 2
c.update(['fdf'])
c # Counter({'safddsfasdf': 2, 'fdf': 1})
c['fdf'] # 1
So basically you will have to pass it a list with the elements you want to count (you could have more than 1 id in the list, for exampling reading 10 IDs before inserting them, for improved efficiency, but remember not constructing a thousands of elements list if you are seeking good memory behaviour).
If you try this and get into some trouble come back and we will help further.
Edit
Spoiler alert: I decided to give a full answer to the problem, please avoid it if you want to find your own solution and learn Python in the progress.
# The csv module will help us read and write to the files
from csv import reader, writer
# The collections module has a useful type called Counter that fulfills our needs
from collections import Counter
# Getting the names/paths of the files is not this question goal,
# so I'll just have them in a list
files = [
"file_1.csv",
"file_2.csv",
]
# The output file name/path will also be stored in a variable
output = "output.csv"
# We create the item that is gonna count for us
appearances = Counter()
# Now we will loop each file
for file in files:
# We open the file in reading mode and get a handle
with open(file, "r") as file_h:
# We create a csv parser from the handle
file_reader = reader(file_h)
# Here you may need to do something if your first row is a header
# We loop over all the rows
for row in file_reader:
# We insert the id into the counter
appearances.update(row[:1])
# row[:1] will get explained afterwards, it is the first column of the row in list form
# Now we will open/create the output file and get a handle
with open(output, "w") as file_h:
# We create a csv parser for the handle, this time to write
file_writer = writer(file_h)
# If you want to insert a header to the output file this is the place
# We loop through our Counter object to write them:
# here we have different options, if you want them sorted
# by number of appearances Counter.most_common() is your friend,
# if you dont care about the order you can use the Counter object
# as if it was a normal dict
# Option 1: ordered
for id_and_times in apearances.most_common():
# id_and_times is a tuple with the id and the times it appears,
# so we check the second element (they start at 0)
if id_and_times[1] == 1:
# As they are ordered, we can stop the loop when we reach
# the first 1 to finish the earliest possible.
break
# As we have ended the loop if it appears once,
# only duplicate IDs will reach to this point
file_writer.writerow(id_and_times)
# Option 2: unordered
for id_and_times in apearances.iteritems():
# This time we can not stop the loop as they are unordered,
# so we must check them all
if id_and_times[1] > 1:
file_writer.writerow(id_and_times)
I offered 2 options, printing them ordered (based on Counter.most_common() doc) and unoredered (based on normal dict method dict.iteritems()). Choose one. From a speed point of view I'm not sure which one would be faster, as one first needs to order the Counter but also stops looping when finding the first element non-duplicated while the second doesn't need to order the elements but needs to loop every ID. The speed will probably be dependant on your data.
About the row[:1] thingy:
row is a list
You can get a subset of a list telling the initial and final positions
In this case the initial position is omited, so it defaults to the start
The final position is 1, so just the first element gets selected
So the output is another list with just the first element
row[:1] == [row[0]] They have the same output, getting a sublist of only the same element is the same that constructing a new list with only the first element

Related

How to copy a csv file into a dictionary?

I'm working on cs50's pset6, DNA, and I want to read a csv file that looks like this:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
But the problem is that dictionaries only have a key, and a value, so I don't know how I could structure this. What I currently have is this piece of code:
import sys
with open(argv[1]) as data_file:
data_reader = csv.DictReader(data_file)
And also, my csv file has multiple columns and rows, with a header and the first column indicating the name of the person. I don't know how to do this, and I will later need to access the individual amount of say, Alice's value of AATG.
Also, I'm using the module sys, to import DictReader and also reader
You can always try to create the function on your own.
You can use my code here:
def csv_to_dict(csv_file):
key_list = [key for key in csv_file[:csv_file.index('\n')].split(',')] # save the keys
data = {} # every dictionary
info = [] # list of dicitionaries
# for each line
for line in csv_file[csv_file.index('\n') + 1:].split('\n'):
count = 0 # this variable saves the key index in my key_list.
# for each string before comma
for value in line.split(','):
data[key_list[count]] = value # for each key in key_list (which I've created before), I put the value. This is the way to set a dictionary values.
count += 1
info.append(data) # after updating my data (dictionary), I append it to my list.
data = {} # I set the data dictionary to empty dictionary.
print(info) # I print it.
### Be aware that this function prints a list of dictionaries.

Problem parsing data from a firewall log and finding "worm"

I am struggling with trying to see what is wrong with my code. I am new to python.
import os
uniqueWorms = set()
logLineList = []
with open("redhat.txt", 'r') as logFile:
for eachLine in logFile:
logLineList.append(eachLine.split())
for eachColumn in logLineList:
if 'worm' in eachColumn.lower():
uniqueWorms.append()
print (uniqueWorms)
eachLine.split() returns a list of words. When you append this to logLineList, it becomes a 2-dimensional list of lists.
Then when you iterate over it, eachColumn is a list, not a single column.
If you want logLineList to be a list of words, use
logLineList += eachLine.split()
instead of
logLineList.append(eachLine.split())
Finally, uniqueWorms.append() should be uniqueWOrms.append(eachColumn). And print(uniqueWorms) should be outside the loop, so you just see the final result, not every time a worm is added.

Cannot combine two lists into a map in multiprocessing in Python

I have one csv with SKUs and URLs I break them in two lists with
def myskus():
myskus =[]
with open('websupplies2.csv', 'r') as csvf:
reader = csv.reader(csvf, delimiter=";")
for row in reader:
myskus.append(row[0]) # Add each skus to list contents
return myskus
def mycontents():
contents = []
with open('websupplies2.csv', 'r') as csvf:
reader = csv.reader(csvf, delimiter=";")
for row in reader:
contents.append(row[1]) # Add each url to list contents
return contents
Then I multiprocess my urls but I want to join the correspondin SKU
if __name__ == "__main__":
with Pool(4) as p:
records = p.map(parse, web_links)
if len(records) > 0:
with open('output_websupplies.csv', 'a') as f:
f.write('\n'.join(records))
Can I put
records = p.map(parse, skus, web_links)
because is not working
My desirable output format
would be
sku price availability
bkk11 10,00 available
how can I achieve this?
minor refactor
I recommend naming your pair of functions def get_skus() and def get_urls(), to match your problem definition.
data structure
Having a pair of lists, skus and urls, does not seem like a good fit for your high level problem.
Keep them together, as a list of (sku, url) tuples, or as a sku_to_url dict.
That is, delete one of your two functions, so you're reading the CSV once, and keeping the related details together.
Then your parse() routine would have more information available to it.
The list of tuples boils down to Monty's starmap() suggestion.
writing results
You're using this:
if len(records) > 0:
with open('output_websupplies.csv', 'a') as f:
f.write('\n'.join(records))
Firstly, testing for at least one record is probably superfluous, it's not the end of the world to open for append and then write zero records.
If you care about the timestamp on the file then perhaps it's a useful optimization.
More importantly, the write() seems Bad.
One day an unfortunate character may creep into one of your records.
Much better to feed your structured records to a csv.writer, to ensure appropriate quoting.

Counting Occurrences of Zip Codes in Big Data Set w/Python

I'm a python newbie looking to count the 100 most occurring zip codes in several .csv files (6+). There are literally 3 million+ zip codes in the data set, and I'm looking for a way to pull out only the top 100 most occurring. Here is a sample of code below that was inspired from another post, although I'm trying to count across several .csv files. Thanks in advance!
import csv
import collections
zip = collections.Counter()
with open('zipcodefile1.csv', 'zipcodefile2.csv', 'zipcodefile3.csv') as input file:
for row in csv.reader(input_file, delimiter=';'):
ZIP[row[1]] += 1
print ZIP.most_common(100)
I'd suggest using Python's generators here, as they will be nice and efficient. First, suppose we have two files:
zc1.txt:
something;00001
another;00002
test;00003
and zc2.txt:
foo;00001
bar;00001
quuz;00003
Now let's write a function that takes several filenames and iterates through the lines in all of the files, returning only the zip codes:
import csv
def iter_zipcodes(paths):
for path in paths:
with open(path) as fh:
for row in csv.reader(fh, delimiter=';'):
yield row[1]
Note that we write yield row[1]. This signals that the function is a generator, and it returns its values lazily.
We can test it out as follows:
>>> list(iter_zipcodes(['zc1.txt', 'zc2.txt']))
['00001', '00002', '00003', '00001', '00001', '00003']
So we see that the generator simply spits out the zip codes in each file, in order. Now let's count them:
>>> zipcodes = iter_zipcodes(['zc1.txt', 'zc2.txt'])
>>> counts = collections.Counter(zipcodes)
>>> counts
Counter({'00001': 3, '00002': 1, '00003': 2})
Looks like it worked. This approach is efficient because it only reads one line in at a time. When one file is completely read, it moves on to the next.

creat stock data from cvs file, list, dictionary

My stock programs input is as follow
'Sqin.txt' data read in and is a cvs file
AAC,D,20111207,9.83,9.83,9.83,9.83,100
AACC,D,20111207,3.46,3.47,3.4,3.4,13400
AACOW,D,20111207,0.3,0.3,0.3,0.3,500
AAME,D,20111207,1.99,1.99,1.95,1.99,8600
AAON,D,20111207,21.62,21.9,21.32,21.49,93200
AAPL,D,20111207,389.93,390.94,386.76,389.09,10892800
AATI,D,20111207,5.75,5.75,5.73,5.75,797900
The output is
dat1[]
['AAC', ['9.83', '9.83', '9.83', '9.83', '100'], ['9.83', '9.83', '9.83', '9.83', '100']]
dat1[0] is the stock symbol 'ACC' used for lookup and data updates
Dat1[1....?] Is the EOD (end of day) data
At the close of stock markets the EOD data will be inserted at dat1.insert (1,M) each update cycle .
Guys you can code this out in probably one line. Mine so far is over 30 lines, so seeing my code isn't relevant. Above is an example of some simple input and the desired output.
If you decide to take on some real world programing please keep it verbose. Declare your variables, then populate it, and finally use them ex.
M = []
M = q [0][3:] ## had to do it this way because 'ACC' made the variable M [] begin as a string (inmutable). So I could not add M to the data.-dat1[]- because -dat1[]- also became a string (inmutable strings how stupid). Had to force 'ACC' to be a list so I can create a list of lists -dat1-
Dat1.insert(1.M) ## -M- is used to add another list to the master.dat record
Maybe it would be OK to be some what pythonic and a little less verbose.
You should use a dictionary with the names as keys:
import csv
import collections
filename = 'csv.txt'
with open(filename) as file_:
reader = csv.reader(file_)
data = collections.defaultdict(list)
for line in reader:
# line[1] contains "D" and line[2] is the date
key, value = line[0], line[3:]
data[key].append(value)
To add data you do data[name].insert(0, new_data). Where name could be AAC and value is a list of the data. This places the new data at the beginning of the list like you said in your post.
I would recommend append instead of insert, it is faster. If you really want the data added to the begin of the list use collections.deque instead of list.

Categories

Resources