Extract a list without duplicates from a CSV file

Extract a list without duplicates from a CSV file - python

I have a dataset which looks like this:
id,created_at,username
1,2006-10-09T18:21:51Z,hey
2,2007-10-09T18:30:28Z,bob
3,2008-10-09T18:40:33Z,bob
4,2009-10-09T18:47:42Z,john
5,2010-10-09T18:51:04Z,brad
...
I contains 1M+ lines.
I'd like to extract the list of username without duplicate from it using python. So far my code looks like this:
import csv
file1 = file("sample.csv", 'r')
file2 = file("users.csv", 'w')
reader = csv.reader(file1)
writer = csv.writer(file2)
rownum = 0
L = []
for row in reader:
if not rownum == 0:
if not row[2] in L:
L.append(row[2])
writer.writerow(row[2])
rownum += 1
I have several questions:
1 - my output in users.csv looks like this:
h,e,y
b,o,b
j,o,h,n
b,r,a,d
How do I remove the commas between each letter?
2 - My code is not very elegant, is there any way to import the csv file as a matrix to select the last row and then to use an elegant library like underscore.js in javascript to remove the duplicates?
Many thanks

You can use a set here, it provides O(1) item lookup compared to O(N) of lists.
seen = set()
add_ = seen.add
next(reader) #skip header
writer.writerows([row[-1]] for row in reader if row[-1] not in seen
and not add_(row[-1]))
And always use the with statement for handling files, it'll automatically close the file for you:
with file("sample.csv", 'r') as file1, file("users.csv", 'w') as file2:
#Do stuff with file1 and file2 here

Change
writer.writerow(row[2])
to
writer.writerow([row[2]])
Also, checking for membership in lists is computationally expensive [O(n)]. If you will be checking for membership in a large collection of items, and doing it often, use a set [O(1)]:
L = set()
reader.next() # Skip the header
for row in reader:
if row[2] not in L:
L.add(row[2])
writer.writerow([row[2]])
Alternatively
If you're okay with using a few megabytes of memory, just do this:
with open("sample.csv", "rb") as infile:
reader = csv.reader(infile)
reader.next()
no_duplicates = set(tuple(row) for row in reader)
with open("users.csv", "wb") as outfile:
csv.writer(outfile).writerows(no_duplicates)
if order is important, use an OrderedDict instead of a set:
from collections import OrderedDict
with open("sample.csv", "rb") as infile:
reader = csv.reader(infile)
reader.next()
no_duplicates = OrderedDict.fromkeys(tuple(row) for row in reader)
with open("users.csv", "wb") as outfile:
csv.writer(outfile).writerows(no_duplicates.keys())

Easy and short!
for line in reader:
string = str(line)
split = string.split("," , 2)
username = split[2][2:-2]

Related

How to sort the values (from smallest to larger) of a column in an ascii file using python?

I have an ASCII file with the following columns :
ID, val1, val2, val3
where ID is a row_number but not sorted. I want to write a new ascii file with the same columns with sorted ID (from smaller to larger).
How I could do that in python?
In fact, this file has been produced by the concatenation of 2 ascii files using the following code:
import os.path
maindir1="/home/d01/"
maindir2="/home/d02/"
outdir="/home/final/"
pols=[ "F1","F2","F3" ]
months=["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
for ipol in pols:
for imonth in months:
for kk in range(1, 7):
template_args = {"ipol": ipol, "imonth": imonth, "kk": kk}
filename = "{ipol}_{imonth}_0{kk}_L1.txt".format(ipol=ipol, imonth=imonth, kk=kk)
out_name = os.path.join(outdir, filename)
in_names = [os.path.join(maindir1, filename), os.path.join(maindir2, filename)]
with open(out_name, "w") as out_file:
for in_name in in_names:
with open(in_name, "r") as in_file:
out_file.write(in_file.read())
How could I define to the above code to write the final file in a sorted way (based on the first column) ?

Assuming Comma Separated Values
I think you're talking about a Comma Separated Values (CSV) file. The character encoding is probably ASCII. If this is true, you'll have an input like this:
id,val1,val2,val3
3,a,b,c
1,a,b,c
2,a,b,c
Python has a good standard library for this: csv.
import csv
with open("in.csv") as f:
reader = csv.reader(f)
We import the csv library first, then open the file using a context processor. Basically, it's a nice way to open the file, do stuff (in the with block) and then close it.
The csv.reader method takes the file pointer f as an argument. This reader can be iterated and represents the contents of your file. If you cast it to a list, you get a list of lists. The first item in the list of lists is the header, which you want to save, and the rest is the contents:
contents = list(reader)
header = contents[0]
rows = contents[1:]
You then want to sort the rows. But sorting a list of lists might not do what you expect. You need to write a function that helps you find the key to use to perform the sorting:
lambda line: line[0]
This means for every line (which we expect to be a list), the key is equal to the first member of the list. If you prefer not to use lambdas, you can also define a function:
def get_key(line):
return line[0]
get_key is identical to the lambda.
Combine this all together to get:
new_file = sorted(rows, key=lambda line: line[0])
If you didn't use the lambda, that's:
new_file = sorted(rows, key=get_key)
To write it to a file, you can use the csv library again. Remember to first write the header then the rest of the contents:
with open("out.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(new_file)
All together, the code looks like this:
import csv
with open("in.txt") as f:
reader = csv.reader(f)
contents = list(reader)
header = contents[0]
rows = contents[1:]
new_file = sorted(rows, key=lambda line: line[0])
with open("out.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(new_file)
Assuming Custom
If the file is custom and definitely has the spaces in the header like you described (almost like a CSV) or you don't want to use the csv library, you can extract rows like this:
contents = [row.replace(" ", "").split(",") for row in f.readlines()]
If, for instance, it's space-delimited instead of comma-delimited, you would use this:
contents = [row.split() for row in f.readlines()]
You can write rows like this:
with open("out.csv", "w") as f:
f.write(", ".join(header))
for row in new_file:
f.write(", ".join(row))
In ensemble:
with open("in.txt") as f:
contents = [row.replace(" ", "").split(",") for row in f.readlines()]
header = contents[0]
rows = contents[1:]
new_file = sorted(rows, key=lambda line: line[0])
with open("out.csv", "w") as f:
f.write(", ".join(header))
for row in new_file:
f.write(", ".join(row))
Hope that helps!
EDIT: This would perform a lexicographical sort on the first column, which is probably not what you want. If you can guarantee that all of the first column (aside from the header) are integers, you can just cast them from a str:
lambda line: line[0]
...becomes:
lambda line: int(line[0])
...with full code:
import csv
with open("in.txt") as f:
reader = csv.reader(f)
contents = list(reader)
header = contents[0]
rows = contents[1:]
new_file = sorted(rows, key=lambda line: int(line[0]))
with open("out.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(new_file)

So, you need to sort the data in csv format you have in ascending order on the basis of Id.
You can use this function to do it
def Sort(sub_li):
sub_li.sort(key = lambda x: x[0])
return sub_li
x[0] to sort according to Id means first column or you can change according to your use case.
I took the input as `
a = [["1a", 122323,1000,0],
["6a", 12323213,24,2],
["3a", 1233,1,3]]
So, using the above function I got the output as
[['1a', 122323, 1000, 0],
['3a', 1233, 1, 3],
['6a', 12323213, 24, 2]]
I hope this will help.

Get length of csv file without ruining reader?

I am trying to do the following:
reader = csv.DictReader(open(self.file_path), delimiter='|')
reader_length = sum([_ for item in reader])
for line in reader:
print line
However, doing the reader_length line, makes the reader itself unreadable. Note that I do not want to do a list() on the reader, as it is too big to read on my machine entirely from memory.

Use enumerate with a start value of 1, when you get to the end of the file you will have the line count:
for count,line in enumerate(reader,1):
# do work
print count
Or if you need the count at the start for some reason sum using a generator expression and seek back to the start of the file:
with open(self.file_path) as f:
reader = csv.DictReader(f, delimiter='|')
count = sum(1 for _ in reader)
f.seek(0)
reader = csv.DictReader(f, delimiter='|')
for line in reader:
print(line)

reader = list(csv.DictReader(open(self.file_path), delimiter='|'))
print len(reader)
is one way to do this i suppose
another way to do it would be
reader = csv.DictReader(open(self.file_path), delimiter='|')
for i,row in enumerate(reader):
...
num_rows = i+1

How can I properly read in this text data to a list?

I have some data in a .txt file structured as follows:
Soup Tomato
Beans Kidney
.
.
.
I read in the data with
combo=open("combo.txt","r")
lines=combo.readlines()
However, the data is then appears as
lines=['Soup\tTomato\r\n','Beans\tKidney\r\n',...]
I would like each entry to be its own element in the list, like
lines=['Soup','Tomato',...]
And even better would be to have two lists, one for each column.
Can someone suggest a way to achieve this or fix my problem?

You should split the lines:
lines = [a_line.strip().split() for a_line in combo.readlines()]
Or without using readlines:
lines = [a_line.strip().split() for a_line in combo]

You look like your opening a csv tab delimeted file.
use the python csv class.
lines = []
with open('combo.txt', 'rb') as csvfile:
for row in csv.reader(csvfile, delimiter='\t'):
lines += row
print(lines)
now as a list.
or with a list of lists you can invert it ...
lines = []
with open('combo.txt', 'rb') as csvfile:
for row in csv.reader(csvfile, delimiter='\t'):
line.append(rows) # gives you a list of lists.
columns = map(list, zip(*lines))
columns[0] = ['Soup','Beans',...];

If you want to get all the items in a single list:
>>> with open('combo.txt','r') as f:
... all_soup = f.read().split()
...
>>> all_soup
['Soup', 'Tomato', 'Beans', 'Kidney']
If you want to get each column, then do this:
>>> with open('combo.txt','r') as f:
... all_cols = zip(*(line.strip().split() for line in f))
...
>>> all_cols
[('Soup', 'Beans'), ('Tomato', 'Kidney')]

Use the csv module to handle csv-like files (in this case, tab-separated values, not comma-separated values).
import csv
import itertools
with open('path/to/file.tsv') as tsvfile:
reader = csv.reader(tsvfile, delimiter="\t")
result = list(itertools.chain.from_iterable(reader))
csv.reader turns your file into a list of lists, essentially:
def reader(file, delimiter=",")
with open('path/to/file.tst') as tsvfile:
result = []
for line in tsvfile:
sublst = line.strip().split(delimiter)
result += sublst
return result

Add Dictionary to a list python

I am new to python and I'm trying to create a csv parsing script.
I pass rows from the csv to a list but what currently troubles me is that I need to add the first header line as a dictionary in each item.
def parse_csv(datafile):
data = []
with open(datafile, "r") as f:
next(f) #skip headerline
for line in f:
splitLine = line.strip(',')
rowL = splitLine.rstrip('\n') #remove the newline char
data.append(rowL)
pprint(data)
return data
If the 1st header line has the dictionaries (e.g Title, Name etc) how am I going to pass to each stripped element?
e.g {'Dict1': 'data1', 'Dict2': 'data2' }
This may be considered duplicate but tried various ways from similar posts but none worked properly on my case.

I strongly recommend to use the provided csv library. It will save you a lot of time and effort. Here is what you want to do:
import csv
data = []
with open(datafile, 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
data.append(row)
print(row['Title'], row['Name'])
In this example each row is actually a python dictionary.

#GeorgiDimitrov is certainly right that the proper approach is to use the csv module from the standard library, but, if you're doing this only for self-instruction purposes, then...:
def parse_csv(datafile):
data = []
with open(datafile, "r") as f:
headers = next(f).split(',')
for line in f:
splitLine = line.split(',')
dd = dict(zip(headers,splitLine))
data.append(dd
pprint(data)
return data
This will not properly deal with quoted/escaped commas, &c -- all subtleties that are definitely best left to the csv module:-).

CSV file , can't add record from csv file

How can I add record from csv file into dictionary in function where the input attribute will be tha path fo that csv file?
Please help with this uncompleted function :
def csv_file (p):
dictionary={}
file=csv.reader(p)
for rows in file:
dictionary......(rows)
return dictionary

You need to open the file first:
def csv_file(p):
dictionary = {}
with open(p, "rb") as infile: # Assuming Python 2
file = csv.reader(infile) # Possibly DictReader might be more suitable,
for row in file: # but that...
dictionary......(row) # depends on what you want to do.
return dictionary

It seems as though you haven't even opened the file, you need to use open for that.
Try the following code:
import csv
from pprint import pprint
INFO_LIST = []
with open('sample.csv') as f:
reader = csv.reader(f, delimiter=',', quotechar='"')
for i, row in enumerate(reader):
if i == 0:
TITLE_LIST = [var for var in row]
continue
INFO_LIST.append({title: info for title, info in zip(TITLE_LIST, row)})
pprint(INFO_LIST)
I use the following csv file as an example:
"REVIEW_DATE","AUTHOR","ISBN","DISCOUNTED_PRICE"
"1985/01/21","Douglas Adams",0345391802,5.95
"1990/01/12","Douglas Hofstadter",0465026567,9.95
"1998/07/15","Timothy ""The Parser"" Campbell",0968411304,18.99
"1999/12/03","Richard Friedman",0060630353,5.95
"2001/09/19","Karen Armstrong",0345384563,9.95
"2002/06/23","David Jones",0198504691,9.95
"2002/06/23","Julian Jaynes",0618057072,12.50
"2003/09/30","Scott Adams",0740721909,4.95
"2004/10/04","Benjamin Radcliff",0804818088,4.95
"2004/10/04","Randel Helms",0879755725,4.50
You can put all that logic into a function like so:
def csv_file(file_path):
# Checking if a filepath is a string, if not then we return None
if not isinstance(file_path, str):
return None
# Creating a the list in which we will hold our dictionary's files
_info_list = []
with open(file_path) as f:
# Setting the delimiter and quotechar
reader = csv.reader(f, delimiter=',', quotechar='"')
# We user enumerate here, because we know the first row contains data about the information
for i, row in enumerate(reader):
# The first row contains the headings
if i == 0:
# Creating a list from first row
title_list = [var for var in row]
continue
# Zipping title_list and info_list together, so that a dictionary comprehension is possible
_info_list.append({title: info for title, info in zip(title_list, row)})
return _info_list
APPENDIX
open()
zip
Dictionary Comprehension
Delmiter, its the character that separates values, in this case ,.
Quotechar, its the character, that holds values in a csv, in this case ".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract a list without duplicates from a CSV file - python

Easy and short! for line in reader: string = str(line) split = string.split("," , 2) username = split[2][2:-2]

Related

How to sort the values (from smallest to larger) of a column in an ascii file using python?

Get length of csv file without ruining reader?

How can I properly read in this text data to a list?

Add Dictionary to a list python

CSV file , can't add record from csv file

Categories

Resources