I have a csv file containing about 10 lines of data in the following format:
Attendees
ID, name, location, age
001, John, Harper, 53
002, Lucy, Jones, 23
etc...
I need to import it into python then sort the records by age. I want to do it using some kind of comparison loop (this is what we have been taught in class).
I've imported the records into python as one long list and split it into the seperate records but I'm having trouble with how to convert the age value into an integer (tried int(item[3]) but I got an error message) and also how i can loop through the lists one by one and refer to the last one without them having individual names.
This is what I have so far:
text_file = open("attendees.csv", "r")
lines = text_file.readlines()
print(lines)
new_list = []
for line in lines:
item = line.strip().split(',')
new_list.append(item)
print (new_list)
text_file.close()
You need to skip the first two lines of your input. You can't convert e.g. age to and integer.
First you need to skip over the title and header rows in your file to stop the sort from breaking. Next read all of the rows into a list. Finally sort the rows based on the integer value held for the age column:
with open('attendees.csv', 'r') as f_input:
title = next(f_input)
header = next(f_input)
rows = [[col.strip('\n ') for col in row.split(',')] for row in f_input]
for row in sorted(rows, key = lambda x: int(x[3])):
print row
This would display the following output for your sample input:
['002', 'Lucy', 'Jones', '23']
['001', 'John', 'Harper', '53']
Note, it is safer to use always use Python's with keyword when dealing with files. This ensures the file is automatically closed when the script goes outside its scope.
Related
So it is involves a previous question I posted, I gotten a lot good answers. But for this scenario, I want to enter more than one input at the same at the prompt, and search through a list of csv files.
For example:
Enter your strings:
11234567, 1234568. 1234569 etc.
(I want to set the parameter to be from 1 to 20)
And as for files input, is there a way to search for entire folder with files extensions with CSV, instead of hardcoding the names of csv files inside my code? So this way I don't have to keep adding names of CSV files in my code let's say if want to search through like 50 files. Is there a script like feature in Python to do this?
FYI, each input value I enter is distinct, so it cannot exist in 2 or more csv files at the same time.
Code:
import csv
files_to_open = ["C:/Users/CSV/salesreport1.csv","C:/Users/CSV//salesreport2.csv"]
data=[]
##iterate through list of files and add body to list
for file in files_to_open:
csvfile = open(file,"r")
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
keys_dict = {}
column = int(input("Enter the column you want to search: "))
val = input("Enter the value you want to search for in this column: ")
for row in data:
v = row[column]
##adds the row to the list with the right key
keys_dict[v] = row
if val in keys_dict:
print(keys_dict[val])
else:
print("Nothing was found at this column and key!")
Also one last thing, how do I show the name of the csv file as a result too?
Enter the column to search: 0
Enter values to search (separated by comma): 107561898, 107607997
['107561898', ......]
Process finished with exit code 0
107561898 is the from column 0 from file 1, and 107607997 is the second value of that is stored in file 2(column 0)
as you can see, the result is only returning file that contains first input, where I want both input to be returned, so two record
column 0 is where all the input values(card numbers are)
Seeing as you want to check a large number of files, here's an example of a very simple approach that checks all the CSVs in the same folder as this script.
This script allows you to specify the column and multiple values to search for.
Sample input:
Enter the column to search: 2
Enter values to search (separated by comma): leet,557,hello
This will search in the 3rd column for the worlds "leet", "hello" and the number 557.
Note that columns start counting at 0, and there should be no extra spaces unless the keyword itself has a space char.
import csv
import os
# this gets all the filenames of files that end with .csv in the specified directory
# this is where the file finding happens, to answer your comment #1
path_to_dir = 'C:\\Users\\Public\\Desktop\\test\\CSV'
csv_files = [x for x in os.listdir(path_to_dir) if x.endswith('.csv')]
# get row number and cast it to an integer type
# will throw error if you don't type a number
column_number = int(input("Enter the column to search: "))
# get values list
values = input("Enter values to search (separated by comma): ").split(',')
# loops over every filename
for filename in csv_files:
# opens the file
with open(path_to_dir + '\\' + filename) as file:
reader = csv.reader(file)
# loops for every row in the file
for row in reader:
# checks if the entry at this row, in the specified column is one of the values we want
if row[column_number] in values:
# prints the row
print(f'{filename}: {row}')
I'm currently trying to write a function that takes an integer a dataset (one that I already have, named data). And looks for a column in this dataset called name. It then has to return the number of different types of names there are in the column (there are 4 values, but only 3 types of values--two of them are the same).
I'm having a hard time with this program, but this is what I have so far:
def name_count(data):
unique = []
for name in data:
if name.strip() not in unique:
unique[name] += 1
else:
unique[name] = 1
unique.append(name)
The only import I'm allowed to use for this challenge is math.
Does anyone have any help or advice they can offer with this problem?
You can use a set to keep duplicates from it, for example:
data = ['name1', 'name2', 'name3', 'name3 ']
cleaned_data = map(lambda x: x.strip(), data)
count = len(set(cleaned_data))
print(count)
>>> 3
You almost had it. Unique should be a dictionary, not a list.
def name_count(data):
unique = {}
for name in data:
if name.strip() in unique:
unique[name] += 1
else:
unique[name] = 1
return unique
#test
print(name_count(['Jack', 'Jill', 'Mary', 'Sam', 'Jack', 'Mary']))
#output
{'Jack': 2, 'Jill': 1, 'Mary': 2, 'Sam': 1}
def name_count(data):
df = pandas.DataFrame(data)
unique = []
for name in df["name"]: #if column name is "name"
if name:
if (name not in unique) :
unique.append(name)
return unique
You need to pass the complete dataset to the function and not just the integers.
It is not clear what kind of data variable you already have there.
So, I will suggest a solution, starting from reading the file.
Considering that you have a csv file and that there is a restriction on importing only math module (as you mentioned), then this should work.
def name_count(filename):
with open(filename, 'r') as fh:
headers = next(fh).split(',')
name_col_idx = headers.index('name')
names = [
line.split(',')[name_col_idx]
for line in fh
]
return len(set(names))
Here we read the first line, identify the location of name in the header, collect all items in the name column into a variable names and finally return the length of the set, which contains only unique elements.
Here is the solution if you are feeding a csv file to your function. It reads the csv file, gets rid of the header line, accumulates the names which are on index 1 of each line, casts the list as a set to get rid of the duplicates and returns the length of the set which is the same as the number of unique names.
import csv
def name_count(filename):
with open(filename, "r") as csvfile:
csvreader = csv.reader(csvfile)
names = [row[1] for row in csvreader if row][1:]
return len(set(names))
Alternatively, if you don't want to use a csv reader, you can use a tect file reader without any imports as follows. The code splits each line on commas.
def name_count(filename):
with open(filename, "r") as input:
names = [row.rstrip('\n').split(',')[1] for row in input if row][1:]
return len(set(names))
I'm having difficulty completing a coding challenge and would like some help if possible.
I have a CSV file with 1000 forenames, surnames, emails, genders and dobs (date of births) and am trying to separate each column.
My first challenge was to print the first 10 rows of the file, I did so using;
counter = 0
print("FORENAME SURNAME EMAIL GENDER DATEOFBIRTH","\n")
csvfile = open ("1000people.csv").read().splitlines()
for line in csvfile:
print(line+"\n")
counter = counter + 1
if counter >= 10:
break
This works and prints the 10 first rows as intended. The second challenge is to do the same, but in alphabetical order and only the names. I can only manage to print the first 10 rows alphabetically using:
counter = 0
print("FORENAME SURNAME EMAIL GENDER DATEOFBIRTH","\n")
csvfile = open ("1000people.csv").read().splitlines()
for line in sorted(csvfile):
print(line+"\n")
counter = counter + 1
if counter >= 10:
break
Outcome of above code:
>>>
FORENAME SURNAME EMAIL GENDER DATEOFBIRTH
Abba,Ebbings,aebbings7z#diigo.com,Male,23/06/1993
Abby,Powney,apowneynj#walmart.com,Female,01/03/1998
Abbye,Cardus,acardusji#ft.com,Female,30/10/1960
Abeu,Chaize,achaizehi#apple.com,Male,25/06/1994
Abrahan,Shuard,ashuardb5#zdnet.com,Male,09/12/1995
Adah,Lambkin,alambkinga#skyrock.com,Female,21/08/1992
Addison,Shiers,ashiersmg#freewebs.com,Male,13/07/1981
Adelaida,Sheed,asheedqh#elpais.com,Female,06/08/1976
Adelbert,Jurkowski,ajurkowski66#amazonaws.com,Male,27/10/1957
Adelice,Van Arsdall,avanarsdallqp#pagesperso-orange.fr,Female,30/06/1979
So would there be a way to separate the columns so I can print just one specific column when chosen?
Thank you for reading and replying if you do.
the csv module helps to split the columns. A pythonic way to achieve what you want would be:
import csv
with open("1000people.csv") as f:
cr = csv.reader(f) # default sep is comma, no need to change it
first_10_rows = [next(cr,[]) for _ in range(10)]
the next(cr,[]) instruction consumes one row in a list comprehension, and if the file is smaller than 10 rows, you'll get an empty row instead of an exception (that's the purpose of the second argument)
now first_10_rows is a list of lists containing your values. Quotes & commas are properly stripped off thanks to the csv module.
I have three different columns in my csv file, with their respected values. Column B (Name column) in csv file has the values in all caps. I am trying to convert it into first letter caps but when I run the code it returns all the columns squished together and in quotes.
The Original File:
Company Name Job Title
xxxxxx JACK NICHOLSON Manager
yyyyyy BRAD PITT Accountant
I am trying to do:
Company Name Job Title
xxxxxx Jack Nicholson Manager
yyyyyy Brad Pitt Accountant
My code:
import csv
with open('C:\\Users\\Data.csv', 'rb') as f:
reader = csv.reader(f, delimiter='\t')
data = list(reader)
for item in data:
if len(item) > 1:
item[1] = item[1].title()
with open('C:\\Users\\Data.csv', 'wb') as f:
writer = csv.writer(f, delimiter='\t')
writer.writerows(data)
My result after I run the code is: Instead of returning three different columns and the second column adjusted with the title() syntax, it returns all the three columns squished together in just one column with quotes.
"Company","Name","Job Title"
xxxxxx,"JACK NICHOLSON","Manager"
yyyyyy,"BRAD PITT","Accountant"
I do not know what is wrong with my snippet. The result has absurd markings in the beginning
A slight change to Mohammed's solution using read_fwf to simplify reading the file.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html
import pandas as pd
df = pd.read_fwf('old_csv_file')
df.Name = df.Name.str.title()
df.to_csv('new_csv_file', index=False, sep='\t')
EDIT:
Changed to use a string method over lambda. I prefer to use lambdas as a last result.
You can do something like this with pandas:
import pandas as pd
df = pd.read_csv('old_csv_file', sep='\s{3,}')
df.Name = df.Name.apply(lambda x: x.title())
df.to_csv('new_csv_file', index=False, sep='\t')
string.title() converts the string to title case, i.e every first letter of the word in string is capitalized and subsequent letters are converted to lower case.
With df.apply you can perform some operation on an entire column or row.
'\s{3,}' is a regular expression
\s is a space character. \s{3,} is for more than 3 spaces.
When you are reading a CSV format you have to specify how your columns are separated.
Generally columns are separated by comma or tab. But in your case you have like 5,6 spaces between each column of a row.
So by using \s{3,} I am telling the CSV processor that the columns in a row are delimited by more than 3 spaces.
If I had use only \s then it would have treated First Name and Last Name as two separate columns because they have 1 space in between. So by 3+ spaces I made First Name and Last Name as a single column.
Take note that data stores each row as list containing one string only.
Having a length of 1, the statement inside this if block won't execute.
if len(item) > 1:
item[1] = item[1].title()
Aside from that, reading and writing in binary format is unnecessary.
import csv
with open('C:\\Users\\Data.csv', 'r') as f:
reader = csv.reader(f, delimiter='\t')
data = list(reader)
for item in data[1:]: # excludes headers
item[0] = item[0].title() # will capitalize the Company column too
item[0] = item[0][0].lower() + item[0][1:] # that's why we need to revert
print(item)
# see that data contains lists having one element only
# the line above will output to
# ['Company Name Job Title']
# ['xxxxxx Jack Nicholson Manager']
# ['yyyyyy Brad Pitt Accountant']
with open('C:\\Users\\Data.csv', 'w') as f:
writer = csv.writer(f, delimiter='\t')
writer.writerows(data)
Here's a sample csv file
id, serial_no
2, 500
2, 501
2, 502
3, 600
3, 601
This is the output I'm looking for (list of serial_no withing a list of ids):
[2, [500,501,502]]
[3, [600, 601]]
I have implemented my solution but it's too much code and I'm sure there are better solutions out there. Still learning Python and I don't know all the tricks yet.
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
each_row = []
each_row.append(row[0])
each_row.append(row[1])
zipped_data.append(each_row)
for rec in zipped_data:
if rec[0] not in ids:
ids.append(rec[0])
for id in ids:
for rec in zipped_data:
if rec[0] == id:
ser_no.append(rec[1])
tmp.append(id)
tmp.append(ser_no)
print tmp
tmp = []
ser_no = []
**I've omitted var initializing for simplicity of code
print tmp
Gives me output I mentioned above. I know there's a better way to do this or pythonic way to do it. It's just too messy! Any suggestions would be great!
from collections import defaultdict
records = defaultdict(list)
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
records[row[0]].append(row[1])
#sorting by ids since keys don't maintain order
results = sorted(records.items(), key=lambda x: x[0])
print results
If the list of serial_nos need to be unique just replace defaultdict(list) with defaultdict(set) and records[row[0]].append(row[1]) with records[row[0]].add(row[1])
Instead of a list, I'd make it a collections.defaultdict(list), and then just call the append() method on the value.
result = collections.defaultdict(list)
for row in data:
result[row[0]].append(row[1])
Here's a version I wrote, looks like there are plenty of answers for this one already though.
You might like using csv.DictReader, gives you easy access to each column by field name (from the header / first line).
#!/usr/bin/python
import csv
myFile = open('sample.csv','rb')
csvFile = csv.DictReader(myFile)
# first row will be used for field names (by default)
myData = {}
for myRow in csvFile:
myId = myRow['id']
if not myData.has_key(myId): myData[myId] = []
myData[myId].append(myRow['serial_no'])
for myId in sorted(myData):
print '%s %s' % (myId, myData[myId])
myFile.close()
Some observations:
0) file is a built-in (a synonym for open), so it's a poor choice of name for a variable. Further, the variable actually holds a file name, so...
1) The file can be closed as soon as we're done reading from it. The easiest way to accomplish that is with a with block.
2) The first loop appears to go over all the rows, grab the first two elements from each, and make a list with those results. However, your rows already all contain only two elements, so this has no net effect. The CSV reader is already an iterator over rows, and the simple way to create a list from an iterator is to pass it to the list constructor.
3) You proceed to make a list of unique ID values, by manually checking. A list of unique things is better known as a set, and the Python set automatically ensures uniqueness.
4) You have the name zipped_data for your data. This is telling: applying zip to the list of rows would produce a list of columns - and the IDs are simply the first column, transformed into a set.
5) We can use a list comprehension to build the list of serial numbers for a given ID. Don't tell Python how to make a list; tell it what you want in it.
6) Printing the results as we get them is kind of messy and inflexible; better to create the entire chunk of data (then we have code that creates that data, so we can do something else with it other than just printing it and forgetting it).
Applying these ideas, we get:
filename = 'test.csv'
with open(filename) as in_file:
data = csv.reader(in_file)
data.next() # ignore the field labels
rows = list(data) # read the rest of the rows from the iterator
print [
# We want a list of all serial numbers from rows with a matching ID...
[serial_no for row_id, serial_no in rows if row_id == id]
# for each of the IDs that there is to match, which come from making
# a set from the first column of the data.
for id in set(zip(*rows)[0])
]
We can probably do even better than this by using the groupby function from the itertools module.
example using itertools.groupby. This only works if the rows are already grouped by id
from csv import DictReader
from itertools import groupby
from operator import itemgetter
filename = 'test.csv'
# the context manager ensures that infile is closed when it goes out of scope
with open(filename) as infile:
# group by id - this requires that the rows are already grouped by id
groups = groupby(DictReader(infile), key=itemgetter('id'))
# loop through the groups printing a list for each one
for i,j in groups:
print [i, map(itemgetter(' serial_no'), list(j))]
note the space in front of ' serial_no'. This is because of the space after the comma in the input file