I'm trying to solve a simple practice test question:
Parse the CSV file to:
Find only the rows where the user started before September 6th, 2010.
Next, order the values from the "words" column in ascending order (by start date)
Return the compiled "hidden" phrase
The csv file has 19 columns and 1000 rows of data. Most of which are irrelevant. As the problem states, we're only concerned with sorting the the start_date column in ascending order to get the associated word from the 'words' column. Together, the words will give the "hidden" phrase.
The dates in the source file are in UTC time format so I had to convert them. I'm at the point now where I think I've got the right rows selected, but I'm having issues sorting the dates.
Here's my code:
import csv
from collections import OrderedDict
from datetime import datetime
with open('TSE_sample_data.csv', 'rb') as csvIn:
reader = csv.DictReader(csvIn)
for row in reader:
#convert from UTC to more standard date format
startdt = datetime.fromtimestamp(int(row['start_date']))
new_startdt = datetime.strftime(startdt, '%Y%m%d')
# find dates before Sep 6th, 2010
if new_startdt < '20100906':
# add the values from the 'words' column to a list
words = []
words.append(row['words'])
# add the dates to a list
dates = []
dates.append(new_startdt)
# create an ordered dictionary to sort the dates... this is where I'm having issues
dict1 = OrderedDict(zip(words, dates))
print dict1
#print list(dict1.items())[0][1]
#dict2 = sorted([(y,x) for x,y in dict1.items()])
#print dict2
When I print dict1 I'm expecting to have one ordered dictionary with the words and the dates included as items. Instead, what I'm getting is multiple ordered dictionaries for each key-value pair created.
Here's the corrected version:
import csv
from collections import OrderedDict
from datetime import datetime
with open('TSE_sample_data.csv', 'rb') as csvIn:
reader = csv.DictReader(csvIn)
words = []
dates = []
for row in reader:
#convert from UTC to more standard date format
startdt = datetime.fromtimestamp(int(row['start_date']))
new_startdt = datetime.strftime(startdt, '%Y%m%d')
# find dates before Sep 6th, 2010
if new_startdt < '20100906':
# add the values from the 'words' column to a list
words.append(row['words'])
# add the dates to a list
dates.append(new_startdt)
# This is where I was going wrong! Had to move the lines below outside of the for loop
# Originally, because I was still inside the for loop, I was creating a new Ordered Dict for each "row in reader" that met my if condition
# By doing this outside of the for loop, I'm able to create the ordered dict storing all of the values that have been found in tuples inside the ordered dict
# create an ordered dictionary to sort by the dates
dict1 = OrderedDict(zip(words, dates))
dict2 = sorted([(y,x) for x,y in dict1.items()])
# print the hidden message
for i in dict2:
print i[1]
Related
I have a csv file with the download times of various files and I want to know the number of files that was download per day.
Code:
with open('hello.csv', 'r', encoding="latin-1") as csvfile:
readCSV=csv.reader(csvfile, delimiter=',')
list1=list(readCSV)
count=0
b=-1
for j in list1:
b=b+1
if b>0:
dt=j[1]
dt_obj=parse(dt)
d=dt_obj.date()
if dt==d:
count+=1
else:
print(count)
break
hello.csv is my csv file. I have date times so I use the parser to get the date. I want to have the number of downloads per day. I know that this code can't work but I don't know how to compare if the next entry is the same date or not..
My date times look like "2004-01-05 17:56:46" and are in the second column of the csv file. When I have 7 entries on 2004-01-05 and 5 on 2004-01-06 the vector count should look like count=[7 5] for example
You can follow the below procedure.
Convert to a datetime object.
Create a column containing only the date (remove the time).
Group by the new date column.
Count the objects.
# Read csv file
data = pd.read_csv('hello.csv')
# Converting to datetime object
data['timestamp'] = pd.to_datetime(data['timestamp'])
# Creating date column
data['date'] = data['timestamp'].apply(lambda x: x.date())
# Grouping by date
data.group_by('date')['column'].count()
# Result
date
2019-05-20 4
2019-05-21 3
Name: column, dtype: int64
When you want to count elements, Python collections module provides the Counter class which can be used as a dictionary {element_name: count}. I will assume that your parse function does what you want. The code can simply be:
with open('hello.csv', 'r', encoding="latin-1") as csvfile:
readCSV=csv.reader(csvfile, delimiter=',')
counter = collections.Counter((parse(row[1]).date() for row in readCSV))
print(counter)
With your expected data, it should print:
Counter({'2004-01-05': 7, '2004-01-06': 5})
I suggest using Pandas. Say your date column is called date. Since your date is a datetime object, you can group by dates and use the transform method
df = pd.read_csv('hello.csv')
df['date'] = pd.DatetimeIndex(df.date).normalize()
df['count'] = df.groupby('date')['date'].transform('count')
df = df[['date','count']]
Now you have a new dataframe with what you want.
I'm working with a large CSV file where each row includes the date and two values. I'm trying to set up a dictionary with the date as the key for the two values. I then need to multiply the two values of each key and record the answer. I have 3000 rows in the file.
Sample:
So far I have the date set as the key for each pair of values however it's also reusing the date as a third value for each key set, is there a way to remove this?
Once I've removed this, is there a way to multiply the values by eachother in each key set?
This is my code so far:
main_file = "newnoblanks.csv"
import csv
import collections
import pprint
with open(main_file) as fp:
root = csv.reader(fp, delimiter=',')
result = collections.defaultdict(list)
for row in root:
date = row[0].split(",")[0]
result[date].append(row)
print ("Result:-")
pprint.pprint(result)
This is my output:
I don't think you even need to use a defaultdict here, just assign the whole row (minus the date) to the key of the dict. You should just be able to do
with open(main_file) as fp:
root = csv.reader(fp, delimiter=',')
result = dict()
for row in root:
date = row[0].split(",")[0]
result[date] = row[1:]
If you want to get the product of the two values, you could do something like
for key in result:
result[key] = reduce(lambda x, y: x*y, result[key])
I know this has been answered, but feel there is an alternative worth considering:
import csv
from pprint import pprint
with open('newnoblanks.csv') as fp:
root = csv.reader(fp)
result = dict((date, float(a) * float(b)) for date, a, b in root)
pprint(result)
With the following data file:
19/08/2004,49.8458,44994500
20/08/2004,53.80505,23005800
23/08/2004,54.34653,18393200
The output is:
{'19/08/2004': 2242786848.1,
'20/08/2004': 1237828219.29,
'23/08/2004': 999606595.5960001}
I am writing a scipt (i.e. once upon the time) where I am reading the data from an excel-file. For that data I create an id based on the date and time. I have one missing variable, which is contained in a txt-file. The txt-file has also date and time to create an id.
Now I would like to link the data from the excel-file and txt-file based on the id. Right no I am building two lists from the txt-file. One containing the id and the other containing the value I need. Then, I get the index from the id list, where the id is the same in both data sets using the enumerate function. I use that index to get the value from the valuelist. The code looks something like that:
datelist = []
valuelist = []
txtfile = open(folder + os.sep + "Textfile.txt", "r")
ILines = txtfile.readlines()
for i,row in enumerate(ILines):
datelist.append(row.split(",")[1])
valuelist.append(row.split(",")[2])
rows = myexceldata
for row in rows:
x = row[id]
row = row + valuelist[[i for i,e in enumerate(datelist ) if e == x][0]]
However, that takes ages and I wonder if there is a better way to to that.
The files look like that:
Excelfile:
Date Time Var1 Var2
03.02.2016 12:53:24 10 27
03.02.2016 12:53:25 10 27
03.02.2016 12:53:26 10 27
Textfile:
Date Time Var3
03.02.2016 12:53:24 16
03.02.2016 12:53:25 20
Result:
Date Time Var1 Var2 Var3
03.02.2016 12:53:24 10 27 16
03.02.2016 12:53:25 10 27 20
03.02.2016 12:53:26 10 27 *)
*) It would be perfect, if here would be the same value as above, but empty would be ok, too
Ok, I forgot one important thing. Sorry about that: Not all times of the excelfile are in the textfile. The best option would be to get var3 from the previous time of the textfile just before the time of the excelfile. But it would also be an option to leave it blank than.
If both of your files are sorted in time order then the following kind of approach would be fast:
from heapq import merge
from itertools import groupby, chain
import csv
with open('excel.txt', 'rb') as f_excel, open('textfile.txt', 'rb') as f_text, open('output.txt', 'wb') as f_output:
csv_excel = csv.reader(f_excel)
csv_text = csv.reader(f_text)
csv_output = csv.writer(f_output)
header_excel = next(csv_excel)
header_text = next(csv_text)
csv_output.writerow(header_excel + [header_text[-1]])
for k, g in groupby(merge(csv_text, csv_excel), key=lambda x: x[0:2]):
csv_output.writerow(k + list(chain.from_iterable(cols[2:] for cols in g)))
This assumes your two input files are both in csv format, and works as follows:
Create csv readers/writers for all of the files. This allows the files to automatically be read in as lists of columns without requiring each line to be split.
Extract the headers from both of the files and write a combined form to the output.
Take the two input files and pass them to merge. This returns a row at a time from either input file in order.
Pass this to groupby to group rows with the same date and time together. This returns a key and a group, where the key is the date and time that matched, and the group is an iterable of the matching rows.
For each grouped entry, write the key and columns 2 onwards from each row to the output file. chain is used to produce a flat list.
This would give you an output file as follows:
Date,Time,Var1,Var2,Var3
03.02.2016,12:53:24,10,27,16
03.02.2016,12:53:25,10,27,20
As you already have the excel data, this would need to be passed to merge instead of csv_excel as a list of rows/cols.
I have a bunch of files with file-names as
companyname-date_somenumber.txt
I have to sort the files according to company name, then according to date, and copy their content in this sorted order to another text file.
Here's the approach I'm trying :
From each file-name, extract company name and then date, put these two fields in a dictionary, append this dictionary to a list and then sort this list according to the two columns of companyname and then date.
Then once I have the sorted order, I think I could search for the files in the folder according to the file-order I just obtained, then copy each files content into a txt file and I'll have my final txt file.
Here's the code I have so far :
myfiles = [ f for f in listdir(path) if isfile(join(path,f)) ]
file_list=[]
for file1 in myfiles:
# find indices of companyname and date in the file-name
idx1=file1.index('-',0)
idx2=file1.index('_',idx1)
company=file1[0:idx1] # extract companyname
thisdate=file1[idx1+1:idx2] #extract date, which is in format MMDDYY
dict={}
# extract month, date and year from thisdate
m=thisdate[0:2]
d=thisdate[2:4]
y='20'+thisdate[4:6]
# convert into date object
mydate = date(int(y), int(m), int(d))
dict['date']=mydate
dict['company']=company
file_list.append(dict)
I checked the output of file_list at the end of this block of code and I think I have my list of dicts. Now, how do I sort by companyname and then by date? I looked up sorting by multiple keys online but how would I get the increasing order by date?
Is there any other way that I could sort a list by a string and then a date field?
import os
from datetime import datetime
MY_DIR = 'somedirectory'
# my_files = [ f for f in os.listdir(MY_DIR) if os.path.isfile(os.path.join(MY_DIR,f)) ]
my_files = [
'ABC-031814_01.txt',
'ABC-031214_02.txt',
'DEF-010114_03.txt'
]
file_list = []
for file_name in my_files:
company,_,rhs = file_name.partition('-')
datestr,_,rhs = rhs.partition('_')
file_date = datetime.strptime(datestr,'%m%d%y')
file_list.append(dict(file_date=file_date,file_name=file_name,company=company))
for row in sorted(file_list,key=lambda x: (x.get('company'),x.get('file_date'))):
print row
The function sorted takes a keyword argument key that is a function applied to each item in the sequence you're sorting. If this function returns a tuple, the sequence will be sorted by the items in the tuple in turn.
Here lambda x: (x.get('company'),x.get('file_date')) allows sorted to sort by company name and then by date.
Here's a sample csv file
id, serial_no
2, 500
2, 501
2, 502
3, 600
3, 601
This is the output I'm looking for (list of serial_no withing a list of ids):
[2, [500,501,502]]
[3, [600, 601]]
I have implemented my solution but it's too much code and I'm sure there are better solutions out there. Still learning Python and I don't know all the tricks yet.
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
each_row = []
each_row.append(row[0])
each_row.append(row[1])
zipped_data.append(each_row)
for rec in zipped_data:
if rec[0] not in ids:
ids.append(rec[0])
for id in ids:
for rec in zipped_data:
if rec[0] == id:
ser_no.append(rec[1])
tmp.append(id)
tmp.append(ser_no)
print tmp
tmp = []
ser_no = []
**I've omitted var initializing for simplicity of code
print tmp
Gives me output I mentioned above. I know there's a better way to do this or pythonic way to do it. It's just too messy! Any suggestions would be great!
from collections import defaultdict
records = defaultdict(list)
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
records[row[0]].append(row[1])
#sorting by ids since keys don't maintain order
results = sorted(records.items(), key=lambda x: x[0])
print results
If the list of serial_nos need to be unique just replace defaultdict(list) with defaultdict(set) and records[row[0]].append(row[1]) with records[row[0]].add(row[1])
Instead of a list, I'd make it a collections.defaultdict(list), and then just call the append() method on the value.
result = collections.defaultdict(list)
for row in data:
result[row[0]].append(row[1])
Here's a version I wrote, looks like there are plenty of answers for this one already though.
You might like using csv.DictReader, gives you easy access to each column by field name (from the header / first line).
#!/usr/bin/python
import csv
myFile = open('sample.csv','rb')
csvFile = csv.DictReader(myFile)
# first row will be used for field names (by default)
myData = {}
for myRow in csvFile:
myId = myRow['id']
if not myData.has_key(myId): myData[myId] = []
myData[myId].append(myRow['serial_no'])
for myId in sorted(myData):
print '%s %s' % (myId, myData[myId])
myFile.close()
Some observations:
0) file is a built-in (a synonym for open), so it's a poor choice of name for a variable. Further, the variable actually holds a file name, so...
1) The file can be closed as soon as we're done reading from it. The easiest way to accomplish that is with a with block.
2) The first loop appears to go over all the rows, grab the first two elements from each, and make a list with those results. However, your rows already all contain only two elements, so this has no net effect. The CSV reader is already an iterator over rows, and the simple way to create a list from an iterator is to pass it to the list constructor.
3) You proceed to make a list of unique ID values, by manually checking. A list of unique things is better known as a set, and the Python set automatically ensures uniqueness.
4) You have the name zipped_data for your data. This is telling: applying zip to the list of rows would produce a list of columns - and the IDs are simply the first column, transformed into a set.
5) We can use a list comprehension to build the list of serial numbers for a given ID. Don't tell Python how to make a list; tell it what you want in it.
6) Printing the results as we get them is kind of messy and inflexible; better to create the entire chunk of data (then we have code that creates that data, so we can do something else with it other than just printing it and forgetting it).
Applying these ideas, we get:
filename = 'test.csv'
with open(filename) as in_file:
data = csv.reader(in_file)
data.next() # ignore the field labels
rows = list(data) # read the rest of the rows from the iterator
print [
# We want a list of all serial numbers from rows with a matching ID...
[serial_no for row_id, serial_no in rows if row_id == id]
# for each of the IDs that there is to match, which come from making
# a set from the first column of the data.
for id in set(zip(*rows)[0])
]
We can probably do even better than this by using the groupby function from the itertools module.
example using itertools.groupby. This only works if the rows are already grouped by id
from csv import DictReader
from itertools import groupby
from operator import itemgetter
filename = 'test.csv'
# the context manager ensures that infile is closed when it goes out of scope
with open(filename) as infile:
# group by id - this requires that the rows are already grouped by id
groups = groupby(DictReader(infile), key=itemgetter('id'))
# loop through the groups printing a list for each one
for i,j in groups:
print [i, map(itemgetter(' serial_no'), list(j))]
note the space in front of ' serial_no'. This is because of the space after the comma in the input file