I have a list of blog posts with two columns. The date they were created and the unique ID of the person creating them.
I want to return the date of the most recent blog post for each unique ID. Simple, but all of the date values are stored in strings. And all of the strings don't have a leading 0 if the month is less than 10.
I've been struggling w/ strftime and strptime but can't get it to return effectively.
import csv
Posters = {}
with open('datetouched.csv','rU') as f:
reader = csv.reader(f)
for i in reader:
UID = i[0]
Date = i[1]
if UID in Posters:
Posters[UID].append(Date)
else:
Posters[UID] = [Date]
for i in Posters:
print i, max(Posters[i]), Posters[i]
This returns the following output
0014000000s5NoEAAU 7/1/10 ['1/6/14', '7/1/10', '1/18/14', '1/24/14', '7/1/10', '2/5/14']
0014000000s5XtPAAU 2/3/14 ['1/4/14', '1/10/14', '1/16/14', '1/22/14', '1/28/14', '2/3/14']
0014000000vHZp7AAG 2/1/14 ['1/2/14', '1/8/14', '1/14/14', '1/20/14', '1/26/14', '2/1/14']
0014000000wnPK6AAM 2/2/14 ['1/3/14', '1/9/14', '1/15/14', '1/21/14', '1/27/14', '2/2/14']
0014000000d5YWeAAM 2/4/14 ['1/5/14', '1/11/14', '1/17/14', '1/23/14', '1/29/14', '2/4/14']
0014000000s5VGWAA2 7/1/10 ['7/1/10', '1/7/14', '1/13/14', '1/19/14', '7/1/10', '1/31/14']
It's returning 7/1/2010 because that # is larger than 1. I need the max value of the list returned as the exact same string value.
I'd convert the date to a datetime when loading, and store the results in a defaultdict, eg:
import csv
from collections import defaultdict
from datetime import datetime
posters = defaultdict(list)
with open('datetouched.csv','rU') as fin:
csvin = csv.reader(fin)
items = ((row[0], datetime.strptime(row[1], '%m/%d/%y')) for row in csvin)
for uid, dt in items:
posters[uid].append(dt)
for uid, dates in posters.iteritems():
# print uid, list of datetime objects, and max date in same format as input
print uid, dates, '{0.month}/{0.day}/%y'.format(max(dates))
Parse the dates with datetime.datetime.strptime(), either when loading the CSV or as a key function to max().
While loading:
from datetime import datetime
Date = datetime.strptime(i[1], '%m/%d/%y')
or when using max():
print i, max(Posters[i], key=lambda d: datetime.strptime(d, '%m/%d/%y')), Posters[i]
Demo of the latter:
>>> from datetime import datetime
>>> dates = ['1/6/14', '7/1/10', '1/18/14', '1/24/14', '7/1/10', '2/5/14']
>>> max(dates, key=lambda d: datetime.strptime(d, '%m/%d/%y'))
'2/5/14'
Your code can be optimised a little:
import csv
posters = {}
with open('datetouched.csv','rb') as f:
reader = csv.reader(f)
for row in reader:
uid, date = row[:2]
posters.setdefault(uid, []).append(datetime.strptime(date, '%d/%m/%y'))
for uid, dates in enumerate(posters.iteritems()):
print i, max(dates), dates
The dict.setdefault() method sets a default value (an empty list here) whenever the key is not present yet.
Related
I need to remove the day in date and I tried to use datetime.strftime and datetime.strptime but it couldn't work. I need to create a tuple of 2 items(date,price) from a nested list but I need to change the date format first.
here's part of the code:
def get_data(my_csv):
with open("my_csv.csv", "r") as csv_file:
csv_reader = csv.reader(csv_file, delimiter = (','))
next(csv_reader)
data = []
for line in csv_reader:
data.append(line)
return data
def get_monthly_avg(data):
oldformat = '20040819'
datetimeobject = datetime.strptime(oldformat,'%y%m%d')
newformat = datetime.strftime('%y%m ')
You miss print with date formats. 'Y' has to be capitalized.
from datetime import datetime
# use datetime to convert
def strip_date(data):
d = datetime.strptime(data,'%Y%m%d')
return datetime.strftime(d,'%Y%m')
data = '20110513'
print (strip_date(data))
# or just cut off day (2 last symbols) from date string
print (data[:6])
The first variant is better because you can verify that string is in proper date format.
Output:
201105
201105
You didnt specify any code, but this might work:
date = functionThatGetsDate()
date = date[0:6]
I'm working with a large CSV file where each row includes the date and two values. I'm trying to set up a dictionary with the date as the key for the two values. I then need to multiply the two values of each key and record the answer. I have 3000 rows in the file.
Sample:
So far I have the date set as the key for each pair of values however it's also reusing the date as a third value for each key set, is there a way to remove this?
Once I've removed this, is there a way to multiply the values by eachother in each key set?
This is my code so far:
main_file = "newnoblanks.csv"
import csv
import collections
import pprint
with open(main_file) as fp:
root = csv.reader(fp, delimiter=',')
result = collections.defaultdict(list)
for row in root:
date = row[0].split(",")[0]
result[date].append(row)
print ("Result:-")
pprint.pprint(result)
This is my output:
I don't think you even need to use a defaultdict here, just assign the whole row (minus the date) to the key of the dict. You should just be able to do
with open(main_file) as fp:
root = csv.reader(fp, delimiter=',')
result = dict()
for row in root:
date = row[0].split(",")[0]
result[date] = row[1:]
If you want to get the product of the two values, you could do something like
for key in result:
result[key] = reduce(lambda x, y: x*y, result[key])
I know this has been answered, but feel there is an alternative worth considering:
import csv
from pprint import pprint
with open('newnoblanks.csv') as fp:
root = csv.reader(fp)
result = dict((date, float(a) * float(b)) for date, a, b in root)
pprint(result)
With the following data file:
19/08/2004,49.8458,44994500
20/08/2004,53.80505,23005800
23/08/2004,54.34653,18393200
The output is:
{'19/08/2004': 2242786848.1,
'20/08/2004': 1237828219.29,
'23/08/2004': 999606595.5960001}
I am trying to read dates from a txt file and have that converted to datetime format
Code:
from datetime import datetime, date
with open("birth.txt") as f:
content = f.readlines()
content = [x.strip() for x in content]
for i in content:
a = i.split(":")
date_b = []
date_b.append(a[-1])
print date_b
for j in date_b:
date_object = datetime.strptime(str(j), '%m-%d-%Y')
print date_object
Text File:
a:11-23-2001
b:02-14-2002
ValueError: time data ' 11-23-2001' does not match format '%m-%d-%Y'
Can someone help me resolve this error?
There are multiple problematic parts with your code. The error is caused by having a space before your date string although I'm not sure where it comes from given your file. Also, why are you even having the second loop? And you're overwriting the date_b in each line loop... Try this:
from datetime import datetime
with open("birth.txt") as f:
dates = [] # store this outside of your loop
for line in f: # read line by line
v, d = line.strip().split(":")
d = datetime.strptime(d.strip(), '%m-%d-%Y') # just in case of additional whitespace
dates.append((v, d))
print(dates)
# [('a', datetime.datetime(2001, 11, 23, 0, 0)), ('b', datetime.datetime(2002, 2, 14, 0, 0))]
You can turn the latter into a dictionary, too (dict(dates)) or build a dictionary immediately.
Besides the issues that #zwer points out, you have the major inefficiency of reading the entire file into memory before processing it. This actually makes your job harder than it needs to be because files in Python are iterable over their lines. You can do something like:
from datetime import datetime, date
with open('birth.txt') as f:
for line in f:
key, datestr = line.strip().split(':')
dateobj = datetime.strptime(datestr, '%m-%d-%Y')
print(dateobj)
Using the fact that the file is an iterator, you can write a one-line list comprehension to generate a full list of dates:
with open('birth.txt') as f:
dates = [datetime.strptime(line.strip().split(':')[1], '%m-%d-%Y') for line in f]
If the key has some significance, you can create a dictionary with a dictionary comprehension using a similar syntax:
with open('birth.txt') as f:
dates = {key: datetime.strptime(datestr, '%m-%d-%Y') for key, datestr in (line.strip().split(':') for line in f)}
I'm trying to solve a simple practice test question:
Parse the CSV file to:
Find only the rows where the user started before September 6th, 2010.
Next, order the values from the "words" column in ascending order (by start date)
Return the compiled "hidden" phrase
The csv file has 19 columns and 1000 rows of data. Most of which are irrelevant. As the problem states, we're only concerned with sorting the the start_date column in ascending order to get the associated word from the 'words' column. Together, the words will give the "hidden" phrase.
The dates in the source file are in UTC time format so I had to convert them. I'm at the point now where I think I've got the right rows selected, but I'm having issues sorting the dates.
Here's my code:
import csv
from collections import OrderedDict
from datetime import datetime
with open('TSE_sample_data.csv', 'rb') as csvIn:
reader = csv.DictReader(csvIn)
for row in reader:
#convert from UTC to more standard date format
startdt = datetime.fromtimestamp(int(row['start_date']))
new_startdt = datetime.strftime(startdt, '%Y%m%d')
# find dates before Sep 6th, 2010
if new_startdt < '20100906':
# add the values from the 'words' column to a list
words = []
words.append(row['words'])
# add the dates to a list
dates = []
dates.append(new_startdt)
# create an ordered dictionary to sort the dates... this is where I'm having issues
dict1 = OrderedDict(zip(words, dates))
print dict1
#print list(dict1.items())[0][1]
#dict2 = sorted([(y,x) for x,y in dict1.items()])
#print dict2
When I print dict1 I'm expecting to have one ordered dictionary with the words and the dates included as items. Instead, what I'm getting is multiple ordered dictionaries for each key-value pair created.
Here's the corrected version:
import csv
from collections import OrderedDict
from datetime import datetime
with open('TSE_sample_data.csv', 'rb') as csvIn:
reader = csv.DictReader(csvIn)
words = []
dates = []
for row in reader:
#convert from UTC to more standard date format
startdt = datetime.fromtimestamp(int(row['start_date']))
new_startdt = datetime.strftime(startdt, '%Y%m%d')
# find dates before Sep 6th, 2010
if new_startdt < '20100906':
# add the values from the 'words' column to a list
words.append(row['words'])
# add the dates to a list
dates.append(new_startdt)
# This is where I was going wrong! Had to move the lines below outside of the for loop
# Originally, because I was still inside the for loop, I was creating a new Ordered Dict for each "row in reader" that met my if condition
# By doing this outside of the for loop, I'm able to create the ordered dict storing all of the values that have been found in tuples inside the ordered dict
# create an ordered dictionary to sort by the dates
dict1 = OrderedDict(zip(words, dates))
dict2 = sorted([(y,x) for x,y in dict1.items()])
# print the hidden message
for i in dict2:
print i[1]
I am trying to remove rows with a specific ID within particular dates from a large CSV file.
The CSV file contains a column [3] with dates formatted like "1962-05-23" and a column with identifiers [2]: "ddd:011232700:mpeg21:a00191".
Within the following date range:
01-01-1951 to 12-31-1951
07-01-1962 to 12-31-1962
01-01 to 09-30-1963
7-01 to 07-31-1965
10-01 to 10-31-1965
04-01-1966 to 11-30-1966
01-01-1969 to 12-31-1969
01-01-1970 to 12-31-1989
I want to remove rows that contain the ID ddd:11*
I think I have to create a variable that contains both the date range and the ID. And look for these in every row, but I'm very new to python so I'm not sure what would be an eloquent way to do this.
This is what I have now. -CODE UPDATED
import csv
import collections
import sys
import re
from datetime import datetime
csv.field_size_limit(sys.maxsize)
dateranges = [("01-01-1951","12-31-1951"),("07-01-1962","12-31-1962")]
dateranges = list(map(lambda dr: tuple(map(lambda x: datetime.strptime(x,"%m-%d-%Y"),dr)),dateranges))
def datefilter(x):
x = datetime.strptime(x,"%Y-%m-%d")
for r in dateranges:
if r[0]<=x and r[1]>=x: return True
return False
writer = csv.writer(open('filtered.csv', 'wb'))
for row in csv.reader('my_file.csv', delimiter='\t'):
if datefilter(row[3]):
if not row[2].startswith("dd:111"):
writer.writerow(row)
else:
writer.writerow(row)
writer.close()
I'd recommend using pandas: it's great for filtering tables. Nice and readable.
import pandas as pd
# assumes the csv contains a header, and the 2 columns of interest are labeled "mydate" and "identifier"
# Note that "date" is a pandas keyword so not wise to use for column names
df = pd.read_csv(inputFilename, parse_dates=[2]) # assumes mydate column is the 3rd column (0-based)
df = df[~df.identifier.str.contains('ddd:11')] # filters out all rows with 'ddd:11' in the 'identifier' column
# then filter out anything not inside the specified date ranges:
df = df[((pd.to_datetime("1951-01-01") <= df.mydate) & (df.mydate <= pd.to_datetime("1951-12-31"))) |
((pd.to_datetime("1962-07-01") <= df.mydate) & (df.mydate <= pd.to_datetime("1962-12-31")))]
df.to_csv(outputFilename)
See Pandas Boolean Indexing
Here is how I would approach that, but it may not be the best method.
from datetime import datetime
dateranges = [("01-01-1951","12-31-1951"),("07-01-1962","12-31-1962")]
dateranges = list(map(lambda dr: tuple(map(lambda x: datetime.strptime(x,"%m-%d-%Y"),dr)),dateranges))
def datefilter(x):
# The date format is different here to match the format of the csv
x = datetime.strptime(x,"%Y-%m-%d")
for r in dateranges:
if r[0]<=x and r[1]>=x: return True
return False
with open(main_file, "rb") as fp:
root = csv.reader(fp, delimiter='\t')
result = collections.defaultdict(list)
for row in root:
if datefilter(row[3]):
# use a regular expression or any other means to filter on id here
if row[2].startswith("dd:111"): #code to remove item
What I have done is create a list of tuples of your date ranges (for brevity, I only put 2 ranges in it), and then I convert those into datetime objects.
I have used maps for doing that in one line: first loop over all tuples in that list, applying a function which loops over all entries in that tuple and converts to a date time, using the tuple and list functions to get back to the original structure. Doing it the long way would look like:
dateranges2=[]
for dr in dateranges:
dateranges2.append((datetime.strptime(dr[0],"%m-%d-%Y"),datetime.strptime(dr[1],"%m-%d-%Y"))
dateranges = dateranges2
Notice that I just convert each item in the tuple into a datetime, and add the tuples to the new list, replacing the original (which I don't need anymore).
Next, I create a datefilter function which takes a datestring, converts it to a datetime, and then loops over all the ranges, checking if the value is in the range. If it is, we return True (indicating this item should be filtered), otherwise return False if we have checking all ranges with no match (indicating that we don't filter this item).
Now you can check out the id using any method that you want once the date has matched, and remove the item if desired. As your example is constant in the first few characters, we can just use the string startswith function to check the id. If it is more complex, we could use a regex.
My kinda approach workds like this -
import csv
import re
import datetime
field_id = 'ddd:11'
d1 = datetime.date(1951,1,01) #change the start date
d2 = datetime.date(1951,12,31) #change the end date
diff = d2 - d1
date_list = []
for i in range(diff.days + 1):
date_list.append((d1 + datetime.timedelta(i)).isoformat())
with open('mwevers_example_2016.01.02-07.25.55.csv','rb') as csv_file:
reader = csv.reader(csv_file)
for row in reader:
for date in date_list:
if row[3] == date:
print row
var = re.search('\\b'+field_id,row[2])
if bool(var) == True:
print 'olalala'#here you can make a function to copy those rows into another file or any list
import csv
import sys
import re
from datetime import datetime
csv.field_size_limit(sys.maxsize)
field_id = 'ddd:11'
dateranges = [("1951-01-01", "1951-12-31"),
("1962-07-01", "1962-12-31"),
("1963-01-01", "1963-09-30"),
("1965-07-01", "1965-07-30"),
("1965-10-01", "1965-10-31"),
("1966-04-01", "1966-11-30"),
("1969-01-01", "1989-12-31")
]
dateranges = list(map(lambda dr:
tuple(map(lambda x:
datetime.strptime(x, "%Y-%m-%d"), dr)),
dateranges))
def datefilter(x):
x = datetime.strptime(x, "%Y-%m-%d")
for r in dateranges:
if r[0] <= x and r[1] >= x:
return True
return False
output = []
with open('my_file.csv', 'r') as f:
reader = csv.reader(f, delimiter='\t', quotechar='"')
next(reader)
for row in reader:
if datefilter(row[4]):
var = re.search('\\b'+field_id, row[3])
if bool(var) == False:
output.append(row)
else:
output.append(row)
with open('output.csv', 'w') as outputfile:
writer = csv.writer(outputfile, delimiter='\t', quotechar='"')
writer.writerows(output)