How to count entries per day in a csv file? - python

I have a csv file with the download times of various files and I want to know the number of files that was download per day.
Code:
with open('hello.csv', 'r', encoding="latin-1") as csvfile:
readCSV=csv.reader(csvfile, delimiter=',')
list1=list(readCSV)
count=0
b=-1
for j in list1:
b=b+1
if b>0:
dt=j[1]
dt_obj=parse(dt)
d=dt_obj.date()
if dt==d:
count+=1
else:
print(count)
break
hello.csv is my csv file. I have date times so I use the parser to get the date. I want to have the number of downloads per day. I know that this code can't work but I don't know how to compare if the next entry is the same date or not..
My date times look like "2004-01-05 17:56:46" and are in the second column of the csv file. When I have 7 entries on 2004-01-05 and 5 on 2004-01-06 the vector count should look like count=[7 5] for example

You can follow the below procedure.
Convert to a datetime object.
Create a column containing only the date (remove the time).
Group by the new date column.
Count the objects.
# Read csv file
data = pd.read_csv('hello.csv')
# Converting to datetime object
data['timestamp'] = pd.to_datetime(data['timestamp'])
# Creating date column
data['date'] = data['timestamp'].apply(lambda x: x.date())
# Grouping by date
data.group_by('date')['column'].count()
# Result
date
2019-05-20 4
2019-05-21 3
Name: column, dtype: int64

When you want to count elements, Python collections module provides the Counter class which can be used as a dictionary {element_name: count}. I will assume that your parse function does what you want. The code can simply be:
with open('hello.csv', 'r', encoding="latin-1") as csvfile:
readCSV=csv.reader(csvfile, delimiter=',')
counter = collections.Counter((parse(row[1]).date() for row in readCSV))
print(counter)
With your expected data, it should print:
Counter({'2004-01-05': 7, '2004-01-06': 5})

I suggest using Pandas. Say your date column is called date. Since your date is a datetime object, you can group by dates and use the transform method
df = pd.read_csv('hello.csv')
df['date'] = pd.DatetimeIndex(df.date).normalize()
df['count'] = df.groupby('date')['date'].transform('count')
df = df[['date','count']]
Now you have a new dataframe with what you want.

Related

How to arrange data week wise csv file in python

I have generated csv file which has formate as shown in below image:
In this image, I have data week wise but somewhere I couldn't arrange data week wise. If you look into the below image, you will see the red mark and blue mark. I want to separate this both marks. How I will do it?
Note: If Holiday on Friday then it should set a week from Monday to Thursday.
currently, I'm using below logic :
Image: Please click here to see image
current logic:
import csv
blank_fields=[' ']
fields=[' ','Weekly Avg']
# Read csv file
file1 = open('test.csv', 'rb')
reader = csv.reader(file1)
new_rows_list = []
# Read data row by row and store into new list
for row in reader:
new_rows_list.append(row)
if 'Friday' in row:
new_rows_list.append(fields)
file1.close()
overall you are going towards the right direction, your condition is just a little too error prone, things can get worse (e.g., just one day in a week appears in your list). So testing for the weekday string isn't the best choice here.
I would suggest "understanding" the date/time in your table to solve this using weekdays, like this:
from datetime import datetime as dt, timedelta as td
# remember the last weekday
last = None
# each item in list represents one week, while active_week is a temp-var
weeks = []
_cur_week = []
for row in reader:
# assuming the date is in row[1] (split, convert to int, pass as args)
_cur_date = dt(*map(int, row[1].split("-")))
# weekday will be in [0 .. 6]
# now if _cur_date.weekday <= last.weekday, a week is complete
# (also catching corner-case with more than 7 days, between two entries)
if last and (_cur_date.weekday <= last.weekday or (_cur_date - last) >= td(days=7)):
# append collected rows to 'weeks', empty _cur_week for next rows
weeks.append(_cur_week)
_cur_week = []
# regular invariant, append row and set 'last' to '_cur_date'
_cur_week.append(row)
last = _cur_date
Pretty verbose and extensive, but I hope I can transport the pattern used here:
parse existing date and use weekday to distinguish one week from another (i.e., weekday will increase monotonously, means any decrease (or equality) will tell you the current date represents the next week).
store rows in a temporary list during one week
append _cur_week into weeks once the condition for next week gets triggered
empty _cur_week for next rows i.e., week
Finally the last thing to do is to "concat" the data e.g. like this:
new_rows_list = [[fields] + week for week in weeks]
I have another logic for this same thing and it is successfully worked and easy solution.
import csv
import datetime
fields={'':' ', 'Date':'Weekly Avg'}
#Read csv file
file1 = open('temp.csv', 'rb')
reader = csv.DictReader(file1)
new_rows_list = []
last_date = None
# Read data row by row and store into new list
for row in reader:
cur_date = datetime.datetime.strptime(row['Date'], '%Y-%m-%d').date()
if last_date and ((last_date.weekday() > cur_date.weekday()) or (cur_date.weekday() == 0)):
new_rows_list.append(fields)
last_date = cur_date
new_rows_list.append(row)
file1.close()

How to compare date from csv(string) to actual date

filenameA ="ApptA.csv"
filenameAc = "CheckoutA.csv"
def checkouttenantA():
global filenameA
global filenameAc
import csv
import datetime
with open(filenameA, 'r') as inp, open(filenameAc, 'a' , newline = "") as out:
my_writer = csv.writer(out)
for row in csv.reader(inp):
my_date= datetime.date.today()
string_date = my_date.strftime("%d/%m/%Y")
if row[5] <= string_date:
my_writer.writerow(row)
Dates are saved in format %d/%m/%Y in an excel file on column [5]. I am trying to compare dates in csv file with actual date, but it is only comparing the %d part. I assume it is because dates are in string format.
Ok so there are a few improvements to make as well, which I'll put as an edit to this, but you're converting todays date to a string with strftime() and comparing the two strings, you should be converting the string date from the csv file to a datetime object and comparing those instead.
I'll add plenty of comments to try and explain the code and the reasoning behind it.
# imports should go at the top
import csv
# notice we are importing datetime from datetime (we are importing the `datetime` type from the module datetime
import from datetime import datetime
# try to avoid globals where possible (they're not needed here)
def check_dates_in_csv(input_filepath):
''' function to load csv file and compare dates to todays date'''
# create a list to store the rows which meet our criteria
# appending the rows to this will make a list of lists (nested list)
output_data = []
# get todays date before loop to avoid calling now() every line
# we only need this once and it'll slow the loop down calling it every row
todays_date = datetime.now()
# open your csv here using the function argument
with open(input_filepath, output_filepath) as csv_file:
reader = csv.reader(csv_file)
# iterate over the rows and grab the date in each row
for row in reader:
string_date = row[5]
# convert the string to a datetime object
csv_date = datetime.strptime(string_date, '%d/%m/%Y')
# compare the dates and append if it meets the criteria
if csv_date <= todays_date:
output_data.append(row)
# function should only do one thing, compare the dates
# save the output after
return output_data
# then run the script here
# this comparison is basically the entry point of the python program
# this answer explains it better than I could: https://stackoverflow.com/questions/419163/what-does-if-name-main-do
if __name__ == "__main__":
# use our new function to get the output data
output_data = check_dates_in_csv("input_file.csv")
# save the data here
with open("output.csv", "w") as output_file:
writer = csv.writer(output_file)
writer.writerows(output_data)
I would recommend to use Pandas for such tasks:
import pandas as pd
filenameA ="ApptA.csv"
filenameAc = "CheckoutA.csv"
today = pd.datetime.today()
df = pd.read_csv(filenameA, parse_dates=[5])
df.loc[df.iloc[:, 5] <= today].to_csv(filenameAc, index=False)

What is the best way to link data of two sources for example by date?

I am writing a scipt (i.e. once upon the time) where I am reading the data from an excel-file. For that data I create an id based on the date and time. I have one missing variable, which is contained in a txt-file. The txt-file has also date and time to create an id.
Now I would like to link the data from the excel-file and txt-file based on the id. Right no I am building two lists from the txt-file. One containing the id and the other containing the value I need. Then, I get the index from the id list, where the id is the same in both data sets using the enumerate function. I use that index to get the value from the valuelist. The code looks something like that:
datelist = []
valuelist = []
txtfile = open(folder + os.sep + "Textfile.txt", "r")
ILines = txtfile.readlines()
for i,row in enumerate(ILines):
datelist.append(row.split(",")[1])
valuelist.append(row.split(",")[2])
rows = myexceldata
for row in rows:
x = row[id]
row = row + valuelist[[i for i,e in enumerate(datelist ) if e == x][0]]
However, that takes ages and I wonder if there is a better way to to that.
The files look like that:
Excelfile:
Date Time Var1 Var2
03.02.2016 12:53:24 10 27
03.02.2016 12:53:25 10 27
03.02.2016 12:53:26 10 27
Textfile:
Date Time Var3
03.02.2016 12:53:24 16
03.02.2016 12:53:25 20
Result:
Date Time Var1 Var2 Var3
03.02.2016 12:53:24 10 27 16
03.02.2016 12:53:25 10 27 20
03.02.2016 12:53:26 10 27 *)
*) It would be perfect, if here would be the same value as above, but empty would be ok, too
Ok, I forgot one important thing. Sorry about that: Not all times of the excelfile are in the textfile. The best option would be to get var3 from the previous time of the textfile just before the time of the excelfile. But it would also be an option to leave it blank than.
If both of your files are sorted in time order then the following kind of approach would be fast:
from heapq import merge
from itertools import groupby, chain
import csv
with open('excel.txt', 'rb') as f_excel, open('textfile.txt', 'rb') as f_text, open('output.txt', 'wb') as f_output:
csv_excel = csv.reader(f_excel)
csv_text = csv.reader(f_text)
csv_output = csv.writer(f_output)
header_excel = next(csv_excel)
header_text = next(csv_text)
csv_output.writerow(header_excel + [header_text[-1]])
for k, g in groupby(merge(csv_text, csv_excel), key=lambda x: x[0:2]):
csv_output.writerow(k + list(chain.from_iterable(cols[2:] for cols in g)))
This assumes your two input files are both in csv format, and works as follows:
Create csv readers/writers for all of the files. This allows the files to automatically be read in as lists of columns without requiring each line to be split.
Extract the headers from both of the files and write a combined form to the output.
Take the two input files and pass them to merge. This returns a row at a time from either input file in order.
Pass this to groupby to group rows with the same date and time together. This returns a key and a group, where the key is the date and time that matched, and the group is an iterable of the matching rows.
For each grouped entry, write the key and columns 2 onwards from each row to the output file. chain is used to produce a flat list.
This would give you an output file as follows:
Date,Time,Var1,Var2,Var3
03.02.2016,12:53:24,10,27,16
03.02.2016,12:53:25,10,27,20
As you already have the excel data, this would need to be passed to merge instead of csv_excel as a list of rows/cols.

Filtering out csv rows by column data

I'm not sure how to call this but I have an csv with the data:
...|Address | Date |...
...|Abraham st.| 01/01/2008 |...
...|Abraham st.| 02/02/2007 |...
...|Abraham st.| 03/03/2011|...
so what I want to do is only keep the newest entry(in this case it would be row4), I'm really having trouble bending my mind around this.
My initial idea is to read the data from csv to list of rows and then:
to convert date strings to datetime object
and then go through every row, get it's name and do a comparison with every other row to find the highest date and save the date's row.
is there a better way to approach this?
Keep track of the highest value seen so far instead; I'm assuming here you already have a csv.reader() object reading the CSV data:
from datetime import datetime
max_date = datetime.min
newest_row = None
for row in csv_reader:
# assumption: your date is the 4th column in each row
date = datetime.strptime(row[3], '%m/%d/%Y')
if date > max_date:
# row is newer, remember it
max_date = date
newest_row = row
When you've read the whole file, newest_row will hold the data row with the most recent date. However, the cold never holds more than 2 rows in memory (the currently row being processed, and the newest row found so far).
Note that I started max_date as datetime.min, which is the minimum value you can store in a datetime object; as long as your input file does not contain any rows for January 1st in the year 1, you should be good.
Just use the max builtin with a key function that extracts and converts the date field into a datetime object. I assume that your dates are mm/dd/yyyy.
import csv
from datetime import datetime
DATE_COLUMN = 1
with open('input.csv') as f:
reader = csv.reader(f, delimiter='|')
next(reader) # skip over the CSV header row
most_recent = max(reader, key=lambda x : datetime.strptime(x[DATE_COLUMN].strip(), '%d/%m/%Y'))
>>> print most_recent
['Abraham st.', ' 03/03/2011']
I think your intent is to group by the "Address" column and select the most recent date from the "Date" column, in which case you can use itertools.groupby() like this:
import csv
from itertools import groupby
from datetime import datetime
ADDRESS_COLUMN = 0
DATE_COLUMN = 1
most_recent = []
with open('input.csv') as f:
reader = csv.reader(f, delimiter='|')
next(reader) # skip over the CSV header row
for k, g in groupby(sorted(reader), lambda x : x[ADDRESS_COLUMN]):
most_recent.append(max(g, key=lambda x : datetime.strptime(x[DATE_COLUMN].strip(), '%d/%m/%Y')))
>>> print most_recent
[['Abraham st.', ' 03/03/2011'], ['Moses rd.', ' 10/12/2013'], ['Smith St.', ' 01/01/1999']]
Assuming input.csv contains this:
Address |Date
Abraham st.| 01/01/2008
Abraham st.| 02/02/2007
Abraham st.| 03/03/2011
Moses rd.| 10/12/2013
Moses rd.| 11/11/2011
Smith St.| 01/01/1999
Not sure you need to "compare with every other row" (but that might just be me misunderstanding your intent. I would simply save the currently newest row as I loop over the column.
Something like this pseudo code:
saved_row = Null
for row in table:
if not saved_row:
saved_row = row
else if row.date > saved_row.date:
saved_row = row
There is probably a more elegant way to store the initial row into saved_row

Python modifying csv column by increments

First off, I apologize for the terrible title; I didn't know how to summarize my problem. Okay, so here's a the first few lines of my .csv file. The first column is the timestamp. The program I'm getting this data from samples 24 times per second, so there are 24 rows that start with 15:40:15, 24 that start with 15:40:16, and so on. Instead of 24 rows with the same timestamp, I want the timestamp to increase increments of 1/24 seconds, or .042 seconds. So 15:40:15.042, 15:40:15.084, etc.
Another problem is that there aren't 24 rows for the first second, because we start in the middle of the second. For example, there are only 13 15:40:14 rows. For those it would preferably count backwards from 15:40:15.000 and subtract .042 seconds for every row.
How can I do this in Python? Thanks in advance!
CPUtime,Displacement Into Surface,Load On Sample,Time On Sample,Raw Load,Raw Displacement
15:40:14,-990.210561,-0.000025,1.7977E+308,-115.999137,-989.210000
15:40:14,-989.810561,-0.000025,1.7977E+308,-115.999105,-988.810000
15:40:14,-989.410561,-0.000025,1.7977E+308,-115.999073,-988.410000
15:40:14,-989.010561,-0.000025,1.7977E+308,-115.999041,-988.010000
15:40:14,-988.590561,-0.000025,1.7977E+308,-115.999007,-987.590000
15:40:14,-988.170561,-0.000025,1.7977E+308,-115.998974,-987.170000
15:40:14,-987.770561,-0.000025,1.7977E+308,-115.998942,-986.770000
15:40:14,-987.310561,-0.000025,1.7977E+308,-115.998905,-986.310000
15:40:14,-986.870561,-0.000025,1.7977E+308,-115.998870,-985.870000
15:40:14,-986.430561,-0.000025,1.7977E+308,-115.998834,-985.430000
15:40:14,-985.990561,-0.000025,1.7977E+308,-115.998799,-984.990000
15:40:14,-985.570561,-0.000025,1.7977E+308,-115.998766,-984.570000
15:40:14,-985.170561,-0.000025,1.7977E+308,-115.998734,-984.170000
15:40:15,-984.730561,-0.000025,1.7977E+308,-115.998698,-983.730000
15:40:15,-984.310561,-0.000025,1.7977E+308,-115.998665,-983.310000
15:40:15,-983.890561,-0.000025,1.7977E+308,-115.998631,-982.890000
15:40:15,-983.490561,-0.000025,1.7977E+308,-115.998599,-982.490000
15:40:15,-983.090561,-0.000025,1.7977E+308,-115.998567,-982.090000
I'd add to #robert king's answer that you could use itertools.groupby() to group rows with the same timestamp:
import csv
import shutil
from itertools import groupby
n = 24
time_increment = 1./n
fractions = [("%.3f" % (i*time_increment,)).lstrip('0') for i in xrange(n)]
with open('input.csv', 'rb') as f, open('output.csv', 'wb') as fout:
writer = csv.writer(fout)
# assume the file is sorted by timestamp
for timestamp, group in groupby(csv.reader(f), key=lambda row: row[0]):
sametime = list(group) # all rows that have the same timestamp
assert n >= len(sametime)
for i, row in enumerate(sametime, start=n-len(sametime)):
row[0] += fractions[i] # append fractions of a second
writer.writerows(sametime)
shutil.move('output.csv', 'input.csv') # update input file
'b' file mode is mandatory for csv in Python 2 otherwise entries that may span several physical lines won't work
if there are less than n entries with the same timestamp then the code assumes that they are consecutive values from the end of a second
open the csv file and create a csv reader as per http://docs.python.org/library/csv.html
Also create a csv writer as per http://docs.python.org/library/csv.html
Now loop through each row of the file. On each row, modify the timestamp and then write it to your new csv file.
If you want the new csv file to replace the old csv file, at the end use shutil http://docs.python.org/library/shutil.html to replace it.
I recommend inside your loop you have a variable called "current_timestamp" and a variable called "current_increment". If the timestamp in the row is equal to the current_timestamp, simply add the increment, otherwise change them both appropriately.

Categories

Resources