How to arrange data week wise csv file in python - python

I have generated csv file which has formate as shown in below image:
In this image, I have data week wise but somewhere I couldn't arrange data week wise. If you look into the below image, you will see the red mark and blue mark. I want to separate this both marks. How I will do it?
Note: If Holiday on Friday then it should set a week from Monday to Thursday.
currently, I'm using below logic :
Image: Please click here to see image
current logic:
import csv
blank_fields=[' ']
fields=[' ','Weekly Avg']
# Read csv file
file1 = open('test.csv', 'rb')
reader = csv.reader(file1)
new_rows_list = []
# Read data row by row and store into new list
for row in reader:
new_rows_list.append(row)
if 'Friday' in row:
new_rows_list.append(fields)
file1.close()

overall you are going towards the right direction, your condition is just a little too error prone, things can get worse (e.g., just one day in a week appears in your list). So testing for the weekday string isn't the best choice here.
I would suggest "understanding" the date/time in your table to solve this using weekdays, like this:
from datetime import datetime as dt, timedelta as td
# remember the last weekday
last = None
# each item in list represents one week, while active_week is a temp-var
weeks = []
_cur_week = []
for row in reader:
# assuming the date is in row[1] (split, convert to int, pass as args)
_cur_date = dt(*map(int, row[1].split("-")))
# weekday will be in [0 .. 6]
# now if _cur_date.weekday <= last.weekday, a week is complete
# (also catching corner-case with more than 7 days, between two entries)
if last and (_cur_date.weekday <= last.weekday or (_cur_date - last) >= td(days=7)):
# append collected rows to 'weeks', empty _cur_week for next rows
weeks.append(_cur_week)
_cur_week = []
# regular invariant, append row and set 'last' to '_cur_date'
_cur_week.append(row)
last = _cur_date
Pretty verbose and extensive, but I hope I can transport the pattern used here:
parse existing date and use weekday to distinguish one week from another (i.e., weekday will increase monotonously, means any decrease (or equality) will tell you the current date represents the next week).
store rows in a temporary list during one week
append _cur_week into weeks once the condition for next week gets triggered
empty _cur_week for next rows i.e., week
Finally the last thing to do is to "concat" the data e.g. like this:
new_rows_list = [[fields] + week for week in weeks]

I have another logic for this same thing and it is successfully worked and easy solution.
import csv
import datetime
fields={'':' ', 'Date':'Weekly Avg'}
#Read csv file
file1 = open('temp.csv', 'rb')
reader = csv.DictReader(file1)
new_rows_list = []
last_date = None
# Read data row by row and store into new list
for row in reader:
cur_date = datetime.datetime.strptime(row['Date'], '%Y-%m-%d').date()
if last_date and ((last_date.weekday() > cur_date.weekday()) or (cur_date.weekday() == 0)):
new_rows_list.append(fields)
last_date = cur_date
new_rows_list.append(row)
file1.close()

Related

How to extract a month from date in csv file?

I'm trying to get an output of all the employees who worked this month by extracting the month from the date but I get this error:
month = int(row[1].split('-')[1])
IndexError: list index out of range
A row in the attendance log csv looks like this:
"404555403","2020-10-14 23:58:15.668520","Chandler Bing"
I don't understand why it's out of range?
Thanks for any help!
import csv
import datetime
def monthly_attendance_report():
"""
The function prints the attendance data of all employees from this month.
"""
this_month = datetime.datetime.now().month
with open('attendance_log.csv', 'r') as csvfile:
content = csv.reader(csvfile, delimiter=',')
for row in content:
month = int(row[1].split('-')[1])
if month == this_month:
return row
monthly_attendance_report()
It is working for me. The problem will be probably in processing the csv file, because csv files have in most cases headers, which means that you can't split header text. So add slicer [1:] to your for loop and ignore first line with header:
for row in content[1:]:
And processing date by slicing is not good at all, too. Use datetime module or something like that.

Looking for ideal method to "scrub" csv file to be put into Excel

Bit of an involved setup to this question, but bear with me!
(Copy and pasting the below block into an editor works well)
I am using clevercsv to load my data from a financial website's csv file.
Each row is stored as an item in a list.
data = clevercsv.wrappers.read_csv(in_file_name)
After some account info lines, the stock data begins:
stock_data = data[8:]
I wish to remove the data: Market, Loan Value - all the way to - Day High (inclusive0
And Keep Symbol, Description -> % of Positions (inclusive), 52-wk Low, 52-wk High
Each stock has this data associated with it on the relevant line.
Any best practices for removing this data? I have been trying and seem to be having logic errors.
As of Date,2020-04-29 18:44:29
Account,TD Direct Investing - HAHAHA
Cash,123.12
Investments,1234.12
Total Value,12345.12
Margin,123456.12,
,
Symbol,Market,Description,Quantity,Average Cost,Price,Book Cost,Market Value,Unrealized $,Unrealized %,% of Positions,Loan Value,Change Today $,Change Today %,Bid,Bid Lots,Ask,Ask Lots,Volume,Day Low,Day High,52-wk Low,52-wk High
AFL,US,"AFLAC INC",500,43.79,39.23,21895.79,19615.00,-2280.79,-10.42,7.26,,1.4399986,3.81,39.19,1,40.2,1,3001288,38.31,39.48,23.07,57.18
AKTS,US,"AKOUSTIS TECHNOLOGIES INC",2500,5.04,8.94,12609.87,22350.00,9740.13,77.24,8.27,,0.35999966,4.20,8.68,1,9.2,10,1161566,8.65,9.25,3.76,9.25
And here is my code so far:
import clevercsv
data = clevercsv.wrappers.read_csv(in_file_name)
# store the earlier lines for later use, all rows 8 and later are stock data
cash = data[2]
investments = data[3]
tot_value = data[4]
margin = data[5]
full_header = data[7]
stock_data = data[8:]
new_header = []
new_stock_data = []
# I have found the index positions I wish to save, append their data to the new_ lists:
for i in range(len(full_header)):
if i == 0:
new_header.append(full_header[i])
if (i >= 2 and i <= 10):
new_header.append(full_header[i])
if i == 21:
new_header.append(full_header[i])
if i == 22:
new_header.append(full_header[i])
# I have found the index positions I wish to save, append their data to the new_ lists:
for i in range(len(stock_data)):
if i == 0:
new_stock_data.append(stock_data[i])
if (i >= 2 and i <= 10):
new_stock_data.append(stock_data[i])
if i == 21:
new_stock_data.append(stock_data[i])
if i == 22:
new_stock_data.append(stock_data[i])
with open(os.path.join(folder_path,out_file_name),'w') as out_file:
writer = clevercsv.writer(out_file)
writer.writerow(cash)
writer.writerow(investments)
writer.writerow(tot_value)
writer.writerow(margin)
writer.writerow(new_header)
for row in new_stock_data:
writer.writerow(row)
If this is too involved I understand, and if someone has a better library to use, or a better way to use the csv library that will be plenty of help on it's own.
If you already know the column indices and the header length, you can do something like this:
import csv
with open('input.csv', 'r', newline='') as input_file, open('output.csv','w', newline='') as output_file:
reader = csv.reader(input_file)
writer = csv.writer(output_file)
for line_number, row in enumerate(reader, start=0): # Avoid range(len(x))
if line_number < 7:
writer.writerow(row) # Write cash, investments, etc
else:
shortened_row = row[0:1] + row[2:11] + row[21:] # Slice only the columns you need
writer.writerow(shortened_row)
Whenever you find yourself writing range(len(something)), that's a good sign that you probably want to use enumerate(), which will loop through your data and automatically keep track of the current index.
For parsing each row after the header, you can use the slice notation row[start:end] and add slices together to get a new list, which you can then write to a file. Keep in mind that row[start:end] will not return the item at index end, which can be counter intuitive.
Finally, I always add newline='' when working with CSVs, since you can get unexpected line breaks otherwise, but this might be something clevercsv handles for you.
In Python, I would recommend using Pandas for this sort of operation.
First isolate the CSV data. Then treat it as a stream. I am dropping part of your sample in as x:
# This is python3 code
# first treat string as though it is a file
import io
x = io.StringIO("""Symbol,Market,Description,Quantity,Average Cost,Price,Book Cost,Market Value,Unrealized $,Unrealized %,% of Positions,Loan Value,Change Today $,Change Today %,Bid,Bid Lots,Ask,Ask Lots,Volume,Day Low,Day High,52-wk Low,52-wk High
AFL,US,"AFLAC INC",500,43.79,39.23,21895.79,19615.00,-2280.79,-10.42,7.26,,1.4399986,3.81,39.19,1,40.2,1,3001288,38.31,39.48,23.07,57.18
AKTS,US,"AKOUSTIS TECHNOLOGIES INC",2500,5.04,8.94,12609.87,22350.00,9740.13,77.24,8.27,,0.35999966,4.20,8.68,1,9.2,10,1161566,8.65,9.25,3.76,9.25""")
Then use pandas to read the string as CSV, treating the first row as headers by default:
import pandas as pd
df = pd.read_csv(x)
Then select the columns you want by passing a list of column names to the data frame:
new_df = df[['Book Cost', 'Market Value', 'Unrealized $', 'Unrealized %','% of Positions','52-wk Low', '52-wk High']]
Book Cost Market Value Unrealized $ Unrealized % % of Positions \
0 21895.79 19615.0 -2280.79 -10.42 7.26
1 12609.87 22350.0 9740.13 77.24 8.27
52-wk Low 52-wk High
0 23.07 57.18
1 3.76 9.25
Finally you can save it:
new_df.to_csv('test.csv', index=False) # Turn off indexing
And you are set:
Book Cost,Market Value,Unrealized $,Unrealized %,% of Positions,52-wk Low,52-wk High
21895.79,19615.0,-2280.79,-10.42,7.26,23.07,57.18
12609.87,22350.0,9740.13,77.24,8.27,3.76,9.25
(Full disclosure, I'm the author of CleverCSV.)
If you'd like to use CleverCSV for this task, and your data is small enough to fit into memory, you could use clevercsv.read_csv to load the data and clevercsv.write_table to save the data. By using these functions you don't have to worry about CSV dialects etc. You could also find the index of the header row automatically. It could go something like this:
from clevercsv import read_csv, write_table
# Load the table with CleverCSV
table = read_csv(in_file_name)
# Find the index of the header row and get the header
header_idx = next((i for i, r in enumerate(table) if r[0] == 'Symbol'), None)
header = table[header_idx]
# Extract the data as a separate table
data = table[header_idx+1:]
# Create a list of header names that you want to keep
keep = ["Symbol", "Description", "Quantity","Average Cost","Price","Book Cost","Market Value","Unrealized $","Unrealized %","% of Positions","52-wk Low", "52-wk High"]
# Turn that list into column indices (and ensure all exist)
keep_idx = [header.index(k) for k in keep]
# Then create a new table by adding the header and the sliced rows
new_table = [keep]
for row in data:
new_row = [row[i] for i in keep_idx]
new_table.append(new_row)
# Finally, write the table to a new csv file
write_table(new_table, out_file_name)

How to count entries per day in a csv file?

I have a csv file with the download times of various files and I want to know the number of files that was download per day.
Code:
with open('hello.csv', 'r', encoding="latin-1") as csvfile:
readCSV=csv.reader(csvfile, delimiter=',')
list1=list(readCSV)
count=0
b=-1
for j in list1:
b=b+1
if b>0:
dt=j[1]
dt_obj=parse(dt)
d=dt_obj.date()
if dt==d:
count+=1
else:
print(count)
break
hello.csv is my csv file. I have date times so I use the parser to get the date. I want to have the number of downloads per day. I know that this code can't work but I don't know how to compare if the next entry is the same date or not..
My date times look like "2004-01-05 17:56:46" and are in the second column of the csv file. When I have 7 entries on 2004-01-05 and 5 on 2004-01-06 the vector count should look like count=[7 5] for example
You can follow the below procedure.
Convert to a datetime object.
Create a column containing only the date (remove the time).
Group by the new date column.
Count the objects.
# Read csv file
data = pd.read_csv('hello.csv')
# Converting to datetime object
data['timestamp'] = pd.to_datetime(data['timestamp'])
# Creating date column
data['date'] = data['timestamp'].apply(lambda x: x.date())
# Grouping by date
data.group_by('date')['column'].count()
# Result
date
2019-05-20 4
2019-05-21 3
Name: column, dtype: int64
When you want to count elements, Python collections module provides the Counter class which can be used as a dictionary {element_name: count}. I will assume that your parse function does what you want. The code can simply be:
with open('hello.csv', 'r', encoding="latin-1") as csvfile:
readCSV=csv.reader(csvfile, delimiter=',')
counter = collections.Counter((parse(row[1]).date() for row in readCSV))
print(counter)
With your expected data, it should print:
Counter({'2004-01-05': 7, '2004-01-06': 5})
I suggest using Pandas. Say your date column is called date. Since your date is a datetime object, you can group by dates and use the transform method
df = pd.read_csv('hello.csv')
df['date'] = pd.DatetimeIndex(df.date).normalize()
df['count'] = df.groupby('date')['date'].transform('count')
df = df[['date','count']]
Now you have a new dataframe with what you want.

Data Scrubbing using python

So, I am cleaning a data set that has a timestamp as the first column, prices and volumes as the next two columns. I am trying to remove bad rows using several logic and then output a file that has all the bad ticks and another file that has all the good ticks. Every logic seems to work other than where I remove my duplicates. I end up getting lesser number of rows than I started with:
from datetime import datetime
to_sort = list()
noise= list()
line_counter=0
with open("hi.txt", 'r') as f:
for line in f:
#splitting the lines using the delimiter comma
splitted_line = line.strip().split(",")
#stripping the time using the datetime function from the datetime library
date1 = datetime.strptime(splitted_line[0],'%Y%m%d:%H:%M:%S.%f')
#creating columns for volume and price
price = float(splitted_line[1])
volume = int(splitted_line[2])
#creating a tuple using date as first column, price as second and volume as third
my_tuple=(date1,price,volume)
#EDA shows that the prices are between 0 and 3000 and volume must be greater than zero
if price > 0 and price<3000 and volume >0:
to_sort.append(my_tuple)
else:
noise.append(my_tuple)
line_counter +=1
if line_counter %13==0:
#removing duplicates using the set function
sorted_signal=sorted(set(to_sort))
with open ("true.txt","a")as s:
for line in sorted_signal:
s.write(str(line[0])+","+ str(line[1])+","+str(line[2])+"\n")
to_sort=list()
with open ("noise.txt","a")as n:
for line in noise:
n.write(str(line[0])+","+ str(line[1])+","+str(line[2])+"\n")
noise=list()

Find datetime instances in multiple datetime groups python

I have two CSV files with timestamp data in str format.
the first CSV_1 has resampled data from a pandas timeseries, into 15 minute blocks and looks like:
time ave_speed
1/13/15 4:30 34.12318398
1/13/15 4:45 0.83396195
1/13/15 5:00 1.466816057
CSV_2 has regular times from gps points e.g.
id time lat lng
513620 1/13/15 4:31 -8.15949 118.26005
513667 1/13/15 4:36 -8.15215 118.25847
513668 1/13/15 5:01 -8.15211 118.25847
I'm trying to iterate through both files to find instances where time in CSV_2 is found within the 15 min time group in CSV_1 and then do something. In this case append ave_speed to every entry which this condition is true.
Desired result using the above examples:
id time lat lng ave_speed
513620 1/13/15 4:31 -8.15949 118.26005 0.83396195
513667 1/13/15 4:36 -8.15215 118.25847 0.83396195
513668 1/13/15 5:01 -8.15211 118.25847 something else
I tried doing it solely in pandas dataframes but ran into some troubles I thought this might be a workaround to achieve what i'm after.
This is the code i've written so far and I feel like it's close but I can't seem to nail the logic to get my for loop returning entries within the 15 min time group.
with open('path/CSV_2.csv', mode="rU") as infile:
with open('path/CSV_1.csv', mode="rU") as newinfile:
reader = csv.reader(infile)
nreader = csv.reader(newinfile)
next(nreader, None) # skip the headers
next(reader, None) # skip the headers
for row in nreader:
for dfrow in reader:
if (datetime.datetime.strptime(dfrow[2],'%Y-%m-%d %H:%M:%S') < datetime.datetime.strptime(row[0],'%Y-%m-%d %H:%M:%S') and
datetime.datetime.strptime(dfrow[2],'%Y-%m-%d %H:%M:%S') > datetime.datetime.strptime(row[0],'%Y-%m-%d %H:%M:%S') - datetime.timedelta(minutes=15)):
print dfrow[2]
Link to pandas question I posted with same problem Pandas, check if timestamp value exists in resampled 30 min time bin of datetimeindex
EDIT:
Creating two lists of time, i.e. listOne with all the times from CSV_1 and listTwo with all the times in CSV_2 I'm able to find instances in the time groups. So something is weird with using CSV values. Any help would be appreciated.
I feel like this is pretty close to what I want if anyone is curious on how to do the same thing. It's not massively efficient and the current script takes roughly 1 day to iterate over all the rows multiple times because of the double loop.
If anyone has any thoughts on how to make this easier or quicker i'd be very interested.
#OPEN THE CSV FILES
with open('/GPS_Timepoints.csv', mode="rU") as infile:
with open('/Resampled.csv', mode="rU") as newinfile:
reader = csv.reader(infile)
nreader = csv.reader(newinfile)
next(nreader, None) # skip the headers
next(reader, None) # skip the headers
#DICT COMPREHENSION TO GET ONLY THE DESIRED DATA FROM CSV
checkDates = {row[0] : row[7] for row in nreader }
x = checkDates.items()
# READ CSV INTO LIST (SEEMED TO BE EASIER THAN READING DIRECT FROM CSV FILE, I DON'T KNOW IF IT'S FASTER)
csvDates = []
for row in reader:
csvDates.append(row)
#LOOP 1 TO ITERATE OVER FULL RANGE OF DATES IN RESAMPLED DATA AND A PRINT STATEMENT TO GIVE ME HOPE THE PROGRAM IS RUNNING
for i in range(0,len(x)):
print 'checking', i
#TEST TO SEE IF THE TIME IS IN THE TIME RANGE, THEN IF TRUE INSERT THE DESIRED ATTRIBUTE, IN THIS CASE SPEED TO THE ROW
for row in csvDates:
if row[2] > x[i-1][0] and row[2] < x[i][0]:
row.insert(9,x[i][1])
# GET THE RESULT TO CSV TO UPLOAD INTO GIS
with open('/result.csv', mode="w") as outfile:
wr = csv.writer(outfile)
wr.writerow(['id','boat_id','time','state','lat','lng','activity','speed', 'state_reason'])
for row in csvDates:
wr.writerow(row)

Categories

Resources