Extract rows from CSV based on column data

Extract rows from CSV based on column data - python

I have a report that is generated at the beginning of each month, in .csv format. Currently, the report contains a series of columns with assorted data; one of the columns is an 'add_date' field containing data in "YYYY-mm-dd HH:MM:SS" format.
My end goal is to parse this source CSV so that only rows containing 'add_date' cells with dates from the previous month remain. So for example, if the script were run on February 1st 2021, only the rows containing dates from January 2021 would remain in the output CSV file.
This is an example of the source CSV contents:
Name,Data1,add_date
jasmine,stuff ,2021-01-26 17:29:46
ariel,things,2021-01-26 17:48:04
ursula,foo,2016-11-02 19:32:09
belle,bar,2016-01-21 18:47:33
and this is the python script I have so far:
#!/usr/bin/env python3
import csv
filtered_rows = []
with open('test123.csv', newline='') as csvfile:
rowreader = csv.reader(csvfile, delimiter=',')
for row in rowreader:
if row["2021-01"] in csvfile.add_date:
filtered_rows.append(row)
print(filtered_rows)
which I call with the following command:
./testscript.py > testfile.csv
Currently, when I run the above command I am greeted with the following error message:
Traceback (most recent call last):
File "./testscript.py", line 9, in <module>
if row["2021-01"] in csvfile.add_date:
TypeError: list indices must be integers or slices, not str
My current Python version is Python 3.6.4, running in CentOS Linux release 7.6.1810 (Core).

If I undestood well, you can do something like this:
import pandas as pd
from datetime import datetime
df= pd.read_csv('test.csv',sep=',',header=0)
df['add_date']= pd.to_datetime(df['add_date'])
filtered=df[(df.add_date >= datetime.strptime('2021-01-01','%Y-%m-%d')) & (df.add_date <= datetime.strptime('2021-01-31','%Y-%m-%d')) ]

To do this properly you need to determine the previous month and year, then compare that to add_date field of each row. The year is important to handle December →
January (as well as the possibility of multi-year) transitions.
Here's what I mean.
import csv
import datetime
filename = 'test123.csv'
ADD_DATE_COL = 2
# Determine previous month and year.
first = datetime.date.today().replace(day=1)
last = first - datetime.timedelta(days=1)
previous_month, previous_year = last.month, last.year
# Extract rows for previous month.
filtered_rows = []
with open(filename, newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
next(reader) # Ignore header row.
for row in reader:
add_date = datetime.datetime.strptime(row[ADD_DATE_COL], '%Y-%m-%d %H:%M:%S')
if add_date.month == previous_month and add_date.year == previous_year:
filtered_rows.append(row)
print(filtered_rows)
I got the basic idea of how to determine the date of the previous month from #bgporter's answer to the question How to determine date of the previous month?.

Related

How to use python to find the maximum value in a csv file list

I am very new to python and need some help finding the maximum/highest value of in a column of data (time) that is imported from a csv file. this is the code i have tried.
file = open ("results.csv")
unneeded = file.readline()
for line in file:
data = file.readline ()
linelist = line.split(",")
hours = linelist[4]
maxtime = 0
for x in hours:
if x > maxtime:
maxtime = x
print (maxtime)
any help is appreciated
edit: i tried this code but it gives me the wrong answer :(
file = open ("results.csv")
unneeded = file.readline()
maxtime = 0
for line in file:
data = file.readline ()
linelist = line.split(",")
hours = linelist[4]
if hours > str(maxtime):
maxtime = hours
print (maxtime)
[first few lines of results][1]
edit:
results cvs
[1]: https://i.stack.imgur.com/z3pEJ.png

I haven't tested it but this should work. Using the CSV library is easy for parsing CSV files.
import csv
with open("results.csv") as file:
csv_reader = csv.reader(file, delimiter=',')
for row in csv_reader:
hours = row[4]
maxtime = 0
if hours > maxtime:
maxtime = x
print (maxtime)
file.close()

My recommendation is using the pandas module for anything CSV-related.
Using dateutil, I create a dataset of dates and identification numbers whose date values are shuffled (no specific order) like so:
from dateutil.parser import *
from dateutil.rrule import *
from random import shuffle
dates = list(rrule(
DAILY,
dtstart=parse("19970902T090000"),
until=parse("19971224T000000")
))
shuffle(dates)
with open('out.csv', mode='w', encoding='utf-8') as out:
for i,date in enumerate(dates):
out.write(f'{i},{date}\n')
So thus, in this particular dataset, 1997-12-23 09:00:00 would be the "largest" date. Then, to extract the maximum date, we can just do it via string comparisons if it is formatted in the ISO 8601 date/time format:
from pandas import read_csv
df = read_csv('out.csv', names=['index', 'time'], header=1)
print(max(df['time']))
After running it, we indeed get 1997-12-23 09:00:00 printed in the terminal!

Removing entire column from .csv file without column titles - simple question

I simply want to remove columns 6 to 11 entirely from the .csv file. Implementing solutions found online hasn't reaped any fix unfortunately. Most solutions delete. columns identified by their column title, however, my .csv file doesn't have column titles as it is easier for the future without them.
from binance.client import Client
import config, csv
import pandas as pd
client = Client(config.apikey, config.apisecret)
candles = client.get_klines(symbol='ETHUSDT', interval=Client.KLINE_INTERVAL_1HOUR)
csvfile = open('1hour_dec2020_2021.csv', 'w', newline='')
candlestick_writer = csv.writer(csvfile, delimiter=',')
candlesticks=client.get_historical_klines('ETHUSDT', client.KLINE_INTERVAL_1HOUR, "1 Dec 2020", "1 Jan 2021")
for candlestick in candlesticks:
candlestick_writer.writerow(candlestick)
csvfile.close()
Example row from .csv file:
1502942400000,301.13000000,302.57000000,298.00000000,301.61000000,125.66877000,1502945999999,37684.80418100,129,80.56377000,24193.44078900,47039.70675719
which corresponds to the Binance kline response
Removing unwanted columns would preferably result in: - timestamp, o,h,l,c,v
1502942400000,301.13000000,302.57000000,298.00000000,301.61000000,125.66877000

You could use itemgetter() to pick the required values from each row. In your case they are contiguous so you could also use a simple slice:
from operator import itemgetter
from binance.client import Client
import config, csv
required_cols = itemgetter(0, 1, 2, 3, 4, 5) # specify required columns
client = Client(config.apikey, config.apisecret)
with open('1hour_dec2020_2021.csv', 'w', newline='') as csvfile:
candlestick_writer = csv.writer(csvfile)
candlesticks = client.get_historical_klines('ETHUSDT', client.KLINE_INTERVAL_1HOUR, "1 Dec 2020", "1 Jan 2021")
for candlestick in candlesticks:
candlestick_writer.writerow(required_cols(candlestick))
# or just use a slice if contiguous
# csv_output.writerow(candlestick[:6])
Using with() avoids the need to also add a close()

How to extract a month from date in csv file?

I'm trying to get an output of all the employees who worked this month by extracting the month from the date but I get this error:
month = int(row[1].split('-')[1])
IndexError: list index out of range
A row in the attendance log csv looks like this:
"404555403","2020-10-14 23:58:15.668520","Chandler Bing"
I don't understand why it's out of range?
Thanks for any help!
import csv
import datetime
def monthly_attendance_report():
"""
The function prints the attendance data of all employees from this month.
"""
this_month = datetime.datetime.now().month
with open('attendance_log.csv', 'r') as csvfile:
content = csv.reader(csvfile, delimiter=',')
for row in content:
month = int(row[1].split('-')[1])
if month == this_month:
return row
monthly_attendance_report()

It is working for me. The problem will be probably in processing the csv file, because csv files have in most cases headers, which means that you can't split header text. So add slicer [1:] to your for loop and ignore first line with header:
for row in content[1:]:
And processing date by slicing is not good at all, too. Use datetime module or something like that.

How to arrange data week wise csv file in python

I have generated csv file which has formate as shown in below image:
In this image, I have data week wise but somewhere I couldn't arrange data week wise. If you look into the below image, you will see the red mark and blue mark. I want to separate this both marks. How I will do it?
Note: If Holiday on Friday then it should set a week from Monday to Thursday.
currently, I'm using below logic :
Image: Please click here to see image
current logic:
import csv
blank_fields=[' ']
fields=[' ','Weekly Avg']
# Read csv file
file1 = open('test.csv', 'rb')
reader = csv.reader(file1)
new_rows_list = []
# Read data row by row and store into new list
for row in reader:
new_rows_list.append(row)
if 'Friday' in row:
new_rows_list.append(fields)
file1.close()

overall you are going towards the right direction, your condition is just a little too error prone, things can get worse (e.g., just one day in a week appears in your list). So testing for the weekday string isn't the best choice here.
I would suggest "understanding" the date/time in your table to solve this using weekdays, like this:
from datetime import datetime as dt, timedelta as td
# remember the last weekday
last = None
# each item in list represents one week, while active_week is a temp-var
weeks = []
_cur_week = []
for row in reader:
# assuming the date is in row[1] (split, convert to int, pass as args)
_cur_date = dt(*map(int, row[1].split("-")))
# weekday will be in [0 .. 6]
# now if _cur_date.weekday <= last.weekday, a week is complete
# (also catching corner-case with more than 7 days, between two entries)
if last and (_cur_date.weekday <= last.weekday or (_cur_date - last) >= td(days=7)):
# append collected rows to 'weeks', empty _cur_week for next rows
weeks.append(_cur_week)
_cur_week = []
# regular invariant, append row and set 'last' to '_cur_date'
_cur_week.append(row)
last = _cur_date
Pretty verbose and extensive, but I hope I can transport the pattern used here:
parse existing date and use weekday to distinguish one week from another (i.e., weekday will increase monotonously, means any decrease (or equality) will tell you the current date represents the next week).
store rows in a temporary list during one week
append _cur_week into weeks once the condition for next week gets triggered
empty _cur_week for next rows i.e., week
Finally the last thing to do is to "concat" the data e.g. like this:
new_rows_list = [[fields] + week for week in weeks]

I have another logic for this same thing and it is successfully worked and easy solution.
import csv
import datetime
fields={'':' ', 'Date':'Weekly Avg'}
#Read csv file
file1 = open('temp.csv', 'rb')
reader = csv.DictReader(file1)
new_rows_list = []
last_date = None
# Read data row by row and store into new list
for row in reader:
cur_date = datetime.datetime.strptime(row['Date'], '%Y-%m-%d').date()
if last_date and ((last_date.weekday() > cur_date.weekday()) or (cur_date.weekday() == 0)):
new_rows_list.append(fields)
last_date = cur_date
new_rows_list.append(row)
file1.close()

Using python to print strings between csv values

My overarching goal is to write a Python script that transforms each row of a spreadsheet into a standalone markdown file, using each column as a value in the file's YAML header. Right now, the final for loop I've written not only keeps going and going and going… it also doesn't seem to place the values correctly.
import csv
f = open('data.tsv')
csv_f = csv.reader(f, dialect=csv.excel_tab)
date = []
title = []
for column in csv_f:
date.append(column[0])
title.append(column[1])
for year in date:
for citation in title:
print "---\ndate: %s\ntitle: %s\n---\n\n" % (year, citation)
I'm using tab-separated values because some of the fields in my spreadsheet are chunks of text with commas. So ideally, the script should output something like the following (I figured I'd tackle splitting this output into individual markdown files later. One thing at a time):
---
date: 2015
title: foo
---
---
date: 2016
title: bar
---
But instead I getting misplaced values and output that never ends. I'm obviously learning as I go along here, so any advice is appreciated.

import csv
with open('data.tsv', newline='') as f:
csv_f = csv.reader(f, dialect=csv.excel_tab)
for column in csv_f:
year, citation = column # column is a list, unpack them directly
print "---\ndate: %s\ntitle: %s\n---\n\n" % (year, citation)
This is all I can do without the sample CSV file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract rows from CSV based on column data - python

Related

How to use python to find the maximum value in a csv file list

Removing entire column from .csv file without column titles - simple question

How to extract a month from date in csv file?

How to arrange data week wise csv file in python

Using python to print strings between csv values

Categories

Resources