Removing entire column from .csv file without column titles - simple question - python

I simply want to remove columns 6 to 11 entirely from the .csv file. Implementing solutions found online hasn't reaped any fix unfortunately. Most solutions delete. columns identified by their column title, however, my .csv file doesn't have column titles as it is easier for the future without them.
from binance.client import Client
import config, csv
import pandas as pd
client = Client(config.apikey, config.apisecret)
candles = client.get_klines(symbol='ETHUSDT', interval=Client.KLINE_INTERVAL_1HOUR)
csvfile = open('1hour_dec2020_2021.csv', 'w', newline='')
candlestick_writer = csv.writer(csvfile, delimiter=',')
candlesticks=client.get_historical_klines('ETHUSDT', client.KLINE_INTERVAL_1HOUR, "1 Dec 2020", "1 Jan 2021")
for candlestick in candlesticks:
candlestick_writer.writerow(candlestick)
csvfile.close()
Example row from .csv file:
1502942400000,301.13000000,302.57000000,298.00000000,301.61000000,125.66877000,1502945999999,37684.80418100,129,80.56377000,24193.44078900,47039.70675719
which corresponds to the Binance kline response
Removing unwanted columns would preferably result in: - timestamp, o,h,l,c,v
1502942400000,301.13000000,302.57000000,298.00000000,301.61000000,125.66877000

You could use itemgetter() to pick the required values from each row. In your case they are contiguous so you could also use a simple slice:
from operator import itemgetter
from binance.client import Client
import config, csv
required_cols = itemgetter(0, 1, 2, 3, 4, 5) # specify required columns
client = Client(config.apikey, config.apisecret)
with open('1hour_dec2020_2021.csv', 'w', newline='') as csvfile:
candlestick_writer = csv.writer(csvfile)
candlesticks = client.get_historical_klines('ETHUSDT', client.KLINE_INTERVAL_1HOUR, "1 Dec 2020", "1 Jan 2021")
for candlestick in candlesticks:
candlestick_writer.writerow(required_cols(candlestick))
# or just use a slice if contiguous
# csv_output.writerow(candlestick[:6])
Using with() avoids the need to also add a close()

Related

Data Cleansing Dates in Python

I am processing a .csv in Python which has a column full of timestamps in the below format:-
August 21st 2020, 13:58:19.000
The full content of a line in the .csv looks like this:-
"August 21st 2020, 14:55:12.000","joebloggs#gmail.com",IE,Ireland,"192.168.0.1",2
What I wish to do is populate a separate column with the respective date and time and so I'd have '2020-08-21' in a new column and '13:58:19' in an additional new column.
Looking at this, I think I need to firstly tidy up the date and remove the letters after the actual date (21st, 4th, 7th etc) so that they read 21, 4 etc...
I started writing the below but this will replace the actual numbers too. How can I just strip out the letters after the numbers? I did think of a simple find/replace but this would remove 'st' from August for example hence a regex is needed I think to look for a space following by 1 or 2 numbers and then check if it is followed by ['nd','st','rd','th'] and if so, strip this out:-
with open('input_csv', 'r') as input_file:
lines = input_file.readlines()
for line in lines:
line = re.sub(r"\s\d+", "", line)
print(line)
Based on a suggestion to use the datetime module, I tried splitting to see if this helped:-
import re
from datetime import datetime
with open('input_csv.csv', 'r') as input_file:
lines = input_file.readlines()[1:]
for line in lines:
line_lst = line.split(',')
line_date = datetime.strptime(line_lst[0], '%B %d %y')
line_time = datetime.strptime(line_lst[1], '%H:%M:%S.%f')
print(line_date)
print(line_time)
but I still receive an error:-
ValueError: time data '"May 12th 2020' does not match format '%B %d %y'
It's probably easier to use the datetime library. For example,
from datetime import datetime
with open('input_csv', 'r') as input_file:
lines = input_file.readlines()
for line in lines:
line_date = datetime.strptime(line, '%B %d %y, %H:%M:%S.%f')
and then use strftime() to write it to the new file in the new format. To read more about these methods:
https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
I recommend using as many good libraries as you can for this: a lot of what you want to do has already been developed and tested.
Use a CSV parser
Parsing a CSV can look simple at first but become increasingly difficult as there are a lot of special rules around quoting (escaping) characters. Use Python's standard CSV module to do this:
import csv
with open('input.csv', newline='') as f:
reader = csv.reader(csv_file)
for row in reader:
date_val = row[0]
print(f'Raw string: {date_val}')
and I get the following without any special code:
Raw string: August 21st 2020, 14:55:12.000
Use a really good date-time parser
As a couple of others already pointed out, it's important to use a date-time parser, but they didn't tell you that Python's standard parser cannot handle the ordinals like '12th' and '21st' (maybe they just didn't see those...). The 3rd-party dateutil lib can handle them, though:
from dateutil import parser
...
...
print(f'Raw string: {date_val}')
dt = parser.parse(date_val)
print(f'Date: {dt.date()}')
print(f'Time: {dt.time()}')
Raw string: August 21st 2020, 14:55:12.000
Date: 2020-08-21
Time: 14:55:12
Make a new a CSV
Again, use the CSV module's writer class to take the modified rows from the reader and turn them into a CSV.
For simplicity, I recommend just accumulating new rows as you're reading and parsing the input. I've also included commented-out lines for copying a header, if you have one:
...
new_rows = []
with open('input.csv', newline='') as f:
reader = csv.reader(f)
# header = next(reader)
# new_rows.append(header)
for row in reader:
date_val = row[0]
remainder = row[1:]
dt = parser.parse(date_val)
new_row = [dt.date(), dt.time()]
new_row += remainder
new_rows.append(new_row)
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(new_rows)
My output.csv looks like this:
2020-08-21,14:55:12,joebloggs#gmail.com,IE,Ireland,192.168.0.1,2

How do you add a header to an excel csv file using python

So I'm trying to add a header to a csv file dynamically. My current code looks like the following:
import csv
from datetime import datetime
import pandas as pd
rows = []
with open(r'Test_Timestamp.csv', 'r', newline='') as file:
with open(r'Test_Timestamp_Result.csv', 'w', newline='') as file2:
reader = csv.reader(file, delimiter=',')
for row in reader:
rows.append(row)
file_write = csv.writer(file2)
for val in rows:
current_date_time = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)
val.insert(0, current_date_time)
file_write.writerow(val)
Currently how this works is it inserts a new timestamp at column A which is exactly what I want it to do, as I want everything to be pushed as I'll be working with csv files with various different number of columns.
What I'm having trouble with is, how am I able to add a column header? Currently a timestamp is created next to the header. I would want to create a new header named: Execution_Date
I have looked at pandas as a solution but from the documentation I've seen the examples given looks like its a set of column headers already pre-determined. I've tried inserting a column header with df.insert(0, "Execution_Date", current_date_time) but gives me an error when trying to accomplish this.
I know I'm fairly close to doing this but I'm running into errors. Is there a way to do this dynamically so it automatically does this with various different csv files and number of different columns in each csv file, etc.? The current output looks like:
What I want the final result to look like is:
Any help with this would be greatly appreciated! I'm going to continue to see if I can solve this in the meantime, but I'm at a wall with how to proceed.
If the end result is something that excel can read like maybe a csv you can likely bypass pandas altogether:
Edit: adding support for existing titles
Given a simple csv like:
Title,Other
Geeks1,foo
Geeks2,bar
Then you might use:
import contextlib
import csv
from datetime import datetime
with contextlib.ExitStack() as stack:
file_in = open('Test_Timestamp.csv', "r", encoding="utf-8")
file_out = open('Test_Timestamp_Result.csv', "w", encoding="utf-8", newline="")
reader = csv.reader(file_in, delimiter=',')
writer = csv.writer(file_out)
writer.writerow(["Execution_Date"] + next(reader))
writer.writerows(
[datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)] + row
for row in reader
)
to give you a file like:
Execution_Date,Title,Other
2022-02-11 00:00:00,Geeks1,foo
2022-02-11 00:00:00,Geeks2,bar
One way to do this is to utilize to_csv().
Example:
# importing python package
import pandas as pd
# read contents of csv file
file = pd.read_csv("gfg.csv")
print("\nOriginal file:")
print(file)
# adding header
headerList = ['id', 'name', 'profession']
# converting data frame to csv
file.to_csv("gfg2.csv", header=headerList, index=False)
# display modified csv file
file2 = pd.read_csv("gfg2.csv")
print('\nModified file:')enter code here
print(file2)

Extract rows from CSV based on column data

I have a report that is generated at the beginning of each month, in .csv format. Currently, the report contains a series of columns with assorted data; one of the columns is an 'add_date' field containing data in "YYYY-mm-dd HH:MM:SS" format.
My end goal is to parse this source CSV so that only rows containing 'add_date' cells with dates from the previous month remain. So for example, if the script were run on February 1st 2021, only the rows containing dates from January 2021 would remain in the output CSV file.
This is an example of the source CSV contents:
Name,Data1,add_date
jasmine,stuff ,2021-01-26 17:29:46
ariel,things,2021-01-26 17:48:04
ursula,foo,2016-11-02 19:32:09
belle,bar,2016-01-21 18:47:33
and this is the python script I have so far:
#!/usr/bin/env python3
import csv
filtered_rows = []
with open('test123.csv', newline='') as csvfile:
rowreader = csv.reader(csvfile, delimiter=',')
for row in rowreader:
if row["2021-01"] in csvfile.add_date:
filtered_rows.append(row)
print(filtered_rows)
which I call with the following command:
./testscript.py > testfile.csv
Currently, when I run the above command I am greeted with the following error message:
Traceback (most recent call last):
File "./testscript.py", line 9, in <module>
if row["2021-01"] in csvfile.add_date:
TypeError: list indices must be integers or slices, not str
My current Python version is Python 3.6.4, running in CentOS Linux release 7.6.1810 (Core).
If I undestood well, you can do something like this:
import pandas as pd
from datetime import datetime
df= pd.read_csv('test.csv',sep=',',header=0)
df['add_date']= pd.to_datetime(df['add_date'])
filtered=df[(df.add_date >= datetime.strptime('2021-01-01','%Y-%m-%d')) & (df.add_date <= datetime.strptime('2021-01-31','%Y-%m-%d')) ]
To do this properly you need to determine the previous month and year, then compare that to add_date field of each row. The year is important to handle December →
January (as well as the possibility of multi-year) transitions.
Here's what I mean.
import csv
import datetime
filename = 'test123.csv'
ADD_DATE_COL = 2
# Determine previous month and year.
first = datetime.date.today().replace(day=1)
last = first - datetime.timedelta(days=1)
previous_month, previous_year = last.month, last.year
# Extract rows for previous month.
filtered_rows = []
with open(filename, newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
next(reader) # Ignore header row.
for row in reader:
add_date = datetime.datetime.strptime(row[ADD_DATE_COL], '%Y-%m-%d %H:%M:%S')
if add_date.month == previous_month and add_date.year == previous_year:
filtered_rows.append(row)
print(filtered_rows)
I got the basic idea of how to determine the date of the previous month from #bgporter's answer to the question How to determine date of the previous month?.

i have bulk csv file i want to read it in dictionary then write the dictionary again into new csv file ,

i have the below lines in bulk csv file:
date, id, site, linkup,linkdwon , count, connection
20190102,100000000204197,google.com,1,2,1,5
20190102,100000000204197,yahoo.com,2,2,1,5
20190102,100000000204197,yahoo.com,1,2,2,3
20190102,41602323232,google.com,4,11,3
20190102,41602323232,google.com,1,3,1,7
based on id and site i want agregate them
100000000204197,google.com,1,2,1,5
100000000204197,yahoo.com,3,4,3,8
20190102,41602323232,google.com,5,4,2,10
from datetime import datetime
from dateutil.parser import parse
from collections import Counter
import csv
with open('/home/mahmoudod/Desktop/Tareq-Qassrawi/report.txt','r') as rf:
reader = csv.reader(rf)
with open ('/home/mahmoudod/Desktop/Tareq-Qassrawi/writer.txt','w') as wf:
hashing_table = {}
connection_val= 0
connection_val_2=0
for line in reader:
key = int(line[1])
if key != hashing_table.items():
hashing_table =({'IMSI':key
,'SITE':str(line[2])
,'DATE':str(line[0])
,'linkup' :int(line[3])
,'linkdown':int(line[4])
,'count':int(line[5])
,'connection':int(line[6])
})
connection_val = connection_val + int(hashing_table.get('connection'))
hashing _table[key].update({'connection':connection_val})
else:
connection_val_2 = connection_val_2 + int(hashing_table.get('connection'))
hashing _table[key].update({'connection':connection_val2})
Here it is
Using the amazing pandas module by http://wesmckinney.com/ (and a whole host of open source # contributors now. see Docs here http://pandas.pydata.org/pandas-docs/stable/
import pandas as pd
df = pd.read_csv('a.csv') # read in your data from the csv file.
df.groupby(['id', 'site']).sum() # groupby here groups your data by both the id and sum.
To get the id to show for all of them instead of omitting the repeated, we use reset_index
df.groupby(['id', 'site']).sum().reset_index()
If you are using data a lot in your life/career, look into jupyter notebook or jupyter lab as well: https://jupyter.org/
Good luck and welcome to SO and python open source data.
You can use Pandas' from_csv and to_dict for this purpose.

Header of data file disappears when sorting

I have a csv file with rows of data. The first row is headers for the columns.
I'd like to sort the data by some parameter (specifically, the first column), but of course keep the header where it is.
When I do the following, the header disappears completely and is not included in the output file.
Can anyone please advise how to keep the header but skip it and sort the rest of the rows?
(for fwiw, the first column is a mix of numbers and letters).
Thanks!
Here's my code:
import csv
import operator
sankey = open('rawforsankey.csv', "rb")
raw_reader = csv.reader(sankey)
raw_data = []
for row in raw_reader:
raw_data.append(row)
raw_data_sorted = sorted(raw_data, key=operator.itemgetter(0))
myfiletest = open('newfiletest.csv', 'wb')
wr = csv.writer(myfiletest,quoting = csv.QUOTE_ALL)
wr.writerows(raw_data_sorted)
sankey.close()
myfiletest.close()
EDIT: should mention I tried this variation in the code:
raw_data_sorted = sorted(raw_data[1:], key=operator.itemgetter(0))
but got the same result
You sorted all data, including the header, which means it is still there but perhaps in the middle of your resulting output somewhere.
This is how you'd sort a CSV on one column, preserving the header:
import csv
import operator
with open('rawforsankey.csv', "rb") as sankey:
raw_reader = csv.reader(sankey)
header = next(raw_reader, None)
sorted_data = sorted(raw_reader, operator.itemgetter(0))
with open('newfiletest.csv', 'wb') as myfiletest:
wr = csv.writer(myfiletest, quoting=csv.QUOTE_ALL)
if header:
wr.writerow(header)
wr.writerows(sorted_data)
Just remember that sorting is done lexicographically as all columns are strings. So 10 sorts before 9, for example. Use a more specific sorting key if your data is numeric, for example.

Categories

Resources