I have a csv file that is for some reason creating unnecessary Unnamed columns any time I do something.
Two first lines from said CSV file before anything is done to it:
Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put
2020-04-29T06:37:14-0400,Receive Deliver,BUY_TO_OPEN,USO1 201016C00014500,Equity Option,Reverse split: Open 2 USO1 201016C00014500,-10,2,-5,,0,100,USO,10/16/2020,14.5,CALL
Note how there are no extraneous commas. When I read the file as is, it seems to have Date as the index, but I can't reach it. Gives me an error with ['Date'] and [0] and also .index So, I created a new column and made it the index. All seems well. I run a function that changes the Date values to timestamp and also reversed the index. Yet, when I try to reset the index with drop=True or inplace=True or anything else I can think of, I am getting an unnamed column set as the index, with the index column being pushed to the end.
,Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put,ind
0,2019-12-10 17:00:00,Money Movement,,,,ACH DEPOSIT,"1,500.00",0,,,0.0,,,,,,709
This is the current snippet I have:
import pandas as pd
# set csv file as constant
TRADER_READER = pd.read_csv('TastyTrades.csv')
TRADER_READER['ind'] = TRADER_READER.index
# change date format, make date into timestamp object, set date as index, write changes to csv file
def clean_date():
TRADER_READER['Date'] = TRADER_READER['Date'].replace({'T': ' ', '-0500': '', '-0400': ''}, regex=True)
TRADER_READER['Date'] = pd.to_datetime(TRADER_READER['Date'], format="%Y-%m-%d %H:%M:%S")
reversed_index = TRADER_READER.iloc[::-1].reset_index(drop=True)
reversed_index.to_csv('TastyTrades.csv')
Related
I have a python script that generates a CSV file for me. The script adds an index column to the CSV that rightfully starts at 0 and goes onto the last row. This works as expected. However, I need to find a way to make sure that the index starts from the last index number in the previously generated CSV file. That is, if I have the following CSV:
csv1.csv
,Name,Customer
0,test1,customer1
1,test2,customer2
I would like the next CSV file generated to read:
csv2.csv
,Name,Customer
2,test1,customer1
3,test2,customer2
I suspect we will need to import the last generated file to read the last index on it, but how do I make the new CSV file generated start from that point?
Note: I do not wish to import data from the previously generated CSV. I simply want to have the new index start from where the last index ended.
Below is my script thus far, as a reference.
by_name = {}
with open('flavor.csv') as b:
for row in csv.DictReader(b):
name = row.pop('Name')
by_name[name] = row
with open('output.csv', 'w') as c:
w = csv.DictWriter(c, ['ID', 'Name', 'Flavor', 'RAM', 'Disk', 'Ephemeral', 'VCPUs', 'Customer', 'Misc', 'date_stamp', 'Month', 'Year','Unixtime'])
w.writeheader()
with open('instance.csv') as a:
for row in csv.DictReader(a):
try:
match = by_name[row['Flavor']]
except KeyError:
continue
row.update(match)
w.writerow(row)
df = pd.read_csv('output.csv')
df[['Customer','Misc']] = df.Name.str.split('-', n=1,expand=True)
df[['date_stamp']] = date_time
df[['Month']] = month
df[['Year']] = year
df[['Unixtime']] = unixtime
df.loc[df.Misc.str.startswith('business', na=False),'Customer']+='-business'
df.Misc=df.Misc.str.strip('business-')
df[['Customer']] = df.Customer.str.title()
df.to_csv('final-output.csv')
I would like to create a txt file, where every line is a so called "ticker symbol" (=symbol for a stock). As a first step, I downloaded all the tickers I want via a wikipedia api:
import pandas as pd
import wikipedia as wp
html1 = wp.page("List of S&P 500 companies").html().encode("UTF-8")
df = pd.read_html(html1,header =0)[0]
df = df.drop(['SEC filings','CIK', 'Headquarters Location', 'Date first added', 'Founded'], axis = 1)
df.columns = df.columns.str.replace('Symbol', 'Ticker')
Secondly, I would like to create a txt file as mentionned above with all the ticker names of column "Ticker" from df. To do so, I probably have to do somithing similar to:
f = open("tickertest.txt","w+")
f.write("MMM\nABT\n...etc.")
f.close()
Now my problem: Does anybody know how it is possible to bring my Ticker column from df into one big string where between every ticker there is a \n or every ticker is on a new line?
You can use to_csv for this.
df.to_csv("test.txt", columns=["Ticker"], header=False, index=False)
This provides flexibility to include other columns, column names, and index values at some future point (should you need to do some sleuthing, or in case your boss asks for more information). You can even change the separator. This would be a simple modification (obvious changes, e.g.):
df.to_csv("test.txt", columns=["Ticker", "Symbol",], header=True, index=True, sep="\t")
I think the benefit of this method over jfaccioni's answer is flexibility and ease of adapability. This also gets you away from explicitly opening a file. However, if you still want to explicitly open a file you should consider using "with", which will automatically close the buffer when you break out of the current indentation. e.g.
with open("test.txt", "w") as fid:
fid.write("MMM\nABT\n...etc.")
This should do the trick:
'\n'.join(df['Ticker'].astype(str).values)
I am concatenating 100 CSV's with names like XXX_XX_20112020.csv to create one file lets say master.csv
Can I Extract date from each file name and create a new column with that date auto populated for all records in that file? Should I be doing this before or after concatenation and how??
If they all follow the same XXX_XX_20112020.csv pattern then just do 'XXX_XX_20112020.csv'.rsplit('_',1)[-1].rsplit('.',1)[0]
import datetime
file_name = 'XXX_XX_20112020.csv'
file_name_ending = file_name.rsplit('_',1)[-1]
date_part = file_name_ending.rsplit('.',1)[0]
date_part_parsed = datetime.datetime.strptime(date_part, "%d%m%Y").date()
So rsplit is just to split the file names on '_' and we do the same to get rid of the suffix i.e '.csv' by splitting on '.'. Now you need to turn the date string into a real date.
Read here:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime
strptime will turn the string into a datetime object when the right format is given.
Now you can make this a function and apply it to all the file names you have.
P.S: rsplit https://docs.python.org/3/library/stdtypes.html#str.rsplit
import os
import pandas as pd
master_df = pd.DataFrame()
for file in os.listdir('folder_with_csvs'):
# we access the last element after an underscore and all before the dot before csv
date_for_file = file.split('_')[-1].split('.')[0]
date_for_file = datetime.datetime.strptime(date_for_file, "%d%m%Y").date()
df = pd.read_csv(file)
# Following line will put your date in the `POST_DATE` column for every record of this file
df['POST_DATE'] = date_for_file
master_df = pd.concat([master_df, df])
# Eventually
master_df.to_csv('master.csv')
How do I go about manipulating each file of a folder based on values pulled from a dictionary? Basically, say I have x files in a folder. I use pandas to reformat the dataframe, add a column which includes the date of the report, and save the new file under the same name and the date.
import pandas as pd
import pathlib2 as Path
import os
source = Path("Users/Yay/AlotofFiles/April")
items = os.listdir(source)
d_dates = {'0401' : '04/1/2019', '0402 : 4/2/2019', '0403 : 04/03/2019'}
for item in items:
for key, value in d_dates.items():
df = pd.read_excel(item, header=None)
df.set_columns = ['A', 'B','C']
df[df['A'].str.contains("Awesome")]
df['Date'] = value
file_basic = "retrofile"
short_date = key
xlsx = ".xlsx"
file_name = file_basic + short_date + xlsx
df.to_excel(file_name)
I want each file to be unique and categorized by the date. In this case, I would want to have three files, for example "retrofile0401.xlsx" that has a column that contains "04/01/2019" and only has data relevant to the original file.
The actual result is pretty much looping each individual item, creating three different files with those values, moves on to the next file, repeats and replace the first iteration and until I only am left with three files that are copies of the last file. The only thing that is different is that each file has a different date and are named differently. This is what I want but it's duplicating the data from the last file.
If I remove the second loop, it works the way I want it but there's no way of categorizing it based on the value I made in the dictionary.
Try the following. I'm only making input filenames explicit to make clear what's going on. You can continue to use yours from the source.
input_filenames = [
'retrofile0401_raw.xlsx',
'retrofile0402_raw.xlsx',
'retrofile0403_raw.xlsx',]
date_dict = {
'0401': '04/1/2019',
'0402': '4/2/2019',
'0403': '04/03/2019'}
for filename in input_filenames:
date_key = filename[9:13]
df = pd.read_excel(filename, header=None)
df[df['A'].str.contains("Awesome")]
df['Date'] = date_dict[date_key]
df.to_excel('retrofile{date_key}.xlsx'.format(date_key=date_key))
filename[9:13] takes characters #9-12 from the filename. Those are the ones that correspond to your date codes.
I have a csv file that I need to change the date value in each row. The date to be changed appears in the exact same column in each row of the csv.
import csv
firstfile = open('example.csv',"r")
firstReader = csv.reader(firstfile, delimiter='|')
firstData = list(firstReader)
DateToChange = firstData[1][25]
ChangedDate = '2018-09-30'
for row in firstReader:
for column in row:
print(column)
if column==DateToChange:
#Change the date
outputFile = open("output.csv","w")
outputFile.writelines(firstfile)
outputFile.close()
I am trying to grab and store a date already in the csv and change it using a for loop, then output the original file with the changed dates. However, the code above doesn't seem to do anything at all. I am newer to Python so I might not be understanding how to use a for loop correctly.
Any help at all is greatly appreciated!
When you call list(firstReader), you read all of the CSV data in to the firstData list. When you then, later, call for row in firstReader:, the firstReader is already exhausted, so nothing will be looped. Instead, try changing it to for row in firstData:.
Also, when you are trying to write to file, you are trying to write firstFile into the file, rather than the altered row. I'll leave you to figure out how to update the date in the row, but after that you'll need to give the file a string to write. That string should be ', '.join(row), so outputFile.write(', '.join(row)).
Finally, you should open your output file once, not each time in the loop. Move the open call to above your loop, and the close call to after your loop. Then when you have a moment, search google for 'python context manager open file' for a better way to manage the open file.
you could use pandas and numpy. Here I create a dataframe from scratch but you could load it directly from a .csv:
import pandas as pd
import numpy as np
date_df = pd.DataFrame(
{'col1' : ['12', '14', '14', '3412', '2'],
'col2' : ['2018-09-30', '2018-09-14', '2018-09-01', '2018-09-30', '2018-12-01']
})
date_to_change = '2018-09-30'
replacement_date = '2018-10-01'
date_df['col2'] = np.where(date_df['col2'] == date_to_change, replacement_date, date_df['col2'])