my code goes as follows:
import csv
with open('Remarks_Drug.csv', newline='', encoding ='utf-8') as myFile:
reader = csv.reader(myFile)
for row in reader:
product = row[0].lower()
filename = row[1]
product_patterns = ', '.join([i.split("+")[0].strip() for i in product.split(",")])
print(product_patterns, filename)
which outputs as below: (where film-coated tab should be one column and the filename should be another column)
film-coated tablet RECD outcome AUBAGIO IAIN-21 AoR.txt
solution for injection 093 Acceptance NO Safety profil.txt
I want to output this to a csv file with one column as product_patterns and another as filename. I wrote the below code but only the last row gets appended. Can anyone please help me with the looping here. The code I wrote is:
with open ('drug_output.csv', 'a') as csvfile:
fieldnames = ['product_patterns', 'filename']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'product_patterns':product_patterns, 'filename':filename})
enter image description here
Depending on the environment that you can use, it might be more practical to use more dedicated programs to solve your problem.
Especially the pandas package seems useful in your case.
Then you can load the csv using:
import pandas as pd
df=pd.read_csv(file_path)
After doing the necessary manipulations, you can save it again with
df.to_csv(file_path)
This will save you a lot of issues that typically occur when parsing line by line, and it should also increase performance a bit. Pandas is a pretty good package to learn anyway if you need to do some data manipulation.
Related
I have a data file saved as .txt format which has a header row at the top, and is pipe delimited. I am working in databricks, and am needing to create a spark dataframe of this data, with all columns read in as StringType(), the headers defined by the first row, and the columns separated based on the pipe delimiter.
When importing .csv files I am able to set the delimiter and header options. However, I am not able to get the .txt files to import in the same way.
Example Data (completely made up)... for ease, please imagine it is just called datafile.txt:
URN|Name|Supported
12233345757777701|Tori|Yes
32313185648456414|Dave|No
46852554443544854|Steph|No
I would really appreciate a hand in getting this imported into a Spark dataframe so that I can crack on with other parts of the analysis. Thank you!
Any delimiter separated file is a good candidate for csv reading methods. The 'c' of csv is mostly by convention. Thus nothing stops us from reading this:
col1|col2|col3
0|1|2
1|3|8
Like this (in pure python):
import csv
from pathlib import Path
with Path("pipefile.txt").open() as f:
reader = csv.DictReader(f, delimiter="|")
data = list(reader)
print(data)
Since whatever custom reader your libraries are using probably uses csv.reader under the hood you simply need to figure out how to pass the right separator to it.
#blackbishop notes in a comment that
spark.read.csv("datafile.text", header=True, sep="|")
would be the appropriate spark call.
I am stuck at trying to build a database using a CSV file.
I am using input of symbols (stock market tickers), and I am able to generate website links for each symbol, corresponding to the company's website.
I would like to save that database to a CSV file named BiotechDatabase.csv
The Database look
Every time I input a new symbol in Python, I would like to verify the first column of the CSV file to see if the symbol exists. If it does, I need to overwrite the Web column to make sure it is updated.
If the symbol does not exist, a row will need to be appended containing the symbol and the Web.
Since I need to expand the columns to add more information in the future, I need to use DictWriter as some columns might have missing information and need to be skipped.
I have been able to update information for a symbol if the symbol is in the database using the code below:
from csv import DictWriter
import shutil
import csv
#Replacing the symbol below with the any stock symbol I want to get the website for
symbol = 'PAVM'
#running the code web(symbol) generates the website I need for PAVM and that is http://www.pavmed.com which I converted to a string below
web(symbol)
filename = 'BiotechDatabase.csv'
tempfile = NamedTemporaryFile('w', newline='', delete=False)
fields = ['symbol','Web']
#I was able to replace any symbol row using the code below:
with open(filename, 'r', newline='') as csvfile, tempfile:
reader = csv.DictReader(csvfile, fieldnames=fields)
writer = csv.DictWriter(tempfile, fieldnames=fields)
for row in reader:
if row['symbol'] == symbol:
print('adding row', row['symbol'])
row['symbol'], row['Web']= symbol, str(web(symbol))
row = {'symbol': row['symbol'], 'Web': row['Web']}
writer.writerow(row)
shutil.move(tempfile.name, filename)
If the symbol I entered in Python doesn't exist however in the CSV file, how can I append a new row in the CSV file at the bottom of the list, without messing with the header, and while still using a temporary file?
Since the tempfile I defined above uses mode 'w', do I need to create another temporary file that allows mode 'a' in order to append rows?
You can simplify your code dramatically using the Pandas python library.
Note: I do not know how the raw data looks like so you might need to do some tweaking in order to get it to work, please feel free to ask me more about the solution in the comments.
import pandas as pd
symbol = 'PAVM'
web(symbol)
filename = 'BiotechDatabase.csv'
fields = ['symbol', 'Web']
# Reading csv from file with names as fields
df = pd.read_csv(filename, names=fields)
# Pandas uses the first column automatically as index
df.loc[symbol, 'Web'] = web(symbol)
# Saving back to filename and overwrites it - Be careful!
pd.to_csv(filename)
There might be some faster ways to do that but this one is very elegant.
I have a bunch of comma-delimited files that I am trying to change to pipe-delimited files.
I am following the example provided here: Python CSV change separator
Here is my code:
print("setting new delimiter...")
reader = list(csv.reader(open(localfile, "rU"), delimiter=','))
writer = csv.writer(open(localfile, 'w'), delimiter='|', lineterminator='\n')
writer.writerows(row for row in reader)
I can't tell if memory usage is cumulative, or because of a specific filesize, but either way, on my third file, I get a memory error.
since the third file is nearly the same size as the first two, it appears cumulative.
Is there a better way in doing this? Thanks
I have a python program that runs a certain experiment on an artificial deck of cards. I want to save average the number of trials it took to get a certain pattern to the csv file, but I having trouble getting the program to write to a specific cvs file who's directory I've specified. The relevant code is shown below:
row = [str(n), str(limit), str(np.mean(trial_time_list)), str(max(trial_time_list)), str(np.std(trial_time_list))]
with open("D:\Documents\python projects\\results.csv", "a") as csvFile:
writer = csv.writer(csvFile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerow(row)
csvFile.close()
The code runs without any errors, but when I check the csv files, no new data is written onto it. Is it because I'm not running IDLE with admin permissions?
I was unable to reproduce your issue so I suspect there's something wrong with your environment. I suggest enforcing your assumptions:
import csv
import os
import numpy as np
# something about n, limit, and trial_time_list
row = [str(x) for x in (n, limit, np.mean(trial_time_list), max(trial_time_list), np.std(trial_time_list)]
path = r'D:\Documents\python projects'
os.makedirs(path, exist_ok=True)
file = os.path.join(path, 'results.csv')
with open(file, 'a', newline='') as f:
writer = csv.writer(f)
writer.writerow(row)
I have an RSS feed I want to grab data from, manipulate and then save it to a CSV file. The RSS feed refresh rate is a big window, 1 minute to several hours, and only hold 100 items at a time. So to capture everything, Im looking to have my script run every minute. The problem with this is if the script runs before the feed updates I will be grabbing past data which lead to adding duplicate data to the CSV.
I tried looking at using examples mentioned here but it kept erroring out.
Data Flow:
RSS Feed --> Python Script --> CSV file
Sample data and code below:
Sample Data from CSV:
gandcrab,acad5fc7ebe8c6979d98cb8537e3a247,18bb2c3b82649314dfd45a379058869804954276,bf0ac94c6ae6f1ecfcccc049ae2373bfc659b2efb2e48e824e2e78fb43b6ebef,54,C
Sample Data from list:
zeus,186e84c5fd7da7331a62f1f13b1f4608,3c34aee767859fd75eb0c8c701716cbfd5655437,05c8e4f01ec8d4e6f4595db93bbcc0f85386c9f1b82b5833d983c9092640573a,49,C
Code for comparing:
if trends_f.is_file():
with open('trendsv3.csv', 'r+', newline='') as csv_file:
h_reader = csv.reader(csv_file)
next(h_reader) #skip reading header of csv
#should i load the csv into a list then compare it with diff() against the other list?
#or is there an easier, faster, more efficient way?
I would recommending downloading everything into a CSV, and then deduplicating in batches (eg nightly) that generates a new "clean" CSV for whatever you're working on.
To dedup, load the data in with the pandas library and then you can use the function drop_duplicates on the data.
http://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.drop_duplicates.html
Adding the ID from the feed seemed to make things the easiest to check against. Thank #blhsing for mentioning that. Ended reading the IDs from the csv into a list and checking the new data's IDs against that. There may be a faster more efficient way, but this works for me.
Code to check csv before saving to it:
if trends_f.is_file():
with open('trendsv3.csv', 'r') as csv_file:
h_reader = csv.reader(csv_file, delimiter=',')
next(h_reader, None)
for row in h_reader:
csv_list.append(row[6])
csv_file.close()
with open('trendsv3.csv', 'a', newline='') as csv_file:
h_writer = csv.writer(csv_file)
for entry in data_list:
if entry[6].strip() not in csv_list:
print(entry[6], ' is not in the list, saving ', entry[6],' to the list')
h_writer.writerow(entry)
else:
print(entry[6], ' is in the list')
csv_file.close()