I have an RSS feed I want to grab data from, manipulate and then save it to a CSV file. The RSS feed refresh rate is a big window, 1 minute to several hours, and only hold 100 items at a time. So to capture everything, Im looking to have my script run every minute. The problem with this is if the script runs before the feed updates I will be grabbing past data which lead to adding duplicate data to the CSV.
I tried looking at using examples mentioned here but it kept erroring out.
Data Flow:
RSS Feed --> Python Script --> CSV file
Sample data and code below:
Sample Data from CSV:
gandcrab,acad5fc7ebe8c6979d98cb8537e3a247,18bb2c3b82649314dfd45a379058869804954276,bf0ac94c6ae6f1ecfcccc049ae2373bfc659b2efb2e48e824e2e78fb43b6ebef,54,C
Sample Data from list:
zeus,186e84c5fd7da7331a62f1f13b1f4608,3c34aee767859fd75eb0c8c701716cbfd5655437,05c8e4f01ec8d4e6f4595db93bbcc0f85386c9f1b82b5833d983c9092640573a,49,C
Code for comparing:
if trends_f.is_file():
with open('trendsv3.csv', 'r+', newline='') as csv_file:
h_reader = csv.reader(csv_file)
next(h_reader) #skip reading header of csv
#should i load the csv into a list then compare it with diff() against the other list?
#or is there an easier, faster, more efficient way?
I would recommending downloading everything into a CSV, and then deduplicating in batches (eg nightly) that generates a new "clean" CSV for whatever you're working on.
To dedup, load the data in with the pandas library and then you can use the function drop_duplicates on the data.
http://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.drop_duplicates.html
Adding the ID from the feed seemed to make things the easiest to check against. Thank #blhsing for mentioning that. Ended reading the IDs from the csv into a list and checking the new data's IDs against that. There may be a faster more efficient way, but this works for me.
Code to check csv before saving to it:
if trends_f.is_file():
with open('trendsv3.csv', 'r') as csv_file:
h_reader = csv.reader(csv_file, delimiter=',')
next(h_reader, None)
for row in h_reader:
csv_list.append(row[6])
csv_file.close()
with open('trendsv3.csv', 'a', newline='') as csv_file:
h_writer = csv.writer(csv_file)
for entry in data_list:
if entry[6].strip() not in csv_list:
print(entry[6], ' is not in the list, saving ', entry[6],' to the list')
h_writer.writerow(entry)
else:
print(entry[6], ' is in the list')
csv_file.close()
Related
I have a very large dataframe with many million rows and it is normally not feasible to load the entire file into memory to work with. Recently some bad data have gotten in and I need to remove them from the database. So far what I've done is:
file = '/path to database'
rf = pd.read_csv(f'{file}.csv', chunksize = 3000000, index_col=False)
res = pd.concat([chunk[chunk['timestamp'] < 1.6636434764745E+018] for chunk in rf)]
res.to_csv(f'{file}.csv', index=False)
Basically it is opening the database and saving the part i want, overwriting the original file.
However the data has gotten so large that this is failing to fit in memory. Is there a better way to truncate a part of the dataframe based on a simple query?
The truncated part would usually be very small compared to the rest, say 100k rows and always at the end.
I would avoid using pandas in this case and just directly edit the csv file itself. For example:
import csv
with open("test_big.csv", "r") as f_in, open("test_out.csv", "w") as f_out:
reader = csv.reader(f_in)
writer = csv.writer(f_out)
for row in reader:
if int(row[-1]) > 9900: # your condition here
writer.writerow(row)
For context, test_big.csv looks like this
1,2,3,4,5891
1,2,3,4,7286
1,2,3,4,7917
1,2,3,4,937
...
And is 400,000 records long. Execution took 0.2s.
Edit: Ran with 40,000,000 records and took 15s.
My company recently purchased a machine and I'm trying to find a way to store its data in a our data base, but first I need to clean up the CSV or Select Certain cells to write into a new csv. I'm currently Using Python 3.9XX
I need to extract the fallowing items for from this file. Serial number(Highlighted Yellow),Start time,End time, Pass-step,fail-steps, and Test Results.
If I can manage to select one cell it will try to do the rest on my own but im currently stuck trying to select the serial number and then writing into a new csv .
DATA FROM CSV
import csv
# read CSV
csvFile = r"C:\Users\Hunter\Documents\Programing\Python\Measu Dev\11.csv"
f=open(csvFile,'rt')
myReader = csv.reader(f)
Headers = ['SerialNo','PartNo','Startime','Endtime','TabPassed','TabFailed','TestResult']
Serialno = []
with open( 'Processed.csv', 'w', encoding='utf-8', newline='') as csvfile:
writer=csv.writer(csvfile)
writer.writerow(Headers)
writer.writerow(SerialNo)
RESULT
This is my ending result, I want to be able to store the serial number under its header 'SerialNo' but nothing seems to work on my end. I'm pretty new to this any help will be appreciated it.
thank you guys.
I have a directory containing multiple csv's that I would to read into a single dictionary. The dictionary would use the original file names as keys and the contents of the csv's as values. I don't want to use pandas because I am new to Python and want to understand these tasks first before pulling out the big guns. I would like to use DictReader for the task. Here is the code I have so far below. It works fine for one file at a time. Help is greatly appreciated.
def read_lines():
data = []
with open('vari_late_low_scores.csv', newline='') as stream:
reader = csv.reader(stream, delimiter=',', skipinitialspace=True)
for row in reader:
data.append(row)
return data
Thank you!
my code goes as follows:
import csv
with open('Remarks_Drug.csv', newline='', encoding ='utf-8') as myFile:
reader = csv.reader(myFile)
for row in reader:
product = row[0].lower()
filename = row[1]
product_patterns = ', '.join([i.split("+")[0].strip() for i in product.split(",")])
print(product_patterns, filename)
which outputs as below: (where film-coated tab should be one column and the filename should be another column)
film-coated tablet RECD outcome AUBAGIO IAIN-21 AoR.txt
solution for injection 093 Acceptance NO Safety profil.txt
I want to output this to a csv file with one column as product_patterns and another as filename. I wrote the below code but only the last row gets appended. Can anyone please help me with the looping here. The code I wrote is:
with open ('drug_output.csv', 'a') as csvfile:
fieldnames = ['product_patterns', 'filename']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'product_patterns':product_patterns, 'filename':filename})
enter image description here
Depending on the environment that you can use, it might be more practical to use more dedicated programs to solve your problem.
Especially the pandas package seems useful in your case.
Then you can load the csv using:
import pandas as pd
df=pd.read_csv(file_path)
After doing the necessary manipulations, you can save it again with
df.to_csv(file_path)
This will save you a lot of issues that typically occur when parsing line by line, and it should also increase performance a bit. Pandas is a pretty good package to learn anyway if you need to do some data manipulation.
In a loop through many web pages I´m web scraping to get some information:
I´m thinking of building a csv, something like this:
fieldnames = ['id', 'variable1', 'variable2']
f= open('file.csv', 'w', newline='')
my_writer = csv.DictWriter(f, fieldnames)
my_writer.writeheader()
for webpage in webpages:
something where I get the information and put it in a dictionary mydict.
Example mydict={'id':1, 'variable1':200, 'variable2':300}
writer.writerow(mydict)
f.close()
The problem is that there may be different number of variables in each webpage, so I would need a modification to this.
The other alternative I´m think of is to create a list of dictionaries, and at the end convert it to a dataframe and to a csv:
finalist =[]
for webpage in webpages:
something where I get the information and put it in a dictionary mydict.
Example mydict={'id':1, 'variable1':200, 'variable2':300}
mylist =[mydict]
finalist.extend(mylist)
df = pd.DataFrame(mylist)
df.to_csv()
It is a very long loop so there will be a lot of rows, so which of the two would be more efficient?. Or there is another way more efficient than these two? Also, should I keep a json file or a csv or any other format to store the data at the end in order to be used latter on R or any other program?