In a loop through many web pages I´m web scraping to get some information:
I´m thinking of building a csv, something like this:
fieldnames = ['id', 'variable1', 'variable2']
f= open('file.csv', 'w', newline='')
my_writer = csv.DictWriter(f, fieldnames)
my_writer.writeheader()
for webpage in webpages:
something where I get the information and put it in a dictionary mydict.
Example mydict={'id':1, 'variable1':200, 'variable2':300}
writer.writerow(mydict)
f.close()
The problem is that there may be different number of variables in each webpage, so I would need a modification to this.
The other alternative I´m think of is to create a list of dictionaries, and at the end convert it to a dataframe and to a csv:
finalist =[]
for webpage in webpages:
something where I get the information and put it in a dictionary mydict.
Example mydict={'id':1, 'variable1':200, 'variable2':300}
mylist =[mydict]
finalist.extend(mylist)
df = pd.DataFrame(mylist)
df.to_csv()
It is a very long loop so there will be a lot of rows, so which of the two would be more efficient?. Or there is another way more efficient than these two? Also, should I keep a json file or a csv or any other format to store the data at the end in order to be used latter on R or any other program?
Related
My company recently purchased a machine and I'm trying to find a way to store its data in a our data base, but first I need to clean up the CSV or Select Certain cells to write into a new csv. I'm currently Using Python 3.9XX
I need to extract the fallowing items for from this file. Serial number(Highlighted Yellow),Start time,End time, Pass-step,fail-steps, and Test Results.
If I can manage to select one cell it will try to do the rest on my own but im currently stuck trying to select the serial number and then writing into a new csv .
DATA FROM CSV
import csv
# read CSV
csvFile = r"C:\Users\Hunter\Documents\Programing\Python\Measu Dev\11.csv"
f=open(csvFile,'rt')
myReader = csv.reader(f)
Headers = ['SerialNo','PartNo','Startime','Endtime','TabPassed','TabFailed','TestResult']
Serialno = []
with open( 'Processed.csv', 'w', encoding='utf-8', newline='') as csvfile:
writer=csv.writer(csvfile)
writer.writerow(Headers)
writer.writerow(SerialNo)
RESULT
This is my ending result, I want to be able to store the serial number under its header 'SerialNo' but nothing seems to work on my end. I'm pretty new to this any help will be appreciated it.
thank you guys.
I have a directory containing multiple csv's that I would to read into a single dictionary. The dictionary would use the original file names as keys and the contents of the csv's as values. I don't want to use pandas because I am new to Python and want to understand these tasks first before pulling out the big guns. I would like to use DictReader for the task. Here is the code I have so far below. It works fine for one file at a time. Help is greatly appreciated.
def read_lines():
data = []
with open('vari_late_low_scores.csv', newline='') as stream:
reader = csv.reader(stream, delimiter=',', skipinitialspace=True)
for row in reader:
data.append(row)
return data
Thank you!
I have an RSS feed I want to grab data from, manipulate and then save it to a CSV file. The RSS feed refresh rate is a big window, 1 minute to several hours, and only hold 100 items at a time. So to capture everything, Im looking to have my script run every minute. The problem with this is if the script runs before the feed updates I will be grabbing past data which lead to adding duplicate data to the CSV.
I tried looking at using examples mentioned here but it kept erroring out.
Data Flow:
RSS Feed --> Python Script --> CSV file
Sample data and code below:
Sample Data from CSV:
gandcrab,acad5fc7ebe8c6979d98cb8537e3a247,18bb2c3b82649314dfd45a379058869804954276,bf0ac94c6ae6f1ecfcccc049ae2373bfc659b2efb2e48e824e2e78fb43b6ebef,54,C
Sample Data from list:
zeus,186e84c5fd7da7331a62f1f13b1f4608,3c34aee767859fd75eb0c8c701716cbfd5655437,05c8e4f01ec8d4e6f4595db93bbcc0f85386c9f1b82b5833d983c9092640573a,49,C
Code for comparing:
if trends_f.is_file():
with open('trendsv3.csv', 'r+', newline='') as csv_file:
h_reader = csv.reader(csv_file)
next(h_reader) #skip reading header of csv
#should i load the csv into a list then compare it with diff() against the other list?
#or is there an easier, faster, more efficient way?
I would recommending downloading everything into a CSV, and then deduplicating in batches (eg nightly) that generates a new "clean" CSV for whatever you're working on.
To dedup, load the data in with the pandas library and then you can use the function drop_duplicates on the data.
http://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.drop_duplicates.html
Adding the ID from the feed seemed to make things the easiest to check against. Thank #blhsing for mentioning that. Ended reading the IDs from the csv into a list and checking the new data's IDs against that. There may be a faster more efficient way, but this works for me.
Code to check csv before saving to it:
if trends_f.is_file():
with open('trendsv3.csv', 'r') as csv_file:
h_reader = csv.reader(csv_file, delimiter=',')
next(h_reader, None)
for row in h_reader:
csv_list.append(row[6])
csv_file.close()
with open('trendsv3.csv', 'a', newline='') as csv_file:
h_writer = csv.writer(csv_file)
for entry in data_list:
if entry[6].strip() not in csv_list:
print(entry[6], ' is not in the list, saving ', entry[6],' to the list')
h_writer.writerow(entry)
else:
print(entry[6], ' is in the list')
csv_file.close()
I'm writing a dictionary into a csv file using Python's csv library and a DictWriter object like so:
with open('test.csv', 'w', newline='') as fp:
fieldnames = [<fieldnames for csv DictWriter>]
dict_writer = csv.DictWriter(fp, fieldnames=fieldnames)
dict_writer.writeheader()
dict_writer.writerow(<some dictionary here>)
The result is that if i have a long name as one of the dictionary's values the cells look like this:
But i want it to look like this:
Is there a way to fit the cells' size to the string lengths ? I imagine this is a relatively common issue since you don't want to manually do it each time you open a csv file.
How can i fit it beforehand ?
A CSV does not have formatting information. It only have data. It is up to the viewer configuration how to resize the cells.
If you really need formatting, you should write in the file format of the receiving application (.ods, .xls, .xlsx...).
I have a two-column csv which I have uploaded via an HTML page to be operated on by a python cgi script. Looking at the file on the server side, it looks to be a long string i.e for a file called test.csv with the contents.
col1, col2
x,y
has become
('upfile', 'test.csv', 'col1,col2'\t\r\nx,y')
Col1 contains the data I want to operate on (i.e. x) and col 2 contains its identifier (y). Is there a better way of doing the uploading or do I need to manually extract the fields I want - this seems potentially very error-prone
thanks
If you're using the cgi module in python, you should be able to do something like:
form = cgi.FieldStorage()
thefile = form['upfile']
reader = csv.reader(thefile.file)
header = reader.next() # list of column names
for row in reader:
# row is a list of fields
process_row(row)
See, for example, cgi programming or the python cgi module docs.
Can't you use the csv module to parse this? It certantly better than rolling your own.
Something along the lines of
import csv
import cgi
form = cgi.FieldStorage()
thefile = form['upfile']
reader = csv.reader(thefile, delimiter=',')
for row in reader:
for field in row:
doThing()
EDIT: Correcting my answer from the ars answer posted below.
Looks like your file is becoming modified by the HTML upload. Is there anything stopping you from just ftp'ing in and dropping the csv file where you need it?
Once the CSV file is more proper, here is a quick function that will put it into a 2D array:
def genTableFrCsv(incsv):
table = []
fin = open(incsv, 'rb')
reader = csv.reader(fin)
for row in reader:
table.append(row)
fin.close()
return table
From here you can then operate on the whole list in memory rather than pulling bit by bit from the file as in Vitor's solution.
The easy solution is rows = [row.split('\t') for r in csv_string.split('\r\n')]. It's only error proned if you have users from different platforms submit data. They might submit comas or tabs and their line breaks could be \n, \r\n, \r, or ^M. The easiest solution is to use regular expressions. Book mark this page if you don't know regular expressions:
http://regexlib.com/CheatSheet.aspx
And here's the solution:
import re
csv_string = 'col1,col2'\t\r\nx,y' #obviously your csv opening code goes here
rows = re.findall(r'(.*?)[\t,](.*?)',csv_string)
rows = rows[1:] # remove header
Rows is now a list of tuples for all of the rows.