converting an uploaded csv to python list - python

I have a two-column csv which I have uploaded via an HTML page to be operated on by a python cgi script. Looking at the file on the server side, it looks to be a long string i.e for a file called test.csv with the contents.
col1, col2
x,y
has become
('upfile', 'test.csv', 'col1,col2'\t\r\nx,y')
Col1 contains the data I want to operate on (i.e. x) and col 2 contains its identifier (y). Is there a better way of doing the uploading or do I need to manually extract the fields I want - this seems potentially very error-prone
thanks

If you're using the cgi module in python, you should be able to do something like:
form = cgi.FieldStorage()
thefile = form['upfile']
reader = csv.reader(thefile.file)
header = reader.next() # list of column names
for row in reader:
# row is a list of fields
process_row(row)
See, for example, cgi programming or the python cgi module docs.

Can't you use the csv module to parse this? It certantly better than rolling your own.
Something along the lines of
import csv
import cgi
form = cgi.FieldStorage()
thefile = form['upfile']
reader = csv.reader(thefile, delimiter=',')
for row in reader:
for field in row:
doThing()
EDIT: Correcting my answer from the ars answer posted below.

Looks like your file is becoming modified by the HTML upload. Is there anything stopping you from just ftp'ing in and dropping the csv file where you need it?
Once the CSV file is more proper, here is a quick function that will put it into a 2D array:
def genTableFrCsv(incsv):
table = []
fin = open(incsv, 'rb')
reader = csv.reader(fin)
for row in reader:
table.append(row)
fin.close()
return table
From here you can then operate on the whole list in memory rather than pulling bit by bit from the file as in Vitor's solution.

The easy solution is rows = [row.split('\t') for r in csv_string.split('\r\n')]. It's only error proned if you have users from different platforms submit data. They might submit comas or tabs and their line breaks could be \n, \r\n, \r, or ^M. The easiest solution is to use regular expressions. Book mark this page if you don't know regular expressions:
http://regexlib.com/CheatSheet.aspx
And here's the solution:
import re
csv_string = 'col1,col2'\t\r\nx,y' #obviously your csv opening code goes here
rows = re.findall(r'(.*?)[\t,](.*?)',csv_string)
rows = rows[1:] # remove header
Rows is now a list of tuples for all of the rows.

Related

Combine csv files with identical columns and unescape html code

Dear all i often need to concat csv files with identical headers (i.e. but them into one big file). Usually i just use pandas, but i now need to operate in an enviroment were i am not at liberty to install any library. The csv and the html libs do exsit.
I also need to remove all remaining html tags like %amp; for the apercent symbol which are still within the data. I do know in which columns these can come up.
I thought aboug doing it like this, and the concat part of my code seems to work fine:
import CSV
import html
for file in files: # files is a list of csv files.
with open(file, "rt", encoding="utf-8-sig") as source, open(outfilePath, "at", newline='',
encoding='utf-8-sig') as result:
d_reader = csv.DictReader(source,delimiter=";")
# Set header based on first file in file_list:
if file == test_files[0]:
Common_header = d_reader.fieldnames
# Define DcitwriterObject
wtr = csv.DictWriter(result, fieldnames=Common_header, lineterminator='\n', delimiter=";")
# Write Header only once to emtpy file
if result.tell() == 0:
wtr.writeheader()
# If i remove this block i get my concateneated singe csv file as a result
# Howerver the html tags/encoded symbols are sill present.
for row in d_reader:
print(html.unescape(row['ColA'])) # This prints the unescaped Values in the column correctly
# If i kepp these two lines, i get an empty file with just the header as a result of the concatenation
row['ColA'] = html.unescape(row['ColA'])
row['ColB'] = html.unescape(row['ColB'])
wtr.writerows(d_reader)
I would have thought the simply suppling the encoding='utf-8-sig' part to the result file would be sufficient to get rid of the html symbols but that does not work. If you could give me a hint what i am doint wrong in the usage of the code containing the html.unescape function in my code that would be nice.
Thank you in advance

Grab values from seperate csv file and replace the values of columns in a pipe delimited file

Trying to whip this out in python. Long story short I got a csv file that contains column data i need to inject into another file that is pipe delimited. My understanding is that python can't replace values, so i have to re-write the whole file with the new values.
data file(csv):
value1,value2,iwantthisvalue3
source file(txt, | delimited)
value1|value2|iwanttoreplacethisvalue3|value4|value5|etc
fixed file(txt, | delimited)
samevalue1|samevalue2| replacedvalue3|value4|value5|etc
I can't figure out how to accomplish this. This is my latest attempt(broken code):
import re
import csv
result = []
row = []
with open("C:\data\generatedfixed.csv","r") as data_file:
for line in data_file:
fields = line.split(',')
result.append(fields[2])
with open("C:\data\data.txt","r") as source_file, with open("C:\data\data_fixed.txt", "w") as fixed_file:
for line in source_file:
fields = line.split('|')
n=0
for value in result:
fields[2] = result[n]
n=n+1
row.append(line)
for value in row
fixed_file.write(row)
I would highly suggest you use the pandas package here, it makes handling tabular data very easy and it would help you a lot in this case. Once you have installed pandas import it with:
import pandas as pd
To read the files simply use:
data_file = pd.read_csv("C:\data\generatedfixed.csv")
source_file = pd.read_csv('C:\data\data.txt', delimiter = "|")
and after that manipulating these two files is easy, I'm not exactly sure how many values or which ones you want to replace, but if the length of both "iwantthisvalue3" and "iwanttoreplacethisvalue3" is the same then this should do the trick:
source_file['iwanttoreplacethisvalue3'] = data_file['iwantthisvalue3]
now all you need to do is save the dataframe (the table that we just updated) into a file, since you want to save it to a .txt file with "|" as the delimiter this is the line to do that (however you can customize how to save it in a lot of ways):
source_file.to_csv("C:\data\data_fixed.txt", sep='|', index=False)
Let me know if everything works and this helped you. I would also encourage to read up (or watch some videos) on pandas if you're planning to work with tabular data, it is an awesome library with great documentation and functionality.

import web based .txt file into python

I think this is simple but I am not finding an answer that works. The data importing seems to work but separating the "/" numbers doesnt code is below. thanks for the help.
import urllib.request
opener = urllib.request.FancyURLopener({})
url = "http://jse.amstat.org/v22n1/kopcso/BeefDemand.txt"
f = opener.open(url)
content = f.read()
# below are the 3 different ways I tried to separate the data
content.encode('string-escape').split("\\x")
content.split('\r')
content.split('\\')
I highly recommend Pandas for reading and analysing this kind of file. It supports reading directly from a url and also gives meaningful analysis ability.
import pandas
url = "http://jse.amstat.org/v22n1/kopcso/BeefDemand.txt"
df = pandas.read_table(url, sep="\t+", engine='python', index_col="Year")
Note that you have multiple repeated tabs as separators in that file, which is handled by the sep="\t+". The repeats also means you have to use the python engine.
Now that the file is read into a dataframe, we can do easy plotting for instance:
df[['ChickPrice', 'BeefPrice']].plot()
Simply use a csv.reader or csv.DictReader to parse the contents. Make sure to set the delimiter to tabs, in this case:
import requests
import csv
import re
url = "http://jse.amstat.org/v22n1/kopcso/BeefDemand.txt"
response = requests.get(url)
response.raise_for_status()
text = re.sub("\t{1,}", "\t", response.text)
reader = csv.DictReader(text.splitlines(), delimiter="\t")
for row in reader:
print(row)
I like csv.DictReader better in this case, because it consumes the header line for you and each "row" is a dictionary. Your specific text file sometimes seperates fields with repeated tabs to make it look prettier, so you'll have to take that into account in some way. In my snippet, I used a regular expression to replace all tab-clusters with a single tab.

Extracting columns containing a certain name

I'm trying to use it to manipulate data in large txt-files.
I have a txt-file with more than 2000 columns, and about a third of these have a title which contains the word 'Net'. I want to extract only these columns and write them to a new txt file. Any suggestion on how I can do that?
I have searched around a bit but haven't been able to find something that helps me. Apologies if similar questions have been asked and solved before.
EDIT 1: Thank you all! At the moment of writing 3 users have suggested solutions and they all work really well. I honestly didn't think people would answer so I didn't check for a day or two, and was happily surprised by this. I'm very impressed.
EDIT 2: I've added a picture that shows what a part of the original txt-file can look like, in case it will help anyone in the future:
One way of doing this, without the installation of third-party modules like numpy/pandas, is as follows. Given an input file, called "input.csv" like this:
a,b,c_net,d,e_net
0,0,1,0,1
0,0,1,0,1
(remove the blank lines in between, they are just for formatting the
content in this post)
The following code does what you want.
import csv
input_filename = 'input.csv'
output_filename = 'output.csv'
# Instantiate a CSV reader, check if you have the appropriate delimiter
reader = csv.reader(open(input_filename), delimiter=',')
# Get the first row (assuming this row contains the header)
input_header = reader.next()
# Filter out the columns that you want to keep by storing the column
# index
columns_to_keep = []
for i, name in enumerate(input_header):
if 'net' in name:
columns_to_keep.append(i)
# Create a CSV writer to store the columns you want to keep
writer = csv.writer(open(output_filename, 'w'), delimiter=',')
# Construct the header of the output file
output_header = []
for column_index in columns_to_keep:
output_header.append(input_header[column_index])
# Write the header to the output file
writer.writerow(output_header)
# Iterate of the remainder of the input file, construct a row
# with columns you want to keep and write this row to the output file
for row in reader:
new_row = []
for column_index in columns_to_keep:
new_row.append(row[column_index])
writer.writerow(new_row)
Note that there is no error handling. There are at least two that should be handled. The first one is the check for the existence of the input file (hint: check the functionality provide by the os and os.path modules). The second one is to handle blank lines or lines with an inconsistent amount of columns.
This could be done for instance with Pandas,
import pandas as pd
df = pd.read_csv('path_to_file.txt', sep='\s+')
print(df.columns) # check that the columns are parsed correctly
selected_columns = [col for col in df.columns if "net" in col]
df_filtered = df[selected_columns]
df_filtered.to_csv('new_file.txt')
Of course, since we don't have the structure of your text file, you would have to adapt the arguments of read_csv to make this work in your case (see the the corresponding documentation).
This will load all the file in memory and then filter out the unnecessary columns. If your file is so large that it cannot be loaded in RAM at once, there is a way to load only specific columns with the usecols argument.
You can use pandas filter function to select few columns based on regex
data_filtered = data.filter(regex='net')

How would I transfer CSV "words" into Python as strings

So I am quite the beginner in Python, but what I'm trying to do is to download each CSV file for the NYSE. In an excel file I have every symbol. The Yahoo API allows you to download the CSV file by adding the symbol to the base url.
My first instinct was to use pandas, but pandas doesn't store strings.
So what I have
import urllib
strs = ["" for x in range(3297)]
#strs makes the blank string array for each symbol
#somehow I need to be able to iterate each symbol into the blank array spots
while y < 3297:
strs[y] = "symbol of each company from csv"
y = y+1
#loop for downloading each file from the link with strs[y].
while i < 3297:
N = urllib.URLopener()
N.retrieve('http://ichart.yahoo.com/table.csv?s='strs[y],'File')
i = i+1
Perhaps the solution is simpler than what I am doing.
From what I can see in this question you can't see how to connect your list of stock symbols to how you read the data in Pandas. e.g. 'http://ichart.yahoo.com/table.csv?s='strs[y] is not valid syntax.
Valid syntax for this is
pd.read_csv('http://ichart.yahoo.com/table.csv?s={}'.format(strs[y]))
It would be helpful if you could add a few sample lines from your csv file to the question. Guessing at your structure you would do something like:
import pandas as pd
symbol_df = pd.read_csv('path_to_csv_file')
for stock_symbol in symbol_df.symbol_column_name:
df = pd.read_csv('http://ichart.yahoo.com/table.csv?s={}'.format(stock_symbol))
# process your dataframe here
Assuming you take that Excel file w/ the symbols and output as a CSV, you can use Python's built-in CSV reader to parse it:
import csv
base_url = 'http://ichart.yahoo.com/table.csv?s={}'
reader = csv.reader(open('symbols.csv'))
for row in reader:
symbol = row[0]
data_csv = urllib.urlopen(base_url.format(symbol)).read()
# save out to file, or parse with CSV library...

Categories

Resources