I have a tsv file in.txt which I would like to split into a smaller tsv file called out.txt.
I would like to import only the rows of in.txt which contain a string value My String Value in column 6 into out.txt.
import csv
# r is textmode
# rb is binary mode
# binary mode is faster
with open('in.txt','rb') as tsvIn, open('out.txt', 'w') as tsvOut:
tsvIn = csv.reader(tsvIn, delimiter='\t')
tsvOut = csv.writer(tsvOut)
for row in tsvIn:
if "My String Value" in row:
tsvOut.writerows(row)
My output looks like this.
D,r,a,m,a
1,9,6,1,-,0,4,-,1,3
H,y,u,n, ,M,o,k, ,Y,o,o
B,e,o,m,-,s,e,o,n, ,L,e,e
M,u,-,r,y,o,n,g, ,C,h,o,i,",", ,J,i,n, ,K,y,u, ,K,i,m,",", ,J,e,o,n,g,-,s,u,k, ,M,o,o,n,",", ,A,e,-,j,a, ,S,e,o
A, ,p,u,b,l,i,c, ,a,c,c,o,u,n,t,a,n,t,',s, ,s,a,l,a,r,y, ,i,s, ,f,a,r, ,t,o,o, ,s,m,a,l,l, ,f,o,r, ,h,i,m, ,t,o, ,e,v,e,n, ,g,e,t, ,a, ,c,a,v,i,t,y, ,f,i,x,e,d,",", ,l,e,t, ,a,l,o,n,e, ,s,u,p,p,o,r,t, ,h,i,s, ,f,a,m,i,l,y,., ,H,o,w,e,v,e,r,",", ,h,e, ,m,u,s,t, ,s,o,m,e,h,o,w, ,p,r,o,v,i,d,e, ,f,o,r, ,h,i,s, ,s,e,n,i,l,e,",", ,s,h,e,l,l,-,s,h,o,c,k,e,d, ,m,o,t,h,e,r,",", ,h,i,s, ,.,.,.
K,o,r,e,a,n,",", ,E,n,g,l,i,s,h
S,o,u,t,h, ,K,o,r,e,a
It should look like this with tab separated values
Drama Hyn Mok Yoo A public accountant's salary is far to small for him...etc
There are a few things wrong with your code. Let's look at this line by line..
import csv
Import module csv. Ok.
with open('in.txt','rb') as tsvIn, open('out.txt', 'w') as tsvOut:
With auto-closed binary file read handle tsvIn from in.txt, and text write handle tsvOut from out.txt, do... (Note: you probably want to use mode wb instead of mode w; see this post)
tsvIn = csv.reader(tsvIn, delimiter='\t')
Let tsvIn be the result of the call of function reader in module csv with arguments tsvIn and delimiter='\t'. Ok.
tsvOut = csv.writer(tsvOut)
Let tsvOut be the result of the call of function writer in module csv with argument tsvOut. You proably want to add another argument, delimiter='\t', too.
for row in tsvIn:
For each element in tsvIn as row, do...
if "My String Value" in row:
If string "My String Value" is present in row. You mentioned that you wanted to show only those rows whose sixth element was equal to the string, thus you should use something like this instead...
if len(row) >= 6 and row[5] == "My String Value":
This means: If the length of row is at least 6, and the sixth element of row is equal to "My String Value", do...
tsvOut.writerows(row)
Call method writerows of object tsvOut with argument row. Remember that in Python, a string is just a sequence of characters, and a character is a single-element string. Thus, a character is a sequence. Then, we have that row is, according to the docs, a list of strings, each representing a column of the row. Thus, a row is a list of strings. Then, we have the writerows method, that expects a list of rows, that is, a list of lists of strings, that is, a list of lists of sequences of characters. It happens that you can interpret each of row's elements as a row, when it's actually a string, and each element of that string as a string (as characters are strings!). All of this means is that you'll get a messy, character-by-character output. You should try this instead...
tsvOut.writerow(row)
Method writerow expects a single row as an argument, not a list of rows, thus this will yield the expected result.
try this:
import csv
# r is textmode
# rb is binary mode
# binary mode is faster
with open('in.txt','r') as tsvIn, open('out.txt', 'w') as tsvOut:
reader = csv.reader(tsvIn, delimiter='\t')
writer = csv.writer(tsvOutm, delimiter='\t')
[writer.writerow(row) for row in reader if "My String Value" in row]
I'm trying to understand/visualise the process of parsing a raw csv data file in Python from dataquest.io's training course.
I understand that rows = data.split('\n') splits the long string of csv file into rows based on where the line break is. ie:
day1, sunny, \n day2, rain \n
becomes
day1, sunny
day2, rain
I thought the for loop would further break the data into something like:
day 1
sunny
day 2
rain
Instead the course seems to imply it would actually become a list of lists usefully. I don't understand, why does that happen?
weather_data = []
f = open("la_weather.csv", 'r')
data = f.read()
rows = data.split('\n')
for row in rows:
split_row = row.split(",")
weather_data.append(split_row)
I'm ignoring the CSV stuff and concentrating just on your list misunderstanding. When you split the row of text, it becomes a list of strings. That is, rows becomes: ["day1, sunny","day2, rain"].
The for statement, applied to a list, iterates through the elements of that list. So, on the first time through row will be "day1, sunny", the second time through it will be "day2, rain", etc.
Inside each iteration of the for loop, it creates a new list, by splitting row at the commas into, eg, ["day1"," sunny"]. All of these lists are added to the weather_data list you created at the start. You end up with a list of lists, ie [['day1', ' sunny'], ['day2', ' rain']]. If you wanted ['day1', ' sunny', 'day2', ' rain'], you could do:
for row in rows:
split_row = row.split(",")
for ele in split_row:
weather_data.append(ele)
That code does make it a list of lists.
As you say, the first split converts the data into a list, one element per line.
Then, for each line, the second split converts it into another list, one element per column.
And then the second list is appended, as a single item, to the weather_data list - which is now, as the instructions say, a list of lists.
Note that this code isn't very good - quite apart from the fact that you would always use the csv module, as others have pointed out, you would never do f.read() and then split the result. You would just do for line in f which automatically iterates over each row.
As a more pythonic and flexible way for dealing with csv files you can use csv module, instead of reading it as a raw text:
import csv
with open("la_weather.csv", 'rb') as f:
spamreader = csv.reader(f,delimiter=',')
for row in spamreader:
#do stuff
Here spamreader is a reader object and you can get the rows as tuple with looping over it.
And if you want to get all of rows within a list you can just convert the spamreader to list :
with open("la_weather.csv", 'rb') as f:
spamreader = csv.reader(f,delimiter=',')
print list(spamreader)
I have a csv file that has each line formatted with the line name followed by 11 pieces of data. Here is an example of a line.
CW1,0,-0.38,2.04,1.34,0.76,1.07,0.98,0.81,0.92,0.70,0.64
There are 12 lines in total, each with a unique name and data.
What I would like to do is extract the first cell from each line and use that to name the corresponding data, either as a variable equal to a list containing that line's data, or maybe as a dictionary, with the first cell being the key.
I am new to working with inputting files, so the farthest I have gotten is to read the file in using the stock solution in the documentation
import csv
path = r'data.csv'
with open(path,'rb') as csvFile:
reader = csv.reader(csvFile,delimiter=' ')
for row in reader:
print(row[0])
I am failing to figure out how to assign each row to a new variable, especially when I am not sure what the variable names will be (this is because the csv file will be created by a user other than myself).
The destination for this data is a tool that I have written. It accepts lists as input such as...
CW1 = [0,-0.38,2.04,1.34,0.76,1.07,0.98,0.81,0.92,0.70,0.64]
so this would be the ideal end solution. If it is easier, and considered better to have the output of the file read be in another format, I can certainly re-write my tool to work with that data type.
As Scironic said in their answer, it is best to use a dict for this sort of thing.
However, be aware that dict objects do not have any "order" - the order of the rows will be lost if you use one. If this is a problem, you can use an OrderedDict instead (which is just what it sounds like: a dict that "remembers" the order of its contents):
import csv
from collections import OrderedDict as od
data = od() # ordered dict object remembers the order in the csv file
with open(path,'rb') as csvFile:
reader = csv.reader(csvFile, delimiter = ' ')
for row in reader:
data[row[0]] = row[1:] # Slice the row up into 0 (first item) and 1: (remaining)
Now if you go looping through your data object, the contents will be in the same order as in the csv file:
for d in data.values():
myspecialtool(*d)
You need to use a dict for these kinds of things (dynamic variables):
import csv
path = r'data.csv'
data = {}
with open(path,'rb') as csvFile:
reader = csv.reader(csvFile,delimiter=' ')
for row in reader:
data[row[0]] = row[1:]
dicts are especially useful for dynamic variables and are the best method to store things like this. to access you just need to use:
data['CW1']
This solution also means that if you add any extra rows in with new names, you won't have to change anything.
If you are desperate to have the variable names in the global namespace and not within a dict, use exec (N.B. IF ANY OF THIS USES INPUT FROM OUTSIDE SOURCES, USING EXEC/EVAL CAN BE HIGHLY DANGEROUS (rm * level) SO MAKE SURE ALL INPUT IS CONTROLLED AND UNDERSTOOD BY YOURSELF).
with open(path,'rb') as csvFile:
reader = csv.reader(csvFile,delimiter=' ')
for row in reader:
exec("{} = {}".format(row[0], row[1:])
In python, you can use slicing: row[1:] will contain the row, except the first element, so you could do:
>>> d={}
>>> with open("f") as f:
... c = csv.reader(f, delimiter=',')
... for r in c:
... d[r[0]]=map(int,r[1:])
...
>>> d
{'var1': [1, 3, 1], 'var2': [3, 0, -1]}
Regarding variable variables, check How do I do variable variables in Python? or How to get a variable name as a string in Python?. I would stick to dictionary though.
An alternative to using the proper csv library could be as follows:
path = r'data.csv'
csvRows = open(path, "r").readlines()
dataRows = [[float(col) for col in row.rstrip("\n").split(",")[1:]] for row in csvRows]
for dataRow in dataRows: # Where dataRow is a list of numbers
print dataRow
You could then call your function where the print statement is.
This reads the whole file in and produces a list of lines with trailing newlines. It then removes each newline and splits each row into a list of strings. It skips the initial column and calls float() for each entry. Resulting in a list of lists. It depends how important the first column is?
Using CSV writer, I am trying to write a list of strings to a file.
Each string should occupy a separate row.
sectionlist = ["cat", "dog", "frog"]
When I implement the following code:
with open('pdftable.csv', 'wt') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for i in sectionlist:
writer.writerow(i)
I create
c,a,t
d,o,g
f,r,o,g
when I want
cat
dog
frog
Why does the for loop parse each character separately and how can I pass the entire string into csv.writer together so each can be written?
It doesn't look like you even need to use csv writer.
l = ["cat", "dog", "frog"] # Don't name your variable list!
with open('pdftable.csv', 'w') as csvfile:
for word in l:
csvfile.write(word + '\n')
Or as #GP89 suggested
with open('pdftable.csv', 'w') as csvfile:
csvfile.writelines(l)
I think that what you need is:
with open('pdftable.csv', 'wt') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for i in sectionlist:
writer.writerow([i]) # note the square brackets here
writerow treats its argument as an iterable, so if you pass a string, it will see it as if each character is one element in the row; however, you want the whole string to be an item, so you must enclose it in a list or a tuple.
PD: That said, if your particular case is not any more complex than what you are posting, you may not need csv.writer at all, as suggested by other answers.
The problem is i represents a string (word), not a list (row). Strings are iterable sequences (of characters) as well in Python so the CSV function accepts the object it without error, even though the results are "strange".
Fix sectionlist such that it is a list of lists of strings (rows) so i will be a list of strings, wrap each word in a list when used as a writerow parameter, or simply don't use writerow which expects a list of strings.
Trivially, the following structure would be saved correctly:
sectionlist = [
["cat", "meow"],
["dog"],
["frog", "hop", "pond"]
]