I have a CSV file.
There are a fixed number of columns and an unknown number of rows.
The information I need is always in the same 2 columns but not in the same row.
When column 6 has a 17 character value I also need to get the data from column 0.
This is an example row from the CSV file:
E4:DD:EF:1C:00:4F, 2012-10-08 11:29:04, 2012-10-08 11:29:56, -75, 9, 18:35:2C:18:16:ED,
You could open the file and go through it line by line. Split the line and if element 6 has 17 characters append element 0 to your result array.
f = open(file_name, 'r')
res = []
for line in f:
L = line.split(',')
If len(L[6])==17:
res.append(L[0])
Now you have a list with all the elements in column 6 of you cvs.
You can use csv module to read the csv files and you can provide delimiter/dialect as you need (, or | or tab etc..) while reading the file using csv reader.
csv reader takes care of providing the row/record with columns as list of values. If you want access the csv record/row as dict then you can use DictReader and its methods.
import csv
res = []
with open('simple.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
# Index start with 0 so the 6th column will be 5th index
# Using strip method would trim the empty spaces of column value
# Check the length of columns is more than 5 to handle uneven columns
if len(row) > 5 and len(row[5].strip()) == 17:
res.append(row[0])
# Result with column 0 where column 6 has 17 char length
print res
Related
I have a file text with some content.
I want to edit only the column "Medicalization". For example with a program, by entring on keypad B the column "Medicalization" becomes B :
This column has coordinates 14 for each letter of medicalization.
I tried something but I get an "index out of range" error :
with open('d:/test.txt','r') as infile:
with open('d:/test2.txt','w') as outfile:
for line in infile :
line = line.split()
new_line = '"B"\n'.format(line[14])
outfile.write(new_line)
Is that possible to do that with Python ?
Since data is in tabular form so use pandas.read_csv with sep \s+ then use pandas.DataFrame.loc to replace A with B in medicalization.
import pandas as pd
df = pd.read_csv("test.txt", sep="\s+")
df.loc[df["medicalization"] == "A" ,"medicalization"] = "B"
print(df)
typtpt name medicalization
0 1 Entrance B
1 2 Departure B
2 3 Consultation B
3 4 Meeting B
4 5 Transfer B
And if you want to save it back then use:
df.to_csv('test.txt', sep='\t', index=False)
The 'A' value you wish to change cannot possibly be column 14 in every line. If you look at, for example, the 4th row (with 'Consultation' as the name), even with a single space separating the columns, the third column would be at column position 17. So your assumption about fixed column positions must be wrong. If there is, for example, a single space or tab character separating each column, then for the first row of actual data the 'A` value would be at offset 12 and this would explain your exception.
Assuming a single space is separating each column from one another, then you could use the csv module as follows:
import csv
with open('d:/test.txt') as infile:
with open('d:/test2.txt', 'w', newline='') as outfile:
rdr = csv.reader(infile, delimiter=' ')
wtr = csv.writer(outfile, delimiter=' ')
# just write out the first row:
header = next(rdr)
wtr.writerow(header)
for row in rdr:
row[2] = 'B'
wtr.writerow(row)
Or specify delimiter='\t' if a tab is used to separate the columns.
If an arbitrary number of whitespace characters (spaces or tabs) separates each column, then:
with open('test.txt') as infile:
with open('test2.txt', 'w') as outfile:
first_time = True
for row in infile:
columns = row.split()
if first_time:
first_time = False
else:
columns[2] = 'B'
print(' '.join(columns), file=outfile)
The index out of range error is because of the output you get from the line = line.split(). This splits by all the whitespace thus the output of the line.split() is a list like so ['01','Entrance','A'] for line 2 for example. So when you do the indexing you're indexing at 14 which does not exist within the list.
If you're data files format is consistent (all Medicalization data is in the 3rd column) you can achieve what you're after with pure python like so:
with open('test.txt','r') as infile:
with open('test2.txt','w') as outfile:
for idx, line in enumerate(infile) :
line = line.split()
# if idx is 0 its the headers so we don't want to change those
if idx != 0:
line[2] = '"B"'
outfile.write(' '.join(line) + '\n')
However, #Hamza's answer is potentially a nicer one using pandas.
I have an issue where I need to intake different files, with different column locations. One files column might start 4 rows down, whereas another files columns might start on row one.
One file might look like this:
This
is
a
column 1, column 2, column 3, column 4
Another might have columns like this on row 1:
column 1, column 2, column 3
I need to get a list of every files column headers. I consider a column header a list greater than 3 items. If I'm using the csv module how can I write this?
I have something like:
temprow = next(csvfile)
for value in temprow:
if value == '':
temprow = next(csvfile)
if len(value) > 3:
header = temprow
else:
header = temprow
This is not quite working as it is also returning columns that contain 1 string.
Try this:
with open('yourfile.csv', 'r') as f:
for line in f: # iterate for each line
if "," in line: # the header line should contain comma
header = line
break # break the loop when header line is found
print(header)
Output:
column 1, column 2, column 3, column 4
According to the specifications in your post, this code works. It returns the first row in a .csv file that has 4 or more elements ('greater than 3 items').
headers = [] # Column names will be appended to this list
files = ['./test'] # Insert files here
for f in files: # Loop over files
with open(f, 'r') as fh: # Open file
reader = csv.reader(fh, delimiter = ',') # Create reader
for row in reader: # Loop over rows
if len(row) >= 4: # Criteria for appending to headers
headers.append(row)
I created a program to write a simple .csv (code below):
opencsv = open('agentstatus.csv', 'w')
a = csv.writer(opencsv)
data = [[agents125N],
[okstatusN],
[warningstatusN],
[criticalstatusN],
[agentdisabledN],
[agentslegacyN]]
a.writerows(data)
opencsv.close()
The .csv looks like this (it's with empty rows in the middle, but it's not a problem):
36111
96
25887
10128
7
398
Now I am trying to read the .csv and store each of this numbers in a variable, but without success, see below an example for the number 36111:
import csv
with open('agentstatus.csv', 'r') as csvfile:
f = csv.reader(csvfile)
for row in f:
firstvalue = row[0]
However, I get the error:
line 6, in <module>
firstvalue = row[0]
IndexError: list index out of range
Could you support me here?
Your file contains empty lines, so you need to check the length of the row:
values = []
for row in f:
if len(row) > 0:
values.append(row[0])
values is now ['36111', '96', '25887', '10128', '7', '398']
At the moment you're writing 6 rows into a csv file, with each row containing one column. To make a single row with six columns, you need to use a list of values, not each value in its own list.
ie change
data = [[agents125N], [okstatusN], [warningstatusN], [criticalstatusN], [agentdisabledN], [agentslegacyN]]
to
data = [[agents125N, okstatusN, warningstatusN, criticalstatusN, agentdisabledN, agentslegacyN]]
(a list containing one list of six values). Writing this with csv.writerows will result in
36111, 96, 25887, 10128, 7, 398
row[1] in your reading loop will return 96.
I'm trying to learn Python but I'm stuck here, any help appreciated.
I have 2 files.
1 is a .dat file with no column headers that is fixed width containing multiple rows of data
1 is a .fmt file that contains the column headers, column length, and datatype
.dat example:
10IFKDHGHS34
12IFKDHGHH35
53IFHDHGDF33
.fmt example:
ID,2,n
NAME,8,c
CODE,2,n
Desired Output as .csv:
ID,NAME,CODE
10,IFKDHGHS,34
12,IFKDHGHH,35
53,IFHDHGDF,33
First, I'd parse the format file.
with open("format_file.fmt") as f:
# csv.reader parses away the commas for us
# and we get rows as nice lists
reader = csv.reader(f)
# this will give us a list of lists that looks like
# [["ID", "2", "n"], ...]
row_definitions = list(reader)
# iterate and just unpack the headers
# this gives us ["ID", "NAME", "CODE"]
header = [name for name, length, dtype in row_definitions]
# [2, 8, 2]
lengths = [int(length) for name, length, dtype in row_definitions]
# define a generator (possibly as a closure) that simply slices
# into the string bit by bit -- this yields, for example, first
# characters 0 - 2, then characters 2 - 10, then 8 - 12
def parse_line(line):
start = 0
for length in lengths:
yield line[start: start+length]
start = start + length
with open(data_file) as f:
# iterating over a file pointer gives us each line separately
# we call list on each use of parse_line to force the generator to a list
parsed_lines = [list(parse_line(line)) for line in data_file]
# prepend the header
table = [header] + parsed_lines
# use the csv module again to easily output to csv
with open("my_output_file.csv", 'w') as f:
writer = csv.writer(f)
writer.writerows(table)
unique.txt file contains: 2 columns with columns separated by tab. total.txt file contains: 3 columns each column separated by tab.
I take each row from unique.txt file and find that in total.txt file. If present then extract entire row from total.txt and save it in new output file.
###Total.txt
column a column b column c
interaction1 mitochondria_205000_225000 mitochondria_195000_215000
interaction2 mitochondria_345000_365000 mitochondria_335000_355000
interaction3 mitochondria_345000_365000 mitochondria_5000_25000
interaction4 chloroplast_115000_128207 chloroplast_35000_55000
interaction5 chloroplast_115000_128207 chloroplast_15000_35000
interaction15 2_10515000_10535000 2_10505000_10525000
###Unique.txt
column a column b
mitochondria_205000_225000 mitochondria_195000_215000
mitochondria_345000_365000 mitochondria_335000_355000
mitochondria_345000_365000 mitochondria_5000_25000
chloroplast_115000_128207 chloroplast_35000_55000
chloroplast_115000_128207 chloroplast_15000_35000
mitochondria_185000_205000 mitochondria_25000_45000
2_16595000_16615000 2_16585000_16605000
4_2785000_2805000 4_2775000_2795000
4_11395000_11415000 4_11385000_11405000
4_2875000_2895000 4_2865000_2885000
4_13745000_13765000 4_13735000_13755000
My program:
file=open('total.txt')
file2 = open('unique.txt')
all_content=file.readlines()
all_content2=file2.readlines()
store_id_lines = []
ff = open('match.dat', 'w')
for i in range(len(all_content)):
line=all_content[i].split('\t')
seq=line[1]+'\t'+line[2]
for j in range(len(all_content2)):
if all_content2[j]==seq:
ff.write(seq)
break
Problem:
but istide of giving desire output (values of those 1st column that fulfile the if condition). i nead somthing like if jth of unique.txt == ith of total.txt then write ith row of total.txt into new file.
import csv
with open('unique.txt') as uniques, open('total.txt') as total:
uniques = list(tuple(line) for line in csv.reader(uniques))
totals = {}
for line in csv.reader(total):
totals[tuple(line[1:])] = line
with open('output.txt', 'w') as outfile:
writer = csv.writer(outfile)
for line in uniques:
writer.writerow(totals.get(line, []))
I will write your code in this way:
file=open('total.txt')
list_file = list(file)
file2 = open('unique.txt')
list_file2 = list(file2)
store_id_lines = []
ff = open('match.dat', 'w')
for curr_line_total in list_file:
line=curr_line_total.split('\t')
seq=line[1]+'\t'+ line[2]
if seq in list_file2:
ff.write(curr_line_total)
Please, avoid readlines() and use the with syntax when you open your files.
Here is explained why you don't need to use readlines()