I got a problem, I explain the point.
I have one fasta file such:
>seqA
AAAAATTTGG
>seqB
ATTGGGCCG
>seqC
ATTGGCC
>seqD
ATTGGACAG
and a dataframe :
seq name New name seq
seqB BOBO
seqC JOHN
and I simpy want to change my ID seq name in the fasta file if there is the same seq name in my dataframe and change it to the new name seq, it would give:
New fasta fil:
>seqA
AAAAATTTGG
>BOBO
ATTGGGCCG
>JOHN
ATTGGCC
>seqD
ATTGGACAG
Thank you very much
edit:
I used this script:
blast=pd.read_table("matches_Busco_0035_0042.m8",header=None)
blast.columns = ["qseqid", "Busco_ID", "pident", "length", "mismatch", "gapopen","qstart", "qend", "sstart", "send", "evalue", "bitscore"]
repl = blast[blast.pident > 95]
print(repl)
#substituion dataframe
newfile = []
count = 0
for rec in SeqIO.parse("concatenate_0035_0042_aa2.fa", "fasta"):
#get corresponding value for record ID from dataframe
x = repl.loc[repl.seq == rec.id, "Busco_ID"]
#change record, if not empty
if x.any():
rec.name = rec.description = rec.id = x.iloc[0]
count += 1
#append record to list
newfile.append(rec)
#write list into new fasta file
SeqIO.write(newfile, "changedtest.faa", "fasta")
#tell us, how hard you had to work for us
print("I changed {} entries!".format(count))
And I got the following error:
Traceback (most recent call last):
File "Get_busco_blast.py", line 74, in <module>
x = repl.loc[repl.seq == rec.id, "Busco_ID"]
File "/usr/local/lib/python3.6/site-packages/pandas/core/generic.py", line 3614, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'seq'
It's easier to do this with something like BioPython.
First create a dictionary
names = Series(df['seq name'].values,index=df['New seq name']).to_dict()
Now iterate
from Bio import SeqIO
outs = []
for record in SeqIO.parse("orig.fasta", "fasta"):
record.id = names.get(record.id, default=record.id)
outs.append(record)
SeqIO.write(open("new.fasta", "w"), outs, "fasta")
If you have Biopython installed, then you can use SeqIO to read/write fasta files:
from Bio import SeqIO
#substituion dataframe
repl = pd.DataFrame(np.asarray([["seqB_3652_i36", "Bob"], ["seqC_123_6XXX1", "Patrick"]]), columns = ["seq", "newseq"])
newfile = []
count = 0
for rec in SeqIO.parse("test.faa", "fasta"):
#get corresponding value for record ID from dataframe
#repl["seq"] and "newseq" are the pandas column with the old and new sequence names, respectively
x = repl.loc[repl["seq"] == rec.id, "newseq"]
#change record, if not empty
if x.any():
#append old identifier number to the new id name
rec.name = rec.description = rec.id = x.iloc[0] + rec.id[rec.id.index("_"):]
count += 1
#append record to list
newfile.append(rec)
#write list into new fasta file
SeqIO.write(newfile, "changedtest.faa", "fasta")
#tell us, how hard you had to work for us
print("I changed {} entries!".format(count))
Please note that this script doesn't check for multiple entries in the substitution table. It just takes the first element or doesn't change anything, if the record id is not in the dataframe.
Related
This is currently my code for reading through a CSV file, Creating a person object, and adding each person to a list. One line Example input: John,Langley,1,2,2,3,5
When i print(per) each time after creating a person object. My output is correct, but as soon as i add that person to the list i made, the numeric values AKA 'traits' for that person are all the same as the last persons traits in the CSV file.
For Example:
John,Langley,1,2,2,3,5 --(add to list)-->John,Langley,1,1,1,1,1
Isabel,Smith,3,2,4,4,0 --(add to list)-->Isabel,Smith,1,1,1,1,1
John,Doe,1,1,1,1,1 --(add to list)-->John,Doe,1,1,1,1,1
This is impacting me with continuing because i need the person objects' traits to be valid in order to perform analysis on them in the next couple methods. PLEASE IGNORE MY PRINT STATEMENTS. THEY WERE FOR MY DEBUGGING PURPOSES
def read_file(filename):
file = open(filename, "r", encoding='utf-8-sig')
Traits_dict = {}
pl = []
next(file)
for line in file:
line = line.rstrip('\n')
line = line.split(',')
first = str(line[0].strip())
last = str(line[1].strip())
w = line[2].strip()
hobby = line[3].strip()
social = line[4].strip()
eat = line[5].strip()
sleep = line[6].strip()
Traits_dict["Work"] = w
Traits_dict["Hobbies"] = hobby
Traits_dict["Socialize"] = social
Traits_dict["Eat"] = eat
Traits_dict["Sleep"] = sleep
per = Person(first, last, Traits_dict)
print(per)
pl.append(per)
print(pl[0])
print(pl[1])
print(pl[2])
print(pl[3])
print(pl[4])
return pl
All the Traits_dict = {} are the same to all object since you initiating the dict before the loop so it's giving each Person object the same dict reference in it.
You can put the Traits_dict = {} inside the loop that it will create each Person a new dict
for line in file:
Traits_dict = {}
I'm trying to convert text file to excel sheet in python. The txt file contains data in the below specified formart
Column names: reg no, zip code, loc id, emp id, lastname, first name. Each record has one or more error numbers. Each record have their column names listed above the values. I would like to create an excel sheet containing reg no, firstname, lastname and errors listed in separate rows for each record.
How can I put the records in excel sheet ? Should I be using regular expressions ? And how can I insert error numbers in different rows for that corresponding record?
Expected output:
Here is the link to the input file:
https://github.com/trEaSRE124/Text_Excel_python/blob/master/new.txt
Any code snippets or suggestions are kindly appreciated.
Here is a draft code. Let me know if any changes needed:
# import pandas as pd
from collections import OrderedDict
from datetime import date
import csv
with open('in.txt') as f:
with open('out.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_MINIMAL)
#Remove inital clutter
while("INPUT DATA" not in f.readline()):
continue
header = ["REG NO", "ZIP CODE", "LOC ID", "EMP ID", "LASTNAME", "FIRSTNAME", "ERROR"]; data = list(); errors = list()
spamwriter.writerow(header)
print header
while(True):
line = f.readline()
errors = list()
if("END" in line):
exit()
try:
int(line.split()[0])
data = line.strip().split()
f.readline() # get rid of \n
line = f.readline()
while("ERROR" in line):
errors.append(line.strip())
line = f.readline()
spamwriter.writerow(data + errors)
spamwriter.flush()
except:
continue
# while(True):
# line = f.readline()
Use python-2 to run. The errors are appended as subsequent columns. It's slightly complicated the way you want it. I can fix it if still needed
Output looks like:
You can do this using the openpyxl library which is capable of depositing items directly into a spreadsheet. This code shows how to do that for your particular situation.
NEW_PERSON, ERROR_LINE = 1,2
def Line_items():
with open('katherine.txt') as katherine:
for line in katherine:
line = line.strip()
if not line:
continue
items = line.split()
if items[0].isnumeric():
yield NEW_PERSON, items
elif items[:2] == ['ERROR', 'NUM']:
yield ERROR_LINE, line
else:
continue
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws['A2'] = 'REG NO'
ws['B2'] = 'LASTNAME'
ws['C2'] = 'FIRSTNAME'
ws['D2'] = 'ERROR'
row = 2
for kind, data in Line_items():
if kind == NEW_PERSON:
row += 2
ws['A{:d}'.format(row)] = int(data[0])
ws['B{:d}'.format(row)] = data[-2]
ws['C{:d}'.format(row)] = data[-1]
first = True
else:
if first:
first = False
else:
row += 1
ws['D{:d}'.format(row)] = data
wb.save(filename='katherine.xlsx')
This is a screen snapshot of the result.
I am trying to write a program to do the following :
specify a field from a record in a csv file called data.
specify a field from a record in a csv file called log.
compare the position of the two in the data and in the log. If they are on the same line proceed to write the record in the file called log in a new file called result.
If the field does not match the record position in the log file proceed to move to the next record in the log file and compare it until a matching record is found and then the record is saved in the file called result.
reset the index of the log file
go to the next line in the data file and proceed to do the verification until the data file reaches the end.
This is whay i was able to do but i am stuck
import csv
def main():
datafile_csv = open('data.txt')
logfile_csv = open('log.txt')
row_data = []
row_log = []
row_log_temp = []
index_data = 1
index_log = 1
index_log_temp = index_log
counter = 0
data = ''
datareader = ''
logreader = ''
log = ''
# row = 0
logfile_len = sum (1 for lines in open('log.txt'))
with open('resultfile.csv','w') as csvfile:
out_write = csv.writer(csvfile, delimiter=',',quotechar='"')
with open('data.txt','r') as (data):
row_data = csv.reader(csvfile, delimiter=',', quotechar='"')
row_data = next(data)
print(row_data)
with open ('log.txt','r') as (log):
row_log = next(log)
print(row_log)
while counter != logfile_len:
comp_data = row_data[index_data:]
comp_log = row_log[index_log:]
comp_data = comp_data.strip('"')
comp_log = comp_log.strip('"')
print(row_data[1])
print(comp_data)
print(comp_log)
if comp_data != comp_log:
while comp_data != comp_log:
row_log = next(log)
comp_log = row_log[index_log]
out_write.writerow(row_log)
row_data = next(data)
else :
out_write.writerow(row_log)
row_data = next(data)
log.seek(0)
counter +=1
The problem i have are the following :
I cannot convert the data line in a string properly and i cannot compare correctly.
Also i need to be able to reset the pointer in the log file but seek does not seem to be working....
This is the content of the data file
"test1","test2","test3"
"1","2","3"
"4","5","6"
This is the content of the log file
"test1","test2","test3"
"4","5","6"
"1","2","3"
This is what the compiler return me
t
"test1","test2","test3"
t
test1","test2","test3"
test1","test2","test3"
1
1","2","3"
test1","test2","test3"
Traceback (most recent call last):
File "H:/test.py", line 100, in <module>
main()
File "H:/test.py", line 40, in main
comp_log = row_log[index_log]
IndexError: string index out of range
Thank you very much for the help
Regards
Danilo
Joining two files by columns (rowcount and a Specific Column[not defined]), and returning the results limited to the columns of the left/first file.
import petl
log = petl.fromcsv('log.txt').addrownumbers() # Load csv/txt file into PETL table, and add row numbers
log_columns = len(petl.header(log)) # Get the amount of columns in the log file
data = petl.fromcsv('data.txt').addrownumbers() # Load csv/txt file into PETL table, and add row numbers
joined_files = petl.join(log, data, key=['row', 'SpecificField']) # Join the tables using row and a specific field
joined_files = petl.cut(joined_files, *range(1, log_columns)) # Remove the extra columns obtained from right table
petl.tocsv(joined_files, 'resultfile.csv') # Output results to csv file
log.txt
data.txt
resultfile.csv
Also Do not forget to pip install (version used for this example):
pip install petl==1.0.11
I have a single .csv file containing multiple tables.
Using Pandas, what would be the best strategy to get two DataFrame inventory and HPBladeSystemRack from this one file ?
The input .csv looks like this:
Inventory
System Name IP Address System Status
dg-enc05 Normal
dg-enc05_vc_domain Unknown
dg-enc05-oa1 172.20.0.213 Normal
HP BladeSystem Rack
System Name Rack Name Enclosure Name
dg-enc05 BU40
dg-enc05-oa1 BU40 dg-enc05
dg-enc05-oa2 BU40 dg-enc05
The best I've come up with so far is to convert this .csv file into Excel workbook (xlxs), split the tables into sheets and use:
inventory = read_excel('path_to_file.csv', 'sheet1', skiprow=1)
HPBladeSystemRack = read_excel('path_to_file.csv', 'sheet2', skiprow=2)
However:
This approach requires xlrd module.
Those log files have to be analyzed in real time, so that it would be way better to find a way to analyze them as they come from the logs.
The real logs have far more tables than those two.
If you know the table names beforehand, then something like this:
df = pd.read_csv("jahmyst2.csv", header=None, names=range(3))
table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}
should work to produce a dictionary with keys as the table names and values as the subtables.
>>> list(tables)
['HP BladeSystem Rack', 'Inventory']
>>> for k,v in tables.items():
... print("table:", k)
... print(v)
... print()
...
table: HP BladeSystem Rack
0 1 2
6 System Name Rack Name Enclosure Name
7 dg-enc05 BU40 NaN
8 dg-enc05-oa1 BU40 dg-enc05
9 dg-enc05-oa2 BU40 dg-enc05
table: Inventory
0 1 2
1 System Name IP Address System Status
2 dg-enc05 NaN Normal
3 dg-enc05_vc_domain NaN Unknown
4 dg-enc05-oa1 172.20.0.213 Normal
Once you've got that, you can set the column names to the first rows, etc.
I assume you know the names of the tables you want to parse out of the csv file. If so, you could retrieve the index positions of each, and select the relevant slices accordingly. As a sketch, this could look like:
df = pd.read_csv('path_to_file')
index_positions = []
for table in table_names:
index_positions.append(df[df['col_with_table_names']==table].index.tolist()[0])
## Include end of table for last slice, omit for iteration below
index_positions.append(df.index.tolist()[-1])
tables = {}
for position in index_positions[:-1]:
table_no = index_position.index(position)
tables[table_names[table_no] = df.loc[position:index_positions[table_no+10]]
There are certainly more elegant solutions but this should give you a dictionary with the table names as keys and the corresponding tables as values.
Pandas doesn't seem to be ready to do this easily, so I ended up doing my own split_csv function. It only requires table names and will output .csv files named after each table.
import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths
table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"]
def split_csv(csv_path, table_names):
tables_infos = detect_tables_from_csv(csv_path, table_names)
for table_info in tables_infos:
split_csv_by_indexes(csv_path, table_info)
def split_csv_by_indexes(csv_path, table_info):
title, start_index, end_index = table_info
print title, start_index, end_index
dir_ = dirname(csv_path)
output_path = join(dir_, title) + ".csv"
with open(output_path, 'w') as output_file, open(csv_path, 'rb') as input_file:
writer = csv.writer(output_file)
reader = csv.reader(input_file)
for i, line in enumerate(reader):
if i < start_index:
continue
if i > end_index:
break
writer.writerow(line)
def detect_tables_from_csv(csv_path, table_names):
output = []
with open(csv_path, 'rb') as csv_file:
reader = csv.reader(csv_file)
for idx, row in enumerate(reader):
for col in row:
match = [title for title in table_names if title in col]
if match:
match = match[0] # get the first matching element
try:
end_index = idx - 1
start_index
except NameError:
start_index = 0
else:
output.append((previous_match, start_index, end_index))
print "Found new table", col
start_index = idx
previous_match = match
match = False
end_index = idx # last 'end_index' set to EOF
output.append((previous_match, start_index, end_index))
return output
if __name__ == '__main__':
csv_path = 'switch_records.csv'
try:
split_csv(csv_path, table_names)
except IOError as e:
print "This file doesn't exist. Aborting."
print e
exit(1)
I have the following csv called report.csv. It's an excel file:
email agent_id misc
test#email.com 65483843154f35d54 blah1
test1#email.com sldd989eu99ufj9ej9e blah 2
I have the following code:
import csv
data_file = 'report.csv'
def import_data(data_file):
attendee_data = csv.reader(open(data_file, 'rU'), dialect=csv.excel_tab)
for row in attendee_data:
email = row[1]
agent_id = row[2]
pdf_file_name = agent_id + '_' + email + '.pdf'
generate_certificate(email, agent_id, pdf_file_name)
I get the following error:
Traceback (most recent call last):
File "report_test.py", line 56, in <module>
import_data(data_file)
File "report_test.py", line 25, in import_data
email = row[1]
IndexError: list index out of range
I thought the index was the number of columns in, within each row. row[1] and 'row[2]` should be within range, no?
There is most likely a blank line in your CSV file. Also, list indices start at 0, not 1.
import csv
data_file = 'report.csv'
def import_data(data_file):
attendee_data = csv.reader(open(data_file, 'rU'), dialect=csv.excel_tab)
for row in attendee_data:
try:
email = row[0]
agent_id = row[1]
except IndexError:
pass
else:
pdf_file_name = agent_id + '_' + email + '.pdf'
generate_certificate(email, agent_id, pdf_file_name)
You say you have an "Excel CSV", which I don't quite understand so I'll answer assuming you have an actual .csv file.
If I'm loading a .csv into memory (and the file isn't enormous), I'll often have a load_file method on my class that doesn't care about indexes.
Assuming the file has a header row:
import csv
def load_file(filename):
# Define data in case the file is empty.
data = []
with open(filename) as csvfile:
reader = csv.reader(csvfile)
headers = next(reader)
data = [dict(zip(headers, row)) for row in reader]
return data
This returns a list of dictionaries you can use by key, instead of index. The key will be absent in the event, say misc is missing from the row (index 2), so simply .get from the row. This is cleaner than a try...except.
for row in data:
email = row.get('email')
agent_id = row.get('agent_id')
misc = row.get('misc')
This way the order of the file columns don't matter, only the headers do. Also, if any of the columns have a blank value, your script won't error out by giving an IndexError. If you don't want to include blank values, simply handle them by checking:
if not email:
do.something()
if not agent_id:
do.something_else()