Python Pandas - Read csv file containing multiple tables

Python Pandas - Read csv file containing multiple tables - python

I have a single .csv file containing multiple tables.
Using Pandas, what would be the best strategy to get two DataFrame inventory and HPBladeSystemRack from this one file ?
The input .csv looks like this:
Inventory
System Name IP Address System Status
dg-enc05 Normal
dg-enc05_vc_domain Unknown
dg-enc05-oa1 172.20.0.213 Normal
HP BladeSystem Rack
System Name Rack Name Enclosure Name
dg-enc05 BU40
dg-enc05-oa1 BU40 dg-enc05
dg-enc05-oa2 BU40 dg-enc05
The best I've come up with so far is to convert this .csv file into Excel workbook (xlxs), split the tables into sheets and use:
inventory = read_excel('path_to_file.csv', 'sheet1', skiprow=1)
HPBladeSystemRack = read_excel('path_to_file.csv', 'sheet2', skiprow=2)
However:
This approach requires xlrd module.
Those log files have to be analyzed in real time, so that it would be way better to find a way to analyze them as they come from the logs.
The real logs have far more tables than those two.

If you know the table names beforehand, then something like this:
df = pd.read_csv("jahmyst2.csv", header=None, names=range(3))
table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}
should work to produce a dictionary with keys as the table names and values as the subtables.
>>> list(tables)
['HP BladeSystem Rack', 'Inventory']
>>> for k,v in tables.items():
... print("table:", k)
... print(v)
... print()
...
table: HP BladeSystem Rack
0 1 2
6 System Name Rack Name Enclosure Name
7 dg-enc05 BU40 NaN
8 dg-enc05-oa1 BU40 dg-enc05
9 dg-enc05-oa2 BU40 dg-enc05
table: Inventory
0 1 2
1 System Name IP Address System Status
2 dg-enc05 NaN Normal
3 dg-enc05_vc_domain NaN Unknown
4 dg-enc05-oa1 172.20.0.213 Normal
Once you've got that, you can set the column names to the first rows, etc.

I assume you know the names of the tables you want to parse out of the csv file. If so, you could retrieve the index positions of each, and select the relevant slices accordingly. As a sketch, this could look like:
df = pd.read_csv('path_to_file')
index_positions = []
for table in table_names:
index_positions.append(df[df['col_with_table_names']==table].index.tolist()[0])
## Include end of table for last slice, omit for iteration below
index_positions.append(df.index.tolist()[-1])
tables = {}
for position in index_positions[:-1]:
table_no = index_position.index(position)
tables[table_names[table_no] = df.loc[position:index_positions[table_no+10]]
There are certainly more elegant solutions but this should give you a dictionary with the table names as keys and the corresponding tables as values.

Pandas doesn't seem to be ready to do this easily, so I ended up doing my own split_csv function. It only requires table names and will output .csv files named after each table.
import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths
table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"]
def split_csv(csv_path, table_names):
tables_infos = detect_tables_from_csv(csv_path, table_names)
for table_info in tables_infos:
split_csv_by_indexes(csv_path, table_info)
def split_csv_by_indexes(csv_path, table_info):
title, start_index, end_index = table_info
print title, start_index, end_index
dir_ = dirname(csv_path)
output_path = join(dir_, title) + ".csv"
with open(output_path, 'w') as output_file, open(csv_path, 'rb') as input_file:
writer = csv.writer(output_file)
reader = csv.reader(input_file)
for i, line in enumerate(reader):
if i < start_index:
continue
if i > end_index:
break
writer.writerow(line)
def detect_tables_from_csv(csv_path, table_names):
output = []
with open(csv_path, 'rb') as csv_file:
reader = csv.reader(csv_file)
for idx, row in enumerate(reader):
for col in row:
match = [title for title in table_names if title in col]
if match:
match = match[0] # get the first matching element
try:
end_index = idx - 1
start_index
except NameError:
start_index = 0
else:
output.append((previous_match, start_index, end_index))
print "Found new table", col
start_index = idx
previous_match = match
match = False
end_index = idx # last 'end_index' set to EOF
output.append((previous_match, start_index, end_index))
return output
if __name__ == '__main__':
csv_path = 'switch_records.csv'
try:
split_csv(csv_path, table_names)
except IOError as e:
print "This file doesn't exist. Aborting."
print e
exit(1)

Related

Pandas read_table columns with multiple lines

I am working with a text file (ClassTest.txt) and pandas. The text file has 3, tab-separated columns: Title, Description, and Category - Title and Description are normal strings and Category is a (non-zero) integer.
I was gathering the data as follows:
data = pd.read_table('ClassTest.txt')
feature_names = ['Title', 'Description']
X = data[feature_names]
y = data['Category']
However, because values in the Description column can themselves contain new lines, the 'y' DataFrame contains too many rows because of most of the items in the Description column having multiple lines. I attempted to get around this by making the newline character in the file to be '|' (by repopulating it) and using:
data = pd.read_table('ClassTest.txt', lineterminator='|')
X = data[feature_names]
y = data['Category']
This time, I get the error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 20, saw 5
Can anyone help me with this issue?
EDIT: Adding previous code
con = lite.connect('JobDetails.db')
cur = con.cursor()
cur.execute('''SELECT Title, Description, Category FROM ReviewJobs''')
results = [list(each) for each in cur.fetchall()]
cur.execute('''SELECT Title, Description, Category FROM Jobs''')
for each in cur.fetchall():
results.append(list(each))
a = open('ClassTest.txt', 'ab')
newLine = "|"
a.write(u''.join(c for c in 'Title\tDescription\tCategory' + newLine).encode('utf-8'))
for r in results:
toWrite = "".encode('utf-8')
title = u''.join(c for c in r[0].replace("\n", " ")).encode('utf-8') + "\t".encode('utf-8')
description = u''.join(c for c in r[1]).encode('utf-8') + "\t".encode('utf-8')
toWrite += title + description
toWrite += str(r[2]).encode('utf-8') + newLine.encode('utf-8')
a.write(toWrite)
a.close()

pandas.read_table() is deprecated – use read_csv() instead. And then really use the CSV format instead of writing lots of code to write something similar that can't cope with record or field delimiters within fields. There's the csv module in the Python standard library.
Opening the file as text file and passing the encoding to open() spares you from encoding everything yourself in different places.
#!/usr/bin/env python3
from contextlib import closing
import csv
import sqlite3
def main():
with sqlite3.connect("JobDetails.db") as connection:
with closing(connection.cursor()) as cursor:
#
# TODO Having two tables with the same columns for essentially
# the same kind of records smells like a broken DB design.
#
rows = list()
for table_name in ["reviewjobs", "jobs"]:
cursor.execute(
f"SELECT title, description, category FROM {table_name}"
)
rows.extend(cursor.fetchall())
with open("ClassTest.txt", "a", encoding="utf8") as csv_file:
writer = csv.writer(csv_file, delimiter="\t")
writer.write(["Title", "Description", "Category"])
for title, description, category in rows:
writer.writerows([title.replace("\n", " "), description, category])
if __name__ == "__main__":
main()
And the in the other program:
data = pd.read_csv("ClassTest.txt", delimiter="\t")

change seq name in a fasta file with a dataframe

I got a problem, I explain the point.
I have one fasta file such:
>seqA
AAAAATTTGG
>seqB
ATTGGGCCG
>seqC
ATTGGCC
>seqD
ATTGGACAG
and a dataframe :
seq name New name seq
seqB BOBO
seqC JOHN
and I simpy want to change my ID seq name in the fasta file if there is the same seq name in my dataframe and change it to the new name seq, it would give:
New fasta fil:
>seqA
AAAAATTTGG
>BOBO
ATTGGGCCG
>JOHN
ATTGGCC
>seqD
ATTGGACAG
Thank you very much
edit:
I used this script:
blast=pd.read_table("matches_Busco_0035_0042.m8",header=None)
blast.columns = ["qseqid", "Busco_ID", "pident", "length", "mismatch", "gapopen","qstart", "qend", "sstart", "send", "evalue", "bitscore"]
repl = blast[blast.pident > 95]
print(repl)
#substituion dataframe
newfile = []
count = 0
for rec in SeqIO.parse("concatenate_0035_0042_aa2.fa", "fasta"):
#get corresponding value for record ID from dataframe
x = repl.loc[repl.seq == rec.id, "Busco_ID"]
#change record, if not empty
if x.any():
rec.name = rec.description = rec.id = x.iloc[0]
count += 1
#append record to list
newfile.append(rec)
#write list into new fasta file
SeqIO.write(newfile, "changedtest.faa", "fasta")
#tell us, how hard you had to work for us
print("I changed {} entries!".format(count))
And I got the following error:
Traceback (most recent call last):
File "Get_busco_blast.py", line 74, in <module>
x = repl.loc[repl.seq == rec.id, "Busco_ID"]
File "/usr/local/lib/python3.6/site-packages/pandas/core/generic.py", line 3614, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'seq'

It's easier to do this with something like BioPython.
First create a dictionary
names = Series(df['seq name'].values,index=df['New seq name']).to_dict()
Now iterate
from Bio import SeqIO
outs = []
for record in SeqIO.parse("orig.fasta", "fasta"):
record.id = names.get(record.id, default=record.id)
outs.append(record)
SeqIO.write(open("new.fasta", "w"), outs, "fasta")

If you have Biopython installed, then you can use SeqIO to read/write fasta files:
from Bio import SeqIO
#substituion dataframe
repl = pd.DataFrame(np.asarray([["seqB_3652_i36", "Bob"], ["seqC_123_6XXX1", "Patrick"]]), columns = ["seq", "newseq"])
newfile = []
count = 0
for rec in SeqIO.parse("test.faa", "fasta"):
#get corresponding value for record ID from dataframe
#repl["seq"] and "newseq" are the pandas column with the old and new sequence names, respectively
x = repl.loc[repl["seq"] == rec.id, "newseq"]
#change record, if not empty
if x.any():
#append old identifier number to the new id name
rec.name = rec.description = rec.id = x.iloc[0] + rec.id[rec.id.index("_"):]
count += 1
#append record to list
newfile.append(rec)
#write list into new fasta file
SeqIO.write(newfile, "changedtest.faa", "fasta")
#tell us, how hard you had to work for us
print("I changed {} entries!".format(count))
Please note that this script doesn't check for multiple entries in the substitution table. It just takes the first element or doesn't change anything, if the record id is not in the dataframe.

How to ignore commas within cells when reading data from a CSV - Python

I am attempting to read data from a CSV file and load it into a DynamoDB table. The issue is that description is written in sentences and have commas. How do I read the columns with a comma delimiter, but ignore the commas within the cells?
Currently, I am using this code to read the CSV file and write to the DB:
def import_csv_to_dynamodb(table_name, csv_file_name, col_names, column_types):
'''
Import a CSV file to a DynamoDB table
'''
dynamodb_conn = boto.connect_dynamodb(aws_access_key_id=MY_ACCESS_KEY_ID,
aws_secret_access_key=MY_SECRET_ACCESS_KEY)
dynamodb_table = dynamodb_conn.get_table(table_name)
BATCH_COUNT = 2 # 25 is the maximum batch size for Amazon DynamoDB
items = []
count = 0
csv_file = open(csv_file_name, 'r', encoding="utf-8-sig")
for cur_line in csv_file:
count += 1
cur_line = cur_line.strip().split(',')
row = {}
for col_number, col_name in enumerate(col_names):
row[col_name] = column_types[col_number](cur_line[col_number])
item = dynamodb_table.new_item(
attrs=row
)
items.append(item)
if count % BATCH_COUNT == 0:
print
'batch write start ... ',
do_batch_write(items, table_name, dynamodb_table, dynamodb_conn)
items = []
print
'batch done! (row number: ' + str(count) + ')'
# flush remaining items, if any
if len(items) > 0:
do_batch_write(items, table_name, dynamodb_table, dynamodb_conn)
csv_file.close()

The Python built-in csv library is very good. The documentation really needs no extra explanation:
https://docs.python.org/3/library/csv.html

Txt file to excel conversion in python

I'm trying to convert text file to excel sheet in python. The txt file contains data in the below specified formart
Column names: reg no, zip code, loc id, emp id, lastname, first name. Each record has one or more error numbers. Each record have their column names listed above the values. I would like to create an excel sheet containing reg no, firstname, lastname and errors listed in separate rows for each record.
How can I put the records in excel sheet ? Should I be using regular expressions ? And how can I insert error numbers in different rows for that corresponding record?
Expected output:
Here is the link to the input file:
https://github.com/trEaSRE124/Text_Excel_python/blob/master/new.txt
Any code snippets or suggestions are kindly appreciated.

Here is a draft code. Let me know if any changes needed:
# import pandas as pd
from collections import OrderedDict
from datetime import date
import csv
with open('in.txt') as f:
with open('out.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_MINIMAL)
#Remove inital clutter
while("INPUT DATA" not in f.readline()):
continue
header = ["REG NO", "ZIP CODE", "LOC ID", "EMP ID", "LASTNAME", "FIRSTNAME", "ERROR"]; data = list(); errors = list()
spamwriter.writerow(header)
print header
while(True):
line = f.readline()
errors = list()
if("END" in line):
exit()
try:
int(line.split()[0])
data = line.strip().split()
f.readline() # get rid of \n
line = f.readline()
while("ERROR" in line):
errors.append(line.strip())
line = f.readline()
spamwriter.writerow(data + errors)
spamwriter.flush()
except:
continue
# while(True):
# line = f.readline()
Use python-2 to run. The errors are appended as subsequent columns. It's slightly complicated the way you want it. I can fix it if still needed
Output looks like:

You can do this using the openpyxl library which is capable of depositing items directly into a spreadsheet. This code shows how to do that for your particular situation.
NEW_PERSON, ERROR_LINE = 1,2
def Line_items():
with open('katherine.txt') as katherine:
for line in katherine:
line = line.strip()
if not line:
continue
items = line.split()
if items[0].isnumeric():
yield NEW_PERSON, items
elif items[:2] == ['ERROR', 'NUM']:
yield ERROR_LINE, line
else:
continue
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws['A2'] = 'REG NO'
ws['B2'] = 'LASTNAME'
ws['C2'] = 'FIRSTNAME'
ws['D2'] = 'ERROR'
row = 2
for kind, data in Line_items():
if kind == NEW_PERSON:
row += 2
ws['A{:d}'.format(row)] = int(data[0])
ws['B{:d}'.format(row)] = data[-2]
ws['C{:d}'.format(row)] = data[-1]
first = True
else:
if first:
first = False
else:
row += 1
ws['D{:d}'.format(row)] = data
wb.save(filename='katherine.xlsx')
This is a screen snapshot of the result.

How to bypass IndexError

Here is my situation: My code parses out data from HTML tables that are within emails. The roadblock I'm running into is that some of these tables have blank empty rows right in the middle of the table, as seen in the photo below. This blank space causes my code to fail (IndexError: list index out of range) since it attempts to extract text from the cells.
Is it possible to say to Python: "ok, if you run into this error that comes from these blank rows, just stop there and take the rows you have acquired text from so far and execute the rest of the code on those"...?
That might sound like a dumb solution to this problem but my project involves me taking data from only the most recent date in the table anyway, which is always amongst the first few rows, and always before these blank empty rows.
So if it is possible to say "if you hit this error, just ignore it and proceed" then I would like to learn how to do that. If it's not then I'll have to figure out another way around this. Thanks for any and all help.
The table with the gap:
My code:
from bs4 import BeautifulSoup, NavigableString, Tag
import pandas as pd
import numpy as np
import os
import re
import email
import cx_Oracle
dsnStr = cx_Oracle.makedsn("sole.nefsc.noaa.gov", "1526", "sole")
con = cx_Oracle.connect(user="user", password="password", dsn=dsnStr)
def celltext(cell):
'''
textlist=[]
for br in cell.findAll('br'):
next = br.nextSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.nextSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = str(next).strip()
if text:
textlist.append(next)
return (textlist)
'''
textlist=[]
y = cell.find('span')
for a in y.childGenerator():
if isinstance(a, NavigableString):
textlist.append(str(a))
return (textlist)
path = 'Z:\\blub_2'
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
html=open(file_path,'r').read()
soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string
table = soup.find_all('table')[1] # Grab the second table
df_Quota = pd.DataFrame()
for row in table.find_all('tr'):
columns = row.find_all('td')
if columns[0].get_text().strip()!='ID': # skip header
Quota = celltext(columns[1])
Weight = celltext(columns[2])
price = celltext(columns[3])
print(Quota)
Nrows= max([len(Quota),len(Weight),len(price)]) #get the max number of rows
IDList = [columns[0].get_text()] * Nrows
DateList = [columns[4].get_text()] * Nrows
if price[0].strip()=='Package':
price = [columns[3].get_text()] * Nrows
if len(Quota)<len(Weight):#if Quota has less itmes extend with NaN
lstnans= [np.nan]*(len(Weight)-len(Quota))
Quota.extend(lstnans)
if len(price) < len(Quota): #if price column has less items than quota column,
val = [columns[3].get_text()] * (len(Quota)-len(price)) #extend with
price.extend(val) #whatever is in
#price column
#if len(DateList) > len(Quota): #if DateList is longer than Quota,
#print("it's longer than")
#value = [columns[4].get_text()] * (len(DateList)-len(Quota))
#DateList = value * Nrows
if len(Quota) < len(DateList): #if Quota is less than DateList (due to gap),
stu = [np.nan]*(len(DateList)-len(Quota)) #extend with NaN
Quota.extend(stu)
if len(Weight) < len(DateList):
dru = [np.nan]*(len(DateList)-len(Weight))
Weight.extend(dru)
FinalDataframe = pd.DataFrame(
{
'ID':IDList,
'AvailableQuota': Quota,
'LiveWeightPounds': Weight,
'price':price,
'DatePosted':DateList
})
df_Quota = df_Quota.append(FinalDataframe, ignore_index=True)
#df_Quota = df_Quota.loc[df_Quota['DatePosted']=='5/20']
df_Q = df_Quota['DatePosted'].iloc[0]
df_Quota = df_Quota[df_Quota['DatePosted'] == df_Q]
print (df_Quota)
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
pattern = re.compile(r'Sent:.*?\b(\d{4})\b')
email = f.read()
dates = pattern.findall(email)
if dates:
print("Date:", ''.join(dates))
#cursor = con.cursor()
#exported_data = [tuple(x) for x in df_Quota.values]
#sql_query = ("INSERT INTO ROUGHTABLE(species, date_posted, stock_id, pounds, money, sector_name, ask)" "VALUES (:1, :2, :3, :4, :5, 'NEFS 2', '1')")
#cursor.executemany(sql_query, exported_data)
#con.commit()
#cursor.close()
#con.close()

continue is the keyword to use for skipping empty/problem rows. IndexError is thanks to the attempt to access columns[0] on an empty columns list. So just skip to next row when there is an exception.
for row in table.find_all('tr'):
columns = row.find_all('td')
try:
if columns[0].get_text().strip()!='ID':
# Rest as above in original code.
except IndexError:
continue

Use try: ... except: ...:
try:
#extract data from table
except IndexError:
#execute rest of program

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas - Read csv file containing multiple tables - python

Related

Pandas read_table columns with multiple lines

change seq name in a fasta file with a dataframe

How to ignore commas within cells when reading data from a CSV - Python

Txt file to excel conversion in python

How to bypass IndexError

Categories

Resources