Create table from csv tree file using python - python

I have csv file that looks to have the structure of a tree (the file has 3000 lines)
A,
,B
,,B1
,,B2
,,,,,B2a
,C
,,C1
,,,C1a
,,C2
,,,,,C2a1a
I would like parse file to obtaining a table that looks like this
Parent Child
A B
B B1
B B2
B2 B2a
A C
C C1
C1 C1a
C C2
C2 C2a1a
Note the leaf with values B2a and C2a1a has more commas but is related to the closest father

you can try yaml
import yaml
import re
import io
s = """A,
,B
,,B1
,,B2
,,,,,B2a
,C
,,C1
,,,C1a
,,C2
,,,,,C2a1a"""
s_ = re.sub(r'(,*[\w\d]+)', r'\1:', s)
parsed = yaml.load(io.StringIO(s_.replace(',', ' ')))
def flatten_and_print(d):
for k,v in d.items():
if (isinstance(v, dict)):
for k2 in v:
print(k, k2)
flatten_and_print(v)
flatten_and_print(parsed)
# A B
# A C
# B B1
# B B2
# B2 B2a
# C C1
# C C2
# C1 C1a
# C2 C2a1a

You can use a stack for this, which maintains the path from the root to the currently node from the input.
As apparently the number of commas can increase with more than 1, the stack should include the depth of each element it stores.
Here is an implementation:
def pairs(csv):
stack = []
for line in csv.splitlines():
name = line.lstrip(",")
depth = len(line) - len(name)
name = name.rstrip(",")
while stack and depth <= stack[-1][0]:
stack.pop()
if stack:
yield stack[-1][1], name
stack.append((depth, name))
Here is how you could call it:
csv = """A,
,B
,,B1
,,B2
,,,,,B2a
,C
,,C1
,,,C1a
,,C2
,,,,,C2a1a"""
for pair in pairs(csv):
print(*pair)

Related

What is a more efficient way of calculating means with in an array?

I have some data that is stored as daily means over 20 years, and I would like to create a new array that tells me the monthly means over the same period.
I am not very experienced with python, so best I could figure was something like the following:
dis2 = "array of daily means"
a2 = np.sum(dis2[:365])
b2 = np.sum(dis2[365:731])
c2 = np.sum(dis2[731:1096])
d2 = np.sum(dis2[1096:1461])
e2 = np.sum(dis2[1461:1826])
f2 = np.sum(dis2[1826:2191])
g2 = np.sum(dis2[2191:2556])
h2 = np.sum(dis2[2556: 2921])
i2 = np.sum(dis2[2921:3286])
j2 = np.sum(dis2[3286:3651])
k2 = np.sum(dis2[3651:4016])
l2 = np.sum(dis2[4016:4381])
m2 = np.sum(dis2[4381:4746])
n2 = np.sum(dis2[4746:5111])
o2 = np.sum(dis2[5111:5476])
p2 = np.sum(dis2[5476:5841])
q2 = np.sum(dis2[5841:6206])
r2 = np.sum(dis2[6206:6571])
s2 = np.sum(dis2[6571:6936])
t2 = np.sum(dis2[6936:7301])
z2 = [a2,b2,c2,d2,e2,f2,g2,h2,i2,j2, k2, l2, m2, n2, o2, p2, q2, r2, s2, t2]
z2 = [i/365 for i in z2]
The above method over course only gives me the yearly means, and to get the monthly means using this methods I'd need well over a hundred variables and I am certain there must be a simpler more efficient way of doing so, but I have not the experience to determine what it is.
In case it is at all relevant, here is how I loaded my data:
filename = 'LOWL.txt'
f2 = open(filename, 'r')
date2 = []
discharge2 = []
lines = f2.readlines()
import pandas as pd
data2 = pd.read_csv('LOWL.txt',sep='\t',header=None,usecols=[2,3])
date2 = data2[2].values
discharge2 = data2[3].values
date2 = np.array(date2, dtype = "datetime64")
dis2 = [float(i) for i in discharge2]
Combining np.mean(), slicing and a list comprehension makes for a more efficient way of doing your calculation:
z2 = [np.mean(dis2[i:(i+365)]) for i in range(0, len(dis2), 365)]

How to get a value from a two-dimensional list that is in an adjacent column from one that matches the value of another two-dimensional list

I have 2 files I converted to list of lists format. Short examples
a
c1 165.001 17593685
c2 1650.94 17799529
c3 16504399 17823261
b
1 rs3094315 **0.48877594** *17593685* G A
1 rs12562034 0.49571378 768448 A G
1 rs12124819 0.49944228 776546 G A
Using the cycle 'for' I tried to find the common values of these lists, but I can't loop the process. It is necessary since I need to get an value that is adjacent to the value that is common to the two lists(in this given example it is 0.48877594 since 17593685 is common for 'a' and 'b' . My attempts that completely froze:
for i in a:
if i[2] == [d[3] for d in b]:
print(i[0], i[2] + d[2])
or
for i in a and d in b:
if i[2] == d[3]
print(i[0], i[2] + d[2]
Overall I need to get the first file with a new column, which will be that bold adjacent value. Is is my first month of programming and I cant understand logic. Thanks in advance!
+++
List's original format:
a = [['c1', '165.001', '17593685'], ['c2', '1650.94', '17799529'], ['c3', '16504399', '17823261']]
[['c1', '16504399', '17593685.1\n'], ['c2', '16504399', '17799529.1\n'], ['c3', '16504399', '17823261.\n']]
++++ My original data
Two or more people can have DNA segments that are the same, because they were inherited from a common ancestor. File 'a' contains the following columns:
SegmentID, Start of segment, End of Segment, IDs of individuals that share this segment(from 2 to infinity). Example(just a little part since real list has > 1000 raws - segments('c'). Number of individuals can be different.
c1 16504399 17593685 19N 19N.0 19N 19N.0 182AR 182AR.0 182AR 182AR.0 6i 6i.1 6i 6i.1 153A 153A.1 153A 153A.1
c2 14404399 17799529 62BB 62BB.0 62BB 62BB.0 55k 55k.0 55k 55k.0 190k 190k.0 190k 190k.0 51A 51A.1 51A 51A.1 3A 3A.1 3A 3A.1 38k 38k.1 38k 38k.1
c3 1289564 177953453 164Bur 164Bur.0 164Bur 164Bur.0 38BO 38BO.1 38BO 38BO.1 36i 36i.1 36i 36i.1 100k 100k.1 100k 100k.1
file b:
This one always has 6 columns but number of rows more the 100 millions, so only it's part:
1 rs3094315 0.48877594 16504399 G A
1 rs12562034 0.49571378 17593685 A G
1 rs12124819 0.49944228 14404399 G A
1 rs3094221 0.48877594 17799529 G A
1 rs12562222 0.49571378 1289564 A G
1 rs121242223 0.49944228 177953453 G A
So, I need to compare a[1] with b[3] and if they are equal
print(a[1],b[3]), because b[3] is position of segment too but in another measurement system. That is what I can't do
Taking a leap (because the question isn't really clear), I think you are looking for the product of a, b, e.g.:
In []:
for i in a:
for d in b:
if i[2] == d[3]:
print(i[0], i[2] + d[2])
Out[]:
c1 175936850.48877594
You can do the same with itertools.product():
In []:
import itertools as it
for i, d in it.product(a, b):
if i[2] == d[3]:
print(i[0], i[2] + d[2])
Out[]:
c1 175936850.48877594
It would be much faster to leave your data as strings and search:
for a_line in [_ for _ in a.split('\n') if _]: # skip blank lines
search_term = a_line.strip().split()[-1] # get search term
term_loc_in_b = b.find(search_term) #get search term loction in file b
if term_loc_in_b !=-1: #-1 means term not found
# split b once just before search term starts
value_in_b = b[:term_loc_in_b].strip().rsplit(maxsplit=1)[-1]
print(value_in_b)
else:
print('{} not found'.format(search_term))
If the file size is large you might consider using mmap to search b.
mmap.find requires bytes, eg. 'search_term'.encode()

An elegant, readable way to read Butcher tableau from a file

I'm trying to read a specifically formatted file (namely, the Butcher tableau) in python 3.5.
The file looks like this(tab separated):
S
a1 b11 b12 ... b1S
a2 b21 b22 ... b2S
...
aS bS1 bS2 ... bSS
0.0 c1 c2 ... cS
[tolerance]
for example, (tab separated)
2
0.0 0.0 0.0
1.0 0.5 0.5
0.0 0.5 0.5
0.0001
So my code looks like i'm writing in C. Is there a more pythonic approach to parsing this file? Maybe there are numpy methods that could be used here?
#the data from .dat file
S = 0 #method order, first char in .dat file
a = [] #S-dim left column of buther tableau
b = [] #S-dim matrix
c = [] #S-dim lower row
tolerance = 0 # for implicit methods
def parse_method(file_name):
'read the file_name, process lines, produce a Method object'
try:
with open('methods\\' + file_name) as file:
global S
S = int(next(file))
temp = []
for line in file:
temp.append([float(x) for x in line.replace('\n', '').split('\t')])
for i in range(S):
a.append(temp[i].pop(0))
b.append(temp[i])
global c
c = temp[S][1:]
global tolerance
tolerance = temp[-1][0] if len(temp)>S+1 else 0
except OSError as ioerror:
print('File Error: ' + str(ioerror))
My suggestion using Numpy:
import numpy as np
def read_butcher(filename):
with open(filename, 'rb') as fh:
S = int(fh.readline())
array = np.fromfile(fh, float, (S+1)**2, '\t')
rest = fh.read().strip()
array.shape = (S+1, S+1)
a = array[:-1, 0]
b = array[:-1, 1:]
c = array[ -1, 1:]
tolerance = float(rest) if rest else 0.0
return a, b, c, tolerance
Although I'm not entirely sure how consistently numpy.fromfile advances the file pointer... There are no guarantees in the documentation.
Handling of file exceptions should probably be done outside of the parsing method.
Code -
from collections import namedtuple
def parse_file(file_name):
with open('a.txt', 'r') as f:
file_content = f.readlines()
file_content = [line.strip('\n') for line in file_content]
s = int(file_content[0])
a = [float(file_content[i].split()[0]) for i in range(1, s + 1)]
b = [list(map(float, file_content[i].split()[1:]))
for i in range(1, s + 1)]
c = list(map(float, file_content[-2].split()))
tolerance = float(file_content[-1])
ButcherTableau = namedtuple('ButcherTableau', 's a b c tolerance')
bt = ButcherTableau(s, a, b, c, tolerance)
return bt
p = parse_file('a.txt')
print('S :', p.s)
print('a :', p.a)
print('b :', p.b)
print('c :', p.c)
print('tolerance :', p.tolerance)
Output -
S : 2
a : [0.0, 1.0]
b : [[0.0, 0.0], [0.5, 0.5]]
c : [0.0, 0.5, 0.5]
tolerance : 0.0001
Here's a bunch of suggestions you should consider:
from collections import namedtuple
import csv
def parse_method(file_name):
# for conveniency create a namedtuple
bt = namedtuple('ButcherTableau', dict(a=[], b=[], c=[], order=0, tolerance=0))
line = None
# advice ①: do not assume file path in a function, make assumptions as close to your main function as possible (to make it easier to parameterize later on)
# advice ②: do not call your file "file" so you're not shadowing the class "file" that's loaded globally at runtime
with open(file_name, 'r') as f:
# read the first line alone to setup your "method order" value before reading all the tab separated values
bt.order = int(f.readline())
# create a csv reader with cell separator as tabs
# and create an enumerator to have indexes for each line
for idx, line in enumerate(csv.reader(f, delimiter='\t')))
# instead of iterating again, you can just check the index
# and build your a and b values
if idx < bt.order:
bt.a.append(line.pop(0))
bt.b.append(line)
# if line is None (as set before the for), it means we did not iterate, meaning that we need to make it an error
if not line:
raise Exception("File is empty. Could not parse {}".format(file_name))
# finally you can build your c (and tolerance) values with the last line, which conveniently is still available once the for is finished
bt.c = line[1:]
bt.tolerance = line[0] if idx > S+1 else 0
# avoid the globals, return the namedtuple instead and use the results in the caller function
return bt
This code is untested (just rework of your code as I read it), so it might not work as is, but you might want take the good ideas and make them your own.

Operating on a huge table: group of rows at a time using python

I have a huge table file that looks like the following. In order to work on individual products (name), I tried to use pandas groupby, but it seems to put the whole table (~10G) in memory, which I cannot afford.
name index change
A Q QQ
A Q QQ
A Q QQ
B L LL
C Q QQ
C L LL
C LL LL
C Q QQ
C L LL
C LL LL
The name column is well sorted and I will only care about one name at a time. I hope to use the following criteria on column "change" to filter each name:
Check if number of "QQ" overwhelms number of "LL". Basically, if the number of rows contain "QQ" subtracts the number of rows contain "LL" >=2, then discard/ignore the "LL" rows for this name from now on. If "LL" overwhelms "QQ", then discard rows with QQ. (E.g. A has 3 QQ and 0 LL, and C has 4 LL and 2 QQ. They both are fine.)
Resulting table:
name index change
A Q QQ
A Q QQ
A Q QQ
C L LL
C LL LL
C L LL
C LL LL
Comparing "change" to "index", if no change occurs (e.g. LL in both columns), the row is not valid. Further, for the valid changes, the remaining QQ or LL has to be continuous for >=3 times. Therefore C only has 2 valid changes, and it will be filtered out.
Resulting table:
name index change
A Q QQ
A Q QQ
A Q QQ
I wonder if there is a way to just work on the table name by name, and release memory after each name please. (And don't have to do the two criteria step by step.) Any hint or suggestion will be appreciated!
Because the file is sorted by "name", you can read the file row-by-row:
def process_name(name, data, output_file):
group_by = {}
for index, change in data:
if index not in group_by:
group_by[index] = []
group_by[index].append(change)
# do the step 1 filter logic here
# do the step 2 filter logic here
for index in group_by:
if index == group_by[index]:
# Because there is at least one "no change" this
# whole "name" can be thrown out, so return here.
return
output = []
for index in group_by:
output_file.write("%s\t%s\t%s\n" % (name, index, group_by[index]))
current_name = None
current_data = []
input_file = open(input_filename, "r")
output_file = open(output_filename, "w")
header = input_file.readline()
for row in input_file:
cols = row.strip().split("\t")
name = cols[0]
index = cols[1]
change = cols[2]
if name != current_name:
if name != None:
process_name(current_name, current_data, output_file)
current_name = name
current_data = []
current_data.append((index, change))
# process what's left in the buffer
if current_name is not None:
process_name(current_name, current_data, output_file)
input_file.close()
output_file.close()
I don't totally understand the logic you've explained in #1, so I left that blank. I also feel like you probably want to do step #2 first as that will quickly rule out entire "name"s.
Since your file is sorted and you only seem to be operating on the sub segments by name, perhaps just use Python's groupby and create a table for each name segment as you go:
from itertools import groupby
import pandas as pd
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
for e in segment:
for key, v in zip(header, e.split()):
seg_data[key].append(v)
seg_fram=pd.DataFrame.from_dict(seg_data)
print k
print seg_fram
print
Prints:
A
change index name
0 QQ Q A
1 QQ Q A
2 QQ Q A
B
change index name
0 LL L B
C
change index name
0 QQ Q C
1 LL L C
2 LL LL C
3 QQ Q C
4 LL L C
5 LL LL C
Then the largest piece of memory you will have will be dictated by the largest contiguous group and not the size of the file.
You can use 1/2 the memory of that method by appending to the data frame row by row instead of building the intermediate dict:
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
seg_fram=pd.DataFrame(columns=header)
for idx, e in enumerate(segment):
df=pd.DataFrame({k:v for k, v in zip(header, e.split())}, index=[idx])
seg_fram=seg_fram.append(df)
(might be slower though...)
If that does not work, consider using a disk database.

Writing to one column using openpyxl

I want to write the values of the list values only on column A of the new workbook, for example:
a1 = 1
a2 = 2
a3 = 3
etc. etc. but right now I get this:
a1 = 1 b1 = 2 c1= 3 d1= 4
a1 = 1 b1 = 2 c1= 3 d1= 4
a1 = 1 b1 = 2 c1= 3 d1= 4
My code:
# create new workbook and worksheet
values = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
wb = Workbook(write_only = True)
ws = wb.create_sheet()
for row in range(0, len(values)):
ws.append([i for i in values])
wb.save('newfile.xlsx')
this code above fills all the cells in range A1:A15 to O1:O15
I only want to fill the values in column A1:A15
Not yet tested, but I would think
Tested-- you have a syntax error also; substitute 'row' for 'i'. But the following works.
for row in range(0, len(values)):
ws.append([row])
You need to create nested list to append the values in column, refer below code.
from openpyxl import Workbook
values = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
wb = Workbook()
ws = wb.create_sheet()
newlist = [[i] for i in values]
print(newlist)
for x in newlist:
ws.append(x)
wb.save('newfile.xlsx')

Categories

Resources