This question is a follow up to: Openpyxl: TypeError - Concatenation of several columns into one cell per row
What I want to do:
I want to concatenate the cells from columns F to M per row and put the concatenated value into column E like below. This needs to be done for all rows at the same time.
Input:
A B C D E F G H .. M
....... E1 90 2A .. 26
....... 0 80 F8 ..
Output:
A B C D E F G H .. M
....... E1902A..26
....... 080F8..
Code:
def concat_f_to_m():
for row_value in range(1, sheet.max_row+1):
values=[]
del values[:]
for row in sheet.iter_rows(min_col=6, max_col=14, min_row=row_value, max_row=row_value):
for cell in row:
if cell.value != None:
values.append(str(cell.value))
else:
del values[:]
pass
sheet[f'E{row_value}'].value= ''.join(values)
concat_f_to_m()
Also I have set the max column to column N (14) as the longest code goes until column M and I want to stop the loop once there is no entry found in order to go out and join the list's items. I cannot overcome the issue that despite a print of the values list shows only the row's items, it does not write it down into the cell.
Could you give me a hint how to concatenate through all rows by joining the values list at the certain row? Thank you!
Correct implementation:
def concat_f_to_m():
for row_value in range(1, sheet.max_row+1):
values=[]
del values[:]
for row in sheet.iter_rows(min_col=6, max_col=14, min_row=row_value, max_row=row_value):
for cell in row:
if cell.value != None:
values.append(str(cell.value))
sheet[f'E{row_value}'].value= ''.join(values)
else:
del values[:]
break
concat_f_to_m()
Related
I have a dataframe with two columns that are json.
So for example,
df = A B C D
1. 2. {b:1,c:2,d:{r:1,t:{y:0}}} {v:9}
I want to flatten it entirely, so every value in the json will be in a seperate columns, and the name will be the full path. So here the value 0 will be in the column:
C_d_t_y
What is the best way to do it, and without having to predefine the depth of the json or the fields?
If your dataframe contains only nested dictionaries (no lists), you can try:
def get_values(df):
def _parse(val, current_path):
if isinstance(val, dict):
for k, v in val.items():
yield from _parse(v, current_path + [k])
else:
yield "_".join(map(str, current_path)), val
rows = []
for idx, row in df.iterrows():
tmp = {}
for i in row.index:
tmp.update(dict(_parse(row[i], [i])))
rows.append(tmp)
return pd.DataFrame(rows, index=df.index)
print(get_values(df))
Prints:
A B C_b C_c C_d_r C_d_t_y D_v
0 1 2 1 2 1 0 9
I have 2 files I converted to list of lists format. Short examples
a
c1 165.001 17593685
c2 1650.94 17799529
c3 16504399 17823261
b
1 rs3094315 **0.48877594** *17593685* G A
1 rs12562034 0.49571378 768448 A G
1 rs12124819 0.49944228 776546 G A
Using the cycle 'for' I tried to find the common values of these lists, but I can't loop the process. It is necessary since I need to get an value that is adjacent to the value that is common to the two lists(in this given example it is 0.48877594 since 17593685 is common for 'a' and 'b' . My attempts that completely froze:
for i in a:
if i[2] == [d[3] for d in b]:
print(i[0], i[2] + d[2])
or
for i in a and d in b:
if i[2] == d[3]
print(i[0], i[2] + d[2]
Overall I need to get the first file with a new column, which will be that bold adjacent value. Is is my first month of programming and I cant understand logic. Thanks in advance!
+++
List's original format:
a = [['c1', '165.001', '17593685'], ['c2', '1650.94', '17799529'], ['c3', '16504399', '17823261']]
[['c1', '16504399', '17593685.1\n'], ['c2', '16504399', '17799529.1\n'], ['c3', '16504399', '17823261.\n']]
++++ My original data
Two or more people can have DNA segments that are the same, because they were inherited from a common ancestor. File 'a' contains the following columns:
SegmentID, Start of segment, End of Segment, IDs of individuals that share this segment(from 2 to infinity). Example(just a little part since real list has > 1000 raws - segments('c'). Number of individuals can be different.
c1 16504399 17593685 19N 19N.0 19N 19N.0 182AR 182AR.0 182AR 182AR.0 6i 6i.1 6i 6i.1 153A 153A.1 153A 153A.1
c2 14404399 17799529 62BB 62BB.0 62BB 62BB.0 55k 55k.0 55k 55k.0 190k 190k.0 190k 190k.0 51A 51A.1 51A 51A.1 3A 3A.1 3A 3A.1 38k 38k.1 38k 38k.1
c3 1289564 177953453 164Bur 164Bur.0 164Bur 164Bur.0 38BO 38BO.1 38BO 38BO.1 36i 36i.1 36i 36i.1 100k 100k.1 100k 100k.1
file b:
This one always has 6 columns but number of rows more the 100 millions, so only it's part:
1 rs3094315 0.48877594 16504399 G A
1 rs12562034 0.49571378 17593685 A G
1 rs12124819 0.49944228 14404399 G A
1 rs3094221 0.48877594 17799529 G A
1 rs12562222 0.49571378 1289564 A G
1 rs121242223 0.49944228 177953453 G A
So, I need to compare a[1] with b[3] and if they are equal
print(a[1],b[3]), because b[3] is position of segment too but in another measurement system. That is what I can't do
Taking a leap (because the question isn't really clear), I think you are looking for the product of a, b, e.g.:
In []:
for i in a:
for d in b:
if i[2] == d[3]:
print(i[0], i[2] + d[2])
Out[]:
c1 175936850.48877594
You can do the same with itertools.product():
In []:
import itertools as it
for i, d in it.product(a, b):
if i[2] == d[3]:
print(i[0], i[2] + d[2])
Out[]:
c1 175936850.48877594
It would be much faster to leave your data as strings and search:
for a_line in [_ for _ in a.split('\n') if _]: # skip blank lines
search_term = a_line.strip().split()[-1] # get search term
term_loc_in_b = b.find(search_term) #get search term loction in file b
if term_loc_in_b !=-1: #-1 means term not found
# split b once just before search term starts
value_in_b = b[:term_loc_in_b].strip().rsplit(maxsplit=1)[-1]
print(value_in_b)
else:
print('{} not found'.format(search_term))
If the file size is large you might consider using mmap to search b.
mmap.find requires bytes, eg. 'search_term'.encode()
If i have dataframe with column x.
I want to make a new column x_new but I want the first row of this new column to be set to a specific number (let say -2).
Then from 2nd row, use the previous row to iterate through the cx function
data = {'x':[1,2,3,4,5]}
df=pd.DataFrame(data)
def cx(x):
if df.loc[1,'x_new']==0:
df.loc[1,'x_new']= -2
else:
x_new = -10*x + 2
return x_new
df['x_new']=(cx(df['x']))
The final dataframe
I am not sure on how to do this.
Thank you for your help
This is what i have so far:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
df
# calculate equation
def depth_cal(d):
z = -3*d+1 #d must be previous row
return z
depth_cal=(depth_cal(df['depth'])) # how to set d as previous row
print (depth_cal)
depth_new =[]
for row in df['depth']:
if row == 1:
depth_new.append('-5.63')
else:
depth_new.append(depth_cal) #Does not put list in a column
df['Depth_correct']= depth_new
correct output:
There is still two problem with this:
1. it does not put the depth_cal list properly in column
2. in the depth_cal function, i want d to be the previous row
Thank you
I would do this by just using a loop to generate your new data - might not be ideal if particularly huge but it's a quick operation. Let me know how you get on with this:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
res = data['depth']
res[0] = -5.63
for i in range(1, len(res)):
res[i] = -3 * res[i-1] + 1
df['new_depth'] = res
print(df)
To get
depth new_depth
0 1 -5.63
1 2 17.89
2 3 -52.67
3 4 159.01
4 5 -476.03
I have a dataframe with a column containing comma separated strings. What I want to do is separate them by comma, count them and append the counted number to a new data frame. If the column contains a list with only one element, I want to differentiate wheather it is a string or an integer. If it is an integer, I want to append the value 0 in that row to the new df.
My code looks as follows:
def decide(dataframe):
df=pd.DataFrame()
for liste in DataFrameX['Column']:
x=liste.split(',')
if len(x) > 1:
df.append(pd.Series([len(x)]), ignore_index=True)
else:
#check if element in list is int
for i in x:
try:
int(i)
print i
x = []
df.append(pd.Series([int(len(x))]), ignore_index=True)
except:
print i
x = [1]
df.append(pd.Series([len(x)]), ignore_index=True)
return df
The Input data look like this:
C1
0 a,b,c
1 0
2 a
3 ab,x,j
If I now run the function with my original dataframe as input, it returns an empty dataframe. Through the print statement in the try/except statements I could see that everything works. The problem is appending the resulting values to the new dataframe. What do I have to change in my code? If possible, please do not give an entire different solution, but tell me what I am doing wrong in my code so I can learn.
******************UPDATE************************************
I edited the code so that it can be called as lambda function. It looks like this now:
def decide(x):
For liste in DataFrameX['Column']:
x=liste.split(',')
if len(x) > 1:
x = len(x)
print x
else:
#check if element in list is int
for i in x:
try:
int(i)
x = []
x = len(x)
print x
except:
x = [1]
x = len(x)
print x
And I call it like this:
df['Count']=df['C1'].apply(lambda x: decide(x))
It prints the right values, but the new column only contains None.
Any ideas why?
This is a good start, it could be simplified, but I think it works as expected.
#I have a dataframe with a column containing comma separated strings.
df = pd.DataFrame({'data': ['apple, peach', 'banana, peach, peach, cherry','peach','0']})
# What I want to do is separate them by comma, count them and append the counted number to a new data frame.
df['data'] = df['data'].str.split(',')
df['count'] = df['data'].apply(lambda row: len(row))
# If the column contains a list with only one element
df['first'] = df['data'].apply(lambda row: row[0])
# I want to differentiate wheather it is a string or an integer
df['first'] = pd.to_numeric(df['first'], errors='coerce')
# if the element in x is an integer, len(x) should be set to zero
df.loc[pd.notnull(df['first']), 'count'] = 0
# Dropping temp column
df.drop('first', 1, inplace=True)
df
data count
0 [apple, peach] 2
1 [banana, peach, peach, cherry] 4
2 [peach] 1
3 [0] 0
I have a huge table file that looks like the following. In order to work on individual products (name), I tried to use pandas groupby, but it seems to put the whole table (~10G) in memory, which I cannot afford.
name index change
A Q QQ
A Q QQ
A Q QQ
B L LL
C Q QQ
C L LL
C LL LL
C Q QQ
C L LL
C LL LL
The name column is well sorted and I will only care about one name at a time. I hope to use the following criteria on column "change" to filter each name:
Check if number of "QQ" overwhelms number of "LL". Basically, if the number of rows contain "QQ" subtracts the number of rows contain "LL" >=2, then discard/ignore the "LL" rows for this name from now on. If "LL" overwhelms "QQ", then discard rows with QQ. (E.g. A has 3 QQ and 0 LL, and C has 4 LL and 2 QQ. They both are fine.)
Resulting table:
name index change
A Q QQ
A Q QQ
A Q QQ
C L LL
C LL LL
C L LL
C LL LL
Comparing "change" to "index", if no change occurs (e.g. LL in both columns), the row is not valid. Further, for the valid changes, the remaining QQ or LL has to be continuous for >=3 times. Therefore C only has 2 valid changes, and it will be filtered out.
Resulting table:
name index change
A Q QQ
A Q QQ
A Q QQ
I wonder if there is a way to just work on the table name by name, and release memory after each name please. (And don't have to do the two criteria step by step.) Any hint or suggestion will be appreciated!
Because the file is sorted by "name", you can read the file row-by-row:
def process_name(name, data, output_file):
group_by = {}
for index, change in data:
if index not in group_by:
group_by[index] = []
group_by[index].append(change)
# do the step 1 filter logic here
# do the step 2 filter logic here
for index in group_by:
if index == group_by[index]:
# Because there is at least one "no change" this
# whole "name" can be thrown out, so return here.
return
output = []
for index in group_by:
output_file.write("%s\t%s\t%s\n" % (name, index, group_by[index]))
current_name = None
current_data = []
input_file = open(input_filename, "r")
output_file = open(output_filename, "w")
header = input_file.readline()
for row in input_file:
cols = row.strip().split("\t")
name = cols[0]
index = cols[1]
change = cols[2]
if name != current_name:
if name != None:
process_name(current_name, current_data, output_file)
current_name = name
current_data = []
current_data.append((index, change))
# process what's left in the buffer
if current_name is not None:
process_name(current_name, current_data, output_file)
input_file.close()
output_file.close()
I don't totally understand the logic you've explained in #1, so I left that blank. I also feel like you probably want to do step #2 first as that will quickly rule out entire "name"s.
Since your file is sorted and you only seem to be operating on the sub segments by name, perhaps just use Python's groupby and create a table for each name segment as you go:
from itertools import groupby
import pandas as pd
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
for e in segment:
for key, v in zip(header, e.split()):
seg_data[key].append(v)
seg_fram=pd.DataFrame.from_dict(seg_data)
print k
print seg_fram
print
Prints:
A
change index name
0 QQ Q A
1 QQ Q A
2 QQ Q A
B
change index name
0 LL L B
C
change index name
0 QQ Q C
1 LL L C
2 LL LL C
3 QQ Q C
4 LL L C
5 LL LL C
Then the largest piece of memory you will have will be dictated by the largest contiguous group and not the size of the file.
You can use 1/2 the memory of that method by appending to the data frame row by row instead of building the intermediate dict:
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
seg_fram=pd.DataFrame(columns=header)
for idx, e in enumerate(segment):
df=pd.DataFrame({k:v for k, v in zip(header, e.split())}, index=[idx])
seg_fram=seg_fram.append(df)
(might be slower though...)
If that does not work, consider using a disk database.