How to parse data to coresponding columns in awk - python

I have TAB separate data and I would like to parse data like this:
Input :
more input.tsv
A B 5 A1,A2,A3,A4,A5 B1,B2,B3,B4,B5
C D 3 C1,C2,C3 D1,D2,D3
And required output is:
A B 5 A1 B1
A B 5 A2 B2
.
.
A B 5 A5 B5
C D 3 C1 D1
.
C D 3 C3 D3
So it is mean keep first three columns and split 4th and 5th to corresponding values. Number of values in 4th and 5th columns is define value in 3th column.
I would prefer awk or maybe python with example explanation - to easy understand and learn something.
My try without any loop:
awk '{OFS="\t"}{split($4,arr4,",") split($5,arr5,","); print $1,$2,$3,arr4[1],arr5[1]; print $1,$2,$3,arr4[2],arr5[2]}'

In Python you can do something like this:
tempstr = """A\tB\t5\tA1,A2,A3,A4,A5\tB1,B2,B3,B4,B5
C\tD\t3\tC1,C2,C3\tD1,D2,D3"""
data = []
for line in tempstr.split("\n"):
line = line.split("\t")
split_column_1 = line[3].split(",")
split_column_2 = line[4].split(",")
if len(split_column_1) != len(split_column_2):
print("Something wrong")
else:
for c1,c2 in zip(split_column_1,split_column_2):
data.append((line[0],line[1],line[2],c1,c2))
for d in data:
print("\t".join(d))
Output:
A B 5 A1 B1
A B 5 A2 B2
A B 5 A3 B3
A B 5 A4 B4
A B 5 A5 B5
C D 3 C1 D1
C D 3 C2 D2
C D 3 C3 D3
With TSV file
You can use the csv module to process your data:
import csv
data = []
with open('resources/data.tsv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter='\t')
for row in csv_reader:
split_column_1 = row[3].split(",")
split_column_2 = row[4].split(",")
if len(split_column_1) != len(split_column_2):
print("Something wrong")
else:
for c1, c2 in zip(split_column_1, split_column_2):
data.append((row[0], row[1], row[2], c1, c2))
for d in data:
print("\t".join(d))
Explanation
Open file with csv module. The advantage is it already do a split on the delimiter we specify. Default should be "," but we are using \t as we have a tsv file.
with open('resources/data.tsv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter='\t')
We go through each row / line. Also a feature of the csv module to do it easy with a for loop.
for row in csv_reader:
Now we split the fourth and fith column on "," because they are still strings. Now we have a list with the splitted elements.
split_column_1 = row[3].split(",")
split_column_2 = row[4].split(",")
If the length of this two are not the same then something is wrong with the data and can lead to unexpected events. (depends on your code) so to account for that we check if this is the case (if your data doesn't have any error it will never be true)
if len(split_column_1) != len(split_column_2):
print("Something wrong")
We save all the data as a tuple in a list. You can later also access this data in a later step if you need (e.g. data[3][3] # 4th row, 4th element -> A4
else:
for c1, c2 in zip(split_column_1, split_column_2):
data.append((row[0], row[1], row[2], c1, c2))
Print it nicely so that it looks like your expected output. Basically you can use join on a string (in our case we take \t) and as a parameter use a tuple/list. Now he concat all elements of the tuple/list with the string on the left:
for d in data:
print("\t".join(d))

Good regex with sed loop:
# recreate input
# tr to replace spaces with tabs, as the input is tsv
tr -s ' ' '\t' <<EOF |
A B 5 A1,A2,A3,A4,A5 B1,B2,B3,B4,B5
C D 3 C1,C2,C3 D1,D2,D3
EOF
# sed script
sed -E '
# label a
: a
# take the last items after `,` comma
# and add a new line to the pattern space with the two items
# and remove the last items from the list in the first line
s/([^\t]+\t[^\t]+\t[^\t]+\t)(.+),([^\t]+)\t(.+),([^\n]+)/\1\2\t\4\n\1\3\t\5/
# if the last substitution was successfull, branch to label a
t a
'
on repl gives the following output:
A B 5 A1 B1
A B 5 A2 B2
A B 5 A3 B3
A B 5 A4 B4
A B 5 A5 B5
C D 3 C1 D1
C D 3 C2 D2
C D 3 C3 D3
And a oneliner without extended regex:
sed ':a;s/\([^\t]*\t[^\t]*\t[^\t]*\t\)\(.*\),\([^\t]*\)\t\(.*\),\([^\n]*\)/\1\2\t\4\n\1\3\t\5/;ta'

Could you please try following, haven't tested it as of now.
awk '
BEGIN{
FS=OFS="\t"
}
{
num1=split($4,array1,",")
num2=split($5,array2,",")
till=num1>num2?num1:num2
for(j=1;j<=till;j++){
print $1,$2,$3,array1[j],array2[j]
}
delete array1
delete array2
}
' Input_file
Testing of above code without setting field separator as TAB:
awk '
{
num1=split($4,array1,",")
num2=split($5,array2,",")
till=num1>num2?num1:num2
for(j=1;j<=till;j++){
print $1,$2,$3,array1[j],array2[j]
}
delete array1
delete array2
}
' Input_file
A B 5 A1 B1
A B 5 A2 B2
A B 5 A3 B3
A B 5 A4 B4
A B 5 A5 B5
C D 3 C1 D1
C D 3 C2 D2
C D 3 C3 D3

Related

How to slice/chop a string using multiple indexes in a panda DataFrame

I'm in need of some advice on the following issue:
I have a DataFrame that looks like this:
ID SEQ LEN BEG_GAP END_GAP
0 A1 AABBCCDDEEFFGG 14 2 4
1 A1 AABBCCDDEEFFGG 14 10 12
2 B1 YYUUUUAAAAMMNN 14 4 6
3 B1 YYUUUUAAAAMMNN 14 8 12
4 C1 LLKKHHUUTTYYYYYYYYAA 20 7 9
5 C1 LLKKHHUUTTYYYYYYYYAA 20 12 15
6 C1 LLKKHHUUTTYYYYYYYYAA 20 17 18
And what I need to get is the SEQ that's separated between the different BEG_GAP and END_GAP. I already have worked it out (thanks to a previous question) for sequences that have only one pair of gaps, but here they have multiple.
This is what the sequences should look like:
ID SEQ
0 A1 AA---CDDEE---GG
1 B1 YYUU---A-----NN
2 C1 LLKKHHU---YY----Y--A
Or in an exploded DF:
ID Seq_slice
0 A1 AA
1 A1 CDDEE
2 A1 GG
3 B1 YYUU
4 B1 A
5 B1 NN
6 C1 LLKKHHU
7 C1 YY
8 C1 Y
9 C1 A
At the moment, I'm using a piece of code (that I got thanks to a previous question) that works only if there's one gap, and it looks like this:
import pandas as pd
df = pd.read_csv("..\path_to_the_csv.csv")
df["BEG_GAP"] = df["BEG_GAP"].astype(int)
df["END_GAP"]= df["END_GAP"].astype(int)
df['SEQ'] = df.apply(lambda x: [x.SEQ[:x.BEG_GAP], x.SEQ[x.END_GAP+1:]], axis=1)
output = df.explode('SEQ').query('SEQ!=""')
But this has the problem that it generates a bunch of sequences that don't really exist because they actually have another gap in the middle.
I.e what it would generate:
ID Seq_slice
0 A1 AA
1 A1 CDDEEFFG #<- this one shouldn't exist! Because there's another gap in 10-12
2 A1 AABBCCDDEE #<- Also, this one shouldn't exist, it's missing the previous gap.
3 A1 GG
And so on, with the other sequences. As you can see, there are some slices that are not being generated and some that are wrong, because I don't know how to tell the code to have in mind all the gaps while analyzing the sequence.
All advice is appreciated, I hope I was clear!
Let's try defining a function and apply:
def truncate(data):
seq = data.SEQ.iloc[0]
ll = data.LEN.iloc[0]
return [seq[x:y] for x,y in zip([0]+list(data.END_GAP),
list(data.BEG_GAP)+[ll])]
(df.groupby('ID').apply(truncate)
.explode().reset_index(name='Seq_slice')
)
Output:
ID Seq_slice
0 A1 AA
1 A1 CCDDEE
2 A1 GG
3 B1 YYUU
4 B1 AA
5 B1 NN
6 C1 LLKKHHU
7 C1 TYY
8 C1 YY
9 C1 AA
In one line:
df.groupby('ID').agg({'BEG_GAP': list, 'END_GAP': list, 'SEQ': max, 'LEN': max}).apply(lambda x: [x['SEQ'][b: e] for b, e in zip([0] + x['END_GAP'], x['BEG_GAP'] + [x['LEN']])], axis=1).explode()
ID
A1 AA
A1 CCDDEE
A1 GG
B1 YYUU
B1 AA
B1 NN
C1 LLKKHHU
C1 TYY
C1 YY
C1 AA

Pandas: Join data to a single row to new columns

I'm new to pandas and have been having trouble using the merge, join and concatenate functions on a single row of data.
I'm iterating over a handful of rows in a table and in each iteration add some data I've found to the row I'm handling. I know, blasphemy! Thou shall not iterate. Each iteration results in a call to a server, so I need to control flow. There aren't that many rows. It's just for my own use. I promise I'll not iterate when I shouldn't.
That aside, my basic question is this: How do I add data to a given row where the new data has priority over existing data and has new columns?
Let's suppose I have a DataFrame df that I'm iterating over by row:
> df
c1 c2 c3
0 a b c
1 d e f
and when iterating on row 0, I get some new data that I want to add to row 0. That new data is in df_a:
> df_a
c4 c5 c6
0 g h i
I want to add data from df_a to row 0 of df so df is now:
> df
c1 c2 c3 c4 c5 c6
0 a b c g h i
1 d e f NaN NaN NaN
Next I iterate on row 1 and I get some columns which overlap and some which don't in df_b:
> df_b
c5 c7 c8
0 j k l
And again I want to add this data to row 1 so df now has
> df
c1 c2 c3 c4 c5 c6 c7 c8
0 a b c g h i NaN NaN
1 d e f NaN j NaN k l
I can't list columns names because I don't know what they'll be and new ones can appear beyond my control. Rows don't have a key because the whole thing gets thrown away after I disconnect. Data I find during each iteration always overwrites what's currently in df.
Thanks in advance!

How to group by two column with swapped values in pandas?

I want to group by columns where the commutative rule applies.
For example
column 1, column 2 contains values (a,b) in the first row and (b,a) for another row, then I want to group these two records perform a group by operation.
Input:
From To Count
a1 b1 4
b1 a1 3
a1 b2 2
b3 a1 12
a1 b3 6
Output:
From To Count(+)
a1 b1 7
a1 b2 2
b3 a1 18
I tried to apply group by after swapping the elements. But I don't have any approach to solve this problem. Help me to solve this problem.
Thanks in advance.
Use numpy.sort for sorting each row:
cols = ['From','To']
df[cols] = pd.DataFrame(np.sort(df[cols], axis=1))
print (df)
From To Count
0 a1 b1 4
1 a1 b1 3
2 a1 b2 2
3 a1 b3 12
4 a1 b3 6
df1 = df.groupby(cols, as_index=False)['Count'].sum()
print (df1)
From To Count
0 a1 b1 7
1 a1 b2 2
2 a1 b3 18

Taking last characters of a column of objects and making it the column on a dataframe - pandas python

I have a dataframe like the following:
df =
A B D
a1 b1 9052091001A
a2 b2 95993854906
a3 b3 93492480190
a4 b4 93240941993
What I want:
df_resp =
A B D
a1 b1 001A
a2 b2 4906
a3 b3 0190
a4 b4 1993
What I tried:
for i in (0,len(df['D'])):
df['D'][i]= df['D'][i][-4:]
Error I got:
KeyError: 4906
Also, it takes a really long time and I think there should be a quicker way with pandas.
Use pd.Series.str string accessor for vectorized string operations. These are preferred over using apply.
If D elements are already strings
df.assign(D=df.D.str[-4:])
A B D
0 a1 b1 001A
1 a2 b2 4906
2 a3 b3 0190
3 a4 b4 1993
If not
df.assign(D=df.D.astype(str).str[-4:])
A B D
0 a1 b1 001A
1 a2 b2 4906
2 a3 b3 0190
3 a4 b4 1993
You can change in place with
df['D'] = df.D.str[-4:]
Use the apply() method of pandas.Series, it will be way faster than iterating with a for loop...
This should work (provided the column contains only strings):
df_resp = df.copy()
df_resp['D'] = df_resp['D'].apply(lambda x : x[-4:])
As for the KeyError, it probably comes from your DataFrame's index, since calling df['D'][i] is equivalent to df.loc[i]['D'], i.e. i refers to the index's label, not its position. It would (probably) work if you replaced it with df.loc[i]['D'], which refers to the index at position i.
I hope this helps!

how to make excel into dict by xlrd

I have the following data in Excel:
Column(A) Column(B) Column(C)
Header1
A a
FC Qty
select a1 1
a2 2
derived a3 3
Header 2
B b
FC Qty
select b1 1
derived b2 2
b3 3
And I need to add this data to a dictionary in the following format (dict with tuples):
my_dict = { A:[a,select:[(a1,1),(a2,2),(a3,3)], derived:[(a3,3)]], B:[b,select:(b1,1),derived:[(b2,2),(b3,3)]]}

Categories

Resources