I have TAB separate data and I would like to parse data like this:
Input :
more input.tsv
A B 5 A1,A2,A3,A4,A5 B1,B2,B3,B4,B5
C D 3 C1,C2,C3 D1,D2,D3
And required output is:
A B 5 A1 B1
A B 5 A2 B2
.
.
A B 5 A5 B5
C D 3 C1 D1
.
C D 3 C3 D3
So it is mean keep first three columns and split 4th and 5th to corresponding values. Number of values in 4th and 5th columns is define value in 3th column.
I would prefer awk or maybe python with example explanation - to easy understand and learn something.
My try without any loop:
awk '{OFS="\t"}{split($4,arr4,",") split($5,arr5,","); print $1,$2,$3,arr4[1],arr5[1]; print $1,$2,$3,arr4[2],arr5[2]}'
In Python you can do something like this:
tempstr = """A\tB\t5\tA1,A2,A3,A4,A5\tB1,B2,B3,B4,B5
C\tD\t3\tC1,C2,C3\tD1,D2,D3"""
data = []
for line in tempstr.split("\n"):
line = line.split("\t")
split_column_1 = line[3].split(",")
split_column_2 = line[4].split(",")
if len(split_column_1) != len(split_column_2):
print("Something wrong")
else:
for c1,c2 in zip(split_column_1,split_column_2):
data.append((line[0],line[1],line[2],c1,c2))
for d in data:
print("\t".join(d))
Output:
A B 5 A1 B1
A B 5 A2 B2
A B 5 A3 B3
A B 5 A4 B4
A B 5 A5 B5
C D 3 C1 D1
C D 3 C2 D2
C D 3 C3 D3
With TSV file
You can use the csv module to process your data:
import csv
data = []
with open('resources/data.tsv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter='\t')
for row in csv_reader:
split_column_1 = row[3].split(",")
split_column_2 = row[4].split(",")
if len(split_column_1) != len(split_column_2):
print("Something wrong")
else:
for c1, c2 in zip(split_column_1, split_column_2):
data.append((row[0], row[1], row[2], c1, c2))
for d in data:
print("\t".join(d))
Explanation
Open file with csv module. The advantage is it already do a split on the delimiter we specify. Default should be "," but we are using \t as we have a tsv file.
with open('resources/data.tsv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter='\t')
We go through each row / line. Also a feature of the csv module to do it easy with a for loop.
for row in csv_reader:
Now we split the fourth and fith column on "," because they are still strings. Now we have a list with the splitted elements.
split_column_1 = row[3].split(",")
split_column_2 = row[4].split(",")
If the length of this two are not the same then something is wrong with the data and can lead to unexpected events. (depends on your code) so to account for that we check if this is the case (if your data doesn't have any error it will never be true)
if len(split_column_1) != len(split_column_2):
print("Something wrong")
We save all the data as a tuple in a list. You can later also access this data in a later step if you need (e.g. data[3][3] # 4th row, 4th element -> A4
else:
for c1, c2 in zip(split_column_1, split_column_2):
data.append((row[0], row[1], row[2], c1, c2))
Print it nicely so that it looks like your expected output. Basically you can use join on a string (in our case we take \t) and as a parameter use a tuple/list. Now he concat all elements of the tuple/list with the string on the left:
for d in data:
print("\t".join(d))
Good regex with sed loop:
# recreate input
# tr to replace spaces with tabs, as the input is tsv
tr -s ' ' '\t' <<EOF |
A B 5 A1,A2,A3,A4,A5 B1,B2,B3,B4,B5
C D 3 C1,C2,C3 D1,D2,D3
EOF
# sed script
sed -E '
# label a
: a
# take the last items after `,` comma
# and add a new line to the pattern space with the two items
# and remove the last items from the list in the first line
s/([^\t]+\t[^\t]+\t[^\t]+\t)(.+),([^\t]+)\t(.+),([^\n]+)/\1\2\t\4\n\1\3\t\5/
# if the last substitution was successfull, branch to label a
t a
'
on repl gives the following output:
A B 5 A1 B1
A B 5 A2 B2
A B 5 A3 B3
A B 5 A4 B4
A B 5 A5 B5
C D 3 C1 D1
C D 3 C2 D2
C D 3 C3 D3
And a oneliner without extended regex:
sed ':a;s/\([^\t]*\t[^\t]*\t[^\t]*\t\)\(.*\),\([^\t]*\)\t\(.*\),\([^\n]*\)/\1\2\t\4\n\1\3\t\5/;ta'
Could you please try following, haven't tested it as of now.
awk '
BEGIN{
FS=OFS="\t"
}
{
num1=split($4,array1,",")
num2=split($5,array2,",")
till=num1>num2?num1:num2
for(j=1;j<=till;j++){
print $1,$2,$3,array1[j],array2[j]
}
delete array1
delete array2
}
' Input_file
Testing of above code without setting field separator as TAB:
awk '
{
num1=split($4,array1,",")
num2=split($5,array2,",")
till=num1>num2?num1:num2
for(j=1;j<=till;j++){
print $1,$2,$3,array1[j],array2[j]
}
delete array1
delete array2
}
' Input_file
A B 5 A1 B1
A B 5 A2 B2
A B 5 A3 B3
A B 5 A4 B4
A B 5 A5 B5
C D 3 C1 D1
C D 3 C2 D2
C D 3 C3 D3
df1=
A B C D
a1 b1 c1 1
a2 b2 c2 2
a3 b3 c3 4
df2=
A B C D
a1 b1 c1 2
a2 b2 c2 1
I want to compare the value of the column 'D' in both dataframes. If both dataframes had same number of rows I would just do this.
newDF = df1['D']-df2['D']
However there are times when the number of rows are different. I want a result Dataframe which shows a dataframe like this.
resultDF=
A B C D_df1 D_df2 Diff
a1 b1 c1 1 2 -1
a2 b2 c2 2 1 1
EDIT: if 1st row in A,B,C from df1 and df2 is same then and only then compare 1st row of column D for each dataframe. Similarly, repeat for all the row.
Use merge and df.eval
df1.merge(df2, on=['A','B','C'], suffixes=['_df1','_df2']).eval('Diff=D_df1 - D_df2')
Out[314]:
A B C D_df1 D_df2 Diff
0 a1 b1 c1 1 2 -1
1 a2 b2 c2 2 1 1
I am trying to loop through an Excel sheet and append the data from multiple sheets into a data frame.
So far I have:
master_df = pd.DataFrame()
for sheet in target_sheets:
df1 = file.parse(sheet, skiprows=4)
master_df.append(df1, ignore_index=True)
But then when I call master_df.head() it returns __
The data on these sheets is in the same format and relate to each other.
So I would like to join them like this:
Sheet 1 contains:
A1
B1
C1
Sheet 2 contains:
A2
B2
C2
Sheet 3:
A3
B3
C3
End result:
A1
B1
C1
A2
B2
C2
A3
B3
C3
Is my logic correct or how can I achieve this?
Below code will work even if you don't know the exact sheet_names in the excel file. You can try this:
import pandas as pd
xls = pd.ExcelFile('myexcel.xls')
out_df = pd.DataFrame()
for sheet in xls.sheet_names:
df = pd.read_excel('myexcel.xls', sheet_name=sheet)
out_df.append(df) ## This will append rows of one dataframe to another(just like your expected output)
print(out_df)
## out_df will have data from all the sheets
Let me know if this helps.
Simply use pd.concat():
pd.concat([pd.read_excel(file, sheet_name=sheet) for sheet in ['Sheet1','Sheet2','Sheet3']], axis=1)
For example, will yield:
A1 B1 C1 A2 B2 C2 A3 B3 C3
0 1 2 3 1 2 3 1 2 3
1 4 5 6 4 5 6 4 5 6
2 7 8 9 7 8 9 7 8 9
The output desired in the question is obtained by setting axis=0.
import pandas as pd
df2 = pd.concat([pd.read_excel(io="projects.xlsx", sheet_name=sheet) for sheet in ['JournalArticles','Proposals','Books']], axis=0)
df2
I want to group by columns where the commutative rule applies.
For example
column 1, column 2 contains values (a,b) in the first row and (b,a) for another row, then I want to group these two records perform a group by operation.
Input:
From To Count
a1 b1 4
b1 a1 3
a1 b2 2
b3 a1 12
a1 b3 6
Output:
From To Count(+)
a1 b1 7
a1 b2 2
b3 a1 18
I tried to apply group by after swapping the elements. But I don't have any approach to solve this problem. Help me to solve this problem.
Thanks in advance.
Use numpy.sort for sorting each row:
cols = ['From','To']
df[cols] = pd.DataFrame(np.sort(df[cols], axis=1))
print (df)
From To Count
0 a1 b1 4
1 a1 b1 3
2 a1 b2 2
3 a1 b3 12
4 a1 b3 6
df1 = df.groupby(cols, as_index=False)['Count'].sum()
print (df1)
From To Count
0 a1 b1 7
1 a1 b2 2
2 a1 b3 18
Now I would like to handle dataframe
df
A B
1 A0
1 A1
1 B0
2 B1
2 B2
3 B3
3 A2
3 A3
First, I would like to group by df.A
sub1
A B
1 A0
1 A1
1 B0
Second, I would like to extract first rows which contains letter A
A B
1 A0
If there is no A
sub2
A B
2 B1
2 B2
I would like to extract the first rows
A B
2 B1
So, I would like to get the result below
A B
1 A0
2 B1
3 A2
I would like to handle priority extraction,I tried grouping but Couldnt figure out. How to handle this?
You can groupby column A and for each group use idxmax() on str.contains("A"), then if there is A in column B, it will get the first index which contains letter A, otherwise it falls back to the first row as all values are False:
df.groupby("A", as_index=False).apply(lambda g: g.loc[g.B.str.contains("A").idxmax()])
# A B
#0 1 A0
#1 2 B1
#2 3 A2
In cases where you may have duplicated index, you can use numpy.ndarray.argmax() with iloc which accepts integer as position indexing:
df.groupby("A", as_index=False).apply(lambda g: g.iloc[g.B.str.contains("A").values.argmax()])
# A B
#0 1 A0
#1 2 B1
#2 3 A2