I want to create a matrix from multiple files. Each of these files has a list of gene names, of various lengths.
To create the matrix I need to group all the gene names from all the files in the first column.
Then for each file append a new column(with the file name as header) and if the gene name is in the appended list add value 1 to the cell, else if the gene name in the first column is not found in the new column appended add zero.
This is what I got until now:
import os
files= os.listdir("/gene_files")
df01 = pd.DataFrame()
for file in files:
file_name = "/gene_files/" + file
for file in file:
df = pd.read_csv(file, sep='\t', header = 0)
df01 = pd.concat(df01,df)
df01.to_csv('gene_matrix.csv')
This gives me all the gene lists in one column. I then drop all the duplicates.
df01 = df01.drop_duplicates
Now I need to append a new column for each file, evaluate if geneName(file) is in first column and add 1s or 0s accordingly. I'm stuck.... and also utterly confused.
The files look like this:
File1 File2 File3 etc...
GeneName GeneName GeneName
A B A
B C B
C D E
F E F
The output I want would be a matrix/dataframe:
GeneName File1 File2 File3
A 1 0 1
B 1 1 1
C 1 1 0
D 0 1 0
E 0 1 1
F 1 0 1
These are the actual first few lines of the files:
fileAIB fileAIC fileAID
Plekha4 Dffb Rabggta
1700012D01Rik A430033K04Rik Sc5d
Isg20 Tubb3 Gnpnat1
Smad6 Rbm17 Nabp1
Ndufa10 Isg20 Isg20
Wdr90 Arrb2 Lrrc27
Thumpd1 Ankrd13c Add3
Cd2bp2 Ndufa10 Prkaa1
Cndp2 Inpp5e Gmeb2
Jmjd1c Lamtor2 B4galt7
And the output would look like:
GeneName fileAIB fileAIC fileAID
Plekha4 1 0 0
1700012D01Rik 1 0 0
Isg20 1 1 1
Smad6 1 0 0
Ndufa10 1 0 0
Wdr90 1 0 0
Thumpd1 1 0 0
Cd2bp2 1 0 0
Rbm17 1 0 1
Jmjd1c 1 0 0
Dffb 0 1 0
A430033K04Rik 0 1 0
Tubb3 0 1 1
Rbm17 0 1 0
Arrb2 0 1 0
Ankrd13c 0 1 0
Ndufa10 0 1 0
Gnpnat1 0 1 0
Lamtor2 0 1 0
Rabggta 0 0 1
Sc5d 0 0 1
Gnpnat1 0 0 1
Lrrc27 0 0 1
Prkaa1 0 0 1
Gmeb2 0 0 1
B4galt7 0 0 1
Consider appending all text file data into a long form dataframe and then pivoting to wide format:
dfList = []
for file in files:
df = pd.read_csv(file, sep='\t', header = None, names = ['GeneName'])
df = df.assign(file = file.replace('.txt', ''), num = 1)
dfList.append(df)
finaldf = pd.concat(dfList)
# PIVOT (LONG TO WIDE)
finaldf = finaldf.pivot_table(index=['GeneName'], columns=['file'],
values='num', aggfunc='count').fillna(0).reset_index()
# CONVERT TO INTEGER
numcols = list(range(1,len(finaldf.columns)))
finaldf.ix[:,numcols] = finaldf.ix[:,numcols].astype(int)
Output (using posted actual three columns as .txt files)
# file GeneName fileAIB fileAIC fileAID
# 0 1700012D01Rik 1 0 0
# 1 A430033K04Rik 0 1 0
# 2 Add3 0 0 1
# 3 Ankrd13c 0 1 0
# 4 Arrb2 0 1 0
# 5 B4galt7 0 0 1
# 6 Cd2bp2 1 0 0
# 7 Cndp2 1 0 0
# 8 Dffb 0 1 0
# 9 Gmeb2 0 0 1
# 10 Gnpnat1 0 0 1
# 11 Inpp5e 0 1 0
# 12 Isg20 1 1 1
# 13 Jmjd1c 1 0 0
# 14 Lamtor2 0 1 0
# 15 Lrrc27 0 0 1
# 16 Nabp1 0 0 1
# 17 Ndufa10 1 1 0
# 18 Plekha4 1 0 0
# 19 Prkaa1 0 0 1
# 20 Rabggta 0 0 1
# 21 Rbm17 0 1 0
# 22 Sc5d 0 0 1
# 23 Smad6 1 0 0
# 24 Thumpd1 1 0 0
# 25 Tubb3 0 1 0
# 26 Wdr90 1 0 0
You should be able to easily do this by putting the gene name in the index and creating a column of all ones for with the file name as the column name and then concatenating. This should be done altogether in one for loop. Your current for loop syntax doesn't look right. Try something like the following that assumes you have a one column dataframe with column name 'GeneName' when read from read_csv.
import os
files= os.listdir("/gene_files")
df_list = []
for file in files:
df = pd.read_csv(file, sep='\t', header = 0)
df[file] = 1
df.set_index('GeneName')
df_list.append(df)
pd.concat(df_list, axis=1).fillna(0)
Try using pd.concat() with the axis attribute. In your case:
df01 = pd.concat([df01, df], axis=1)
Before you can use df.columns = [filename] in order to give a new dataframe a column name.
Related
a = [[0,0,0,0],[0,-1,1,0],[1,-1,1,0],[1,-1,1,0]]
df = pd.DataFrame(a, columns=['A','B','C','D'])
df
Output:
A B C D
0 0 0 0 0
1 0 -1 1 0
2 1 -1 1 0
3 1 -1 1 0
So reading down vertically per column, values in the columns all begin at 0 on the first row, once they change they can never change back and can either become a 1 or a -1. I would like to re arrange the dataframe columns so that the columns in this order:
Order columns that hit 1 in the earliest row as possible
Order columns that hit -1 in the earliest row as possible
Finally the remaining rows that never changed values and remained as zero (if there are even any left)
Desired Output:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
The my main data frame is 3000 rows and 61 columns long, is there any way of doing this quickly?
We have to handle the positive and negative values seperately. One way is take sum of the columns , then using sort_values , we can adjust the ordering:
a = df.sum().sort_values(ascending=False)
b = pd.concat((a[a.gt(0)],a[a.lt(0)].sort_values(),a[a.eq(0)]))
out = df.reindex(columns=b.index)
print(out)
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
Try with pd.Series.first_valid_index
s = df.where(df.ne(0))
s1 = s.apply(pd.Series.first_valid_index)
s2 = s.bfill().iloc[0]
out = df.loc[:,pd.concat([s2,s1],axis=1,keys=[0,1]).sort_values([0,1],ascending=[False,True]).index]
out
Out[35]:
C A B D
0 0 0 0 0
1 1 0 -1 0
2 1 1 -1 0
3 1 1 -1 0
I have 6 dataframes with same column names.
The column names are:
"session_id", "player_id", "gersey_color","timestamp"
data in each frame looks like:
session_id
player_id
gersey_color
timestamp
123xyz
yellow
9
1347.85
I want to combine these dataframe in single dataframe where the format would be like:
session_id
player_info
df1
df2
df3
df4
df5
df6
total_occurance
timestamp
123xyz
yellow9
0
1
0
3
3
0
7
1347.85
green2
0
1
1
0
2
0
4
blue5
1
1
1
1
1
1
6
523pqr
yellow1
2
1
0
0
0
0
3
747.45
white2
0
1
0
0
0
1
205abd
green1
0
1
0
0
3
0
4
57.61
111mnz
yellow10
1
0
0
0
0
0
1
1821.21
black2
0
1
0
1
1
0
3
Here I am using timestamp as unique identifier and want to get the frequency of each time stamp happening across all the dataframe but categorised with session_id and player_id and gersey_color combined.
my current code can get all the information but can not format like the way I want:
for i, combo_row in combo_df.iterrows():
value_in_combo = combo_row['timestamp']
count = 0
player_info = []
session_id = []
for id, df_path in enumerate(df_list):
rule_df = pd.read_excel(df_path)
sub_counter = 0
for idx, entry in rule_df.iterrows():
idr = list(rule_df.columns).index('timestamp')
value = entry[idr]
s_id = entry[list(rule_df.columns).index('session.id')]
player_team = entry[list(rule_df.columns).index('gersey_color')]
player_num = entry[list(rule_df.columns).index('player_id')]
if value == value_in_combo:
sub_counter+=1
session_id.append(s_id)
player_info.append(str(player_team)+str(player_num))
combo_df.at[i, f'df{id+1}'] = ','.join(list(set(player_info)))
combo_df.at[i, 'session_id'] =','.join(list(set(session_id)))
count += sub_counter
combo_df.at[i, 'occurrence_across_rules'] = count
Here combo_df is the df I predefined to populate all the data.
current combo_df looks like:
session_id
player_info
df1
df2
df3
df4
df5
df6
timestamp
total_occurance
123xyz
yellow1
623.15
1
423pqz
green1,yellow5
yellow55
green1,yellow5
1347.85
5
. . . . . .
But as I said my code does not generate the format I want.
Can anyone suggest how to do it ?
Edit:
I solved the problem using:
combo_df.set_index(['session_id', 'player_team', 'player_num'], inplace=True)
I am trying to get the frequency distribution of column which is a list of words against the class labels.
Label Numbers
0 [(a,b,c)]
0 [(d)]
0 [(e,f,g)]
1 [(a,z)]
1 [(d,x,y)]
The output should be:
0 1
a 1 1
b 1 0
c 1 0
d 1 1
e 1 0
f 1 0
g 1 0
x 0 1
y 0 1
z 0 1
The list of sets in the 'Numbers' column makes manipulating the DataFrame as-is very difficult (this is not tidy data). The solution is to expand out the DataFrame so that you only have one number in the 'Numbers' column corresponding to one value in the 'Label' column. Assuming your data is in a DataFrame called df, the following code performs that operation:
rows_list = []
for index, row in df.iterrows():
for element in row['Numbers'][0]:
dict1 = {}
dict1.update(key=row['Label'], value=element)
rows_list.append(dict1)
new_df = pd.DataFrame(rows_list)
new_df.columns = ['Label', 'Numbers']
The result is
Label Numbers
0 0 a
1 0 b
2 0 c
3 0 d
4 0 e
5 0 f
6 0 g
7 1 a
8 1 z
9 1 d
10 1 x
11 1 y
Now it's a matter of pivoting:
print(new_df.pivot_table(index='Numbers', columns='Label', aggfunc=len,
fill_value=0))
The result is
Label 0 1
Numbers
a 1 1
b 1 0
c 1 0
d 1 1
e 1 0
f 1 0
g 1 0
x 0 1
y 0 1
z 0 1
See the first answer here for the last line of code.
I have the foll. dataframe:
c3ann c3nfx c3per c4ann c4per pastr primf
c3ann 1 0 1 0 1 0 1
c3nfx 1 0 1 0 1 0 1
c3per 1 0 1 0 1 0 1
c4ann 1 0 1 0 1 0 1
c4per 1 0 1 0 1 0 1
pastr 1 0 1 0 1 0 1
primf 1 0 1 0 1 0 1
I would like to reorder the rows and columns so that the order is this:
primf pastr c3ann c3nfx c3per c4ann c4per
I can do this for just the columns like this:
cols = ['primf', 'pastr', 'c3ann', 'c3nfx', 'c3per', 'c4ann', 'c4per']
df = df[cols]
How do I do this such that the row headers are also changed appropriately?
You can use reindex to reorder both the columns and index at the same time.
df = df.reindex(index=cols, columns=cols)
I have a DataFrame where a combination of column values identify a unique address (A,B,C). I would like to identify all such rows and assign them a unique identifier that I increment per address.
For example
A B C D E
0 1 1 0 1
0 1 2 0 1
0 1 1 1 1
0 1 3 0 1
0 1 2 1 0
0 1 1 2 1
I would like to generate the following
A B C D E ID
0 1 1 0 1 0
0 1 2 0 1 1
0 1 1 1 1 0
0 1 3 0 1 2
0 1 2 1 0 1
0 1 1 2 1 0
I tried the following:
id = 0
def set_id(df):
global id
df['ID'] = id
id += 1
df.groupby(['A','B','C']).transform(set_id)
This returns a NULL dataframe...This is definitely not the way to do it..I am new to pandas. The above should actually use df[['A','B','C']].drop_duplicates() to get all unique values
Thank you.
I think this is what you need :
df2 = df[['A','B','C']].drop_duplicates() #get unique values of ABC
df2 = df2.reset_index(drop = True).reset_index() #reset index to create a column named index
df2=df2.rename(columns = {'index':'ID'}) #rename index to ID
df = pd.merge(df,df2,on = ['A','B','C'],how = 'left') #append ID column with merge
# Create tuple triplet using values from columns A, B & C.
df['key'] = [triplet for triplet in zip(*[df[col].values.tolist() for col in ['A', 'B', 'C']])]
# Sort dataframe on new `key` column.
df.sort_values('key', inplace=True)
# Use `groupby` to keep running total of changes in key value.
df['ID'] = (df['key'] != df['key'].shift()).cumsum() - 1
# Clean up.
del df['key']
df.sort_index(inplace=True)
>>> df
A B C D E ID
0 0 1 1 0 1 0
1 0 1 2 0 1 1
2 0 1 1 1 1 0
3 0 1 3 0 1 2
4 0 1 2 1 0 1
5 0 1 1 2 1 0