I have 6 dataframes with same column names.
The column names are:
"session_id", "player_id", "gersey_color","timestamp"
data in each frame looks like:
session_id
player_id
gersey_color
timestamp
123xyz
yellow
9
1347.85
I want to combine these dataframe in single dataframe where the format would be like:
session_id
player_info
df1
df2
df3
df4
df5
df6
total_occurance
timestamp
123xyz
yellow9
0
1
0
3
3
0
7
1347.85
green2
0
1
1
0
2
0
4
blue5
1
1
1
1
1
1
6
523pqr
yellow1
2
1
0
0
0
0
3
747.45
white2
0
1
0
0
0
1
205abd
green1
0
1
0
0
3
0
4
57.61
111mnz
yellow10
1
0
0
0
0
0
1
1821.21
black2
0
1
0
1
1
0
3
Here I am using timestamp as unique identifier and want to get the frequency of each time stamp happening across all the dataframe but categorised with session_id and player_id and gersey_color combined.
my current code can get all the information but can not format like the way I want:
for i, combo_row in combo_df.iterrows():
value_in_combo = combo_row['timestamp']
count = 0
player_info = []
session_id = []
for id, df_path in enumerate(df_list):
rule_df = pd.read_excel(df_path)
sub_counter = 0
for idx, entry in rule_df.iterrows():
idr = list(rule_df.columns).index('timestamp')
value = entry[idr]
s_id = entry[list(rule_df.columns).index('session.id')]
player_team = entry[list(rule_df.columns).index('gersey_color')]
player_num = entry[list(rule_df.columns).index('player_id')]
if value == value_in_combo:
sub_counter+=1
session_id.append(s_id)
player_info.append(str(player_team)+str(player_num))
combo_df.at[i, f'df{id+1}'] = ','.join(list(set(player_info)))
combo_df.at[i, 'session_id'] =','.join(list(set(session_id)))
count += sub_counter
combo_df.at[i, 'occurrence_across_rules'] = count
Here combo_df is the df I predefined to populate all the data.
current combo_df looks like:
session_id
player_info
df1
df2
df3
df4
df5
df6
timestamp
total_occurance
123xyz
yellow1
623.15
1
423pqz
green1,yellow5
yellow55
green1,yellow5
1347.85
5
. . . . . .
But as I said my code does not generate the format I want.
Can anyone suggest how to do it ?
Edit:
I solved the problem using:
combo_df.set_index(['session_id', 'player_team', 'player_num'], inplace=True)
Related
I am using panda version 0.23.0. I want to use data frame group by function to generate new aggregated columns using [lambda] functions..
My data frame looks like
ID Flag Amount User
1 1 100 123345
1 1 55 123346
2 0 20 123346
2 0 30 123347
3 0 50 123348
I want to generate a table which looks like
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM Flag0_User_Count Flag1_User_Count
1 2 2 0 155 0 2
2 2 0 50 0 2 0
3 1 0 50 0 1 0
here:
Flag0_Count is count of Flag = 0
Flag1_Count is count of Flag = 1
Flag0_Amount_SUM is SUNM of amount when Flag = 0
Flag1_Amount_SUM is SUNM of amount when Flag = 1
Flag0_User_Count is Count of Distinct User when Flag = 0
Flag1_User_Count is Count of Distinct User when Flag = 1
I have tried something like
df.groupby(["ID"])["Flag"].apply(lambda x: sum(x==0)).reset_index()
but it creates a new a new data frame. This means I will have to this for all columns and them merge them together into a new data frame.
Is there an easier way to accomplish this?
Use DataFrameGroupBy.agg by dictionary by column names with aggregate function, then reshape by unstack, flatten MultiIndex of columns, rename columns and last reset_index:
df = (df.groupby(["ID", "Flag"])
.agg({'Flag':'size', 'Amount':'sum', 'User':'nunique'})
.unstack(fill_value=0))
#python 3.6+
df.columns = [f'{i}{j}' for i, j in df.columns]
#python below
#df.columns = [f'{}{}'.format(i, j) for i, j in df.columns]
d = {'Flag0':'Flag0_Count',
'Flag1':'Flag1_Count',
'Amount0':'Flag0_Amount_SUM',
'Amount1':'Flag1_Amount_SUM',
'User0':'Flag0_User_Count',
'User1':'Flag1_User_Count',
}
df = df.rename(columns=d).reset_index()
print (df)
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM \
0 1 0 2 0 155
1 2 2 0 50 0
2 3 1 0 50 0
Flag0_User_Count Flag1_User_Count
0 0 2
1 2 0
2 1 0
Ok, I admit, I had troubles to really formulate a good header for that. So I will try to make give an example.
This is my sample dataframe:
df = pd.DataFrame([
(1,"a","good"),
(1,"a","good"),
(1,"b","good"),
(1,"c","bad"),
(2,"a","good"),
(2,"b","bad"),
(3,"a","none")], columns=["id", "type", "eval"])
What I do with it is the following:
df.groupby(["id", "type"])["id"].agg({'id':'count'})
This results in:
id
id type
1 a 2
b 1
c 1
2 a 1
b 1
3 a 1
This is fine, although what I will need later on is that e.g. the id would be repeated in every row. But this is not the most important part.
What I would need now is something like this:
id good bad none
id type
1 a 2 2 0 0
b 1 1 0 0
c 1 0 1 0
2 a 1 1 0 0
b 1 0 1 0
3 a 1 0 0 1
And even better would be a result like this, because I will need this back in a dataframe (and finally in an Excel sheet) with all fields populated. In reality, there will be many more columns I am grouping by. They would have to be completely populated as well.
id good bad none
id type
1 a 2 2 0 0
1 b 1 1 0 0
1 c 1 0 1 0
2 a 1 1 0 0
2 b 1 0 1 0
3 a 1 0 0 1
Thank you for helping me out.
You can use groupby + size (last column was added) or value_counts with unstack:
df1 = df.groupby(["id", "type", 'eval'])
.size()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
df1 = df.groupby(["id", "type"])[ 'eval']
.value_counts()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
But for write to excel get:
df1.to_excel('file.xlsx')
So need reset_index last.
df1.reset_index().to_excel('file.xlsx', index=False)
EDIT:
I forget for id column, but it is duplicate column name, so need id1:
df1.insert(0, 'id1', df1.sum(axis=1))
I want to create a matrix from multiple files. Each of these files has a list of gene names, of various lengths.
To create the matrix I need to group all the gene names from all the files in the first column.
Then for each file append a new column(with the file name as header) and if the gene name is in the appended list add value 1 to the cell, else if the gene name in the first column is not found in the new column appended add zero.
This is what I got until now:
import os
files= os.listdir("/gene_files")
df01 = pd.DataFrame()
for file in files:
file_name = "/gene_files/" + file
for file in file:
df = pd.read_csv(file, sep='\t', header = 0)
df01 = pd.concat(df01,df)
df01.to_csv('gene_matrix.csv')
This gives me all the gene lists in one column. I then drop all the duplicates.
df01 = df01.drop_duplicates
Now I need to append a new column for each file, evaluate if geneName(file) is in first column and add 1s or 0s accordingly. I'm stuck.... and also utterly confused.
The files look like this:
File1 File2 File3 etc...
GeneName GeneName GeneName
A B A
B C B
C D E
F E F
The output I want would be a matrix/dataframe:
GeneName File1 File2 File3
A 1 0 1
B 1 1 1
C 1 1 0
D 0 1 0
E 0 1 1
F 1 0 1
These are the actual first few lines of the files:
fileAIB fileAIC fileAID
Plekha4 Dffb Rabggta
1700012D01Rik A430033K04Rik Sc5d
Isg20 Tubb3 Gnpnat1
Smad6 Rbm17 Nabp1
Ndufa10 Isg20 Isg20
Wdr90 Arrb2 Lrrc27
Thumpd1 Ankrd13c Add3
Cd2bp2 Ndufa10 Prkaa1
Cndp2 Inpp5e Gmeb2
Jmjd1c Lamtor2 B4galt7
And the output would look like:
GeneName fileAIB fileAIC fileAID
Plekha4 1 0 0
1700012D01Rik 1 0 0
Isg20 1 1 1
Smad6 1 0 0
Ndufa10 1 0 0
Wdr90 1 0 0
Thumpd1 1 0 0
Cd2bp2 1 0 0
Rbm17 1 0 1
Jmjd1c 1 0 0
Dffb 0 1 0
A430033K04Rik 0 1 0
Tubb3 0 1 1
Rbm17 0 1 0
Arrb2 0 1 0
Ankrd13c 0 1 0
Ndufa10 0 1 0
Gnpnat1 0 1 0
Lamtor2 0 1 0
Rabggta 0 0 1
Sc5d 0 0 1
Gnpnat1 0 0 1
Lrrc27 0 0 1
Prkaa1 0 0 1
Gmeb2 0 0 1
B4galt7 0 0 1
Consider appending all text file data into a long form dataframe and then pivoting to wide format:
dfList = []
for file in files:
df = pd.read_csv(file, sep='\t', header = None, names = ['GeneName'])
df = df.assign(file = file.replace('.txt', ''), num = 1)
dfList.append(df)
finaldf = pd.concat(dfList)
# PIVOT (LONG TO WIDE)
finaldf = finaldf.pivot_table(index=['GeneName'], columns=['file'],
values='num', aggfunc='count').fillna(0).reset_index()
# CONVERT TO INTEGER
numcols = list(range(1,len(finaldf.columns)))
finaldf.ix[:,numcols] = finaldf.ix[:,numcols].astype(int)
Output (using posted actual three columns as .txt files)
# file GeneName fileAIB fileAIC fileAID
# 0 1700012D01Rik 1 0 0
# 1 A430033K04Rik 0 1 0
# 2 Add3 0 0 1
# 3 Ankrd13c 0 1 0
# 4 Arrb2 0 1 0
# 5 B4galt7 0 0 1
# 6 Cd2bp2 1 0 0
# 7 Cndp2 1 0 0
# 8 Dffb 0 1 0
# 9 Gmeb2 0 0 1
# 10 Gnpnat1 0 0 1
# 11 Inpp5e 0 1 0
# 12 Isg20 1 1 1
# 13 Jmjd1c 1 0 0
# 14 Lamtor2 0 1 0
# 15 Lrrc27 0 0 1
# 16 Nabp1 0 0 1
# 17 Ndufa10 1 1 0
# 18 Plekha4 1 0 0
# 19 Prkaa1 0 0 1
# 20 Rabggta 0 0 1
# 21 Rbm17 0 1 0
# 22 Sc5d 0 0 1
# 23 Smad6 1 0 0
# 24 Thumpd1 1 0 0
# 25 Tubb3 0 1 0
# 26 Wdr90 1 0 0
You should be able to easily do this by putting the gene name in the index and creating a column of all ones for with the file name as the column name and then concatenating. This should be done altogether in one for loop. Your current for loop syntax doesn't look right. Try something like the following that assumes you have a one column dataframe with column name 'GeneName' when read from read_csv.
import os
files= os.listdir("/gene_files")
df_list = []
for file in files:
df = pd.read_csv(file, sep='\t', header = 0)
df[file] = 1
df.set_index('GeneName')
df_list.append(df)
pd.concat(df_list, axis=1).fillna(0)
Try using pd.concat() with the axis attribute. In your case:
df01 = pd.concat([df01, df], axis=1)
Before you can use df.columns = [filename] in order to give a new dataframe a column name.
I have the foll. dataframe:
c3ann c3nfx c3per c4ann c4per pastr primf
c3ann 1 0 1 0 1 0 1
c3nfx 1 0 1 0 1 0 1
c3per 1 0 1 0 1 0 1
c4ann 1 0 1 0 1 0 1
c4per 1 0 1 0 1 0 1
pastr 1 0 1 0 1 0 1
primf 1 0 1 0 1 0 1
I would like to reorder the rows and columns so that the order is this:
primf pastr c3ann c3nfx c3per c4ann c4per
I can do this for just the columns like this:
cols = ['primf', 'pastr', 'c3ann', 'c3nfx', 'c3per', 'c4ann', 'c4per']
df = df[cols]
How do I do this such that the row headers are also changed appropriately?
You can use reindex to reorder both the columns and index at the same time.
df = df.reindex(index=cols, columns=cols)
I have a DataFrame where a combination of column values identify a unique address (A,B,C). I would like to identify all such rows and assign them a unique identifier that I increment per address.
For example
A B C D E
0 1 1 0 1
0 1 2 0 1
0 1 1 1 1
0 1 3 0 1
0 1 2 1 0
0 1 1 2 1
I would like to generate the following
A B C D E ID
0 1 1 0 1 0
0 1 2 0 1 1
0 1 1 1 1 0
0 1 3 0 1 2
0 1 2 1 0 1
0 1 1 2 1 0
I tried the following:
id = 0
def set_id(df):
global id
df['ID'] = id
id += 1
df.groupby(['A','B','C']).transform(set_id)
This returns a NULL dataframe...This is definitely not the way to do it..I am new to pandas. The above should actually use df[['A','B','C']].drop_duplicates() to get all unique values
Thank you.
I think this is what you need :
df2 = df[['A','B','C']].drop_duplicates() #get unique values of ABC
df2 = df2.reset_index(drop = True).reset_index() #reset index to create a column named index
df2=df2.rename(columns = {'index':'ID'}) #rename index to ID
df = pd.merge(df,df2,on = ['A','B','C'],how = 'left') #append ID column with merge
# Create tuple triplet using values from columns A, B & C.
df['key'] = [triplet for triplet in zip(*[df[col].values.tolist() for col in ['A', 'B', 'C']])]
# Sort dataframe on new `key` column.
df.sort_values('key', inplace=True)
# Use `groupby` to keep running total of changes in key value.
df['ID'] = (df['key'] != df['key'].shift()).cumsum() - 1
# Clean up.
del df['key']
df.sort_index(inplace=True)
>>> df
A B C D E ID
0 0 1 1 0 1 0
1 0 1 2 0 1 1
2 0 1 1 1 1 0
3 0 1 3 0 1 2
4 0 1 2 1 0 1
5 0 1 1 2 1 0