Python DataFrames concat or append problem - python

I have a problem with dataframes in Python. I am trying to copy certain rows to a new dataframe but I can't figure it out.
There are 2 arrays:
pokemon_data
# HP Attack Defense Sp. Atk Sp. Def Speed
0 1 45 49 49 65 65 45
1 2 60 62 63 80 80 60
2 3 80 82 83 100 100 80
3 4 80 100 123 122 120 80
4 5 39 52 43 60 50 65
... ... ... ... ... ... ... ...
795 796 50 100 150 100 150 50
796 797 50 160 110 160 110 110
797 798 80 110 60 150 130 70
798 799 80 160 60 170 130 80
799 800 80 110 120 130 90 70
800 rows × 7 columns
combats_data
First_pokemon Second_pokemon Winner
0 266 298 1
1 702 701 1
2 191 668 1
3 237 683 1
4 151 231 0
... ... ... ...
49995 707 126 0
49996 589 664 0
49997 303 368 1
49998 109 89 0
49999 9 73 0
50000 rows × 3 columns
I created third dataset with columns:
output1
HP0 Attack0 Defens0 Sp. Atk0 Sp. Def0 Speed0 HP1 Attack1 Defense1 Sp. Atk1 Sp. Def1 Speed1 Winner
What I'm trying to do is copy attributes from pokemon_data to output1 in order from combats_data.
HP0 and HP1 are respectivly HP from first Pokemon and HP from second Pokemon.
I want to use that data in neural networks with TensorFlow to predict what Pokemon would win.

For this type of wrangling, you should first "melt" or "tidy" the combats_data so each ID has its own row, then do a "join" or "merge" of the two dataframes.
You didn't provide a minimum reproducible example, so here's mine:
import pandas as pd
df1 = pd.DataFrame({'id': [1,2,3,4,5],
'var1': [10,20,30,40,50],
'var2': [15,25,35,45,55]})
df2 = pd.DataFrame({'id1': [1,2],
'id2': [3,4],
'outcome': [1,4]})
df2tidy = pd.melt(df2, id_vars=['outcome'], value_vars=['id1', 'id2'],
var_name='name', value_name='id')
df2tidy
# outcome name id
# 0 1 id1 1
# 1 4 id1 2
# 2 1 id2 3
# 3 4 id2 4
output = pd.merge(df2tidy, df1, on='id')
output
# outcome name id var1 var2
# 0 1 id1 1 10 15
# 1 4 id1 2 20 25
# 2 1 id2 3 30 35
# 3 4 id2 4 40 45
which you could then train some sort of classifier on outcome.
(Btw, you should make outcome a 0 or 1 (for pokemon1 vs pokemon2) instead of the actual ID of the winner.)

So i would like to create new array based on these two arrays. For example:
#ids represent pokemons and their attributes
pokemons = pd.DataFrame({'id': [1,2,3,4,5],
'HP': [10,20,30,40,50],
'Attack': [15,25,35,45,55],
'Defese' : [25,15,45,15,35]})
#here 0 or 1 represents whether first or second pokemon won
combats = pd.DataFrame({'id1': [1,2],
'id2': [3,4],
'winner': [0,1]})
#in output data i want to replace ids with attributes, the order is based on combats array
output = pd.DataFrame({'HP1': [10,20],
'Attack1': [15,25],
'Defense1': [25,15],
'HP2': [30,40],
'Attack2': [35,45],
'Defense2': [45,15],
'winner': [0,1]})
Not sure if its correct thinking. I want to train neural network to figure out what pokemon will win.

This is solution from user part from 4programmers.net forum.
import pandas as pd
if __name__ == "__main__":
pokemon_data = pd.DataFrame({
"Id": [1, 2, 3, 4, 5],
"HP": [45, 60, 80, 80, 39],
"Attack": [49, 62, 82, 100, 52],
"Defense": [49, 63, 83, 123, 43],
"Sp. Atk": [65, 80, 100, 122, 60],
"Sp. Def": [65, 80, 100, 120, 50],
"Speed": [45, 60, 80, 80, 65]})
combats_data = pd.DataFrame({
"First_pokemon": [1, 2, 3],
"Second_pokemon": [2, 3, 4],
"Winner": [1, 0, 1]})
output = pokemon_data.merge(combats_data, left_on="Id", right_on="First_pokemon")
output = output.merge(pokemon_data, left_on="Second_pokemon", right_on="Id",
suffixes=("_pokemon1", "_pokemon2"))
print(output)

Related

how to do complex calculations in pandas dataframe

sample dataframe:
df = pd.DataFrame({'sales': ['2020-01','2020-02','2020-03','2020-04','2020-05','2020-06'],
'2020-01': [24,42,18,68,24,30],
'2020-02': [24,42,18,68,24,30],
'2020-03': [64,24,70,70,88,57],
'2020-04': [22,11,44,3,5,78],
'2020-05': [11,35,74,12,69,51]}
I want to find below df['L2']
I studied pandas rolling,groupby,etcs, cannot solve it.
please read L2 formula & givee me a opinion
L2 formula
L2(Jan-20) = 24
-------------------
sales 2020-01
0 2020-01 24
-------------------
L2(Feb-20) = 132 (sum of below matrix 2x2)
sales 2020-01 2020-02
0 2020-01 24 24
1 2020-02 42 42
-------------------
L2(Mar-20) = 154 (sum of matrix 2x2)
sales 2020-02 2020-03
0 2020-02 42 24
1 2020-03 18 70
-------------------
L2(Apr-20) = 187 (sum of below maxtrix 2x2)
sales 2020-03 2020-04
0 2020-03 70 44
1 2020-04 70 3
output
Unnamed: 0 sales Jan-20 Feb-20 Mar-20 Apr-20 May-20 L2 L3
0 0 Jan-20 24 24 64 22 11 24 24
1 1 Feb-20 42 42 24 11 35 132 132
2 2 Mar-20 18 18 70 44 74 154 326
3 3 Apr-20 68 68 70 3 12 187 350
4 4 May-20 24 24 88 5 69 89 545
5 5 Jun-20 30 30 57 78 51 203 433
Values=f.values[:,1:]
L2=[]
RANGE=Values.shape[0]
for a in range(RANGE):
if a==0:
result=Values[a,a]
else:
if Values[a-1:a+1,a-1:a+1].shape==(2,1):
result=np.sum(Values[a-1:a+1,a-2:a])
else:
result=np.sum(Values[a-1:a+1,a-1:a+1])
L2.append(result)
print(L2)
L2 output:-->[24, 132, 154, 187, 89, 203]
f["L2"]=L2
f:
import pandas as pd
import numpy as np
# make a dataset
df = pd.DataFrame({'sales': ['2020-01','2020-02','2020-03','2020-04','2020-05','2020-06'],
'2020-01': [24,42,18,68,24,30],
'2020-02': [24,42,18,68,24,30],
'2020-03': [64,24,70,70,88,57],
'2020-04': [22,11,44,3,5,78],
'2020-05': [11,35,74,12,69,51]})
print(df)
# datawork(L2)
for i in range(0,df.shape[0]):
if i==0:
df.loc[i,'L2']=df.loc[i,'2020-01']
else:
if i!=df.shape[0]-1:
df.loc[i,'L2']=df.iloc[i-1:i+1,i:i+2].sum().sum()
if i==df.shape[0]-1:
df.loc[i,'L2']=df.iloc[i-1:i+1,i-1:i+1].sum().sum()
print(df)
# sales 2020-01 2020-02 2020-03 2020-04 2020-05 L2
#0 2020-01 24 24 64 22 11 24.0
#1 2020-02 42 42 24 11 35 132.0
#2 2020-03 18 18 70 44 74 154.0
#3 2020-04 68 68 70 3 12 187.0
#4 2020-05 24 24 88 5 69 89.0
#5 2020-06 30 30 57 78 51 203.0
I tried another method.
this method uses reshape long(in python : melt), but I applyed reshape long twice in python because time frequency of sales and other columns in df is monthly and not daily, so I did reshape long one more time to make int column corresponding to monthly date.
(I have used Stata more often than python, in Stata, I can only do reshape long one time because it has monthly time frequency, and reshape task is much easier than that of pandas, python)
if you are interested, take a look
# 00.module
import pandas as pd
import numpy as np
from order import order # https://stackoverflow.com/a/68464246/16478699
# 0.make a dataset
df = pd.DataFrame({'sales': ['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06'],
'2020-01': [24, 42, 18, 68, 24, 30],
'2020-02': [24, 42, 18, 68, 24, 30],
'2020-03': [64, 24, 70, 70, 88, 57],
'2020-04': [22, 11, 44, 3, 5, 78],
'2020-05': [11, 35, 74, 12, 69, 51]}
)
df.to_stata('dataset.dta', version=119, write_index=False)
print(df)
# 1.reshape long(in python: melt)
t = list(df.columns)
t.remove('sales')
df_long = df.melt(id_vars='sales', value_vars=t, var_name='var', value_name='val')
df_long['id'] = list(range(1, df_long.shape[0] + 1)) # make id for another resape long
print(df_long)
# 2.another reshape long(in python: melt, reason: make int(col name: tid) corresponding to monthly date of sales and monthly columns in df)
df_long2 = df_long.melt(id_vars=['id', 'val'], value_vars=['sales', 'var'])
df_long2['tid'] = df_long2['value'].apply(lambda x: 1 + list(df_long2.value.unique()).index(x))
print(df_long2)
# 3.back to wide form with tid(in python: pd.pivot)
df_wide = pd.pivot(df_long2, index=['id', 'val'], columns='variable', values=['value', 'tid'])
df_wide.columns = df_wide.columns.map(lambda x: x[1] if x[0] == 'value' else f'{x[0]}_{x[1]}') # change multiindex columns name into just normal columns name
df_wide = df_wide.reset_index()
print(df_wide)
# 4.make values of L2
for i in df_wide.tid_sales.unique():
if list(df_wide.tid_sales.unique()).index(i) + 1 == len(df_wide.tid_sales.unique()):
df_wide.loc[df_wide['tid_sales'] == i, 'L2'] = df_wide.loc[(((df_wide['tid_sales'] == i) | (
df_wide['tid_sales'] == i - 1)) & ((df_wide['tid_var'] == i - 1) | (
df_wide['tid_var'] == i - 2))), 'val'].sum()
else:
df_wide.loc[df_wide['tid_sales'] == i, 'L2'] = df_wide.loc[(((df_wide['tid_sales'] == i) | (
df_wide['tid_sales'] == i - 1)) & ((df_wide['tid_var'] == i) | (
df_wide['tid_var'] == i - 1))), 'val'].sum()
print(df_wide)
# 5.back to shape of df with L2(reshape wide, in python: pd.pivot)
df_final = df_wide.drop(columns=df.filter(regex='^tid')) # no more columns starting with tid needed
df_final = pd.pivot(df_final, index=['sales', 'L2'], columns='var', values='val').reset_index()
df_final = order(df_final, 'L2', f_or_l='last') # order function is made by me
print(df_final)

How to group by and aggregate on multiple columns in pandas

I have following dataframe in pandas
ID Balance ATM_drawings Value
1 100 50 345
1 150 33 233
2 100 100 333
2 100 100 234
I want data in that desired format
ID Balance_mean Balance_sum ATM_Drawings_mean ATM_drawings_sum
1 75 250 41.5 83
2 200 100 200 100
I am using following command to do it in pandas
df1= df[['Balance','ATM_drawings']].groupby('ID', as_index = False).agg(['mean', 'sum']).reset_index()
But, it does not give what I intended to get.
You can use a dictionary to specify aggregation functions for each series:
d = {'Balance': ['mean', 'sum'], 'ATM_drawings': ['mean', 'sum']}
res = df.groupby('ID').agg(d)
# flatten MultiIndex columns
res.columns = ['_'.join(col) for col in res.columns.values]
print(res)
Balance_mean Balance_sum ATM_drawings_mean ATM_drawings_sum
ID
1 125 250 41.5 83
2 100 200 100.0 200
Or you can define d via dict.fromkeys:
d = dict.fromkeys(('Balance', 'ATM_drawings'), ['mean', 'sum'])
Not sure how to achieve this using agg, but you could reuse the `groupby´ object to avoid having to do the operation multiple times, and then use transformations:
import pandas as pd
df = pd.DataFrame({
"ID": [1, 1, 2, 2],
"Balance": [100, 150, 100, 100],
"ATM_drawings": [50, 33, 100, 100],
"Value": [345, 233, 333, 234]
})
gb = df.groupby("ID")
df["Balance_mean"] = gb["Balance"].transform("mean")
df["Balance_sum"] = gb["Balance"].transform("sum")
df["ATM_drawings_mean"] = gb["ATM_drawings"].transform("mean")
df["ATM_drawings_sum"] = gb["ATM_drawings"].transform("sum")
print df
Which yields:
ID Balance Balance_mean Balance_sum ATM_drawings ATM_drawings_mean ATM_drawings_sum Value
0 1 100 125 250 50 41.5 83 345
1 1 150 125 250 33 41.5 83 233
2 2 100 100 200 100 100.0 200 333
3 2 100 100 200 100 100.0 200 234

Convert dict constructor to Pandas MultiIndex dataframe

I have a lot of data that I'd like to structure in a Pandas dataframe. However, I need a multi-index format for this. The Pandas MultiIndex feature has always confused me and also this time I can't get my head around it.
I built the structure as I want it as a dict, but because my actual data is much larger, I want to use Pandas instead. The code below is the dict variant. Note that the original data has a lot more labels and more rows as well.
The idea is that the original data contains rows of a task with index Task_n that has been performed by a participant with index Participant_n. Each row is a segment. Even though the original data does not have this distinction, I want to add this to my dataframe. In other words:
Participant_n | Task_n | val | dur
----------------------------------
1 | 1 | 12 | 2
1 | 1 | 3 | 4
1 | 1 | 4 | 12
1 | 2 | 11 | 11
1 | 2 | 34 | 4
The above example contains one participants, two tasks, with respectively three and two segments (rows).
In Python, with a dict structure this looks like this:
import pandas as pd
cols = ['Participant_n', 'Task_n', 'val', 'dur']
data = [[1,1,25,83],
[1,1,4,68],
[1,1,9,987],
[1,2,98,98],
[1,2,84,4],
[2,1,9,21],
[2,2,15,6],
[2,2,185,6],
[2,2,18,4],
[2,3,8,12],
[3,1,7,78],
[3,1,12,88],
[3,2,12,48]]
d = pd.DataFrame(data, columns=cols)
part_d = {}
for row in d.itertuples():
participant_n = row.Participant_n
participant = "participant" + str(participant_n)
task = "task" + str(row.Task_n)
if participant in part_d:
part_d[participant]['all_sum']['val'] += int(row.val)
part_d[participant]['all_sum']['dur'] += int(row.dur)
else:
part_d[participant] = {
'prof': 0 if participant_n < 20 else 1,
'all_sum': {
'val': int(row.val),
'dur': int(row.dur),
}
}
if task in part_d[participant]:
# Get already existing keys
k = list(part_d[participant][task].keys())
k_int = []
# Only get the ints (i.e. not all_sum etc.)
for n in k:
# Get digit from e.g. seg1
n = n[3:]
try:
k_int.append(int(n))
except ValueError:
pass
# Increment max by 1
i = max(k_int) + 1
part_d[participant][task][f"seg{i}"] = {
'val': int(row.val),
'dur': int(row.dur),
}
part_d[participant][task]['task_sum']['val'] += int(row.val)
part_d[participant][task]['task_sum']['dur'] += int(row.dur)
else:
part_d[participant][task] = {
'seg1': {
'val': int(row.val),
'dur': int(row.dur),
},
'task_sum': {
'val': int(row.val),
'dur': int(row.dur),
}
}
print(part_d)
In the end result here I have some additional variables such as: task_sum (the sum over the task of a participant), all_sum (sum of all a participant's actions), and also prof which is an arbitrary boolean flag. The resulting dict looks like this (not beautified to save space. If you want to inspect, open in text editor as JSON or Python dict and beautify):
{'participant1': {'prof': 0, 'all_sum': {'val': 220, 'dur': 1240}, 'task1': {'seg1': {'val': 25, 'dur': 83}, 'task_sum': {'val': 38, 'dur': 1138}, 'seg2': {'val': 4, 'dur': 68}, 'seg3': {'val': 9, 'dur': 987}}, 'task2': {'seg1': {'val': 98, 'dur': 98}, 'task_sum': {'val': 182, 'dur': 102}, 'seg2': {'val': 84, 'dur': 4}}}, 'participant2': {'prof': 0, 'all_sum': {'val': 235, 'dur': 49}, 'task1': {'seg1': {'val': 9, 'dur': 21}, 'task_sum': {'val': 9, 'dur': 21}}, 'task2': {'seg1': {'val': 15, 'dur': 6}, 'task_sum': {'val': 218, 'dur': 16}, 'seg2': {'val': 185, 'dur': 6}, 'seg3': {'val': 18, 'dur': 4}}, 'task3': {'seg1': {'val': 8, 'dur': 12}, 'task_sum': {'val': 8, 'dur': 12}}}, 'participant3': {'prof': 0, 'all_sum': {'val': 31, 'dur': 214}, 'task1': {'seg1': {'val': 7, 'dur': 78}, 'task_sum': {'val': 19, 'dur': 166}, 'seg2': {'val': 12, 'dur': 88}}, 'task2': {'seg1': {'val': 12, 'dur': 48}, 'task_sum': {'val': 12, 'dur': 48}}}}
Instead of a dictionary, I would like this to end up in a pd.DataFrame with multiple indexes that looks like the representation below, or similar. (For simplicity's sake, instead of task1 or seg1 I just used the indices.)
Participant Prof all_sum Task Task_sum Seg val dur
val dur val dur
====================================================================
participant1 0 220 1240 1 38 1138 1 25 83
2 4 68
3 9 987
2 182 102 1 98 98
2 84 4
--------------------------------------------------------------------
participant2 0 235 49 1 9 21 1 9 21
2 218 16 1 15 6
2 185 6
3 18 4
3 8 12 1 8 12
--------------------------------------------------------------------
participant3 0 31 214 1 19 166 1 7 78
2 12 88
2 12 48 1 12 48
Is this a structure that is possible in Pandas? If not, which reasonable alternatives are?
Again I have to emphasise that in reality there is a lot more data and possibly more sub-levels. The solution thus has to be flexible, and efficient. If it makes things a lot simpler, I am willing to only have multi-index on one axis, and change the header to:
Participant Prof all_sum_val all_sum_dur Task Task_sum_val Task_sum_dur Seg
The main issue I am having is that I do not understand how I can build a multi index df if I don't know the dimensions in advance. I don't know in advance how many tasks or segments there will be. So I am pretty sure I can keep the loop construct from my initial dict approach and I guess I'd then have to append/concat to an initial empty DataFrame, but the question is then what the structure has to look like. It can't be a simple Series, because that does not take multi index in account. So how?
For the people who have read this far and want to try their hand at this, I think that my original code can be re-used for the most part (loop and variable assignment), but instead of a dict it have to be accessors to the DataFrame. That an import aspect: data should be easily readable with getters/setters, just as a regular DataFrame is. E.g. it should be easy to get the duration value for participant two, task 2, segment 2, and so on. But also, getting a subset of the data (e.g. where prof === 0) should be without problems.
My only suggestion is to get rid of all your dictionary stuff. All of that code can be re-written in Pandas without much effort. This will likely speed up the transformation process as well but will take some time. To help you in the process I have rewritten the section you provided. The rest is up to you.
import pandas as pd
cols = ['Participant_n', 'Task_n', 'val', 'dur']
data = [[1,1,25,83],
[1,1,4,68],
[1,1,9,987],
[1,2,98,98],
[1,2,84,4],
[2,1,9,21],
[2,2,15,6],
[2,2,185,6],
[2,2,18,4],
[2,3,8,12],
[3,1,7,78],
[3,1,12,88],
[3,2,12,48]]
df = pd.DataFrame(data, columns=cols)
df["Task Sum val"] = df.groupby(["Participant_n","Task_n"])["val"].transform("sum")
df["Task Sum dur"] = df.groupby(["Participant_n","Task_n"])["dur"].transform("sum")
df["seg"] =df.groupby(["Participant_n","Task_n"]).cumcount() + 1
df["All Sum val"] = df.groupby("Participant_n")["val"].transform("sum")
df["All Sum dur"] = df.groupby("Participant_n")["dur"].transform("sum")
df = df.set_index(["Participant_n","All Sum val","All Sum dur","Task_n","Task Sum val","Task Sum dur"])[["seg","val","dur"]]
df = df.sort_index()
df
Output
seg val dur
Participant_n All Sum val All Sum dur Task_n Task Sum val Task Sum dur
1 220 1240 1 38 1138 1 25 83
1138 2 4 68
1138 3 9 987
2 182 102 1 98 98
102 2 84 4
2 235 49 1 9 21 1 9 21
2 218 16 1 15 6
16 2 185 6
16 3 18 4
3 8 12 1 8 12
3 31 214 1 19 166 1 7 78
166 2 12 88
2 12 48 1 12 48
Try to run this code and let me know what you think. Comment with any questions.
I faced a similar issue with data presentation and came up with the following helper functions for groupby with subtotals.
With this process it's possible to generate subtotals for an arbitrary number of group by columns, however the output data has a different format. Instead of the subtotals being put in their own columns, each subtotal adds an extra row to the data frame.
For interactive data exploration & analysis, I find this very helpful as its possible to get the subtotals with just a couple of lines of code
def get_subtotals(frame, columns, aggvalues, subtotal_level):
if subtotal_level == 0:
return frame.groupby(columns, as_index=False).agg(aggvalues)
elif subtotal_level == len(columns):
return pd.DataFrame(frame.agg(aggvalues)).transpose().assign(
**{c: np.nan for i, c in enumerate(columns)}
)
return frame.groupby(
columns[:subtotal_level],
as_index=False
).agg(aggvalues).assign(
**{c: np.nan for i, c in enumerate(columns[subtotal_level:])}
)
def groupby_with_subtotals(frame, columns, aggvalues, grand_totals=False, totals_position='last'):
gt = 1 if grand_totals else 0
out = pd.concat(
[get_subtotals(df, columns, aggvalues, i)
for i in range(len(columns)+gt)]
).sort_values(columns, na_position=totals_position)
out[columns] = out[columns].fillna('total')
return out.set_index(columns)
resuing the dataframe creation code from Gabriel A's answer
cols = ['Participant_n', 'Task_n', 'val', 'dur']
data = [[1,1,25,83],
[1,1,4,68],
[1,1,9,987],
[1,2,98,98],
[1,2,84,4],
[2,1,9,21],
[2,2,15,6],
[2,2,185,6],
[2,2,18,4],
[2,3,8,12],
[3,1,7,78],
[3,1,12,88],
[3,2,12,48]]
df = pd.DataFrame(data, columns=cols)
It is first necessary to add the seg column
df['seg'] = df.groupby(['Participant_n', 'Task_n']).cumcount() + 1
Then we can use groupby_with_subtotals like this. Additionally, note that you can place the subtotals at the top and also include grand_totals by passing in grand_totals=True, totals_position='first'
groupby_columns = ['Participant_n', 'Task_n', 'seg']
groupby_aggs = {'val': 'sum', 'dur': 'sum'}
aggdf = groupby_with_subtotals(df, groupby_columns, groupby_aggs)
aggdf
# outputs
dur val
Participant_n Task_n seg
1 1.0 1.0 83 25
2.0 68 4
3.0 987 9
total 1138 38
2.0 1.0 98 98
2.0 4 84
total 102 182
total total 1240 220
2 1.0 1.0 21 9
total 21 9
2.0 1.0 6 15
2.0 6 185
3.0 4 18
total 16 218
3.0 1.0 12 8
total 12 8
total total 49 235
3 1.0 1.0 78 7
2.0 88 12
total 166 19
2.0 1.0 48 12
total 48 12
total total 214 31
Here, the subtotals rows are marked with total, and the left most total indicates the subtotal level.
Once the aggregate data frame is created, its possible to access the subtotals using loc. example:
aggdf.loc[1,'total','total']
# outputs:
dur 1240
val 220
Name: (1, total, total), dtype: int64

Fastest way to do cumulative totals in Pandas dataframe

I've got a pandas dataframe of golfers' round scores going back to 2003 (approx 300000 rows). It looks something like this:
Date----Golfer---Tournament-----Score---Player Total Rounds Played
2008-01-01---Tiger Woods----Invented Tournament R1---72---50
2008-01-01---Phil Mickelson----Invented Tournament R1---73---108
I want the 'Player Total Rounds Played' column to be a running total of the number of rounds (i.e. instance in the dataframe) that a player has played up to that date. Is there a quick way of doing it? My current solution (basically using iterrows and then a one-line function) works fine but will take approx 11hrs to run.
Thanks,
Tom
Here is one way:
df = df.sort_values('Date')
df['Rounds CumSum'] = df.groupby('Golfer')['Rounds'].cumsum()
For example:
import pandas as pd
df = pd.DataFrame([['A', 70, 50],
['B', 72, 55],
['A', 73, 45],
['A', 71, 60],
['B', 74, 55],
['A', 72, 65]],
columns=['Golfer', 'Rounds', 'Played'])
df['Rounds CumSum'] = df.groupby('Golfer')['Rounds'].cumsum()
# Golfer Rounds Played Rounds CumSum
# 0 A 70 50 70
# 1 B 72 55 72
# 2 A 73 45 143
# 3 A 71 60 214
# 4 B 74 55 146
# 5 A 72 65 286

output multiple files based on column value python pandas

i have a sample pandas data frame:
import pandas as pd
df = {'ID': [73, 68,1,94,42,22, 28,70,47, 46,17, 19, 56, 33 ],
'CloneID': [1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4 ],
'VGene': ['64D', '64D', '64D', 61, 61, 61, 311, 311, 311, 311, 311, 311, 311, 311]}
df = pd.DataFrame(df)
it looks like this:
df
Out[7]:
CloneID ID VGene
0 1 73 64D
1 1 68 64D
2 1 1 64D
3 1 94 61
4 1 42 61
5 2 22 61
6 2 28 311
7 3 70 311
8 3 47 311
9 3 46 311
10 4 17 311
11 4 19 311
12 4 56 311
13 4 33 311
i want to write a simple script to output each cloneID to a different output file. so in this case there would be 4 different files.
the first file would be named 'CloneID1.txt' and it would look like this:
CloneID ID VGene
1 73 64D
1 68 64D
1 1 64D
1 94 61
1 42 61
second file would be named 'CloneID2.txt':
CloneID ID VGene
2 22 61
2 28 311
third file would be named 'CloneID3.txt':
CloneID ID VGene
3 70 311
3 47 311
3 46 311
and last file would be 'CloneID4.txt':
CloneID ID VGene
4 17 311
4 19 311
4 56 311
4 33 311
the code i found online was:
import pandas as pd
data = pd.read_excel('data.xlsx')
for group_name, data in data.groupby('CloneID'):
with open('results.csv', 'a') as f:
data.to_csv(f)
but it outputs everything to one file instead of multiple files.
You can do something like the following:
In [19]:
gp = df.groupby('CloneID')
for g in gp.groups:
print('CloneID' + str(g) + '.txt')
print(gp.get_group(g).to_csv())
CloneID1.txt
,CloneID,ID,VGene
0,1,73,64D
1,1,68,64D
2,1,1,64D
3,1,94,61
4,1,42,61
CloneID2.txt
,CloneID,ID,VGene
5,2,22,61
6,2,28,311
CloneID3.txt
,CloneID,ID,VGene
7,3,70,311
8,3,47,311
9,3,46,311
CloneID4.txt
,CloneID,ID,VGene
10,4,17,311
11,4,19,311
12,4,56,311
13,4,33,311
So here we iterate over the groups in for g in gp.groups: and we use this to create the result file path name and call to_csv on the group so the following should work for you:
gp = df.groupby('CloneID')
for g in gp.groups:
path = 'CloneID' + str(g) + '.txt'
gp.get_group(g).to_csv(path)
Actually the following would be even simpler:
gp = df.groupby('CloneID')
gp.apply(lambda x: x.to_csv('CloneID' + str(x.name) + '.txt'))

Categories

Resources