How add or merge duplicate rows and columns - python

I have a data frame with over 2000 rows, but in the first column, there are a number of duplicates. I want to add up the data in each duplicate together. An example of the data is seen below
enter image description here

Considering your case, I think this is what you are expecting which will squeeze the duplicate rows and returns single row and its sum
df2 = df.groupby('Player').sum()
print(df2)
You can also explicitly specify on which column you wanted to do a sum() operation. The below example applies the sum on the Gls column which is in the image you provided.
df2 = df.groupby('Player')['Gls'].sum()
print(df2)
Hope this helps your case, if not please feel free to comment. Thanks

Related

Pandas: How to Squash Multiple Rows into One Row with More Columns

I'm looking for a way to convert 5 rows in a pandas dataframe into one row with 5 times the amount of columns (so I have the same information, just squashed into one row). Let me explain:
I'm working with hockey game statistics. Currently, there are 5 rows representing the same game in different situations, each with 111 columns. I want to convert these 5 rows into one row (so that one game is represented by one row) but keep the information contained in the different situations. In other words, I want to convert 5 rows, each with 111 columns into one row with 554 columns (554=111*5 minus one since we're joining on gameId).
Here is my DF head:
So, as an example, we can see the first 5 rows have gameId = 2008020001, but each have a different situation (i.e. other, all, 5on5, 4on5, and 5on4). I'd like these 5 rows to be converted into one row with gameId = 2008020001, and with columns labelled according to their situation.
For example, I want columns for all unblockedShotAttemptsAgainst, 5on5 unblockedShotAttemptsAgainst, 5on4 unblockedShotAttemptsAgainst, 4on5 unblockedShotAttemptsAgainst, and other unblockedShotAttemptsAgainst (and the same for every other stat).
Any info would be greatly appreciated. It's also worth mentioning that my dataset is fairly large (177990 rows), so an efficient solution is desired. The resulting dataframe should have one-fifth the rows and 5 times the columns. Thanks in advance!
---- What I've Tried Already ----
I tried to do this using df.apply() and some nested for loops, but it got very ugly very quickly and was incredibly slow. I think pandas has a better way of doing this, but I'm not sure how.
Looking at other SO answers, I initially thought it might have something to do with df.pivot() or df.groupby(), but I couldn't figure it out. Thanks again!
It sounds like what you are looking for is pd.get_dummies()
cols = df.columns
#get dummies
df1 = pd.get_dummies(df, columns = ['situation'])
#drop all columns from existing df, including original col passed into get dummies
df1.drop(cols, axis=1 , inplace=True)
#add dummy cols to original df
df = pd.concat([df, df1], axis=1)
#drop duplicate rows
df.groupby(cols).first()
For the last line you can also use df.drop_duplicates() : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

Using pandas I need to create a new column that takes a value from a previous row

I have many rows of data and one of the columns is a flag. I have 3 identifiers that need to match between rows.
What I have:
partnumber, datetime1, previousdatetime1, datetime2, previousdatetime2, flag
What I need:
partnumber, datetime1, previousdatetime1, datetime2, previousdatetime2, flag, previous_flag
I need to find flag from the row where partnumber matches, and where the previousdatetime1(current row*) == datetime1(other row)*, and the previousdatetime2(current row) == datetime2(other row).
*To note, the rows are not necessarily in order so the previous row may come later in the dataframe
I'm not quite sure where to start. I got this logic working in PBI using a LookUpValue and basically finding where partnumber = Value(partnumber), datetime1 = Value(datetime1), datetime2 = Value(datetime2). Thanks for the help!
Okay, so assuming you've read this in as a pandas dataframe df1:
(1) Make a copy of the dataframe:
df2=df1.copy()
(2) For sanity, drop some columns in df2
df2.drop(['previousdatetime1','previousdatetime2'],axis=1,inplace=True)
Now you have a df2 that has columns:
['partnumber','datetime1','datetime2','flag']
(3) Merge the two dataframes
newdf=df1.merge(df2,how='left',left_on=['partnumber','previousdatetime1'],right_on=['partnumber','datetime1'],suffixes=('','_previous'))
Now you have a newdf that has columns:
['partnumber','datetime1','previousdatetime1','datetime2','previousdatetime2','flag','partnumber_previous','datetime1_previous','datetime2_previous','flag_previous']
(4) Drop the unnecessary columns
newdf.drop(['partnumber_previous', 'datetime1_previous', 'datetime2_previous'],axis=1,inplace=True)
Now you have a newdf that has columns:
['partnumber','datetime1','previousdatetime1','datetime2','previousdatetime2','flag','flag_previous']

How can I transfer columns of a table to rows in python?

One ID can have multiple dates and results and I want each date and result column stacked sideways to be stacked into 1 date and 1 result row. How can I transfer columns of a table to rows?
[Table which needs to be transposed]
enter image description here
[I want to change like this]
enter image description here
This seems to work, not sure if it's the best solution:
df2 = pd.concat([df.loc[:,['ID','Date','Result']],
df.loc[:,['ID','Date1','Result1']].rename(columns={'Date1':'Date','Result1':'Result'}),
df.loc[:,['ID','Date2','Result2']].rename(columns={'Date2':'Date','Result2':'Result'})
]).dropna().sort_values(by = 'ID')
It's just separating the dataframes, concatenating them together inline, removing the NAs and then sorting.
If you are looking to transpose data from pandas you could use pandas.DataFrame.pivot There are more examples there on the syntax.

How can I rearrange a pandas dataframe into this specific configuration?

I'm trying to rearrange a pandas dataframe that looks like this: [![enter image description here][1]][1]
into a dataframe that looks like this:
[![enter image description here][2]][2]
This is derived in a way that for each original row, a number of rows are created where the first two columns are unchanged, the third column is which of the next original columns this new column is from, and the fourth column is the corresponding float value (e.g. 20.33333).
I don't think this is a pivot table, but I'm not sure how exactly to get this cleanly. Apologies if this question has been asked before, I can't seem to find what I'm looking for. Apologies also if my explanation or formatting were less than ideal! Thanks for your help.
I think you need DataFrame.melt with GroupBy.size if need counts values per 3 columns:
df1 = df.melt(id_vars=['CentroidID_O', 'CentroidID_D'], var_name='dt_15')
df2 = (df1.groupby(['CentroidID_O', 'CentroidID_D', 'dt_15'])
.size()
.reset_index(name='counts'))

Replacing one column onto another in python/pandas, but keeping the replaced columns values if the replacing column has a NaN value?

I have two columns in a data frame that I want to merge together. The attached image shows the columns:
Image of the two columns I want to merge
I want the "precio_uf_y" column to take precedent over the "precio_uf_x" column a new column, but if there is a NaN value in the "precio_uf_y" column I want the value in the "precio_uf_x" column to go to the new column. My ideal new merged column would look like this:
Desired new column
I have tried different merge functions, and taking min and max with numpy, but maybe there is a way to write a function with these parameters?
Thank you in advance for any help.
You can use df.apply.
def get_new_val(x):
if np.isnan(x.precio_uf_y):
return x.precio_uf_x
else:
return x.precio_uf_y
df["new_precio_uf"] = df.apply(get_new_val, axis=1)

Categories

Resources