How can I use groupby to merge rows in Pandas? - python

I have a dataframe that looks like this:
ID
Name
Major1
Major2
Major3
12
Dave
English
NaN
NaN
12
Dave
NaN
Biology
NaN
12
Dave
NaN
NaN
History
13
Nate
Spanish
NaN
NaN
13
Nate
NaN
Business
NaN
I need to merge rows resulting in this:
ID
Name
Major1
Major2
Major3
12
Dave
English
Biology
History
13
Nate
Spanish
Business
NaN
I know this is possible with groupby but I haven't been able to get it to work correctly. Can anyone help?

If you are intent on using groupby, you could do something like this:
dataframe = dataframe.melt(['ID', 'Name']).dropna()
dataframe = dataframe.groupby(['ID', 'Name', 'variable'])['value'].sum().unstack('variable')
You may have to mess with the column names a bit, but this is what comes to me as a possible solution using groupby.

Use melt and pivot
>>> df.melt(['ID', 'Name']).dropna() \
.pivot(['ID', 'Name'], 'variable', 'value') \
.reset_index().rename_axis(columns=None)
ID Name Major1 Major2 Major3
0 12 Dave English Biology History
1 13 Nate Spanish Business NaN

Related

How to create multiple dataframe from a excel data table

I have extracted this data frame from an excel spreadsheet using pandas library,
after getting the needed columns and,
I have table formatted like this,
REF PLAYERS
0 103368 Andrés Posada Sanmiguel
1 300552 Diego Posada Sanmiguel
2 103304 Roberto Motta Stanziola
3 NaN NaN
4 REF PLAYERS
5 1047012 ANABELLA EISMANN DE AMAYA
6 104701 FERNANDO ENRIQUE AMAYA CASTRO
7 103451 AUGUSTO ANTONIO ALVARADO AZCARRAGA
8 103484 Kevin Adrian Villarreal Kam
9 REF PLAYERS
10 NaN NaN
11 NaN NaN
12 NaN NaN
13 NaN NaN
14 REF PLAYERS
15 NaN NaN
16 NaN NaN
17 NaN NaN
18 NaN NaN
19 REF PLAYERS
I want to create multiple dataframes converting each row [['REF', 'PLAYERS']] to a new dataframe columns.
suggestions are welcomed I also need to preserve the blank spaces. A pandas newbie.
For this to work, you must first read the dataframe from the file differently: set the argument header=None in your pd.read_excel() function. Because now your columns are called "REF" and "PLAYERS", but we would like to group by them.
Then the first column name probably would be "0", and the first line will be as follows, where the df is the name of your dataframe:
# Set unique index for each group
df["group_id"] = (df[0] == "REF").cumsum()
Solution:
# Set unique index for each group
df["group_id"] = (df["name_of_first_column"] == "REF").cumsum()
# Iterate over groups
dataframes = []
for name, group in df.groupby("group_id"):
df_ = group
# promote 1st row to column name
df_.columns = df_.iloc[0]
# and drop it
df_ = df_.iloc[1:]
# drop index column
df_ = df_[["REF", "PLAYERS"]]
# append to the list of dataframes
dataframes.append(df_)
All your multiple dataframes are now stored in an array dataframes.
You can split your dataframe, into equal lengths (in your case 4 rows for each df), using np.split.
Since you want 4 rows per dataframe, you can split it into 5 different df:
import numpy as np
dfs = [df.loc[idx] for idx in np.split(df.index,5)]
And then create your individual dataframes:
df1 = dfs[1]
df1
REF PLAYERS
4 REF PLAYERS
5 1047012 ANABELLA EISMANN DE AMAYA
6 104701 FERNANDO ENRIQUE AMAYA CASTRO
7 103451 AUGUSTO ANTONIO ALVARADO AZCARRAGA
df2 = dfs[2]
df2
REF PLAYERS
8 103484 Kevin Adrian Villarreal Kam
9 REF PLAYERS
10 NaN NaN
11 NaN NaN

How to spread the data in pandas?

i'm working on spread r equivalent in pandas my dataframe looks like below
Name age Language year Period
Nik 18 English 2018 Beginer
John 19 French 2019 Intermediate
Kane 33 Russian 2017 Advanced
xi 44 Thai 2015 Beginer
and looking for output like this
Name age Language Beginer Intermediate Advanced
Nik 18 English 2018
John 19 French 2019
Kane 33 Russian 2017
John 44 Thai 2015
my code
pd.pivot(x1,values='year', columns=['Period'])
i'm getting only these columns Beginer,Intermediate,Advanced not the entire dataframe
while reshaping it i tried using index but says no duplicates in index.
So i created new index column but still not getting entire dataframe
If I understood correctly you could do something like this:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Advanced Beginner Intermediate
0 Nik 18 English 0 2018 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 2017 0 0
3 xi 44 Thai 0 2015 0
If you want to match the exact same output, convert the column to categorical first, and specify the order:
# encode as categorical
df['Period'] = pd.Categorical(df['Period'], ['Beginner', 'Advanced', 'Intermediate'], ordered=True)
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018 0 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 0 2017 0
3 xi 44 Thai 2015 0 0
Finally if you want to replace the 0, with missing values, add a third step:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
res = res.replace(0, np.nan)
Output (with missing values)
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018.0 NaN NaN
1 John 19 French NaN NaN 2019.0
2 Kane 33 Russian NaN 2017.0 NaN
3 xi 44 Thai 2015.0 NaN NaN
One way you can get to the equivalent of R's spread function using pd.pivot_table:
If you don't mind about the index, you can use reset_index() on the newly created df:
new_df = (pd.pivot_table(df, index=['Name','age','Language'],columns='Period',values='year',aggfunc='sum')).reset_index()
which will get you:
Period Name age Language Advanced Beginer Intermediate
0 John 19 French NaN NaN 2019.0
1 Kane 33 Russian 2017.0 NaN NaN
2 Nik 18 English NaN 2018.0 NaN
3 xi 44 Thai NaN 2015.0 NaN
EDIT
If you have many columns in your dataframe and you want to include them in the reshaped dataset:
Grab in a list the columns to be used in pivot table (i.e. Period and year)
Grab all the other columns in your dataframe in a list (using not in)
Use the index_cols as index in the pd.pivot_table() command
non_index_cols = ['Period','year'] # SPECIFY THE 2 COLUMNS IN THE PIVOT TABLE TO BE USED
index_cols = [i for i in df.columns if i not in non_index_cols] # GET ALL THE REST IN A LIST
new_df = (pd.pivot_table(df, index=index_cols,columns='Period',values='year',aggfunc='sum')).reset_index()
The new_df, will include all the columns of your initial dataframe.

How to add a word to the end of each string in a specific column (pandas dataframe)

I want to add "NSW" to the end of each town name in a pandas data frame.The dataframe currently looks like this:
0 Parkes NaN
1 Forbes NaN
2 Yanco NaN
3 Orange NaN
4 Narara NaN
5 Wyong NaN
I need every town to also have the word NSW added to it
Try with
df['Name'] = df['Name'] + 'NSW'

Drop items with all NaN values from a pandas multi-indexed dataframe

I'm having some trouble wrangling a dataframe that looks something like this:
value
year name
2015 bob 10.0
cat NaN
2016 bob NaN
cat NaN
I want to drop those items where all the values for the same name are NaN. In this case the result should be this:
value
year name
2015 bob 10.0
2016 bob NaN
All the cat values were NaN so cat is gone. Since bob had one non-NaN value, it gets to stay.
Note that both the 2016 values were NaN in the input, but 2016 is still around in the output - because this rule only applies to the name column. Ideally I'd like to be able to provide which column this applies to as a parameter.
Is this even possible? How should I do this? I'm okay with reindexing/transposing/whatever if that's needed to get the job done (only if it's necessary though!).
You can use groupby with filter
df.groupby(level='name').filter(lambda x: x.value.notnull().any())
value
year name
2015 bob 10.0
2016 bob NaN
In [208]: df.reset_index().sort_values('name').drop_duplicates(['value']).set_index(['year','name'])
Out[208]:
value
year name
2015 bob 10.0
2016 bob NaN
You can use unstack, isnull, all, and stack:
df.unstack().loc[:,~df.unstack().isnull().all()].stack(-1, dropna=False)
Or use notnull and any:
df.unstack().loc[:,df.unstack().notnull().any()].stack(-1, dropna=False)
Output:
value
year name
2015 bob 10.0
2016 bob NaN

parse text into different columns in pandas

I have a dataframe containing the query part of multiple urls.
For eg.
in=2015-09-19&stars_4=yes&min=4&a=3&city=New+York,+NY,+United+States&out=2015-09-20&search=1\n
in=2015-09-14&stars_3=yes&min=4&a=3&city=London,+United+Kingdom&out=2015-09-15&search=1\n
in=2015-09-26&Filter=175&min=5&a=2&city=New+York,+NY,+United+States&out=2015-09-27&search=2\n
My desired dataframe should be:
in Filter stars min a max city country out search
--------------------------------------------------------------------------------
2015-09-19 NAN stars_4 4 3 NAN NY US 2015-09-20 1
2015-09-14 NAN stars_3 4 3 NAN LONDON UK 2015-09-15 1
2015-09-26 175 NAN 5 2 NAN NY US 2015-09-27 2
Is there any easy way out for this using regex?
Any help will be much appreciated! Thanks in advance!
A quick-and-dirty fix would be to just use list comprehensions:
json_data = [{c[0]:c[1] for c in [b.split('=') for b in line.split('&')]} \
for line in open('data_file.txt')]
df = pd.DataFrame.from_records(json_data)
This won't solve your location classification issues, but will get you a better dataframe from which to work.

Categories

Resources