How to spread the data in pandas?

How to spread the data in pandas? - python

i'm working on spread r equivalent in pandas my dataframe looks like below
Name age Language year Period
Nik 18 English 2018 Beginer
John 19 French 2019 Intermediate
Kane 33 Russian 2017 Advanced
xi 44 Thai 2015 Beginer
and looking for output like this
Name age Language Beginer Intermediate Advanced
Nik 18 English 2018
John 19 French 2019
Kane 33 Russian 2017
John 44 Thai 2015
my code
pd.pivot(x1,values='year', columns=['Period'])
i'm getting only these columns Beginer,Intermediate,Advanced not the entire dataframe
while reshaping it i tried using index but says no duplicates in index.
So i created new index column but still not getting entire dataframe

If I understood correctly you could do something like this:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Advanced Beginner Intermediate
0 Nik 18 English 0 2018 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 2017 0 0
3 xi 44 Thai 0 2015 0
If you want to match the exact same output, convert the column to categorical first, and specify the order:
# encode as categorical
df['Period'] = pd.Categorical(df['Period'], ['Beginner', 'Advanced', 'Intermediate'], ordered=True)
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018 0 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 0 2017 0
3 xi 44 Thai 2015 0 0
Finally if you want to replace the 0, with missing values, add a third step:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
res = res.replace(0, np.nan)
Output (with missing values)
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018.0 NaN NaN
1 John 19 French NaN NaN 2019.0
2 Kane 33 Russian NaN 2017.0 NaN
3 xi 44 Thai 2015.0 NaN NaN

One way you can get to the equivalent of R's spread function using pd.pivot_table:
If you don't mind about the index, you can use reset_index() on the newly created df:
new_df = (pd.pivot_table(df, index=['Name','age','Language'],columns='Period',values='year',aggfunc='sum')).reset_index()
which will get you:
Period Name age Language Advanced Beginer Intermediate
0 John 19 French NaN NaN 2019.0
1 Kane 33 Russian 2017.0 NaN NaN
2 Nik 18 English NaN 2018.0 NaN
3 xi 44 Thai NaN 2015.0 NaN
EDIT
If you have many columns in your dataframe and you want to include them in the reshaped dataset:
Grab in a list the columns to be used in pivot table (i.e. Period and year)
Grab all the other columns in your dataframe in a list (using not in)
Use the index_cols as index in the pd.pivot_table() command
non_index_cols = ['Period','year'] # SPECIFY THE 2 COLUMNS IN THE PIVOT TABLE TO BE USED
index_cols = [i for i in df.columns if i not in non_index_cols] # GET ALL THE REST IN A LIST
new_df = (pd.pivot_table(df, index=index_cols,columns='Period',values='year',aggfunc='sum')).reset_index()
The new_df, will include all the columns of your initial dataframe.

Related

Replace values in specific rows from one DataFrame to another when certain columns have the same values

Unlike the other questions, I don't want to create a new column with the new values, I want to use the same column just changing the old values for new ones if they exist.
For a new column I would have:
import pandas as pd
df1 = pd.DataFrame(data = {'Name' : ['Carl','Steave','Julius','Marcus'],
'Work' : ['Home','Street','Car','Airplane'],
'Year' : ['2022','2021','2020','2019'],
'Days' : ['',5,'','']})
df2 = pd.DataFrame(data = {'Name' : ['Carl','Julius'],
'Work' : ['Home','Car'],
'Days' : [1,2]})
df_merge = pd.merge(df1, df2, how='left', on=['Name','Work'], suffixes=('','_'))
print(df_merge)
Name Work Year Days Days_
0 Carl Home 2022 1.0
1 Steave Street 2021 5 NaN
2 Julius Car 2020 2.0
3 Marcus Airplane 2019 NaN
But what I really want is exactly like this:
Name Work Year Days
0 Carl Home 2022 1
1 Steave Street 2021 5
2 Julius Car 2020 2
3 Marcus Airplane 2019
How can I make such a union?

You can use combine_first, setting the empty strings to NaNs beforehand (the indexing at the end is to rearrange the columns to match the desired output):
df1.loc[df1["Days"] == "", "Days"] = float("NaN")
df1.combine_first(df1[["Name", "Work"]].merge(df2, "left"))[df1.columns.values]
This outputs:
Name Work Year Days
0 Carl Home 2022 1.0
1 Steave Street 2021 5
2 Julius Car 2020 2.0
3 Marcus Airplane 2019 NaN

You can use the update method of Series:
df1.Days.update(pd.merge(df1, df2, how='left', on=['Name','Work']).Days_y)

How can I use groupby to merge rows in Pandas?

I have a dataframe that looks like this:
ID
Name
Major1
Major2
Major3
12
Dave
English
NaN
NaN
12
Dave
NaN
Biology
NaN
12
Dave
NaN
NaN
History
13
Nate
Spanish
NaN
NaN
13
Nate
NaN
Business
NaN
I need to merge rows resulting in this:
ID
Name
Major1
Major2
Major3
12
Dave
English
Biology
History
13
Nate
Spanish
Business
NaN
I know this is possible with groupby but I haven't been able to get it to work correctly. Can anyone help?

If you are intent on using groupby, you could do something like this:
dataframe = dataframe.melt(['ID', 'Name']).dropna()
dataframe = dataframe.groupby(['ID', 'Name', 'variable'])['value'].sum().unstack('variable')
You may have to mess with the column names a bit, but this is what comes to me as a possible solution using groupby.

Use melt and pivot
>>> df.melt(['ID', 'Name']).dropna() \
.pivot(['ID', 'Name'], 'variable', 'value') \
.reset_index().rename_axis(columns=None)
ID Name Major1 Major2 Major3
0 12 Dave English Biology History
1 13 Nate Spanish Business NaN

Compare two Excel files and show divergences using Python

I was comparing two excel files which contains the information of the students of two schools. However those files might contain different number of rows between them.
The first set that I used is to import the excel files in two dataframes:
df1 = pd.read_excel('School A - Information.xlsx')
df2 = pd.read_excel('School B - Information.xlsx')
print(df1)
Name Age Birth_Country Previous Schools
0 tom 10 USA 3
1 nick 15 MEX 1
2 juli 14 CAN 0
3 tom 19 NOR 1
print(df2)
Name Age Birth_Country Previous Schools
0 tom 10 USA 3
1 tom 19 NOR 1
2 nick 15 MEX 4
After this, I would like to check the divergences between those two dataframes (index order is not important). However I am receiving an error due to the size of the dataframes.
compare = df1.values == df2.values
<ipython-input-9-7cc64ba0e622>:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
compare = df1.values == df2.values
print(compare)
False
Adding to that, I would like to create a third DataFrame with the corresponding divergences, that shows the divergence.
import numpy as np
rows,cols=np.where(compare==False)
for item in zip(rows,cols):
df1.iloc[item[0], item[1]] = '{} --> {}'.format(df1.iloc[item[0], item[1]],df2.iloc[item[0], item[1]])
However, using this code is not working, as the index order may be different between the two dataframes.
My expected output should be the below dataframe:

You can use pd.merge to accomplish this. If you're unfamiliar with dataframe merges, here's a post that describes relational database merging ideas: link. So in this case, what we want to do is first do a left merge of df2 onto df1 to find how the Previous Schools column differs:
df_merged = pd.merge(df1, df2, how="left", on=["Name", "Age", "Birth_Country"], suffixes=["_A", "_B"])
print(df_merged)
will give you a new dataframe
Name Age Birth_Country Previous Schools_A Previous Schools_B
0 tom 10 USA 3 3.0
1 nick 15 MEX 1 4.0
2 juli 14 CAN 0 NaN
3 tom 19 NOR 1 1.0
This new dataframe has all the information you're looking for. To find just the rows where the Previous Schools entries differ:
df_different = df_merged[df_merged["Previous Schools_A"]!=df_merged["Previous Schools_B"]]
print(df_different)
Name Age Birth_Country Previous Schools_A Previous Schools_B
1 nick 15 MEX 1 4.0
2 juli 14 CAN 0 NaN
and to find the rows where Previous Schools has not changed:
df_unchanged = df_merged[df_merged["Previous Schools_A"]==df_merged["Previous Schools_B"]]
print(df_unchanged)
Name Age Birth_Country Previous Schools_A Previous Schools_B
0 tom 10 USA 3 3.0
3 tom 19 NOR 1 1.0
If I were you, I'd stop here, because creating the final dataframe you want is going to have generic object column types because of the mix of strings and integers, which will limit its uses... but maybe you need it in that particular formattting for some reason. In which case, it's all about putting together these dataframe subsets in the right way to get your desired formatting. Here's one way.
First, initialize the final dataframe with the unchanged rows:
df_final = df_unchanged[["Name", "Age", "Birth_Country", "Previous Schools_A"]].copy()
df_final = df_final.rename(columns={"Previous Schools_A": "Previous Schools"})
print(df_final)
Name Age Birth_Country Previous Schools
0 tom 10 USA 3
3 tom 19 NOR 1
now process the entries that have changed between dataframes. There are two cases here: where the entries have changed (where Previous Schools_B is not NaN) and where the entrie is new (where Previous Schools_B is NaN). We'll deal with each in turn:
changed_entries = df_different[~pd.isnull(df_different["Previous Schools_B"])].copy()
changed_entries["Previous Schools"] = changed_entries["Previous Schools_A"].astype('str') + " --> " + changed_entries["Previous Schools_B"].astype('int').astype('str')
changed_entries = changed_entries.drop(columns=["Previous Schools_A", "Previous Schools_B"])
print(changed_entries)
Name Age Birth_Country Previous Schools
1 nick 15 MEX 1 --> 4
and now process the entries that are completely new:
new_entries = df_different[pd.isnull(df_different["Previous Schools_B"])].copy()
new_entries = "NaN --> " + new_entries[["Name", "Age", "Birth_Country", "Previous Schools_A"]].astype('str')
new_entries = new_entries.rename(columns={"Previous Schools_A": "Previous Schools"})
print(new_entries)
Name Age Birth_Country Previous Schools
2 NaN --> juli NaN --> 14 NaN --> CAN NaN --> 0
and finally, concatenate all the dataframes:
df_final = pd.concat([df_final, changed_entries, new_entries])
print(df_final)
Name Age Birth_Country Previous Schools
0 tom 10 USA 3
3 tom 19 NOR 1
1 nick 15 MEX 1 --> 4
2 NaN --> juli NaN --> 14 NaN --> CAN NaN --> 0

DataFrame from variable and filtering data

I have a DataFrame and want to extract 3 columns from it, but one of them is an input from the user. I made a list, but need it to be iterable so I can run a For iteration.
So far I made it through by making a dictionary with 2 of the columns making a list of each and zipping them... but I really need the 3 columns...
My code:
Data=pd.read_csv(----------)
selec=input("What month would you want to show?")
NewData=[(Data['Country']),(Data['City']),(Data[selec].astype('int64')]
#here I try to iterate:
iteration=[i for i in NewData if NewData[i]<=25]
print (iteration)
*TypeError:list indices must be int ot slices, not Series*
My CSV is the following:
I want to be able to choose the month with the variable "selec" and filter the results of the month I've chosen... so the output for selec="Feb" would be:
I tried as well with loc/iloc, but not lucky at all (unhashable type:'list').

See the below example for how you can:
select specific columns from a DataFrame by providing a list of columns between the selection brackets (link to tutorial)
select specific rows from a DataFrame by providing a condition between the selection brackets (link to tutorial)
iterate rows of a DataFrame, although I don't suppose you need it - if you'd like to keep working with the DataFrame after filtering it, it's better to use the method mentioned above (you won't have to put the rows back together, and it will likely be more performant because pandas is optimized for bulk operations)
import pandas as pd
# this is just for testing, instead of pd.read_csv(...)
df = pd.DataFrame([
dict(Country="Spain", City="Madrid", Jan="15", Feb="16", Mar="17", Apr="18", May=""),
dict(Country="Spain", City="Galicia", Jan="1", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="France", City="Paris", Jan="0", Feb="2", Mar="3", Apr="4", May=""),
dict(Country="Algeria", City="Argel", Jan="20", Feb="28", Mar="29", Apr="30", May=""),
])
print("---- Original df:")
print(df)
selec = "Feb" # let's pretend this comes from input()
print("\n---- Just the 3 columns:")
df = df[["Country", "City", selec]] # narrow down the df to just the 3 columns
df[selec] = df[selec].astype("int64") # convert the selec column to proper type
print(df)
print("\n---- Filtered dataframe:")
df1 = df[df[selec] <= 25]
print(df1)
print("\n---- Iterated & filtered rows:")
for row in df.itertuples():
# we could also use row[3] instead of getattr(...)
if getattr(row, selec) <= 25:
print(row)
Output:
---- Original df:
Country City Jan Feb Mar Apr May
0 Spain Madrid 15 16 17 18
1 Spain Galicia 1 2 3 4
2 France Paris 0 2 3 4
3 Algeria Argel 20 28 29 30
---- Just the 3 columns:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
3 Algeria Argel 28
---- Filtered dataframe:
Country City Feb
0 Spain Madrid 16
1 Spain Galicia 2
2 France Paris 2
---- Iterated & filtered dataframe:
Pandas(Index=0, Country='Spain', City='Madrid', Feb=16)
Pandas(Index=1, Country='Spain', City='Galicia', Feb=2)
Pandas(Index=2, Country='France', City='Paris', Feb=2)

How to extract info from original dataframe after doing some analysis on it?

So I had a dataframe and I had to do some cleansing to minimize the duplicates. In order to do that I created a dataframe that had instead of 40 only 8 of the original columns. Now I have two columns I need for further analysis from the original dataframe but they would mess with the desired outcome if I used them in my previous analysis. Anyone have any idea on how to "extract" these columns based on the new "clean" dataframe I have?

You can merge the new "clean" dataframe with the other two variables by using the indexes. Let me use a pratical example. Suppose the "initial" dataframe, called "df", is:
df
name year reports location
0 Jason 2012 4 Cochice
1 Molly 2012 24 Pima
2 Tina 2013 31 Santa Cruz
3 Jake 2014 2 Maricopa
4 Amy 2014 3 Yuma
while the "clean" dataframe is:
d1
year location
0 2012 Cochice
2 2013 Santa Cruz
3 2014 Maricopa
The remaing columns are saved in dataframe "d2" ( d2 = df[['name','reports']] ):
d2
name reports
0 Jason 4
1 Molly 24
2 Tina 31
3 Jake 2
4 Amy 3
By using the inner join on the indexes d1.merge(d2, how = 'inner' left_index= True, right_index = True) you get the following result:
name year reports location
0 Jason 2012 4 Cochice
2 Tina 2013 31 Santa Cruz
3 Jake 2014 2 Maricopa

You can make a new dataframe with the specified columns;
import pandas
#If your columns are named a,b,c,d etc
df1 = df[['a','b']]
#This will extract columns 0, to 2 based on their index
#[remember that pandas indexes columns from zero!
df2 = df.iloc[:,0:2]
If you could, provide a sample piece of data, that'd make it easier for us to help you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to spread the data in pandas? - python

Related

Replace values in specific rows from one DataFrame to another when certain columns have the same values

How can I use groupby to merge rows in Pandas?

Compare two Excel files and show divergences using Python

DataFrame from variable and filtering data

How to extract info from original dataframe after doing some analysis on it?

Categories

Resources