Pandas Duplicating Data but pivoting columns - python

I want to convert this DF
Location
Date
F1_ID
F1_Name
F1_Height
F1_Status
F2_ID
F2_Name
F2_Height
F2_Status
USA
12/31/19
1
Jon
67
W
2
Anthony
68
L
To this DF by duplicating the rows but switching the data around.
Location
Date
F1_ID
F1_Name
F1_Height
F1_Status
F2_ID
F2_Name
F2_Height
F2_Status
USA
12/31/19
1
Jon
67
W
2
Anthony
68
L
USA
12/31/19
2
Anthony
68
L
1
Jon
67
W
How can I acheive this in Pandas. I tried creating a copy of the df and renaming the columns but would get an error because of unique indexing

Lets try a concat and sort_index:
import re
import pandas as pd
df = pd.DataFrame(
{'Location': {0: 'USA'}, 'Date': {0: '12/31/19'},
'F1_ID': {0: 1}, 'F1_Name': {0: 'Jon'}, 'F1_Height': {0: 67},
'F1_Status': {0: 'W'}, 'F2_ID': {0: 2},
'F2_Name': {0: 'Anthony'}, 'F2_Height': {0: 68},
'F2_Status': {0: 'L'}})
# Columns Not To Swap
keep_columns = ['Location', 'Date']
# Get F1 and F2 Column Names
f1_columns = list(filter(re.compile(r'F1_').search, df.columns))
f2_columns = list(filter(re.compile(r'F2_').search, df.columns))
# Create Inverse DataFrame
inverse_df = df[[*keep_columns, *f2_columns, *f1_columns]]
# Set Columns so they match df (prevents concat from un-inverting)
inverse_df.columns = df.columns
# Concat and sort index
new_df = pd.concat((df, inverse_df)).sort_index().reset_index(drop=True)
print(new_df.to_string())
Src:
Location Date F1_ID F1_Name F1_Height F1_Status F2_ID F2_Name F2_Height F2_Status
0 USA 12/31/19 1 Jon 67 W 2 Anthony 68 L
Output:
Location Date F1_ID F1_Name F1_Height F1_Status F2_ID F2_Name F2_Height F2_Status
0 USA 12/31/19 1 Jon 67 W 2 Anthony 68 L
1 USA 12/31/19 2 Anthony 68 L 1 Jon 67 W

Related

Using melt() in Pandas

Trying to melt the following data such that the result is 4 rows - one for "John-1" containing his before data, one for "John-2" containing his after data, one for "Kelly-1" containing her before data, and one for "Kelly-2" containing her after data. And the columns would be "Name", "Weight", and "Height". Can this be done solely with the melt function?
df = pd.DataFrame({'Name': ['John', 'Kelly'],
'Weight Before': [200, 175],
'Weight After': [195, 165],
'Height Before': [6, 5],
'Height After': [7, 6]})
Use pandas.wide_to_long function as shown below:
pd.wide_to_long(df, ['Weight', 'Height'], 'Name', 'grp', ' ', '\\w+').reset_index()
Name grp Weight Height
0 John Before 200 6
1 Kelly Before 175 5
2 John After 195 7
3 Kelly After 165 6
or you could also use pivot_longer from pyjanitor as follows:
import janitor
df.pivot_longer('Name', names_to = ['.value', 'grp'], names_sep = ' ')
Name grp Weight Height
0 John Before 200 6
1 Kelly Before 175 5
2 John After 195 7
3 Kelly After 165 6

split row to create header in pandas

I have a data frame df like:
age=14 gender=male loc=NY key=0012328434 Unnamed: 4
age=45 gender=female loc=CS key=834734hh43 pre="axe"
age=23 gender=female loc=CA key=545df35fdf NaN
..
..
age=65 gender=male loc=LA key=dfdf545dfg pre="cold"
And I need this df to have a header and remove the redundant data, like desired_df:
age gender loc key pre
14 male NY 0012328434 NaN
45 female CS 834734hh43 axe
23 female CA 545df35fdf NaN
..
..
65 male LA dfdf545dfg cold
what I tried to do:
df1 = df.str.split()
df_out = pd.DataFrame(df1.str[1::2].tolist(), columns=df1[0][0::2])
but this fails, clearly as I do not have a df name to begin with. Any help would be really appreciated.
# df = pd.read_csv(r'xyz.csv', header = None)
df1=(pd.DataFrame(df.fillna('NaN=NaN')
.apply(lambda x: dict(list(x.str.replace('"', '')
.str.split('='))), axis=1).to_list())
.drop('NaN', axis = 1))
age gender loc key pre
0 14 male NY 0012328434
1 45 female CS 834734hh43 axe
2 23 female CA 545df35fdf NaN
3 65 male LA dfdf545dfg cold
(Untested!)
headers = ['age', 'gender', 'loc', 'key', 'pre']
df.columns = headers
for name in df.columns:
df[name] = df[name].str.removeprefix(f'{name}=')

Is there a better way to join two dataframes in Pandas?

I have to convert data from a SQL table to pandas and display the output. The data is a sales table:
cust prod day month year state quant
0 Bloom Pepsi 2 12 2017 NY 4232
1 Knuth Bread 23 5 2017 NJ 4167
2 Emily Pepsi 22 1 201 CT 4404
3 Emily Fruits 11 1 2010 NJ 4369
4 Helen Milk 7 11 2016 CT 210
I have to convert this to find average sales for each customer for each state for year 2017:
CUST AVG_NY AVG_CT AVG_NJ
Bloom 28923 3241 1873
Sam 4239 872 142
Below is my code:
import pandas as pd
import psycopg2 as pg
engine = pg.connect("dbname='postgres' user='postgres' host='127.0.0.1' port='8800' password='sh'")
df = pd.read_sql('select * from sales', con=engine)
df.drop("prod", axis=1, inplace=True)
df.drop("day", axis=1, inplace=True)
df.drop("month", axis=1, inplace=True)
df_main = df.loc[df.year == 2017]
#df.drop(df[df['state'] != 'NY'].index, inplace=True)
df2 = df_main.loc[df_main.state == 'NY']
df2.drop("year",axis=1,inplace=True)
NY = df2.groupby(['cust']).mean()
df3 = df_main.loc[df_main.state == 'CT']
df3.drop("year",axis=1,inplace=True)
CT = df3.groupby(['cust']).mean()
df4 = df_main.loc[df_main.state == 'NJ']
df4.drop("year",axis=1,inplace=True)
NJ = df4.groupby(['cust']).mean()
NY = NY.join(CT,how='left',lsuffix = 'NY', rsuffix = '_right')
NY = NY.join(NJ,how='left',lsuffix = 'NY', rsuffix = '_right')
print(NY)
This give me an output like:
quantNY quant_right quant
cust
Bloom 3201.500000 3261.0 2277.000000
Emily 2698.666667 1432.0 1826.666667
Helen 4909.000000 2485.5 2352.166667
I found a question where I can change the column names to the output I need but I am not sure if the below two lines of the code are the right way to join the dataframes:
NY = NY.join(CT,how='left',lsuffix = 'NY', rsuffix = '_right')
NY = NY.join(NJ,how='left',lsuffix = 'NY', rsuffix = '_right')
Is there a better way of doing this with Pandas?
Use pivot_table:
df.pivot_table(index=['year', 'cust'], columns='state',
values='quant', aggfunc='mean').add_prefix('AVG_')

How to flatten a dataframe within dataframe

I would like to flatten a dataframe that is inside the dataframe. In this example, the column account has a dataframe as value. I would like to flatten this into a single dataframe.
Example: (Updated)
import panda as pd
account1 = pd.DataFrame([{'nr': '123', 'balance': 56}, {'nr': '230', 'balance': 55}])
account2 = pd.DataFrame([{'nr': '456', 'balance': 575}])
account3 = pd.DataFrame([{'nr': '350', 'balance': 59}])
df = pd.DataFrame([{'id': 1, 'age': 23, 'name': 'anna', 'account': account1},
{'id': 2, 'age': 71, 'name': 'mary', 'account': account2},
{'id': 3, 'age': 42, 'name': 'bob', 'account': account3}])
print(df)
gives the dataframe:
id age name account
0 1 23 anna nr balance
0 123 56
1 230 55
1 2 71 mary nr balance
0 456 575
2 3 42 bob nr balance
0 350 59
And I would like to get:
id name age account|nr|0 account|balance|0 account|nr|1 account|balance|1
0 1 anna 23 123 56 230 55
1 2 mary 71 456 575
2 3 bob 59 350 59
How can I flatten a dataframe inside a dataframe to a single dataframe? This type of structure is called Hierarchical DataFrame?
This is the solution that I have found.
list_accounts = []
for index_j, row_j in df.iterrows():
account = row_j["account"]
account = pd.DataFrame(account).stack().to_frame().T
account.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in account.columns]
list_accounts.append(account)
df = pd.concat([df, pd.concat(list_accounts).reset_index(drop=True)], axis=1)
df.drop(columns="account", inplace=True)

Creating a new dataframe from applying a function on all cell of a dataframe

I have a data frame, df, like this:
data = {'A': ['Jason (121439)', 'Molly (194439)', 'Tina (114439)', 'Jake (127859)', 'Amy (122579)'],
'B': ['Bob (127439)', 'Mark (136489)', 'Tyler (121443)', 'John (126259)', 'Anna(174439)'],
'C': ['Jay (121596)', 'Ben (12589)', 'Toom (123586)', 'Josh (174859)', 'Al(121659)'],
'D': ['Paul (123839)', 'Aaron (124159)', 'Steve (161899)', 'Vince (179839)', 'Ron (128379)']}
df = pd.DataFrame(data)
And I want to create a new data frame with one column with the name and the other column with the number between parenthesis, which would look like this:
data2 = {'Name': ['Jason ', 'Molly ', 'Tina ', 'Jake ', 'Amy '],
'ID#': ['121439', '194439', '114439', '127859', '122579']}
result = pd.DataFrame(data2)
I try different things, but it all did not work:
1)
List_name=pd.DataFrame()
List_id=pd.DataFrame()
List_both=pd.DataFrame(columns=["Name","ID"])
for i in df.columns:
left=df[i].str.split("(",1).str[0]
right=df[i].str.split("(",1).str[1]
List_name=List_name.append(left)
List_id=List_id.append(right)
List_both=pd.concat([List_name,List_id], axis=1)
List_both
2) applying a function on all cell
Names = lambda x: x.str.split("(",1).str[0]
IDS = Names = lambda x: x.str.split("(",1).str[1]
But I was wondering how to do it in order to store it in a data frame that will look like result...
You can use stack followed by str.extract.
(df.stack()
.str.strip()
.str.extract(r'(?P<Name>.*?)\s*\((?P<ID>.*?)\)$')
.reset_index(drop=True))
Name ID
0 Jason 121439
1 Bob 127439
2 Jay 121596
3 Paul 123839
4 Molly 194439
5 Mark 136489
6 Ben 12589
7 Aaron 124159
8 Tina 114439
9 Tyler 121443
10 Toom 123586
11 Steve 161899
12 Jake 127859
13 John 126259
14 Josh 174859
15 Vince 179839
16 Amy 122579
17 Anna 174439
18 Al 121659
19 Ron 128379

Categories

Resources