Transform DataFrame in Pandas - python

I am struggling with the following issue.
My DF is:
df = pd.DataFrame(
[
['7890-1', '12345N', 'John', 'Intermediate'],
['7890-4', '30909N', 'Greg', 'Intermediate'],
['3300-1', '88117N', 'Mark', 'Advanced'],
['2502-2', '90288N', 'Olivia', 'Elementary'],
['7890-2', '22345N', 'Joe', 'Intermediate'],
['7890-3', '72245N', 'Ana', 'Elementary']
],
columns=['Id', 'Code', 'Person', 'Level'])
print(df)
I would like to get such a result:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate
1
3300
88117N
Mark
Advanced
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
2502
NaN
NaN
NaN
90288N
Olivia
Elementary
NaN
NaN
NaN
NaN
NaN
NaN

I'd start with the same approach as #Andrej Kesely but then sort by index after unstacking and map over the column names with ' '.join.
df[["Id", "No"]] = df["Id"].str.split("-", expand=True)
df_wide = df.set_index(["Id", "No"]).unstack(level=1).sort_index(axis=1,level=1)
df_wide.columns = df_wide.columns.map(' '.join)
Output
Code 1 Level 1 Person 1 Code 2 Level 2 Person 2 Code 3 \
Id
2502 NaN NaN NaN 90288N Elementary Olivia NaN
3300 88117N Advanced Mark NaN NaN NaN NaN
7890 12345N Intermediate John 22345N Intermediate Joe 72245N
Level 3 Person 3 Code 4 Level 4 Person 4
Id
2502 NaN NaN NaN NaN NaN
3300 NaN NaN NaN NaN NaN
7890 Elementary Ana 30909N Intermediate Greg

Try:
df[["Id", "Id2"]] = df["Id"].str.split("-", expand=True)
x = df.set_index(["Id", "Id2"]).unstack(level=1)
x.columns = [f"{a} {b}" for a, b in x.columns]
print(
x[sorted(x.columns, key=lambda k: int(k.split()[-1]))]
.reset_index()
.to_markdown()
)
Prints:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
2502
nan
nan
nan
90288N
Olivia
Elementary
nan
nan
nan
nan
nan
nan
1
3300
88117N
Mark
Advanced
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate

Related

Iterate over dataframes and merge by conditions

i have to data frame
id-input id-output Date Price Type
1 3 20/09/2020 100 ABC
2 1 20/09/2020 200 ABC
2 1 21/09/2020 300 ABC
1 3 21/09/2020 50 AD
1 2 21/09/2020 40 AD
I want to get this Output :
id-inp-ABC id-out-ABC Date-ABC Price-ABC Type-ABC id-inp-AD id-out-AD Date-AD Price-AD Type-AD
1 3 20/09/2020 10 ABC 2 1 20/09/2020 10 AD
1' 3 20/09/2020 90 ABC Nan Nan Nan Nan Nan
2 1 20/09/2020 40 ABC 1 2 21/09/2020 40 AD
2' 1 20/09/2020 160 ABC Nan Nan Nan Nan Nan
2 1 21/09/2020 300 ABC Nan Nan Nan Nan Nan
My idea is to :
-divide the dataframe into two dataframes by type
-iterate through the both dataframes and check if the same id-input == id-output
-check if the price is equal , if not split row and soustract the price.
rename the columns and merge them.
grp = df.groupby('type')
transformed_df_list = []
for idx, frame in grp:
frame.reset_index(drop=True, inplace=True)
transformed_df_list.append(frame.copy())
ABC = pd.DataFrame([transformed_df_list[0])
AD = pd.DataFrame([transformed_df_list[1])
for i , row in ABC.iterrows():
for i, row1 in AD.iterrows():
if row['id-inp'] == row1['id-out']:2
row_df = pd.DataFrame([row1])
row_df= row_df.rename(columns={'id-inp': 'id-inp-AD', 'id-out':'id-out-AD' , 'Date':'Date-AD' ,'price':'price-AD'})
output = pd.merge(ABC.set_index('id-inp' , drop =False) ,row_df.set_index('id-out-AD' , drop =False), how='left' , left_on =['id-inp'] , right_on =['id-inp-AD' ])
but the results is Nan in the id-inp-AD id-out-AD Date-AD Price-AD Type-AD part ,
and row_df contains just the last row :
1 2 21/09/2020 40 A
i want also that the iteration respect the order and each insert in the output dataframe is sorted by date.
The most elegant way to solve your problem is to use pandas.DataFrame.pivot. You end up with multilevel column names instead of a single level. If you need to transfer the DataFrame back to single level column names, check the second answer here.
import pandas as pd
input = [
[1, 3, '20/09/2020', 100, 'ABC'],
[2, 1, '20/09/2020', 200, 'ABC'],
[2, 1, '21/09/2020', 300, 'ABC'],
[1, 3, '21/09/2020', 50, 'AD'],
[1, 2, '21/09/2020', 40, 'AD']
]
df = pd.DataFrame(data=input, columns=["id-input", "id-output", "Date", "Price", "Type"])
df_pivot = df.pivot(columns=["Type"])
print(df_pivot)
Output
id-input id-output Date Price
Type ABC AD ABC AD ABC AD ABC AD
0 1.0 NaN 3.0 NaN 20/09/2020 NaN 100.0 NaN
1 2.0 NaN 1.0 NaN 20/09/2020 NaN 200.0 NaN
2 2.0 NaN 1.0 NaN 21/09/2020 NaN 300.0 NaN
3 NaN 1.0 NaN 3.0 NaN 21/09/2020 NaN 50.0
4 NaN 1.0 NaN 2.0 NaN 21/09/2020 NaN 40.0

Extract column value based on another column, reading multiple files

I will like to extract out the values based on another on Name,Grade,School,Class.
For example if I were to find Name and Grade, I would like to go through column 0 and find the value in the next few column, but the value is scattered(to be extracted) around the next column. Same goes for School and Class.
Refer to this: extract column value based on another column pandas dataframe
I have multiple files:
0 1 2 3 4 5 6 7 8
0 nan nan nan Student Registration nan nan
1 Name: nan nan John nan nan nan nan nan
2 Grade: nan 6 nan nan nan nan nan nan
3 nan nan nan School: C College nan Class: 1A
0 1 2 3 4 5 6 7 8 9
0 nan nan nan Student Registration nan nan nan
1 nan nan nan nan nan nan nan nan nan nan
2 Name: Mary nan nan nan nan nan nan nan nan
3 Grade: 7 nan nan nan nan nan nan nan nan
4 nan nan nan School: nan D College Class: nan 5A
This is my code: (Error)
for file in files:
df = pd.read_csv(file,header=0)
df['Name'] = df.loc[df[0].isin('Name')[1,2,3]
df['Grade'] = df.loc[df[0].isin('Grade')[1,2,3]
df['School'] = df.loc[df[3].isin('School')[4,5]
df['Class'] = df.loc[df[7].isin('Class')[8,9]
d.append(df)
df = pd.concat(d,ignore_index=True)
This is the outcome I want: (Melt Function)
Name Grade School Class ... .... ... ...
John 6 C College 1A
John 6 C College 1A
John 6 C College 1A
John 6 C College 1A
Mary 7 D College 5A
Mary 7 D College 5A
Mary 7 D College 5A
Mary 7 D College 5A
I think here is possible use:
for file in files:
df = pd.read_csv(file,header=0)
#filter out first column and reshape - removed NaNs, convert to 1 column df
df = df.iloc[1:].stack().reset_index(drop=True).to_frame('data')
#compare by :
m = df['data'].str.endswith(':', na=False)
#shift values to new column
df['val'] = df['data'].shift(-1)
#filter and transpose
df = df[m].set_index('data').T.rename_axis(None, axis=1)
d.append(df)
df = pd.concat(d,ignore_index=True)
EDIT:
You can use:
for file in files:
#if input are excel, change read_csv to read_excel
df = pd.read_excel(file, header=None)
df['Name'] = df.loc[df[0].eq('Name:'), [1,2,3]].dropna(axis=1).squeeze()
df['Grade'] = df.loc[df[0].eq('Grade:'), [1,2,3]].dropna(axis=1).squeeze()
df['School'] = df.loc[df[3].eq('School:'), [4,5]].dropna(axis=1).squeeze()
df['Class'] = df.loc[df[6].eq('Class:'), [7,8]].dropna(axis=1).squeeze()
print (df)

Python List match with appropriate index

I need to match the lists with appropriate indexes alone. There are 5
lists, one will be main list. List1/List2 will be combined together
same way List3/List4. List1/List3 index will be available in
main_list. List2 / List4 need to match with appropriate index in
main_list
main_list = ['munnar', 'ooty', 'coonoor', 'nilgri', 'wayanad', 'coorg', 'chera', 'hima']
List1 = ['ooty', 'coonoor', 'chera']
List2 = ['hill', 'hill', 'hill']
List3 = ['nilgri', 'hima', 'ooty']
List4 = ['mount', 'mount', 'mount']
df = pd.DataFrame(dict(Area=main_list))
df1 = pd.DataFrame(
list(zip(List1, List2)),
columns=('Area', 'Content')
)
df2 = pd.DataFrame(
list(zip(List3, List4)),
columns=('Area', 'cont')
)
re = pd.concat([df, df1, df2], ignore_index=True, sort=False)
Output:
Area Content cont
0 munnar NaN NaN
1 ooty NaN NaN
2 coonoor NaN NaN
3 nilgri NaN NaN
4 wayanad NaN NaN
5 coorg NaN NaN
6 chera NaN NaN
7 hima NaN NaN
8 ooty hill NaN
9 coonoor hill NaN
10 chera hill NaN
11 nilgri NaN mount
12 hima NaN mount
13 ooty NaN mount
Expected Output:
Area Content cont
0 munnar NaN NaN
1 ooty hill mount
2 coonoor hill NaN
3 nilgri NaN mount
4 wayanad NaN NaN
5 coorg NaN NaN
6 chera hill NaN
7 hima NaN mount
IIUC set_index before concat
pd.concat([df.set_index('Area'), df1.set_index('Area'), df2.set_index('Area')],1).reset_index()
Out[312]:
index Content cont
0 chera hill NaN
1 coonoor hill NaN
2 coorg NaN NaN
3 hima NaN mount
4 munnar NaN NaN
5 nilgri NaN mount
6 ooty hill mount
7 wayanad NaN NaN

MultiIndex slicing doesn't work as expected (error involving lexsorted tuples)

I've got a problem, and it just doesn't make sense. I've got a large pd.DataFrame that I reduced in size so that I could easily show it in an example (called test1):
>>> print(test1)
value TIME \
star 0 1 2 3 4
0 1952.205873 1952.205873 1952.205873 1952.205873 1952.205873
1 1952.226307 1952.226307 1952.226307 1952.226307 1952.226307
2 1952.246740 1952.246740 1952.246740 1952.246740 1952.246740
3 1952.267174 1952.267174 1952.267174 1952.267174 1952.267174
value CNTS \
star 5 0 1 2
0 1952.205873 575311.432228 534103.079080 179471.239561
1 1952.226307 571480.854183 533138.021051 187456.451900
2 1952.246740 555631.798095 530263.846685 203247.734806
3 1952.267174 553639.056784 527058.335157 210088.229427
value
star 3 4 5
0 121884.201457 39003.397835 2089.321993
1 122796.312201 39552.401359 2810.010142
2 123500.068304 39158.050385 2652.409086
3 124357.387418 38881.565235 2721.908129
and I want to perform slice indexing on it. However it just doesn't seem to work. Here is what I try:
test.loc[:,(slice(None),0)]
and I get this error:
*** KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (0)'
This isn't the first time I've had this error or asked the question, but I still don't understand how to fix it and what's wrong.
Even more confusing, is that the following code seems to work without a hitch:
import pandas as pd
import numpy as np
column_values = ['TIME', 'XPOS']
target = range(0,2)
mindex = pd.MultiIndex.from_product([column_values, target], names=['value', 'target'])
df = pd.DataFrame(columns=mindex, index=range(10), dtype=float)
print(df.loc[:,(slice(None),0)])
I just don't understand what's happening and what's wrong here.
You need only sort MultiIndex in columns by sort_index:
df = df.sort_index(axis=1)
You can also check docs - sorting a multiindex.
Sample (columns are not lexsorted):
#your sample, only swap values in column_values
column_values = ['XPOS', 'TIME']
target = range(0,2)
mindex = pd.MultiIndex.from_product([column_values, target], names=['value', 'target'])
df = pd.DataFrame(columns=mindex, index=range(10), dtype=float)
print (df)
value XPOS TIME
target 0 1 0 1
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
print (df.columns.is_lexsorted())
False
df = df.sort_index(axis=1)
print (df.columns.is_lexsorted())
True
print(df.loc[:,(slice(None),0)])
value TIME XPOS
target 0 0
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN

Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries

I have a pandas dataframe with a column named 'City, State, Country'. I want to separate this column into three new columns, 'City, 'State' and 'Country'.
0 HUN
1 ESP
2 GBR
3 ESP
4 FRA
5 ID, USA
6 GA, USA
7 Hoboken, NJ, USA
8 NJ, USA
9 AUS
Splitting the column into three columns is trivial enough:
location_df = df['City, State, Country'].apply(lambda x: pd.Series(x.split(',')))
However, this creates left-aligned data:
0 1 2
0 HUN NaN NaN
1 ESP NaN NaN
2 GBR NaN NaN
3 ESP NaN NaN
4 FRA NaN NaN
5 ID USA NaN
6 GA USA NaN
7 Hoboken NJ USA
8 NJ USA NaN
9 AUS NaN NaN
How would one go about creating the new columns with the data right-aligned? Would I need to iterate through every row, count the number of commas and handle the contents individually?
I'd do something like the following:
foo = lambda x: pd.Series([i for i in reversed(x.split(','))])
rev = df['City, State, Country'].apply(foo)
print rev
0 1 2
0 HUN NaN NaN
1 ESP NaN NaN
2 GBR NaN NaN
3 ESP NaN NaN
4 FRA NaN NaN
5 USA ID NaN
6 USA GA NaN
7 USA NJ Hoboken
8 USA NJ NaN
9 AUS NaN NaN
I think that gets you what you want but if you also want to pretty things up and get a City, State, Country column order, you could add the following:
rev.rename(columns={0:'Country',1:'State',2:'City'},inplace=True)
rev = rev[['City','State','Country']]
print rev
City State Country
0 NaN NaN HUN
1 NaN NaN ESP
2 NaN NaN GBR
3 NaN NaN ESP
4 NaN NaN FRA
5 NaN ID USA
6 NaN GA USA
7 Hoboken NJ USA
8 NaN NJ USA
9 NaN NaN AUS
Assume you have the column name as target
df[["City", "State", "Country"]] = df["target"].str.split(pat=",", expand=True)
Since you are dealing with strings I would suggest the amendment to your current code i.e.
location_df = df[['City, State, Country']].apply(lambda x: pd.Series(str(x).split(',')))
I got mine to work by testing one of the columns but give this one a try.

Categories

Resources