I am kind of new to data frames and been trying to find a solution to shift data in specific columns to another row if ID and CODE matches.
df = pd.DataFrame({'ID': ['123', '123', '123', '154', '167', '167'],
'NAME': ['Adam', 'Adam', 'Adam', 'Bob', 'Charlie', 'Charlie'],
'CODE': ['1001', '1001', '1011', '1002', 'A0101', 'A0101'],
'TAG1': ['A123', 'B123', 'K123', 'D123', 'E123', 'G123'],
'TAG2': [ np.NaN, 'C123', 'L123', np.NaN, 'F123', 'H123'],
'TAG3': [ np.NaN, 'M123', np.NaN, np.NaN, np.NaN, np.NaN]})
ID NAME CODE TAG1 TAG2 TAG3
0 123 Adam 1001 A123 NaN NaN
1 123 Adam 1001 B123 C123 M123
2 123 Adam 1011 K123 L123 NaN
3 154 Bob 1002 D123 NaN NaN
4 167 Charlie A0101 E123 F123 NaN
5 167 Charlie A0101 G123 H123 NaN
So above I have added the code which depicts the initial data frame, now we can see that ID='123' has two rows with the same codes and the values in 'TAG' columns vary, I would like to shift the 'TAG' data in the second row to the first row after 'TAG1' or in 'TAG1' if it is empty and delete the second row completely. It should be the same for ID='167'.
Below is another code with a sample data frame I have added manually which depicts the final result, any suggestions would be great. I tried one thing, it exactly did not work the way I wanted it to.
df_result = pd.DataFrame({'ID': ['123', '123', '154', '167'],
'NAME': ['Adam', 'Adam', 'Bob', 'Charlie',],
'CODE': ['1001', '1011', '1002', 'A0101'],
'TAG1': ['A123', 'K123', 'D123', 'E123'],
'TAG2': ['B123', 'L123', np.NaN, 'F123'],
'TAG3': ['C123', np.NaN, np.NaN, 'G123'],
'TAG4': ['M123', np.NaN, np.NaN, 'H123']})
ID NAME CODE TAG1 TAG2 TAG3 TAG4
0 123 Adam 1001 A123 B123 C123 M123
1 123 Adam 1011 K123 L123 NaN NaN
2 154 Bob 1002 D123 NaN NaN NaN
3 167 Charlie A0101 E123 F123 G123 H123
The code that I have tried is below to kind of get the result. But it did not get exactly the output I wanted
df2=pd.pivot_table(test_shift, index=['ID', 'NAME', 'CODE'],
columns=test_shift.groupby(['ID', 'CODE']).cumcount().add(1),
values=['TAG1'], aggfunc='sum')
Output Image
NOTE: Sorry for the bad posting the first time, I tried adding the dataframe as code to help you visually. But I 'FAILED'. I will try to learn that over the coming days and be a better member of 'Stack Overflow.
Thank you for the help.
Here is one-way:
def f(x):
x = x.stack().to_frame(name='val')
x = x.assign(tags='Tag'+x["val"].notna().cumsum().astype(str))
x = (x.reset_index(level=3, drop=True)
.set_index('tags', append=True)['val'].unstack())
return x
df_out = (df.set_index(['ID', 'NAME','CODE'])
.groupby(level=[0,1,2], as_index=False).apply(f)
.reset_index().drop('level_0', axis=1))
print(df_out)
Output:
ID NAME CODE Tag1 Tag2 Tag3 Tag4
0 123 Adam 1001 A123 B123 C123 M123
1 123 Adam 1011 K123 L123 NaN NaN
2 154 Bob 1002 D123 NaN NaN NaN
3 167 Charlie A0101 E123 F123 G123 H123
An approach with pd.wide_to_long for flattening the TAGS and then mapping the index as per the group of ID and CODE , then unstacking and joining the de-duplicated rows from ID,NAME and CODE
u = pd.wide_to_long(df.filter(like='TAG').reset_index(),'TAG','index','j').dropna()
idx = u.index.get_level_values(0).map(df.groupby(['ID','CODE']).ngroup())
u = u.set_index(pd.MultiIndex.from_arrays((idx,u.groupby(idx).cumcount()+1))).unstack()
u.columns = u.columns.map('{0[0]}{0[1]}'.format)
out = df[['ID','NAME','CODE']].drop_duplicates().reset_index(drop=True).join(u)
print(out)
ID NAME CODE TAG1 TAG2 TAG3 TAG4
0 123 Adam 1001 A123 B123 C123 M123
1 123 Adam 1011 K123 L123 NaN NaN
2 154 Bob 1002 D123 NaN NaN NaN
3 167 Charlie A0101 E123 F123 G123 H123
Related
I have a dataframe like as shown below
tdf = pd.DataFrame(
{'Unnamed: 0' : ['Region','Asean','Asean','Asean','Asean','Asean','Asean'],
'Unnamed: 1' : ['Name', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR','STU'],
'2017Q1' : ['target_achieved',2345,5678,7890,1234,6789,5454],
'2017Q1' : ['target_set', 3000,6000,8000,1500,7000,5500],
'2017Q1' : ['score', 86, 55, 90, 65, 90, 87],
'2017Q2' : ['target_achieved',245,578,790,123,689,454],
'2017Q2' : ['target_set', 300,600,800,150,700,500],
'2017Q2' : ['score', 76, 45, 70, 55, 60, 77]})
As you can see that, my column names are duplicated.
Meaning, there are 3 columns (2017Q1 each and 2017Q2 each)
dataframe doesn't allow to have columns with duplicate names.
I tried the below to get my expected output
tdf.columns = tdf.iloc[0]v # but this still ignores the column with duplicate names
update
After reading the excel file, based on jezrael answer, I get the below display
I expect my output to be like as shown below
First create MultiIndex in columns and indices:
df = pd.read_excel(file, header=[0,1], index_col=[0,1])
If not possible, here is alternative from your sample data - converted columns and first row of data to MultiIndex in columns and first columns to MultiIndex in index:
tdf = pd.read_excel(file)
tdf.columns = pd.MultiIndex.from_arrays([tdf.columns, tdf.iloc[0]])
df = (tdf.iloc[1:]
.set_index(tdf.columns[:2].tolist())
.rename_axis(index=['Region','Name'], columns=['Year',None]))
print (df.index)
MultiIndex([('Asean', 'DEF'),
('Asean', 'GHI'),
('Asean', 'JKL'),
('Asean', 'MNO'),
('Asean', 'PQR'),
('Asean', 'STU')],
names=['Region', 'Name'])
print (df.columns)
MultiIndex([('2017Q1', 'target_achieved'),
('2017Q1', 'target_set'),
('2017Q1', 'score'),
('2017Q2', 'target_achieved'),
('2017Q2', 'target_set'),
('2017Q2', 'score')],
names=['Year', None])
And then reshape:
df1 = df.stack(0).reset_index()
print (df1)
Region Name Year score target_achieved target_set
0 Asean DEF 2017Q1 86 2345 3000
1 Asean DEF 2017Q2 76 245 300
2 Asean GHI 2017Q1 55 5678 6000
3 Asean GHI 2017Q2 45 578 600
4 Asean JKL 2017Q1 90 7890 8000
5 Asean JKL 2017Q2 70 790 800
6 Asean MNO 2017Q1 65 1234 1500
7 Asean MNO 2017Q2 55 123 150
8 Asean PQR 2017Q1 90 6789 7000
9 Asean PQR 2017Q2 60 689 700
10 Asean STU 2017Q1 87 5454 5500
11 Asean STU 2017Q2 77 454 500
EDIT: Solution for EDITed question is similar:
df = pd.read_excel(file, header=[0,1], index_col=[0,1])
df1 = df.rename_axis(index=['Region','Name'], columns=['Year',None]).stack(0).reset_index()
I need to make a function to expand a dataframe. For example, the input of the function is :
df = pd.DataFrame({
'Name':['Ali', 'Ali', 'Ali', 'Sasha', 'Sasha', 'Sasha'],
'Cart':['book', 'phonecase', 'shirt', 'phone', 'food', 'bag']
})
suppose the n value is 3. Then, for each person inside the Name column, I have to add 3 more new rows and leave the Cart as np.nan. The output should be like this :
df = pd.DataFrame({
'Name':['Ali', 'Ali', 'Ali', 'Ali', 'Ali', 'Ali', 'Sasha', 'Sasha', 'Sasha', 'Sasha', 'Sasha', 'Sasha'],
'Cart':['book', 'phonecase', 'shirt', np.nan, np.nan, np.nan, 'phone', 'food', 'bag', np.nan, np.nan, np.nan]
})
How can I solve this with using copy() and append()?
You can use np.repeat with pd.Series.unique:
n = 3
print (df.append(pd.DataFrame(np.repeat(df["Name"].unique(), n), columns=["Name"])))
Name Cart
0 Ali book
1 Ali phonecase
2 Ali shirt
3 Sasha phone
4 Sasha food
5 Sasha bag
0 Ali NaN
1 Ali NaN
2 Ali NaN
3 Sasha NaN
4 Sasha NaN
5 Sasha NaN
Try this one: (it adds n rows to each group of rows with the same Name value)
import pandas as pd
import numpy as np
n = 3
list_of_df_unique_names = [df[df["Name"]==name] for name in df["Name"].unique()]
df2 = pd.concat([d.append(pd.DataFrame({"Name":np.repeat(d["Name"].values[-1], n)}))\
for d in list_of_df_unique_names]).reset_index(drop=True)
print(df2)
Output:
Name Cart
0 Ali book
1 Ali phonecase
2 Ali shirt
3 Ali NaN
4 Ali NaN
5 Ali NaN
6 Sasha phone
7 Sasha food
8 Sasha bag
9 Sasha NaN
10 Sasha NaN
11 Sasha NaN
Maybe not the most beautiful of all solutions, but it works. Say that you want to add 4 NaN rows by group. Then, given your df:
df = pd.DataFrame({
'Name':['Ali', 'Ali', 'Ali', 'Sasha', 'Sasha', 'Sasha'],
'Cart':['book', 'phonecase', 'shirt', 'phone', 'food', 'bag']
})
you can creat an empty dataframe DF and loop trough the range (1,4), filter the df you had and in every loop add an empty row:
DF = []
names = list(set(df.Name))
for i in range(4):
for name in names:
gf = df[df['Name']=='{}'.format(name)]
a = pd.concat([gf, gf.groupby('Name')['Cart'].apply(lambda x: x.shift(-1).iloc[-1]).reset_index()]).sort_values('Name').reset_index(drop=True)
DF.append(a)
DF_full = pd.concat(DF)
Now, you'll end up with copies of your original df, so you need to dump them without dumping the NaN rows:
DFF = DF_full.sort_values(['Name','Cart'])
DFF = DFF[(~DFF.duplicated()) | (DFF['Cart'].isnull())]
which gives:
Name Cart
0 Ali book
1 Ali phonecase
2 Ali shirt
3 Ali NaN
3 Ali NaN
3 Ali NaN
3 Ali NaN
2 Sasha bag
1 Sasha food
0 Sasha phone
3 Sasha NaN
3 Sasha NaN
3 Sasha NaN
3 Sasha NaN
I´m trying to apply to my pandas dataframe something similar to R's tidyr::spread . I saw in some places people using pd.pivot but so far I had no success.
So in this example I have the following dataframe DF:
df = pd.DataFrame({'action_id' : [1,2,1,4,5],
'name': ['jess', 'alex', 'jess', 'cath', 'mary'],
'address': ['house', 'house', 'park', 'park', 'park'],
'date': [ '01/01', '02/01', '03/01', '04/01', '05/01']})
How does it look like:
Ok, so what I want is a multi-index pivot table having 'action_id' and 'name' as index, "spread" the address column and fill it with the 'date' column. So my df would look like this:
What I tryed to do was:
df.pivot(index = ['action_id', 'name'], columns = 'address', values = 'date')
And I got the error TypeError: MultiIndex.name must be a hashable type
Does anyone know what am I doing wrong?
You do not need to mention the index in pd.pivot
This will work
import pandas as pd
df = pd.DataFrame({'action_id' : [1,2,1,4,5],
'name': ['jess', 'alex', 'jess', 'cath', 'mary'],
'address': ['house', 'house', 'park', 'park', 'park'],
'date': [ '01/01', '02/01', '03/01', '04/01', '05/01']})
df = pd.concat([df, pd.pivot(data=df, index=None, columns='address', values='date')], axis=1) \
.reset_index(drop=True).drop(['address','date'], axis=1)
print(df)
action_id name house park
0 1 jess 01/01 NaN
1 2 alex 02/01 NaN
2 1 jess NaN 03/01
3 4 cath NaN 04/01
4 5 mary NaN 05/01
And to arrive at what you want, you need to do a groupby
df = df.groupby(['action_id','name']).agg({'house':'first','park':'first'}).reset_index()
print(df)
action_id name house park
0 1 jess 01/01 03/01
1 2 alex 02/01 NaN
2 4 cath NaN 04/01
3 5 mary NaN 05/01
Dont forget to accept the answer if it helped you
Another option:
df2 = df.set_index(['action_id','name','address']).date.unstack().reset_index()
df2.columns.name = None
I want to parse a csv which has no headers and a delimiter('\t'),the result is wrong.All values of a row will be recognized as the first field and the other fields is Nan
csv_df=pd.read_csv(
"./123",
sep="\t",
names=["month", "number", "age", "name", "column1"],
quotechar='"',
doublequote=True,
skip_blank_lines=True,
encoding="utf-8"
)
print csv_df
the csv values is
1 'Pete Houston' 'Software Engineer' 92
2 'John Wick' 'Assassin' 95
3 'Bruce Wayne' 'Batman' 99
4 'Clark Kent' 'Superman' 95
the parse result is
month number age name
0 1 'Pete Houston' 'Software Engineer' 92 NaN NaN NaN
1 2 'John Wick' 'Assassin' 95 NaN NaN NaN
2 3 'Bruce Wayne' 'Batman' 99 NaN NaN NaN
3 4 'Clark Kent' 'Superman' 95 NaN NaN NaN
The following code works fine for your csv example:
pd.read_csv("filename.csv", sep="\s+", quotechar="'", header=None,
names=["a", "b", "c", "d"])
Instead of using "\t" as separator, better use "\s+", which is more versatile. Also, there's a mess with your quote parameters.
Because your csv file does not contain /t (tab) between the names.
I just replaced the spaces in each row with tabs and I ran your code which gave me following result:
month number age name column1
0 1 'Pete Houston' 'Software Engineer' 92 NaN
1 2 'John Wick' 'Assassin' 95 NaN
2 3 'Bruce Wayne' 'Batman' 99 NaN
3 4 'Clark Kent' 'Superman' 95 NaN
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 3 years ago.
The sample data looks like this:
d = pd.DataFrame({'name': ['Adam', 'Adam', 'Bob', 'Bob', 'Craig'],
'number': [111, 222, 333, 444, 555],
'type': ['phone', 'fax', 'phone', 'phone', 'fax']})
name number type
------ ------ -----
Adam 111 phone
Adam 222 fax
Bob 333 phone
Bob 444 phone
Craig 555 fax
I am trying to convert the numbers (phone and fax) to a wide format, the ideal output:
name fax phone
---- ----- -----
Adam 222.0 111.0
Bob NaN 333.0
Bob NaN 444.0
Craig 555.0 NaN
When I tried to use the pivot method and run the following code:
p = d.pivot(index='name', columns = 'type', values='number').reset_index()
I received the error ValueError: Index contains duplicate entries, cannot reshape due to the fact that Bob has two phone numbers.
Is there a workaround for this?
Here you go with cumcount create the additional key
d['key']=d.groupby(['name','type']).cumcount()
p = d.pivot_table(index=['key','name'], columns = 'type', values='number',aggfunc='sum').reset_index()
p
Out[71]:
type key name fax phone
0 0 Adam 222.0 111.0
1 0 Bob NaN 333.0
2 0 Craig 555.0 NaN
3 1 Bob NaN 444.0