make a unique enumeration for concatenated pandas df - python

I have some dataframes where data is tagged in groups, let's say as such:
df1 = pd.DataFrame({'id':[1,3,7, 10,30, 70, 100, 300], 'name':[1,1,1,1,1,1,1,1], 'tag': [1,1,1, 2,2,2, 3,3]})
df2 = pd.DataFrame({'id':[2,5,6, 20, 50, 200, 500, 600], 'name': [2,2,2,2,2,2,2,2], 'tag':[1,1,1, 2, 2, 3,3,3]})
df3 = pd.DataFrame({'id':[4, 8, 9, 40, 400, 800, 900], 'name': [3,3,3,3,3,3,3], 'tag':[1,1,1, 2, 3, 3,3]})
In each dataframe, the tag is attibuted in an ascending order of ids (so bigger ids will have equal or bigger tags).
My wish is to recalculate tags in the concatenated dataframe,
df = pd.concat([df1, df2, df3])
so that the tag of each group will be in ascending order of ids of the first element of each. So, the group starting by id=1 will be tagged by 1 (that is, ids 1,3,7), the group starting by id=2 will be tagged by 2 (that is , ids 2,5,6), the group starting by 4 will be tagged by 3, the group starting by 10 will be tagged as 4, and so on.
I did manage to get a (complicated!) solution:
1) Get first row of each group , put those in a dataframe , sort by id and create the new tags:
dff = pd.concat([df1.groupby('tag').first(), df2.groupby('tag').first(), df3.groupby('tag').first()])
dff = dff.sort(['id'])
dff = dff.reset_index()
dff['new_tags'] = dff.index +1
2) Concatenate this dataframe with initial ones, drop_duplicates so as to keep the newly tagged rows, order by group , then propagate new tags:
df = pd.concat([dff, df1, df2, df3])
df = df.drop_duplicates(subset=['id', 'tag', 'name'])
df = df.sort(['name', 'tag'])
df = df.fillna(method = 'pad')
The new tags are exactly what needed, but my solution seems too complicated. Would you have a suggestion on how to make easier? I think I must be missing something!
Thanks in advance,
M.

Using pd.concat + keys , I break down the steps
df=pd.concat([df1,df2,df3],keys=[0,1,2])
df=df.reset_index(level=0)#get the level=0 index
df=df.sort_values(['tag','level_0']) # sort the value
df['New']=(df['tag'].diff().ne(0)|df['level_0'].diff().ne(0)).cumsum()
df
Out[110]:
level_0 id name tag New
0 0 1 1 1 1
1 0 3 1 1 1
2 0 7 1 1 1
0 1 2 2 1 2
1 1 5 2 1 2
2 1 6 2 1 2
0 2 4 3 1 3
1 2 8 3 1 3
2 2 9 3 1 3
3 0 10 1 2 4
4 0 30 1 2 4
5 0 70 1 2 4
3 1 20 2 2 5
4 1 50 2 2 5
3 2 40 3 2 6
6 0 100 1 3 7
7 0 300 1 3 7
5 1 200 2 3 8
6 1 500 2 3 8
7 1 600 2 3 8
4 2 400 3 3 9
5 2 800 3 3 9
6 2 900 3 3 9

Once concatenated, you can use groupby the columns 'tag' and 'name' with transform and first on the column 'id'. Then sort_values this series and cumsum the diff is more than 0 such as:
df = pd.concat([df1, df2, df3]).sort_values('id').reset_index(drop=True)
df['new'] = (df.groupby(['tag','name'])['id'].transform('first')
.sort_values().diff().ne(0.).cumsum())
and you get the expected output:
id name tag new
0 1 1 1 1
1 2 2 1 2
2 3 1 1 1
3 4 3 1 3
4 5 2 1 2
5 6 2 1 2
6 7 1 1 1
7 8 3 1 3
8 9 3 1 3
9 10 1 2 4
10 20 2 2 5
11 30 1 2 4
12 40 3 2 6
...
EDIT: to avoid using groupby, you can drop_duplicates and index to get the index of the first ids, create the column new with an incremental value using loc and range and then ffill after sort_values to fill the values:
df = pd.concat([df1, df2, df3]).sort_values('id').reset_index(drop=True)
list_ind = df.drop_duplicates(['name','tag']).index
df.loc[list_ind,'new'] = range(1,len(list_ind)+1)
df['new'] = df.sort_values(['tag','name'])['new'].ffill().astype(int)
and you get the same result

Related

Sort part of DataFrame in Python Panda, return new column with order depending on row values

My first question here, I hope this is understandable.
I have a Panda DataFrame:
order_numbers
x_closest_autobahn
0
34
3
1
11
3
2
5
3
3
8
12
4
2
12
I would like to get a new column with the order_number per closest_autobahn in ascending order:
order_numbers
x_closest_autobahn
order_number_autobahn_x
2
5
3
1
1
11
3
2
0
34
3
3
4
2
12
1
3
8
12
2
I have tried:
df['order_number_autobahn_x'] = ([df.loc[(df['x_closest_autobahn'] == 3)]].sort_values(by=['order_numbers'], ascending=True, inplace=True))
I have looked at slicing, sort_values and reset_index
df.sort_values(by=['order_numbers'], ascending=True, inplace=True)
df = df.reset_index() # reset index to the order after sort
df['order_numbers_index'] = df.index
but I can't seem to get the DataFrame I am looking for.
Use DataFrame.sort_values by both columns and for counter use GroupBy.cumcount:
df = df.sort_values(['x_closest_autobahn','order_numbers'])
df['order_number_autobahn_x'] = df.groupby('x_closest_autobahn').cumcount().add(1)
print (df)
order_numbers x_closest_autobahn order_number_autobahn_x
2 5 3 1
1 11 3 2
0 34 3 3
4 2 12 1
3 8 12 2

Pandas Merge Columns with Priority

My input dataframe;
MinA MinB MaxA MaxB
0 1 2 5 7
1 1 0 8 6
2 2 15 15
3 3
4 10
I want to merge "min" and "max" columns amongst themselves with priority (A columns have more priority than B columns).
If both columns are null, they should have default values, for min=0 for max=100.
Desired output is;
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 15 15 2 15
3 3 3 100
4 10 0 10
Could you please help me about this?
This can be accomplished using mask. With your data that would look like the following:
df = pd.DataFrame({
'MinA': [1,1,2,None,None],
'MinB': [2,0,None,3,None],
'MaxA': [5,8,15,None,None],
'MaxB': [7,6,15,None,10],
})
# Create new Column, using A as the base, if it is Nan, then use B.
# Then do the same again using specified values
df['Min'] = df['MinA'].mask(pd.isna, df['MinB']).mask(pd.isna, 0)
df['Max'] = df['MaxA'].mask(pd.isna, df['MaxB']).mask(pd.isna, 100)
The above would result in the desired output:
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 NaN 15 15 2 15
3 NaN 3 NaN NaN 3 100
4 NaN NaN NaN 10 0 10
Just use fillna() will be fine.
df['Min'] = df['MinA'].fillna(df['MinB']).fillna(0)
df['Max'] = df['MaxA'].fillna(df['MaxB']).fillna(100)

How to Transpose multiple columns into multiple rows but retain primary keys as is using Pandas

I have a dataframe which can be generated from the code given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1': ['12/31/2007','11/25/2009','10/06/2005'],'date1derived':[0,0,0],'val1':[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'date2derived':[0,0,0],'val2':[1,3,5],'date3':['12/31/2027','11/25/2029','10/06/2025'],'date3derived':[0,0,0],'val3':[7,9,11]})
The dataframe looks like as shown below
I would like to retain the rows of each person as seperate rows and not as columns like shown in screenshot above.In addition, I want the date1derived,date2derived columns to be dropped.
I did try below approaches but they didn't provide the expected output
1) df.set_index(['person_id']).stack()/unstack
2) df.set_index(['person_id','date1','date2','date3']).stack()/unstack()
3) df.set_index('person_id').unstack()/stack
How can I get an output to be like this? I have more than 600 columns, so I don't think writing the column names manually would help me.
This is a wide_to_long problem:
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id', j='grp').sort_index(level=0)
date val
person_id grp
1 1 12/31/2007 2
2 12/31/2017 1
3 12/31/2027 7
2 1 11/25/2009 4
2 11/25/2019 3
3 11/25/2029 9
3 1 10/06/2005 6
2 10/06/2015 5
3 10/06/2025 11
To match your expected output:
df = pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id', j='grp').sort_index(level=0)
df = df.reset_index(level=1, drop=True).reset_index()
person_id date val
0 1 12/31/2007 2
1 1 12/31/2017 1
2 1 12/31/2027 7
3 2 11/25/2009 4
4 2 11/25/2019 3
5 2 11/25/2029 9
6 3 10/06/2005 6
7 3 10/06/2015 5
8 3 10/06/2025 11
You can do it without wide_to_long() but just with append()
df2 = pd.DataFrame()
for i in range(1, 4):
new_df = df[['person_id', f'date{i}', f'val{i}']]
new_df.columns = ['person_id', 'date', 'val']
df2 = df2.append(new_df)
df2.sort_values('person_id').reset_index(drop=True)
ouput :
person_id date val
0 1 12/31/2007 2
1 1 12/31/2017 1
2 1 12/31/2027 7
3 2 11/25/2009 4
4 2 11/25/2019 3
5 2 11/25/2029 9
6 3 10/06/2005 6
7 3 10/06/2015 5
8 3 10/06/2025 11

How to remove ugly row in pandas.dataframe

so I am filling dataframes from 2 different files. While those 2 files should have the same structure (the values should be different thought) the resulting dataframes look different. So when printing those I get:
a b c d
0 70402.14 70370.602112 0.533332 98
1 31362.21 31085.682726 1.912552 301
... ... ... ... ...
753919 64527.16 64510.008206 0.255541 71
753920 58077.61 58030.943621 0.835758 152
a b c d
index
0 118535.32 118480.657338 0.280282 47
1 49536.10 49372.999416 0.429902 86
... ... ... ... ...
753970 52112.95 52104.717927 0.356051 116
753971 37044.40 36915.264944 0.597472 165
So in the second dataframe there is that "index" row that doesnt make any sense for me and it causes troubles in my following code. I did neither write the code to fill the files into the dataframes nor I did create those files. So I am rather interested in checking if such a row exists and how I might be able to remove it. Does anyone have an idea about this?
The second dataframe has an index level named "index".
You can remove the name with
df.index.name = None
For example,
In [126]: df = pd.DataFrame(np.arange(15).reshape(5,3))
In [128]: df.index.name = 'index'
In [129]: df
Out[129]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [130]: df.index.name = None
In [131]: df
Out[131]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
The dataframe might have picked up the name "index" if you used reset_index and set_index like this:
In [138]: df.reset_index()
Out[138]:
index 0 1 2
0 0 0 1 2
1 1 3 4 5
2 2 6 7 8
3 3 9 10 11
4 4 12 13 14
In [140]: df.reset_index().set_index('index')
Out[140]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
Index is just the first column - it's numbering the rows by default, but you can change it a number of ways (e.g. filling it with values from one of the columns)

How to convert pandas dataframe so that index is the unique set of values and data is the count of each value?

I have a dataframe from a multiple choice questions and it is formatted like so:
Sex Qu1 Qu2 Qu3
Name
Bob M 1 2 1
John M 3 3 5
Alex M 4 1 2
Jen F 3 2 4
Mary F 4 3 4
The data is a rating from 1 to 5 for the 3 multiple choice questions. I want rearrange the data so that the index is range(1,6) where 1='bad', 2='poor', 3='ok', 4='good', 5='excellent', the columns are the same and the data is the count of the number occurrences of the values (excluding the Sex column). This is basically a histogram of fixed bin sizes and the x-axis labeled with strings. I like the output of df.plot() much better than df.hist() for this but I can't figure out how to rearrange the table to give me a histogram of data. Also, how do you change x-labels to be strings?
Series.value_counts gives you the histogram you're looking for:
In [9]: df['Qu1'].value_counts()
Out[9]:
4 2
3 2
1 1
So, apply this function to each of those 3 columns:
In [13]: table = df[['Qu1', 'Qu2', 'Qu3']].apply(lambda x: x.value_counts())
In [14]: table
Out[14]:
Qu1 Qu2 Qu3
1 1 1 1
2 NaN 2 1
3 2 2 NaN
4 2 NaN 2
5 NaN NaN 1
In [15]: table = table.fillna(0)
In [16]: table
Out[16]:
Qu1 Qu2 Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
Using table.reindex or table.ix[some_array] you can rearrange the data.
To transform to strings, use table.rename:
In [17]: table.rename(index=str)
Out[17]:
Qu1 Qu2 Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
In [18]: table.rename(index=str).index[0]
Out[18]: '1'

Categories

Resources