Pandas str alphabetically then numerically - python

This is probably a simple question and I just couldn't find the answer. In a pandas DataFrame like the one below, how can the objects be sorted first alphabetically and then numerically.
START:
import pandas as pd
d ={'col1': ['A1','B2','A10','A7','C4','C2','C22','B4']}
df = pd.DataFrame(data=d)
df
col1
0 A1
1 A7
2 A10
3 B2
4 B4
5 C2
6 C4
7 C22
WHAT I WANT TO GET:
col1
0 A1
1 A7
2 A10
3 B2
4 B4
5 C2
6 C4
7 C22
WHAT I GET:
>>>df.sort_values(by='col1')
col1
0 A1
2 A10
1 A7
3 B2
4 B4
5 C2
7 C22
6 C4

This is overkill to use Pandas to sort a list:
lot_file = pd.DataFrame()
lot_file['SPOOL'] = ['A39','B34','A3','B37','A6','B18','A48','B15','A47']
group_lots = lot_file.sort_values(by=['SPOOL'])
group_lots['SPOOL'].tolist()
Output:
['A3', 'A39', 'A47', 'A48', 'A6', 'B15', 'B18', 'B34', 'B37']
Or use sorted
spool_list = ['A39','B34','A3','B37','A6','B18','A48','B15','A47']
sorted(spool_list)
Output:
['A3', 'A39', 'A47', 'A48', 'A6', 'B15', 'B18', 'B34', 'B37']

Related

How to transform tables using pandas

---I have a csv dataset---
import pandas as pd
df = pd.DataFrame({'A':['a','a','a','a1','a1','a1','a1','a1','a1'], 'B':['b','b','b','b1','b1','b1','b1','b1','b1'], 'C':['c','c','c','c1','c1','c1','c1','c1','c1'], 'D':['d','d1','d2','d3','d4','d5','d6','d7','d8'], 'Rank':[1,2,3,1,2,3,4,5,6})
---I want to transform as in the following table ---
pd.pivot_table(df, values = ['D'] index=['A','B','C'], columns = 'Rank').reset_index()
---I didn't get what I want---
pd.DataFrame({'A':['a','a1'], 'B':['b','b1'], 'C':['c','c1'], '1':['d','d3'], '2':['d1','d4'], '3':['d2','d5'], '4':['NaN','d6'], '5':['NaN','d7'], '6':['NaN','d8'], '7':['NaN','NaN']})
You have to use pivot, not pivot_table in this case:
df.pivot(index=['A', 'B', 'C'], columns='Rank', values='D').reset_index()
Output:
Rank A B C 1 2 3 4 5 6
0 a b c d d1 d2 NaN NaN NaN
1 a1 b1 c1 d3 d4 d5 d6 d7 d8
pivot_table aggregates duplicates, but pivot doesn't. Which is what you want.
To remove axis name:
df.pivot(index=['A', 'B', 'C'], columns='Rank', values='D').reset_index().rename_axis(columns=None)
Output:
A B C 1 2 3 4 5 6
0 a b c d d1 d2 NaN NaN NaN
1 a1 b1 c1 d3 d4 d5 d6 d7 d8

Understanding the FutureWarning on using join_axes when concatenating with Pandas

I have two DataFrames:
df1:
A B C
1 A1 B1 C1
2 A2 B2 C2
df2:
B C D
3 B3 C3 D3
4 B4 C4 D4
Columns B and C are identical for both.
I'd like to concatenate them vertically and keep the columns of the first DataFrame:
pd.concat([df1, df2], join_axes=[df1.columns]):
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
This works, but raises a
FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
I couldn't find (either in the documentation or through Google) how to "Use .reindex or .reindex_like on the result to achieve the same functionality".
Colab notebook illustrating issue: https://colab.research.google.com/drive/13EBq2z0Nh05JY7ovrdnLGtfeqdKVvZq0
Just like what the error mentioned add reindex
pd.concat([df1,df2.reindex(columns=df1.columns)])
Out[286]:
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
df1 = pd.DataFrame({'A': ['A1', 'A2'], 'B': ['B1', 'B2'], 'C': ['C1', 'C2']})
df2 = pd.DataFrame({'B': ['B3', 'B4'], 'C': ['C3', 'C4'], 'D': ['D1', 'D2']})
pd.concat([df1, df2], sort=False)[df1.columns]
yields the desired result.
OR...
pd.concat([df1, df2], sort=False).reindex(df1.columns, axis=1)
Output:
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4

Summing columns from different dataframe according to some column names

Suppose I have a main dataframe
main_df
Cri1 Cri2 Cr3 total
0 A1 A2 A3 4
1 B1 B2 B3 5
2 C1 C2 C3 6
I also have 3 dataframes
df_1
Cri1 Cri2 Cri3 value
0 A1 A2 A3 1
1 B1 B2 B3 2
df_2
Cri1 Cri2 Cri3 value
0 A1 A2 A3 9
1 C1 C2 C3 10
df_3
Cri1 Cri2 Cri3 value
0 B1 B2 B3 15
1 C1 C2 C3 17
What I want is to add value from each frame df to total in the main_df according to Cri
i.e. main_df will become
main_df
Cri1 Cri2 Cri3 total
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
Of course I can do it using for loop, but at the end I want to apply the method to a large amount of data, say 50000 rows in each dataframe.
Is there other ways to solve it?
Thank you!
First you should align your numeric column names. In this case:
df_main = df_main.rename(columns={'total': 'value'})
Then you have a couple of options.
concat + groupby
You can concatenate and then perform a groupby with sum:
res = pd.concat([df_main, df_1, df_2, df_3])\
.groupby(['Cri1', 'Cri2', 'Cri3']).sum()\
.reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
set_index + reduce / add
Alternatively, you can create a list of dataframes indexed by your criteria columns. Then use functools.reduce with pd.DataFrame.add to sum these dataframes.
from functools import reduce
dfs = [df.set_index(['Cri1', 'Cri2', 'Cri3']) for df in [df_main, df_1, df_2, df_3]]
res = reduce(lambda x, y: x.add(y, fill_value=0), dfs).reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14.0
1 B1 B2 B3 22.0
2 C1 C2 C3 33.0

select the first N elements of each row in a column

I am looking to select the first two elements of each row in column a and column b.
Here is an example
df = pd.DataFrame({'a': ['A123', 'A567','A100'], 'b': ['A156', 'A266666','A35555']})
>>> df
a b
0 A123 A156
1 A567 A266666
2 A100 A35555
desired output
>>> df
a b
0 A1 A1
1 A5 A2
2 A1 A3
I have been trying to use df.loc but not been successful.
Use
In [905]: df.apply(lambda x: x.str[:2])
Out[905]:
a b
0 A1 A1
1 A5 A2
2 A1 A3
Or,
In [908]: df.applymap(lambda x: x[:2])
Out[908]:
a b
0 A1 A1
1 A5 A2
2 A1 A3
In [107]: df.apply(lambda c: c.str.slice(stop=2))
Out[107]:
a b
0 A1 A1
1 A5 A2
2 A1 A3

Pandas: How to expand data frame rows containing a dictionary with varying keys in a column?

I'm a little stuck, can you please help me with this. I've simplified the problem I'm facing to the following:
Input
Desired Output
I know how to handle the case where the dictionaries in col. c have same keys.
You can create DataFrame by constructor, reshape by stack and last join to original:
df1 = (pd.DataFrame(df.c.values.tolist())
.stack()
.reset_index(level=1)
.rename(columns={0:'val','level_1':'key'}))
print (df1)
key val
0 c00 v00
0 c01 v01
1 c10 v10
2 c20 v20
2 c21 v21
2 c22 v22
df = df.drop('c', 1).join(df1).reset_index(drop=True)
print (df)
a b key val
0 a0 b0 c00 v00
1 a0 b0 c01 v01
2 a1 b1 c10 v10
3 a2 b2 c20 v20
4 a2 b2 c21 v21
5 a2 b2 c22 v22
Here is one way:
import pandas as pd
from itertools import chain
df = pd.DataFrame([['a0', 'b0', {'c00': 'v00', 'c01': 'v01'}],
['a1', 'b1', {'c10': 'v10'}],
['a2', 'b2', {'c20': 'v20', 'c21': 'v21', 'c22': 'v22'}] ],
columns=['a', 'b', 'c'])
# first convert 'c' to list of tuples
df['c'] = df['c'].apply(lambda x: list(x.items()))
lens = list(map(len, df['c']))
# create dataframe
df_out = pd.DataFrame({'a': np.repeat(df['a'].values, lens),
'b': np.repeat(df['b'].values, lens),
'c': list(chain.from_iterable(df['c'].values))})
# unpack tuple
df_out = df_out.join(df_out['c'].apply(pd.Series))\
.rename(columns={0: 'key', 1: 'val'}).drop('c', 1)
# a b key val
# 0 a0 b0 c00 v00
# 1 a0 b0 c01 v01
# 2 a1 b1 c10 v10
# 3 a2 b2 c20 v20
# 4 a2 b2 c21 v21
# 5 a2 b2 c22 v22
My solution is next:
import pandas as pd
t=pd.DataFrame([['a0','b0',{'c00':'v00','c01':'v01'}],['a1','b1',{'c10':'v10'}],['a2','b2',{'c20':'v20','c21':'v21','c22':'v22'}]],columns=['a','b','c'])
l2=[]
for i in t.index:
for j in t.loc[i,'c']:
l2+=[[t.loc[i,'a'],t.loc[i,'b'],j,t.loc[i,'c'][j]]]
t2=pd.DataFrame(l2,columns=['a','b','key','val'])
where 't' is your DataFrame, which you obtain as you want.

Categories

Resources