Pandas DataFrame splitting with multiple attributes - python

If I had a pandas data frame with columns x , y , h , w and label.
For example:
x y h w label
0 0 4 4 1
4 0 4 8 1
0 4 8 4 2
8 0 4 4 3
8 8 4 8 2
Is there any way to split them into groups with same h and w values?
Into following
Group 1
x y h w label
0 0 4 4 1
8 0 4 4 3
Group 2
x y h w label
4 0 4 8 1
8 8 4 8 2
Group3
x y h w label
0 4 8 4 2

Use DataFrame.groupby + dict comprehension to get a dict of DataFrame, you could create new variables with globals() but it is not recommended
dict_dfs = {i:group for i, group in df.groupby(['h', 'w'])}
Or
dict_dfs = {f'Group {i}' : group
for i, (_, group) in enumerate(df.groupby(['h', 'w']), 1)}

Related

keeping first column value .melt func

I want to use dataframe.melt function in pandas lib to convert data format from rows into column but keeping first column value. I ve just tried also .pivot, but it is not working good. Please look at the example below and please help:
ID Alphabet Unspecified: 1 Unspecified: 2
0 1 A G L
1 2 B NaN NaN
2 3 C H NaN
3 4 D I M
4 5 E J NaN
5 6 F K O
Into this:
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
4 3 C
5 3 H
6 4 D
7 4 I
8 4 M
9 5 E
10 5 J
11 6 F
12 6 K
11 6 O
Try (assuming ID is unique and sorted):
df = (
pd.melt(df, "ID")
.sort_values("ID", kind="stable")
.drop(columns="variable")
.dropna()
.reset_index(drop=True)
.rename(columns={"value": "Alphabet"})
)
print(df)
Prints:
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
4 3 C
5 3 H
6 4 D
7 4 I
8 4 M
9 5 E
10 5 J
11 6 F
12 6 K
13 6 O
Don't melt but rather stack, this will directly drop the NaNs and keep the order per row:
out = (df
.set_index('ID')
.stack().droplevel(1)
.reset_index(name='Alphabet')
)
Output:
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
4 3 C
5 3 H
6 4 D
7 4 I
8 4 M
9 5 E
10 5 J
11 6 F
12 6 K
13 6 O
One option is with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = 'ID',
names_to = 'Alphabet',
names_pattern = ['.+'],
sort_by_appearance = True)
.dropna()
)
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
6 3 C
7 3 H
9 4 D
10 4 I
11 4 M
12 5 E
13 5 J
15 6 F
16 6 K
17 6 O
In the code above, the names_pattern accepts a list of regular expression to match the desired columns, all the matches are collated into one column names Alphabet in names_to.

How can one duplicate columns N times in DataFrame?

I have a dataframe with one column and I would like to get a Dataframe with N columns all of which will be identical to the first one. I can simply duplicate it by:
df[['new column name']] = df[['column name']]
but I have to make more than 1000 identical columns that's why it doesnt work
. One important thing is figures in columns should change for instance if first columns is 0 the nth column is n and the previous is n-1
If it's a single column, you can use tranpose and then simply replicate them with pd.concat and tranpose back to the original format, this avoids looping and should be faster, then you can change the column names in a second line, but without dealing with all the data in the dataframe which would be the most consuming performance wise:
import pandas as pd
df = pd.DataFrame({'Column':[1,2,3,4,5]})
Original dataframe:
Column
0 1
1 2
2 3
3 4
4 5
df = pd.concat([df.T]*1000).T
Output:
Column Column Column Column ... Column Column Column Column
0 1 1 1 1 ... 1 1 1 1
1 2 2 2 2 ... 2 2 2 2
2 3 3 3 3 ... 3 3 3 3
3 4 4 4 4 ... 4 4 4 4
4 5 5 5 5 ... 5 5 5 5
[5 rows x 1000 columns]
df.columns = ['Column'+'_'+str(i) for i in range(1000)]
Say that you have a df:, with column name 'company_name' that consists of 8 companies:
df = {"company_name":{"0":"Telia","1":"Proximus","2":"Tmobile","3":"Orange","4":"Telefonica","5":"Verizon","6":"AT&T","7":"Koninklijke"}}
company_name
0 Telia
1 Proximus
2 Tmobile
3 Orange
4 Telefonica
5 Verizon
6 AT&T
7 Koninklijke
You can use a loop and range to determine how many identical columns to create, and do:
for i in range(0,1000):
df['company_name'+str(i)] = df['company_name']
which results in the shape of the df:
df.shape
(8, 1001)
i.e. it replicated 1000 times the same columns. The names of the duplicated columns will be the same as the original one, plus an integer (=+1) at the end:
'company_name', 'company_name0', 'company_name1', 'company_name2','company_name..N'
df
A B C
0 x x x
1 y x z
Duplicate column "C" 5 times using df.assign:
n = 5
df2 = df.assign(**{f'C{i}': df['C'] for i in range(1, n+1)})
df2
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
Set n to 1000 to get your desired output.
You can also directly assign the result back:
df[[f'C{i}' for i in range(1, n+1)]] = df[['C']*n].to_numpy()
df
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
I think the most efficient is to index with DataFrame.loc instead of using an outer loop
n = 3
new_df = df.loc[:, ['column_duplicate']*n +
df.columns.difference(['column_duplicate']).tolist()]
print(new_df)
column_duplicate column_duplicate column_duplicate other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19
If you want add a suffix
suffix_tup = ('a', 'b', 'c')
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*len(suffix_tup) +
not_dup_cols]
.set_axis(list(map(lambda suffix: f'column_duplicate_{suffix}',
suffix_tup)) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_a column_duplicate_b column_duplicate_c other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
or add an index
n = 3
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*n +
not_dup_cols]
.set_axis(list(map(lambda x: f'column_duplicate_{x}', range(n))) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_0 column_duplicate_1 column_duplicate_2 other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19

How to compute nested group proportions using python pandas without losing original row count

Given the following data:
df = pd.DataFrame({
'where': ['a','a','a','a','a','a'] + ['b','b','b','b','b','b'],
'what': ['x','y','z','x','y','z'] + ['x','y','z','x','y','z'],
'val' : [1,3,2,5,4,3] + [5,6,3,4,5,3]
})
Which looks as:
where what val
0 a x 1
1 a y 3
2 a z 2
3 a x 5
4 a y 4
5 a z 3
6 b x 5
7 b y 6
8 b z 3
9 b x 4
10 b y 5
11 b z 3
I would like to compute the proportion of what in where, and create a new
column that represented this.
The column will have duplicates, If I consider what = x in the above, and
add that column in then the data would be as follows
where what val what_where_prop
0 a x 1 6/18
1 a y 3
2 a z 2
3 a x 5 6/18
4 a y 4
5 a z 3
6 b x 5 9/26
7 b y 6
8 b z 3
9 b x 4 9/26
10 b y 5
11 b z 3
Here 6/18 is computed by finding the total x (6 = 1 + 5) in a over the total of val in a. The same process is taken for 9/26
The full solution will be filled similarly for y and z in the final column.
IIUC,
df['what_where_group'] = (df.groupby(['where', 'what'], as_index=False)['val']
.transform('sum')
.div(df.groupby('where')['val']
.transform('sum'),
axis=0))
df
Output:
where what val what_where_prop what_where_group
0 a x 1 6 0.333333
1 a y 3 7 0.388889
2 a z 2 5 0.277778
3 a x 5 6 0.333333
4 a y 4 7 0.388889
5 a z 3 5 0.277778
6 b x 5 9 0.346154
7 b y 6 11 0.423077
8 b z 3 6 0.230769
9 b x 4 9 0.346154
10 b y 5 11 0.423077
11 b z 3 6 0.230769
Details:
First groupby two levels using what and where, by using index=False, I am not setting the index as the groups, and transform sum. Next, groupby only where and transform sum. Lastly, divide, using div, the first groupby by the second groupby using the direction as rows with axis=0.
Another way:
g = df.set_index(['where', 'what'])['val']
num = g.sum(level=[0,1])
denom = g.sum(level=0)
ww_group = num.div(denom, level=0).rename('what_where_group')
df.merge(ww_group, left_on=['where','what'], right_index=True)
Output:
where what val what_where_prop what_where_group
0 a x 1 6 0.333333
3 a x 5 6 0.333333
1 a y 3 7 0.388889
4 a y 4 7 0.388889
2 a z 2 5 0.277778
5 a z 3 5 0.277778
6 b x 5 9 0.346154
9 b x 4 9 0.346154
7 b y 6 11 0.423077
10 b y 5 11 0.423077
8 b z 3 6 0.230769
11 b z 3 6 0.230769
Details:
Basically the same as before just using steps. And, merge results to apply division to each line.

Merge a pandas dataframe with a list through an outer join

I have a dataframe that looks as follow
A B
0 1 4
1 2 5
2 3 6
and a list
names = ['x','y']
I want to get a dataframe that kind of performs and outer join with that list. The desired result is:
A B name
0 1 4 x
1 1 4 y
2 2 5 x
3 2 5 y
4 3 6 x
5 3 6 y
Using pd.concat:
res = pd.concat([df.assign(name=i) for i in names], ignore_index=True)
Result:
A B name
0 1 4 x
1 2 5 x
2 3 6 x
3 1 4 y
4 2 5 y
5 3 6 y
Using additional key for merge
df.assign(key=1).merge(pd.DataFrame({'Name':names,'key':1})).drop('key',1)
Out[54]:
A B Name
0 1 4 x
1 1 4 y
2 2 5 x
3 2 5 y
4 3 6 x
5 3 6 y
Comprehension
pd.DataFrame(
[r + (n,) for r in zip(*map(df.get, df)) for n in names],
columns=[*df.columns, *['name']]
)
A B name
0 1 4 x
1 1 4 y
2 2 5 x
3 2 5 y
4 3 6 x
5 3 6 y

Python: how to reshape a Pandas dataframe and keeping the information?

I have a dataframe counting the geographical information of points.
df:
A B ax ay bx by
0 x y 5 7 3 2
1 z w 2 0 7 4
2 k x 5 7 2 0
3 v y 2 3 3 2
I would like to create a dataframe with the geographical info of the unique points
df1:
ID x y
0 x 5 7
1 y 3 2
2 z 2 0
3 w 7 4
4 k 5 7
5 v 2 3
First flatten values in columns by numpy.ravel, create DataFrame by contructor and last add drop_duplicates, thanks #zipa:
a = df[['A','B']].values.ravel()
b = df[['ax','bx']].values.ravel()
c = df[['ay','by']].values.ravel()
df = pd.DataFrame({'ID':a, 'x':b, 'y':c}).drop_duplicates('ID').reset_index(drop=True)
print (df)
ID x y
0 x 5 7
1 y 3 2
2 z 2 0
3 w 7 4
4 k 5 7
5 v 2 3

Categories

Resources