Python: how to reshape a Pandas dataframe and keeping the information? - python

I have a dataframe counting the geographical information of points.
df:
A B ax ay bx by
0 x y 5 7 3 2
1 z w 2 0 7 4
2 k x 5 7 2 0
3 v y 2 3 3 2
I would like to create a dataframe with the geographical info of the unique points
df1:
ID x y
0 x 5 7
1 y 3 2
2 z 2 0
3 w 7 4
4 k 5 7
5 v 2 3

First flatten values in columns by numpy.ravel, create DataFrame by contructor and last add drop_duplicates, thanks #zipa:
a = df[['A','B']].values.ravel()
b = df[['ax','bx']].values.ravel()
c = df[['ay','by']].values.ravel()
df = pd.DataFrame({'ID':a, 'x':b, 'y':c}).drop_duplicates('ID').reset_index(drop=True)
print (df)
ID x y
0 x 5 7
1 y 3 2
2 z 2 0
3 w 7 4
4 k 5 7
5 v 2 3

Related

Pandas DataFrame splitting with multiple attributes

If I had a pandas data frame with columns x , y , h , w and label.
For example:
x y h w label
0 0 4 4 1
4 0 4 8 1
0 4 8 4 2
8 0 4 4 3
8 8 4 8 2
Is there any way to split them into groups with same h and w values?
Into following
Group 1
x y h w label
0 0 4 4 1
8 0 4 4 3
Group 2
x y h w label
4 0 4 8 1
8 8 4 8 2
Group3
x y h w label
0 4 8 4 2
Use DataFrame.groupby + dict comprehension to get a dict of DataFrame, you could create new variables with globals() but it is not recommended
dict_dfs = {i:group for i, group in df.groupby(['h', 'w'])}
Or
dict_dfs = {f'Group {i}' : group
for i, (_, group) in enumerate(df.groupby(['h', 'w']), 1)}

How can one duplicate columns N times in DataFrame?

I have a dataframe with one column and I would like to get a Dataframe with N columns all of which will be identical to the first one. I can simply duplicate it by:
df[['new column name']] = df[['column name']]
but I have to make more than 1000 identical columns that's why it doesnt work
. One important thing is figures in columns should change for instance if first columns is 0 the nth column is n and the previous is n-1
If it's a single column, you can use tranpose and then simply replicate them with pd.concat and tranpose back to the original format, this avoids looping and should be faster, then you can change the column names in a second line, but without dealing with all the data in the dataframe which would be the most consuming performance wise:
import pandas as pd
df = pd.DataFrame({'Column':[1,2,3,4,5]})
Original dataframe:
Column
0 1
1 2
2 3
3 4
4 5
df = pd.concat([df.T]*1000).T
Output:
Column Column Column Column ... Column Column Column Column
0 1 1 1 1 ... 1 1 1 1
1 2 2 2 2 ... 2 2 2 2
2 3 3 3 3 ... 3 3 3 3
3 4 4 4 4 ... 4 4 4 4
4 5 5 5 5 ... 5 5 5 5
[5 rows x 1000 columns]
df.columns = ['Column'+'_'+str(i) for i in range(1000)]
Say that you have a df:, with column name 'company_name' that consists of 8 companies:
df = {"company_name":{"0":"Telia","1":"Proximus","2":"Tmobile","3":"Orange","4":"Telefonica","5":"Verizon","6":"AT&T","7":"Koninklijke"}}
company_name
0 Telia
1 Proximus
2 Tmobile
3 Orange
4 Telefonica
5 Verizon
6 AT&T
7 Koninklijke
You can use a loop and range to determine how many identical columns to create, and do:
for i in range(0,1000):
df['company_name'+str(i)] = df['company_name']
which results in the shape of the df:
df.shape
(8, 1001)
i.e. it replicated 1000 times the same columns. The names of the duplicated columns will be the same as the original one, plus an integer (=+1) at the end:
'company_name', 'company_name0', 'company_name1', 'company_name2','company_name..N'
df
A B C
0 x x x
1 y x z
Duplicate column "C" 5 times using df.assign:
n = 5
df2 = df.assign(**{f'C{i}': df['C'] for i in range(1, n+1)})
df2
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
Set n to 1000 to get your desired output.
You can also directly assign the result back:
df[[f'C{i}' for i in range(1, n+1)]] = df[['C']*n].to_numpy()
df
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
I think the most efficient is to index with DataFrame.loc instead of using an outer loop
n = 3
new_df = df.loc[:, ['column_duplicate']*n +
df.columns.difference(['column_duplicate']).tolist()]
print(new_df)
column_duplicate column_duplicate column_duplicate other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19
If you want add a suffix
suffix_tup = ('a', 'b', 'c')
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*len(suffix_tup) +
not_dup_cols]
.set_axis(list(map(lambda suffix: f'column_duplicate_{suffix}',
suffix_tup)) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_a column_duplicate_b column_duplicate_c other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
or add an index
n = 3
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*n +
not_dup_cols]
.set_axis(list(map(lambda x: f'column_duplicate_{x}', range(n))) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_0 column_duplicate_1 column_duplicate_2 other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19

How to compute nested group proportions using python pandas without losing original row count

Given the following data:
df = pd.DataFrame({
'where': ['a','a','a','a','a','a'] + ['b','b','b','b','b','b'],
'what': ['x','y','z','x','y','z'] + ['x','y','z','x','y','z'],
'val' : [1,3,2,5,4,3] + [5,6,3,4,5,3]
})
Which looks as:
where what val
0 a x 1
1 a y 3
2 a z 2
3 a x 5
4 a y 4
5 a z 3
6 b x 5
7 b y 6
8 b z 3
9 b x 4
10 b y 5
11 b z 3
I would like to compute the proportion of what in where, and create a new
column that represented this.
The column will have duplicates, If I consider what = x in the above, and
add that column in then the data would be as follows
where what val what_where_prop
0 a x 1 6/18
1 a y 3
2 a z 2
3 a x 5 6/18
4 a y 4
5 a z 3
6 b x 5 9/26
7 b y 6
8 b z 3
9 b x 4 9/26
10 b y 5
11 b z 3
Here 6/18 is computed by finding the total x (6 = 1 + 5) in a over the total of val in a. The same process is taken for 9/26
The full solution will be filled similarly for y and z in the final column.
IIUC,
df['what_where_group'] = (df.groupby(['where', 'what'], as_index=False)['val']
.transform('sum')
.div(df.groupby('where')['val']
.transform('sum'),
axis=0))
df
Output:
where what val what_where_prop what_where_group
0 a x 1 6 0.333333
1 a y 3 7 0.388889
2 a z 2 5 0.277778
3 a x 5 6 0.333333
4 a y 4 7 0.388889
5 a z 3 5 0.277778
6 b x 5 9 0.346154
7 b y 6 11 0.423077
8 b z 3 6 0.230769
9 b x 4 9 0.346154
10 b y 5 11 0.423077
11 b z 3 6 0.230769
Details:
First groupby two levels using what and where, by using index=False, I am not setting the index as the groups, and transform sum. Next, groupby only where and transform sum. Lastly, divide, using div, the first groupby by the second groupby using the direction as rows with axis=0.
Another way:
g = df.set_index(['where', 'what'])['val']
num = g.sum(level=[0,1])
denom = g.sum(level=0)
ww_group = num.div(denom, level=0).rename('what_where_group')
df.merge(ww_group, left_on=['where','what'], right_index=True)
Output:
where what val what_where_prop what_where_group
0 a x 1 6 0.333333
3 a x 5 6 0.333333
1 a y 3 7 0.388889
4 a y 4 7 0.388889
2 a z 2 5 0.277778
5 a z 3 5 0.277778
6 b x 5 9 0.346154
9 b x 4 9 0.346154
7 b y 6 11 0.423077
10 b y 5 11 0.423077
8 b z 3 6 0.230769
11 b z 3 6 0.230769
Details:
Basically the same as before just using steps. And, merge results to apply division to each line.

Merge a pandas dataframe with a list through an outer join

I have a dataframe that looks as follow
A B
0 1 4
1 2 5
2 3 6
and a list
names = ['x','y']
I want to get a dataframe that kind of performs and outer join with that list. The desired result is:
A B name
0 1 4 x
1 1 4 y
2 2 5 x
3 2 5 y
4 3 6 x
5 3 6 y
Using pd.concat:
res = pd.concat([df.assign(name=i) for i in names], ignore_index=True)
Result:
A B name
0 1 4 x
1 2 5 x
2 3 6 x
3 1 4 y
4 2 5 y
5 3 6 y
Using additional key for merge
df.assign(key=1).merge(pd.DataFrame({'Name':names,'key':1})).drop('key',1)
Out[54]:
A B Name
0 1 4 x
1 1 4 y
2 2 5 x
3 2 5 y
4 3 6 x
5 3 6 y
Comprehension
pd.DataFrame(
[r + (n,) for r in zip(*map(df.get, df)) for n in names],
columns=[*df.columns, *['name']]
)
A B name
0 1 4 x
1 1 4 y
2 2 5 x
3 2 5 y
4 3 6 x
5 3 6 y

How to reshape multiple time-series signals for use with sns.tsplot?

I'm trying to reshape data that looks like this:
t y0 y1 y2
0 0 -1 0 1
1 1 0 1 2
2 2 1 2 3
3 3 2 3 4
4 4 3 4 5
into something like this:
t trial signal value
0 0 0 y -1
1 0 1 y 0
2 0 2 y 1
3 1 0 y 0
4 1 1 y 1
5 1 2 y 2
6 2 0 y 1
7 2 1 y 2
8 2 2 y 3
9 3 0 y 2
10 3 1 y 3
11 3 2 y 4
12 4 0 y 3
13 4 1 y 4
14 4 2 y 5
so that I can feed it into sns.tsplot.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
fig = plt.figure()
num_points = 5
# Create some dummy line signals and assemble a data frame
t = np.arange(num_points)
y0 = t - 1
y1 = t
y2 = t + 1
df = pd.DataFrame(np.vstack((t, y0, y1, y2)).transpose(), columns=['t', 'y0', 'y1', 'y2'])
print(df)
# Do some magic transformations
df = pd.melt(df, id_vars=['t'])
print(df)
# Plot the time-series data
sns.tsplot(time="t", value="value", unit="trial", condition="signal", data=df, ci=[68, 95])
plt.savefig("dummy.png")
plt.close()
I'm hoping to achieve this for lines:
https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.tsplot.html
http://pandas.pydata.org/pandas-docs/stable/reshaping.html
I think you can use melt for reshaping, get first and second char by indexing with str and last sort_values with reordering columns:
df1 = pd.melt(df, id_vars=['t'])
#create helper Series
variable = df1['variable']
#extract second char, convert to int
df1['trial'] = variable.str[1].astype(int)
#extract first char
df1['signal'] = variable.str[0]
#sort values by column t, reset index
df1 = df1.sort_values('t').reset_index(drop=True)
#reorder columns
df1 = df1[['t','trial','signal','value']]
print df1
t trial signal value
0 0 0 y -1
1 0 1 y 0
2 0 2 y 1
3 1 0 y 0
4 1 1 y 1
5 1 2 y 2
6 2 0 y 1
7 2 1 y 2
8 2 2 y 3
9 3 0 y 2
10 3 1 y 3
11 3 2 y 4
12 4 0 y 3
13 4 1 y 4
14 4 2 y 5
Another solution, if all values in column signal are only y:
#remove y from column name, first value of column names is same
df.columns = df.columns[:1].tolist() + [int(col[1]) for col in df.columns[1:]]
print df
t 0 1 2
0 0 -1 0 1
1 1 0 1 2
2 2 1 2 3
3 3 2 3 4
4 4 3 4 5
df1 = pd.melt(df, id_vars=['t'], var_name=['trial'])
#all values in column signal are y
df1['signal'] = 't'
#sort values by column t, reset index
df1 = df1.sort_values('t').reset_index(drop=True)
#reorder columns
df1 = df1[['t','trial','signal','value']]
print df1
t trial signal value
0 0 0 t -1
1 0 1 t 0
2 0 2 t 1
3 1 0 t 0
4 1 1 t 1
5 1 2 t 2
6 2 0 t 1
7 2 1 t 2
8 2 2 t 3
9 3 0 t 2
10 3 1 t 3
11 3 2 t 4
12 4 0 t 3
13 4 1 t 4
14 4 2 t 5

Categories

Resources