I am trying to find some elegant ways of rearranging a pandas dataframe.
My initial dataframe looks like this:
PS PSS 10PS 10PSS 5PS 5PSS
1 6 263 5 23 2 101
2 5 49 2 30 1 30
desired arrangement would be:
1-PS 1-PSS 1-10PS 1-10PSS 1-5PS 1-5PSS 2-PS 2-PSS 2-10PS 2-10PSS 2-5PS 2-5PSS
A 6 263 5 23 2 101 5 49 2 30 1 30
Where A is a new index and I would like the rows to be merged with the columns.
You need stack here , with column join
s=df.stack().to_frame('A')
s.index=s.index.map('{0[0]}-{0[1]}'.format)
s.T
Out[42]:
1-PS 1-PSS 1-10PS 1-10PSS 1-5PS 1-5PSS 2-PS 2-PSS 2-10PS 2-10PSS \
A 6 263 5 23 2 101 5 49 2 30
2-5PS 2-5PSS
A 1 30
Hopefully these lines can help you out:
# Put a pandas Series from each line in a generator
series = (pd.Series(i, index=['{}-{}'.format(ind,x) for x in df.columns])
for ind, i in zip(df.index,df.values))
# Concatenate and convert to frame + transpose
df = pd.concat(series).to_frame('A').T
Full example:
import pandas as pd
data = '''\
index PS PSS 10PS 10PSS 5PS 5PSS
1 6 263 5 23 2 101
2 5 49 2 30 1 30'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+').set_index('index')
# Put a pandas Series from each line in a generator
series = (pd.Series(i, index=['{}-{}'.format(ind,x) for x in df.columns])
for ind, i in zip(df.index,df.values))
# Concatenate and convert to frame + transpose
df = pd.concat(series).to_frame('A').T
Related
I have a data frame df1 like this:
A B C ...
mean 10 100 1
std 11 110 2
median 12 120 3
I want to make another df with separate col for each df1 col. header-row name pair:
A-mean A-std A-median B-mean B-std B-median C-mean C-std C-median ...
10 11 12 100 110 120 1 2 3
Basically I have used the pandas.DataFrame.describe function and now I would like to transpose it this way.
You can unstack your DataFrame into a Series, flatten the Index, turn it back into a DataFrame and transpose the result.
out = (
df.unstack()
.pipe(lambda s:
s.set_axis(s.index.map('-'.join))
)
.to_frame().T
)
print(out)
A-mean A-std A-median B-mean B-std B-median C-mean C-std C-median
0 10 11 12 100 110 120 1 2 3
I have two datasets. The first one (df1) contains more then 200.000 rows, and the second one (df2) only two. I need to create a new column df1['column_2'] which is the sum of df1['column_1'] and df2['column_1']
When I try to make df1['column_2'] = df1['column_1'] + df2['column_1'] I get an error "A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
How can I sum values of different datasets with different amount of rows?
Will be thankful for any help!
Screenshot of my notebook: https://prnt.sc/p1d6ze
I tried your code and it works with no error, using Pandas 0.25.0
and Python 3.7.0.
If you use older versions, consider upgrade.
For the test I used df1 with 10 rows (shorter):
column_1
0 10
1 20
2 30
3 40
4 50
5 60
6 70
7 80
8 90
9 100
and df2 with 2 rows (just as in your post):
column_1
0 3
1 5
Your instruction df1['column_2'] = df1['column_1'] + df2['column_1']
gives the following result:
column_1 column_2
0 10 13.0
1 20 25.0
2 30 NaN
3 40 NaN
4 50 NaN
5 60 NaN
6 70 NaN
7 80 NaN
8 90 NaN
9 100 NaN
So that:
Elements with "overlapping" index values are summed.
Other elements (with no corresponding index in df2 are NaN.
Because of the presence of NaN values, this column is coerced to float.
Alternative form of this instruction, using .loc[...] is:
df1['column_2'] = df1.loc[:, 'column_1'] + df2.loc[:, 'column_1']
It works on my computer either.
Or maybe you want to "multiply" (replicate) df2 to the length of df1
before summing? If yes, run:
df1['column_2'] = df1.column_1 + df2.column_1.values.tolist() * 5
In this case 5 is the number of times df2 should be "multiplied".
This time no index alignment takes place and the result is:
column_1 column_2
0 10 13
1 20 25
2 30 33
3 40 45
4 50 53
5 60 65
6 70 73
7 80 85
8 90 93
9 100 105
Reindex is applied on the df which have less number of records compared to the other, For example here y
Subtraction:
import pandas as pd
import re
x = pd.DataFrame([(100,200),(300,400),(100,111)], columns=['a','b'])
y = pd.DataFrame([(1,2),(3,4)], columns=['a','b'])
z= x - y.reindex_like(x).fillna(0)
Addition
import pandas as pd
import re
x = pd.DataFrame([(100,200),(300,400),(100,111)], columns=['a','b'])
y = pd.DataFrame([(1,2),(3,4)], columns=['a','b'])
z= x + y.reindex_like(x).fillna(0)
Multiplication
import pandas as pd
import re
x = pd.DataFrame([(100,200),(300,400),(100,111)], columns=['a','b'])
y = pd.DataFrame([(1,2),(3,4)], columns=['a','b'])
z= x * y.reindex_like(x).fillna(1)
I have discovered that I can not make df_1['column_3] = df_1['column_1] + df_1['column_2] if df_1 is a slice from original dataframe df. So, I have solved my question by writing a function:
def new_column(dataframe):
if dataframe['column']=='value_1':
dataframe['new_column'] =(dataframe['column_1']
- df_2[df_2['column']=='value_1']
['column_1'].values[0])
else:
dataframe['new_column'] =(dataframe['column_1']
- df_2[df_2['column']=='value_2']
['column_1'].values[0])
return dataframe
dataframe=df_1.apply(new_column,axis=1)
I have 3 different data-frames which can be generated using the code given below
data_file= pd.DataFrame({'person_id':[1,2,3],'gender': ['Male','Female','Not disclosed'],'ethnicity': ['Chinese','Indian','European'],'Marital_status': ['Single','Married','Widowed'],'Smoke_status':['Yes','No','No']})
map_file= pd.DataFrame({'gender': ['1.Male','2. Female','3. Not disclosed'],'ethnicity': ['1.Chinese','2. Indian','3.European'],
'Marital_status':['1.Single','2. Married','3 Widowed'],'Smoke_status':['1. Yes','2. No',np.nan]})
hash_file = pd.DataFrame({'keys':['gender','ethnicity','Marital_status','Smoke_status','Yes','No','Male','Female','Single','Married','Widowed','Chinese','Indian','European'],'values':[21,22,23,24,125,126,127,128,129,130,131,141,142,0]})
And another empty dataframe in which the output should be filled can be generated using the code below
columns = ['person_id','obsid','valuenum','valuestring','valueid']
obs = pd.DataFrame(columns=columns)
What I am trying to achieve is shown in the table where you can see the rules and description of how data is to be filled
I did try via for loop approach but as soon as I unstack it, I lose the column names and not sure how I can proceed further.
a=1
for i in range(len(data_file)):
df_temp = data_file[i:a]
a=a+1
df_temp=df_temp.unstack()
df_temp = df_temp.to_frame().reset_index()
How can I get my output dataframe to be filled like as shown below (ps: I have shown only for person_id = 1 and 4 columns) but in real time, I have more than 25k persons and 400 columns for each person. So any elegant and efficient approach is helpful unlike my for loop.
After chat and remove duplicates data is possible use:
s = hash_file.set_index('VARIABLE')['concept_id']
df1 = map_file.melt().dropna(subset=['value'])
df1[['valueid','valuestring']] = df1.pop('value').str.extract('(\d+)\.(.+)')
df1['valuestring'] = df1['valuestring'].str.strip()
columns = ['studyid','obsid','valuenum','valuestring','valueid']
obs = data_file.melt('studyid', value_name='valuestring').sort_values('studyid')
#merge by 2 columns variable, valuestring
obs = (obs.merge(df1, on=['variable','valuestring'], how='left')
.rename(columns={'valueid':'valuenum'}))
obs['obsid'] = obs['variable'].map(s)
obs['valueid'] = obs['valuestring'].map(s)
#map by only one column variable
s1 = df1.drop_duplicates('variable').set_index('variable')['valueid']
obs['valuenum_new'] = obs['variable'].map(s1)
obs = obs.reindex(columns + ['valuenum_new'], axis=1)
print (obs)
#compare number of non missing rows
print (len(obs.dropna(subset=['valuenum'])))
print (len(obs.dropna(subset=['valuenum_new'])))
Here is an alterternative approach using DataFrame.melt and Series.map:
# Solution for pandas V 0.24.0 +
columns = ['person_id','obsid','valuenum','valuestring','valueid']
# Create map Series
hash_map = hash_file.set_index('keys')['values']
value_map = map_file.stack().str.split('\.\s?', expand=True).set_index(1, append=True).droplevel(0)[0]
# Melt and add mapped columns
obs = data_file.melt(id_vars=['person_id'], value_name='valuestring')
obs['obsid'] = obs.variable.map(hash_map)
obs['valueid'] = obs.valuestring.map(hash_map).astype('Int64')
obs['valuenum'] = obs[['variable', 'valuestring']].apply(tuple, axis=1).map(value_map)
# Reindex and sort for desired output
obs.reindex(columns=columns).sort_values('person_id')
[out]
person_id obsid valuenum valuestring valueid
0 1 21 1 Male 127
3 1 22 1 Chinese 141
6 1 23 1 Single 129
9 1 24 1 Yes 125
1 2 21 2 Female 128
4 2 22 2 Indian 142
7 2 23 2 Married 130
10 2 24 2 No 126
2 3 21 3 Not disclosed NaN
5 3 22 3 European 0
8 3 23 3 Widowed 131
11 3 24 2 No 126
I'd like to keep the columns in the order they were defined with pd.DataFrame. In the example below, df.info shows that GroupId is the first column and print also prints GroupId.
I'm using Python version 3.6.3
import numpy as np
import pandas as pd
df = pd.DataFrame({'Id' : np.random.randint(1,100,10),
'GroupId' : np.random.randint(1,5,10) })
df.info()
print(df.iloc[:,0])
One way is to use collections.OrderedDict, as below. Note that the OrderedDict object takes a list of tuples as an input.
from collections import OrderedDict
df = pd.DataFrame(OrderedDict([('Id', np.random.randint(1,100,10)),
('GroupId', np.random.randint(1,5,10))]))
# Id GroupId
# 0 37 4
# 1 10 2
# 2 42 1
# 3 97 2
# 4 6 4
# 5 59 2
# 6 12 2
# 7 69 1
# 8 79 1
# 9 17 1
Unless you're using python-3.6+ where dictionaries are ordered, this just isn't possible with a (standard) dictionary. You will need to zip your items together and pass a list of tuples:
np.random.seed(0)
a = np.random.randint(1, 100, 10)
b = np.random.randint(1, 5, 10)
df = pd.DataFrame(list(zip(a, b)), columns=['Id', 'GroupId'])
Or,
data = [a, b]
df = pd.DataFrame(list(zip(*data)), columns=['Id', 'GroupId']))
df
Id GroupId
0 45 3
1 48 1
2 65 1
3 68 1
4 68 3
5 10 2
6 84 3
7 22 4
8 37 4
9 88 3
I have a list of arrays (one-dimensional numpy array) (a_) and a list (l_) and want to have a DataFrame with them as its columns. They look like this:
a_: [array([381]), array([376]), array([402]), array([400])...]
l_: [1.5,2.34,4.22,...]
I can do it by:
df_l = pd.DataFrame(l_)
df_a = pd.DataFrame(a_)
df = pd.concat([df_l, df_a], axis=1)
Is there a shorter way of doing it? I tried to use pd.append:
df_l = pd.DataFrame(l_)
df_l = df_l.append(a_)
However, because columns indices are both 0, it adds a_ to the end of the dataframe column, resulting in a single column. Is there something like this:
l_ = l_.append(a_).reset(columns)
that set a new column index for the appended array? well, obviously this does not work!
the desired output is like:
0 0
0 1.50 381
1 2.34 376
2 4.22 402
...
Thanks.
Suggestion:
df_l = pd.DataFrame(l_)
df_1['a_'] = pd.Series(a_list, index=df_1.index)
Example #1:
L = list(data)
A = list(data)
data_frame = pd.DataFrame(L)
data_frame['A'] = pd.Series(A, index=data_frame.index)
Example #2 - Same Series length (create series and set index to the same as existing data frame):
In [33]: L = list(item for item in range(10))
In [34]: A = list(item for item in range(10,20))
In [35]: data_frame = pd.DataFrame(L,columns=['L'])
In [36]: data_frame['A'] = pd.Series(A, index=data_frame.index)
In [37]: print data_frame
L A
0 0 10
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
Example #3 - Different Series lengths (create series and let pandas handle index matching):
In [45]: not_same_length = list(item for item in range(50,55))
In [46]: data_frame['nsl'] = pd.Series(not_same_length)
In [47]: print data_frame
L A nsl
0 0 10 50
1 1 11 51
2 2 12 52
3 3 13 53
4 4 14 54
5 5 15 NaN
6 6 16 NaN
7 7 17 NaN
8 8 18 NaN
9 9 19 NaN
Based on your comments, it looks like you want to join your list of lists.I'm assuming they are in list structure because array() is not a method in python. To do that you would do the following:
In [63]: A = [[381],[376], [402], [400]]
In [64]: A = [inner_item for item in A for inner_item in item]
In [65]: print A
[381, 376, 402, 400]
Then create the Series using the new array and follow the steps above to add to your data frame.