Efficiently transform pandas dataFrame using column name as factor - python

I would like to transform a DataFrame given by a software into a more python usable one and I can't fix it in a simple way with pandas because I have to use information contained in the columns. Here a simple example :
import pandas as pd
d = {'00' : [1],'01' : [11], '10': [111], '11':[1111]}
pd.DataFrame(d)
00 01 10 11
0 1 11 111 1111
The column names contains the factors that I need to use in rows, I would like to get something like this :
df = {'trt': [0,0,1,1], 'grp': [0,1,0,1], 'value':[1,11,111,1111]}
pd.DataFrame(df)
grp trt value
0 0 0 1
1 1 0 11
2 0 1 111
3 1 1 1111
Any ideas of how to do it properly ?

Solution with MultiIndex.from_arrays created indexing with str and transpose by T:
df.columns = pd.MultiIndex.from_arrays([df.columns.str[0], df.columns.str[1]])
print (df)
0 1
0 1 0 1
0 1 11 111 1111
df1 = df.T.reset_index()
df1.columns = ['grp','trt','value']
print (df1)
grp trt value
0 0 0 1
1 0 1 11
2 1 0 111
3 1 1 1111
Similar solution with rename_axis and rename index:
d = {'00' : [1],'01' : [11], '10': [111], '11':[1111]}
df = pd.DataFrame(d)
df.columns = pd.MultiIndex.from_arrays([df.columns.str[0], df.columns.str[1]])
print(df.rename_axis(('grp','trt'), axis=1).rename(index={0:'value'}).T.reset_index())
grp trt value
0 0 0 1
1 0 1 11
2 1 0 111
3 1 1 1111

To me the simplest solution is just melting the original frame and splitting the column names in a second step. Something like this:
df = pd.DataFrame(d)
mf = pd.melt(df)
mf[['grp', 'trt']] = mf.pop('variable').apply(lambda x: pd.Series(tuple(x)))
Here's mf after melting:
variable value
0 00 1
1 01 11
2 10 111
3 11 1111
And the final result, after splitting the variable column:
value grp trt
0 1 0 0
1 11 0 1
2 111 1 0
3 1111 1 1
I'd encourage you to read up more on melting here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html . It can be incredibly useful.

Related

Data frame Merge in a specific format

I have two dataframe, and i am able to merge it. but I want to merge it in specific format ( column wise), Below are the further details
>df1
id A B C
0 1 20 0 1
1 2 23 1 2
>df2
id A B C
0 1 10 1 1
1 2 20 1 1
Below is my code and output
df = pd.merge(df1,df2,on='id',suffixes=('_Pre', '_Post'))
The output of this is :
id A_Pre B_Pre C_Pre A_Post B_Post C_Post
0 1 20 0 1 10 1 1
1 2 23 1 2 20 1 1
But the EXPECTED output should be, Can someone help or guide me for this :
id A_Pre A_Post B_Pre B_Post C_Pre C_Post
0 1 20 10 0 1 1 1
1 2 23 20 1 1 2 1
When subsequently manipulation is possible you can do domething like:
df[np.array([[x+"_Pre", x+"_Post"] for x in df1.columns.drop("id")]).flatten()]
If you just want to modify the order of your columns you can use reindex :
df = df.reindex(columns=['A_Pre','A_Post','B_Pre','B_Post','C_Pre','C_Post'])
You can order the columns in the new dataset using sorted and just add the column "id" in a second statement
order_col = sorted(df.columns[1:], key=lambda x:x[:3])
df_final = pd.concat([df['id'],df[order_col]], axis=1)

How to multiply every column in one dataframe with all columns in other dataframe

I have two dataframes X_dummy and X_var, where X_dummy contains dummies and looks like this:
dummy1 dummy2
1 0
0 1
1 0
The X_var dataframe looks contains variables and looks like this:
var1 var2
4 2
10 5
1 1
Now I want to create a dataframe containing the cellwise product of every column from X_dummy with the complete X_var dataframe. Hence, my resulting dataframe should look like, X_result:
var1dummy1 var2dummy1 var1dummy2 var2dummy2
4 2 0 0
0 0 10 5
1 1 0 0
Does anyone know how to do this without using multiple for loops?
Something like numpy broadcast
new = pd.DataFrame(np.concatenate(df2.T.values * df1.T.values[:,None]).T)
new
Out[161]:
0 1 2 3
0 4 2 0 0
1 0 0 10 5
2 1 1 0 0
##new.columns = pd.MultiIndex.from_product([df1.columns,df2.columns]).map('_'.join)
Try:
pd.concat([(df1[i]*df2[j]).rename(f'{i}{j}') for i in df1 for j in df2], axis=1)
Output:
dummy1var1 dummy1var2 dummy2var1 dummy2var2
0 4 2 0 0
1 0 0 10 5
2 1 1 0 0
You can definitely do it with one loop:
dummies = X_dummy.astype(bool)
pd.concat([X_var.loc[dummies[c]] for c in dummies], axis=1).fillna(0).astype(int)
# var1 var2 var1 var2
#0 4 2 0 0
#1 0 0 10 5
#2 1 1 0 0
Note that because one of your dataframes contains dummies, you do not need multiplication at all.

Group a dataframe and count amount of items of a column that is not shown

Ok, I admit, I had troubles to really formulate a good header for that. So I will try to make give an example.
This is my sample dataframe:
df = pd.DataFrame([
(1,"a","good"),
(1,"a","good"),
(1,"b","good"),
(1,"c","bad"),
(2,"a","good"),
(2,"b","bad"),
(3,"a","none")], columns=["id", "type", "eval"])
What I do with it is the following:
df.groupby(["id", "type"])["id"].agg({'id':'count'})
This results in:
id
id type
1 a 2
b 1
c 1
2 a 1
b 1
3 a 1
This is fine, although what I will need later on is that e.g. the id would be repeated in every row. But this is not the most important part.
What I would need now is something like this:
id good bad none
id type
1 a 2 2 0 0
b 1 1 0 0
c 1 0 1 0
2 a 1 1 0 0
b 1 0 1 0
3 a 1 0 0 1
And even better would be a result like this, because I will need this back in a dataframe (and finally in an Excel sheet) with all fields populated. In reality, there will be many more columns I am grouping by. They would have to be completely populated as well.
id good bad none
id type
1 a 2 2 0 0
1 b 1 1 0 0
1 c 1 0 1 0
2 a 1 1 0 0
2 b 1 0 1 0
3 a 1 0 0 1
Thank you for helping me out.
You can use groupby + size (last column was added) or value_counts with unstack:
df1 = df.groupby(["id", "type", 'eval'])
.size()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
df1 = df.groupby(["id", "type"])[ 'eval']
.value_counts()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
But for write to excel get:
df1.to_excel('file.xlsx')
So need reset_index last.
df1.reset_index().to_excel('file.xlsx', index=False)
EDIT:
I forget for id column, but it is duplicate column name, so need id1:
df1.insert(0, 'id1', df1.sum(axis=1))

sum values of columns starting with the same string in pandas dataframe

I have a dataframe with about 100 columns that looks like this:
Id Economics-1 English-107 English-2 History-3 Economics-zz Economics-2 \
0 56 1 1 0 1 0 0
1 11 0 0 0 0 1 0
2 6 0 0 1 0 0 1
3 43 0 0 0 1 0 1
4 14 0 1 0 0 1 0
Histo Economics-51 Literature-re Literatureu4
0 1 0 1 0
1 0 0 0 1
2 0 0 0 0
3 0 1 1 0
4 1 0 0 0
My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":
Id Economics English History Literature
0 56 1 1 2 1
1 11 1 0 0 1
2 6 0 1 1 0
3 43 2 0 1 1
4 14 0 1 1 0
For this purpose, I have tried two methods. First method:
df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]
Second method:
df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
df['History'] = df[filter_col].sum(axes=1)
print df['History', df[filter_col]]
However, both gives the error:
TypeError: 'DataFrame' objects are mutable, thus they cannot be
hashed
My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc in pandas.
I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.
Consider the following:
df = pd.DataFrame({
'a_a': [1, 2, 3, 4],
'a_b': [2, 3, 4, 5],
'b_a': [1, 2, 3, 4],
'b_b': [2, 3, 4, 5],
})
Now
[s.split('_')[0] for s in df.T.index.values]
is the prefix of the columns. So
>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().T
a b
0 3 3
1 5 5
2 7 7
3 9 9
does what you want.
In your case, make sure to split using the '-' character.
You can use these to create sum of columns starting with specific name,
df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)
Using brilliant DSM's idea:
from __future__ import print_function
import pandas as pd
categories = set(['Economics', 'English', 'Histo', 'Literature'])
def correct_categories(cols):
return [cat for col in cols for cat in categories if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df)
print(df.groupby(correct_categories(df.columns),axis=1).sum())
Output:
Economics English Histo Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
Here is another version, which takes care of "Histo/History" problematic..
from __future__ import print_function
import pandas as pd
#categories = set(['Economics', 'English', 'Histo', 'Literature'])
#
# mapping: common starting pattern: desired name
#
categories = {
'Histo': 'History',
'Economics': 'Economics',
'English': 'English',
'Literature': 'Literature'
}
def correct_categories(cols):
return [categories[cat] for col in cols for cat in categories.keys() if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df.columns, len(df.columns))
#print(correct_categories(df.columns), len(correct_categories(df.columns)))
#print(df.groupby(pd.Index(correct_categories(df.columns)),axis=1).sum())
rslt = df.groupby(correct_categories(df.columns),axis=1).sum()
print(rslt)
print('History\n', rslt['History'])
Output:
Economics English History Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
History
Id
56 2
11 0
6 0
43 1
14 1
Name: History, dtype: int64
PS You may want to add missing categories to categories map/dictionary

How to reorder indexed rows based on a list in Pandas data frame

I have a data frame that looks like this:
company Amazon Apple Yahoo
name
A 0 130 0
C 173 0 0
Z 0 0 150
It was created using this code:
import pandas as pd
df = pd.DataFrame({'name' : ['A', 'Z','C'],
'company' : ['Apple', 'Yahoo','Amazon'],
'height' : [130, 150,173]})
df = df.pivot(index="name", columns="company", values="height").fillna(0)
What I want to do is to sort the row (with index name) according to a predefined list:
["Z", "C", "A"]`
Resulting in this :
company Amazon Apple Yahoo
name
Z 0 0 150
C 173 0 0
A 0 130 0
How can I achieve that?
You could set index on predefined order using reindex like
In [14]: df.reindex(["Z", "C", "A"])
Out[14]:
company Amazon Apple Yahoo
Z 0 0 150
C 173 0 0
A 0 130 0
However, if it's alphabetical order, you could use sort_index(ascending=False)
In [12]: df.sort_index(ascending=False)
Out[12]:
company Amazon Apple Yahoo
name
Z 0 0 150
C 173 0 0
A 0 130 0
Like pointed below, you need to assign it to some variable
In [13]: df = df.sort_index(ascending=False)
We could also use loc:
lst = ["Z", "C", "A"]
df = df.loc[lst]
Output:
company Amazon Apple Yahoo
name
Z 0 0 150
C 173 0 0
A 0 130 0
Note that if there are values in lst that does not exist in df.index (e.g. if lst=['Z','C','A','D']), then loc throws a KeyError (whereas reindex creates a new row 'D' full of NaNs).
MultiIndex
If df is MultiIndex, such as:
C3
C1 C2
2 evelen 0
ten 1
twelve 2
1 evelen 3
ten 4
twelve 5
and if you want to sort the second level by ten,eleven andtwelve, then using loc:
out = df.loc[:, ['ten','evelen', 'twelve'],:]
Output:
C3
C1 C2
2 evelen 0
ten 1
twelve 2
1 evelen 3
ten 4
twelve 5
and for both levels:
out = df.loc[[1,2], ['ten','evelen','twelve'], :]
Output:
C3
C1 C2
1 ten 4
evelen 3
twelve 5
2 ten 1
evelen 0
twelve 2
IMHO, specially if you want to sort by multiples values, the best solution is:
df = df.set_index("C1")
df = df.sort_values(["C1", "C2"])
df.reset_index(inplace=True)

Categories

Resources