In Python, I have the following Pandas dataframe:
Factor Value
0 a 1.2
1 b 3.4
2 b 4.5
3 b 5.6
4 c 1.3
5 d 4.6
I would like to organize this where:
unique row identifiers (the factor col) become columns
Their respective values remain under the created columns
The factor values are not in an organized.
Target:
A B C D
0 1.2 3.4 1.3 4.6
1 4.5
2 5.6
3
4
5
Use, set_index and unstack with groupby:
df.set_index(['Factor', df.groupby('Factor').cumcount()])['Value'].unstack(0)
Output:
Factor a b c d
0 1.2 3.4 1.3 4.6
1 NaN 4.5 NaN NaN
2 NaN 5.6 NaN NaN
Related
I have a data frame:
A B C D E
12 4.5 6.1 BUY NaN
12 BUY BUY 5.6 NaN
BUY 4.5 6.1 BUY NaN
12 4.5 6.1 0 NaN
I want to count the number of times 'BUY' appears in each row. Intended result:
A B C D E score
12 4.5 6.1 BUY NaN 1
12 BUY BUY 5.6 NaN 2
15 4.5 6.1 BUY NaN 1
12 4.5 6.1 0 NaN 0
I have tried the following but it simply gives 0 for all the rows:
df['score'] = df[df == 'BUY'].sum(axis=1)
Note that BUY can only appear in B, C, D, E columns.
I tried to find the solution online but shockingly found none.
Little help will be appreciated. THANKS!
You can compare and then sum:
df['score'] = (df[['B','C','D','E']] == 'BUY').sum(axis=1)
This sums up all the booleans and you get the correct result.
When you do df[df == 'BUY'], you are just replacing anything which is not BUY with np.nan and then taking sum over axis=1 doesnot work since all you have left in your result is np.nan and the 'BUY' string. Hence you get all 0.
Or you could use apply with list.count:
df['score'] = df.apply(lambda x: x.tolist().count('BUY'), axis=1)
print(df)
Output:
A B C D E score
0 12 4.5 6.1 BUY NaN 1
1 12 BUY BUY 5.6 NaN 2
2 BUY 4.5 6.1 BUY NaN 2
3 12 4.5 6.1 0 NaN 0
Try using apply with lambda over axis=1. This picks up each row at a time as a series. You can use the condition [row == 'BUY'] to filter the row and then count the number of 'BUY' using len()
df['score'] = df.apply(lambda row: len(row[row == 'BUY']), axis=1)
print(df)
A B C D E score
0 12 4.5 6.1 BUY NaN 1
1 12 BUY BUY 5.6 NaN 2
2 BUY 4.5 6.1 BUY NaN 2
3 12 4.5 6.1 0 NaN 0
import numpy as np
df['score'] = np.count_nonzero(df == 'BUY', axis=1)
Output:
A B C D E score
0 12 4.5 6.1 BUY NaN 1
1 12 BUY BUY 5.6 NaN 2
2 BUY 4.5 6.1 BUY NaN 2
3 12 4.5 6.1 0 NaN 0
Using pandas I like to use groupby and an aggregate function, e.g. mean
and then put the results back in the original dataframe, but in the next group and not in the group itself. How to do this in a vectorized way?
I have a pandas dataframe like this:
data = {'Group': ['A','A','B','B','B','B', 'C','C', 'D','D'],
'Value': [1.1,1.3,9.1,9.2,9.5,9.4,6.2,6.4,2.2,2.3]
}
df = pd.DataFrame(data, columns = ['Group','Value'])
print (df)
Group Value
0 A 1.1
1 A 1.3
2 B 9.1
3 B 9.2
4 B 9.5
5 B 9.4
6 C 6.2
7 C 6.4
8 D 2.2
9 D 2.3
I like to get this, where each group has the mean value of the previous group.
Group Value
0 A NaN
1 A NaN
2 B 1.2
3 B 1.2
4 B 1.2
5 B 1.2
6 C 9.3
7 C 9.3
8 D 6.3
9 D 6.3
I tried this, but this is without the shift to the next group
df.groupby('Group')['Value'].transform('mean')
Easy, use map on a groupby result:
df['Value'] = df['Group'].map(df.groupby('Group')['Value'].mean().shift())
df
Group Value
0 A NaN
1 A NaN
2 B 1.2
3 B 1.2
4 B 1.2
5 B 1.2
6 C 9.3
7 C 9.3
8 D 6.3
9 D 6.3
How It Works
Get the mean
df.groupby('Group')['Value'].mean()
Group
A 1.20
B 9.30
C 6.30
D 2.25
Name: Value, dtype: float64
Shift it down by 1
df.groupby('Group')['Value'].mean().shift()
Group
A NaN
B 1.2
C 9.3
D 6.3
Name: Value, dtype: float64
Map it back.
df['Group'].map(df.groupby('Group')['Value'].mean().shift())
0 NaN
1 NaN
2 1.2
3 1.2
4 1.2
5 1.2
6 9.3
7 9.3
8 6.3
9 6.3
Name: Group, dtype: float64
You can calculate aggregated GroupBy.mean of each group value and use pd.Series.shift and take advantage of pandas index alignment.
df.set_index('Group').assign(value = df.groupby('Group').mean().shift()).reset_index()
Group Value value
0 A 1.1 NaN
1 A 1.3 NaN
2 B 9.1 1.2
3 B 9.2 1.2
4 B 9.5 1.2
5 B 9.4 1.2
6 C 6.2 9.3
7 C 6.4 9.3
8 D 2.2 6.3
9 D 2.3 6.3
If I have two data frames for an example:
df1:
x y
0 1.1. 2.1
1 3.1 5.1
df2:
x y
0 0.0 2.2
1 1.1 2.1
2 3.0. 6.6
3 3.1 5.1
4 0.2 8.8
and I want df2 to match the order that matching values that are in common but keeping the values that don't match after the order, how would I do that using pandas? or maybe something else.
desired output:
new_df:
x y
0 1.1 2.1
1 3.1. 5.1
2 0.0 2.2
3 3.0 6.6
4 0.2 8.8
rows 2-4 I don't care about the order as long as the matching rows follow the same order as df1. I want the values of indexes of df1 and df2 to be equal
any way to do this?
sorry if the way I submitted this is wrong.
thanks guys
Just using merge with indicator sort as default
df1.merge(df2,indicator=True,how='right')
Out[354]:
x y _merge
0 1.1 2.1 both
1 3.1 5.1 both
2 0.0 2.2 right_only
3 3.0 6.6 right_only
4 0.2 8.8 right_only
Use pd.concat with drop_duplicates:
pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
Output:
x y
0 1.1 2.1
1 3.1 5.1
2 0.0 2.2
3 3.0 6.6
4 0.2 8.8
Look at the .combine_first & .update methods.
df1.combine_first(df2)
They are explained in the documentation here.
I have two large square matrices ( in two CSV files). The two matrices may have a few different labels and different dimensions.
I want to add these two matrices and retain all labels. How do I do this in python?
Example:
{a, b, c ... e} are labels.
a b c d a e
a 1.2 1.3 1.4 1.5 a 9.1 9.2
X= b 2.1 2.2 2.3 2.4 Y= e 8.1 8.2
c 3.3 3.4 3.5 3.6
d 4.2 4.3 4.4 4.5
a b c d e
a 1.2+9.1 1.3 1.4 1.5 9.2
X+Y= b 2.1 2.2 2.3 2.4 0
c 3.3 3.4 3.5 3.6 0
d 4.2 4.3 4.4 4.5 0
e 8.1 0 0 0 8.2
If someone wants to see the files (matrices), they are here.
** Trying the method suggested by #piRSquared
import pandas as pd
X= pd.read_csv('30203_Transpose.csv')
Y= pd.read_csv('62599_1999psCSV.csv')
Z= X.add(Y, fill_value=0).fillna(0)
print Z
Z -> 467 rows x 661 columns
The resulting matrix should be square too.
This approach also causes the row headers to be lost ( now become 1,2,3 .. , They should be 10010, 10071, 10107, 1013 ..)
10010 10071 10107 1013 ....
0 0 0 0.01705 0.0439666659
1 0 0 0 0
2 0 0 0 0.0382000022
3 0.0663666651 0 0 0.0491333343
4 0 0 0 0
5 0.0208000001 0 0 0.1275333315
.
.
What should I be doing?
use the add method with the parameter fill_value=0
X.add(Y, fill_value=0).fillna(0)
I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...
set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)
In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.
This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.