I have two large square matrices ( in two CSV files). The two matrices may have a few different labels and different dimensions.
I want to add these two matrices and retain all labels. How do I do this in python?
Example:
{a, b, c ... e} are labels.
a b c d a e
a 1.2 1.3 1.4 1.5 a 9.1 9.2
X= b 2.1 2.2 2.3 2.4 Y= e 8.1 8.2
c 3.3 3.4 3.5 3.6
d 4.2 4.3 4.4 4.5
a b c d e
a 1.2+9.1 1.3 1.4 1.5 9.2
X+Y= b 2.1 2.2 2.3 2.4 0
c 3.3 3.4 3.5 3.6 0
d 4.2 4.3 4.4 4.5 0
e 8.1 0 0 0 8.2
If someone wants to see the files (matrices), they are here.
** Trying the method suggested by #piRSquared
import pandas as pd
X= pd.read_csv('30203_Transpose.csv')
Y= pd.read_csv('62599_1999psCSV.csv')
Z= X.add(Y, fill_value=0).fillna(0)
print Z
Z -> 467 rows x 661 columns
The resulting matrix should be square too.
This approach also causes the row headers to be lost ( now become 1,2,3 .. , They should be 10010, 10071, 10107, 1013 ..)
10010 10071 10107 1013 ....
0 0 0 0.01705 0.0439666659
1 0 0 0 0
2 0 0 0 0.0382000022
3 0.0663666651 0 0 0.0491333343
4 0 0 0 0
5 0.0208000001 0 0 0.1275333315
.
.
What should I be doing?
use the add method with the parameter fill_value=0
X.add(Y, fill_value=0).fillna(0)
Related
I have to consider nth row and check n+1 to n+3 rows, if it is in the range of (nth row value)-0.5 to (nth row value)+0.5, and(&) the results of 3 rows.
A result
0 1.1 1 # 1.2 1.3 and 1.5 are in range of 0.6 to 1.6, ( 1 & 1 & 1)
1 1.2 0 # 1.3 and 1.5 are in range of 0.7 to 1.7, but not 2, hence ( 1 & 0 & 0)
2 1.3 0 # 1.5 and 1 are in range of 0.8 to 1.8, but not 2 ( 1 & 0 & 1)
3 1.5
4 2.0
5 1.0
6 2.5
7 1.8
8 4.0
9 4.2
10 4.5
11 3.9
df = pd.DataFrame( {
'A': [1.1,1.2,1.3,1.9,2,1,2.5,1.8,4,4.2,4.5,3.9]
} )
I have done some research on the site, but couldn't able to find exact syntax. I tried using rolling function for taking 3 rows and use between function check range and then and the results. Could you please help here.
s = pd.Series([1, 2, 3, 4])
s.rolling(2).between(s-1,s+1)
getting error :
AttributeError: 'Rolling' object has no attribute 'between'
You can also achieve the result without using rolling() while keep using .between(), as follows:
df['result'] = (
(df['A'].shift(-1).between(df['A'] - 0.5, df['A'] + 0.5)) &
(df['A'].shift(-2).between(df['A'] - 0.5, df['A'] + 0.5)) &
(df['A'].shift(-3).between(df['A'] - 0.5, df['A'] + 0.5))
).astype(int)
Result:
print(df)
A result
0 1.1 1
1 1.2 0
2 1.3 0
3 1.5 0
4 2.0 0
5 1.0 0
6 2.5 0
7 1.8 0
8 4.0 1
9 4.2 0
10 4.5 0
11 3.9 0
Rolling windows tend to be quite slow in pandas. One quick solution can be to generate a dataframe with the values of the windows per row:
df_temp = pd.concat([df['A'].shift(i) for i in range(-1, 2)], axis=1)
df_temp
A A A
0 1.2 1.1 NaN
1 1.3 1.2 1.1
2 1.9 1.3 1.2
3 2.0 1.9 1.3
4 1.0 2.0 1.9
5 2.5 1.0 2.0
6 1.8 2.5 1.0
7 4.0 1.8 2.5
8 4.2 4.0 1.8
9 4.5 4.2 4.0
10 3.9 4.5 4.2
11 NaN 3.9 4.5
Then you can check per row if the value is in the desired range:
df['result'] = df_temp.apply(lambda x: (x - x.iloc[0]).between(-0.5, 0.5), axis=1).all(axis=1).astype(int)
A result
0 1.1 0
1 1.2 1
2 1.3 0
3 1.9 0
4 2.0 0
5 1.0 0
6 2.5 0
7 1.8 0
8 4.0 0
9 4.2 1
10 4.5 0
11 3.9 0
I tried converting the values in some columns of a DataFrame of floats to integers by using round then astype. However, the values still contained decimal places. What is wrong with my code?
nums = np.arange(1, 11)
arr = np.array(nums)
arr = arr.reshape((2, 5))
df = pd.DataFrame(arr)
df += 0.1
df
Original df:
0 1 2 3 4
0 1.1 2.1 3.1 4.1 5.1
1 6.1 7.1 8.1 9.1 10.1
Rounding then to int code:
df.iloc[:, 2:] = df.iloc[:, 2:].round()
df.iloc[:, 2:] = df.iloc[:, 2:].astype(int)
df
Output:
0 1 2 3 4
0 1.1 2.1 3.0 4.0 5.0
1 6.1 7.1 8.0 9.0 10.0
Expected output:
0 1 2 3 4
0 1.1 2.1 3 4 5
1 6.1 7.1 8 9 10
The problem is for the .iloc it assign the value and did not change the column type
l = df.columns[2:]
df[l] = df[l].astype(int)
df
0 1 2 3 4
0 1.1 2.1 3 4 5
1 6.1 7.1 8 9 10
One way to solve that is to use .convert_dtypes()
df.iloc[:, 2:] = df.iloc[:, 2:].round()
df = df.convert_dtypes()
print(df)
output:
0 1 2 3 4
0 1.1 2.1 3 4 5
1 6.1 7.1 8 9 10
It will help you to coerce all dtypes of your dataframe to a better fit.
had the same issue, was able to resolve with converting numbers to str and applying an lambda to cut of zeros.
df['converted'] = df['floats'].astype(str)
def cut_zeros(row):
if row[-2:]=='.0':
row=row[:-2]
else:row
return row
df['converted'] = df.apply(lambda row: cut_zeros(row['converted']),axis=1)
Using pandas I like to use groupby and an aggregate function, e.g. mean
and then put the results back in the original dataframe, but in the next group and not in the group itself. How to do this in a vectorized way?
I have a pandas dataframe like this:
data = {'Group': ['A','A','B','B','B','B', 'C','C', 'D','D'],
'Value': [1.1,1.3,9.1,9.2,9.5,9.4,6.2,6.4,2.2,2.3]
}
df = pd.DataFrame(data, columns = ['Group','Value'])
print (df)
Group Value
0 A 1.1
1 A 1.3
2 B 9.1
3 B 9.2
4 B 9.5
5 B 9.4
6 C 6.2
7 C 6.4
8 D 2.2
9 D 2.3
I like to get this, where each group has the mean value of the previous group.
Group Value
0 A NaN
1 A NaN
2 B 1.2
3 B 1.2
4 B 1.2
5 B 1.2
6 C 9.3
7 C 9.3
8 D 6.3
9 D 6.3
I tried this, but this is without the shift to the next group
df.groupby('Group')['Value'].transform('mean')
Easy, use map on a groupby result:
df['Value'] = df['Group'].map(df.groupby('Group')['Value'].mean().shift())
df
Group Value
0 A NaN
1 A NaN
2 B 1.2
3 B 1.2
4 B 1.2
5 B 1.2
6 C 9.3
7 C 9.3
8 D 6.3
9 D 6.3
How It Works
Get the mean
df.groupby('Group')['Value'].mean()
Group
A 1.20
B 9.30
C 6.30
D 2.25
Name: Value, dtype: float64
Shift it down by 1
df.groupby('Group')['Value'].mean().shift()
Group
A NaN
B 1.2
C 9.3
D 6.3
Name: Value, dtype: float64
Map it back.
df['Group'].map(df.groupby('Group')['Value'].mean().shift())
0 NaN
1 NaN
2 1.2
3 1.2
4 1.2
5 1.2
6 9.3
7 9.3
8 6.3
9 6.3
Name: Group, dtype: float64
You can calculate aggregated GroupBy.mean of each group value and use pd.Series.shift and take advantage of pandas index alignment.
df.set_index('Group').assign(value = df.groupby('Group').mean().shift()).reset_index()
Group Value value
0 A 1.1 NaN
1 A 1.3 NaN
2 B 9.1 1.2
3 B 9.2 1.2
4 B 9.5 1.2
5 B 9.4 1.2
6 C 6.2 9.3
7 C 6.4 9.3
8 D 2.2 6.3
9 D 2.3 6.3
In Python, I have the following Pandas dataframe:
Factor Value
0 a 1.2
1 b 3.4
2 b 4.5
3 b 5.6
4 c 1.3
5 d 4.6
I would like to organize this where:
unique row identifiers (the factor col) become columns
Their respective values remain under the created columns
The factor values are not in an organized.
Target:
A B C D
0 1.2 3.4 1.3 4.6
1 4.5
2 5.6
3
4
5
Use, set_index and unstack with groupby:
df.set_index(['Factor', df.groupby('Factor').cumcount()])['Value'].unstack(0)
Output:
Factor a b c d
0 1.2 3.4 1.3 4.6
1 NaN 4.5 NaN NaN
2 NaN 5.6 NaN NaN
If I have two data frames for an example:
df1:
x y
0 1.1. 2.1
1 3.1 5.1
df2:
x y
0 0.0 2.2
1 1.1 2.1
2 3.0. 6.6
3 3.1 5.1
4 0.2 8.8
and I want df2 to match the order that matching values that are in common but keeping the values that don't match after the order, how would I do that using pandas? or maybe something else.
desired output:
new_df:
x y
0 1.1 2.1
1 3.1. 5.1
2 0.0 2.2
3 3.0 6.6
4 0.2 8.8
rows 2-4 I don't care about the order as long as the matching rows follow the same order as df1. I want the values of indexes of df1 and df2 to be equal
any way to do this?
sorry if the way I submitted this is wrong.
thanks guys
Just using merge with indicator sort as default
df1.merge(df2,indicator=True,how='right')
Out[354]:
x y _merge
0 1.1 2.1 both
1 3.1 5.1 both
2 0.0 2.2 right_only
3 3.0 6.6 right_only
4 0.2 8.8 right_only
Use pd.concat with drop_duplicates:
pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
Output:
x y
0 1.1 2.1
1 3.1 5.1
2 0.0 2.2
3 3.0 6.6
4 0.2 8.8
Look at the .combine_first & .update methods.
df1.combine_first(df2)
They are explained in the documentation here.