multiply 2 columns in 2 dfs if they match the column name - python

I have 2 dfs with some similar colnames.
I tried this, it worked only when I have nonrepetitive colnames in national df.
out = {}
for col in national.columns:
for col2 in F.columns:
if col == col2:
out[col] = national[col].values * F[col2].values
I tried to use the same code on df where it has several names, but I got the following error 'shapes (26,33) and (1,26) not aligned: 33 (dim 1) != 1 (dim 0)'. Because in the second df it has 33 columns with the same name, and that needs to be multiplied elementwise with one column for the first df.
This code does not work, as there are repeated same colnames in urban.columns.
[np.matrix(urban[col].values) * np.matrix(F[col2].values) for col in urban.columns for col2 in F.columns if col == col2]
Reproducivle code
df1 = pd.DataFrame({
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6],
'Col2': [7, 4, 2, 8, 6]})
df2 = pd.DataFrame({
'Col1': [1.5, 2.0, 3.0, 5.0, 10.0],
'Col2': [1, 0.0, 4.0, 5.0, 7.0})

Hopefully the below working example helps. Please provided a minimum reproducible example in your question with input code and desired output like I have provided. Please see how to ask a good pandas question:
df1 = pd.DataFrame({
'Product': ['AA', 'AA', 'BB', 'BB', 'BB'],
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6]})
print(df1)
df2 = pd.DataFrame({
'FX Rate': [1.5, 2.0, 3.0, 5.0, 10.0]})
print(df2)
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
for col in ['Col1', 'Col2']:
df1[col] = df1[col] * df2['FX Rate']
df1
(df1)
Product Col1 Col2
0 AA 1 2
1 AA 2 4
2 BB 1 2
3 BB 2 4
4 BB 3 6
(df2)
FX Rate
0 1.5
1 2.0
2 3.0
3 5.0
4 10.0
Out[1]:
Product Col1 Col2
0 AA 1.5 3.0
1 AA 4.0 8.0
2 BB 3.0 6.0
3 BB 10.0 20.0
4 BB 30.0 60.0

You can't multiply two DataFrame if they have different shapes but if you want to multiply it anyway then use transpose:
out = {}
for col in national.columns:
for col2 in F.columns:
if col == col2:
out[col] = national[col].values * F[col2].T.values

You can get the common columns of the 2 dataframes, then multiply the 2 dataframe by simple multiplication. Then, join back the only column(s) in df1 to the multiplication result, as follows:
common_cols = df1.columns.intersection(df2.columns)
df1_only_cols = df1.columns.difference(common_cols)
df1_out = df1[df1_only_cols].join(df1[common_cols] * df2[common_cols])
df1 = df1_out.reindex_like(df1)
Demo
df1 = pd.DataFrame({
'Product': ['AA', 'AA', 'BB', 'BB', 'BB'],
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6],
'Col3': [7, 4, 2, 8, 6]})
df2 = pd.DataFrame({
'Col1': [1.5, 2.0, 3.0, 5.0, 10.0],
'Col2': [1, 0.0, 4.0, 5.0, 7.0})
common_cols = df1.columns.intersection(df2.columns)
df1_only_cols = df1.columns.difference(common_cols)
df1_out = df1[df1_only_cols].join(df1[common_cols] * df2[common_cols])
df1 = df1_out.reindex_like(df1)
print(df1)
Product Col1 Col2 Col3
0 AA 1.5 2.0 7
1 AA 4.0 0.0 4
2 BB 3.0 8.0 2
3 BB 10.0 20.0 8
4 BB 30.0 42.0 6

A friend of mine sent this solution wich works just as i wanted.
out = urban.copy()
for col in urban.columns:
for col2 in F.columns:
if col == col2:
out.loc[:,col] = urban.loc[:,[col]].values * F.loc[:,[col2]].values

Related

Writing a DataFrame to an excel file where items in a list are put into separate cells

Consider a dataframe like pivoted, where replicates of some data are given as lists in a dataframe:
d = {'Compound': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'Conc': [1, 0.5, 0.1, 1, 0.5, 0.1, 2, 1, 0.5, 0.1],
'Data': [[100, 90, 80], [50, 40, 30], [10, 9.7, 8],
[20, 15, 10], [3, 4, 5, 6], [100, 110, 80],
[30, 40, 50, 20], [10, 5, 9, 3], [2, 1, 2, 2], [1, 1, 0]]}
df = pd.DataFrame(data=d)
pivoted = df.pivot(index='Conc', columns='Compound', values='Data')
This df can be written to an excel file as such:
with pd.ExcelWriter('output.xlsx') as writer:
pivoted.to_excel(writer, sheet_name='Sheet1', index_label='Conc')
How can this instead be written where replicate data are given in side-by-side cells? Desired excel file:
Then you need to pivot your data in a slightly different way, first explode the Data column, and deduplicate with groupby.cumcount:
(df.explode('Data')
.assign(n=lambda d: d.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'n'], values='Data')
.droplevel('n', axis=1).rename_axis(columns=None)
)
Output:
A A A B B B B C C C C
Conc
0.1 10 9.7 8 100 110 80 NaN 1 1 0 NaN
0.5 50 40 30 3 4 5 6 2 1 2 2
1.0 100 90 80 20 15 10 NaN 10 5 9 3
2.0 NaN NaN NaN NaN NaN NaN NaN 30 40 50 20
Beside the #mozway's answer, just for formatting, you can use:
piv = (df.explode('Data').assign(col=lambda x: x.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'col'], values='Data')
.rename_axis(None))
piv.columns = pd.Index([i if j == 0 else '' for i, j in piv.columns], name='Conc')
piv.to_excel('file.xlsx')

Ordering a dataframe with a key function and multiple columns

I have the following
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, -1, 9, -8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
'col4': ['a', 'B', 'c', 'D', 'e', 'F'],
'col5': [2, 1, 9, 8, 7, 4],
'col6': [1.00005,1.00001,-2.12132, -2.12137,1.00003,-2.12135]
})
print(df)
print(df.sort_values(by=['col5']))
print(df.sort_values(by=['col2']))
print(df.sort_values(by='col2', key=lambda col: col.abs() ))
So far so good.
However I would like to order the dataframe by two columns:
First col6 and then col5
However, with the following conditions:
col6 only has to consider 4 decimals (meaning that 1.00005 and 1.00001 should be consider equal
col6 should be considered as absolute (meaning 1.00005 is less than -2.12132)
So the desired output would be
col1 col2 col3 col4 col5 col6
1 A -1 1 B 1 1.00001
0 A 2 0 a 2 1.00005
4 D 7 2 e 7 1.00003
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132
How can I combine the usage of keys with multiple columns?
If you want to use arbitrary conditions on different columns, the easiest (ans most efficient) is to use numpy.lexsort:
import numpy as np
out = df.iloc[np.lexsort([df['col5'].abs(), df['col6'].round(4)])]
NB. unlike sort_values, the keys with higher priority are in the end with lexsort.
If you really want to use sort_values, you can use a custom function that choses the operation to apply depending on the Series name:
def sorter(s):
funcs = {
'col5': lambda s: s.abs(),
'col6': lambda s: s.round(4)
}
return funcs[s.name](s) if s.name in funcs else s
out = df.sort_values(by=['col6', 'col5'], key=sorter)
Output:
col1 col2 col3 col4 col5 col6
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132
1 A -1 1 B 1 1.00001
4 D 7 2 e 7 1.00003
0 A 2 0 a 2 1.00005
provided example
reading again the question and the provided example, I think you might want:
df.iloc[np.lexsort([df['col5'], np.trunc(df['col6'].abs()*10**4)/10**4])]
Output:
col1 col2 col3 col4 col5 col6
1 A -1 1 B 1 1.00001
0 A 2 0 a 2 1.00005
4 D 7 2 e 7 1.00003
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132
round() should not be used to truncate because round(1.00005, 4) = 1.0001.
Proposed code :
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, -1, 9, -8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
'col4': ['a', 'B', 'c', 'D', 'e', 'F'],
'col5': [2, 1, 9, 8, 7, 4],
'col6': [1.00005,1.00001,-2.12132, -2.12137,1.00003,-2.12135]
})
r = df.sort_values(by=['col6', 'col5'], key=lambda c: c.apply(lambda x: abs(float(str(x)[:-1]))) if c.name=='col6' else c)
print(r)
Result :
col1 col2 col3 col4 col5 col6
1 A -1 1 B 1 1.00001
0 A 2 0 a 2 1.00005
4 D 7 2 e 7 1.00003
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132
Other coding style inspired from Mozway
I have read the inspiring #Mozway way.
Very interesting but like s is a serie you should use the following script :
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, -1, 9, -8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
'col4': ['a', 'B', 'c', 'D', 'e', 'F'],
'col5': [2, 1, 9, 8, 7, 4],
'col6': [1.00005,1.00001,-2.12132, -2.12137,1.00003,-2.12135]
})
def truncate(x):
s = str(x).split('.')
s[1] = s[1][:4]
return '.'.join(s)
def sorter(s):
funcs = {
'col5': lambda s: s,
'col6': lambda s: s.apply(lambda x: abs(float(truncate(x))))
}
return funcs[s.name](s) if s.name in funcs else s
out = df.sort_values(by=['col6', 'col5'], key=sorter)
print(out)

Combine 2 dataframes when the dataframes have different size

I have 2 df one is
df1 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df2 = {'col_1': [3, 2, 1, 3]}
I want the result as follows
df3 = {'col_1': [3, 2, 1, 3], 'col_2': ['a', 'b', 'c', 'a']}
The column 2 of the new df is the same as the column 2 of the df1 depending on the value of the df1.
Add the new column by mapping the values from df1 after setting its first column as index:
df3 = df2.copy()
df3['col_2'] = df2['col_1'].map(df1.set_index('col_1')['col_2'])
output:
col_1 col_2
0 3 a
1 2 b
2 1 c
3 3 a
You can do it with merge after converting the dicts to df with pd.DataFrame():
output = pd.DataFrame(df2)
output = output.merge(pd.DataFrame(df1),on='col_1',how='left')
Or in a one-liner:
output = pd.DataFrame(df2).merge(pd.DataFrame(df1),on='col_1',how='left')
Outputs:
col_1 col_2
0 3 a
1 2 b
2 1 c
3 3 a
This could be a simple way of doing it.
# use df1 to create a lookup dictionary
lookup = df1.set_index("col_1").to_dict()["col_2"]
# look up each value from df2's "col_1" in the lookup dict
df2["col_2"] = df2["col_1"].apply(lambda d: lookup[d])

Python Pandas concatenate strings and numbers into one string

I am working with a pandas dataframe and trying to concatenate multiple string and numbers into one string.
This works
df1 = pd.DataFrame({'Col1': ['a', 'b', 'c'], 'Col2': ['a', 'b', 'c']})
df1.apply(lambda x: ', '.join(x), axis=1)
0 a, a
1 b, b
2 c, c
How can I make this work just like df1?
df2 = pd.DataFrame({'Col1': ['a', 'b', 1], 'Col2': ['a', 'b', 1]})
df2.apply(lambda x: ', '.join(x), axis=1)
TypeError: ('sequence item 0: expected str instance, int found', 'occurred at index 2')
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(3, 3)),
columns=list('abc')
)
print(df)
a b c
0 0 2 7
1 3 8 7
2 0 6 8
You can use astype(str) ahead of the lambda
df.astype(str).apply(', '.join, 1)
0 0, 2, 7
1 3, 8, 7
2 0, 6, 8
dtype: object
Using a comprehension
pd.Series([', '.join(l) for l in df.values.astype(str).tolist()], df.index)
0 0, 2, 7
1 3, 8, 7
2 0, 6, 8
dtype: object
In [75]: df2
Out[75]:
Col1 Col2 Col3
0 a a x
1 b b y
2 1 1 2
In [76]: df2.astype(str).add(', ').sum(1).str[:-2]
Out[76]:
0 a, a, x
1 b, b, y
2 1, 1, 2
dtype: object
You have to convert column types to strings.
import pandas as pd
df2 = pd.DataFrame({'Col1': ['a', 'b', 1], 'Col2': ['a', 'b', 1]})
df2.apply(lambda x: ', '.join(x.astype('str')), axis=1)

Remove columns that have 'N' number of NA values in it - python

Suppose I use df.isnull().sum() and I get a count for all the 'NA' values in all the columns of df dataframe. I want to remove a column that has NA values above 'K'.
For eg.,
df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'B': [0, np.nan, np.nan, 0, 0, 0],
'C': [0, 0, 0, 0, 0, 0.0],
'D': [5, 5, np.nan, np.nan, 5.6, 6.8],
'E': [0,np.nan,np.nan,np.nan,np.nan,np.nan],})
df.isnull().sum()
A 1
B 2
C 0
D 2
E 5
dtype: int64
Suppose I want to remove columns that have '2' and above number of NA values. How would be approach this problem? My output should be,
df.columns
A,C
Can anybody help me in doing this?
Thanks
Call dropna and pass axis=1 to drop column-wise and pass thresh=len(df)-K, what thresh does is it sets the minimum number of non-NaN values which is equal to the number of rows minus K NaN values
In [22]:
df.dropna(axis=1, thresh=len(df)-1)
Out[22]:
A C
0 1.0 0
1 2.1 0
2 NaN 0
3 4.7 0
4 5.6 0
5 6.8 0
If you just want the columns:
In [23]:
df.dropna(axis=1, thresh=len(df)-1).columns
Out[23]:
Index(['A', 'C'], dtype='object')
Or simply mask the counts output against the columns:
In [28]:
df.columns[df.isnull().sum() <2]
Out[28]:
Index(['A', 'C'], dtype='object')
Could do something like:
df = df.reindex(columns=[x for x in df.columns.values if df[x].isnull().sum() < threshold])
Which just builds a list of columns that match your requirement (fewer than threshold nulls), and then uses that list to reindex the dataframe. So if you set threshold to 1:
threshold = 1
df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
'B': [0, np.nan, np.nan, 0, 0, 0],
'C': [0, 0, 0, 0, 0, 0.0],
'D': [5, 5, np.nan, np.nan, 5.6, 6.8],
'E': ['NA', 'NA', 'NA', 'NA', 'NA', 'NA'],})
df = df.reindex(columns=[x for x in df.columns.values if df[x].isnull().sum() < threshold])
df.count()
Will yield:
C 6
E 6
dtype: int64
The dropna() function has a thresh argument that allows you to give the number of non-NaN values you require, so this would give you your desired output:
df.dropna(axis=1,thresh=5).count()
A 5
C 6
E 6
If you wanted just C & E, you'd have to change thresh to 6 in this case.

Categories

Resources