pandas multiply other columns by another - python

I would like to apply a scaling factor on some data:
col1 col2 col3
10 4 5
100 2 3
1000 6 7
Then, i would like the output to be:
col1 col2 col3
10 40 50
100 200 300
1000 6000 7000
I was trying to use a lambda but it kept throwing me errors.

pandas.DataFrame.mul with axis=0
When Pandas operates between a DataFrame and a Series it aligns the index of the Series with the columns of the DataFrame. We can alter that behavior by utilizing the equivalent operation method and passing the axis=0 argument to tell Pandas to align the Series index with the DataFrame index.
df[['col2', 'col3']] = df[['col2', 'col3']].mul(df['col1'], axis=0)
df
col1 col2 col3
0 10 40 50
1 100 200 300
2 1000 6000 7000
A shorter way of doing this
df.update(df.drop('col1', 1).mul(df.col1, axis=0))
in-line
And not in-place. Meaning, produce a copy and leave the original alone
df.assign(**df.drop('col1', 1).mul(df.col1, axis=0))
col1 col2 col3
0 10 40 50
1 100 200 300
2 1000 6000 7000
After thought
I was messing around with this completely hacky way of doing it.
[df.get(c).__imul__(df.col1) for c in [*df][1:]];
Super gross as it depends on a side effect of the comprehension and throws the result of the comprehension away.
Please ignore this!

Related

BigQuery is not supporting split function on dataframes to expand column into multiple columns

To insert data in BigQuery, I am loading csv file to bigquery. To create csv file I am using dataframes then converting them to csv file. After creating one dataframe, I am using split function to expand some columns(suppose I have 6 columns and I want to expand each column into 21 columns i.e total of 6*21) but when I am applying split function on more than one column of dataframe, it's giving me error.
Then I also tried other methods like creating multiple dataframes then using merge, concat function to join them but it didn't work.
For example: data in one column is : '11.7_16.1_20.6_25.0_29.4_33.9_38.3_42.7_47.1_51.6_56.0_60.4_64.8_69.3_73.7_78.1_82.5_87.0_91.4_95.8_100.2' and now we want to split this into separate 21 columns, similarly for other 5 columns.
If anyone here can help me with this would be appreciated. Trying to solve this from past 3 days.
I would use a combo of stack/unstack :
out = df.stack().str.split("_", expand=True).unstack()
NB : For the sake of the output, I divided your column into 2 (#len(df.columns)=2).
​
Output :
print(out)
0 1 2 3 4 5 6 7 8 9
col1 col2 col1 col2 col1 col2 col1 col2 col1 col2 col1 col2 col1 col2 col1 col2 col1 col2 col1 col2
0 11.7 56.060.4 16.1 64.8 20.6 69.3 25.0 73.7 29.4 78.1 33.9 82.5 38.3 87.0 42.7 91.4 47.1 95.8 51.6 100.2

pandas groupby where and else

I have a dataframe like this:
col1 col2
0 a 100
1 a 200
2 a 150
3 b 1000
4 c 400
5 c 200
what I want to do is group by col1 and count the number of occurrences and if count is equal or greater than 2, then calculate mean of col2 for those rows and if not apply another function. The output should be:
col1 mean
0 a 150
1 b whatever aggregator function returns
2 c 300
I followed #ansev solution in here pandas groupby count and then conditional mean however I don't want to replace them with NaN and actually want to replace it a value that returns from another function like this:
def aggregator(col1, col2):
return col1+col2
Please keep in mind that the actual aggregator function is more complicated and dependent to other tables and this is for just simplifying the question.
I'm not sure this is what you need, but you can resolve to apply:
def aggregator(x):
if len(x)==1:
return pd.Series( (x['col1'] + x['col2'].astype(str)).values)
else: return pd.Series(x['col2'].mean())
df.groupby('col1').apply(aggregator)
Output:
0
col1
a 150
b b1000
c 300

For each row, select all instances where value is not '0' (in any/all columns)

I have a df that looks something like this:
Col1 Col2 Col3 ColN
0 0 2 1
10 5 0 8
0 0 0 12
Trying to get a sum/mean of all the times a value has not been zero, for each row (and then add it as a 'Sum/Mean' column), to have output:
Col1 Col2 Col3 ColN Sum
0 0 2 1 2
10 5 0 8 1
0 0 0 12 3
In the df, I'm recording number of times an event has occurred. I'm trying to get the average number of occurrences or frequency (or I guess, the number of times a value in a row has been not 0).
Is there some way to apply this dataframe-wide? I have about 2000 rows, and have been hacking away trying to use Counter but have managed to get the number of times something has been observed only for 1 row :(
Or maybe I should convert all non-zero numbers to a dummy variable, but then still don't know how to select and sum?
As yatu suggested,
df.ne(0).sum(1)
does the job. (Note: when I use it to do df['Sum'] = df.ne(0).sum(1), I get a warning message, but I don't really understand the implications)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Actually, I get several rows with zeroes in the wanted column that are still there (not sure why), so I also go and remove any rows with zeroes after this (this is all very ugly, but no idea...)
df = df[(df[['Sum']] != 0).all(axis=1)]

comparing two dataframes and finding a unique combbination of columns

I have two DataFrames with different size and different number of column, for example:
DF1:
index col1 col2 col3
1 AA A12 SH7B
2 Ac DJS 283
3 ZH 28S 48d
DF2:
index col1 col2 col3 col4
2 AA cc2 SH7B hd5
7 Ac DJS 283,dhb re
10 ZH 28S SJE,48d 385d
23 3V4 38D 350,eh4 sm4
44 S3 3YE 032,she 3927
so the indexes are different. and there are some unique combination of data in the first dataframe which is similar to other dataframe and I want to find them.So I want to iterate through the rows of second dataframe and find every single combination of data per row (for example: (7,Ac,DJS,283,re) and (7,Ac,DJS,dhb,re) are two combinations of index 7 since there is a column that has more than one value) and compare it with the first dataframe's rows and print it out if there is an identical combination in second dataframe as well.
result:
1 Ac DJS 283
2 ZH 28S 48d
thank you
You need to split the col3 from data frame 2 firstly, and then merge it back with data frame 1; To split col3 of data frame 2, a common approach is to split and flatten the col3 while use numpy.repeat to make equal length of other columns:
import pandas as pd
import numpy as np
from itertools import chain
# count how many repeats are needed for other columns based on commas
repeats = df2.col3.str.count(",") + 1
# repeat columns except for col3, split and flatten col3 and merge it back with df1
(df2.drop('col3', 1).apply(lambda col: np.repeat(col, repeats))
.assign(col3 = list(chain.from_iterable(df2['col3'].str.split(','))))
.merge(df1))
# col1 col2 col4 col3
#0 Ac DJS re 283
#1 ZH 28S 385d 48d

Grouping by many columns in Pandas

I basically have a dataset that looks as follows
Col1 Col2 Col3 Count
A B 1 50
A B 1 50
A C 20 1
A D 17 2
A E 5 70
A E 15 20
Suppose it is called data. I basically do data.groupby(by=['Col1', 'Col2', 'Col3'], as_index=False, sort=False).sum(), which should give me this:
Col1 Col2 Col3 Count
A B 1 100
A C 20 1
A D 17 2
A E 5 70
A E 15 20
However, this returns an empty dataset, which does have the columns I want but no rows. The only caveat is that the by parameter is getting calculated dynamically, instead of fixed (thats because the columns might change, although Count will always be there).
Any ideas on why this could be failing, and how to fix it?
EDIT: Further searching revealed that pandas' groupby removes rows that have NULL at any column. This is a problem for me because every single column might be NULL. Hence, the actual question is: any reasonable way to deal with NULLs and still use groupby?
Would love to be corrected here, but I'm not sure if there is a clean way to handle missing data. As you noted, Pandas will just exclude rows from groupby that contain NaN values
You could fill the NaN values with something beyond the range of your data:
data = pd.read_csv("c:/Users/simon/Desktop/data.csv")
data.fillna(-999, inplace=True)
new = data.groupby(by=['Col1', 'Col2', 'Col3'], as_index=False, sort=False).sum()
Which is messy because it wont add those values to the correct group by for the summation. But theres no real way to groupby something thats missing
Another method might be to fill each column separately with some missing value that is appropriate for that variable.

Categories

Resources