I have 2 dataframes as below, some of the index values could be common between the two and I would like to add the values across the two if same index is present. The output should have all the index values present (from 1 & 2) and their cumulative values.
Build
2.1.3.13 2
2.1.3.1 1
2.1.3.15 1
2.1.3.20 1
2.1.3.8 1
2.1.3.9 1
Ref_Build
2.1.3.13 2
2.1.3.10 1
2.1.3.14 1
2.1.3.17 1
2.1.3.18 1
2.1.3.22 1
For example in the above case 2.1.3.13 should show 4 and the remaining 11 of them with 1 each.
What's the efficient way to do this? I tried merge etc., but some of those options were giving me 'intersection' and not 'union'.
Use Series.add and Series.fillna
df1['Build'].add(df2['Ref_Build']).fillna(df1['Build']).fillna(df2['Ref_Build'])
2.1.3.1 1.0
2.1.3.10 1.0
2.1.3.13 4.0
2.1.3.14 1.0
2.1.3.15 1.0
2.1.3.17 1.0
2.1.3.18 1.0
2.1.3.20 1.0
2.1.3.22 1.0
2.1.3.8 1.0
2.1.3.9 1.0
dtype: float64
Or:
pd.concat([df1['Build'], df2['Ref_Build']], axis=1).sum(axis=1)
2.1.3.13 4.0
2.1.3.1 1.0
2.1.3.15 1.0
2.1.3.20 1.0
2.1.3.8 1.0
2.1.3.9 1.0
2.1.3.10 1.0
2.1.3.14 1.0
2.1.3.17 1.0
2.1.3.18 1.0
2.1.3.22 1.0
dtype: float64
You can try merge with outer option or concat on columns
out = pd.merge(df1, df2, left_index=True, right_index=True, how='outer').fillna(0)
# or
out = pd.concat([df1, df2], axis=1).fillna(0)
out['sum'] = out['Build'] + out['Ref_Build']
# or with `eval` in one line
out = pd.concat([df1, df2], axis=1).fillna(0).eval('sum = Build + Ref_Build')
print(out)
Build Ref_Build sum
2.1.3.13 2.0 2.0 4.0
2.1.3.1 1.0 0.0 1.0
2.1.3.15 1.0 0.0 1.0
2.1.3.20 1.0 0.0 1.0
2.1.3.8 1.0 0.0 1.0
2.1.3.9 1.0 0.0 1.0
2.1.3.10 0.0 1.0 1.0
2.1.3.14 0.0 1.0 1.0
2.1.3.17 0.0 1.0 1.0
2.1.3.18 0.0 1.0 1.0
2.1.3.22 0.0 1.0 1.0
Related
I would like to make a table of frequency and percent by container, class, and score.
df = pd.read_csv('https://drive.google.com/file/d/1pL8fHCc25-XRBYgj9n6NdRt5VHrIr-p1/view?usp=sharing', sep=',')
df.groupby([ 'Containe', 'Class']).count()
The output should be:
But that script does not work!
First, we stack the values in order to have one by rows :
>>> df1 = (df.set_index(["Containe", "Class"])
... .stack()
... .reset_index(name='Score')
... .rename(columns={'level_2':'letters'}))
Then, we use a groupby to get the size of each combinaison of values like so :
>>> df_grouped = df1.groupby(['Containe', 'Class', 'letters', 'Score'], as_index=False).size()
To finish, we use the pivot_table method to get the expected result :
>>> pd.pivot_table(df_grouped, values='size', index=['letters', 'Class', 'Containe'], columns=['Score']).fillna(0)
Score 0 1 2
letters Class Containe
AB A 1 2.0 1.0 1.0
2 1.0 2.0 1.0
B 3 2.0 1.0 1.0
4 1.0 2.0 1.0
AC A 1 0.0 2.0 2.0
2 1.0 2.0 1.0
B 3 1.0 2.0 1.0
4 2.0 2.0 0.0
AD A 1 2.0 0.0 2.0
2 1.0 3.0 0.0
B 3 2.0 1.0 1.0
4 1.0 1.0 2.0
def drop_cols_na(df, threshold):
df.drop(df.isna[col for col in df if ....])
return df
Hard coding is relatively simple but I want to create a quick program that changes the threshold of when to drop a column depending on the input parameter I choose. For example: drop columns if number of nan's equate to 50%, 60% and so on.
I have found a few examples to follow. But I am struggling to implement it into a def function
the following line that must run without my changing is
df=drop_cols_na(df) which naturally returns an error "missing 1 required positional argument: 'threshold'"
Test case:
>>> df
0 1 2 3 4 5 6 7 8 9
0 1.0 NaN NaN 1.0 1.0 1.0 NaN 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0
2 1.0 1.0 NaN 1.0 1.0 NaN 1.0 1.0 1.0 1.0
3 1.0 1.0 NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 NaN 1.0 1.0 1.0 1.0 1.0 1.0
5 NaN 1.0 1.0 1.0 1.0 1.0 NaN 1.0 1.0 1.0
6 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
7 1.0 1.0 NaN 1.0 1.0 1.0 1.0 NaN 1.0 1.0
8 1.0 1.0 NaN 1.0 1.0 NaN 1.0 1.0 1.0 NaN
9 NaN 1.0 1.0 1.0 1.0 NaN 1.0 1.0 1.0 1.0
10 NaN 1.0 1.0 1.0 NaN 1.0 1.0 1.0 1.0 1.0
11 1.0 NaN 1.0 NaN 1.0 1.0 1.0 NaN NaN NaN
12 1.0 1.0 NaN 1.0 1.0 1.0 NaN 1.0 NaN 1.0
13 1.0 1.0 NaN NaN 1.0 1.0 1.0 1.0 NaN 1.0
14 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0 NaN
15 1.0 NaN 1.0 NaN NaN 1.0 NaN 1.0 1.0 1.0
16 1.0 1.0 1.0 1.0 NaN 1.0 1.0 1.0 1.0 1.0
17 NaN 1.0 1.0 NaN 1.0 1.0 NaN 1.0 NaN 1.0
18 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
19 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0 1.0 NaN
# 20% 15% 35% 30% 15% 15% 30% 15% 25% 20% % of NaN
def drop_cols_na(df, threshold):
return df[df.columns[df.isna().sum() / len(df) < threshold]]
Drop all cols where NaN >= 0.25:
>>> drop_cols_na(df, 0.3)
0 1 4 5 7 9
0 1.0 NaN 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 NaN 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0 1.0
5 NaN 1.0 1.0 1.0 1.0 1.0
6 1.0 1.0 1.0 1.0 1.0 1.0
7 1.0 1.0 1.0 1.0 NaN 1.0
8 1.0 1.0 1.0 NaN 1.0 NaN
9 NaN 1.0 1.0 NaN 1.0 1.0
10 NaN 1.0 NaN 1.0 1.0 1.0
11 1.0 NaN 1.0 1.0 NaN NaN
12 1.0 1.0 1.0 1.0 1.0 1.0
13 1.0 1.0 1.0 1.0 1.0 1.0
14 1.0 1.0 1.0 1.0 NaN NaN
15 1.0 NaN NaN 1.0 1.0 1.0
16 1.0 1.0 NaN 1.0 1.0 1.0
17 NaN 1.0 1.0 1.0 1.0 1.0
18 1.0 1.0 1.0 1.0 1.0 1.0
19 1.0 1.0 1.0 1.0 1.0 NaN
First find the columns where the condition is met. Then, drop them.
def drop_cols_na(df, threshold):
cols = [col for col in df.columns if df[col].isna().sum()/df[col].shape[0]>threshold]
df = df.drop(cols, axis=1)
return df
How can I create a new column in a dataframe that consists of the MEAN of an indexed range of values in that row?
example:
1 2 3 JUNK
0 0.0 0.0 0.0 A
1 1.0 1.0 -1.0 B
2 2.0 2.0 1.0 C
the JUNK column would be ignored when trying to determine the MEAN column
expected output:
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.0
1 1.0 1.0 -1.0 B 0.33
2 2.0 2.0 1.0 C 1.66
Use drop for removing or iloc for filter out unnecessary columns:
df['MEAN'] = df.drop('JUNK', axis=1).mean(axis=1)
df['MEAN'] = df.iloc[:, :-1].mean(axis=1)
print (df)
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.000000
1 1.0 1.0 -1.0 B 0.333333
2 2.0 2.0 1.0 C 1.666667
I have a dataframe such as:
0 1 2 3 4 5
0 41.0 22.0 9.0 4.0 2.0 1.0
1 6.0 1.0 2.0 1.0 1.0 1.0
2 4.0 2.0 4.0 1.0 0.0 1.0
3 1.0 2.0 1.0 1.0 1.0 1.0
4 5.0 1.0 0.0 1.0 0.0 1.0
5 11.4 5.6 3.2 1.6 0.8 1.0
Where the final row contains averages. I would like to rename the final row label to "A" so that the dataframe will look like this:
0 1 2 3 4 5
0 41.0 22.0 9.0 4.0 2.0 1.0
1 6.0 1.0 2.0 1.0 1.0 1.0
2 4.0 2.0 4.0 1.0 0.0 1.0
3 1.0 2.0 1.0 1.0 1.0 1.0
4 5.0 1.0 0.0 1.0 0.0 1.0
A 11.4 5.6 3.2 1.6 0.8 1.0
I understand columns can be done with df.columns = . . .. But how can I do this with a specific row label?
You can get the last index using negative indexing similar to that in Python
last = df.index[-1]
Then
df = df.rename(index={last: 'a'})
Edit: If you are looking for a one-liner,
df.index = df.index[:-1].tolist() + ['a']
use index attribute:
df.index = df.index[:-1].append(pd.Index(['A']))
My dask dataframe is the follwing:
In [65]: df.head()
Out[65]:
id_orig id_cliente id_cartao inicio_processo fim_processo score \
0 1.0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0 1.0
automatico canal aceito motivo_recusa variante
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0
Assigning an integer works:
In [92]: df = df.assign(id_cliente=999)
In [93]: df.head()
Out[93]:
id_orig id_cliente id_cartao inicio_processo fim_processo score \
0 1.0 999 1.0 1.0 1.0 1.0
1 1.0 999 1.0 1.0 1.0 1.0
2 1.0 999 1.0 1.0 1.0 1.0
3 1.0 999 1.0 1.0 1.0 1.0
4 1.0 999 1.0 1.0 1.0 1.0
automatico canal aceito motivo_recusa variante
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0
However no other method for assigning Series or any other iterable in existing columns works.
How can I achieve that?
DataFrame.assign accepts any scalar or any dd.Series
df = df.assign(a=1) # accepts scalars
df = df.assign(z=df.x + df.y) # accepts dd.Series objects
If you are trying to assign a NumPy array or Python list then it might be your data is small enough to fit in RAM, and so Pandas might be a better fit than Dask.dataframe.
You can also use plain setitem syntax
df['a'] = 1
df['z'] = df.x + df.y