Table of frequency of specific scores in python pandas - python

I would like to make a table of frequency and percent by container, class, and score.
df = pd.read_csv('https://drive.google.com/file/d/1pL8fHCc25-XRBYgj9n6NdRt5VHrIr-p1/view?usp=sharing', sep=',')
df.groupby([ 'Containe', 'Class']).count()
The output should be:
But that script does not work!

First, we stack the values in order to have one by rows :
>>> df1 = (df.set_index(["Containe", "Class"])
... .stack()
... .reset_index(name='Score')
... .rename(columns={'level_2':'letters'}))
Then, we use a groupby to get the size of each combinaison of values like so :
>>> df_grouped = df1.groupby(['Containe', 'Class', 'letters', 'Score'], as_index=False).size()
To finish, we use the pivot_table method to get the expected result :
>>> pd.pivot_table(df_grouped, values='size', index=['letters', 'Class', 'Containe'], columns=['Score']).fillna(0)
Score 0 1 2
letters Class Containe
AB A 1 2.0 1.0 1.0
2 1.0 2.0 1.0
B 3 2.0 1.0 1.0
4 1.0 2.0 1.0
AC A 1 0.0 2.0 2.0
2 1.0 2.0 1.0
B 3 1.0 2.0 1.0
4 2.0 2.0 0.0
AD A 1 2.0 0.0 2.0
2 1.0 3.0 0.0
B 3 2.0 1.0 1.0
4 1.0 1.0 2.0

Related

Adding values in columns from 2 dataframes

I have 2 dataframes as below, some of the index values could be common between the two and I would like to add the values across the two if same index is present. The output should have all the index values present (from 1 & 2) and their cumulative values.
Build
2.1.3.13 2
2.1.3.1 1
2.1.3.15 1
2.1.3.20 1
2.1.3.8 1
2.1.3.9 1
Ref_Build
2.1.3.13 2
2.1.3.10 1
2.1.3.14 1
2.1.3.17 1
2.1.3.18 1
2.1.3.22 1
For example in the above case 2.1.3.13 should show 4 and the remaining 11 of them with 1 each.
What's the efficient way to do this? I tried merge etc., but some of those options were giving me 'intersection' and not 'union'.
Use Series.add and Series.fillna
df1['Build'].add(df2['Ref_Build']).fillna(df1['Build']).fillna(df2['Ref_Build'])
2.1.3.1 1.0
2.1.3.10 1.0
2.1.3.13 4.0
2.1.3.14 1.0
2.1.3.15 1.0
2.1.3.17 1.0
2.1.3.18 1.0
2.1.3.20 1.0
2.1.3.22 1.0
2.1.3.8 1.0
2.1.3.9 1.0
dtype: float64
Or:
pd.concat([df1['Build'], df2['Ref_Build']], axis=1).sum(axis=1)
2.1.3.13 4.0
2.1.3.1 1.0
2.1.3.15 1.0
2.1.3.20 1.0
2.1.3.8 1.0
2.1.3.9 1.0
2.1.3.10 1.0
2.1.3.14 1.0
2.1.3.17 1.0
2.1.3.18 1.0
2.1.3.22 1.0
dtype: float64
You can try merge with outer option or concat on columns
out = pd.merge(df1, df2, left_index=True, right_index=True, how='outer').fillna(0)
# or
out = pd.concat([df1, df2], axis=1).fillna(0)
out['sum'] = out['Build'] + out['Ref_Build']
# or with `eval` in one line
out = pd.concat([df1, df2], axis=1).fillna(0).eval('sum = Build + Ref_Build')
print(out)
Build Ref_Build sum
2.1.3.13 2.0 2.0 4.0
2.1.3.1 1.0 0.0 1.0
2.1.3.15 1.0 0.0 1.0
2.1.3.20 1.0 0.0 1.0
2.1.3.8 1.0 0.0 1.0
2.1.3.9 1.0 0.0 1.0
2.1.3.10 0.0 1.0 1.0
2.1.3.14 0.0 1.0 1.0
2.1.3.17 0.0 1.0 1.0
2.1.3.18 0.0 1.0 1.0
2.1.3.22 0.0 1.0 1.0

new Pandas Dataframe column calculated from other column values

How can I create a new column in a dataframe that consists of the MEAN of an indexed range of values in that row?
example:
1 2 3 JUNK
0 0.0 0.0 0.0 A
1 1.0 1.0 -1.0 B
2 2.0 2.0 1.0 C
the JUNK column would be ignored when trying to determine the MEAN column
expected output:
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.0
1 1.0 1.0 -1.0 B 0.33
2 2.0 2.0 1.0 C 1.66
Use drop for removing or iloc for filter out unnecessary columns:
df['MEAN'] = df.drop('JUNK', axis=1).mean(axis=1)
df['MEAN'] = df.iloc[:, :-1].mean(axis=1)
print (df)
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.000000
1 1.0 1.0 -1.0 B 0.333333
2 2.0 2.0 1.0 C 1.666667

How can I change a specific row label in a Pandas dataframe?

I have a dataframe such as:
0 1 2 3 4 5
0 41.0 22.0 9.0 4.0 2.0 1.0
1 6.0 1.0 2.0 1.0 1.0 1.0
2 4.0 2.0 4.0 1.0 0.0 1.0
3 1.0 2.0 1.0 1.0 1.0 1.0
4 5.0 1.0 0.0 1.0 0.0 1.0
5 11.4 5.6 3.2 1.6 0.8 1.0
Where the final row contains averages. I would like to rename the final row label to "A" so that the dataframe will look like this:
0 1 2 3 4 5
0 41.0 22.0 9.0 4.0 2.0 1.0
1 6.0 1.0 2.0 1.0 1.0 1.0
2 4.0 2.0 4.0 1.0 0.0 1.0
3 1.0 2.0 1.0 1.0 1.0 1.0
4 5.0 1.0 0.0 1.0 0.0 1.0
A 11.4 5.6 3.2 1.6 0.8 1.0
I understand columns can be done with df.columns = . . .. But how can I do this with a specific row label?
You can get the last index using negative indexing similar to that in Python
last = df.index[-1]
Then
df = df.rename(index={last: 'a'})
Edit: If you are looking for a one-liner,
df.index = df.index[:-1].tolist() + ['a']
use index attribute:
df.index = df.index[:-1].append(pd.Index(['A']))

How do I assign series or sequences to dask dataframe column?

My dask dataframe is the follwing:
In [65]: df.head()
Out[65]:
id_orig id_cliente id_cartao inicio_processo fim_processo score \
0 1.0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0 1.0
automatico canal aceito motivo_recusa variante
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0
Assigning an integer works:
In [92]: df = df.assign(id_cliente=999)
In [93]: df.head()
Out[93]:
id_orig id_cliente id_cartao inicio_processo fim_processo score \
0 1.0 999 1.0 1.0 1.0 1.0
1 1.0 999 1.0 1.0 1.0 1.0
2 1.0 999 1.0 1.0 1.0 1.0
3 1.0 999 1.0 1.0 1.0 1.0
4 1.0 999 1.0 1.0 1.0 1.0
automatico canal aceito motivo_recusa variante
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0
However no other method for assigning Series or any other iterable in existing columns works.
How can I achieve that?
DataFrame.assign accepts any scalar or any dd.Series
df = df.assign(a=1) # accepts scalars
df = df.assign(z=df.x + df.y) # accepts dd.Series objects
If you are trying to assign a NumPy array or Python list then it might be your data is small enough to fit in RAM, and so Pandas might be a better fit than Dask.dataframe.
You can also use plain setitem syntax
df['a'] = 1
df['z'] = df.x + df.y

Select rows where at least one value from the list of columns is not null

I have a big dataframe with many columns (like 1000). I have a list of columns (generated by a script ~10). And I would like to select all the rows in the original dataframe where at least one of my list of columns is not null.
So if I would know the number of my columns in advance, I could do something like this:
list_of_cols = ['col1', ...]
df[
df[list_of_cols[0]].notnull() |
df[list_of_cols[1]].notnull() |
...
df[list_of_cols[6]].notnull() |
]
I can also iterate over the list of cols and create a mask which then I would apply to df, but his looks too tedious. Knowing how powerful is pandas with respect to dealing with nan, I would expect that there is a way easier way to achieve what I want.
Use the thresh parameter in the dropna() method. By setting thresh=1, you specify that if there is at least 1 non null item, don't drop it.
df = pd.DataFrame(np.random.choice((1., np.nan), (1000, 1000), p=(.3, .7)))
list_of_cols = list(range(10))
df[list_of_cols].dropna(thresh=1).head()
Starting with this:
data = {'a' : [np.nan,0,0,0,0,0,np.nan,0,0, 0,0,0, 9,9,],
'b' : [np.nan,np.nan,1,1,1,1,1,1,1, 2,2,2, 1,7],
'c' : [np.nan,np.nan,1,1,2,2,3,3,3, 1,1,1, 1,1],
'd' : [np.nan,np.nan,7,9,6,9,7,np.nan,6, 6,7,6, 9,6]}
df = pd.DataFrame(data, columns=['a','b','c','d'])
df
a b c d
0 NaN NaN NaN NaN
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
Rows where not all values are nulls. (Removing row index 0)
df[~df.isnull().all(axis=1)]
a b c d
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
One can use boolean indexing
df[~pd.isnull(df[list_of_cols]).all(axis=1)]
Explanation:
The expression df[list_of_cols]).all(axis=1) returns a boolean array that is applied as a filter to the dataframe:
isnull() applied to df[list_of_cols] creates a boolean mask for the dataframe df[list_of_cols] with True values for the null elements in df[list_of_cols], False otherwise
all() returns True if all of the elements are True (row-wise axis=1)
So, by negation ~ (not all null = at least one is non-null) one gets a mask for all rows that have at least one non-null element in the given list of columns.
An example:
Dataframe:
>>> df=pd.DataFrame({'A':[11,22,33,np.NaN],
'B':['x',np.NaN,np.NaN,'w'],
'C':['2016-03-13',np.NaN,'2016-03-14','2016-03-15']})
>>> df
A B C
0 11 x 2016-03-13
1 22 NaN NaN
2 33 NaN 2016-03-14
3 NaN w 2016-03-15
isnull mask:
>>> ~pd.isnull(df[list_of_cols])
B C
0 True True
1 False False
2 False True
3 True True
apply all(axis=1) row-wise:
>>> ~pd.isnull(df[list_of_cols]).all(axis=1)
0 True
1 False
2 True
3 True
dtype: bool
Boolean selection from dataframe:
>>> df[~pd.isnull(df[list_of_cols]).all(axis=1)]
A B C
0 11 x 2016-03-13
2 33 NaN 2016-03-14
3 NaN w 2016-03-15

Categories

Resources