How do I assign series or sequences to dask dataframe column? - python

My dask dataframe is the follwing:
In [65]: df.head()
Out[65]:
id_orig id_cliente id_cartao inicio_processo fim_processo score \
0 1.0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0 1.0
automatico canal aceito motivo_recusa variante
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0
Assigning an integer works:
In [92]: df = df.assign(id_cliente=999)
In [93]: df.head()
Out[93]:
id_orig id_cliente id_cartao inicio_processo fim_processo score \
0 1.0 999 1.0 1.0 1.0 1.0
1 1.0 999 1.0 1.0 1.0 1.0
2 1.0 999 1.0 1.0 1.0 1.0
3 1.0 999 1.0 1.0 1.0 1.0
4 1.0 999 1.0 1.0 1.0 1.0
automatico canal aceito motivo_recusa variante
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0
However no other method for assigning Series or any other iterable in existing columns works.
How can I achieve that?

DataFrame.assign accepts any scalar or any dd.Series
df = df.assign(a=1) # accepts scalars
df = df.assign(z=df.x + df.y) # accepts dd.Series objects
If you are trying to assign a NumPy array or Python list then it might be your data is small enough to fit in RAM, and so Pandas might be a better fit than Dask.dataframe.
You can also use plain setitem syntax
df['a'] = 1
df['z'] = df.x + df.y

Related

Adding values in columns from 2 dataframes

I have 2 dataframes as below, some of the index values could be common between the two and I would like to add the values across the two if same index is present. The output should have all the index values present (from 1 & 2) and their cumulative values.
Build
2.1.3.13 2
2.1.3.1 1
2.1.3.15 1
2.1.3.20 1
2.1.3.8 1
2.1.3.9 1
Ref_Build
2.1.3.13 2
2.1.3.10 1
2.1.3.14 1
2.1.3.17 1
2.1.3.18 1
2.1.3.22 1
For example in the above case 2.1.3.13 should show 4 and the remaining 11 of them with 1 each.
What's the efficient way to do this? I tried merge etc., but some of those options were giving me 'intersection' and not 'union'.
Use Series.add and Series.fillna
df1['Build'].add(df2['Ref_Build']).fillna(df1['Build']).fillna(df2['Ref_Build'])
2.1.3.1 1.0
2.1.3.10 1.0
2.1.3.13 4.0
2.1.3.14 1.0
2.1.3.15 1.0
2.1.3.17 1.0
2.1.3.18 1.0
2.1.3.20 1.0
2.1.3.22 1.0
2.1.3.8 1.0
2.1.3.9 1.0
dtype: float64
Or:
pd.concat([df1['Build'], df2['Ref_Build']], axis=1).sum(axis=1)
2.1.3.13 4.0
2.1.3.1 1.0
2.1.3.15 1.0
2.1.3.20 1.0
2.1.3.8 1.0
2.1.3.9 1.0
2.1.3.10 1.0
2.1.3.14 1.0
2.1.3.17 1.0
2.1.3.18 1.0
2.1.3.22 1.0
dtype: float64
You can try merge with outer option or concat on columns
out = pd.merge(df1, df2, left_index=True, right_index=True, how='outer').fillna(0)
# or
out = pd.concat([df1, df2], axis=1).fillna(0)
out['sum'] = out['Build'] + out['Ref_Build']
# or with `eval` in one line
out = pd.concat([df1, df2], axis=1).fillna(0).eval('sum = Build + Ref_Build')
print(out)
Build Ref_Build sum
2.1.3.13 2.0 2.0 4.0
2.1.3.1 1.0 0.0 1.0
2.1.3.15 1.0 0.0 1.0
2.1.3.20 1.0 0.0 1.0
2.1.3.8 1.0 0.0 1.0
2.1.3.9 1.0 0.0 1.0
2.1.3.10 0.0 1.0 1.0
2.1.3.14 0.0 1.0 1.0
2.1.3.17 0.0 1.0 1.0
2.1.3.18 0.0 1.0 1.0
2.1.3.22 0.0 1.0 1.0

Table of frequency of specific scores in python pandas

I would like to make a table of frequency and percent by container, class, and score.
df = pd.read_csv('https://drive.google.com/file/d/1pL8fHCc25-XRBYgj9n6NdRt5VHrIr-p1/view?usp=sharing', sep=',')
df.groupby([ 'Containe', 'Class']).count()
The output should be:
But that script does not work!
First, we stack the values in order to have one by rows :
>>> df1 = (df.set_index(["Containe", "Class"])
... .stack()
... .reset_index(name='Score')
... .rename(columns={'level_2':'letters'}))
Then, we use a groupby to get the size of each combinaison of values like so :
>>> df_grouped = df1.groupby(['Containe', 'Class', 'letters', 'Score'], as_index=False).size()
To finish, we use the pivot_table method to get the expected result :
>>> pd.pivot_table(df_grouped, values='size', index=['letters', 'Class', 'Containe'], columns=['Score']).fillna(0)
Score 0 1 2
letters Class Containe
AB A 1 2.0 1.0 1.0
2 1.0 2.0 1.0
B 3 2.0 1.0 1.0
4 1.0 2.0 1.0
AC A 1 0.0 2.0 2.0
2 1.0 2.0 1.0
B 3 1.0 2.0 1.0
4 2.0 2.0 0.0
AD A 1 2.0 0.0 2.0
2 1.0 3.0 0.0
B 3 2.0 1.0 1.0
4 1.0 1.0 2.0

Drop columns if number of NaNs equals the threshold parameter

def drop_cols_na(df, threshold):
df.drop(df.isna[col for col in df if ....])
return df
Hard coding is relatively simple but I want to create a quick program that changes the threshold of when to drop a column depending on the input parameter I choose. For example: drop columns if number of nan's equate to 50%, 60% and so on.
I have found a few examples to follow. But I am struggling to implement it into a def function
the following line that must run without my changing is
df=drop_cols_na(df) which naturally returns an error "missing 1 required positional argument: 'threshold'"
Test case:
>>> df
0 1 2 3 4 5 6 7 8 9
0 1.0 NaN NaN 1.0 1.0 1.0 NaN 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0
2 1.0 1.0 NaN 1.0 1.0 NaN 1.0 1.0 1.0 1.0
3 1.0 1.0 NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 NaN 1.0 1.0 1.0 1.0 1.0 1.0
5 NaN 1.0 1.0 1.0 1.0 1.0 NaN 1.0 1.0 1.0
6 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
7 1.0 1.0 NaN 1.0 1.0 1.0 1.0 NaN 1.0 1.0
8 1.0 1.0 NaN 1.0 1.0 NaN 1.0 1.0 1.0 NaN
9 NaN 1.0 1.0 1.0 1.0 NaN 1.0 1.0 1.0 1.0
10 NaN 1.0 1.0 1.0 NaN 1.0 1.0 1.0 1.0 1.0
11 1.0 NaN 1.0 NaN 1.0 1.0 1.0 NaN NaN NaN
12 1.0 1.0 NaN 1.0 1.0 1.0 NaN 1.0 NaN 1.0
13 1.0 1.0 NaN NaN 1.0 1.0 1.0 1.0 NaN 1.0
14 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0 NaN
15 1.0 NaN 1.0 NaN NaN 1.0 NaN 1.0 1.0 1.0
16 1.0 1.0 1.0 1.0 NaN 1.0 1.0 1.0 1.0 1.0
17 NaN 1.0 1.0 NaN 1.0 1.0 NaN 1.0 NaN 1.0
18 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
19 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0 1.0 NaN
# 20% 15% 35% 30% 15% 15% 30% 15% 25% 20% % of NaN
def drop_cols_na(df, threshold):
return df[df.columns[df.isna().sum() / len(df) < threshold]]
Drop all cols where NaN >= 0.25:
>>> drop_cols_na(df, 0.3)
0 1 4 5 7 9
0 1.0 NaN 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 NaN 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 1.0 1.0
5 NaN 1.0 1.0 1.0 1.0 1.0
6 1.0 1.0 1.0 1.0 1.0 1.0
7 1.0 1.0 1.0 1.0 NaN 1.0
8 1.0 1.0 1.0 NaN 1.0 NaN
9 NaN 1.0 1.0 NaN 1.0 1.0
10 NaN 1.0 NaN 1.0 1.0 1.0
11 1.0 NaN 1.0 1.0 NaN NaN
12 1.0 1.0 1.0 1.0 1.0 1.0
13 1.0 1.0 1.0 1.0 1.0 1.0
14 1.0 1.0 1.0 1.0 NaN NaN
15 1.0 NaN NaN 1.0 1.0 1.0
16 1.0 1.0 NaN 1.0 1.0 1.0
17 NaN 1.0 1.0 1.0 1.0 1.0
18 1.0 1.0 1.0 1.0 1.0 1.0
19 1.0 1.0 1.0 1.0 1.0 NaN
First find the columns where the condition is met. Then, drop them.
def drop_cols_na(df, threshold):
cols = [col for col in df.columns if df[col].isna().sum()/df[col].shape[0]>threshold]
df = df.drop(cols, axis=1)
return df

Unable to properly read the lines from a file

I have file, which i wrote using a python script. The file is large and contain more than a 1000 lines, and each line is very large and it goes like :(shortened)
1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
...
And see, each line could take the space of 3 lines while displaying it in the monitor.
When I try :
fp = open('data.txt','r')
c = 0
for line in fp:
c += 1
print("No. of line = ",c)
I get the correct value, and when I use the read() function, I get a different value, as in:
fp = open('data.txt','r')
c = 0
data = fp.read()
for line in data:
c += 1
print("No. of line = ",c)
Can somebody explain, what is the difference between using the read() function, and not using it ?
Thanks in advance...
Using
data = fp.read()
for line in data:
c += 1
you read all in one string and for-loop treats this string as list of chars - so you count chars.
You have to use readlines() to get list of lines and count lines in this list
data = fp.readlines()
for line in data:
c += 1
BTW: The same result to count chars
data = fp.read()
c = len(data)
and to count lines
data = fp.readlines()
c = len(data)
BTW: You could also use print() to see what you have in variable
data = fp.read()
print(data[0])
print(data[:3])
print(data)
and
data = fp.readlines()
print(data[0])
print(data[:3])
print(data)
If you want to test in the one script then you have to close and open fail again or use fp.seek(0) to move to beginning of file before you read again.
To works with lines you should use
fp = open('data.txt','r')
for line in fp:
# ...code ...
fp.close()
or
fp = open('data.txt','r')
all_lines = fp.readlines()
for line in all_lines:
# ...code ...
fp.close()
The same with with ... as ...
with open('data.txt','r') as fp:
for line in fp:
# ...code ...
or
with open('data.txt','r') as fp:
all_lines = fp.readlines()
for line in all_lines:
# ...code ...

How can I change a specific row label in a Pandas dataframe?

I have a dataframe such as:
0 1 2 3 4 5
0 41.0 22.0 9.0 4.0 2.0 1.0
1 6.0 1.0 2.0 1.0 1.0 1.0
2 4.0 2.0 4.0 1.0 0.0 1.0
3 1.0 2.0 1.0 1.0 1.0 1.0
4 5.0 1.0 0.0 1.0 0.0 1.0
5 11.4 5.6 3.2 1.6 0.8 1.0
Where the final row contains averages. I would like to rename the final row label to "A" so that the dataframe will look like this:
0 1 2 3 4 5
0 41.0 22.0 9.0 4.0 2.0 1.0
1 6.0 1.0 2.0 1.0 1.0 1.0
2 4.0 2.0 4.0 1.0 0.0 1.0
3 1.0 2.0 1.0 1.0 1.0 1.0
4 5.0 1.0 0.0 1.0 0.0 1.0
A 11.4 5.6 3.2 1.6 0.8 1.0
I understand columns can be done with df.columns = . . .. But how can I do this with a specific row label?
You can get the last index using negative indexing similar to that in Python
last = df.index[-1]
Then
df = df.rename(index={last: 'a'})
Edit: If you are looking for a one-liner,
df.index = df.index[:-1].tolist() + ['a']
use index attribute:
df.index = df.index[:-1].append(pd.Index(['A']))

Categories

Resources