Use of sample and seed in numpy - python

die = pd.DataFrame([1, 2, 3, 4, 5, 6])
sum_of_dice = die.sample(n=2, replace=True).sum().loc[0]
print (sum_of_dice)
Can someone explain me what's .sum().loc[0] doing here?

It's always useful to print the intermediate steps to get an idea.
sum calculates the sum of the dataframe for each column.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html
loc selects a group of rows/columns.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
sum returns a dataframe with one element, but as we need the sum in integer dtype not a dataframe, we use loc to get the first element.
import pandas as pd
die = pd.DataFrame([1, 2, 3, 4, 5, 6])
sum_of_dice = die.sample(n=2, replace=True)
print(sum_of_dice)
sum_of_dice = sum_of_dice.sum()
print('---')
print (sum_of_dice)
sum_of_dice = sum_of_dice.loc[0]
print('---')
print (sum_of_dice)
0
4 5
0 1
---
0 6
dtype: int64
---
6

Related

Lists become pd.Series, the again lists with one dimension more

I have another problem with pandas, I will never make mine this library.
First, this is - I think - how zip() is supposed to work with lists:
import numpy as np
import pandas as pd
a = [1,2]
b = [3,4]
print(type(a))
print(type(b))
vv = zip([1,2], [3,4])
for i, v in enumerate(vv):
print(f"{i}: {v}")
with output:
<class 'list'>
<class 'list'>
0: (1, 3)
1: (2, 4)
Problem. I create a dataframe, with list elements (in the actual code the lists come from grouping ops and I cannot change them, basically they contain all the values in a dataframe grouped by a column).
# create dataframe
values = [{'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}]
df = pd.DataFrame.from_dict(values)
print(df)
x y
0 [1, 2, 3] [4, 5, 6]
However, the lists are now pd.Series:
print(type(df["x"]))
<class 'pandas.core.series.Series'>
If I do this:
col1 = df["x"].tolist()
col2 = df["y"].tolist()
print(f"col1 is of type {type(col1)}, with length {len(col1)}, first el is {col1[0]} of type {type(col1[0])}")
col1 is of type <class 'list'>, width length 1, first el is [1, 2, 3] of type <class 'list'>
Basically, the tolist() returned a list of list (why?):
Indeed:
print("ZIP AND ITER")
vv = zip(col1, col2)
for v in zip(col1, col2):
print(v)
ZIP AND ITER
([1, 2, 3], [4, 5, 6])
I neeed only to compute this:
# this fails because x (y) is a list
# df['s'] = [np.sqrt(x**2 + y**2) for x, y in zip(df["x"], df["y"])]
I could add df["x"][0] that seems not very elegant.
Question:
How am I supposed to compute sqrt(x^2 + y^2) when x and y are in two columns df["x"] and df["y"]
This should calculate df['s']
df['s'] = df.apply(lambda row: [np.sqrt(x**2 + y**2) for x, y in zip(row["x"], row["y"])], axis=1)
Basically, the tolist() returned a list of list (why?):
Because your dataframe has only 1 row, with two columns and both columns contain a list for its value. So, returning that column as a list of its values, it would return a list with 1 element (the list that is the value).
I think you wanted to create a dataframe like this:
values = {'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}
x y
0 1 4
1 2 5
2 3 6
values = [{'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}]
df = pd.DataFrame.from_dict(values)
print(df) # yields
x y
0 [1, 2, 3] [4, 5, 6]
An elegant solution to computing sqrt(x^2 + y^2) can be done by converting the dataframe as following:
new_df = df.iloc[0,:].apply(pd.Series).T.reset_index(drop=True)
This yields the follwoing output
x y
0 1 4
1 2 5
2 3 6
Now compute the sqrt(x^2 + y^2)
np.sqrt(new_df['x']**2 + new_df['y']**2)
This yields :
0 4.123106
1 5.385165
2 6.708204
dtype: float64

Scripting a simple counter

I wanted to create a simple script, which counts values in one column, that are higher in another column:
d = {'a': [1, 3], 'b': [0, 2]}
df = pd.DataFrame(data=d, index=[1, 2])
print(df)
a b
1 1 0
2 3 2
My function:
def diff(dataframe):
a_counter=0
b_counter=0
for i in dataframe["a"]:
for ii in dataframe["b"]:
if i>ii:
a_counter+=1
elif ii>i:
b_counter+=1
return a_counter, b_counter
However
diff(df)
returns (3, 1), instead of (2,0). I know the problem is that every single value of one column gets compared to every value of the other column (e.g. 1 gets compared to 0 and 2 of column b). There probably is a special function for my problem, but can you help me fix my script?
I would suggest adding some helper columns in an intuitive way to help compute the sum of each condition a > b and b > a
A working example based on your code :
import numpy as np
import pandas as pd
d = {'a': [1, 3], 'b': [0, 2]}
df = pd.DataFrame(data=d, index=[1, 2])
def diff(dataframe):
dataframe['a>b'] = np.where(dataframe['a']>dataframe['b'], 1, 0)
dataframe['b>a'] = np.where(dataframe['b']>dataframe['a'], 1, 0)
return dataframe['a>b'].sum(), dataframe['b>a'].sum()
print(diff(df))
>>> (2, 0)
Basically what np.where() does, the way I used it, is that it produces 1 if the condition is met and 0 otherwise. You can then add those columns up using a simple sum() function applied on the desired columns.
Update
Maybe you can use:
>>> df['a'].gt(df['b']).sum(), df['b'].gt(df['a']).sum()
(2, 0)
IIUC, to fix your code:
def diff(dataframe):
a_counter=0
b_counter=0
for i in dataframe["a"]:
for ii in dataframe["b"]:
if i>ii:
a_counter+=1
elif ii>i:
b_counter+=1
# Subtract the minimum of counters
m = min(a_counter, b_counter)
return a_counter-m, b_counter-m
Output:
>>> diff(df)
(2, 0)
IIUC, you can use the sign of the difference and count the values:
d = {1: 'a', -1: 'b', 0: 'equal'}
(np.sign(df['a'].sub(df['b']))
.map(d)
.value_counts()
.reindex(list(d.values()), fill_value=0)
)
output:
a 2
b 0
equal 0
dtype: int64

pandas groupby ID and select row with minimal value of specific columns

i want to select the whole row in which the minimal value of 3 selected columns is found, in a dataframe like this:
it is supposed to look like this afterwards:
I tried something like
dfcheckminrow = dfquery[dfquery == dfquery['A':'C'].min().groupby('ID')]
obviously it didn't work out well.
Thanks in advance!
Bkeesey's answer looks like it almost got you to your solution. I added one more step to get the overall minimum for each group.
import pandas as pd
# create sample df
df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3],
'A': [30, 14, 100, 67, 1, 20],
'B': [10, 1, 2, 5, 100, 3],
'C': [1, 2, 3, 4, 5, 6],
})
# set "ID" as the index
df = df.set_index('ID')
# get the min for each column
mindf = df[['A','B']].groupby('ID').transform('min')
# get the min between columns and add it to df
df['min'] = mindf.apply(min, axis=1)
# filter df for when A or B matches the min
df2 = df.loc[(df['A'] == df['min']) | (df['B'] == df['min'])]
print(df2)
In my simplified example, I'm just finding the minimum between columns A and B. Here's the output:
A B C min
ID
1 14 1 2 1
2 100 2 3 2
3 1 100 5 1
One method do filter the initial DataFrame based on a groupby conditional could be to use transform to find the minimum for a "ID" group and then use loc to filter the initial DataFrame where `any(axis=1) (checking rows) is met.
# create sample df
df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3],
'A': [30, 14, 100, 67, 1, 20],
'B': [10, 1, 2, 5, 100, 3]})
# set "ID" as the index
df = df.set_index('ID')
Sample df:
A B
ID
1 30 10
1 14 1
2 100 2
2 67 5
3 1 100
3 20 3
Use groupby and transform to find minimum value based on "ID" group.
Then use loc to filter initial df to where any(axis=1) is valid
df.loc[(df == df.groupby('ID').transform('min')).any(axis=1)]
Output:
A B
ID
1 14 1
2 100 2
2 67 5
3 1 100
3 20 3
In this example only the first row should be removed as it in both columns is not a minimum for the "ID" group.

Proper way to do this in pandas without using for loop

The question is I would like to avoid iterrows here.
From my dataframe I want to create a new column "unique" that will be based on the condition that if "a" and "b" column values are the same I would give it a value "uniqueN" then for all occurrence of the exact "a" and "b" I would need the same value "uniqueN".
In this case
"1", "3" (the first row) from "a" and "b" is the first unique pair, so I give that the value "unique1", and the seventh row will also have the same value which is "unique1" as it is also "1", "3".
"2", "2" (the second row) is the next unique "a", "b" pair so I give them "unique2" and the eight row also has "2", "2" so that will also have "unique2".
"3", "1" (third row) is the next unique, so "unique3", no more rows in the df is "3", "1" so that value wont repeat.
and so on
I have a working code that uses loops but this is not the pandas way, can anyone suggest how I can do this using pandas functions?
Expected Output (My code works, but its not using pandas methods)
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Code
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
c = 1
seen = {}
for i, j in df.iterrows():
j = tuple(j)
if j not in seen:
seen[j] = 'unique' + str(c)
c += 1
for key, value in seen.items():
df.loc[(df.a == key[0]) & (df.b == key[1]), 'unique'] = value
Let's use groupby ngroup with sort=False to ensure values are enumerated in order of appearance, add 1 so group numbers start at one, then convert to string with astype so we can add the prefix unique to the number:
df['unique'] = 'unique' + \
df.groupby(['a', 'b'], sort=False).ngroup().add(1).astype(str)
Or with map and format instead of converting and concatenating:
df['unique'] = (
df.groupby(['a', 'b'], sort=False).ngroup()
.add(1)
.map('unique{}'.format)
)
df:
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Setup:
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]
})
I came up with a slightly different solution. I'll add this for posterity, but the groupby answer is superior.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
print(df)
df1 = df[~df.duplicated()]
print(df1)
df1['unique'] = df1.index
print(df1)
df2 = df.merge(df1, how='left')
print(df2)

how to assign categorical values according to numbers in a column of a dataframe?

I have a data frame with a column 'score'. It contains scores from 1 to 10. I want to create a new column "color" which gives the column color according to the score.
For e.g. if the score is 1, the value of color should be "#75968f", if the score is 2, the value of color should be "#a5bab7". i.e. we need colors ["#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2","#f1d4Af", "#dfccce", "#ddb7b1", "#cc7878", "#933b41", "#550b1d"] for scores [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] respectively.
Is it possible to do this without using a loop?
Let me know in case you have a problem understanding the question.
Use Series.map with dictionary generated by zipping both lists or if need range by length of list colors is possible use enumerate:
df = pd.DataFrame({'score':[2,4,6,3,8]})
colors = ["#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2","#f1d4Af",
"#dfccce", "#ddb7b1", "#cc7878", "#933b41", "#550b1d"]
scores = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
df['new'] = df['score'].map(dict(zip(scores, colors)))
df['new1'] = df['score'].map(dict(enumerate(colors, 1)))
print (df)
score new new1
0 2 #a5bab7 #a5bab7
1 4 #e2e2e2 #e2e2e2
2 6 #dfccce #dfccce
3 3 #c9d9d3 #c9d9d3
4 8 #cc7878 #cc7878

Categories

Resources