How to repeatedly loop through list to assign values - python

I have two pandas data frames. Within df1 I have a string column with a finite list of unique values. I want to make those values a list, then loop through and append a new column onto df2. The value would loop through the list and then start over for the entire range of the second data frame.
df1
my_value
0 A
1 B
2 C
df2
color
0 red
1 orange
2 yellow
3 green
4 blue
5 indigo
6 violet
7 maroon
8 brown
9 black
What I want
color my_value
0 red A
1 orange B
2 yellow C
3 green A
4 blue B
5 indigo C
6 violet A
7 maroon B
8 brown C
9 black A
#create list
my_list = pd.Series(df1.my_value.values).to_list()
# create column
my_new_column = []
for i in range(len(df2)):
assigned_value = my_list[i]
my_new_column.append(assigned_value)
df2['my_new_column'] = my_new_column
return df2
The list index and range are differing lengths which is where I'm getting hung up.
This is super straight forward and I'm completely looking past the solution, please feel free to link me to another question if this is answered elsewhere. Thanks for you input!

#You can use zip with itertools.cycle() to cycle thru the smallest list/Series
df1 = pd.Series(data=['a','b','c'],name='my_values')
df2 = pd.Series(data= 'red','orange','yellow','green','blue','indigo','violet','maroon','brown','black'], name='color')
import itertools
df2 = pd.concat([df2, pd.Series([b for a,b in zip(df2 , itertools.cycle(df1))], name='my_value')],axis=1)
df2
color my_value
0 red a
1 orange b
2 yellow c
3 green a
4 blue b
5 indigo c
6 violet a
7 maroon b
8 brown c
9 black a

Related

Finding combinations of values across multiple rows and columns in pandas dataframe

I am trying to create all possible sets of values from a pandas dataframe, taking 3 values from each row.
For example, if I consider the dataframe below:
0 1 2 3 4 5 6
Table Black Brown Red Blue Green Amber
Chair Magenta Turquoise White Orange Violet Pink
Window Indigo Yellow Cerulean Grey Peach Aqua
I want to generate all possible solution sets, by taking 3 values from the first, second, third and fourth rows each.
This is what I tried:
from itertools import product
uniques = [df[i].unique().tolist() for i in df.columns ]
pd.DataFrame(product(*uniques), columns = df.columns)
But this generates all combinations with all 6 columns like this:
0 1 2 3 4 5 6
Table Black Brown Red Blue Green Aqua
Table Black Brown Red Blue Green Violet
Here, all values of Row 1 remain the same except for the last value, and all combinations are generated like this.
But what I need is this:
0 1 2 3 4 5 6 7 8 9
Table Black Red Blue Magenta White Orange Yellow Peach Aqua
Here, the first three values are from Row 1, the second 3 values are from Row 2, and the last 3 values are from Row 3.
Similarly, I want to display all such sets, and store them in a new dataframe.
Any help will be appreciated.
df
###
0 1 2 3 4 5 6
0 Table Black Brown Red Blue Green Amber
1 Chair Magenta Turquoise White Orange Violet Pink
2 Window Indigo Yellow Cerulean Grey Peach Aqua
from itertools import product
import random
uniques = [df[i].unique().tolist() for i in df.columns]
rl = list(product(*uniques))
pd.DataFrame(random.choices(rl))
product() generates all the combinations from sets, but you want a random result from the combinations list.
0 1 2 3 4 5 6
0 Table Black Brown White Orange Peach Aqua
Supplement
Combination
with 3 sets.
Select one element from each set, how many combinations would be?
2 * 2 * 3 = 12
Let's check whether the total number of all combinations is 12 or not.
list_of_lists = [['Yellow','Blue'],['Cat','Dog'],['Swim','Run','Sleep']]
combination = product(*list_of_lists)
combination_list = list(combination)
pd.DataFrame(combination_list)
###
0 1 2
0 Yellow Cat Swim
1 Yellow Cat Run
2 Yellow Cat Sleep
3 Yellow Dog Swim
4 Yellow Dog Run
5 Yellow Dog Sleep
6 Blue Cat Swim
7 Blue Cat Run
8 Blue Cat Sleep
9 Blue Dog Swim
10 Blue Dog Run
11 Blue Dog Sleep
And choose one row from above randomly, would be the solution to generate a set from combinations.

Sampling of a pandas dataframe with groupby on one column such that the sample has no duplicates in another column

Assume the below dummy dataframe:
a b
0 red 0
1 red 1
2 red 3
3 blue 0
4 blue 1
5 blue 3
6 black 4
7 black 2
I want to sample on column a, with a sample size of 1. But the only condition is that in the sample created the values of column b should be unique, such that it gives me a result something like:
a b
2 red 0
3 blue 1
5 black 4
The below result is not acceptable:
a b
2 red 1
3 blue 1
5 black 4
Right now I am using below code with pandas sample function, but it is returning duplicate values in column b
df.groupby('a').sample(n=1)
sample(n=1) selects values randomly from each element in column 'a'.
You can use df.groupby('a').first() which chooses the first value of each element, and would return the output you are looking for.

Using groupby() and value_counts()

The goal is to identify the count of colors in each groupby().
If you see the outcome below shows the first group color blue appears 3 times.
The second group yellow appears 2 times and third group blue appears 5 times
So far this is what I got.
df.groupby(['X','Y','Color'})..Color.value_counts()
but this produces only the count of 1 since color for each row appears once.
The final output should be like this:
Thanks in advance for any assistance.
If you give size to the transform function, it will be in tabular form without aggregation.
df['Count'] = df.groupby(['X','Y']).Color.transform('size')
df.set_index(['X','Y'], inplace=True)
df
Color Count
X Y
A B Blue 3
B Blue 3
B Blue 3
C D Yellow 2
D Yellow 2
E F Blue 5
F Blue 5
F Blue 5
F Blue 5
F Blue 5

How to assign a new descriptive column while concatenating dataframes

I have two data frames that i want to concatenate in python. However, I want to add another column type in order to distinguish among the columns.
Here is my sample data:
import pandas as pd
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']},
columns=['numbers', 'colors'])
df1 = pd.DataFrame({'numbers': [7, 9, 9], 'colors': ['yellow', 'brown', 'blue']},
columns=['numbers', 'colors'])
pd.concat([df,df1])
This code will give me the following result:
numbers colors
0 1 red
1 2 white
2 3 blue
0 7 yellow
1 9 brown
2 9 blue
but what I would like to get is as follows:
numbers colors type
0 1 red first
1 2 white first
2 3 blue first
0 7 yellow second
1 9 brown second
2 9 blue second
type column is going to help me to differentiate between the values of the two data frames.
Can anyone help me with this please?
Use DataFrame.assign for new columns:
df = pd.concat([df.assign(typ='first'),df1.assign(typ='second')])
print (df)
numbers colors typ
0 1 red first
1 2 white first
2 3 blue first
0 7 yellow second
1 9 brown second
2 9 blue second
Using a list-comprehension
df = pd.concat([d.assign(typ=f'id{i}') for i, d in enumerate([df, df1])], ignore_index=True)
numbers colors typ
0 1 red id0
1 2 white id0
2 3 blue id0
3 7 yellow id1
4 9 brown id1
5 9 blue id1

Pandas: find group index of first row matching a predicate in a group, if any

I want to group a DataFrame by some criteria, and then find the integer index in the group (not the DataFrame) of the first row satisfying some predicate. If there is no such row, I want to get NaN.
For example, I group by column a divided by 5 and then in each group, find the index of the first row where column b is "red":
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': xrange(12), 'b': ['red', 'green', 'blue'] * 4})
a b
0 0 red
1 1 green
2 2 blue
3 3 red
4 4 green
5 5 blue
6 6 red
7 7 green
8 8 blue
9 9 red
10 10 green
11 11 blue
df.groupby(df.a // 5).apply(lambda g: next((idx for idx, row in g.reset_index(drop=True).iterrows() if row.b == "red"), None))
a
0 0
1 1
2 NaN
dtype: float64
(I guess I'm assuming rows stay in the same order as the in original DataFrame, but I can sort the group if needed.) Is there a more concise, efficient way to do this?
This is a bit longer, but IMHO is more understandable / customizable
In [126]: df2 = df.copy()
This is your group metric
In [127]: g = df.a//5
A reference to the create groups
In [128]: grp = df.groupby(g)
Create a columns of the generated group and the cumulative count within the group
In [129]: df2['group'] = g
In [130]: df2['count'] = grp.cumcount()
In [131]: df2
Out[131]:
a b group count
0 0 red 0 0
1 1 green 0 1
2 2 blue 0 2
3 3 red 0 3
4 4 green 0 4
5 5 blue 1 0
6 6 red 1 1
7 7 green 1 2
8 8 blue 1 3
9 9 red 1 4
10 10 green 2 0
11 11 blue 2 1
Filtering and grouping gives you back the first elemnt that you want. The count is the within group count
In [132]: df2[df2.b=='red'].groupby('group').first()
Out[132]:
a b count
group
0 0 red 0
1 6 red 1
You can generate all of the group keys (e.g. nothing came back from your filter); this way.
In [133]: df2[df2.b=='red'].groupby('group').first().reindex(grp.groups.keys())
Out[133]:
a b count
0 0 red 0
1 6 red 1
2 NaN NaN NaN
Best I could do:
import itertools as it
df.groupby(df.a // 5).apply(lambda group: next(it.chain(np.where(group.get_values() == "red")[0], [None])))
The only real difference is using np.where on the values (so I'd expect this to be faster usually), but you may even want to just write your own first_where function and use that.

Categories

Resources