How to assign a new descriptive column while concatenating dataframes - python

I have two data frames that i want to concatenate in python. However, I want to add another column type in order to distinguish among the columns.
Here is my sample data:
import pandas as pd
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']},
columns=['numbers', 'colors'])
df1 = pd.DataFrame({'numbers': [7, 9, 9], 'colors': ['yellow', 'brown', 'blue']},
columns=['numbers', 'colors'])
pd.concat([df,df1])
This code will give me the following result:
numbers colors
0 1 red
1 2 white
2 3 blue
0 7 yellow
1 9 brown
2 9 blue
but what I would like to get is as follows:
numbers colors type
0 1 red first
1 2 white first
2 3 blue first
0 7 yellow second
1 9 brown second
2 9 blue second
type column is going to help me to differentiate between the values of the two data frames.
Can anyone help me with this please?

Use DataFrame.assign for new columns:
df = pd.concat([df.assign(typ='first'),df1.assign(typ='second')])
print (df)
numbers colors typ
0 1 red first
1 2 white first
2 3 blue first
0 7 yellow second
1 9 brown second
2 9 blue second
Using a list-comprehension
df = pd.concat([d.assign(typ=f'id{i}') for i, d in enumerate([df, df1])], ignore_index=True)
numbers colors typ
0 1 red id0
1 2 white id0
2 3 blue id0
3 7 yellow id1
4 9 brown id1
5 9 blue id1

Related

Create a matrix from two columns

I'm trying to create a matrix from two columns within an excel sheet. The first column is a key with multiple repeating instances and the second column references the different values tied to the key. I'd like to be able to create a matrix of all the values in the second column to reference the number of times they are paired together for all the key instances.
a b
1 red
1 blue
1 green
2 yellow
2 red
3 blue
3 green
3 yellow
and I'd like to turn this sample dataframe into
color red blue yellow green
red 0 1 1 1
blue 1 0 1 2
yellow 1 1 0 1
green 1 2 1 0
Essentially using column a as a groupby() to segment each key then making counts of the relationships encountered as a running tally. Can't quite figure out how to implement a pivot table or a cross tab to accomplish this (if that's even the best route).
Use how='cross' as parameter of pd.merge. I assume you have no ('a', 'b') duplicates like two (1, red).
out = (
pd.merge(df, df, how='cross').query('a_x == a_y & b_x != b_y')[['b_x', 'b_y']] \
.assign(dummy=1).pivot_table('dummy', 'b_x', 'b_y', 'count', fill_value=0) \
.rename_axis(index=None, columns=None)
)
print(out)
# Output:
blue green red yellow
blue 0 2 1 1
green 2 0 1 1
red 1 1 0 1
yellow 1 1 1 0
This looks like an outer join so I went with that:
df = pd.DataFrame( {'a': [1,1,1,2,2,3,3,3],
'b':['red', 'blue', 'green', 'yellow', 'red', 'blue', 'green', 'yellow']})
df_count = df.merge(df, on = 'a').groupby(['b_x', 'b_y']).count().reset_index().pivot(index = 'b_x', columns='b_y', values='a')
np.fill_diagonal(df_count.values, 0)
df_count.index.name='color'
df_count.columns.name=None
blue green red yellow
color
blue 0 2 1 1
green 2 0 1 1
red 1 1 0 1
yellow 1 1 1 0
import numpy as np
import pandas as pd
s = pd.crosstab(df.a, df.b) # crosstabulate
s = s.T # s # transpose and take dot product
np.fill_diagonal(s.values, 0) # Fill the diagonals with 0
print(s)
b blue green red yellow
b
blue 0 2 1 1
green 2 0 1 1
red 1 1 0 1
yellow 1 1 1 0

Pandas Groupby Based on Values in Multiple Columns

I have a dataframe that I am trying to use pandas.groupby on to get the cumulative sum. The values that I am grouping by show up in two different columns, and I am having trouble getting the groupby to work correctly. My starting dataframe is:
df = pd.DataFrame({'col_A': ['red', 'red', 'blue', 'red'], 'col_B': ['blue', 'red', 'blue', 'red'], 'col_A_qty': [1, 1, 1, 1], 'col_B_qty': [1, 1, 1, 1]})
col_A col_B col_A_qty col_B_qty
red blue 1 1
red red 1 1
blue blue 1 1
red red 1 1
The result I am trying to get is:
col_A col_B col_A_qty col_B_qty red_cumsum blue_cumsum
red blue 1 1 1 1
red red 1 1 3 1
blue blue 1 1 3 3
red red 1 1 5 3
I've tried:
df.groupby(['col_A', 'col_B'])['col_A_qty'].cumsum()
but this groups on the combination of col_A and col_B. How can I use pandas.groupby to calculate the cumulative sum of red and blue, regardless of if it's in col_A or col_B?
Try two pivot
out = pd.pivot(df,columns='col_A',values='col_A_qty').fillna(0).cumsum().add(pd.pivot(df,columns='col_B',values='col_B_qty').fillna(0).cumsum(),fill_value=0)
Out[404]:
col_A blue red
0 1.0 1.0
1 1.0 3.0
2 3.0 3.0
3 3.0 5.0
df = df.join(out)
A simple method is to define each cumsum column by two Series.cumsum, as follows:
df['red_cumsum'] = df['col_A'].eq('red').cumsum() + df['col_B'].eq('red').cumsum()
df['blue_cumsum'] = df['col_A'].eq('blue').cumsum() + df['col_B'].eq('blue').cumsum()
In each column col_A and col_B, check for values equal 'red' / 'blue' (results are boolean series). Then, we use Series.cumsum on these resultant boolean series to get the cumulative counts. You don't really need to use pandas.groupby in this use case.
If you have multiple items in col_A and col_B, you can also iterate through the unique item list, as follows:
for item in np.unique(df[['col_A', 'col_B']]):
df[f'{item}_cumsum'] = df['col_A'].eq(item).cumsum() + df['col_B'].eq(item).cumsum()
Result:
print(df)
col_A col_B col_A_qty col_B_qty red_cumsum blue_cumsum
0 red blue 1 1 1 1
1 red red 1 1 3 1
2 blue blue 1 1 3 3
3 red red 1 1 5 3

How to replace empty value with value from another row

I am dealing with a dataframe that has some empty cells. For example:
df = pd.DataFrame(data={'id': [1, 2, 3, 1, 2, 3], 'id2': [1,1,1,2,2,2], 'color': ["red", "", "green", "yellow", "", "blue"], 'rate':["good","","good","average","","good"]})
id id2 color rate
0 1 1 red good
1 2 1
2 3 1 green good
3 1 2 yellow average
4 2 2
5 3 2 blue good
For both the columns "color" and "rate", I would like to replace the empty rows with values from another row where id is 1. Therefore, I would like the dataframe to look like this in the end:
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
I prefer not to replace the empty cells with values from the previous row. I would like to indicate the id and replace the empty cells with rows that have the specific id.
IIUC you can groupby and transform with first, and finally assign to empty values:
df.loc[df["color"].eq(""), ["color", "rate"]] = df.groupby(df["id"].eq(1).cumsum())[["color","rate"]].transform("first")
print (df)
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
Its only works if the number of rows with empty values in [color,rate] is equal to the number of rows with id==1. Expand further on the question of whether this is not the intention.
If I understood you correctly:
empty_rows = df.loc[df['color']=='',['color','rate']].index
df.loc[empty_rows, ['color','rate']] = df.loc[df['id']==1,['color','rate']].values
Result df:

How to repeatedly loop through list to assign values

I have two pandas data frames. Within df1 I have a string column with a finite list of unique values. I want to make those values a list, then loop through and append a new column onto df2. The value would loop through the list and then start over for the entire range of the second data frame.
df1
my_value
0 A
1 B
2 C
df2
color
0 red
1 orange
2 yellow
3 green
4 blue
5 indigo
6 violet
7 maroon
8 brown
9 black
What I want
color my_value
0 red A
1 orange B
2 yellow C
3 green A
4 blue B
5 indigo C
6 violet A
7 maroon B
8 brown C
9 black A
#create list
my_list = pd.Series(df1.my_value.values).to_list()
# create column
my_new_column = []
for i in range(len(df2)):
assigned_value = my_list[i]
my_new_column.append(assigned_value)
df2['my_new_column'] = my_new_column
return df2
The list index and range are differing lengths which is where I'm getting hung up.
This is super straight forward and I'm completely looking past the solution, please feel free to link me to another question if this is answered elsewhere. Thanks for you input!
#You can use zip with itertools.cycle() to cycle thru the smallest list/Series
df1 = pd.Series(data=['a','b','c'],name='my_values')
df2 = pd.Series(data= 'red','orange','yellow','green','blue','indigo','violet','maroon','brown','black'], name='color')
import itertools
df2 = pd.concat([df2, pd.Series([b for a,b in zip(df2 , itertools.cycle(df1))], name='my_value')],axis=1)
df2
color my_value
0 red a
1 orange b
2 yellow c
3 green a
4 blue b
5 indigo c
6 violet a
7 maroon b
8 brown c
9 black a

Pandas: find group index of first row matching a predicate in a group, if any

I want to group a DataFrame by some criteria, and then find the integer index in the group (not the DataFrame) of the first row satisfying some predicate. If there is no such row, I want to get NaN.
For example, I group by column a divided by 5 and then in each group, find the index of the first row where column b is "red":
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': xrange(12), 'b': ['red', 'green', 'blue'] * 4})
a b
0 0 red
1 1 green
2 2 blue
3 3 red
4 4 green
5 5 blue
6 6 red
7 7 green
8 8 blue
9 9 red
10 10 green
11 11 blue
df.groupby(df.a // 5).apply(lambda g: next((idx for idx, row in g.reset_index(drop=True).iterrows() if row.b == "red"), None))
a
0 0
1 1
2 NaN
dtype: float64
(I guess I'm assuming rows stay in the same order as the in original DataFrame, but I can sort the group if needed.) Is there a more concise, efficient way to do this?
This is a bit longer, but IMHO is more understandable / customizable
In [126]: df2 = df.copy()
This is your group metric
In [127]: g = df.a//5
A reference to the create groups
In [128]: grp = df.groupby(g)
Create a columns of the generated group and the cumulative count within the group
In [129]: df2['group'] = g
In [130]: df2['count'] = grp.cumcount()
In [131]: df2
Out[131]:
a b group count
0 0 red 0 0
1 1 green 0 1
2 2 blue 0 2
3 3 red 0 3
4 4 green 0 4
5 5 blue 1 0
6 6 red 1 1
7 7 green 1 2
8 8 blue 1 3
9 9 red 1 4
10 10 green 2 0
11 11 blue 2 1
Filtering and grouping gives you back the first elemnt that you want. The count is the within group count
In [132]: df2[df2.b=='red'].groupby('group').first()
Out[132]:
a b count
group
0 0 red 0
1 6 red 1
You can generate all of the group keys (e.g. nothing came back from your filter); this way.
In [133]: df2[df2.b=='red'].groupby('group').first().reindex(grp.groups.keys())
Out[133]:
a b count
0 0 red 0
1 6 red 1
2 NaN NaN NaN
Best I could do:
import itertools as it
df.groupby(df.a // 5).apply(lambda group: next(it.chain(np.where(group.get_values() == "red")[0], [None])))
The only real difference is using np.where on the values (so I'd expect this to be faster usually), but you may even want to just write your own first_where function and use that.

Categories

Resources