merging rows with repeating column values

merging rows with repeating column values - python

I have a dataframe as follows:
data
0 a
1 a
2 a
3 a
4 a
5 b
6 b
7 b
8 b
9 b
I want to group the repeating values of a and b into a single row element as follows:
data
0 a
a
a
a
a
1 b
b
b
b
b
How do I go about doing this? I tried the following but it puts each repeating value in its own column
df.groupby('data')

Seems like a pivot problem, but since you missing the column(create by cumcount) and index(create by factorize) columns , it is hard to figure out
pd.crosstab(pd.factorize(df.data)[0],df.groupby('data').cumcount(),df.data,aggfunc='sum')
Out[358]:
col_0 0 1 2 3 4
row_0
0 a a a a a
1 b b b b b

Something like
index = ((df['data'] != df['data'].shift()).cumsum() - 1).rename(columns= {'data':''})
df = df.set_index(index)
data
0 a
0 a
0 a
0 a
0 a
1 b
1 b
1 b
1 b
1 b

You can use pd.factorize followed by set_index:
df = df.assign(key=pd.factorize(df['data'], sort=False)[0]).set_index('key')
print(df)
data
key
0 a
0 a
0 a
0 a
0 a
1 b
1 b
1 b
1 b
1 b

Related

Join an array to every row in the pandas dataframe

I have a data frame and an array as follows:
df = pd.DataFrame({'x': range(0,5), 'y' : range(1,6)})
s = np.array(['a', 'b', 'c'])
I would like to attach the array to every row of the data frame, such that I got a data frame as follows:
What would be the most efficient way to do this?

Just plain assignment:
# replace the first `s` with your desired column names
df[s] = [s]*len(df)

Try this:
for i in s:
df[i] = i
Output:
x y a b c
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c

You could use pandas.concat:
pd.concat([df, pd.DataFrame(s).T], axis=1).ffill()
output:
x y 0 1 2
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c

You can try using df.loc here.
df.loc[:, s] = s
print(df)
x y a b c
0 0 1 a b c
1 1 2 a b c
2 2 3 a b c
3 3 4 a b c
4 4 5 a b c

Grouping the columns and identifying values which are not part of this group

I have a DataFrame which looks like this:
df:-
A B
1 a
1 a
1 b
2 c
3 d
Now using this dataFrame i want to get the following new_df:
new_df:-
item val_not_present
1 c #1 doesn't have values c and d(values not part of group 1)
1 d
2 a #2 doesn't have values a,b and d(values not part of group 2)
2 b
2 d
3 a #3 doesn't have values a,b and c(values not part of group 3)
3 b
3 c
or an individual DataFrame for each items like:
df1:
item val_not_present
1 c
1 d
df2:-
item val_not_present
2 a
2 b
2 d
df3:-
item val_not_present
3 a
3 b
3 c
I want to get all the values which are not part of that group.

You can use np.setdiff and explode:
values_b = df.B.unique()
pd.DataFrame(df.groupby("A")["B"].unique().apply(lambda x: np.setdiff1d(values_b,x)).rename("val_not_present").explode())
Output:
val_not_present
A
1 c
1 d
2 a
2 b
2 d
3 a
3 b
3 c

Another approach is using crosstab/pivot_table to get counts and then filter on where count is 0 and transform to dataframe:
m = pd.crosstab(df['A'],df['B'])
pd.DataFrame(m.where(m.eq(0)).stack().index.tolist(),columns=['A','val_not_present'])
A val_not_present
0 1 c
1 1 d
2 2 a
3 2 b
4 2 d
5 3 a
6 3 b
7 3 c

You could convert B to a categorical datatype and then compute the value counts. Categorical variables will show categories that have frequency counts of zero so you could do something like this:
df['B'] = df['B'].astype('category')
new_df = (
df.groupby('A')
.apply(lambda x: x['B'].value_counts())
.reset_index()
.query('B == 0')
.drop(labels='B', axis=1)
.rename(columns={'level_1':'val_not_present',
'A':'item'})
)

Changing dummy variable value from 1 to column name, and then creating a list that I can compare rows with

I have a dataframe that looks like this :
A B C
1 0 0
1 1 0
0 1 0
0 0 1
I want to replace all values with the respective column name, so that the data looks like:
A B C
A 0 0
A B 0
0 B 0
0 0 C
Afterwards, I want to create a column that is a list of all column values like so:
A B C D
A 0 0 ['A','0','0']
A B 0 ['A','B','0']
0 B 0 ['0','B','0']
0 0 C ['0','0','C']
Finally, I want to group by column D and count the number of occurrences for each pattern.

You can do with mul
df.mul(df.columns).replace('',0)
Out[63]:
A B C
0 A 0 0
1 A B 0
2 0 B 0
3 0 0 C
#df['D']=df.mul(df.columns).replace('',0).values.tolist()

There must be cleaner ways to achieve this, but the you can use:
for column in df:
df[column] = df[column].astype(str).replace("1", column)
df["D"] = df.values.tolist()
Output:
A B C D
0 A 0 0 [A, 0, 0]
1 A B 0 [A, B, 0]
2 0 B 0 [0, B, 0]
3 0 0 C [0, 0, C]
PS: W-B's answer is the cleaner way.

Pandas: assign value depending on another dataframe

I have to dataframes that look like this:
df1: condition
A
A
A
B
B
B
B
df2: condition value
A 1
B 2
I would like to assign to each condition its value, adding a column to df1 in order to obtain:
df1: condition value
A 1
A 1
A 1
B 2
B 2
B 2
B 2
how can I do this? thank you in advance!

Use map by Series created by set_index if need append one column only:
df1['value'] = df1['condition'].map(df2.set_index('condition')['value'])
print (df1)
condition value
0 A 1
1 A 1
2 A 1
3 B 2
4 B 2
5 B 2
6 B 2
Or use merge with left join if df2 have more columns:
df = df1.merge(df2, on='condition', how='left')
print (df)
condition value
0 A 1
1 A 1
2 A 1
3 B 2
4 B 2
5 B 2
6 B 2

add serial count to each group pandas python

I have a simple problem but haven't been able to fix it.
I have a simple table such as:
group1
a
a
a
b
b
b
c
c
I can add a count to the column with:
df['count'] = range(1, len(df) + 1)
I have tried to alter this with groupby functions but can't manage to stop the count and restart per group such as:
group1 count
a 1
a 2
a 3
b 1
b 2
b 3
c 1
c 2

You could use cumcount. If you want to start from 1 you could add it:
In [16]: df['count'] = df.groupby('group1').cumcount()+1
In [17]: df
Out[17]:
group1 count
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 b 3
6 c 1
7 c 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

merging rows with repeating column values - python

Seems like a pivot problem, but since you missing the column(create by cumcount) and index(create by factorize) columns , it is hard to figure out pd.crosstab(pd.factorize(df.data)[0],df.groupby('data').cumcount(),df.data,aggfunc='sum') Out[358]: col_0 0 1 2 3 4 row_0 0 a a a a a 1 b b b b b

Something like index = ((df['data'] != df['data'].shift()).cumsum() - 1).rename(columns= {'data':''}) df = df.set_index(index) data 0 a 0 a 0 a 0 a 0 a 1 b 1 b 1 b 1 b 1 b

You can use pd.factorize followed by set_index: df = df.assign(key=pd.factorize(df['data'], sort=False)[0]).set_index('key') print(df) data key 0 a 0 a 0 a 0 a 0 a 1 b 1 b 1 b 1 b 1 b

Related

Join an array to every row in the pandas dataframe

Grouping the columns and identifying values which are not part of this group

Changing dummy variable value from 1 to column name, and then creating a list that I can compare rows with

Pandas: assign value depending on another dataframe

add serial count to each group pandas python

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

merging rows with repeating column values - python

Seems like a pivot problem, but since you missing the column(create by cumcount) and index(create by factorize) columns , it is hard to figure out pd.crosstab(pd.factorize(df.data)[0],df.groupby('data').cumcount(),df.data,aggfunc='sum') Out[358]: col_0 0 1 2 3 4 row_0 0 a a a a a 1 b b b b b

Something like index = ((df['data'] != df['data'].shift()).cumsum() - 1).rename(columns= {'data':''}) df = df.set_index(index) data 0 a 0 a 0 a 0 a 0 a 1 b 1 b 1 b 1 b 1 b ​

You can use pd.factorize followed by set_index: df = df.assign(key=pd.factorize(df['data'], sort=False)[0]).set_index('key') print(df) data key 0 a 0 a 0 a 0 a 0 a 1 b 1 b 1 b 1 b 1 b

Related

Join an array to every row in the pandas dataframe

Grouping the columns and identifying values which are not part of this group

Changing dummy variable value from 1 to column name, and then creating a list that I can compare rows with

Pandas: assign value depending on another dataframe

add serial count to each group pandas python

Categories

Resources

Something like index = ((df['data'] != df['data'].shift()).cumsum() - 1).rename(columns= {'data':''}) df = df.set_index(index) data 0 a 0 a 0 a 0 a 0 a 1 b 1 b 1 b 1 b 1 b