Pandas flatten rows - python

I have a text file that looks like
item1 value1 0
item1 value2 0
item1 value3 0
item2 value1 0
item1 value2 0
item1 value3 0
I'd like to get a pandas dataframe where I have each value as a column, and each item as a row.
E.g.
item | value1 | value2 | value3 | value4...
item1 | 0 | 0 | 0 | NaN
I know how to do it by iterating over the dataframe, but I thought there could be a way to avoid iteration (as that's anti-pattern) through groupby perhaps?

What you look for is called pivoting in pandas nomenclature. Here is a link to pandas-pivot documentation.
You simply have to do this:
df.pivot(index="item", columns="value", values="zero_col")
Change the names according to your dataframe column names.
Edit
I tested it locally and at least in the general case seems to work. Although, as #tdy suggested some cleaning up after the pivot operation might be necessary to fit your use case.
A snippet:
c = {"items": np.arange(5), "values": np.arange(5), "zero_cols": np.zeros(5)}
df = pd.DataFrame(c, columns=["items", "values", "zero_cols"])
df.pivot(index="items", columns="values", values="zero_cols")
Here is the result:
values 0 1 2 3 4
items
0 0.0 NaN NaN NaN NaN
1 NaN 0.0 NaN NaN NaN
2 NaN NaN 0.0 NaN NaN
3 NaN NaN NaN 0.0 NaN
4 NaN NaN NaN NaN 0.0

It seems #kalgoritmi's answer works fine for you, but it breaks on my end given your sample data. I'm not sure if this is a versioning issue (I'm on pandas 1.2.3). In any case this might be useful for others.
If there are duplicate pairs, pivoting immediately will throw a duplicate index ValueError:
>>> df = pd.DataFrame({'item': ['item1','item1','item1','item2','item1','item1'], 'value': ['value1','value2','value3']*2, 'number': 0})
item value number
0 item1 value1 0
1 item1 value2 0
2 item1 value3 0
3 item2 value1 0
4 item1 value2 0
5 item1 value3 0
>>> df.pivot(index='item', columns='value', values='number')
ValueError: Index contains duplicate entries, cannot reshape
One workaround is to aggregate the duplicate pairs before pivoting, e.g. with mean():
>>> df = df.groupby(['item', 'value'], as_index=False).mean()
item value number
0 item1 value1 0
1 item1 value2 0
2 item1 value3 0
3 item2 value1 0
>>> df.pivot(index='item', columns='value', values='number')
value value1 value2 value3
item
item1 0.0 0.0 0.0
item2 0.0 NaN NaN

Related

Performing Addition on Two Pandas DataFrames by Index Without NaNs

I have a list of Pandas dataframes where I want to add together the rows that have the same index. For example, say we have two dataframes where their indexes are unordered:
Column1 Column2
Item1 1 4
Item3 2 5
Item2 3 6
Column1 Column2
Item1 1 3
Item2 2 4
Is there a way to add these two dataframes together by index to get the following result with Item3 included? Because a simple df1 + df2 will return the first two lines correctly, but Item3 will end up having NaNs. Having the results become floats is fine.
# What I want to calculate
Column1 Column2
Item1 2 7
Item2 5 10
Item3 2 5
# What actually calculates
Column1 Column2
Item1 2.0 7.0
Item2 5.0 10.0
Item3 NaN NaN
You can try
df_final=(df1.set_index('item').add(df2.set_index('item'), fill_value=0).reset_index())
#float to int
df_final = df_final.astype(str).replace('\.0','',regex=True).replace(['nan','None'], np.nan)
print(df_final)
output
item col1 col2
0 it1 2 7
1 it2 5 10
2 it3 2 5

Get last non NaN value after groupby and aggregation

I have a data frame like this for example:
col1 col2
0 A 3
1 B 4
2 A NaN
3 B 5
4 A 5
5 A NaN
6 B NaN
.
.
.
47 B 8
48 A 9
49 B NaN
50 A NaN
when i try df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index() it gives me this output
col1 col2
0 A NaN
1 B NaN
I want to get the last non NaN value after groupby and agg. The desirable output is like below
col1 col2
0 A 9
1 B 8
For me your solution working well, if NaN are missing values.
Here is alternative:
df = df.dropna(subset=['col2']).drop_duplicates('col1', keep='last')
If NaNs are strings first convert them to missing values:
df['col2'] = df['col2'].replace('NaN', np.nan)
df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index()

Pandas: Compare next column value with previous column value

I have the following DataFrame structure with example data:
Col1 Col2 Col3
1 1 8
5 4 7
3 9 9
1 NaN NaN
Columns have a sequential ordering, meaning Col1 comes before Col2 and so on...
I want to compare if two (or more) subsequent columns have the same value. If so I want to drop the entire row. NaN values can appear but should not be treated as having the same value
So with the rows above, I'd like to have row 1 and 3 dropped (row 1: Col1->Col2 same value, row 3: Col2 -> Col3 same value) and row 2 and 4 to be kept in the dataframe.
How can I accomplish this? Thanks!
Use DataFrame.diff and filter rows if exist no 0 value per rows by DataFrame.ne for not equal and DataFrame.all for test if all True and filter in boolean indexing:
df = df[df.diff(axis=1).ne(0).all(axis=1)]
print (df)
Col1 Col2 Col3
1 5 4.0 7.0
3 1 NaN NaN
Detail:
print (df.diff(axis=1))
Col1 Col2 Col3
0 NaN 0.0 7.0
1 NaN -1.0 3.0
2 NaN 6.0 0.0
3 NaN NaN NaN
print (df.diff(axis=1).ne(0))
Col1 Col2 Col3
0 True False True
1 True True True
2 True True False
3 True True True
print (df.diff(axis=1).ne(0).all(axis=1))
0 False
1 True
2 False
3 True
dtype: bool

Compare each of the column values and return final value based on conditions

I currently have a dataframe which looks like this:
col1 col2 col3
1 2 3
2 3 NaN
3 4 NaN
2 NaN NaN
0 2 NaN
What I want to do is apply some condition to the column values and return the final result in a new column.
The condition is to assign values based on this order of priority where 2 being the first priority: [2,1,3,0,4]
I tried to define a function to append the final results but wasnt really getting anywhere...any thoughts?
The desired outcome would look something like:
col1 col2 col3 col4
1 2 3 2
2 3 NaN 2
3 4 NaN 3
2 NaN NaN 2
0 2 NaN 2
where col4 is the new column created.
Thanks
first you may want to get ride of the NaNs:
df.fillna(5)
and then apply a function to every row to find your value:
def func(x,l=[2,1,3,0,4,5]):
for j in l:
if(j in x):
return j
df['new'] = df.apply(lambda x: func(list(x)),axis =1)
Output:
col1 col2 col3 new
0 1 2 3 2
1 2 3 5 2
2 3 4 5 3
3 2 5 5 2
4 0 2 5 2
maybe a little later.
import numpy as np
def f(x):
for i in [2,1,3,0,4]:
if i in x.tolist():
return i
return np.nan
df["col4"] = df.apply(f, axis=1)
and the Output:
col1 col2 col3 col4
0 1 2.0 3.0 2
1 2 3.0 NaN 2
2 3 4.0 NaN 3
3 2 NaN NaN 2
4 0 2.0 NaN 2

.combine_first for merging multiple rows

I have a pandas dataframe (df) where there are duplicating rows for some of the rows. Some columns in these repeating rows have NaN values while similar columns in the duplicated rows have values. I would like to merge the duplicating rows such that the missing values are replaced with the values from the duplicating rows and then dropping the duplicated rows. For examples the following are duplicated rows:
id col1 col2 col3
0 01 abc 123
9 01 xy
The result should be like:
id col1 col2 col3
0 01 abc xy 123
I tried .combine_first by using df.iloc[0:1,].combine_first(df.iloc[9:10,]) but no success. Can anybody help me with this? Thanks!
I think you need groupby with forward and back filling NaNs and then drop_duplicates:
print (df)
id col1 col2 col3
0 1 abc NaN 123.0
9 1 NaN xy NaN
0 2 abc NaN 17.0
9 2 NaN xr NaN
9 2 NaN xu NaN
df = df.groupby('id').apply(lambda x: x.ffill().bfill()).drop_duplicates()
print (df)
id col1 col2 col3
0 1 abc xy 123.0
0 2 abc xr 17.0
9 2 abc xu 17.0

Categories

Resources