I have data frame where i need to convert all the column in a row with their unique values
A B C
1 2 2
1 2 3
5 2 9
Desired output
X1 V1
A 1
A 5
B 2
C 2
C 3
C 9
I can get unique values by unique() function but don't know how I get desired output in pandas
You can use melt and drop_duplicates:
df.melt(var_name='X1', value_name='V1').drop_duplicates()
Output:
X1 V1
0 A 1
2 A 5
3 B 2
6 C 2
7 C 3
8 C 9
P.S. And you can add .reset_index(drop=True) if you want to have sequential integers for index
I'm working in Python, with Pandas DataFrames.
I have a problem where my dataframe looks like this:
Index A B Copy_of_B
1 a 0 0
2 a 1 1
3 a 5 5
4 b 0 0
5 b 4 4
6 c 6 6
My expected output is:
Index A B Copy_of_B
1 a 0 1
2 a 1 1
3 a 5 5
4 b 0 4
5 b 4 4
6 c 6 6
I would like to replace the 0 values in the Copy_of_B column with the values in the following row, but I don't want to use a for loop to iterate.
Is there an easy solution for this?
Thanks,
Barna
I make use of fact that your DataFrame has index composed of consecutive numbers.
Start from creating 2 indices:
ind = df[df.Copy_of_B == 0].index
ind2 = ind + 1
The first contains index values of rows where Copy_of_B == 0.
The second contains indices of subsequent rows.
Then, to "copy" data from subsequent rows to rows containing zeroes, run:
df.loc[ind, 'Copy_of_B'] = df.loc[ind2, 'Copy_of_B'].tolist()
As you can see, without any loop running over the whole DataFrame.
You can use mask and bfill:
df['Copy_of_B'] = df['B'].mask(df['B'].eq(0)).bfill()
Output:
Index A B Copy_of_B
0 1 a 0 1.0
1 2 a 1 1.0
2 3 a 5 5.0
3 4 b 0 4.0
4 5 b 4 4.0
5 6 c 6 6.0
I have some dataframes where data is tagged in groups, let's say as such:
df1 = pd.DataFrame({'id':[1,3,7, 10,30, 70, 100, 300], 'name':[1,1,1,1,1,1,1,1], 'tag': [1,1,1, 2,2,2, 3,3]})
df2 = pd.DataFrame({'id':[2,5,6, 20, 50, 200, 500, 600], 'name': [2,2,2,2,2,2,2,2], 'tag':[1,1,1, 2, 2, 3,3,3]})
df3 = pd.DataFrame({'id':[4, 8, 9, 40, 400, 800, 900], 'name': [3,3,3,3,3,3,3], 'tag':[1,1,1, 2, 3, 3,3]})
In each dataframe, the tag is attibuted in an ascending order of ids (so bigger ids will have equal or bigger tags).
My wish is to recalculate tags in the concatenated dataframe,
df = pd.concat([df1, df2, df3])
so that the tag of each group will be in ascending order of ids of the first element of each. So, the group starting by id=1 will be tagged by 1 (that is, ids 1,3,7), the group starting by id=2 will be tagged by 2 (that is , ids 2,5,6), the group starting by 4 will be tagged by 3, the group starting by 10 will be tagged as 4, and so on.
I did manage to get a (complicated!) solution:
1) Get first row of each group , put those in a dataframe , sort by id and create the new tags:
dff = pd.concat([df1.groupby('tag').first(), df2.groupby('tag').first(), df3.groupby('tag').first()])
dff = dff.sort(['id'])
dff = dff.reset_index()
dff['new_tags'] = dff.index +1
2) Concatenate this dataframe with initial ones, drop_duplicates so as to keep the newly tagged rows, order by group , then propagate new tags:
df = pd.concat([dff, df1, df2, df3])
df = df.drop_duplicates(subset=['id', 'tag', 'name'])
df = df.sort(['name', 'tag'])
df = df.fillna(method = 'pad')
The new tags are exactly what needed, but my solution seems too complicated. Would you have a suggestion on how to make easier? I think I must be missing something!
Thanks in advance,
M.
Using pd.concat + keys , I break down the steps
df=pd.concat([df1,df2,df3],keys=[0,1,2])
df=df.reset_index(level=0)#get the level=0 index
df=df.sort_values(['tag','level_0']) # sort the value
df['New']=(df['tag'].diff().ne(0)|df['level_0'].diff().ne(0)).cumsum()
df
Out[110]:
level_0 id name tag New
0 0 1 1 1 1
1 0 3 1 1 1
2 0 7 1 1 1
0 1 2 2 1 2
1 1 5 2 1 2
2 1 6 2 1 2
0 2 4 3 1 3
1 2 8 3 1 3
2 2 9 3 1 3
3 0 10 1 2 4
4 0 30 1 2 4
5 0 70 1 2 4
3 1 20 2 2 5
4 1 50 2 2 5
3 2 40 3 2 6
6 0 100 1 3 7
7 0 300 1 3 7
5 1 200 2 3 8
6 1 500 2 3 8
7 1 600 2 3 8
4 2 400 3 3 9
5 2 800 3 3 9
6 2 900 3 3 9
Once concatenated, you can use groupby the columns 'tag' and 'name' with transform and first on the column 'id'. Then sort_values this series and cumsum the diff is more than 0 such as:
df = pd.concat([df1, df2, df3]).sort_values('id').reset_index(drop=True)
df['new'] = (df.groupby(['tag','name'])['id'].transform('first')
.sort_values().diff().ne(0.).cumsum())
and you get the expected output:
id name tag new
0 1 1 1 1
1 2 2 1 2
2 3 1 1 1
3 4 3 1 3
4 5 2 1 2
5 6 2 1 2
6 7 1 1 1
7 8 3 1 3
8 9 3 1 3
9 10 1 2 4
10 20 2 2 5
11 30 1 2 4
12 40 3 2 6
...
EDIT: to avoid using groupby, you can drop_duplicates and index to get the index of the first ids, create the column new with an incremental value using loc and range and then ffill after sort_values to fill the values:
df = pd.concat([df1, df2, df3]).sort_values('id').reset_index(drop=True)
list_ind = df.drop_duplicates(['name','tag']).index
df.loc[list_ind,'new'] = range(1,len(list_ind)+1)
df['new'] = df.sort_values(['tag','name'])['new'].ffill().astype(int)
and you get the same result
I am new to python. here is the question I have, which is really weird to me.
A simple data frame looks like:
a1=pd.DataFrame({'Hash':[1,1,2,2,2,3,4,4],
'Card':[1,1,2,2,3,3,4,4]})
I need to group a1 by Hash, calculate how many rows in each group, then add one column in a1 to indicate row numbers. So, I want to use groupby + transform.
When I use:
a1['CustomerCount']=a1.groupby(['Hash']).transform(lambda x: x.shape[0])
The result is correct:
Card Hash CustomerCount
0 1 1 2
1 1 1 2
2 2 2 3
3 2 2 3
4 3 2 3
5 3 3 1
6 4 4 2
7 4 4 2
But when I use:
a1.loc[:,'CustomerCount']=a1.groupby(['Hash']).transform(lambda x: x.shape[0])
The result is:
Card Hash CustomerCount
0 1 1 NaN
1 1 1 NaN
2 2 2 NaN
3 2 2 NaN
4 3 2 NaN
5 3 3 NaN
6 4 4 NaN
7 4 4 NaN
So, why does this happen?
As far as I know, loc and iloc (like a1.loc[:,'CustomerCount']) are better than nothing (like a1['CustomerCount']) so loc and iloc are usually recommanded to use. But why this happens?
Also, I have tried loc and iloc a lot of times to generate a new column in one data frame. They usualy work. So does this have something to do with groupby + transform?
The difference is how loc deals with assigning a DataFrame object to a single column. When you assigned the DataFrame with the columns of Card it attempted to line up the index and the column name. The columns didn't line up and you got NaNs. When assigning via direct column access, it determined that it was one column for another and just did it.
Reduce to a single column
You can resolve this by either reducing the result of the groupby operation to just one column thus allowing for easy resolution.
a1.loc[:,'CustomerCount'] = a1.groupby(['Hash']).Card.transform('size')
a1
Hash Card CustomerCount
0 1 1 2
1 1 1 2
2 2 2 3
3 2 2 3
4 2 3 3
5 3 3 1
6 4 4 2
7 4 4 2
Rename the column
Don't really do this, the other answer is far simpler
a1.loc[:, 'CustomerCount'] = a1.groupby('Hash').transform(len).rename(
columns={'Card': 'CustomerCount'})
a1
pd.factorize and np.bincount
What I'd actually do
f, u = pd.factorize(a1.Hash)
a1['CustomerCount'] = np.bincount(f)[f]
a1
Or inline making a copy
a1.assign(CustomerCount=(lambda f: np.bincount(f)[f])(pd.factorize(a1.Hash)[0]))
Hash Card CustomerCount
0 1 1 2
1 1 1 2
2 2 2 3
3 2 2 3
4 2 3 3
5 3 3 1
6 4 4 2
7 4 4 2
Based on the fact that directly append two dataframe with different numbers of columns, an error would occur as pandas.io.common.CParserError: Error tokenizing data. C error: Expected 4 fields in line 242, saw 5. How can I do with pandas to avoid the error??
I have figure out one naive approach: just to process the original data, to make the numbers of columns equally.
Can it be more elegant?? I think the missing columns can be filled with np.nan after pd.append.
You should be able to concat the dataframes as shown.
You will need to rename the columns to suit you needs.
df1 = pd.DataFrame({'a':[1,2,3,4],'b':[1,2,3,4],'c':[1,2,3,4]})
df2 = pd.DataFrame({'a':[1,2,3,4],'c':[1,2,3,4]})
df = pd.concat([df1,df2])
print('df1')
print(df1)
print('\ndf2')
print(df2)
print('\ndf')
print(df)
Output:
df1
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
df2
a c
0 1 1
1 2 2
2 3 3
3 4 4
df
a b c
0 1 1.0 1
1 2 2.0 2
2 3 3.0 3
3 4 4.0 4
0 1 NaN 1
1 2 NaN 2
2 3 NaN 3
3 4 NaN 4