Move values in rows in a new column in pandas - python

I have a DataFrame with an Ids column an several columns with data, like the column "value" in this example.
For this DataFrame I want to move all the values that correspond to the same id to a new column in the row as shown below:
I guess there is an opposite function to "melt" that allow this, but I'm not getting how to pivot this DF.
The dicts for the input and out DFs are:
d = {"id":[1,1,1,2,2,3,3,4,5],"value":[12,13,1,22,21,23,53,64,9]}
d2 = {"id":[1,2,3,4,5],"value1":[12,22,23,64,9],"value2":[1,21,53,"","",],"value3":[1,"","","",""]}

Create MultiIndex by cumcount, reshape by unstack and add change columns names by add_prefix:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index())
print (df)
id value0 value1 value2
0 1 12.0 13.0 1.0
1 2 22.0 21.0 NaN
2 3 23.0 53.0 NaN
3 4 64.0 NaN NaN
4 5 9.0 NaN NaN
Missing values is possible replace by fillna, but get mixed numeric with strings data, so some function should failed:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index()
.fillna(''))
print (df)
id value0 value1 value2
0 1 12.0 13 1
1 2 22.0 21
2 3 23.0 53
3 4 64.0
4 5 9.0

You can GroupBy to a list, then expand the series of lists:
df = pd.DataFrame(d) # create input dataframe
res = df.groupby('id')['value'].apply(list).reset_index() # groupby to list
res = res.join(pd.DataFrame(res.pop('value').values.tolist())) # expand lists to columns
print(res)
id 0 1 2
0 1 12 13.0 1.0
1 2 22 21.0 NaN
2 3 23 53.0 NaN
3 4 64 NaN NaN
4 5 9 NaN NaN
In general, such operations will be expensive as the number of columns is arbitrary. Pandas / NumPy solutions work best when you can pre-allocate memory, which isn't possible here.

Related

Pandas dataframe. Wrong series fly in the line

I want to have 81 rows x 1 columns.
How to correct this distortion?
Use fillna. Basically, use the values in the second column to fill holes in the first column:
df['first_column'].fillna(df['second_column'])
For example, if you have DataFrame df:
a b
0 1.0 NaN
1 2.0 NaN
2 NaN 100.0
then
df['a'] = df['a'].fillna(df['b'])
df = df.drop(columns=['b'])
Output:
a
0 1.0
1 2.0
2 100.0

How to replace part of the data-frame with another data-frame

i have two data frames and i like to filter the data and replace list of columns from df1 with the same columns in df2
i want to filter this df by df1.loc[df1["name"]=="A"]
first_data={"col1":[2,3,4,5,7],
"col2":[4,2,4,6,4],
"col3":[7,6,9,11,2],
"col4":[14,11,22,8,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df1=pd.DataFrame.from_dict(first_data)
and put the columns ["col1","col2","n_roll"] when name="A"
on the same places in df2 (on the same indexs)
sec_df={"col1":[55,0,57,1,3],
"col2":[55,0,4,4,53],
"col3":[55,33,9,0,2],
"col4":[55,0,22,4,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df2=pd.DataFrame.from_dict(sec_df)
if i put that list of cols=[col1,col2,col3,col4]
so i like to get this
data={"col1":[55,0,4,1,7],
"col2":[55,0,4,4,4],
"col3":[55,33,9,0,2],
"col4":[55,0,22,4,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df=pd.DataFrame.from_dict(data)
df
You can achieve this with a double combine_first
Combine a filtered version of df1 with df2
However, the columns that were excluded in the filtered version of df1 were left behind and you have NaN values. But, that is okay -- Just do another combine_first on df2 to get those values!
(df1.loc[df1['name'] != 'A', ["col1","col2","n_roll"]]
.combine_first(df2)
.combine_first(df2))
Out[1]:
col1 col2 col3 col4 n_roll name
0 55.0 55.0 55.0 55.0 8.0 A
1 0.0 0.0 33.0 0.0 2.0 A
2 4.0 4.0 9.0 22.0 1.0 V
3 1.0 4.0 0.0 4.0 3.0 A
4 7.0 4.0 2.0 5.0 9.0 B
Use one line to achieve this man!
df1=df1[df1.name!='A'].append(df2[df2.name=='A'].rename(columns={'hight':'n_roll'})).sort_index()
col1 col2 col3 col4 name n_roll
0 55 55 55 55 A 8
1 0 0 33 0 A 2
2 4 4 9 22 V 1
3 1 4 0 4 A 3
4 7 4 2 5 B 9
How it works
d=df1[df1.name!='A']#selects df1 where name is not A
df2[df2.name=='A']#selects df where name is A
e=df2[df2.name=='A'].rename(columns={'hight':'n_roll'})#renames column height to allow appending
d.append(e)# combines the dataframes
d.append(e).sort_index()#sorts the index

Pandas extensive 'describe' include count the null values

I have a large data frame composed of 450 columns with 550 000 rows.
In the columns i have :
73 float columns
30 columns dates
remainder columns in object
I would like to make a description of my variables, but not only describe as usual, but also include other descriptions in the same matrix. At the final, we will have a matrix of description with the set of 450 variables then a detailed description of:
- dtype
- count
- count null values
- % number of null values
- max
- min
- 50%
- 75%
- 25%
- ......
For now, i have juste a basic function that describe my data like this :
Dataframe.describe(include = 'all')
Do you have a function or method to do this more extensive descrition.
Thanks.
You need write custom functions for Series and then add to final describe DataFrame:
Notice:
First row of final df is count - used function count for count non NaNs values
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,np.nan,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7 1 5 a
1 b NaN 8 3 3 a
2 c NaN 9 5 6 a
3 d 5.0 4 7 9 b
4 e 5.0 2 1 2 b
5 f 4.0 3 0 4 b
df1 = df.describe(include = 'all')
df1.loc['dtype'] = df.dtypes
df1.loc['size'] = len(df)
df1.loc['% count'] = df.isnull().mean()
print (df1)
A B C D E F
count 6 4 6 6 6 6
unique 6 NaN NaN NaN NaN 2
top e NaN NaN NaN NaN b
freq 1 NaN NaN NaN NaN 3
mean NaN 4.5 5.5 2.83333 4.83333 NaN
std NaN 0.57735 2.88097 2.71416 2.48328 NaN
min NaN 4 2 0 2 NaN
25% NaN 4 3.25 1 3.25 NaN
50% NaN 4.5 5.5 2 4.5 NaN
75% NaN 5 7.75 4.5 5.75 NaN
max NaN 5 9 7 9 NaN
dtype object float64 int64 int64 int64 object
size 6 6 6 6 6 6
% count 0 0.333333 0 0 0 0
In pandas, there is no alternative function to describe(), but it clearly isn't displaying all the values that you need. You can use various parameters of the describe() function accordingly.
describe() on a DataFrame only works for numeric types. If you think you have a numeric variable and it doesn't show up in describe(), change the type with:
df[['col1', 'col2']] = df[['col1', 'col2']].astype(float)
You could also create new columns for handling the numeric part of a mixed-type column, or convert strings to numbers using a dictionary and the map() function.
describe() on a non-numeric Series will give you some statistics (like count, unique and the most frequently-occurring value).
To call describe() on just the objects (strings) use describe(include = ['O']).

Reshaping long pandas dataframe

I have a very simple dataframe, made of only one column and the indexes. This is a very long column (52 rows) and I would like to group the items in groups of, let's say, 5 and put indexes and values side by side. Something like going from this
value
index
1 123
2 345
...
...
...
...
...
...
52 567
to this
value value ....
index index ....
1 123 6 ###
2 345 7 ###
3 567 8 ###
4 678 9 ###
5 789 10 ###
All for visual clarity, so that then I can simply do df.to_latex() without having to arrange things in latex. Is that possible?
First create new column from index by reset_index, then create MultiIndex by floor divison by 5 and reshape by unstack, change order of columns by sort_index. Last convert MultiIndex to columns by map:
df = pd.DataFrame({
'value': list(range(10, 19))
})
df = df.reset_index()
.set_index([df.index % 5, df.index // 5])
.unstack().sort_index(axis=1, level=1)
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
print (df)
index_0 value_0 index_1 value_1
0 0.0 10.0 5.0 15.0
1 1.0 11.0 6.0 16.0
2 2.0 12.0 7.0 17.0
3 3.0 13.0 8.0 18.0
4 4.0 14.0 NaN NaN

Fill Nan based on group

I would like to fill NaN based on a column values' mean.
Example:
Groups Temp
1 5 27
2 5 23
3 5 NaN (will be replaced by 25)
4 1 NaN (will be replaced by the mean of the Temps that are in group 1)
Any suggestions ? Thanks !
Use groupby, transfrom, and lamdba function with fillna and mean:
df = df.assign(Temp=df.groupby('Groups')['Temp'].transform(lambda x: x.fillna(x.mean())))
print(df)
Output:
Temp
0 27.0
1 23.0
2 25.0
3 NaN

Categories

Resources