Reshaping long pandas dataframe - python

I have a very simple dataframe, made of only one column and the indexes. This is a very long column (52 rows) and I would like to group the items in groups of, let's say, 5 and put indexes and values side by side. Something like going from this
value
index
1 123
2 345
...
...
...
...
...
...
52 567
to this
value value ....
index index ....
1 123 6 ###
2 345 7 ###
3 567 8 ###
4 678 9 ###
5 789 10 ###
All for visual clarity, so that then I can simply do df.to_latex() without having to arrange things in latex. Is that possible?

First create new column from index by reset_index, then create MultiIndex by floor divison by 5 and reshape by unstack, change order of columns by sort_index. Last convert MultiIndex to columns by map:
df = pd.DataFrame({
'value': list(range(10, 19))
})
df = df.reset_index()
.set_index([df.index % 5, df.index // 5])
.unstack().sort_index(axis=1, level=1)
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
print (df)
index_0 value_0 index_1 value_1
0 0.0 10.0 5.0 15.0
1 1.0 11.0 6.0 16.0
2 2.0 12.0 7.0 17.0
3 3.0 13.0 8.0 18.0
4 4.0 14.0 NaN NaN

Related

Normalizing/Adjusting time series dataframe

I am fairly new to Python and Pandas; been searching for a solution for couple days with no luck... here's the problem:
I have a data set like the below and I need to cull the first few values of some rows so the highest value in each row is in column A. In the below example, rows 0 & 3 would drop the values in column A and row 4 drop the values in column A and B and then shift all remaining values to left.
A B C D
0 11 23 21 14
1 24 18 17 15
2 22 18 15 13
3 10 13 12 10
4 5 7 14 11
Desired
A B C D
0 23 21 14 NaN
1 24 18 17 15
2 22 18 15 13
3 13 12 10 NaN
4 14 11 NaN NaN
I've looked at the df.shift(), but don't see how I can get that function to work on a unique row by row basis. Should I instead be using an array and a loop function?
Any help is greatly appreciated.
You need to turn all left values of the max to np.nan and use the solution in this question. I use the one from #cs95
df_final = df[df.eq(df.max(1), axis=0).cummax(1)].apply(lambda x: sorted(x, key=pd.isnull), 1)
A B C D
0 23.0 21.0 14.0 NaN
1 24.0 18.0 17.0 15.0
2 22.0 18.0 15.0 13.0
3 13.0 12.0 10.0 NaN
4 14.0 11.0 NaN NaN
You can loop over the unique shifts (fewer of these than rows) with a groupby and join the results back:
import pandas as pd
shifts = df.to_numpy().argmax(1)
pd.concat([gp.shift(-i, axis=1) for i, gp in df.groupby(shifts)]).sort_index()
A B C D
0 23.0 21.0 14.0 NaN
1 24.0 18.0 17.0 15.0
2 22.0 18.0 15.0 13.0
3 13.0 12.0 10.0 NaN
4 14.0 11.0 NaN NaN
One approach is to convert each row of the data frame to a list (excluding the index) and append NaN values. Then keep N elements, starting with the max value.
ncols = len(df.columns)
nans = [np.nan] * ncols
new_rows = list()
for row in df.itertuples():
# convert each row of the data frame to a list
# start at 1 to exclude the index;
# and append list of NaNs
new_list = list(row[1:]) + nans
# find index of max value (exluding NaNs we appended)
k = np.argmax(new_list[:ncols])
# collect `new row`, starting at max element
new_rows.append(new_list[k : k+ncols])
# create new data frame
df_new = pd.DataFrame(new_rows, columns=df.columns)
df_new
for i in range(df.shape[0]):
arr = list(df.iloc[i,:])
c = 0
while True:
if arr[0] != max(arr):
arr.remove(arr[0])
c += 1
else:
break
nan = ["NaN"]*c
arr.extend(nan)
df.iloc[i,:] = arr
print(df)
I have looped over every row and found out max value and remove values before the max and padding "NaN" values at the end to match the number of columns for every row.

Move values in rows in a new column in pandas

I have a DataFrame with an Ids column an several columns with data, like the column "value" in this example.
For this DataFrame I want to move all the values that correspond to the same id to a new column in the row as shown below:
I guess there is an opposite function to "melt" that allow this, but I'm not getting how to pivot this DF.
The dicts for the input and out DFs are:
d = {"id":[1,1,1,2,2,3,3,4,5],"value":[12,13,1,22,21,23,53,64,9]}
d2 = {"id":[1,2,3,4,5],"value1":[12,22,23,64,9],"value2":[1,21,53,"","",],"value3":[1,"","","",""]}
Create MultiIndex by cumcount, reshape by unstack and add change columns names by add_prefix:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index())
print (df)
id value0 value1 value2
0 1 12.0 13.0 1.0
1 2 22.0 21.0 NaN
2 3 23.0 53.0 NaN
3 4 64.0 NaN NaN
4 5 9.0 NaN NaN
Missing values is possible replace by fillna, but get mixed numeric with strings data, so some function should failed:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index()
.fillna(''))
print (df)
id value0 value1 value2
0 1 12.0 13 1
1 2 22.0 21
2 3 23.0 53
3 4 64.0
4 5 9.0
You can GroupBy to a list, then expand the series of lists:
df = pd.DataFrame(d) # create input dataframe
res = df.groupby('id')['value'].apply(list).reset_index() # groupby to list
res = res.join(pd.DataFrame(res.pop('value').values.tolist())) # expand lists to columns
print(res)
id 0 1 2
0 1 12 13.0 1.0
1 2 22 21.0 NaN
2 3 23 53.0 NaN
3 4 64 NaN NaN
4 5 9 NaN NaN
In general, such operations will be expensive as the number of columns is arbitrary. Pandas / NumPy solutions work best when you can pre-allocate memory, which isn't possible here.

Pandas extensive 'describe' include count the null values

I have a large data frame composed of 450 columns with 550 000 rows.
In the columns i have :
73 float columns
30 columns dates
remainder columns in object
I would like to make a description of my variables, but not only describe as usual, but also include other descriptions in the same matrix. At the final, we will have a matrix of description with the set of 450 variables then a detailed description of:
- dtype
- count
- count null values
- % number of null values
- max
- min
- 50%
- 75%
- 25%
- ......
For now, i have juste a basic function that describe my data like this :
Dataframe.describe(include = 'all')
Do you have a function or method to do this more extensive descrition.
Thanks.
You need write custom functions for Series and then add to final describe DataFrame:
Notice:
First row of final df is count - used function count for count non NaNs values
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,np.nan,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7 1 5 a
1 b NaN 8 3 3 a
2 c NaN 9 5 6 a
3 d 5.0 4 7 9 b
4 e 5.0 2 1 2 b
5 f 4.0 3 0 4 b
df1 = df.describe(include = 'all')
df1.loc['dtype'] = df.dtypes
df1.loc['size'] = len(df)
df1.loc['% count'] = df.isnull().mean()
print (df1)
A B C D E F
count 6 4 6 6 6 6
unique 6 NaN NaN NaN NaN 2
top e NaN NaN NaN NaN b
freq 1 NaN NaN NaN NaN 3
mean NaN 4.5 5.5 2.83333 4.83333 NaN
std NaN 0.57735 2.88097 2.71416 2.48328 NaN
min NaN 4 2 0 2 NaN
25% NaN 4 3.25 1 3.25 NaN
50% NaN 4.5 5.5 2 4.5 NaN
75% NaN 5 7.75 4.5 5.75 NaN
max NaN 5 9 7 9 NaN
dtype object float64 int64 int64 int64 object
size 6 6 6 6 6 6
% count 0 0.333333 0 0 0 0
In pandas, there is no alternative function to describe(), but it clearly isn't displaying all the values that you need. You can use various parameters of the describe() function accordingly.
describe() on a DataFrame only works for numeric types. If you think you have a numeric variable and it doesn't show up in describe(), change the type with:
df[['col1', 'col2']] = df[['col1', 'col2']].astype(float)
You could also create new columns for handling the numeric part of a mixed-type column, or convert strings to numbers using a dictionary and the map() function.
describe() on a non-numeric Series will give you some statistics (like count, unique and the most frequently-occurring value).
To call describe() on just the objects (strings) use describe(include = ['O']).

Fill Nan based on group

I would like to fill NaN based on a column values' mean.
Example:
Groups Temp
1 5 27
2 5 23
3 5 NaN (will be replaced by 25)
4 1 NaN (will be replaced by the mean of the Temps that are in group 1)
Any suggestions ? Thanks !
Use groupby, transfrom, and lamdba function with fillna and mean:
df = df.assign(Temp=df.groupby('Groups')['Temp'].transform(lambda x: x.fillna(x.mean())))
print(df)
Output:
Temp
0 27.0
1 23.0
2 25.0
3 NaN

(pandas) Fill NaN based on groupby and column condition

Using 'bfill' or 'ffill' on a groupby element is trivial, but what if you need to fill the na with a specific value in a second column, based on a condition in a third column?
For example:
>>> df=pd.DataFrame({'date':['01/10/2017', '02/09/2017', '02/10/2016','01/10/2017', '01/11/2017', '02/10/2016'], 'a':[1,1,1,2,2,2], 'b':[4,np.nan,6, 5, np.nan, 7]})
>>> df
a b date
0 1 4.0 01/10/2017
1 1 NaN 02/09/2017
2 1 6.0 02/10/2016
3 2 5.0 01/10/2017
4 2 NaN 01/11/2017
5 2 7.0 02/10/2016
I need to group by column 'a', and fill the NaN with the column 'b' value where the date for that row is closest to the date in the NaN row.
So the output should look like:
a b date
0 1 4.0 01/10/2017
1 1 6.0 02/09/2017
2 1 6.0 02/10/2016
3 2 5.0 01/10/2017
4 2 5.0 01/11/2017
5 2 7.0 02/10/2016
Assume there is a closest_date() function that takes the NaN date and the list of other dates in that group, and returns the closest date.
I'm trying to find a clean solution that doesn't have to iterate through rows, ideally able to use apply() with lambdas. Any ideas?
This should work:
df['closest_date_by_a'] = df.groupby('a')['date'].apply(closest_date)
df['b'] = df.groupby(['a', 'closest_date_by_a'])['b'].ffill().bfill()
Given a function (closest_date()), you need to apply that function by group so it calculates the closest dates for rows within each group. Then you can group by both the main grouping column (a) and the closest date column (closest_date_by_a) and perform your filling.
Ensure that your date column are in fact dates.
df = pd.DataFrame(
{'date': ['01/10/2017', '02/09/2017', '02/10/2016','01/10/2017', '01/11/2017', '02/10/2016'],
'a':[1,1,1,2,2,2], 'b':[4,np.nan,6, 5, np.nan, 7]})
df.date = pd.to_datetime(df.date)
print(df)
a b date
0 1 4.0 2017-01-10
1 1 NaN 2017-02-09
2 1 6.0 2016-02-10
3 2 5.0 2017-01-10
4 2 NaN 2017-01-11
5 2 7.0 2016-02-10
Use reindex with method='nearest' after having dropna()
def fill_with_nearest(df):
s = df.set_index('date').b
s = s.dropna().reindex(s.index, method='nearest')
s.index = df.index
return s
df.loc[df.b.isnull(), 'b'] = df.groupby('a').apply(fill_with_nearest).reset_index(0, drop=True)
print(df)
a b date
0 1 4.0 2017-01-10
1 1 4.0 2017-02-09
2 1 6.0 2016-02-10
3 2 5.0 2017-01-10
4 2 5.0 2017-01-11
5 2 7.0 2016-02-10

Categories

Resources