Concatenate columns skipping pasted rows and columns - python

I expect to describe well want I need. I have a data frame with the same columns name and another column that works as an index. The data frame looks as follows:
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],'X':[1,2,3,4,5,2,3,4,1,3,4,5],'Y':[1,2,3,4,5,2,3,4,5,4,3,2]})
df
Out[21]:
ID X Y
0 1 1 1
1 1 2 2
2 1 3 3
3 1 4 4
4 1 5 5
5 2 2 2
6 2 3 3
7 2 4 4
8 3 1 5
9 3 3 4
10 3 4 3
11 3 5 2
My intention is to copy X as an index or one column (it doesn't matter) and append Y columns from each 'ID' in the following way:

You can try
out = pd.concat([group.rename(columns={'Y': f'Y{name}'}) for name, group in df.groupby('ID')])
out.columns = out.columns.str.replace(r'\d+$', '', regex=True)
print(out)
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 5.0
9 3 3 NaN NaN 4.0
10 3 4 NaN NaN 3.0
11 3 5 NaN NaN 2.0

Here's another way to do it:
df_org = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],'X':[1,2,3,4,5,2,3,4,1,3,4,5]})
df = df_org.copy()
for i in set(df_org['ID']):
df1 = df_org[df_org['ID']==i]
col = 'Y'+str(i)
df1.columns = ['ID', col]
df = pd.concat([ df, df1[[col]] ], axis=1)
df.columns = df.columns.str.replace(r'\d+$', '', regex=True)
print(df)
Output:
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 1.0
9 3 3 NaN NaN 3.0
10 3 4 NaN NaN 4.0
11 3 5 NaN NaN 5.0

Another solution could be as follow.
Get unique values for column ID (stored in array s).
Use np.transpose to repeat column ID n times (n == len(s)) and evaluate the array's matches with s.
Use np.where to replace True with values from df.Y and False with NaN.
Finally, drop the orignal df.Y and rename the new columns as required.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],
'X':[1,2,3,4,5,2,3,4,1,3,4,5],
'Y':[1,2,3,4,5,2,3,4,5,4,3,2]})
s = df.ID.unique()
df[s] = np.where((np.transpose([df.ID]*len(s))==s),
np.transpose([df.Y]*len(s)),
np.nan)
df.drop('Y', axis=1, inplace=True)
df.rename(columns={k:'Y' for k in s}, inplace=True)
print(df)
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 5.0
9 3 3 NaN NaN 4.0
10 3 4 NaN NaN 3.0
11 3 5 NaN NaN 2.0
If performance is an issue, this method should be faster than this answer, especially when the number of unique values for ID increases.

Related

Pandas Matrix Addition/Division between index values and column header values

I have the following the data frame and I want to add the index value plus the column header and divide by 2.
The initial grid would like this:
grid = pd.DataFrame(
columns=[1,2,3,4,5],
index = [1,2,3,4,5]
)
grid
This results in the following:
1 2 3 4 5
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
What I'm looking to get is: For example, index of 1 + column header of 3 results in 4, which divided by 2 is 2
1 2 3 4 5
1 1 1.5 2 2.5 3
2 1.5 2 2.5 3 3.5
3 2 2.5 3 3.5 4
4 2.5 3 3.5 4 4.5
5 3 3.5 4 4.5 5
You can use broadcasting here:
grid[:] = (grid.index.values[:,None] + grid.columns.values)/2

Pandas countif based on multiple conditions, result in new column

How can I add a field that returns 1/0 if the value in any specified column in not NaN?
Example:
df = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10],
'val1': [2,2,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,2],
'val2': [7,0.2,5,8,np.nan,1,0,np.nan,1,1],
})
display(df)
mycols = ['val1', 'val2']
# if entry in mycols != np.nan, then df[row, 'countif'] =1; else 0
Desired output dataframe:
We do not need countif logic in pandas , try notna + any
df['out'] = df[['val1','val2']].notna().any(1).astype(int)
df
Out[381]:
id val1 val2 out
0 1 2.0 7.0 1
1 2 2.0 0.2 1
2 3 NaN 5.0 1
3 4 NaN 8.0 1
4 5 NaN NaN 0
5 6 1.0 1.0 1
6 7 NaN 0.0 1
7 8 NaN NaN 0
8 9 NaN 1.0 1
9 10 2.0 1.0 1
Using iloc accessor filtre last two columns. Check if the sum of not NaNs in each row is more than zero. Convert resulting Boolean to integers.
df['countif']=df.iloc[:,1:].notna().sum(1).gt(0).astype(int)
id val1 val2 countif
0 1 2.0 7.0 1
1 2 2.0 0.2 1
2 3 NaN 5.0 1
3 4 NaN 8.0 1
4 5 NaN NaN 0
5 6 1.0 1.0 1
6 7 NaN 0.0 1
7 8 NaN NaN 0
8 9 NaN 1.0 1
9 10 2.0 1.0 1

Appending Pandas DataFrame in a loop

Let's say I have a df such as this:
df = pd.DataFrame({'A': [1,2,3,4,5], 'A_z': [2,3,4,5,6], 'B': [3,4,5,6,7], 'B_z': [4,5,6,7,8],
'C': [5,6,7,8,9], 'C_z': [6,7,8,9,10]})
Which looks like this:
A A_z B B_z C C_z
0 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
What I'm looking to do is create a new df and for each letter (A,B,C) append this new df vertically with the data from the two columns per letter so that it looks like this:
Letter Letter_z
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 3 4
6 4 5
7 5 6
8 6 7
9 7 8
10 5 6
11 6 7
12 7 8
13 8 9
14 9 10
As far as I'm concerned something like this should work fine:
for col in df.columns:
if col[-1] != 'z':
new_df = new_df.append(df[[col, col + '_z']])
However this results in the following mess:
A A_z B B_z C C_z
0 1.0 2.0 NaN NaN NaN NaN
1 2.0 3.0 NaN NaN NaN NaN
2 3.0 4.0 NaN NaN NaN NaN
3 4.0 5.0 NaN NaN NaN NaN
4 5.0 6.0 NaN NaN NaN NaN
0 NaN NaN 3.0 4.0 NaN NaN
1 NaN NaN 4.0 5.0 NaN NaN
2 NaN NaN 5.0 6.0 NaN NaN
3 NaN NaN 6.0 7.0 NaN NaN
4 NaN NaN 7.0 8.0 NaN NaN
0 NaN NaN NaN NaN 5.0 6.0
1 NaN NaN NaN NaN 6.0 7.0
2 NaN NaN NaN NaN 7.0 8.0
3 NaN NaN NaN NaN 8.0 9.0
4 NaN NaN NaN NaN 9.0 10.0
What am I doing wrong? Any help would be really appreciated, cheers.
EDIT:
After the kind help from jezrael the renaming of the columns in his answer got me thinking about a possible way to do it using my original train of thought.
I can now also achieve the new df I want using the following:
for col in df:
if col[-1] != 'z':
d = df[[col, col + '_z']]
d.columns = ['Letter', 'Letter_z']
new_df = new_df.append(d)
The different columns names were clearly what was causing the problem which is something I wasn't aware of at the time. Hope this helps anyone.
One ide is use Series.str.split with expand=True for MultiIndex, then use rename for avoid NaNs and finally new columns names, reshape by DataFrame.stack, sort for correct order by DataFrame.sort_index and last remove MultiIndex:
df.columns = df.columns.str.split('_', expand=True)
df = df.rename(columns=lambda x:'Letter_z' if x == 'z' else 'Letter', level=1)
df = df.stack(0).sort_index(level=[1,0]).reset_index(drop=True)
print (df)
Letter Letter_z
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 3 4
6 4 5
7 5 6
8 6 7
9 7 8
10 5 6
11 6 7
12 7 8
13 8 9
14 9 10
Or if possible simplify problem with reshape all non z values to one column and all z values to another use numpy.ravel:
m = df.columns.str.endswith('_z')
a = df.loc[:, ~m].to_numpy().T.ravel()
b = df.loc[:, m].to_numpy().T.ravel()
df = pd.DataFrame({'Letter': a,'Letter_z': b})
print (df)
Letter Letter_z
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 3 4
6 4 5
7 5 6
8 6 7
9 7 8
10 5 6
11 6 7
12 7 8
13 8 9
14 9 10
You can use the function concat and a list comprehension:
cols = df.columns[~df.columns.str.endswith('_z')]
func = lambda x: 'letter_z' if x.endswith('_z') else 'letter'
pd.concat([df.filter(like=i).rename(func, axis=1) for i in cols])
or
cols = df.columns[~df.columns.str.endswith('_z')]
pd.concat([df.filter(like=i).set_axis(['letter', 'letter_z'], axis=1, inplace=False) for i in cols])

How to remove certain features that have a low completeness rate in a Data frame(Python)

I have a Data Frame with more than 450 variables and more than 500 000 rows. However, some variables have null values ​​over 90%. I would like to delete features with more than > 90% empty rows.
I made my description of my variables:
Data Frame :
df = pd.DataFrame({
'A':list('abcdefghij'),
'B':[4,np.nan,np.nan,np.nan,np.nan,np.nan, np.nan, np.nan, np.nan, np.nan],
'C':[7,8,np.nan,4,2,3,6,5, 4, 6],
'D':[1,3,5,np.nan,1,0,10,7, np.nan, 5],
'E':[5,3,6,9,2,4,7,3, 5, 9],
'F':list('aaabbbckfr'),
'G':[np.nan,8,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan, np.nan, np.nan]})
print(df)
A B C D E F G
0 a 4.0 7 1 5 a NaN
1 b NaN 8 3 3 a 8.0
2 c NaN NaN 5 6 a NaN
3 d NaN 4 NaN 9 b NaN
4 e NaN 2 1 2 b NaN
5 f NaN 3 0 4 b NaN
6 g NaN 6 10 7 c NaN
7 h NaN 5 7 3 k NaN
8 i NaN 4 NaN 5 f NaN
9 j NaN 6 5 9 r NaN
Describe:
desc = df.describe(include = 'all')
d1 = desc.loc['varType'] = desc.dtypes
d3 = desc.loc['rowsNull'] = df.isnull().sum()
d4 = desc.loc['%rowsNull'] = round((d3/len(df))*100, 2)
print(desc)
A B C D E F G
count 10 1 10 10 10 10 1
unique 10 NaN NaN NaN NaN 6 NaN
top i NaN NaN NaN NaN b NaN
freq 1 NaN NaN NaN NaN 3 NaN
mean NaN 4 5.4 4.3 5.3 NaN 8
std NaN NaN 2.22111 3.16403 2.45176 NaN NaN
min NaN 4 2 0 2 NaN 8
25% NaN 4 4 1.5 3.25 NaN 8
50% NaN 4 5.5 4.5 5 NaN 8
75% NaN 4 6.75 6.5 6.75 NaN 8
max NaN 4 9 10 9 NaN 8
varType object float64 float64 float64 float64 object float64
rowsNull 0 9 1 2 0 0 9
%rowsNull 0 90 10 20 0 0 90
In this exemple we have juste 2 features to delete 'B' and 'G'.
But in my case i find 40 variables whose '%rowsNull' greater than > 90%, how should i do not take into account these variables in my modeling?
I have no idea how to do this.
Please help me.
Thanks.
First compare missing values and then get mean (it working because Trues are processing like 1s), last filter by boolean indexing with loc, because removing columns:
df = df.loc[:, df.isnull().mean() <.9]
print (df)
A C D E F
0 a 7.0 1.0 5 a
1 b 8.0 3.0 3 a
2 c NaN 5.0 6 a
3 d 4.0 NaN 9 b
4 e 2.0 1.0 2 b
5 f 3.0 0.0 4 b
6 g 6.0 10.0 7 c
7 h 5.0 7.0 3 k
8 i 4.0 NaN 5 f
9 j 6.0 5.0 9 r
Detail:
print (df.isnull().mean())
A 0.0
B 0.9
C 0.1
D 0.2
E 0.0
F 0.0
G 0.9
dtype: float64
You can find columns with more than 90% null values and drop
cols_to_drop = df.columns[df.isnull().sum()/len(df) >= .90]
df.drop(cols_to_drop, axis = 1, inplace = True)
A C D E F
0 a 7.0 1.0 5 a
1 b 8.0 3.0 3 a
2 c NaN 5.0 6 a
3 d 4.0 NaN 9 b
4 e 2.0 1.0 2 b
5 f 3.0 0.0 4 b
6 g 6.0 10.0 7
7 h 5.0 7.0 3 k
8 i 4.0 NaN 5 f
9 j 6.0 5.0 9 r
Based on your code, you could do something like
keepCols = desc.columns[desc.loc['%rowsNull'] < 90]
df = df[keepCols]

Missing data, insert rows in Pandas and fill with NAN

I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...
set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)
In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.
This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.

Categories

Resources