pd.wide_to_long() lost data - python

I'm very new to Python. I've tried to reshape a data set using pd.wide_to_long. The original dataframe looks like this:
chk1 chk2 chk3 ... chf1 chf2 chf3 id var1 var2
0 3 4 2 ... nan nan nan 1 1 0
1 4 4 4 ... nan nan nan 2 1 0
2 2 nan nan ... 3 4 3 3 0 1
3 3 3 3 ... 3 2 2 4 1 0
I used the following code:
df2 = pd.wide_to_long(df,
stubnames=['chk', 'chf'],
i=['id', 'var1', 'var2'],
j='type')
When checking the data after these codes, it looks like this
chk chf
id var1 var2 egenskap
1 1 0 1 3 nan
2 4 nan
3 2 nan
4 nan nan
5 4 nan
6 nan nan
7 4 nan
8 4 nan
2 1 0 1 4 nan
2 4 nan
3 4 nan
4 5 nan
But when I check the columns in the new data set, it seems that all columns except 'chk' and 'chf' are gone!
df2.columns
Out[47]: Index(['chk', 'chf'], dtype='object')
df2.columns
for col in df2.columns:
print(col)
chk
chf
From the dataview it looks like 'id', 'var1', 'var2' have been merged into one common index:
Screenprint dataview here
Can someone please help me? :)

Related

Concatenate columns skipping pasted rows and columns

I expect to describe well want I need. I have a data frame with the same columns name and another column that works as an index. The data frame looks as follows:
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],'X':[1,2,3,4,5,2,3,4,1,3,4,5],'Y':[1,2,3,4,5,2,3,4,5,4,3,2]})
df
Out[21]:
ID X Y
0 1 1 1
1 1 2 2
2 1 3 3
3 1 4 4
4 1 5 5
5 2 2 2
6 2 3 3
7 2 4 4
8 3 1 5
9 3 3 4
10 3 4 3
11 3 5 2
My intention is to copy X as an index or one column (it doesn't matter) and append Y columns from each 'ID' in the following way:
You can try
out = pd.concat([group.rename(columns={'Y': f'Y{name}'}) for name, group in df.groupby('ID')])
out.columns = out.columns.str.replace(r'\d+$', '', regex=True)
print(out)
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 5.0
9 3 3 NaN NaN 4.0
10 3 4 NaN NaN 3.0
11 3 5 NaN NaN 2.0
Here's another way to do it:
df_org = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],'X':[1,2,3,4,5,2,3,4,1,3,4,5]})
df = df_org.copy()
for i in set(df_org['ID']):
df1 = df_org[df_org['ID']==i]
col = 'Y'+str(i)
df1.columns = ['ID', col]
df = pd.concat([ df, df1[[col]] ], axis=1)
df.columns = df.columns.str.replace(r'\d+$', '', regex=True)
print(df)
Output:
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 1.0
9 3 3 NaN NaN 3.0
10 3 4 NaN NaN 4.0
11 3 5 NaN NaN 5.0
Another solution could be as follow.
Get unique values for column ID (stored in array s).
Use np.transpose to repeat column ID n times (n == len(s)) and evaluate the array's matches with s.
Use np.where to replace True with values from df.Y and False with NaN.
Finally, drop the orignal df.Y and rename the new columns as required.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],
'X':[1,2,3,4,5,2,3,4,1,3,4,5],
'Y':[1,2,3,4,5,2,3,4,5,4,3,2]})
s = df.ID.unique()
df[s] = np.where((np.transpose([df.ID]*len(s))==s),
np.transpose([df.Y]*len(s)),
np.nan)
df.drop('Y', axis=1, inplace=True)
df.rename(columns={k:'Y' for k in s}, inplace=True)
print(df)
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 5.0
9 3 3 NaN NaN 4.0
10 3 4 NaN NaN 3.0
11 3 5 NaN NaN 2.0
If performance is an issue, this method should be faster than this answer, especially when the number of unique values for ID increases.

merge the specific row in from two dataframe

I have df like this
df1:
Name A B C
a b r t y U
0 xyz 1 2 3 4 3 4
1 abc 3 5 4 7 7 8
2 pqr 2 4 4 5 4 6
df2:
Name A B C
a b r t y U
0 xyz Nan Nan Nan Nan Nan Nan
1 abc 2 4 5 7 7 9
2 pqr Nan Nan Nan Nan Nan Nan
i want df like this
Name A B C
a b r t y U
0 xyz Nan Nan Nan Nan Nan Nan
1 abc 5 9 9 14 14 17
2 pqr Nan Nan Nan Nan Nan Nan
basically i want the sum of abc row only
First check what is columns names, obviously it is tuple ('Name', '') here, so set to index and then sum it:
print (df1.columns.tolist())
print (df2.columns.tolist())
df1 = df1.set_index([('Name', '')])
df2 = df2.set_index([('Name', '')])
#set by position
#df1 = df1.set_index([df1.columns[0]])
#df2 = df2.set_index([df2.columns[0]])
df = df1.add(df2)
Or:
df = df1 + df2

Create dataframe with hierarchical indices and extra columns from non-hierarchically indexed dataframe

Consider a simple dataframe:
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(10).reshape(5,2))
print(x)
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I would like to create a hierarchically indexed dataframe of the form:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
where the 'a' columns correspond to the original dataframe columns and the 'b' columns are blank (or nan).
I can certainly create a hierarchically indexed dataframe with all NaNs and loop over the columns of the original dataframe, writing them into
the new dataframe. Is there something more compact than that?
you can do with MultiIndex.from_product
extra_level = ['a', 'b']
new_cols = pd.MultiIndex.from_product([x.columns, extra_level])
x.columns = new_cols[::len(x.columns)] # select all the first element of extra_level
x = x.reindex(columns=new_cols)
print(x)
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
Very much like #Ben.T I am using MultiIndex.from_product:
x.assign(l='a')
.set_index('l', append=True)
.unstack()
.reindex(pd.MultiIndex.from_product([x.columns.tolist(), ['a','b']]), axis=1)
Output:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN

How to insert an empty column after each column in an existing dataframe

I have a dataframe that looks as follows
df = pd.DataFrame({"A":[1,2,3,4],
"B":[3,4,5,6],
"C":[2,3,4,5]})
I would like to insert an empty column (with type string) after each existing column in the dataframe, such that the output looks like
A col1 B col2 C col3
0 1 NaN 3 NaN 2 NaN
1 2 NaN 4 NaN 3 NaN
2 3 NaN 5 NaN 4 NaN
3 4 NaN 6 NaN 5 NaN
Actually there's a much more simple way thanks to reindex:
df.reindex([x for i, c in enumerate(df.columns, 1) for x in (c, f'col{i}')], axis=1)
Result:
A col1 B col2 C col3
0 1 NaN 3 NaN 2 NaN
1 2 NaN 4 NaN 3 NaN
2 3 NaN 5 NaN 4 NaN
3 4 NaN 6 NaN 5 NaN
Here's the other more complicated way:
import numpy as np
df.join(pd.DataFrame(np.empty(df.shape, dtype=object), columns=df.columns + '_sep')).sort_index(axis=1)
A A_sep B B_sep C C_sep
0 1 None 3 None 2 None
1 2 None 4 None 3 None
2 3 None 5 None 4 None
3 4 None 6 None 5 None
This solution worked for me:
merged = pd.concat([myDataFrame, pd.DataFrame(columns= [' '])], axis=1)
This is what you can do:
for count in range(len(df.columns)):
df.insert(count*2+1, str('col'+str(count+1)), 'NaN')
print(df)
Output:
A col1 B col2 C col3
0 1 NaN 3 NaN 2 NaN
1 2 NaN 4 NaN 3 NaN
2 3 NaN 5 NaN 4 NaN
3 4 NaN 6 NaN 5 NaN

Pandas: replace column A with column B if B is not missing

I have question similar to a previous post. I want to replace missing values in A with B if B is not-missing. I've used a toy dataset.
#Create sample dataset
import pandas as pd
import numpy as np
np.random.seed(12345)
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df
df[df < 0] = 'NaN'
print(df)
Obs. A B
0 NaN 0.478943
1 NaN NaN
2 1.96578 1.39341
3 0.0929079 0.281746
4 0.769023 1.24643
5 1.00719 NaN
6 0.274992 0.228913
7 1.35292 0.886429
8 NaN NaN
9 1.66903 NaN
#Replace NaN in A with B if B is not NaN
df['A'] = np.where(pd.isnull(df['A']) & pd.notnull(df['B']) == 0, df['B']*1, df['A'])
print(df)
obs A B
0 0.478943 0.478943
1 NaN NaN
2 1.39341 1.39341
3 0.281746 0.281746
4 1.24643 1.24643
5 NaN NaN
6 0.228913 0.228913
7 0.886429 0.886429
8 NaN NaN
9 NaN NaN
This code does the job. But why do I need pd.notnull(df['B']) == 0? If I write:
pd.notnull(df['B'])
instead, the code does not work correctly. The output from that is:
Obs. A B
0 NaN 0.478943
1 NaN NaN
2 1.96578 1.39341
3 0.0929079 0.281746
4 0.769023 1.24643
5 1.00719 NaN
6 0.274992 0.228913
7 1.35292 0.886429
8 NaN NaN
9 1.66903 NaN
I'm trying to understand the flaw in my logic. Any other simple intuitive code will be appreciated.
I basically need to do this simple operation for a very large dataset (100m obs+) so looking for a fast way (in terms of computer processing time) to do it. Thanks in advance.
Replace 'NaN' with np.nan and apply fillna on column A using column B
df = df.replace('NaN', np.nan)
df.A.fillna(df.B, inplace=True)
Output:
A B
0 0.478943 0.478943
1 NaN NaN
2 1.965781 1.393406
3 0.092908 0.281746
4 0.769023 1.246435
5 1.007189 NaN
6 0.274992 0.228913
7 1.352917 0.886429
8 NaN NaN
9 1.669025 NaN

Categories

Resources