I expect to describe well want I need. I have a data frame with the same columns name and another column that works as an index. The data frame looks as follows:
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],'X':[1,2,3,4,5,2,3,4,1,3,4,5],'Y':[1,2,3,4,5,2,3,4,5,4,3,2]})
df
Out[21]:
ID X Y
0 1 1 1
1 1 2 2
2 1 3 3
3 1 4 4
4 1 5 5
5 2 2 2
6 2 3 3
7 2 4 4
8 3 1 5
9 3 3 4
10 3 4 3
11 3 5 2
My intention is to copy X as an index or one column (it doesn't matter) and append Y columns from each 'ID' in the following way:
You can try
out = pd.concat([group.rename(columns={'Y': f'Y{name}'}) for name, group in df.groupby('ID')])
out.columns = out.columns.str.replace(r'\d+$', '', regex=True)
print(out)
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 5.0
9 3 3 NaN NaN 4.0
10 3 4 NaN NaN 3.0
11 3 5 NaN NaN 2.0
Here's another way to do it:
df_org = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],'X':[1,2,3,4,5,2,3,4,1,3,4,5]})
df = df_org.copy()
for i in set(df_org['ID']):
df1 = df_org[df_org['ID']==i]
col = 'Y'+str(i)
df1.columns = ['ID', col]
df = pd.concat([ df, df1[[col]] ], axis=1)
df.columns = df.columns.str.replace(r'\d+$', '', regex=True)
print(df)
Output:
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 1.0
9 3 3 NaN NaN 3.0
10 3 4 NaN NaN 4.0
11 3 5 NaN NaN 5.0
Another solution could be as follow.
Get unique values for column ID (stored in array s).
Use np.transpose to repeat column ID n times (n == len(s)) and evaluate the array's matches with s.
Use np.where to replace True with values from df.Y and False with NaN.
Finally, drop the orignal df.Y and rename the new columns as required.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],
'X':[1,2,3,4,5,2,3,4,1,3,4,5],
'Y':[1,2,3,4,5,2,3,4,5,4,3,2]})
s = df.ID.unique()
df[s] = np.where((np.transpose([df.ID]*len(s))==s),
np.transpose([df.Y]*len(s)),
np.nan)
df.drop('Y', axis=1, inplace=True)
df.rename(columns={k:'Y' for k in s}, inplace=True)
print(df)
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 5.0
9 3 3 NaN NaN 4.0
10 3 4 NaN NaN 3.0
11 3 5 NaN NaN 2.0
If performance is an issue, this method should be faster than this answer, especially when the number of unique values for ID increases.
I am a newbie at python and programming in general. I hope the following question is well explained.
I have a big dataset, with 80+ columns and some of these columns have only data on a weekly basis. I would like transform these columns to have values on a daily basis by simply dividing the weekly value by 7 and attributing the result to the value itself and the 6 other days of that week.
This is what my input dataset looks like:
date col1 col2 col3
02-09-2019 14 NaN 1
09-09-2019 NaN NaN 2
16-09-2019 NaN 7 3
23-09-2019 NaN NaN 4
30-09-2019 NaN NaN 5
07-10-2019 NaN NaN 6
14-10-2019 NaN NaN 7
21-10-2019 21 NaN 8
28-10-2019 NaN NaN 9
04-11-2019 NaN 14 10
11-11-2019 NaN NaN 11
..
This is what the output should look like:
date col1 col2 col3
02-09-2019 2 NaN 1
09-09-2019 2 NaN 2
16-09-2019 2 1 3
23-09-2019 2 1 4
30-09-2019 2 1 5
07-10-2019 2 1 6
14-10-2019 2 1 7
21-10-2019 3 1 8
28-10-2019 3 1 9
04-11-2019 3 2 10
11-11-2019 3 2 11
..
I can´t come up with a solution, but here is what I thought might work:
def convert_to_daily(df):
for column in df.columns.tolist():
if column.isna(): # if true
for line in range(len(df[column])):
# check if value is not empty and
succeeded by an 6 empty values or some
better logic
# I don´t know how to do that.
I believe you need select columns contains at least one missing value, forward filling missing values and divide by 7:
m = df.isna().any()
df.loc[:, m] = df.loc[:, m].ffill(limit=7).div(7)
print (df)
date col1 col2 col3
0 02-09-2019 2.0 NaN 1
1 09-09-2019 2.0 NaN 2
2 16-09-2019 2.0 1.0 3
3 23-09-2019 2.0 1.0 4
4 30-09-2019 2.0 1.0 5
5 07-10-2019 2.0 1.0 6
6 14-10-2019 2.0 1.0 7
7 21-10-2019 3.0 1.0 8
8 28-10-2019 3.0 1.0 9
9 04-11-2019 3.0 2.0 10
10 11-11-2019 3.0 2.0 11
This question already has answers here:
Merge two dataframes by index
(7 answers)
Closed 2 years ago.
I'm having trouble with merging my predicted values to an existing dataframe. I currently have 2 dataframe one which has filenames and other dataframe with prediction values and both are of the same length . However when I try merging or concatenating I'm not getting a desired output.
Dataframe 1
filename
0 1gBZ9vG1.txt
1 4XztkgDw.txt
2 GfCk8XGZ.txt
3 gfHCMnJM.txt
4 GfLCd17y.txt
5 gFqruhps.txt
6 gfsZpRDu.txt
7 gfT1yDbz.txt
8 GfT9mkJL.txt
9 GFTbJDLn.txt
10 gFwh0Ekb.txt
11 GGB7680Q.txt
12 R7NkR2q2.txt
13 tK2Xmi4C.txt
Dataframe 2
predictedLabels
0 2
1 2
2 2
3 1
4 2
5 2
6 2
7 2
8 1
9 1
10 1
11 0
12 2
13 2
Output
filename predictedLabels
0 1gBZ9vG1.txt NaN
1 4XztkgDw.txt NaN
2 GfCk8XGZ.txt NaN
3 gfHCMnJM.txt NaN
4 GfLCd17y.txt NaN
5 gFqruhps.txt NaN
6 gfsZpRDu.txt NaN
7 gfT1yDbz.txt NaN
8 GfT9mkJL.txt NaN
9 GFTbJDLn.txt NaN
10 gFwh0Ekb.txt NaN
11 GGB7680Q.txt NaN
12 R7NkR2q2.txt NaN
13 tK2Xmi4C.txt NaN
0 NaN 2.0
1 NaN 2.0
2 NaN 2.0
3 NaN 1.0
4 NaN 2.0
5 NaN 2.0
6 NaN 2.0
7 NaN 2.0
8 NaN 1.0
9 NaN 1.0
10 NaN 1.0
11 NaN 0.0
12 NaN 2.0
13 NaN 2.0
I'm not sure why the labels appears below with NaN values though they are of the same length. I tried both merge and concat and also tried to reset my index but it does not work.
Try it:
Dataframe1["predictedLabels"] = Dataframe2["predictedLabels"]
I have a dataframe in which I want to apply a rolling mean over a column of numbers that come in 3-pairs where I only want 4 unique values to go into the mean.
Lets say my dataframe looks like:
Group Column to roll
1 9
2 5
2 5
2 4
2 4
2 4
2 3
2 3
2 3
2 6
2 6
2 6
2 8
Since I want 4 unique values to go into the mean but all values to be of equal weight and within the same group, my expected output (assuming I need 4 unique values) would be:
Group Output
1 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 (6+3+4+5)/4
2 (6+3+4+5)/4
2 (6+3+4+5)/4
2 (8+6+3+4)/4
Any ideas how to do this?
You could try something like this:
df['Column to roll'].drop_duplicates().rolling(4).mean().reindex(df.index).ffill()
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 4.50
9 4.50
10 4.50
11 5.25
Name: Column to roll, dtype: float64
Edit question changed
df_out = df.groupby('Group')['Column to roll']\
.apply(lambda x: x.drop_duplicates().rolling(4).mean()).rename('Output')
df.set_index('Group',append=True).swaplevel(0,1)\
.join(df_out, how='left').ffill().reset_index(level=1, drop=True)
Output:
Column to roll Output
Group
1 9 NaN
2 5 NaN
2 5 NaN
2 4 NaN
2 4 NaN
2 4 NaN
2 3 NaN
2 3 NaN
2 3 NaN
2 6 4.50
2 6 4.50
2 6 4.50
2 8 5.25
I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...
set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)
In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.
This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.