Appending columns to other columns in Pandas - python

Given the dataframe:
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11], 'col4': [12, 13, 14, 15, 16]}
What is the easiest way to append the third column to the first and the fourth column to the second?
The result should look like.
d = {'col1': [1, 2, 3, 4, 7, 7, 8, 12, 1, 11], 'col2': [4, 5, 6, 9, 5, 12, 13, 14, 15, 16],
I need to use this for a script with different column names, thus referencing columns by name is not possible. I have tried something along the lines of df.iloc[:,x] to achieve this.

You can use:
out = pd.concat([subdf.set_axis(['col1', 'col2'], axis=1)
for _, subdf in df.groupby(pd.RangeIndex(df.shape[1]) // 2, axis=1)])
print(out)
# Output
col1 col2
0 1 4
1 2 5
2 3 6
3 4 9
4 7 5
0 7 12
1 8 13
2 12 14
3 1 15
4 11 16

You can change the column names and concat:
pd.concat([df[['col1', 'col2']],
df[['col3', 'col4']].set_axis(['col1', 'col2'], axis=1)])
Add ignore_index=True to reset the index in the process.
Output:
col1 col2
0 1 4
1 2 5
2 3 6
3 4 9
4 7 5
0 7 12
1 8 13
2 12 14
3 1 15
4 11 16
Or, using numpy:
N = 2
pd.DataFrame(
df
.values.reshape((-1,df.shape[1]//2,N))
.reshape(-1,N,order='F'),
columns=df.columns[:N]
)

This may not be the most efficient solution but, you can do it using the pd.concat() function in pandas.
First convert your initial dict d into a pandas Dataframe and then apply the concat function.
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11], 'col4': [12, 13, 14, 15, 16]}
df = pd.DataFrame(d)
d_2 = {'col1':pd.concat([df.iloc[:,0],df.iloc[:,2]]),'col2':pd.concat([df.iloc[:,1],df.iloc[:,3]])}
d_2 is your required dict. Convert it to a dataframe if you need it to,
df_2 = pd.DataFrame(d_2)

Related

Pandas data frame index

if I have a Series
s = pd.Series(1, index=[1,2,3,5,6,9,10])
But, I need a standard index = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], with index[4, 7, 8] values equal to zeros.
So I expect the updated series will be
s = pd.Series([1,1,1,0,1,1,0,0,1,1], index=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
How should I update the series?
Thank you in advance!
Try this:
s.reindex(range(1,s.index.max() + 1),fill_value=0)
Output:
1 1
2 1
3 1
4 0
5 1
6 1
7 0
8 0
9 1
10 1

Remove n elements from start of a list in pandas column, where n is the value in another column

Say I have the following DataFrame:
a = [[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15]]
b = [3,1,2]
df = pd.DataFrame(zip(a,b), columns = ['a', 'b'])
df:
a b
0 [1, 2, 3, 4, 5] 3
1 [6, 7, 8, 9, 10] 1
2 [11, 12, 13, 14, 15] 2
How can I remove the first n elements from each list in column a, where n is the value in column b.
The result I would expect for the above df is:
a b
0 [4, 5] 3
1 [7, 8, 9, 10] 1
2 [13, 14, 15] 2
I imagine the answer revolves around using .apply() and a lambda function, but I cannot get my head around this one!
Try:
df["a"] = df.apply(lambda x: x["a"][x["b"] :], axis=1)
print(df)
Prints:
a b
0 [4, 5] 3
1 [7, 8, 9, 10] 1
2 [13, 14, 15] 2
Try this:
df['a'] = df.apply(lambda row: row['a'][row['b']:], axis=1)
Output:
a b
0 [4, 5] 3
1 [7, 8, 9, 10] 1
2 [13, 14, 15] 2

How to replicate same values based on the index value of other column in python

I have a dataframe like below and I want to add another column that is replicated untill certain condition is met.
sample_df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8],
'z' : [5, 3, 6],
'g' : [8, 8, 10]
})
additional_rows=
Now I want to add another column which contains additional information about the dataframe. For instance, I want to replicate Yes untill id is B and No when it is below B and Yes from C to D and from from D to E Maybe.
The output I am expecting is as follows:
sample_df = pd.DataFrame(data={
'id': ['A', 'B', 'C','G','D','E'],
'n' : [ 1, 2, 3, 5, 5, 9],
'v' : [ 10, 13, 8, 8, 4 , 3],
'z' : [5, 3, 6, 9, 9, 8],
'New Info': ['Yes','Yes','No','No','Maybe','Maybe']
})
sample_df
id n v z New Info
0 A 1 10 5 Yes
1 B 2 13 3 Yes
2 C 3 8 6 No
3 G 5 8 9 No
4 D 5 4 9 Maybe
5 E 9 3 8 Maybe
How can I achieve this in python?
You can use np.select to return results based on conditions. Since you were talking more about positional conditions I used df.index:
sample_df = pd.DataFrame(data={
'id': ['A', 'B', 'C','G','D','E'],
'n' : [ 1, 2, 3, 5, 5, 9],
'v' : [ 10, 13, 8, 8, 4 , 3],
'z' : [5, 3, 6, 9, 9, 8]
})
sample_df['New Info'] = np.select([sample_df.index<2, sample_df.index<4],['Yes', 'No'], 'Maybe')
sample_df
Out[1]:
id n v z New Info
0 A 1 10 5 Yes
1 B 2 13 3 Yes
2 C 3 8 6 No
3 G 5 8 9 No
4 D 5 4 9 Maybe
5 E 9 3 8 Maybe

Why is pandas df.add_suffix() not working with for-loop

I am trying to use pandas df.add_suffix() for multiple dataframes, that are stored in a list via a for-loop:
df_1 = pd.DataFrame({'X': [2, 3, 4, 5], 'Y': [4, 5, 6, 7]})
df_2 = pd.DataFrame({'X': [6, 7, 8, 9], 'Y': [9, 8, 7, 6]})
df_3 = pd.DataFrame({'X': [6, 3, 1, 13], 'Y': [7, 0, 1, 4]})
mylist = [df_1, df_2, df_3]
for i in mylist:
i = i.add_suffix('_test')
However when I print the dataframes afterwards, i see still the old column names "X" and "Y".
When doing the same operation on each of the dataframes separately:
df1 = df_1.add_suffix('_test')
everything works as expected and I get the column names "X_test" and "Y_test".
Does anyone have any idea, what I am missing here?
you are changing the value of the variable i but i it is not the same with the mylist elements, when you are iterating using for loop you are assigning to the variable i consecutive elements from mylist, you should use the list index to change the elements:
for i in range(len(mylist)):
mylist[i] = mylits[i].add_suffix('_test')
Problem is output is no assign back to list, so no change.
Solution if want assign to same list of DataFrames with enumerate for indexing:
for j,i in enumerate(mylist):
mylist[j] = i.add_suffix('_test')
print (mylist)
[ X_test Y_test
0 2 4
1 3 5
2 4 6
3 5 7, X_test Y_test
0 6 9
1 7 8
2 8 7
3 9 6, X_test Y_test
0 6 7
1 3 0
2 1 1
3 13 4]
Or if want new list of DataFrames use list comprehension:
dfs = [i.add_suffix('_test') for i in mylist]
print (dfs)
[ X_test Y_test
0 2 4
1 3 5
2 4 6
3 5 7, X_test Y_test
0 6 9
1 7 8
2 8 7
3 9 6, X_test Y_test
0 6 7
1 3 0
2 1 1
3 13 4]
df_1 = pd.DataFrame({'X': [2, 3, 4, 5], 'Y': [4, 5, 6, 7]})
df_2 = pd.DataFrame({'X': [6, 7, 8, 9], 'Y': [9, 8, 7, 6]})
df_3 = pd.DataFrame({'X': [6, 3, 1, 13], 'Y': [7, 0, 1, 4]})
mylist = [df_1, df_2, df_3]
for i,j in enumerate(mylist):
mylist[i] = j.add_suffix('_test')
The updated dfs are in the list(mylist) rather than the original one.

Pandas: add column based on groupby with condition

I have a dataframe with four columns: id1, id2, age, stime. For example
df = pd.DataFrame(np.array([[1, 1, 3, pd.to_datetime('2020-01-10 00:30:16')],
[2, 1, 10, pd.to_datetime('2020-01-27 00:20:20')],
[3, 1, 60, pd.to_datetime('2020-01-26 00:10:08')],
[4, 2, 1, pd.to_datetime('2020-01-13 00:20:19')],
[5, 2, 2, pd.to_datetime('2020-01-12 00:40:17')],
[6, 2, 3, pd.to_datetime('2020-01-10 00:10:53')],
[7, 3, 20, pd.to_datetime('2020-01-21 00:20:57')],
[8, 3, 40, pd.to_datetime('2020-01-20 00:10:38')],
[9, 3, 60, pd.to_datetime('2020-01-01 00:30:38')],
]),
columns=['id1', 'id2', 'age', 'stime'])
I want to add a column where the value is the maximum value of age, that also has a matching id2 and was within the last 2 weeks of the stime for that row. So for the above example I want to get
df2 = pd.DataFrame(np.array([[1, 1, 3, pd.to_datetime('2020-01-10 00:30:16'), 3],
[2, 1, 10, pd.to_datetime('2020-01-27 00:20:20'), 60],
[3, 1, 60, pd.to_datetime('2020-01-26 00:10:08'), 60],
[4, 2, 1, pd.to_datetime('2020-01-13 00:20:19'), 3],
[5, 2, 2, pd.to_datetime('2020-01-12 00:40:17'), 3],
[6, 2, 3, pd.to_datetime('2020-01-10 00:10:53'), 3],
[7, 3, 20, pd.to_datetime('2020-01-21 00:20:57'), 40],
[8, 3, 40, pd.to_datetime('2020-01-20 00:10:38'), 40],
[9, 3, 60, pd.to_datetime('2020-01-01 00:30:38'), 60]
]),
columns=['id1', 'id2', 'age', 'stime', 'max_age_last_2w'])
As the df I want to do this is on is very large, any help on how to do this efficiently would be greatly appreciated - thanks in advance!
Try:
df['max_age_last_2w'] = df.groupby(['id2', pd.Grouper(key='stime', freq='2W', closed='right')])['age'].transform('max')
Output:
id1 id2 age stime max_age_last_2w
0 1 1 3 2020-01-10 00:30:16 3
1 2 1 10 2020-01-27 00:20:20 60
2 3 1 60 2020-01-26 00:10:08 60
3 4 2 1 2020-01-13 00:20:19 3
4 5 2 2 2020-01-12 00:40:17 3
5 6 2 3 2020-01-10 00:10:53 3
6 7 3 20 2020-01-21 00:20:57 40
7 8 3 40 2020-01-20 00:10:38 40
8 9 3 60 2020-01-01 00:30:38 60

Categories

Resources