Move row by name to desired location in df - python

I have a df which looks like this:
a b
apple | 7 | 2 |
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
other | 8 | 9 |
I want to select a given row say by name, say "apple" and move it to a new location, say -1 (second last row)
desired output
a b
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
apple | 7 | 2 |
other | 8 | 9 |
Is there any functions available to achieve this?

Use Index.difference for remove value and numpy.insert for add value to new index, last use DataFrame.reindex or DataFrame.loc for change order of rows:
a = 'apple'
idx = np.insert(df.index.difference([a], sort=False), -1, a)
print (idx)
Index(['google', 'swatch', 'merc', 'apple', 'other'], dtype='object')
df = df.reindex(idx)
#alternative
#df = df.loc[idx]
print (df)
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9

This seems good, I am using pd.Index.insert() and pd.Index.drop_duplicates():
df.reindex(df.index.insert(-1,'apple').drop_duplicates(keep='last'))
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9

I'm not aware of any built-in function, but one approach would be to manipulate the index only, then use the new index to re-order the DataFrame (assumes all index values are unique):
name = 'apple'
position = -1
new_index = [i for i in df.index if i != name]
new_index.insert(position, name)
df = df.loc[new_index]
Results:
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9

Related

column calculation based on column names Python

I have this dataframe with columns like
| LHA_1 | JH_1 | LHA_2 | JH_2 | LHA_3 | JH_3 | LHA_4 | JH_5 | ....
What I would like to do is to have LHA_2 - JH_1, LHA3 - JH_2, LHA4 - JH_3.....
and the final would look like
| LHA_1 | JH_1 | LHA_2 |LHA_2 - JH_1| JH_2 | LHA_3 |LHA_3 - JH_2| JH_3 | LHA_4 | JH_5 | ....
You can use pd.IndexSlice. Suppose the following dataframe:
>>> df
LHA_1 JH_1 LHA_2 JH_2 LHA_3 JH_3 LHA_4 JH_5 LHA_6
0 9 8 7 6 5 4 3 2 1
1 19 18 17 16 15 14 13 12 11
res = df.loc[:, pd.IndexSlice['LHA_2'::2]].values \
- df.loc[:, pd.IndexSlice['JH_1'::2]].values
res = pd.DataFrame(res).add_prefix('LHA_JH_')
out = pd.concat([df, res], axis=1)
print(out)
# Output
LHA_1 JH_1 LHA_2 JH_2 LHA_3 JH_3 LHA_4 JH_5 LHA_6 LHA_JH_0 LHA_JH_1 LHA_JH_2 LHA_JH_3
0 9 8 7 6 5 4 3 2 1 -1 -1 -1 -1
1 19 18 17 16 15 14 13 12 11 -1 -1 -1 -1

Concatenate and sum column values while iterating

I am trying to create a function that will take in CSV files and create dataframes and concatenate/sum like so:
id number_of_visits
0 3902932804358904910 2
1 5972629290368575970 1
2 5345473950081783242 1
3 4289865755939302179 1
4 36619425050724793929 19
+
id number_of_visits
0 3902932804358904910 5
1 5972629290368575970 10
2 5345473950081783242 3
3 4289865755939302179 20
4 36619425050724793929 13
=
id number_of_visits
0 3902932804358904910 7
1 5972629290368575970 11
2 5345473950081783242 4
3 4289865755939302179 21
4 36619425050724793929 32
My main issue is that in the for loop after I create the dataframes, I tried to concatenate by df += new_df and new_df wasn't being added. So I tried the following implementation.
def add_dfs(files):
master = []
big = pd.DataFrame({'id': 0, 'number_of_visits': 0}, index=[0]) # dummy df to initialize
for k in range(len(files)):
new_df = create_df(str(files[k])) # helper method to read, create and clean dfs
master.append(new_df) #creates a list of dataframes with in master
for k in range(len(master)):
big = pd.concat([big, master[k]]).groupby(['id', 'number_of_visits']).sum().reset_index()
# iterate through list of dfs and add them together
return big
Which gives me the following
id number_of_visits
1 1000036822946495682 2
2 1000036822946495682 4
3 1000044447054156512 1
4 1000044447054156512 9
5 1000131582129684623 1
So the number_of_visits for each user_id aren't actually adding together, they're just being sorted in order by number_of_visits
Pass your list of dataframes directly to concat() then group on the id and sum.
>>> pd.concat(master).groupby('id').number_of_visits.sum().reset_index()
id number_of_visits
0 36619425050724793929 32
1 3902932804358904910 7
2 4289865755939302179 21
3 5345473950081783242 4
4 5972629290368575970 11
def add_dfs(files):
master = []
for f in files:
new_df = create_df(f)
master.append(new_df)
big = pd.concat(master).groupby('id').number_of_visits.sum().reset_index()
return big
You can use
df1['number_of_visits'] += df2['number_of_visits']
this gives you:
| | id | number_of_visits |
|---:|---------------------:|-------------------:|
| 0 | 3902932804358904910 | 7 |
| 1 | 5972629290368575970 | 11 |
| 2 | 5345473950081783242 | 4 |
| 3 | 4289865755939302179 | 21 |
| 4 | 36619425050724793929 | 32 |

Apply function on dataframe counting power of difference between int and series

I am trying to add a new column to dataframe with apply function. I need to count distance between X and Y coords in row 0 and all other rows, I have created following logic:
import pandas as pd
import numpy as np
data = {'X':[0,0,0,1,1,5,6,7,8],'Y':[0,1,4,2,6,5,6,4,8],'Value':[6,7,4,5,6,5,6,4,8]}
df = pd.DataFrame(data)
def countDistance(lat1, lon1, lat2, lon2):
print(lat1, lon1, lat2, lon2)
#use basic knowledge about triangles - values are in meters
distance = np.sqrt(np.power(lat1-lat2,2)+np.power(lon1-lon2,2))
return distance
def recModif(df):
x = df.loc[0,'X']
y = df.loc[0,'Y']
df['dist'] = df.apply(lambda n: countDistance(x,y,df['X'],df['Y']), axis=1)
#more code will come here
recModif(df)
But this always returns error: ValueError: Wrong number of items passed 9, placement implies
I thought that as x and y are scalars, using np.repeat might help but it didn't, the error was still the same. I saw similar posts such as this one, but with multiplication which is simple, how can I achieve subtraction like I need?
The variable name in .apply() was messed up and collides with the outer scope. Avoid that and the code works.
df['dist'] = df.apply(lambda row: countDistance(x,y,row['X'],row['Y']), axis=1)
df
X Y Value dist
0 0 0 6 0.000000
1 0 1 7 1.000000
2 0 4 4 4.000000
3 1 2 5 2.236068
4 1 6 6 6.082763
5 5 5 5 7.071068
6 6 6 6 8.485281
7 7 4 4 8.062258
8 8 8 8 11.313708
Also note that np.power() and np.sqrt() are already vectorized, so .apply itself is redundant for the dataset given:
countDistance(x,y,df['X'],df['Y'])
Out[154]:
0 0.000000
1 1.000000
2 4.000000
3 2.236068
4 6.082763
5 7.071068
6 8.485281
7 8.062258
8 11.313708
dtype: float64
To achieve your end goal I suggest changing the function recModif to:
def recModif(df):
x = df.loc[0,'X']
y = df.loc[0,'Y']
df['dist'] = countDistance(x,y,df['X'],df['Y'])
#more code will come here
This outputs
X Y Value dist
0 0 0 6 0.000000
1 0 1 7 1.000000
2 0 4 4 4.000000
3 1 2 5 2.236068
4 1 6 6 6.082763
5 5 5 5 7.071068
6 6 6 6 8.485281
7 7 4 4 8.062258
8 8 8 8 11.313708
Solution
Try this:
## Method-1
df['dist'] = ((df.X - df.X[0])**2 + (df.Y - df.Y[0])**2)**0.5
## Method-2: .apply()
x, y = df.X[0], df.Y[0]
df['dist'] = df.apply(lambda row: ((row.X - x)**2 + (row.Y - y)**2)**0.5, axis=1)
Output:
# print(df.to_markdown(index=False))
| X | Y | Value | dist |
|----:|----:|--------:|---------:|
| 0 | 0 | 6 | 0 |
| 0 | 1 | 7 | 1 |
| 0 | 4 | 4 | 4 |
| 1 | 2 | 5 | 2.23607 |
| 1 | 6 | 6 | 6.08276 |
| 5 | 5 | 5 | 7.07107 |
| 6 | 6 | 6 | 8.48528 |
| 7 | 4 | 4 | 8.06226 |
| 8 | 8 | 8 | 11.3137 |
Dummy Data
import pandas as pd
data = {
'X': [0,0,0,1,1,5,6,7,8],
'Y': [0,1,4,2,6,5,6,4,8],
'Value':[6,7,4,5,6,5,6,4,8]
}
df = pd.DataFrame(data)

Concatenate array with new array created in for loop python

I have the following type of data, I am doing predictions:
Input: 1 | 2 | 3 | 4 | 5
1 | 2 | 3 | 4 | 5
1 | 2 | 3 | 4 | 5
Output: 6
7
8
I want to predict one at a time and feed it back to the input as the value for the last column. I use this function but it is not working well:
def moving_window(num_future_pred):
preds_moving = []
moving_test_window = [test_X[0,:].tolist()]
moving_test_window = np.array(moving_test_window)
for j in range(1, len(test_Y)):
moving_test_window = [test_X[j,:].tolist()]
moving_test_window = np.array(moving_test_window)
pred_one_step = model.predict(moving_test_window[:,:,:])
preds_moving.append(pred_one_step[0,0])
pred_one_step = pred_one_step.reshape((1,1,1))
moving_test_window =
np.concatenate((moving_test_window[:,:4,:], pred_one_step), axis= 1)
return preds_moving
preds_moving = moving_window(len(test_Y))
What I want:
Input: 1 | 2 | 3 | 4 | 5
1 | 2 | 3 | 4 | 6
1 | 2 | 3 | 4 | 17
Output: 6
17
18
Basically to make the first prediction [1,2,3,4,5] --> 6 and then remove the last column [5] from the next inputs and add the predicted value at each time.
What it does now, it just takes all the inputs as they are and makes predictions for each row. Any idea appreciated!

How to identify a specific occurrence across two rows and calculate the count

Let's say I have these 2 pandas dataframes:
id | userid | type
1 | 20 | a
2 | 20 | a
3 | 20 | b
4 | 21 | a
5 | 21 | b
6 | 21 | a
7 | 21 | b
8 | 21 | b
I want to obtain the number of times 'b follows a' for each user, and obtain a new dataframe like this:
userid | b_follows_a
20 | 1
21 | 2
I know I can do this using for loop. However, I wonder if there is a more elegant solution to this.
You can use shift() to check if a is followed by b with vectorized & and then count the trues with a sum:
df.groupby('userid').type.apply(lambda x: ((x == "a") & (x.shift(-1) == "b")).sum()).reset_index()
#userid type
#0 20 1
#1 21 2
Creative solution:
In [49]: df.groupby('userid')['type'].sum().str.count('ab').reset_index()
Out[49]:
userid type
0 20 1
1 21 2
Explanation:
In [50]: df.groupby('userid')['type'].sum()
Out[50]:
userid
20 aab
21 ababb
Name: type, dtype: object

Categories

Resources