I am trying to add a new column to dataframe with apply function. I need to count distance between X and Y coords in row 0 and all other rows, I have created following logic:
import pandas as pd
import numpy as np
data = {'X':[0,0,0,1,1,5,6,7,8],'Y':[0,1,4,2,6,5,6,4,8],'Value':[6,7,4,5,6,5,6,4,8]}
df = pd.DataFrame(data)
def countDistance(lat1, lon1, lat2, lon2):
print(lat1, lon1, lat2, lon2)
#use basic knowledge about triangles - values are in meters
distance = np.sqrt(np.power(lat1-lat2,2)+np.power(lon1-lon2,2))
return distance
def recModif(df):
x = df.loc[0,'X']
y = df.loc[0,'Y']
df['dist'] = df.apply(lambda n: countDistance(x,y,df['X'],df['Y']), axis=1)
#more code will come here
recModif(df)
But this always returns error: ValueError: Wrong number of items passed 9, placement implies
I thought that as x and y are scalars, using np.repeat might help but it didn't, the error was still the same. I saw similar posts such as this one, but with multiplication which is simple, how can I achieve subtraction like I need?
The variable name in .apply() was messed up and collides with the outer scope. Avoid that and the code works.
df['dist'] = df.apply(lambda row: countDistance(x,y,row['X'],row['Y']), axis=1)
df
X Y Value dist
0 0 0 6 0.000000
1 0 1 7 1.000000
2 0 4 4 4.000000
3 1 2 5 2.236068
4 1 6 6 6.082763
5 5 5 5 7.071068
6 6 6 6 8.485281
7 7 4 4 8.062258
8 8 8 8 11.313708
Also note that np.power() and np.sqrt() are already vectorized, so .apply itself is redundant for the dataset given:
countDistance(x,y,df['X'],df['Y'])
Out[154]:
0 0.000000
1 1.000000
2 4.000000
3 2.236068
4 6.082763
5 7.071068
6 8.485281
7 8.062258
8 11.313708
dtype: float64
To achieve your end goal I suggest changing the function recModif to:
def recModif(df):
x = df.loc[0,'X']
y = df.loc[0,'Y']
df['dist'] = countDistance(x,y,df['X'],df['Y'])
#more code will come here
This outputs
X Y Value dist
0 0 0 6 0.000000
1 0 1 7 1.000000
2 0 4 4 4.000000
3 1 2 5 2.236068
4 1 6 6 6.082763
5 5 5 5 7.071068
6 6 6 6 8.485281
7 7 4 4 8.062258
8 8 8 8 11.313708
Solution
Try this:
## Method-1
df['dist'] = ((df.X - df.X[0])**2 + (df.Y - df.Y[0])**2)**0.5
## Method-2: .apply()
x, y = df.X[0], df.Y[0]
df['dist'] = df.apply(lambda row: ((row.X - x)**2 + (row.Y - y)**2)**0.5, axis=1)
Output:
# print(df.to_markdown(index=False))
| X | Y | Value | dist |
|----:|----:|--------:|---------:|
| 0 | 0 | 6 | 0 |
| 0 | 1 | 7 | 1 |
| 0 | 4 | 4 | 4 |
| 1 | 2 | 5 | 2.23607 |
| 1 | 6 | 6 | 6.08276 |
| 5 | 5 | 5 | 7.07107 |
| 6 | 6 | 6 | 8.48528 |
| 7 | 4 | 4 | 8.06226 |
| 8 | 8 | 8 | 11.3137 |
Dummy Data
import pandas as pd
data = {
'X': [0,0,0,1,1,5,6,7,8],
'Y': [0,1,4,2,6,5,6,4,8],
'Value':[6,7,4,5,6,5,6,4,8]
}
df = pd.DataFrame(data)
Related
From my for-loop, the resulted lists are as follow:
#These lists below are list types and in ordered/structured.
key=[1234,2345,2223,6578,9976]
index0=[1,4,6,3,4,5,6,2,1]
index1=[4,3,2,1,6,8,5,3,1]
index2=[9,4,6,4,3,2,1,4,1]
How do I merge them all into a table by pandas? Below is the expectation.
key | index0 | index1 | index2
1234 | 1 | 4 | 9
2345 | 4 | 3 | 4
... | ... | ... | ...
9967 | 1 | 1 | 1
I had tried using pandas, but only came across into an error about data type. Then I set the dtype into int64 and int32, but still came across the error about data type again.
And for an optional question, should I had approached assembling a table from such a similar data in lists with SQL? I am just learning SQL with mySQL and wonder if it would've been convenient than with pandas for record keeping and persistent storage?
Just use a dict and pass it to pd.DataFrame:
dct = {
'key': pd.Series(key),
'index0': pd.Series(index0),
'index1': pd.Series(index1),
'index2': pd.Series(index2),
}
df = pd.DataFrame(dct)
Output:
>>> df
key index0 index1 index2
0 1234.0 1 4 9
1 2345.0 4 3 4
2 2223.0 6 2 6
3 6578.0 3 1 4
4 9976.0 4 6 3
5 NaN 5 8 2
6 NaN 6 5 1
7 NaN 2 3 4
8 NaN 1 1 1
Here is another way:
First load data into a dictionary:
d = dict(key=[1234,2345,2223,6578,9976],
index0=[1,4,6,3,4,5,6,2,1],
index1=[4,3,2,1,6,8,5,3,1],
index2=[9,4,6,4,3,2,1,4,1])
Then convert to a df:
df = pd.DataFrame({i:pd.Series(j) for i,j in d.items()})
Output:
key index0 index1 index2
0 1234.0 1 4 9
1 2345.0 4 3 4
2 2223.0 6 2 6
3 6578.0 3 1 4
4 9976.0 4 6 3
5 NaN 5 8 2
6 NaN 6 5 1
7 NaN 2 3 4
8 NaN 1 1 1
I have a df which looks like this:
a b
apple | 7 | 2 |
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
other | 8 | 9 |
I want to select a given row say by name, say "apple" and move it to a new location, say -1 (second last row)
desired output
a b
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
apple | 7 | 2 |
other | 8 | 9 |
Is there any functions available to achieve this?
Use Index.difference for remove value and numpy.insert for add value to new index, last use DataFrame.reindex or DataFrame.loc for change order of rows:
a = 'apple'
idx = np.insert(df.index.difference([a], sort=False), -1, a)
print (idx)
Index(['google', 'swatch', 'merc', 'apple', 'other'], dtype='object')
df = df.reindex(idx)
#alternative
#df = df.loc[idx]
print (df)
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
This seems good, I am using pd.Index.insert() and pd.Index.drop_duplicates():
df.reindex(df.index.insert(-1,'apple').drop_duplicates(keep='last'))
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
I'm not aware of any built-in function, but one approach would be to manipulate the index only, then use the new index to re-order the DataFrame (assumes all index values are unique):
name = 'apple'
position = -1
new_index = [i for i in df.index if i != name]
new_index.insert(position, name)
df = df.loc[new_index]
Results:
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
I have the following type of data, I am doing predictions:
Input: 1 | 2 | 3 | 4 | 5
1 | 2 | 3 | 4 | 5
1 | 2 | 3 | 4 | 5
Output: 6
7
8
I want to predict one at a time and feed it back to the input as the value for the last column. I use this function but it is not working well:
def moving_window(num_future_pred):
preds_moving = []
moving_test_window = [test_X[0,:].tolist()]
moving_test_window = np.array(moving_test_window)
for j in range(1, len(test_Y)):
moving_test_window = [test_X[j,:].tolist()]
moving_test_window = np.array(moving_test_window)
pred_one_step = model.predict(moving_test_window[:,:,:])
preds_moving.append(pred_one_step[0,0])
pred_one_step = pred_one_step.reshape((1,1,1))
moving_test_window =
np.concatenate((moving_test_window[:,:4,:], pred_one_step), axis= 1)
return preds_moving
preds_moving = moving_window(len(test_Y))
What I want:
Input: 1 | 2 | 3 | 4 | 5
1 | 2 | 3 | 4 | 6
1 | 2 | 3 | 4 | 17
Output: 6
17
18
Basically to make the first prediction [1,2,3,4,5] --> 6 and then remove the last column [5] from the next inputs and add the predicted value at each time.
What it does now, it just takes all the inputs as they are and makes predictions for each row. Any idea appreciated!
I have a dataset based on different weather stations,
stationID | Time | Temperature | ...
----------+------+-------------+-------
123 | 1 | 30 |
123 | 2 | 31 |
202 | 1 | 24 |
202 | 2 | 24.3 |
202 | 3 | NaN |
...
And I would like to remove 'stationID' groups, which have more than a certain number of NaNs. For instance, if I type:
**>>> df.groupby('stationID')**
then, I would like to drop groups that have (at least) a certain number of NaNs (say 30) within a group. As I understand it, I cannot use dropna(thresh=10) with groupby:
**>>> df2.groupby('station').dropna(thresh=30)**
*AttributeError: Cannot access callable attribute 'dropna' of 'DataFrameGroupBy' objects...*
So, what would be the best way to do that with Pandas?
IIUC you can do df2.loc[df2.groupby('station')['Temperature'].filter(lambda x: len(x[pd.isnull(x)] ) < 30).index]
Example:
In [59]:
df = pd.DataFrame({'id':[0,0,0,1,1,1,2,2,2,2], 'val':[1,1,np.nan,1,np.nan,np.nan, 1,1,1,1]})
df
Out[59]:
id val
0 0 1.0
1 0 1.0
2 0 NaN
3 1 1.0
4 1 NaN
5 1 NaN
6 2 1.0
7 2 1.0
8 2 1.0
9 2 1.0
In [64]:
df.loc[df.groupby('id')['val'].filter(lambda x: len(x[pd.isnull(x)] ) < 2).index]
Out[64]:
id val
0 0 1.0
1 0 1.0
2 0 NaN
6 2 1.0
7 2 1.0
8 2 1.0
9 2 1.0
So this will filter out the groups that have more than 1 nan values
You can create a column to give the the number of null values by station_id, and then use loc to select the relevant data for further processing.
df['station_id_null_count'] = \
df.groupby('stationID').Temperature.transform(lambda group: group.isnull().sum())
df.loc[df.station_id_null_count > 30, :] # Select relevant data
Using #EdChum setup: Since you dont mention your final output, adding this.
vals = df.groupby(['id'])['val'].apply(lambda x: (np.size(x)-x.count()) < 2 )
vals[vals]
id
0 True
2 True
Name: val, dtype: bool
I am trying to work through a problem in pandas, being more accustomed to R.
I have a data frame df with three columns: person, period, value
df.head() or the top few rows look like:
| person | period | value
0 | P22 | 1 | 0
1 | P23 | 1 | 0
2 | P24 | 1 | 1
3 | P25 | 1 | 0
4 | P26 | 1 | 1
5 | P22 | 2 | 1
Notice the last row records a value for period 2 for person P22.
I would now like to add a new column that provides the value from the previous period. So if for P22 the value in period 1 is 0, then this new column would look like:
| person | period | value | lastperiod
5 | P22 | 2 | 1 | 0
I believe I need to do something like the following command, having loaded pandas:
for p in df.period.unique():
df['lastperiod']== [???]
How should this be formulated?
You could groupby person and then apply a shift to the values:
In [11]: g = df.groupby('person')
In [12]: g['value'].apply(lambda s: s.shift())
Out[12]:
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0
dtype: float64
Adding this as a column:
In [13]: df['lastPeriod'] = g['value'].apply(lambda s: s.shift())
In [14]: df
Out[14]:
person period value lastPeriod
1 P22 1 0 NaN
2 P23 1 0 NaN
3 P24 1 1 NaN
4 P25 1 0 NaN
5 P26 1 1 NaN
6 P22 2 1 0
Here the NaN signify missing data (i.e. there wasn't an entry in the previous period).