How can I fill in "nan" values conditionally? - python

I want to fill missing values with like this:
data = pd.read_csv("E:\\SPEED.csv")
Data - DataFrame
Case - 1
if flcass= "motorway", "motorway_link", "trunk" or "trunk_link"
I want to replace the text "nan" with 110
Case - 2
if flcass= "primary", "primary_link", "secondary" or "secondary_link"
I want to replace the text "nan" with 70
Case - 3
if "fclass" is another value, I want to change it to 40.
I would be grateful for any help.

Two ways in pandas:
df = DataFrame(
{
"A": [1, 2, np.nan, 4],
"B": [1, 4, 9, np.nan],
"C": [1, 2, 3, 5],
"D": list("abcd"),
}
)
fillna lets you fill NA's (or NaNs) with what appears to be a fixed value:
df['B'].fillna(12)
[1,4,9,12]
interpolate uses scipy's interpolation methods -- linear by default:
df.interpolate()
df['A']
[1,2,3,4]

Thank you all for your answers. However, as there are 6812 rows and 16 columns (containing nan values) in the data, it seems that different solutions are required.

You can try this
import pandas as pd
import math
def valuesMapper(data, valuesDict, columns_to_update):
for i in columns_to_update:
data[i] = data[i].apply(lambda x: valuesDict.get(x, 40) if math.isnan(x) else x)
return data
data = pd.read_csv("E:\\SPEED.csv")
valuesDict = {"motorway":110, "motorway_link":110, "trunk":110, "primary":70, "primary_link":70, "secondary":70, "secondary_link":70}
column_to_update = ['AGU_PZR_07_10'] #columns_to_update is the list of columns to be updated, you can get it through code didn't added that as i dont have your data
print(valuesMapper(data, valuesDict, columns_to_update))

With the below example:
data = pandas.DataFrame({
'flclass': ['a', 'b', 'c', 'a'],
'AGU': [float('nan'), float('nan'), float('nan'), 9]
})
You can update it using numpy conditionals iterating over your columns starting from 2nd ([1:]) - 5th ([4:]) in your data:
for column in data.columns[1:]:
data[column] = np.where((data['flclass'] == 'b') & (data[column].isna()), 110, data[column])
Or panadas apply:
import numpy as np
data['AGU'] = data.apply(
lambda row: 110 if np.isnan(row['AGU']) and row['flclass'] in ("b","a") else row['AGU'],
axis=1,
)
where you can replace ("b","a") with eg ("motorway", "motorway_link", "trunk", "trunk_link")

Related

iterate over a list, apply a function and set each output to rows of a pandas dataframe

So I have the following data, which comes from two different pandas dataframes:
lis = []
for index, rows in full.iterrows():
my_list = [rows.ARIEL, rows.LBHD, rows.LFHD, rows.RFHD, rows.RBHD]
lis.append(my_list)
lis2 = []
for index, rows in reduced.iterrows():
my_list = rows.bar_head
lis2.append(my_list)
For example, part of lis and lis are shown below:
lis = [[[-205.981, 1638.787, 1145.274], [-264.941, 1482.371, 1168.693], [-263.454, 1579.4370000000001, 1016.279], [-148.062, 1592.005, 1016.75], [-134.313, 1479.1429999999998, 1167.109]], ...
lis2 = [[-203.3502, 1554.3486, 1102.821], [-203.428, 1554.3492, 1103.0592], [-203.4954, 1554.3234, 1103.2794], [-203.5022, 1554.2974, 1103.4522], ...
What I want is to use lis and lis2 with the following apply method (where mdf is another empty dataframe of the same length as the other two, and md is a function I've created):
mdf['head_md'] = mdf['head_md'].apply(md, args=(5, lis, lis2))
But the way it does it now, is it output the same result to all rows of mdf.
What I want is for it to loop through lis and lis2 and based on the indexes, to output the corresponding result to the corresponding row of mdf. All dataframes and variables have length 7446.
I tried for example this, but it doesn't work:
for i in range(len(mdf)):
for j in range(0, 5):
mdf['head_md'] = mdf['head_md'].apply(md, args=(5, lis[i][j], lis2[i]))
Let me know if you need any more information from the code, and thanks in advance!
EDIT: Examples of the dataframes:
bar_head
0 [-203.3502, 1554.3486, 1102.821]
1 [-203.428, 1554.3492, 1103.0592]
2 [-203.4954, 1554.3234, 1103.2794]
3 [-203.5022, 1554.2974, 1103.4522]
4 [-203.5014, 1554.2948, 1103.6594]
ARIEL LBHD LFHD RBHD RFHD
0 [-205.981, 1638.787, 1145.274] [-264.941, 1482.371, 1168.693] [-263.454, 1579.4370000000001, 1016.279] [-134.313, 1479.1429999999998, 1167.109] [-148.062, 1592.005, 1016.75]
1 [-206.203, 1638.649, 1145.734] [-264.85400000000004, 1482.069, 1168.776] [-263.587, 1579.6129999999998, 1016.627] [-134.286, 1479.0839999999998, 1167.076] [-148.21, 1592.3310000000001, 1017.0830000000001]
2 [-206.37599999999998, 1638.531, 1146.135] [-264.803, 1481.8210000000001, 1168.8519999999... [-263.695, 1579.711, 1016.922] [-134.265, 1478.981, 1167.104] [-148.338, 1592.5729999999999, 1017.3839999999...
3 [-206.493, 1638.405, 1146.519] [-264.703, 1481.5439999999999, 1168.95] [-263.742, 1579.8139999999999, 1017.207] [-134.15200000000002, 1478.922, 1167.112] [-148.421, 1592.8020000000001, 1017.4730000000...
4 [-206.56900000000002, 1638.33, 1146.828] [-264.606, 1481.271, 1169.0330000000001] [-263.788, 1579.934, 1017.467] [-134.036, 1478.888, 1167.289] [-148.50799999999998, 1593.0510000000002, 1017...
If the items in the columns of full and reduced are lists convert them to numpy ndarrays first.
ariel = np.array(full.ARIEL.to_list())
lbhd = np.array(full.LBHD.to_list())
lfhd = np.array(full.LFHD.to_list())
rfhd = np.array(full.RFHD.to_list())
rbhd = np.array(full.RBHD.to_list())
barhead = np.array(reduced.bar_head.to_list())
Subtract barhead from ariel using broadcasting, square the results and sum along the last axis (assuming I understood the comment about your function).
a = np.sum(np.square(ariel-barhead[:,None,:]),-1)
Using the setup below the result is a (4,5) array of values (rounded to two places).
>>> a # a[0] a[1] a[2] a[3] a[4]
array([[8939.02, 8956.22, 8971.93, 8984.87, 8999.85], # b[0]
[8918.35, 8935.3 , 8950.79, 8963.53, 8978.35], # b[1]
[8903.82, 8920.53, 8935.82, 8948.36, 8963.04], # b[2]
[8893.7 , 8910.24, 8925.38, 8937.78, 8952.34]]) # b[3]
It seemed that you wanted a 1-d sequence for the result: a.ravel() produces a 1-d array like:
[(a[0]:b[0]),(a[1]:b[0]),(a[2]:b[0]),...,(a[0]:b[1]),(a[1]:b[1]),...,(a[0]:b[2]),...]
The other four columns of full.
lb = np.sum(np.square(lbhd-barhead[:,None,:]),-1)
lf = np.sum(np.square(lfhd-barhead[:,None,:]),-1)
rf = np.sum(np.square(rfhd-barhead[:,None,:]),-1)
rb = np.sum(np.square(rbhd-barhead[:,None,:]),-1)
Again assuming I understood your process the result would be 100 values (using the setup below).
full reduced
(rows * columns) * (rows)
x = np.concatenate([a.ravel(),lb.ravel(),lf.ravel(),rf.ravel(),rb.ravel()])
Setup
import numpy as np
import pandas as pd
lis = [[[-205.981, 1638.787, 1145.274],[-264.941, 1482.371, 1168.693],[-263.454, 1579.437, 1016.279],[-134.313, 1479.1429, 1167.109],[-148.062, 1592.005, 1016.75]],
[[-206.203, 1638.649, 1145.734],[-264.854, 1482.069, 1168.776],[-263.587, 1579.6129, 1016.627],[-134.286, 1479.0839, 1167.076],[-148.21, 1592.331, 1017.083]],
[[-206.3759, 1638.531, 1146.135],[-264.803, 1481.821, 1168.85199],[-263.695, 1579.711, 1016.922],[-134.265, 1478.981, 1167.104],[-148.338, 1592.5729, 1017.3839]],
[[-206.493, 1638.405, 1146.519],[-264.703, 1481.5439, 1168.95],[-263.742, 1579.8139, 1017.207],[-134.152, 1478.922, 1167.112],[-148.421, 1592.802, 1017.473]],
[[-206.569, 1638.33, 1146.828],[-264.606, 1481.271, 1169.033],[-263.788, 1579.934, 1017.467],[-134.036, 1478.888, 1167.289],[-148.5079, 1593.051, 1017.666]]]
barhd = [[[-203.3502, 1554.3486, 1102.821]],
[[-203.428, 1554.3492, 1103.0592]],
[[-203.4954, 1554.3234, 1103.2794]],
[[-203.5022, 1554.2974, 1103.4522]]]
full = pd.DataFrame(lis, columns=['ARIEL', 'LBHD', 'LFHD', 'RFHD', 'RBHD'])
reduced = pd.DataFrame(barhd,columns=['bar_head'])
I hope to understand you well, is it what you want?
v is lis and v2 is lis2.
Arithmatic function for 3D by 2D.
import numpy as np
na = np.array
v=na([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9],[ 10, 11, 1]]])
v2=na([[1, 2, 3], [4, 5, 6], [7, 8, 9],[ 10, 11, 12]])
lst = []
for a in v:
for b in a:
for a2 in v2:
lst.append(b+a2) # you can do any arithmetic functions

how to make a list from pandas if the index is timestamp

Here is my dataframe
ord_datetime
2019-05-01 22.483871
2019-05-02 27.228070
2019-05-03 30.140625
2019-05-04 32.581633
2019-05-05 30.259259
if i do code like this
b=[]
b.append((df.iloc[2]-df.iloc[1])/(df.iloc[1]))
print(b)
output is
[Ordered Items 0.106969
dtype: float64]
I want an output like 0.106969 only
How can i do that?
You are working with Series here, which is why you get this result.
Your iloc returns a Series of 1 element, and the arithmetic operators also return series.
If you want to get the scalar value, you can simply use my_series[0].
So for your example:
data = {datetime(2019, 5, 1): 22.483871, datetime(2019, 5, 2): 27.228070,
datetime(2019, 5, 3): 30.140625, datetime(2019, 5, 4): 32.581633,
datetime(2019, 5, 5): 30.259259}
df = pd.DataFrame.from_dict(data, orient="index")
series_result = (df.iloc[2] - df.iloc[1]) / df.iloc[1]
scalar_result = series_result[0]
# you can append the result to your list if you want
You can do something like the following
import pandas as pd
data = {
"ord_datetime": ["2019-05-01","2019-05-02","2019-05-03","2019-05-04","2019-05-05"],
"value": [22.483871,27.228070,30.140625,32.581633,30.259259]
}
df = pd.DataFrame(data=data)
res = [ (df.iloc[ridx + 1, 1] - df.iloc[ ridx, 1]) / (df.iloc[ridx, 1]) for ridx in range(0, df.shape[0]-1) ]
res # [0.2110045463256749, 0.10696883767376833, 0.08098730533955406, -0.0712786249848188]
Hope it helps.
If you want to just get the values from the output you can use df.values which returns a numpy array. If you want a list from that numpy array you can then use np_array.tolist.
So
b = ((df.iloc[2]-df.iloc[1])/(df.iloc[1])).values #returns numpy array
b.tolist # returns python list

Dividing rows when iterating in pandas using iterrows

I have a flask app where I am getting back data and transformed into pandas Dataframe.
if request.method == 'PUT':
content = request.get_json(silent=True)
df = pd.DataFrame.from_dict(content)
for index, row in df.iterrows():
if row["label"] == True:
row['A'] = row['B'] / row['C']
elif row["label"] == False:
row['A'] = row["B"]
if row['D'] == 0:
row['C'] = 0
else:
...
What i am trying to do here is simple arithmetic like addition, subtraction & division.
I have used iterrows() mainly because i needed multiple values to iterate and perform calculations on specific row values. df['..'].item() didn't work in my use case.
Addition and subtraction work fine but division seems to slip out somehow and always returns values like 0, -1, 1
Example calculation
row['A'] = row['B'] / row['C']
Most of the time the values of row['B'] is lesser than row['C']. Example values
row['A'] = 1232455 / 26719856
The only calculation involved in the app are addition, subtraction & division.
you can try this (Here is an example):
import pandas as pd
import numpy as np
data = {'label': [True, False, True, True, False],
'A': [2012, 2012, 2013, 2014, 2014],
'B': [4, 24, 31, 21, 3],
'C': [25, 94, 57, 62, 70],
'D': [3645, 0, 27, 24, 96]}
df = pd.DataFrame(data)
You can apply your changes directly on your main Dataframe without being required to iterate over each row each time like this :
# select only rows with label == True and apply the division function
df.loc[df.label == True, 'A'] = df['B']/df['C']
df.loc[df.label == False, 'A'] = df['B']
df.loc[np.logical_and(df.label == False, df.D == 0), 'C'] = 0
.
.
.
You can select each time the row you want to change and apply the changes directly on. Just like i did.
Another point :
After applying the division in my example the integers are transformed into float64 you can try on your example the function series.astype('flat64')
for row['A'] = 1232455 / 26719856 you will get 0.046125 and not only the integer part 0 .
Maybe it will save you from having zeros every time when you do the divisions

Get idxmax or idxmin of tuple-valued column in pandas groupby

I have a tuple-valued score that I'd like to get the row corresponding to the maximum value of. A toy example of what I'd like to do would be:
import pandas as pd
df = pd.DataFrame({'id': ['a', 'a', 'b', 'b'],
'score': [(1,1,1), (1,1,2), (0, 0, 100), (8,8,8)],
'numeric_score': [1, 2, 3, 4],
'value':['foo', 'bar', 'baz', 'qux']})
# Works, gives correct result:
correct_df = df.loc[df.groupby('id')['numeric_score'].idxmax(), :]
# Fails with a TypeError
goal_df = df.loc[df.groupby('id')['score'].idxmax(), :]
correct_df has the result I'd like in goal_df. This throws a bunch of errors, the core of which seems to be:
TypeError: reduction operation 'argmax' not allowed for this dtype
A working, but ugly solution is:
best_scores = df.groupby('id')['score'].max().reset_index()[['id', 'score']]
goal_df = (pd.merge(df, best_scores, on=['id', 'score'])
.groupby(['id'])
.first()
.reset_index())
Is there a slick version of this?
I understand your question to be:
"NumPy's .argmax() does not work for tuples. For a Series of tuples, how do I determine the index for the maximum valued tuple?"
IIUC, this will return the desired outcome:
df.loc[df.score == df.score.max()]

Pandas split-apply-combine, why combining with pd.concat([df]) works when sorting?

I do a split-apply-merge type of workflow with pandas. The 'apply' part returns a DataFrame. When the DataFrame I run gropupby on is firstly sorted, simply returning a DataFrame from apply raises ValueError: cannot reindex from a duplicate axis. Instead, I have found it to work properly when I return pd.concat([df]) (instead of just return df). If I don't sort the DataFrame, both ways of merging results work correctly. I expect sorting must be doing something to the index yet I don't understand what. Can someone please explain?
import pandas as pd
import numpy as np
def fill_out_ids(df, filling_function, sort=False, sort_col='sort_col',
group_by='group_col', to_fill=['id1', 'id2']):
df = df.copy()
df.set_index(group_by, inplace=True)
if sort:
df.sort_values(by=sort_col, inplace=True)
g = df.groupby(df.index, sort=False, group_keys=False)
df = g.apply(filling_function, to_fill)
df.reset_index(inplace=True)
return df
def _fill_ids_concat(df, to_fill):
df[to_fill] = df[to_fill].fillna(method='ffill')
df[to_fill] = df[to_fill].fillna(method='bfill')
return pd.concat([df])
def _fill_ids_plain(df, to_fill):
df[to_fill] = df[to_fill].fillna(method='ffill')
df[to_fill] = df[to_fill].fillna(method='bfill')
return df
def test_fill_out_ids():
input_df = pd.DataFrame(
[
['a', None, 1.0, 1],
['a', None, 1.0, 3],
['a', 'name1', np.nan, 2],
['b', None, 2.0, 3],
['b', 'name1', np.nan, 2],
['b', 'name2', np.nan, 1],
],
columns=['group_col', 'id1', 'id2', 'sort_col']
)
# this works
fill_out_ids(input_df, _fill_ids_plain, sort=False)
# this raises: ValueError: cannot reindex from a duplicate axis
fill_out_ids(input_df, _fill_ids_plain, sort=True)
# this works
fill_out_ids(input_df, _fill_ids_concat, sort=True)
# this works
fill_out_ids(input_df, _fill_ids_concat, sort=False)
if __name__ == "__main__":
test_fill_out_ids()

Categories

Resources