Related
I have the following dataframe
index,name,score,attempts,qualify
a,Anastasia,12.5,1,yes
b,Dima,9.0,3,no
c,Katherine,16.5,2,yes
d,James,NaN,3,no
e,Emily,9.0,2,no
I am trying to use pandas map function to update name column where name is either James or Emily to any test value 99.
codes = {'James':'99','Emily':'99'}
dff['name'] = dff['name'].map(codes)
dff
I am getting the following output -
index,name,score,attempts,qualify
a,NaN,12.5,1,yes
b,NaN,9.0,3,no
c,NaN,16.5,2,yes
d,99,NaN,3,no
e,99,9.0,2,no
Note that name column values James and Emily have been updated to 99, but the rest of name values are mapped to NaN.
How can we ignore indexes which are not intended to be mapped?
The issue is that the map function will apply the dictionary values to all values in the 'name' column, not just the ones specified. To get around this, you can use the replace method instead:
dff['name'] = dff['name'].replace({'James':'99','Emily':'99'})
This will replace only the specified values and leave the others unchanged.
I believe you may be looking for replace instead of map.
import pandas as pd
names = pd.Series([
"Anastasia",
"Dima",
"Katherine",
"James",
"Emily"
])
names.replace({"James": "99", "Emily": "99"})
# 0 Anastasia
# 1 Dima
# 2 Katherine
# 3 99
# 4 99
# dtype: object
If you're really set on using map, then you have to provide a function that knows how to handle every single name it might encounter.
codes = {"James": "99", "Emily": "99"}
# If the lookup into `code` fails,
# return the name that was used for lookup
names.map(lambda name: codes.get(name, name))
codes = {'James':'99',
'Emily':'99'}
dff['name'] = dff['name'].replace(codes)
dff
replace() satisfies the requirement -
index,name,score,attempts,qualify
a,Anastasia,12.5,1,yes
b,Dima,9.0,3,no
c,Katherine,16.5,2,yes
d,99,NaN,3,no
e,99,9.0,2,no
You can replace back one way to achiev it
dff['name'] = dff['name'].map(codes).fillna(dff['name'])
codes = {'James':'99','Emily':'99'}
dff['name'] = dff['name'].map(codes).fillna(dff['name'])
dff
index name score attempts qualify
0 a Anastasia 12.5 1 yes
1 b Dima 9.0 3 no
2 c Katherine 16.5 2 yes
3 d 99 NaN 3 no
4 e 99 9.0 2 no
I have the below script that returns data in a list format per quote of (i). I set up an empty list, and then query with the API function get_kline_data, and pass each output into my klines_list with the .extend function
klines_list = []
a = ["REQ-ETH","REQ-BTC","XLM-BTC"]
for i in a:
klines = client.get_kline_data(i, '5min', 1619317366, 1619317606)
klines_list.extend([i,klines])
klines_list
klines_list then returns data in this format;
['REQ-ETH',
[['1619317500',
'0.0000491',
'0.0000491',
'0.0000491',
'0.0000491',
'5.1147',
'0.00025113177']],
'REQ-BTC',
[['1619317500',
'0.00000219',
'0.00000219',
'0.00000219',
'0.00000219',
'19.8044',
'0.000043371636']],
'XLM-BTC',
[['1619317500',
'0.00000863',
'0.00000861',
'0.00000863',
'0.00000861',
'653.5693',
'0.005629652673']]]
I then try to convert it into a dataframe;
import pandas as py
df = py.DataFrame(klines_list)
And this is the result;
0
0 REQ-ETH
1 [[1619317500, 0.0000491, 0.0000491, 0.0000491,...
2 REQ-BTC
3 [[1619317500, 0.00000219, 0.00000219, 0.000002...
4 XLM-BTC
5 [[1619317500, 0.00000863, 0.00000861, 0.000008..
The structure of the DF is incorrect and it seems to be due to the way I have put my list together.
I would like the quantitative data in a column corresponding to the correct entry in list a, not in rows. Also, the ticker data, or list a, ("REQ-ETH/REQ-BTC") etc should be in a separate column. What would be a good way to go about restructuring this?
Edit: #Ynjxsjmh
This is the output when following the suggestion below for appending a dictionary within the for loop
REQ-ETH REQ-BTC XLM-BTC
0 [1619317500, 0.0000491, 0.0000491, 0.0000491, ... NaN NaN
1 NaN [1619317500, 0.00000219, 0.00000219, 0.0000021... NaN
2 NaN NaN [1619317500, 0.00000863, 0.00000861, 0.0000086...
pandas.DataFrame() can accept a dict. It will construct the dict key as column header, dict value as column values.
import pandas as pd
a = ["REQ-ETH","REQ-BTC","XLM-BTC"]
klines_data = {}
for i in a:
klines = client.get_kline_data(i, '5min', 1619317366, 1619317606)
klines_data[i] = klines[0]
# ^
# |
# Add a key to klines_data
df = pd.DataFrame(klines_data)
print(df)
REQ-ETH REQ-BTC XLM-BTC
0 1619317500 1619317500 1619317500
1 0.0000491 0.00000219 0.00000863
2 0.0000491 0.00000219 0.00000861
3 0.0000491 0.00000219 0.00000863
4 0.0000491 0.00000219 0.00000861
5 5.1147 19.8044 653.5693
6 0.00025113177 0.000043371636 0.005629652673
If the length of klines is not equal, you can use
df = pd.DataFrame.from_dict(klines_data, orient='index').T
This is an extension to a question I posted earlier: Python Sum lookup dynamic array table with df column
I'm currently investigating a way to efficiently map a decision variable to a dataframe. The main DF and the lookup table will be dynamic in length (+15,000 lines and +20 lines, respectively). Thus was hoping not to do this with a loop, but happy to hear suggestions.
The DF (DF1) will mostly look like the following, where I would like to lookup/search for the decision.
Where the decision value is found on a separate DF (DF0).
For Example: the first DF1["ValuesWhereXYcomefrom"] value is 6.915 which is between 3.8>=(value)>7.4 on the key table and thus the corresponding value DF0["Decision"] is -1. The process then repeats until every line is mapped to a decision.
I was thinking to use the python bisect library, but have not prevailed to any working solution and also working with a loop. Now I'm wondering if I am looking at the problem incorrectly as mapping and looping 15k lines is time consuming.
Example Main Data (DF1):
time
Value0
Value1
Value2
ValuesWhereXYcomefrom
Value_toSum
Decision Map
1
41.43
6.579482077
0.00531021
2
41.650002
6.756817908
46.72466411
6.915187703
0.001200456
-1
3
41.700001
6.221966706
11.64727001
1.871959552
0.000959257
-1
4
41.740002
6.230847055
46.92753343
7.531485368
0.006228989
1
5
42
6.637399856
8.031374656
1.210018204
0.010238095
-1
6
42.43
7.484894608
16.24547568
2.170434793
-0.007777563
-1
7
42.099998
7.595291765
38.73871244
5.100358702
0.003562993
-1
8
42.25
7.567457423
37.07538953
4.899319211
0.01088755
-1
9
42.709999
8.234795546
64.27986403
7.805884636
0.005151042
1
10
42.93
8.369526407
24.72700129
2.954408659
-0.003028209
-1
11
42.799999
8.146653099
61.52243361
7.55186613
0
1
Example KeyTable (DF0):
ValueX
ValueY
SUM
Decision
0.203627201
3.803627201
0.040294925
-1
3.803627201
7.403627201
0.031630668
-1
7.403627201
11.0036272
0.011841521
1
Here's how I would go about this, assuming your first DataFrame is called df and your second is decision:
def map_func(x):
for i in range(len(decision)):
try:
if x < decision["ValueY"].iloc[i]:
return decision["Decision"].iloc[i]
except Exception:
return np.nan
df["decision"] = df["ValuesWhereXYcomefrom"].apply(lambda x: map_func(x))
This will create a new row in your DataFrame called "decision" that contains the looked up value. You can then just query it:
df.decision.iloc[row]
I have a scenario where I need to transform the values of a particular column based on the value present in another column in the same row, and the value in another dataframe.
Example-
print(parent_df)
school location modifed_date
0 school_1 New Delhi 2020-04-06
1 school_2 Kolkata 2020-04-06
2 school_3 Bengaluru 2020-04-06
3 school_4 Mumbai 2020-04-06
4 school_5 Chennai 2020-04-06
print(location_df)
school location
0 school_10 New Delhi
1 school_20 Kolkata
2 school_30 Bengaluru
3 school_40 Mumbai
4 school_50 Chennai
As per this use case, I need to transform the school names present in parent_df, based on the location column present in the same df, and the location property present in location_df
To achieve this transformation, I wrote the following method.
def transform_school_name(row, location_df):
name_alias = location_df[location_df['location'] == row['location']]
if len(name_alias) > 0:
return location_df.school.iloc[0]
else:
return row['school']
And this is how I am calling this method
parent_df['school'] = parent_df.apply(UtilityMethods.transform_school_name, args=(self.location_df,), axis=1)
The issue is that for just 46K records, I am seeing the entire tranformation happening in around 2 mins, which is too slow. How can I improve the performance of this solution?
EDITED
Following is the actual scenario I am dealing with wherein there is a small tranformation that is needed to be done before we can replace the value in the original column. I am not sure if this can be done within replace() method as mentioned in one of the answers below.
print(parent_df)
school location modifed_date type
0 school_1 _pre_New Delhi_post 2020-04-06 Govt
1 school_2 _pre_Kolkata_post 2020-04-06 Private
2 school_3 _pre_Bengaluru_post 2020-04-06 Private
3 school_4 _pre_Mumbai_post 2020-04-06 Govt
4 school_5 _pre_Chennai_post 2020-04-06 Private
print(location_df)
school location type
0 school_10 New Delhi Govt
1 school_20 Kolkata Private
2 school_30 Bengaluru Private
Custom Method code
def transform_school_name(row, location_df):
location_values = row['location'].split('_')
name_alias = location_df[location_df['location'] == location_values[1]]
name_alias = name_alias[name_alias['type'] == location_df['type']]
if len(name_alias) > 0:
return location_df.school.iloc[0]
else:
return row['school']
def transform_school_name(row, location_df):
name_alias = location_df[location_df['location'] == row['location']]
if len(name_alias) > 0:
return location_df.school.iloc[0]
else:
return row['school']
This is the actual scenario what I need to handle, so using replace() method won't help.
You can use map/replace:
parent_df['school'] = parent_df.location.replace(location_df.set_index('location')['school'])
Output:
school location modifed_date
0 school_10 New Delhi 2020-04-06
1 school_20 Kolkata 2020-04-06
2 school_30 Bengaluru 2020-04-06
3 school_40 Mumbai 2020-04-06
4 school_50 Chennai 2020-04-06
IIUC, this is more of a regex issue as the pattern doesn't match exactly. First extract the required pattern, create mapping of location in parent_df to location_df, map the values.
pat = '.*?' + '(' + '|'.join(location_df['location']) + ')' + '.*?'
mapping = parent_df['location'].str.extract(pat)[0].map(location_df.set_index('location')['school'])
parent_df['school'] = mapping.combine_first(parent_df['school'])
parent_df
school location modifed_date type
0 school_10 _pre_New Delhi_post 2020-04-06 Govt
1 school_20 _pre_Kolkata_post 2020-04-06 Private
2 school_30 _pre_Bengaluru_post 2020-04-06 Private
3 school_4 _pre_Mumbai_post 2020-04-06 Govt
4 school_5 _pre_Chennai_post 2020-04-06 Private
As I understood the edited task, the following update is to be performed:
for each row in parent_df,
find a row in location_df with matching location (a part of
location column and type),
if found, overwrite school column in parent_df with school
from the row just found.
To do it, proceed as follows:
Step 1: Generate a MultiIndex to locate school names by city and
school type:
ind = pd.MultiIndex.from_arrays([parent_df.location.str
.split('_', expand=True)[2], parent_df.type])
For your sample data, the result is:
MultiIndex([('New Delhi', 'Govt'),
( 'Kolkata', 'Private'),
('Bengaluru', 'Private'),
( 'Mumbai', 'Govt'),
( 'Chennai', 'Private')],
names=[2, 'type'])
Don't worry about strange first level column name (2), it will disappear soon.
Step 2: Generate a list of "new" locations:
locList = location_df.set_index(['location', 'type']).school[ind].tolist()
The result is:
['school_10', 'school_20', 'school_30', nan, nan]
For first 3 schools something has been found, for last 2 - nothing.
Step 3: Perform the actual update with the above list, through "non-null"
mask:
parent_df.school = parent_df.school.mask(pd.notnull(locList), locList)
Execution speed
Due to usage of vectorized operations and lookup by index, my code
runs sustantially faster that apply to each row.
Example: I replicated your parent_df 10,000 times and checked with
%timeit the execution time of your code (actually, a bit changed
version, described below) and mine.
To allow repeated execution, I changed both versions so that they set
school_2 column, and school remains unchanged.
Your code was running 34.9 s, whereas my code - only 161 ms - 261
times faster.
Yet quicker version
If parent_df has a default index (consecutive numbers starting from 0),
then the whole operation can be performed with a single instruction:
parent_df.school = location_df.set_index(['location', 'type']).school[
pd.MultiIndex.from_arrays(
[parent_df.location.str.split('_', expand=True)[2],
parent_df.type])
]\
.reset_index(drop=True)\
.combine_first(parent_df.school)
Steps:
location_df.set_index(...) - Set index to 2 "criterion" columns.
.school - Leave only school column (with the above index).
[...] - Retrieve from it elements indicated by the MultiIndex
defined inside.
pd.MultiIndex.from_arrays( - Create the MultiIndex.
parent_df.location.str.split('_', expand=True)[2] - The first level
of the MultiIndex - the "city" part from location.
parent_df.type - The second level of the MultiIndex - type.
reset_index(...) - Change the MultiIndex into a default index
(now the index is just the same as in parent_df.
combine_first(...) - Overwrite NaN values in the result generated
so far with original values from school.
parent_df.school = - Save the result back in school column.
For the test purpose, to check the execution speed, you can change it
with parent_df['school_2'].
According to my assessment, the execution time is by 9 % shorter than
for my original solution.
Corrections to your code
Take a look at location_values[1]]. It retrieves pre segment, whereas
actually the next segment (city name) should be retrieved.
Ther is no need to create a temporary list, based on the first condition
and then narrow down it, filtering with the second condition.
Both your conditions (for equality of location and type) can be
performed in a single instruction, so that execution time is a bit
shorter.
The value returned in the "positive" case should be from name_alias,
not location_df.
So if for some reason you wanted to stay by your code, change the
respective fragment to:
name_alias = location_df[location_df['location'].eq(location_values[2]) &
location_df['type'].eq(row.type)]
if len(name_alias) > 0:
return name_alias.school.iloc[0]
else:
return row['school']
If I'm reading the question correctly, what you're implementing with the apply method is a kind of join operation. Pandas excels are vectorizing operations, plus its c-based implementation of join ('merge') is almost certainly faster than a python / apply based one. Therefore, I would try to use the following solution:
parent_df["location_short"] = parent_df.location.str.split("_", expand=True)[2]
parent_df = pd.merge(parent_df, location_df, how = "left", left_on=["location_short", "type"],
right_on=["location", "type"], suffixes = ["", "_by_location"])
parent_df.loc[parent_df.school_by_location.notna(), "school"] = \
parent_df.loc[parent_df.school_by_location.notna(), "school_by_location"]
As far as I could understand, it produces what you're looking for:
i would like to filling a dataset and making the log returns at the same time:
These are the returns
ret_names =['FTSEMIB_Index_ret', 'FCA_IM_Equity_ret', 'UCG_IM_Equity_ret', 'ISP_IM_Equity_ret',
'ENI_IM_Equity_ret',
'LUX_IM_Equity_ret']
and this is the Dataframe
'FTSEMIB_Index', 'FCA_IM_Equity', 'UCG_IM_Equity', 'ISP_IM_Equity','ENI_IM_Equity', 'LUX_IM_Equity'
0 22793.69 14.840 16.430 2.8860 14.040 49.24
1 22991.99 15.150 16.460 2.8780 14.220 48.98
2 23046.05 15.290 16.760 2.8660 14.300 48.70
3 23014.13 15.660 16.390 2.8500 14.380 48.72
4 23002.85 15.590 16.300 2.8420 14.500 49.13
so my idea is to use enumerate in a for loop.
for index,name in enumerate(ret_names):
df[name] = np.diff(np.log(df.iloc[:,index]))
but i cannot match the lenght because having the returns i'm going to erase 1 value (the first one i suppose).
Any idea?
maybe i found a solution, but i can't figure out why the previous one doesn't work
for index,name in enumerate(ret_names):
df[name] = np.log(df.iloc[:,index])/np.log(df.iloc[:,index]).shift(1)
with this you can fill and assign name, and the first value will increase.