I have a dataframe where I am creating a new column and populating its value. Based on the condition, the new column needs to have some values appended to it if that row is encountered again.
So for example for a given dataframe:
df
id Stores is_open
1 'Walmart', 'Target' true
2 'Best Buy' false
3 'Target' true
4 'Home Depot' true
Now If I want to add a new column as a Ticker that can be a comma-separated string of tickers or list (whichever is preferable and easier. No preference on my end) for the given comma separated stores.
So for example ticker of Walmart is wmt and target is tgt. The wmt and tgt data I am getting from another dataframe based on matching key so I tried to add as follows but not all of them are assigned even though they have values and only one value followed by a comma is assigned to Tickers column and not multiple:
df['Tickers'] = ''
for _, row in df.iterrows():
stores = row['Stores']
list_stores = stores(',')
if len(list_stores) > 1:
for store in list_stores:
tmp_df = second_df[second_df['store_id'] == store]
ticker = tmp_df['Ticker'].values[0] if len(tmp_df['Ticker'].values) > 0 else None
if ticker:
df.loc[
df['Stores'].astype(str).str.contains(store), 'Ticker'] += '{},'.format(ticker)
Expected output:
id Stores is_open Ticker
1 'Walmart', 'Target' true wmt, tgt
2 'Best Buy' false bby
3 'Target' true tgt
4 'Home Depot' true nan
I would really appreciate if someone could help me out here.
You can use the apply method with axis=1 to pass the row and perform your calculations. See the code below:
import pandas as pd
mydict = {'id':[1,2],'Store':["'Walmart','Target'","'Best Buy'"], 'is_open':['true', 'false']}
df = pd.DataFrame(mydict, index=[0,1])
df.set_index('id',drop=True, inplace=True)
The df so far:
Store is_open
id
1 'Walmart','Target' true
2 'Best Buy' false
The lookup dataframe:
df2 = pd.DataFrame({'Store':['Walmart', 'Target','Best Buy'], 'Ticker':['wmt','tgt','bby']})
Store Ticker
0 Walmart wmt
1 Target tgt
2 Best Buy bby
here is the code for adding the column:
def add_column(row):
items = row['Store'].split(',')
tkr_list = []
for string in items:
mystr = string.replace("'","")
tkr = df2.loc[df2['Store']==mystr,'Ticker'].values[0]
tkr_list.append(tkr)
return tkr_list
df['Ticker']=df.apply(add_column, axis=1)
and this is the result for df:
Store is_open Ticker
id
1 'Walmart','Target' true [wmt, tgt]
2 'Best Buy' false [bby]
Related
I have the following dataframe:
contract
0 Future(conId=482048803, symbol='ESTX50', lastT...
1 Future(conId=497000453, symbol='XT', lastTrade...
2 Stock(conId=321100413, symbol='SXRS', exchange...
3 Stock(conId=473087271, symbol='ETHEEUR', excha...
4 Stock(conId=80268543, symbol='IJPA', exchange=...
5 Stock(conId=153454120, symbol='EMIM', exchange...
6 Stock(conId=75776072, symbol='SXR8', exchange=...
7 Stock(conId=257200855, symbol='EGLN', exchange...
8 Stock(conId=464974581, symbol='VBTC', exchange...
9 Future(conId=478135706, symbol='ZN', lastTrade...
I want to create a new "symbol" column with all symbols (ESTX50, XT, SXRS...).
In order to extract the substring between "symbol='" and the following single quote, I tried the following:
df['symbol'] = df.contract.str.extract(r"symbol='(.*?)'")
but I get a column of NaN.
What am I doing wrong? Thanks
It looks like that is a column of objects, not strings:
import pandas as pd
class Future:
def __init__(self, symbol):
self.symbol = symbol
def __repr__(self):
return f'Future(symbol=\'{self.symbol}\')'
df = pd.DataFrame({'contract': [Future(symbol='ESTX50'), Future(symbol='XT')]})
df['symbol'] = df.contract.str.extract(r"symbol='(.*?)'")
print(df)
df:
contract symbol
0 Future(symbol='ESTX50') NaN
1 Future(symbol='XT') NaN
Notice pandas considers strings to be object type so the string accessor is still allowed to attempt to perform operations. However, it cannot extract because these are not strings.
We can either convert to string first with astype:
df['symbol'] = df.contract.astype(str).str.extract(r"symbol='(.*?)'")
df:
contract symbol
0 Future(symbol='ESTX50') ESTX50
1 Future(symbol='XT') XT
However, the faster approach is to try to extract the object property:
df['symbol'] = [getattr(x, 'symbol', None) for x in df.contract]
Or with apply (which can be slower than the comprehension)
df['symbol'] = df.contract.apply(lambda x: getattr(x, 'symbol', None))
Both produce:
contract symbol
0 Future(symbol='ESTX50') ESTX50
1 Future(symbol='XT') XT
The following dataframe called df_merged, is a snippet of a larger dataframe with 30+ commoditites like oil, gold, silver, etc.
Date CRUDE_OIL CRUDE_OIL_SMA200 GOLD GOLD_SMA200 SILVER SILVER_SMA200
0 2021-04-26 61.91 48.415 1779.199951 1853.216498 26.211 25.269005
1 2021-04-27 62.939999 48.5252 1778 1853.028998 26.412001 25.30566
2 2021-04-28 63.860001 48.6464 1773.199951 1852.898998 26.080999 25.341655
3 2021-04-29 65.010002 48.7687 1768.099976 1852.748498 26.052999 25.377005
4 2021-04-30 63.580002 48.8861 1767.300049 1852.529998 25.853001 25.407725
How can I implement a way to compare the regular commodity price with the SMA200 equivalent in an IF statement?
My current setup includes the below if statement for 30+ columns but I believe this can be done in a function.
comm_name = []
comm_averages = []
if (df_merged.CRUDE_OIL.tail(1) > df_merged.CRUDE_OIL_SMA200.tail(1)).any():
print("CRUDE_OIL above 200SMA")
comm_name.append("CRUDE_OIL")
comm_averages.append(1)
else:
print("CRUDE_OIL under 200SMA")
comm_name.append("CRUDE_OIL")
comm_averages.append(0)
if (df_merged.GOLD.tail(1) > df_merged.GOLD_SMA200.tail(1)).any():
print("GOLD above 200SMA")
comm_name.append("GOLD")
comm_averages.append(1)
else:
print("GOLD under 200SMA")
comm_name.append("GOLD")
comm_averages.append(0)
if (df_merged.SILVER.tail(1) > df_merged.SILVER_SMA200.tail(1)).any():
print("SILVER above 200SMA")
comm_name.append("SILVER")
comm_averages.append(1)
else:
print("SILVER under 200SMA")
comm_name.append("SILVER")
comm_averages.append(0)
comm_signals = pd.DataFrame(
{'Name': comm_name,
'Signal': comm_averages
})
comm_signals
output of comm_signals:
Name Signal
0 CRUDE_OIL 1
1 GOLD 0
2 SILVER 1
I looked through this SO thread but couldn't figure out how to implement: Find column whose name contains a specific string
I guess the goal is a function like this:
comm_name = []
comm_averages = []
def comm_compare(df):
if (df_merged["X"].tail(1) > df_merged["X" + "_SMA200"].tail(1)).any():
print(X + "above 200SMA")
comm_name.append(X)
comm_averages.append(1)
else:
print(X + "under 200SMA")
comm_name.append(X)
comm_averages.append(0)
comm_signals = pd.DataFrame(
{'Name': comm_name,
'Signal': comm_averages
})
print(comm_signals)
Name Signal
0 CRUDE_OIL 1
1 GOLD 0
2 SILVER 1
Try stack + groupby diff
import numpy as np
import pandas as pd
df = pd.DataFrame({
'Date': ['2021-04-26', '2021-04-27', '2021-04-28', '2021-04-29',
'2021-04-30'],
'CRUDE_OIL': [61.91, 62.939999, 63.860001, 65.010002, 63.580002],
'CRUDE_OIL_SMA200': [48.415, 48.5252, 48.6464, 48.7687, 48.8861],
'GOLD': [1779.199951, 1778.0, 1773.199951, 1768.099976, 1767.300049],
'GOLD_SMA200': [1853.216498, 1853.028998, 1852.898998, 1852.748498,
1852.529998],
'SILVER': [26.211, 26.412001, 26.080999, 26.052999, 25.853001],
'SILVER_SMA200': [25.269005, 25.30566, 25.341655, 25.377005, 25.407725]
})
# Grab tail(1) and only numeric columns
# Replace this with more specific select or use filter if not all
# number columns are needed
s = df.tail(1).select_dtypes('number')
# Remove Suffix from all columns
s.columns = s.columns.str.rstrip('_SMA200')
s = (
# Subtract and subtract each group and compare to 0
(s.stack().groupby(level=1).diff().dropna() < 0)
.astype(int) # Convert to Int
.droplevel(0) # Cleanup levels index and column names
.reset_index()
.rename(columns={'index': 'Name', 0: 'Signal'})
)
print(s)
s:
Name Signal
0 CRUDE_OIL 1
1 GOLD 0
2 SILVER 1
To get the output per day you can do:
# to make connection between 'equity' and 'equity_sma200':
df.columns = df.columns.str.split("_", expand=True)
# we split dates to index - so we have only prices in the table:
df.set_index([("Date",)], append=False, inplace=True)
# you might not need casting
df = df.T.astype(float)
# since we have only 2 lines per day, per equity - we can just take negative sma, and cross-check aggregate sum:
mask = df.index.get_level_values(1) == "SMA200"
df.loc[mask] = -df.loc[mask]
df = df.groupby(level=0)[df.columns].sum().gt(0)
# moving columns back to human format:
df.columns = map(lambda x: x[0], df.columns)
Output for the sample data:
2021-04-26 2021-04-27 ... 2021-04-29 2021-04-30
CRUDE True True ... True True
GOLD False False ... False False
SILVER True True ... True True
From a file I have parsed the fields that I need and stored them in variables and it looks something like below:
field_list = ['some_value','some_other_value']
raw_data = """sring1|0|2|N.S.|3|
sring2|0|2|N.S.|2|
sring3|0|2|3|5|"""
Now I need to create a df which looks like:
Str Measure Value
0 sring1 some_value N.S.
1 sring1 some_other_value 3
2 sring2 some_value N.S.
3 sring2 some_other_value 2
4 sring3 some_value 3
5 sring3 some_other_value 5
The logic here is as following:
E.g. For the line in raw_data "sring1|0|2|N.S.|3|" the Str Column value would be sring1 and the value for Measure Column will be some_value(which is coming from the field_list) and the value for Value Column will be N.S
Now, again for the same string the value for the Str Column value would be sring1 and the value for Measure Column will be some_other_value and the value for Value Column will be 3.
The |2| in the string "sring1|0|2|N.S.|3|" tells us how many rows will be there and the last two are the values for the field_list
Currently I have the following code:
field_list = ['some_value','some_other_value']
db_columns = ['Str','Measure','Value']
raw_data = """sring1|0|2|N.S.|3|
sring2|0|2|N.S.|2|
sring3|0|2|3|5|"""
entry_list = raw_data.splitlines()
final_db_list =[]
for entries in entry_list:
each_entry_list = entries.split('|')
security = each_entry_list[0].strip()
print(each_entry_list)
no_of_fields = int(each_entry_list[2])
db_list=[]
upload_list=[]
for i in range (0,no_of_fields):
field = field_list[i]
value = each_entry_list[3+i]
db_list=[security,field,value]
upload_list.append(db_list)
final_db_list.append(upload_list)
flatList = [ item for elem in final_db_list for item in elem]
df = DataFrame(flatList,columns=db_columns)
print(df)
Can someone please help me with a better way of doing this. The one that I have works but is too messy. Need to pythonize it a bit and I am out of ideas.
Please help!
We can do it like this:
import pandas as pd
from io import StringIO
field_list = ['some_value','some_other_value']
raw_data = """sring1|0|2|N.S.|3|
sring2|0|2|N.S.|2|
sring3|0|2|3|5|"""
df = pd.read_csv(StringIO(raw_data), sep='|', header=None)
df = df.drop(5, axis=1)
df = (df.set_index([0,1,2])
.set_axis(field_list, axis=1)
.reset_index(level=[1,2], drop=True)
.stack()
.rename('Value')
.rename_axis(['Str', 'Measure'])
.reset_index()
)
print(df)
Output:
Str Measure Value
0 sring1 some_value N.S.
1 sring1 some_other_value 3
2 sring2 some_value N.S.
3 sring2 some_other_value 2
4 sring3 some_value 3
5 sring3 some_other_value 5
I have currently run the following script which uses Fuzzylogic to replace some common words from the list. Dataframe df1 contains my default list of possible values. Dataframe df2 is the main dataframe where transformations/changes are undertaken after referring to Dataframe df1. The code is as follows:
df1 = pd.DataFrame(['one','two','three','four','five','tsst'])
df2 = pd.DataFrame({'not_shifted':[np.nan,'one','too','three','fours','five','six',np.nan,'test']})
# Drop nan value
df2=pd.DataFrame(df2['not_shifted'].fillna(value=''))
df2['not_shifted'] = df2['not_shifted'].map(lambda x: difflib.get_close_matches(x, df1[0]))
The problem is the output is a dataframe which contains square brackets. To make matters worse, none of the texts within df2['not_shifted'] are viewable/ recallable:
Out[421]:
not_shifted
0 []
1 [one]
2 [two]
3 [three]
4 [four]
5 [five]
6 []
7 []
8 [tsst]
Please help.
df2.not_shifted.apply(lambda x: x[0] if len(x) != 0 else "") or simply df2.not_shifted.str[0] as solved by #Psidom
def replace_all(eg):
rep = {"[":"",
"]":"",
"u":"",
"}":"",
"'":"",
'"':"",
"frozenset":""}
for i,j in rep.items():
eg = eg.replace(i,j)
return eg
for each in df.columns:
df[each] = df[each].apply(lambda x : replace_all(str(x)))
I have a pandas dataframe , where all missing values are np.nan, now I am trying to replace these missing values. The last column of my data is " class" , I need to group the data based on the class, then get mean/median/mode (based on data whether data is categorical/ continuos, normal/ not) of that group of a column and replace missing values of the group of the coulmn by respective mean/median/mode.
This is the code I have come up with , which I know is an overkill..
if I could :
group the col of dataframe
get median/mode/mean of groups of the cols
replace the missing of those groups
recombine them back to original df
it would be great.
but currently I landed up , finding replacement values (mean/median/mode) group wise and storing in dict, then seperating the nan tuples and non-nan tuples.. replacing missing values in nan tuples.. and trying to join them back to dataframe (which i donno yet how to do)
def fillMissing(df, dataType):
'''
Args:
df ( 2d array/ Dict):
eg : ('attribute1': [12, 24, 25] , 'attribute2': ['good', 'bad'])
dataTypes (dict): Dictionary of attribute names of df as keys and values 0/1
indicating categorical/continuous variable eg: ('attribute1':1, 'attribute2': 0)
Returns:
dataframe wih missing values filled
writes a file with missing values replaces.
'''
dataLabels = list(df.columns.values)
# the dictionary to hold the values to put in place of nan
replaceValues = {}
for eachlabel in dataLabels:
thisSer = df[eachlabel]
if dataType[eachlabel] == 1: # if its a continuous variable
_,pval = stats.normaltest(thisSer)
groupedd = thisSer.groupby(df['class'])
innerDict ={}
for name, group in groupedd:
if(pval < 0.5):
groupMiddle = group.median() # get the median of the group
else:
groupMiddle = group.mean() # get mean (if group is normal )
innerDict[name.strip()] = groupMiddle
replaceValues[eachlabel] = innerDict
else: # if the series is continuous
# freqCount = collections.Counter(thisSer)
groupedd = thisSer.groupby(df['class'])
innerDict ={}
for name, group in groupedd:
freqC = collections.Counter(group)
mostFreq = freqC.most_common(1) # get the most frequent value of the attribute(grouped by class)
# newGroup = group.replace(np.nan , mostFreq)
innerDict[name.strip()] = mostFreq[0][0].strip()
replaceValues[eachlabel] = innerDict
print replaceValues
# replace the missing values =======================
newfile = open('missingReplaced.csv', 'w')
newdf = df
mask=False
for col in df.columns: mask = mask | df[col].isnull()
# get the dataframe of tuples that contains nulls
dfnulls = df[mask]
dfnotNulls = df[~mask]
for _, row in dfnulls.iterrows():
for colname in dataLabels:
if pd.isnull(row[colname]):
if row['class'].strip() == '>50K':
row[colname] = replaceValues[colname]['>50K']
else:
row[colname] = replaceValues[colname]['<=50K']
newfile.write(str(row[colname]) + ",")
newdf.append(row)
newfile.write("\n")
# here add newdf to dfnotNulls to get finaldf
return finaldf
If I understand correctly, this is mostly in the documentation, but probably not where you'd be looking if you're asking the question. See note regarding mode at the bottom as it is slightly trickier than mean and median.
df = pd.DataFrame({ 'v':[1,2,2,np.nan,3,4,4,np.nan] }, index=[1,1,1,1,2,2,2,2],)
df['v_mean'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mean()))
df['v_med' ] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.median()))
df['v_mode'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mode()[0]))
df
v v_mean v_med v_mode
1 1 1.000000 1 1
1 2 2.000000 2 2
1 2 2.000000 2 2
1 NaN 1.666667 2 2
2 3 3.000000 3 3
2 4 4.000000 4 4
2 4 4.000000 4 4
2 NaN 3.666667 4 4
Note that mode() may not be unique, unlike mean and median and pandas returns it as a Series for that reason. To deal with that, I just took the simplest route and added [0] in order to extract the first member of the series.