Extract string between two substrings in a pandas df column - python

I have the following dataframe:
contract
0 Future(conId=482048803, symbol='ESTX50', lastT...
1 Future(conId=497000453, symbol='XT', lastTrade...
2 Stock(conId=321100413, symbol='SXRS', exchange...
3 Stock(conId=473087271, symbol='ETHEEUR', excha...
4 Stock(conId=80268543, symbol='IJPA', exchange=...
5 Stock(conId=153454120, symbol='EMIM', exchange...
6 Stock(conId=75776072, symbol='SXR8', exchange=...
7 Stock(conId=257200855, symbol='EGLN', exchange...
8 Stock(conId=464974581, symbol='VBTC', exchange...
9 Future(conId=478135706, symbol='ZN', lastTrade...
I want to create a new "symbol" column with all symbols (ESTX50, XT, SXRS...).
In order to extract the substring between "symbol='" and the following single quote, I tried the following:
df['symbol'] = df.contract.str.extract(r"symbol='(.*?)'")
but I get a column of NaN.
What am I doing wrong? Thanks

It looks like that is a column of objects, not strings:
import pandas as pd
class Future:
def __init__(self, symbol):
self.symbol = symbol
def __repr__(self):
return f'Future(symbol=\'{self.symbol}\')'
df = pd.DataFrame({'contract': [Future(symbol='ESTX50'), Future(symbol='XT')]})
df['symbol'] = df.contract.str.extract(r"symbol='(.*?)'")
print(df)
df:
contract symbol
0 Future(symbol='ESTX50') NaN
1 Future(symbol='XT') NaN
Notice pandas considers strings to be object type so the string accessor is still allowed to attempt to perform operations. However, it cannot extract because these are not strings.
We can either convert to string first with astype:
df['symbol'] = df.contract.astype(str).str.extract(r"symbol='(.*?)'")
df:
contract symbol
0 Future(symbol='ESTX50') ESTX50
1 Future(symbol='XT') XT
However, the faster approach is to try to extract the object property:
df['symbol'] = [getattr(x, 'symbol', None) for x in df.contract]
Or with apply (which can be slower than the comprehension)
df['symbol'] = df.contract.apply(lambda x: getattr(x, 'symbol', None))
Both produce:
contract symbol
0 Future(symbol='ESTX50') ESTX50
1 Future(symbol='XT') XT

Related

Cannot set a DataFrame with multiple columns to the single column total_servings

I am a beginner and getting familiar with pandas .
It is throwing an error , When I was trying to create a new column this way :
drinks['total_servings'] = drinks.loc[: ,'beer_servings':'wine_servings'].apply(calculate,axis=1)
Below is my code, and I get the following error for line number 9:
"Cannot set a DataFrame with multiple columns to the single column total_servings"
Any help or suggestion would be appreciated :)
import pandas as pd
drinks = pd.read_csv('drinks.csv')
def calculate(drinks):
return drinks['beer_servings']+drinks['spirit_servings']+drinks['wine_servings']
print(drinks)
drinks['total_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].apply(calculate,axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
drinks
In your code, when functioncalculate is called with axis=1, it passes each row of the Dataframe as an argument. Here, the function calculate is returning dataframe with multiple columns but you are trying to assigned to a single column, which is not possible. You can try updating your code to this,
def calculate(each_row):
return each_row['beer_servings'] + each_row['spirit_servings'] + each_row['wine_servings']
drinks['total_servings'] = drinks.apply(calculate, axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
print(drinks)
I suppose the reason is the wrong argument name inside calculate method. The given argument is drink but drinks used to calculate sum of columns.
The reason is drink is Series object that represents Row and sum of its elements is scalar. Meanwhile drinks is a DataFrame and sum of its columns will be a Series object
Sample code shows that this method works.
import pandas as pd
df = pd.DataFrame({
"A":[1,1,1,1,1],
"B":[2,2,2,2,2],
"C":[3,3,3,3,3]
})
def calculate(to_calc_df):
return to_calc_df["A"] + to_calc_df["B"] + to_calc_df["C"]
df["total"] = df.loc[:, "A":"C"].apply(calculate, axis=1)
print(df)
Result
A B C total
0 1 2 3 6
1 1 2 3 6
2 1 2 3 6
3 1 2 3 6
4 1 2 3 6

Querying a list object from API and returning it into dataframe - issues with format

I have the below script that returns data in a list format per quote of (i). I set up an empty list, and then query with the API function get_kline_data, and pass each output into my klines_list with the .extend function
klines_list = []
a = ["REQ-ETH","REQ-BTC","XLM-BTC"]
for i in a:
klines = client.get_kline_data(i, '5min', 1619317366, 1619317606)
klines_list.extend([i,klines])
klines_list
klines_list then returns data in this format;
['REQ-ETH',
[['1619317500',
'0.0000491',
'0.0000491',
'0.0000491',
'0.0000491',
'5.1147',
'0.00025113177']],
'REQ-BTC',
[['1619317500',
'0.00000219',
'0.00000219',
'0.00000219',
'0.00000219',
'19.8044',
'0.000043371636']],
'XLM-BTC',
[['1619317500',
'0.00000863',
'0.00000861',
'0.00000863',
'0.00000861',
'653.5693',
'0.005629652673']]]
I then try to convert it into a dataframe;
import pandas as py
df = py.DataFrame(klines_list)
And this is the result;
0
0 REQ-ETH
1 [[1619317500, 0.0000491, 0.0000491, 0.0000491,...
2 REQ-BTC
3 [[1619317500, 0.00000219, 0.00000219, 0.000002...
4 XLM-BTC
5 [[1619317500, 0.00000863, 0.00000861, 0.000008..
The structure of the DF is incorrect and it seems to be due to the way I have put my list together.
I would like the quantitative data in a column corresponding to the correct entry in list a, not in rows. Also, the ticker data, or list a, ("REQ-ETH/REQ-BTC") etc should be in a separate column. What would be a good way to go about restructuring this?
Edit: #Ynjxsjmh
This is the output when following the suggestion below for appending a dictionary within the for loop
REQ-ETH REQ-BTC XLM-BTC
0 [1619317500, 0.0000491, 0.0000491, 0.0000491, ... NaN NaN
1 NaN [1619317500, 0.00000219, 0.00000219, 0.0000021... NaN
2 NaN NaN [1619317500, 0.00000863, 0.00000861, 0.0000086...
pandas.DataFrame() can accept a dict. It will construct the dict key as column header, dict value as column values.
import pandas as pd
a = ["REQ-ETH","REQ-BTC","XLM-BTC"]
klines_data = {}
for i in a:
klines = client.get_kline_data(i, '5min', 1619317366, 1619317606)
klines_data[i] = klines[0]
# ^
# |
# Add a key to klines_data
df = pd.DataFrame(klines_data)
print(df)
REQ-ETH REQ-BTC XLM-BTC
0 1619317500 1619317500 1619317500
1 0.0000491 0.00000219 0.00000863
2 0.0000491 0.00000219 0.00000861
3 0.0000491 0.00000219 0.00000863
4 0.0000491 0.00000219 0.00000861
5 5.1147 19.8044 653.5693
6 0.00025113177 0.000043371636 0.005629652673
If the length of klines is not equal, you can use
df = pd.DataFrame.from_dict(klines_data, orient='index').T

How to append value to a cell in pandas dataframe

I have a dataframe where I am creating a new column and populating its value. Based on the condition, the new column needs to have some values appended to it if that row is encountered again.
So for example for a given dataframe:
df
id Stores is_open
1 'Walmart', 'Target' true
2 'Best Buy' false
3 'Target' true
4 'Home Depot' true
Now If I want to add a new column as a Ticker that can be a comma-separated string of tickers or list (whichever is preferable and easier. No preference on my end) for the given comma separated stores.
So for example ticker of Walmart is wmt and target is tgt. The wmt and tgt data I am getting from another dataframe based on matching key so I tried to add as follows but not all of them are assigned even though they have values and only one value followed by a comma is assigned to Tickers column and not multiple:
df['Tickers'] = ''
for _, row in df.iterrows():
stores = row['Stores']
list_stores = stores(',')
if len(list_stores) > 1:
for store in list_stores:
tmp_df = second_df[second_df['store_id'] == store]
ticker = tmp_df['Ticker'].values[0] if len(tmp_df['Ticker'].values) > 0 else None
if ticker:
df.loc[
df['Stores'].astype(str).str.contains(store), 'Ticker'] += '{},'.format(ticker)
Expected output:
id Stores is_open Ticker
1 'Walmart', 'Target' true wmt, tgt
2 'Best Buy' false bby
3 'Target' true tgt
4 'Home Depot' true nan
I would really appreciate if someone could help me out here.
You can use the apply method with axis=1 to pass the row and perform your calculations. See the code below:
import pandas as pd
mydict = {'id':[1,2],'Store':["'Walmart','Target'","'Best Buy'"], 'is_open':['true', 'false']}
df = pd.DataFrame(mydict, index=[0,1])
df.set_index('id',drop=True, inplace=True)
The df so far:
Store is_open
id
1 'Walmart','Target' true
2 'Best Buy' false
The lookup dataframe:
df2 = pd.DataFrame({'Store':['Walmart', 'Target','Best Buy'], 'Ticker':['wmt','tgt','bby']})
Store Ticker
0 Walmart wmt
1 Target tgt
2 Best Buy bby
here is the code for adding the column:
def add_column(row):
items = row['Store'].split(',')
tkr_list = []
for string in items:
mystr = string.replace("'","")
tkr = df2.loc[df2['Store']==mystr,'Ticker'].values[0]
tkr_list.append(tkr)
return tkr_list
df['Ticker']=df.apply(add_column, axis=1)
and this is the result for df:
Store is_open Ticker
id
1 'Walmart','Target' true [wmt, tgt]
2 'Best Buy' false [bby]

Get value from pandas series object

I have a bunch of data files, with columns 'Names', 'Gender', 'Count', one file per one year. I need to concatenate all the files for some period, sum all counts for all unique names and add a new column with amount of consonant. I can't extract string value from 'Names'. How can I implement that?
Here is my code:
import os
import re
import pandas as pd
PATH = ...
def consonants_dynamics (years):
names_by_year = {}
for year in years:
names_by_year[year] = pd.read_csv(PATH+"\\yob{}.txt".format(year), names =['Names', 'Gender', 'Count'])
names_all = pd.concat(names_by_year, names=['Year', 'Pos'])
dynamics = names_all.groupby('Names').sum().sort_values(by='Count', ascending=False).unstack('Names')
dynamics['Consonants'] = dynamics.apply(count_vowels(dynamics.Names), axis = 1)
return dynamics.head(10)
def count_vowels (name):
vowels = re.compile('A|E|I|O|U|a|e|i|o|u')
return len(name) - len (vowels.findall(name))
If I run something like
a = consonants_dynamics(i for i in range (1900, 2001, 10))
I get the following error message
<ipython-input-9-942fc155267e> in consonants_dynamcis(years)
...
---> 12 dynamics['Consonants'] = dynamics.apply(count_vowels(dynamics.Names), axis = 1)
AttributeError: 'Series' object has no attribute 'Names'
I tried various ways but all failed. How can it be done?
after doing unstack you converted dynamics to a series object where you no longer have Names column dynamics.Names. I think it should be fixed by removing .unstack('Names')
after that use dynamics.index:
dynamics['Consonants'] = dynamics.reset_index()['Names'].apply(count_vowels)
Convert index to_series and apply function:
print (dynamics)
Count
Names
James 2
John 3
Robert 10
def count_vowels (name):
vowels = re.compile('A|E|I|O|U|a|e|i|o|u')
return len(name) - len (vowels.findall(name))
dynamics['Consonants'] = dynamics.index.to_series().apply(count_vowels)
Solution without function with str.len and substract only wovels by str.count:
pat = 'A|E|I|O|U|a|e|i|o|u'
s = dynamics.index.to_series()
dynamics['Consonants_new'] = s.str.len() - s.str.count(pat)
print (dynamics)
Count Consonants_new Consonants
Names
James 2 3 3
John 3 3 3
Robert 10 4 4
EDIT:
Solutions without to_series is add as_index=False to groupby for return DataFrame:
names_all = pd.DataFrame({
'Names':['James','James','John','John', 'Robert', 'Robert'],
'Count':[10,20,10,30, 80,20]
})
dynamics = names_all.groupby('Names', as_index=False).sum()
.sort_values(by='Count', ascending=False)
pat = 'A|E|I|O|U|a|e|i|o|u'
s = dynamics.index.to_series()
dynamics['Consonants'] = dynamics['Names'].str.len() - dynamics['Names'].str.count(pat)
print (dynamics)
Names Count Consonants
2 Robert 100 4
1 John 40 3
0 James 30 3

pandas dataframe contains list

I have currently run the following script which uses Fuzzylogic to replace some common words from the list. Dataframe df1 contains my default list of possible values. Dataframe df2 is the main dataframe where transformations/changes are undertaken after referring to Dataframe df1. The code is as follows:
df1 = pd.DataFrame(['one','two','three','four','five','tsst'])
df2 = pd.DataFrame({'not_shifted':[np.nan,'one','too','three','fours','five','six',np.nan,'test']})
# Drop nan value
df2=pd.DataFrame(df2['not_shifted'].fillna(value=''))
df2['not_shifted'] = df2['not_shifted'].map(lambda x: difflib.get_close_matches(x, df1[0]))
The problem is the output is a dataframe which contains square brackets. To make matters worse, none of the texts within df2['not_shifted'] are viewable/ recallable:
Out[421]:
not_shifted
0 []
1 [one]
2 [two]
3 [three]
4 [four]
5 [five]
6 []
7 []
8 [tsst]
Please help.
df2.not_shifted.apply(lambda x: x[0] if len(x) != 0 else "") or simply df2.not_shifted.str[0] as solved by #Psidom
def replace_all(eg):
rep = {"[":"",
"]":"",
"u":"",
"}":"",
"'":"",
'"':"",
"frozenset":""}
for i,j in rep.items():
eg = eg.replace(i,j)
return eg
for each in df.columns:
df[each] = df[each].apply(lambda x : replace_all(str(x)))

Categories

Resources