Vectorized string interpolation in Pandas? Is this doable without iteration?

Vectorized string interpolation in Pandas? Is this doable without iteration? - python

The set up
I want to add a new column that contains a URL that has a base/template form and should have certain values interpolated into it based on the information contained in the row.
Table
What I would LOVE to be able to do
base_link = "https://www.vectorbase.org/Glossina_fuscipes/Location/View?r=%(scaffold)s:%(start)s-%(end)s"
# simplify getting column data from data_frame
start = operator.attrgetter('start')
end = operator.attrgetter('end')
scaffold = operator.attrgetter('seqname')
def get_links_to_genome_browser(data_frame):
base_links = pd.Series([base_link]*len(data_frame.index))
links = base_links % {"scaffold":scaffold(data_frame),"start":start(data_frame),"end":end(data_frame)}
return links

So I am answering my own question but I finally figured it out so I want to close this out and record the solution.
The solution is to use data_frame.apply() but to change my indexing syntax in the get_links_to_genome_browser function to Series syntax rather than DataFrame indexing syntax.
def get_links_to_genome_browser(series):
link = base_link % {"scaffold":series.ix['seqname'],"start":series.ix['start'],"end":series.ix['end']}
return link
Then call it like:
df.apply(get_links_to_genome_browser, axis=1)

I think I get what you're asking. Let me know
base_link = "https://www.vectorbase.org/Glossina_fuscipes/Location/View?r=%(scaffold)s:%(start)s-%(end)s"
then you can do something like this
data_frame['url'] = base_link + data_frame['start'] + data_frame['end'] + etc...

Related

How do I save a dataframe in the name of a variable I created earlier in the code (oldest_id and iso_data as seen in the code)

#fetch the data in a sequence of 1 million rows as dataframe
df1 = My_functions.get_ais_data(json1)
df2 = My_functions.get_ais_data(json2)
df3 = My_functions.get_ais_data(json3)
df_all = pd.concat([df1,df2,df3], axis = 0 )
#save the data frame with names of the oldest_id and the corresponding iso data format
df_all.to_csv('oldest_id + iso_date +.csv')
.....the last line might be silly but I am trying to save the data frame in the name of some variables I created earlier in the code.

You can use an f-string to embed variables in strings like this:
df_all.to_csv(f'/path/to/folder/{oldest_id}{iso_date}.csv')

if you need the value corresponding to the variable then mids answer is correct thus:
df_all.to_csv(f'/path/to/folder/{oldest_id}{iso_date}.csv')
However if you want to use the name of the variable itselfs :
df_all.to_csv('/path/to/folder/' + f'{oldest_id=}'.split('=')[0] + f'{iso_date=}'.split('=')[0] + '.csv')
would do the work

Maybe try:
file_name = f"{oldest_id}{iso_date}.csv"
df_all.to_csv(file_name)
Assuming you are using Python 3.6 and up.

How to loop through few lines

I have a doubt of how to loop over few lines :
get_sol is a function which is created which has two parameters : def get_sol(sub_dist_fil,fos_cnt)
banswara, palwai and hathin are some random values of a column named as "sub-district".
1 is fixed
I am writing it as :
out_1 = get_sol( "banswara",1)
out_1 = get_sol("palwal",1)
out_1 = get_sol("hathin",1)
How can I apply for loop to these lines in order to get results in one go
Help !!
"FEW COMMENTS HAVE HELPED ME IN ACHIEVING MY RESULTS (THANKS ALOT)". THE RESULT IS AS FOLLOW :
NOW I HAVE A QUERY THAT HOW DO I DISPLAY/PRINT THE NAME OF RESPECTIVE DISTRICT FOR WHICH THE RESULTS ARE RUNNING???????

Well in general case you can do something like this:
data = ['banswara', 'palwal', 'hathin']
result = {}
for item in data:
result[item] = get_sol(item, 1)
print(result)
This will pack your results in dictionary giving you opportunity to see which result is generated for which input.

here you go:
# save the values into a list
random_values = column["sub-district"]
# iterate through using for
for random_value in random_values:
# get the result
result = get_sol(random_value, 1)
# print the result or do whatever
# you want to the result
print(result)

Similar other answers, but using a list comprehension to make it more pythonic (and faster, usually):
districts = ['banswara', 'palwal', 'hathin']
result = [get_sol(item, 1) for item in data]

I think you are trying to get random values from the column 'subdistrict'
For the purpose of illustration, let the dataframe be df. (So to access 'subdistrict' column, df['subdistrict']
import numpy
[print(get_sol(x)) for x in np.random.choice(df['subdistrict'], 10)]
# selecting 10 random values from particular columns
Here is the official documentation

Pytrends - Interest over time - return column with None when there is no data

Pytrends for Google Trends data does not return a column if there is no data for a search parameter on a specific region.
The code below is from pytrends.request
def interest_over_time(self):
"""Request data from Google's Interest Over Time section and return a dataframe"""
over_time_payload = {
# convert to string as requests will mangle
'req': json.dumps(self.interest_over_time_widget['request']),
'token': self.interest_over_time_widget['token'],
'tz': self.tz
}
# make the request and parse the returned json
req_json = self._get_data(
url=TrendReq.INTEREST_OVER_TIME_URL,
method=TrendReq.GET_METHOD,
trim_chars=5,
params=over_time_payload,
)
df = pd.DataFrame(req_json['default']['timelineData'])
if (df.empty):
return df
df['date'] = pd.to_datetime(df['time'].astype(dtype='float64'),
unit='s')
df = df.set_index(['date']).sort_index()
From the code above, if there is no data, it just returns df, which will be empty.
My question is, how can I make it return a column with "No data" on every line and the search term as header, so that I can clearly see for which search terms there is no data?
Thank you.

I hit this problem, then I hit this web page. My solution was to ask Google trends for data on a search item it would have data for, then rename the column and 0 the data.
I used the ".drop" method to get rid of the "isPartial" column and the ".rename" method to change the column name. To zero the data in the column, I did the following, I created a function:
#Make value zero
def MakeZero(x):
return x *0
Then using the ".apply" method on the dataframe to 0 the column.
ThisYrRslt=BlankResult.apply(MakeZero)
: ) But the question is, what search term do you ask google trends about that will always return a value? I chose "Google". : )
I'm sure you can think of some better ones, but it's hard to leave those words in commercial code.

How to replace a string in a list of strings in a DataFrame (Python)?

I have a Dataframe which consists of lists of lists in two seperate columns.
import pandas as pd
data = pd.DataFrame()
data["Website"] = [["google.com", "amazon.com"], ["google.com"], ["aol.com", "no website"]]
data["App"] = [["Ok Google", "Alexa"], ["Ok Google"], ["AOL App", "Generic Device"]]
Thats how the Dataframe looks like
I need to replace certain strings in the first column (here: "no website") with the according string in the second column (here: "Generic Device"). The replacing string has the same index in the list as the string that needs to be replaced.
What did not work so far:
I tried several forms of str.replace(x,y) for lists and DataFrames and nothing worked. A simple replace(x,y) does not work as I need to replace several different strings. I think I can't get my head around the indexing thing.
I already googled and stackoverflowed for two hours and haven't found a solution yet.
Many thanks in advance! Sorry for bad engrish or noob mistakes, I am still learning.
-Max

Define replacement function and use apply to vectorize
def replacements(websites, apps):
" Substitute items in list replace_items that's found in websites "
replace_items = ["no website", ] # can add to this list of keys
# that trigger replacement
for i, k in enumerate(websites):
# Check each item in website for replacement
if k in replace_items:
# This is an item to be replaced
websites[i] = apps[i] # replace with corresponding item in apps
return websites
# Create Dataframe
websites = [["google.com", "amazon.com"], ["google.com"], ["aol.com", "no website"]]
app = [["Ok Google", "Alexa"], ["Ok Google"], ["AOL App", "Generic Device"]]
data = list(zip(websites, app))
df = pd.DataFrame(data, columns = ['Websites', 'App'])
# Perform replacement
df['Websites'] = df.apply(lambda row: replacements(row['Websites'], row['App']), axis=1)
print(df)
Output
Websites App
0 [google.com, amazon.com] [Ok Google, Alexa]
1 [google.com] [Ok Google]
2 [aol.com, Generic Device] [AOL App, Generic Device]

Try this,You can define replaceable values in a array and execute.
def f(x,items):
for rep in items:
if rep in list(x.Website):
x.Website[list(x.Website).index(rep)]=list(x.App)[list(x.Website).index(rep)]
return x
items = ["no website"]
data = data.apply(lambda x: f(x,items),axis=1)
Output:
Website App
0 [google.com, amazon.com] [Ok Google, Alexa]
1 [google.com] [Ok Google]
2 [aol.com, Generic Device] [AOL App, Generic Device]

First of all, Happy Holidays!
I wasn't really sure what your expected output was and I'm not really sure what you have tried previously, but I think that this may work:
data["Website"] = data["Website"].replace("no website", "Generic Device")
I really hope this helps!

You can create a function like this:
def f(replaced_value, col1, col2):
def r(s):
while replaced_value in s[col1]:
s[col1][s[col1].index(replaced_value)] = s[col2][s[col1].index(replaced_value)]
return s
return r
and use apply:
df=df.apply(f("no website","Website","App"), axis=1)
print(df)

SQLAlchemy : is there any good automated way to rename columns

I'm using SQLAlchemy ORM for a few days and i'm looking for a way to get tablename prefix in the results of Session.query().
For instance :
myId = 4
...
data = session.query(Email.address).filter(Email.id==str(myId)).one()
print data.keys()
This would display :
("address",)
And i would like to get something like :
("Email.address",)
Is there any way to do it, without changing the class attributes and the table column names.
This example is a bit dummy but in a more general purpose i would like to prefix all column names by table names in result to make sure the results are always under the same format, even if there are joins in queries.
I've read things about aliased(), many posts here but nothing satisfied me.
Can someone please enlighten me on this ?
Thank you.
EDIT:
Thanks a lot for your answer #alecxe. I finally manage to do what i wanted. Here is the first batch of my code, there is probably many things to improve :
query = self.session.query(Email.address,User.name)
cols = [{str(column['name']):str(column['expr'])} for column in query.column_descriptions]
someone = query.filter(User.name==str(curName)).all()
r = []
for res in someone :
p = {}
for c in map(str,res.__dict__):
if not c.startswith('_'):
for k in cols:
if c == k.keys()[0]:
p[k[c]] = res.__dict__[c]
r.append(p)
print r
The output is :
[{'Email.address': u'john#foobaz.com', 'User.name': u'John'}]

Give a try to column_descriptions:
query = session.query(Email.address)
print [str(column['expr']) for column in query.column_descriptions] # should print ["Email.address"]
data = query.filter(Email.id==str(myId)).one()
Hope that helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Vectorized string interpolation in Pandas? Is this doable without iteration? - python

I think I get what you're asking. Let me know base_link = "https://www.vectorbase.org/Glossina_fuscipes/Location/View?r=%(scaffold)s:%(start)s-%(end)s" then you can do something like this data_frame['url'] = base_link + data_frame['start'] + data_frame['end'] + etc...

Related

How do I save a dataframe in the name of a variable I created earlier in the code (oldest_id and iso_data as seen in the code)

How to loop through few lines

Pytrends - Interest over time - return column with None when there is no data

How to replace a string in a list of strings in a DataFrame (Python)?

SQLAlchemy : is there any good automated way to rename columns

Categories

Resources