I have honestly tried all possible solutions, I think I am nearly there but still something is not working
I have a dataframe with coin names and their tags.
coin
tags
bitcoin
[mineable, pow, sha-256, store-of-value, state-channels]
I want to extract the tags in a binary dataframe. Like that
coin
mineable
Sha 256
scrypt
bitcoin
1
1
0
dogecoin
1
0
1
I have prepared a dataframe like that
coin
mineable
Sha 256
scrypt
bitcoin
mineable
Sha 256
scrypt
dogecoin
mineable
Sha 256
scrypt
the idea was when I run the loop if it finds the the tags in the list it changes it to 1 and when it does not it leaves it (or even better it changes to 0)
for index_tags, row2 in tag_df2.iterrows():#final data set to be changed
for index, row in tags_head.iterrows():#dataset with the tags and the coin names
for my_tags in clean_set: #unique list of tags
if my_tags in (row['tags']):
print ('-----coin name-------------------->>>>',(row['name']))
print (my_tags)
tag_df2.loc[index_tags, my_tags]=1
Now it seems it works iterating through everything but it only finds the first values for the bitcoins and it copies the same to all coins. I add a link to my colab notebook too.
When I print it seems going through the data no problem but when I try to update the dataframe it just copies one to all coins. I hope someone can help me.
https://colab.research.google.com/drive/1sn5lwqiNicoBy2L00EZNmhLgz_SBxsOg?usp=sharing
You can use get_dummies:
# After you have generated `tags` DataFrame with
# tags = df_new[['name','tags']]
pd.get_dummies(tags.set_index('name')['tags'].explode()).sum(level=0)
Output (only showing the first 3 columns here to illustrate the result):
1confirmation-portfolio a16z-portfolio ai-big-data \
name
bitcoin 1 1 0
ethereum 1 1 0
binance coin 0 0 0
dogecoin 0 0 0
cardano 0 0 0
... ... ... ...
australian dollar token 0 0 0
chia network 0 0 0
safemars 0 0 0
lendhub 0 0 0
3x long bitcoin token 0 0 0
Related
Is it possible to conditionally append data to an existing template dataframe? I'll try to make the data below as simple as possible, since I'm asking more for conceptual help than actual code so I better understand the mindset of solving these kinds of problems in the future (but actual code would be great too).
Example Data
I have a dataframe below that shows 4 dummy products SKUs that a client may order. These SKUs never change. Sometimes a client orders large quantities of each SKU, and sometimes they only order one or two SKUs. Due to reporting, I need to fill unordered SKUs with zeroes (probably use ffill?)
Dummy dataframe DF
product_sku
quantity
total_cost
1234
5678
4321
2468
Problem
Currently, my data only returns the SKUs that customers have ordered (a), but I would like unordered SKUs to be returned, with zeros filled in for quantity and total_cost (b)
(a)
product_sku
quantity
total_cost
1234
10
50.00
5678
3
75.00
(b)
product_sku
quantity
total_cost
1234
10
50.00
5678
3
75.00
4321
0
0
2468
0
0
I'm wondering if there's a way to take that existing dataframe, and simply append any sales that actually occurred, leaving the unordered SKUs as zero or blank (whatever makes more sense).
I just need some help thinking through the steps logically, and wasn't able to find anything like this. I'm still relatively novice at this stuff, so let me know if I'm missing any pertinent information.
Thanks!
one way is to use reindex after putting the column with product's names as index with set_index. With your notation it would be something like
l_products = DF['product_sku'].tolist() #you may have the list differently
b = (a.set_index('product_sku')
.reindex(l_products, fill_value=0)
.reset_index()
)
If you know the SKus a-priori, maintain one DataFrame initizlized with zeros and update the relevant rows. Then you will always have all SKUs.
For example:
import pandas as pd
# initialization
df = pd.DataFrame(0, index = ['1234', '5678', '4321', '2468'],
columns={'quantity', 'total_cost'})
print(df)
# updating
df.loc['1234', :] = {'total_cost': 100, 'quantity': 4}
print(df)
# incrementing quantity
df.loc['1234', 'quantity'] += 5
print(df)
total_cost quantity
1234 0 0
5678 0 0
4321 0 0
2468 0 0
total_cost quantity
1234 100 4
5678 0 0
4321 0 0
2468 0 0
total_cost quantity
1234 100 9
5678 0 0
4321 0 0
2468 0 0
I have this type of data, but in real life it has millions of entries. Product id is always product specific, but occurs several times during its lifetime.
date
product id
revenue
estimated lifetime value
2021-04-16
0061M00001AXc5lQAD
970
2000
2021-04-17
0061M00001AXbCiQAL
159
50000
2021-04-18
0061M00001AXb9AQAT
80
3000
2021-04-19
0061M00001AXbIHQA1
1100
8000
2021-04-20
0061M00001AXbY8QAL
90
4000
2021-04-21
0061M00001AXbQ1QAL
29
30000
2021-04-21
0061M00001AXc5lQAD
30
2000
2021-05-02
0061M00001AXc5lQAD
50
2000
2021-05-05
0061M00001AXc5lQAD
50
2000
I'm looking to create a new column in pandas that indicates when a certain product id has generated more revenue than a specific threshold e.g. 100$, 1000$, marking it as a Win (1). A win may occur only once during the lifecycle of a product. In addition I would want to create another column that would indicate the row where a specific product sales exceeds e.g. 10% of the estimated lifetime value.
What would be the most intuitive approach to achieve this in Python / Pandas?
edit:
dw1k_thresh: if the cumulative sales of a specific product id >= 1000, the column takes a boolean value of 1, otherwise zero. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 1000.
dw10perc: if the cumulative sales of one product id >= 10% of estimated lifetime value, the column takes value of 1, otherwise 0. However 1 can occur only once and after that the again always zero. Basically it's just an indicator of the date and transaction when a product sales exceeds the critical value of 10% of the estimated lifetime value.
The threshold value is common for all product id's (I'll just replicate the process with different thresholds at a later stage to determine which is the optimal threshold to predict future revenue).
I'm trying to achieve this:
The code I've written so far is trying to establish the cum_rev and dw1k_thresh columns, but unfortunately it doesn't work.
df_final["dw1k_thresh"] = 0
df_final["cum_rev"]= 0
opp_list =set()
for row in df_final["product id"].iteritems():
opp_list.add(row)
opp_list=list(opp_list)
opp_list=pd.Series(opp_list)
for i in opp_list:
if i == df_final["product id"].any():
df_final.cum_rev = df_final.revenue.cumsum()
for x in df_final.cum_rev:
if x >= 1000 & df_final.dw1k_thresh.sum() == 0:
df_final.dw1k_thresh = 1
else:
df_final.dw1k_thresh = 0
df_final.head(30)
Cumulative Revenue: Can be calculated fairly simply with groupby and cumsum.
dwk1k_thresh: We are first checking whether cum_rev is greater than 1000 and then apply the function that helps us keep 1 only once, and after that the again always zero.
dw10_perc: Same approach as dw1k_thresh.
As a first step you would need to remove $ and make sure your columns are of numeric type to perform the comparisons you outlined.
# Imports
import pandas as pd
import numpy as np
# Remove $ sign and convert to numeric
cols = ['revenue','estimated lifetime value']
df[cols] = df[cols].replace({'\$': '', ',': ''}, regex=True).astype(float)
# Cumulative Revenue
df['cum_rev'] = df.groupby('product id')['revenue'].cumsum()
# Function to be applied on both
def f(df,thresh_col):
return (df[df[thresh_col]==1].sort_values(['date','product id'], ascending=False)
.groupby('product id', as_index=False,group_keys=False)
.apply(lambda x: x.tail(1))
).index.tolist()
# dw1k_thresh
df['dw1k_thresh'] = np.where(df['cum_rev'].ge(1000),1,0)
df['dw1k_thresh'] = np.where(df.index.isin(f(df,'dw1k_thresh')),1,0)
# dw10perc
df['dw10_perc'] = np.where(df['cum_rev'] > 0.10 * df.groupby('product id',observed=True)['estimated lifetime value'].transform('sum'),1,0)
df['dw10_perc'] = np.where(df.index.isin(f(df,'dw10_perc')),1,0)
Prints:
>>> df
date product id revenue ... cum_rev dw1k_thresh dw10_perc
0 2021-04-16 0061M00001AXc5lQAD 970 ... 970 0 1
1 2021-04-17 0061M00001AXbCiQAL 159 ... 159 0 0
2 2021-04-18 0061M00001AXb9AQAT 80 ... 80 0 0
3 2021-04-19 0061M00001AXbIHQA1 1100 ... 1100 1 1
4 2021-04-20 0061M00001AXbY8QAL 90 ... 90 0 0
5 2021-04-21 0061M00001AXbQ1QAL 29 ... 29 0 0
6 2021-04-21 0061M00001AXc5lQAD 30 ... 1000 1 0
7 2021-05-02 0061M00001AXc5lQAD 50 ... 1050 0 0
8 2021-05-05 0061M00001AXc5lQAD 50 ... 1100 0 0
I have results from A/B test that I need to evaluate but in the checking of the data I noticed that there were users that were in both control groups and I need to drop them to not hurt the test. My data looks something like this:
transactionId visitorId date revenue group
0 906125958 0 2019-08-16 10.8 B
1 1832336629 1 2019-08-04 25.9 B
2 3698129301 2 2019-08-01 165.7 B
3 4214855558 2 2019-08-07 30.5 A
4 797272108 3 2019-08-23 100.4 A
What I need to do is remove every user that was in both A and B groups while leaving the rest intact. So from the example data I need this output:
transactionId visitorId date revenue group
0 906125958 0 2019-08-16 10.8 B
1 1832336629 1 2019-08-04 25.9 B
4 797272108 3 2019-08-23 100.4 A
I tried to do it in various ways and I can't seems to figure it out and I couldn't find an answer for it anywhere I would really appreciate some help here,
thanks in advance
You can get a list of users that are in just one group like this:
group_counts = df.groupby('visitorId').agg({'group': 'nunique'}) ##list of users with number of groups
to_include = group_counts[group_counts['group'] == 1] ##filter for just users in 1 group
And then filter your original data according to which visitors are in that list:
df = df[df['visitorId'].isin(to_include.index)]
I have a few hundred thousand rows of data with many different currency forms, some examples being:
116,319,545 SAR
40,381,846 CNY
57,712,170 CNY
158,073,425 RUB2
0 MYR
0 EUR
USD 110,169,240
These values are read into a DataFrame, and I am unsure what the best way (if there is a prebuilt way?) is to just get the integer value out of all the possible cases. There are probably more currencies in the data.
Currently the best approach I have is:
df1['value'].str.replace(r"[a-zA-Z,]",'').astype(int)
But this fails obviously with the entry xxxx RUB2.
EDIT:
In addition to the working answer, it also is reasonable to expect the currency to be important - to extract that the regex is ([A-Z]+\d*)
Given this df
df=pd.DataFrame()
df["col"]=["116,319,545 SAR",
"40,381,846 CNY",
"57,712,170 CNY",
"158,073,425 RUB2",
"0 MYR",
"0 EUR",
"USD 110,169,240"]
You can use regex '(\d+)' after removing commas to get
df.col.str.replace(",","").str.extract('(\d+)').astype(int)
0
0 116319545
1 40381846
2 57712170
3 158073425
4 0
5 0
6 110169240
Another more manual solution would be to split and replace
df.col.str.split(' ').apply(lambda d: pd.Series(int(x.replace(",","")) for x in d if x.replace(",","").isdigit()).item())
0 116319545
1 40381846
2 57712170
3 158073425
4 0
5 0
6 110169240
Here is the sample data file, and I performed the following operation in ipython notebook:
!curl -O http://pbpython.com/extras/sales-funnel.xlsx
df = pd.read_excel('./sales-funnel.xlsx')
df['Status'] = df['Status'].astype('category')
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)
table = pd.pivot_table(df,
index=['Manager', 'Status'],
values=['Price', 'Quantity'],
columns=['Product'],
aggfunc={'Price':[np.sum, np.mean], 'Quantity':len},
fill_value=0
)
This is what the data looks like in table:
I want to select (Manager=="Debra Henley") & (Status=="won") and it works with the query method:
table.query('(Manager=="Debra Henley") & (Status=="won")')
But how do you perform the same selection with loc? I tried this but does not work:
table.loc[['Debra Henley', 'won']]
What do you guys usually use when dealing with MultiIndex? What's the best way to do it?
Update: found two solutions so far:
table.xs(('Debra Henley','won'), level=('Manager', 'Status'))
table.loc[[('Debra Henley', 'won')]]
So I guess tuples should be used instead of lists when indexing with MultiIndex?
Your canonical answer is provided by #ScottBoston.
I'll add this for breadth and perspective in addition to #jezrael's IndexSlice approach.
You can also use pd.DataFrame.xs to take a cross-section
table.xs(['Debra Henley', 'won'])
Product
Quantity len CPU 1
Maintenance 0
Monitor 0
Software 0
Price mean CPU 65000
Maintenance 0
Monitor 0
Software 0
sum CPU 65000
Maintenance 0
Monitor 0
Software 0
Name: (Debra Henley, won), dtype: int64
For simplier selections (only index or only columns) use xs approach or selecting by tuples.
Another more general solution with slicers:
idx = pd.IndexSlice
#output is df
print (table.loc[[idx['Debra Henley','won']]])
Quantity Price \
len mean
Product CPU Maintenance Monitor Software CPU Maintenance
Manager Status
Debra Henley won 1 0 0 0 65000 0
sum
Product Monitor Software CPU Maintenance Monitor Software
Manager Status
Debra Henley won 0 0 65000 0 0 0
idx = pd.IndexSlice
#output is series
print (table.loc[idx['Debra Henley','won'],:])
Quantity len CPU 1
Maintenance 0
Monitor 0
Software 0
Price mean CPU 65000
Maintenance 0
Monitor 0
Software 0
sum CPU 65000
Maintenance 0
Monitor 0
Software 0
Name: (Debra Henley, won), dtype: int64
But it is better for more complicated selections - if need filter index and columns together - one xs doesnt work:
idx = pd.IndexSlice
#select all rows where first level is Debra Henley in index and
#in columns second level is len and sum
print (table.loc[idx['Debra Henley',:], idx[:, ['len', 'sum'], :]])
Quantity Price \
len sum
Product CPU Maintenance Monitor Software CPU
Manager Status
Debra Henley won 1 0 0 0 65000
pending 1 2 0 0 40000
presented 1 0 0 2 30000
declined 2 0 0 0 70000
Product Maintenance Monitor Software
Manager Status
Debra Henley won 0 0 0
pending 10000 0 0
presented 0 0 20000
declined 0 0 0
Yes, you can use:
table.loc[[('Debra Henley', 'won')]]
to return a pandas data frame or you can use:
table.loc[('Debra Henley','won')]
to return a pandas series.
You can refer to the this documentation.