My problem is pretty general and can probably be solved in many ways. But what is a smart way considering time and memory?
I have time series data of user interactions of the following form:
cookie_id interaction
--------- -----------
1234 did_something
1234 viewed_banner*
1234 did_something
1234 did_something
1234 viewed_and_clicked_banner*
... ...
I want it to train models predicting whether a user will click on a banner or not whenever a banner is displayed (i.e. the interactions marked with *). To do this I need to aggregate all previous interactions whenever a point of interest (either viewed_banner or viewed_and clicked_banner) shows up in the feed:
cookie_id interaction
--------- -----------
1234 did_something
1234 viewed_banner <- point of interest
cookie_id interaction
--------- -----------
1234 did_something
1234 viewed_banner
1234 did_something
1234 did_something
1234 viewed_and_clicked_banner <- point of interest
This is the core of the problem: Splitting the data up into overlapping groups! After doing this each group can then be aggregated into for instance:
cookie_id did_something viewed_banner viewed_and_cli... clicked?
--------- ------------- ------------- ----------------- --------
1234 1 0 0 no
1234 3 1 0 yes
Here the numbers in did_something and viewed_banner are the counts of these interaction (not including the point of interest), but other kind of aggregation could be performed as well. The clicked? attribute just describes which of the two kinds of "point of interest" was the last interaction in the interaction feed.
I have tried to look at Pandas apply and groupby methods, but can not come up with something that generates the desired overlapping groups.
The alternative is to use some for-loops, but I would rather not do that if there is a simple and efficient way to solve the problem.
Here is what I tried, I think it need more data to verify the code:
data = """cookie_id interaction
1234 did_something
1234 viewed_banner*
1234 did_something
1234 did_something
1234 viewed_and_clicked_banner*
"""
import pandas as pd
import io
df = pd.read_csv(io.BytesIO(data), delim_whitespace=True)
flag = df.interaction.str.endswith("*")
group_flag = flag.astype(float).mask(~flag).ffill(limit=1).fillna(0).cumsum()
df["interaction"] = df.interaction.str.rstrip("*")
interest_df = df[flag]
def f(s):
return s.value_counts()
df2 = df.groupby(group_flag).interaction.apply(f).unstack().fillna(0).cumsum()
result = df2[::2].reset_index(drop=True)
result["clicked"] = interest_df.interaction.str.contains("clicked").reset_index(drop=True)
print result
output:
did_something viewed_and_clicked_banner viewed_banner clicked
0 1 0 0 False
1 3 0 1 True
The basic idea is split the dataframe into groups:
odd groups are continuous rows without *
even groups are only one row with *
It assume that the first row in the dataframe is without *.
Then do value_counts for every group and combine the results into a dataframe. cumsum() the counts and drop even rows will get the right counts.
I don't know how the clicked column is calculated. Can you explain this in detail?
Related
I have a 2 datasets (dataframes), one called source and the other crossmap. I am trying to find rows with a specific column value starting with "999", if one is found I need to look up the complete value of that column (e.x. "99912345") on the crossmap dataset (dataframe) and return the value from a column on that row in the cross-map.
# Source Dataframe
0 1 2 3 4
------ -------- -- --------- -----
0 303290 544981 2 408300622 85882
1 321833 99910722 1 408300902 85897
2 323241 99902978 3 408056001 95564
# Cross Map Dataframe
ID NDC ID DIN(NDC) GTIN NAME PRDID
------- ------ -------- -------------- ---------------------- -----
44563 321833 99910722 99910722000000 SALBUTAMOL SULFATE (A) 90367
69281 321833 99910722 99910722000000 SALBUTAMOL SULFATE (A) 90367
6002800 323241 99902978 75402850039706 EPINEPHRINE (A) 95564
8001116 323241 99902978 99902978000000 EPINEPHRINE (A) 95564
The 'straw dog' logic I am working with is this:
search source file and find '999' entries in column 1
df_source[df_source['Column1'].str.contains('999')]
interate through the rows returned and search for the value in column 1 in the crossmap dataframe column (DIN(NDC)) and return the corresponding PRDID
update the source dataframe with the PRDID, and write the updated file
It is these last two logic pieces where I am struggling with how to do this. Appreciate any direction/guidance anyone can provide.
Is there maybe a better/easier means of doing this using python but not pandas/dataframes?
So, as far as I understood you correctly: we are looking for the first digits of 999 in the 'Source Dataframe' in the first column of the value. Next, we find these values in the 'Cross Map' column 'DIN(NDC)' and we get the values of the column 'PRDID' on these lines.
If everything is correct, then I can't understand your further actions?
import pandas as pd
import more_itertools as mit
Cross_Map = pd.DataFrame({'DIN(NDC)': [99910722, 99910722, 99902978, 99902978],
'PRDID': [90367, 90367, 95564, 95564]})
df = pd.DataFrame({0: [303290, 321833, 323241], 1: [544981, 99910722, 99902978], 2: [2, 1, 3],
3: [408300622, 408300902, 408056001], 4: [85882, 85897, 95564]})
m = [i for i in df[1] if str(i)[:3] == '999'] #find the values in column 1
index = list(mit.locate(list(Cross_Map['DIN(NDC)']), lambda x: x in m)) #get the indexes of the matched column values DIN(NDC)
print(Cross_Map['PRDID'][index])
I have a question regarding text file handling. My text file prints as one column. The column has data scattered throughout the rows and visually looks great & somewhat uniform however, still just one column. Ultimately, I'd like to append the row where the keyword is found to the end of the top previous row until data is one long row. Then I'll use str.split() to cut up sections into columns as I need.
In Excel (code below-Top) I took this same text file and removed headers, aligned left, and performed searches for keywords. When found, Excel has a nice feature called offset where you can place or append the cell value basically anywhere using this offset(x,y).value from the active-cell start position. Once done, I would delete the row. This allowed my to get the data into a tabular column format that I could work with.
What I Need:
The below Python code will cycle down through each row looking for the keyword 'Address:'. This part of the code works. Once it finds the keyword, the next line should append the row to the end of the previous row. This is where my problem is. I can not find a way to get the active row number into a variable so I can use in place of the word [index] for the active row. Or [index-1] for the previous row.
Excel Code of similar task
Do
Set Rng = WorkRng.Find("Address", LookIn:=xlValues)
If Not Rng Is Nothing Then
Rng.Offset(-1, 2).Value = Rng.Value
Rng.Value = ""
End If
Loop While Not Rng Is Nothing
Python Equivalent
import pandas as pd
from pandas import DataFrame, Series
file = {'Test': ['Last Name: Nobody','First Name: Tommy','Address: 1234 West Juniper St.','Fav
Toy', 'Notes','Time Slot' ] }
df = pd.DataFrame(file)
Test
0 Last Name: Nobody
1 First Name: Tommy
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I've tried the following:
for line in df.Test:
if line.startswith('Address:'):
df.loc[[index-1],:].values = df.loc[index-1].values + ' ' + df.loc[index].values
Line above does not work with index statement
else:
pass
# df.loc[[1],:] = df.loc[1].values + ' ' + df.loc[2].values # copies row 2 at the end of row 1,
# works with static row numbers only
# df.drop([2,0], inplace=True) # Deletes row from df
Expected output:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I am trying to wrap my head around the entire series vectorization approach but still stuck trying loops that I'm semi familiar with. If there is a way to achieve this please point me in the right direction.
As always, I appreciate your time and your knowledge. Please let me know if you can help with this issue.
Thank You,
Use Series.shift on Test then use Series.str.startswith to create a boolean mask, then use boolean indexing with this mask to update the values in Test column:
s = df['Test'].shift(-1)
m = s.str.startswith('Address', na=False)
df.loc[m, 'Test'] += (' ' + s[m])
Result:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
What I'm trying to do
I want to report the weekly rejection rate for multiple users. I use a for loop to go through a monthly dataset to get the numbers for every user. The final dataframe, rates, should look something like:
The end product, rates
Description
I have an initial dataframe (numbers), that contains only the ACCEPT, REJECT and REVIEW numbers, where I added these rows and columns:
Rows: Grand Total, Rejection Rate
Columns: Grand Total
Here's how numbers look like:
|---|--------|--------|--------|--------|-------------|
| | Week 1 | Week 2 | Week 3 | Week 4 | Grand Total |
|---|--------|--------|--------|--------|-------------|
| 0 | 994 | 699 | 529 | 877 | 3099 |
|---|--------|--------|--------|--------|-------------|
| 1 | 27 | 7 | 8 | 13 | 55 |
|---|--------|--------|--------|--------|-------------|
| 2 | 100 | 86 | 64 | 107 | 357 |
|---|--------|--------|--------|--------|-------------|
| 3 | 1121 | 792 | 601 | 997 | 3511 |
|---|--------|--------|--------|--------|-------------|
The indexes represent the following values:
0 - ACCEPT
1 - REJECT
2 - REVIEW
3 - TOTAL (Accept+Reject+Review)
I wrote 2 pre-defined functions:
get_decline_rates(df): The get the decline rates by week in the numbers dataframe.
copy(empty_df, data): To transfer all data to a new dataframe with "double" headers (for reporting purposes).
Here's my code where I add rows and columns to numbers, then re-format it:
# Adding "Grand Total" column and rows
totals = numbers.sum(axis=0) # column sum
numbers = numbers.append(totals, ignore_index=True)
grand_total = numbers.sum(axis=1) # row sum
numbers.insert(len(numbers.columns), "Grand Total", grand_total)
# Adding "Rejection Rate" and re-indexing numbers
decline_rates = get_decline_rates(numbers)
numbers = numbers.append(decline_rates, ignore_index=True)
numbers.index = ["ACCEPT","REJECT","REVIEW","Grand Total","Rejection Rate"]
# Creating a new df with report format requirements
final = pd.DataFrame(0, columns=numbers.columns, index=["User A"]+list(numbers.index))
final.ix["User A",:] = final.columns
# Copying data from numbers to newly formatted df
copy(final,numbers)
# Append final df of this user to the final dataframe
rates = rates.append(final)
I'm using Python 3.5.2 and Pandas 0.19.2. If it helps, here's how the initial dataset looks like:
Data format
I do a resampling on the date column to get the data by week.
What's going wrong
Here's the funny part - the code runs fine and I get all the required information in rates. However, I'm seeing this warning message:
RuntimeWarning: invalid value encountered in longlong_scalars
If i break down the code and run it line by line, this message does not appear. Even the message looks weird (what does longlong_scalars even mean?) Does anyone know what this warning message mean, and what's causing it?
UPDATE:
I just ran a similar script that takes in exactly the same input and produces a similar output (except I get daily rejection rates instead of weekly). I get the same Runtime warning, except more information is given:
RuntimeWarning: invalid value encountered in longlong_scalars
rej_rate = str(int(round((col.ix[1 ]/col.ix[3 ])*100))) + "%"
I suspect something must have gone wrong when I was trying to calculate the decline rates with my pre-defined function, get_decline_rates(df). Could it be due to the dtype of the values? All columns on the input df, numbers, are int64.
Here's the code for my pre-defined function (the input, numbers, can be found under Description):
# Description: Get rejection rates for all weeks.
# Parameters: Pandas Dataframe with ACCEPT, REJECT, REVIEW count by week.
# Output: Pandas Series with rejection rates for all days in input df.
def get_decline_rates(df):
decline_rates = []
for i in range(len(df.columns)):
col = df.ix[:,i]
try:
rej_rate = str(int(round((col[1]/col[3])*100))) + "%"
except ValueError:
rej_rate = "0%"
decline_rates.append(rej_rate)
return pd.Series(decline_rates, index=df.columns)
I had the same RuntimeWarning, and after looking into the data, it was because of a null-division. I did not have the time to look into your sample, but you could look around id=0, or some other records, where null-division or such could occur.
I have this pandas dataframe:
df =
GROUP MARK
ABC 1
ABC 0
ABC 1
DEF 1
DEF 1
DEF 1
DEF 1
XXX 0
I need to create a piechart (using Python or R). The size of each pie should correspond to the proportional count (i.e. the percent) of rows with particular GROUP. Moreover, each pie should be divided into 2 sub-parts corresponding to the percent of rows with MARK==1 and MARK==0 within given GROUP.
I was googling for this type of piecharts and found this one. But this example seems to be overcomplicated for my case. Another good example is done in JavaScript, which doesn't serve for me because of the language.
Can somebody tell me what's the name of this type of piecharts and where can I find some examples of code in Python or R.
Here is a solution in R that uses base R only. Not sure how you want to arrange your pies, but I used par(mfrow=...).
df <- read.table(text=" GROUP MARK
ABC 1
ABC 0
ABC 1
DEF 1
DEF 1
DEF 1
DEF 1
XXX 0", header=TRUE)
plot_pie <- function(x, multiplier=1, label){
pie(table(x), radius=multiplier * length(x), main=label)
}
par(mfrow=c(1,3), mar=c(0,0,2,0))
invisible(lapply(split(df, df$GROUP), function(x){
plot_pie(x$MARK, label=unique(x$GROUP),
multiplier=0.2)
}))
This is the result:
I am trying to read in a ascii-table into Numpy/Pandas/Astropy array/dataframe/table in Python. Each row in the table looks something like this:
329444.6949 0.0124 -6.0124 3 97.9459 15 32507 303 7 3 4 8 2 7 HDC-13-O
The problem is that there is no clear separator/delimiter between the columns, so for some rows there is no space between two columns, like this:
332174.9289 0.0995 -6.3039 3 1708.1601219 30501 30336 333 37 136 H2CO
From the web page it says these are called "card images". The information on the table format is described as this:
The catalog data files are composed of 80-character card images, with
one card image per spectral line. The format of each card image is:
FREQ, ERR, LGINT, DR, ELO, GUP, TAG, QNFMT, QN', QN" (F13.4,F8.4,
F8.4, I2,F10.4, I3, I7, I4, 6I2, 6I2)
I would really like a way where I just use the format specifier given above. The only thing I found wasNumpy's genfromtxt function. However, the following does not work.
np.genfromtxt('tablename', dtype='f13.4,f8.4,f8.4,i2,f10.4,i3,i7,i4,6i2,6i2')
Anyone knows how I could read this table into Python with the use of the format specification of each column that was given?
You can use the fixed-width reader in Astropy. See: http://astropy.readthedocs.org/en/latest/io/ascii/fixed_width_gallery.html#fixedwidthnoheader. This does still require you to count the columns, but you could probably write a simple parser for the dtype expression you showed.
Unlike the pandas solution above (e.g. df['FREQ'] = df.data.str[0:13]), this will automatically determine the column type and give float and int columns in your case. The pandas version results in all str type columns, which is presumably not what you want.
To quote the doc example there:
>>> from astropy.io import ascii
>>> table = """
... #1 9 19 <== Column start indexes
... #| | | <== Column start positions
... #<------><--------><-------------> <== Inferred column positions
... John 555- 1234 192.168.1.10
... Mary 555- 2134 192.168.1.123
... Bob 555- 4527 192.168.1.9
... Bill 555-9875 192.255.255.255
... """
>>> ascii.read(table,
... format='fixed_width_no_header',
... names=('Name', 'Phone', 'TCP'),
... col_starts=(1, 9, 19),
... )
<Table length=4>
Name Phone TCP
str4 str9 str15
---- --------- ---------------
John 555- 1234 192.168.1.10
Mary 555- 2134 192.168.1.123
Bob 555- 4527 192.168.1.9
Bill 555-9875 192.255.255.255