I have a list of tuples, including the date, and the number of "beeps" on that date. However, if there were 0 beeps on a certain date, then that date is simply not present in the list. I need these dates to be present, with the number 0 for the beeps
I've tried solving this using excel and python, but I can't find a solution for it.
16/10/2017 7
18/10/2017 3
21/10/2017 7
23/10/2017 20
24/10/2017 7
25/10/2017 6
This is the start of what I have, and I need this to become:
16/10/2017 7
17/10/2017 0
18/10/2017 3
19/10/2017 0
20/10/2017 0
21/10/2017 7
22/10/2017 0
23/10/2017 20
24/10/2017 7
25/10/2017 6
First save the first date with its value. Then iterate through the dates, saving the dates between the last saved date and the current date with a value of 0, then saving the current date with its value.
A psuedo-code solution would be:
last_saved_date, value = read_first_date()
save(last_saved_date, value)
while not_at_end_of_file():
date, value = read_date()
while last_saved_date + 1 < date:
last_saved_date += 1
save(last_saved_date, 0)
save(date, value)
last_saved_date = date
Related
I have three variables
a = 1
b = 2
c = 3
Every day, I need to add 1 to each variable
a = a + 1 (so a=2)
b = b + 1
c = c + 1
But I need that when tomorrow I run the script, to add 1 unit more:
a = a + 2 (so tomorrow a=3, after 2 days a = 4....)
b = b + 2
c = c + 2
And so on...I need every day to add +1.
Any ideas?
Choose some fixed reference date, and when the code runs, calculate the number of days from the reference date, adjust by some constant offset, and add that to your variables. So, maybe I choose 1/1/2022 as my reference date and an offset of 100 days. This means on 100 days after 1/1/2022 the variables don't get increased at all, 101 days after 1/1/2022 the variables are greater by 1, and so on.
If you need to only increase if the script actually ran on a date, keep a log file of days the script actually ran, or for that matter, save the increments directly!
I would like to spot dates (so rows) where the number of likes is smaller than retweets.
My data looks like
Date Text Like Retweet
28/02/2020 wow!!! 1 0
28/02/2020 I have a baby!!! 1 4
28/02/2020 No words 0 0
...
05/12/2019 I love cooking! 4 2
05/12/2020 Hello world! 1 1
...
To find the numbers of likes/retweets per date I did as follows:
df.groupby([df.Date])["Like"].sum()
df.groupby([df.Date])["Retweet"].sum()
Now I would like to see when the number of likes is greater than that one of retweet (in the example should be 5/12/2020).
You can filter:
grouped = df.groupby('Date')[['Like','Retweet']].sum()
grouped[grouped['Like'] > grouped['Retweet']].index
# similarly
# grouped.query('Like > Retweet').index
For customer segmentation purpose, I want to analyse, How many transactions did the customer do in prior 10 days & 20 days based on given table of transaction records with date.
In this table, the last 2 columns are joined by using the following code.
But I'm not satisfied with this code, please suggest me improvement.
import pandas as pd
df4 = pd.read_excel(path)
# Since A and B two customers are there, two separate dataframe created
df4A = df4[df4['Customer_ID'] == 'A']
df4B = df4[df4['Customer_ID'] == 'B']
from datetime import date
from dateutil.relativedelta import relativedelta
txn_prior_10days = []
for i in range(len(df4)):
current_date = df4.iloc[i,2]
prior_10days_date = current_date - relativedelta(days=10)
if df4.iloc[i,1] == 'A':
No_of_txn = ((df4A['Transaction_Date'] >= prior_10days_date) & (df4A['Transaction_Date'] < current_date)).sum()
txn_prior_10days.append(No_of_txn)
elif df4.iloc[i,1] == 'B':
No_of_txn = ((df4B['Transaction_Date'] >= prior_10days_date) & (df4B['Transaction_Date'] < current_date)).sum()
txn_prior_10days.append(No_of_txn)
txn_prior_20days = []
for i in range(len(df4)):
current_date = df4.iloc[i,2]
prior_20days_date = current_date - relativedelta(days=20)
if df4.iloc[i,1] == 'A':
no_of_txn = ((df4A['Transaction_Date'] >= prior_20days_date) & (df4A['Transaction_Date'] < current_date)).sum()
txn_prior_20days.append(no_of_txn)
elif df4.iloc[i,1] == 'B':
no_of_txn = ((df4B['Transaction_Date'] >= prior_20days_date) & (df4B['Transaction_Date'] < current_date)).sum()
txn_prior_20days.append(no_of_txn)
df4['txn_prior_10days'] = txn_prior_10days
df4['txn_prior_20days'] = txn_prior_20days
df4
Your code would be very difficult to write if you had
e.g. 10 different Customer_IDs.
Fortunately, there is much shorter solution:
When you read your file, convert Transaction_Date to datetime,
e.g. passing parse_dates=['Transaction_Date'] to read_excel.
Define a fuction counting how many dates in group (gr) are
within the range between tDlt (Timedelta) and 1 day before the
current date (dd):
def cntPrevTr(dd, gr, tDtl):
return gr.between(dd - tDtl, dd - pd.Timedelta(1, 'D')).sum()
It will be applied twice to each member of the current group
by Customer_ID (actually to Transaction_Date column only),
once with tDtl == 10 days and second time with tDlt == 20 days.
Define a function counting both columns containing the number of previous
transactions, for the current group of transaction dates:
def priorTx(td):
return pd.DataFrame({
'tx10' : td.apply(cntPrevTr, args=(td, pd.Timedelta(10, 'D'))),
'tx20' : td.apply(cntPrevTr, args=(td, pd.Timedelta(20, 'D')))})
Generate the result:
df[['txn_prior_10days', 'txn_prior_20days']] = df.groupby('Customer_ID')\
.Transaction_Date.apply(priorTx)
The code above:
groups df by Customer_ID,
takes from the current group only Transaction_Date column,
applies priorTx function to it,
saves the result in 2 target columns.
The result, for a bit shortened Transaction_ID, is:
Transaction_ID Customer_ID Transaction_Date txn_prior_10days txn_prior_20days
0 912410 A 2019-01-01 0 0
1 912341 A 2019-01-03 1 1
2 312415 A 2019-01-09 2 2
3 432513 A 2019-01-12 2 3
4 357912 A 2019-01-19 2 4
5 912411 B 2019-01-06 0 0
6 912342 B 2019-01-11 1 1
7 312416 B 2019-01-13 2 2
8 432514 B 2019-01-20 2 3
9 357913 B 2019-01-21 3 4
You cannot use rolling computation, because:
the rolling window extends forward from the current row, but you
want to count previous transactions,
rolling calculations include the current row, whereas
you want to exclude it.
This is why I came up with the above solution (just 8 lines of code).
Details how my solution works
To see all details, create the test DataFrame the following way:
import io
txt = '''
Transaction_ID Customer_ID Transaction_Date
912410 A 2019-01-01
912341 A 2019-01-03
312415 A 2019-01-09
432513 A 2019-01-12
357912 A 2019-01-19
912411 B 2019-01-06
912342 B 2019-01-11
312416 B 2019-01-13
432514 B 2019-01-20
357913 B 2019-01-21'''
df = pd.read_fwf(io.StringIO(txt), skiprows=1,
widths=[15, 12, 16], parse_dates=[2])
Perform groupby, but for now retrieve only group with key 'A':
gr = df.groupby('Customer_ID')
grp = gr.get_group('A')
It contains:
Transaction_ID Customer_ID Transaction_Date
0 912410 A 2019-01-01
1 912341 A 2019-01-03
2 312415 A 2019-01-09
3 432513 A 2019-01-12
4 357912 A 2019-01-19
Let's start from the most detailed issue, how works cntPrevTr.
Retrieve one of dates from grp:
dd = grp.iloc[2,2]
It contains Timestamp('2019-01-09 00:00:00').
To test example invocation of cntPrevTr for this date, run:
cntPrevTr(dd, grp.Transaction_Date, pd.Timedelta(10, 'D'))
i.e. you want to check how many prior transaction performed this customer
before this date, but not earlier than 10 days back.
The result is 2.
To see how the whole first column is computed, run:
td = grp.Transaction_Date
td.apply(cntPrevTr, args=(td, pd.Timedelta(10, 'D')))
The result is:
0 0
1 1
2 2
3 2
4 2
Name: Transaction_Date, dtype: int64
The left column is the index and the right - values returned
from cntPrevTr call for each date.
And the last thing is to show, how the result for the whole group
is generated. Run:
priorTx(grp.Transaction_Date)
The result (a DataFrame) is:
tx10 tx20
0 0 0
1 1 1
2 2 2
3 2 3
4 2 4
The same procedure takes place for all other groups, then
all partial results are concatenated (vertically) and the last
step is to save both columns of the whole DataFrame in
respective columns of df.
I want to compare two continuous home price Sale, and create new column that stores binary variables.
This is my process so far:
dataset['High'] = dataset['November'].map(lambda x: 1 if x>50000 else 0)
This allows me to work on only one column, but I want to compare both November and December home price columns and create new column that contains binary variables.
I want this output
November - December - NewCol
-------------------------------
651200 - 626600 - 0
420900 - 423600 - 1
82300 - 83100 - 1
177000 - 169600 - 0
285500 - 206300 - 0
633900 - 640000 - 1
218900 - 222400 - 1
461700 - 403800 - 0
419100 - 421300 - 1
127600 - 128300 - 1
553400 - 547800 - 0
November and December is a continuous variable, and so I wanted by converting it to a binary variable. I want to use the ifelse() function to create a variable, called "NewCol", which takes on a value of "1" if the ['November'] column is greater than ['December'], and takes on a value of "0" otherwise.
Similar to #3novak but with casting. One uses pandas for greater efficiency but when you use something like map that needs values expressed as (more expensive) python variables, you may as well just use python lists. Try to use pandas operations that apply to entire series and dataframes instead.
>>> import pandas as pd
>>> df = pd.read_csv('test.csv')
>>> df
November December
0 651200 626600
1 420900 423600
2 82300 83100
3 177000 169600
4 285500 206300
5 633900 640000
6 218900 222400
7 461700 403800
8 419100 421300
9 127600 128300
10 553400 547800
>>> df['Higher'] = df['December'].gt(df['November']).astype(int)
>>> df
November December Higher
0 651200 626600 0
1 420900 423600 1
2 82300 83100 1
3 177000 169600 0
4 285500 206300 0
5 633900 640000 1
6 218900 222400 1
7 461700 403800 0
8 419100 421300 1
9 127600 128300 1
10 553400 547800 0
Answer: This would do the trick.
dataset['deff'] = np.where(dataset['2016-11'] >= dataset['2016-12'], 0,1)
If I understand correctly, you can use the following to create a boolean column. We don't even need to use an ifelse statement. Instead we can use the vectorized nature of pandas data frames.
data['NewCol'] = data['November'] > data['December']
This returns a column of True and False values instead of 1 and 0, but they are functionally equivalent. You can sum, take means, etc. treating True as 1 and False as 0.
I have two files that contain two columns each. The first column is an integer. The second column is a linear coordinate. Not every coordinate is represented, and I would like to insert all coordinates that are missing. Below is an example from one file of my data:
3 0
1 10
1 100
2 1000
1 1000002
1 1000005
1 1000006
For this example, coordinates 1-9, 11-99, etc are missing but need to be inserted, and need to be given a count of zero (0).
3 0
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
0 9
1 10
........
With the full set of rows, I then need to add add (1) to every count (the first column). Finally, I would like to do a simple calculation (the ratio) between the corresponding rows of the first column in the two files. The ratio should be real numbers.
I'd like to be able to do this with Unix if possible, but am somewhat familiar with python scripting as well. Any help is greatly appreciated.
This should work with Python 2.3 onwards.
I assumed that your file is space delimited.
If you want values past 1000006, you will need to change the value for desired_range .
import csv
desired_range = 1000007
reader = csv.reader(open('fill_range_data.txt'), delimiter=' ')
data_map = dict()
for row in reader:
frequency = int(row[0])
value = int(row[1])
data_map[value] = frequency
for i in range(desired_range):
if i in data_map:
print data_map[i], i
else:
print 0, i