I have Pandas Data Frame containing product and it's state along with other information. An example data frame can be created as follows
data = {'Product':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C'],
'Date' : ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04','2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06', '2020-01-01', '2020-01-02'],
'Price':[10, 20, 30, 40, 15, 25, 35, 45, 55, 65, 101, 102],
'state':[1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1] }
test = pd.DataFrame(data)
I want count how many times the product state changes from 0 to 1. I've used the following code to check if product state goes from 0 to 1 and named it as change
test['change'] = np.where(test.state < test.state.shift(-1), 1, 0)
The problem is the code above does not takes into account product therefore I need to group by product and then check change in state.
Output:
+---------+----------+-------+-------+--------+
| Product | Date | Price | state | change |
+---------+----------+-------+-------+--------+
| A | 1/1/2020 | 10 | 1 | 0 |
| A | 1/2/2020 | 20 | 0 | 1 |
| A | 1/3/2020 | 30 | 1 | 0 |
| A | 1/4/2020 | 40 | 0 | 1 |
| B | 1/1/2020 | 15 | 1 | 0 |
| B | 1/2/2020 | 25 | 0 | 0 |
| B | 1/3/2020 | 35 | 0 | 1 |
| B | 1/4/2020 | 45 | 1 | 0 |
| B | 1/5/2020 | 55 | 0 | 1 |
| B | 1/6/2020 | 65 | 1 | 0 |
| C | 1/1/2020 | 101 | 0 | 1 |
| C | 1/2/2020 | 102 | 1 | 0 |
+---------+----------+-------+-------+--------+
As seen from the output above for product A on 4th date change is 1 because on next date state is 1 but that is for different product.
Desired Output:
+---------+----------+-------+-------+--------+
| Product | Date | Price | state | change |
+---------+----------+-------+-------+--------+
| A | 1/1/2020 | 10 | 1 | 0 |
| A | 1/2/2020 | 20 | 0 | 1 |
| A | 1/3/2020 | 30 | 1 | 0 |
| A | 1/4/2020 | 40 | 0 | 0 |
| B | 1/1/2020 | 15 | 1 | 0 |
| B | 1/2/2020 | 25 | 0 | 0 |
| B | 1/3/2020 | 35 | 0 | 1 |
| B | 1/4/2020 | 45 | 1 | 0 |
| B | 1/5/2020 | 55 | 0 | 1 |
| B | 1/6/2020 | 65 | 1 | 0 |
| C | 1/1/2020 | 101 | 0 | 1 |
| C | 1/2/2020 | 102 | 1 | 0 |
+---------+----------+-------+-------+--------+
+---------+------------+
| Product |count_change|
+---------+------------+
| A | 1 |
| B | 2 |
| C | 1 |
+---------+------------+
How can I tweak the code so change is computed after grouping based on product and I can get Product wise count of how many times state changed from 0 to 1.
Try groupby:
g = test.groupby('Product')
test['change'] = (g['state'].diff(-1)<0).astype(int)
g['change'].sum()
Related
I have two datasets: dataset1 & dataset2 (image link provided), which have a common column called SAX which is a string object.
dataset1=
SAX
0 glngsyu
1 zicobgm
2 eerptow
3 cqbsynt
4 zvmqben
.. ...
475 rfikekw
476 bnbzvqx
477 rsuhgax
478 ckhloio
479 lbzujtw
480 rows × 2 columns
and
dataset2 =
SAX timestamp
0 hssrlcu 16015
1 ktyuymp 16016
2 xncqmfr 16017
3 aanlmna 16018
4 urvahvo 16019
... ... ...
263455 jeivqzo 279470
263456 bzasxgw 279471
263457 jspqnqv 279472
263458 sxwfchj 279473
263459 gxqnhfr 279474
263460 rows × 2 columns
I need to find and print out the timestamps for whenever a value in SAX column of dataset1 exists in SAX column of dataset2.
Is there a function/method for accomplishing the above?
Thanks.
Let's create an arbitrary dataset to showcase how it works:
import pandas as pd
import numpy as np
def sax_generator(num):
return [''.join(chr(x) for x in np.random.randint(97, 97+26, size=4)) for _ in range(num)]
df1 = pd.DataFrame(sax_generator(10), columns=['sax'])
df2 = pd.DataFrame({'sax': sax_generator(10), 'timestamp': range(10)})
Let's peek into the data:
df1 =
| | sax |
|---:|:------|
| 0 | cvtj |
| 1 | fmjy |
| 2 | rjpi |
| 3 | gwtv |
| 4 | qhov |
| 5 | uriu |
| 6 | kpku |
| 7 | xkop |
| 8 | kzoe |
| 9 | nydj |
df2 =
| | sax | timestamp |
|---:|:------|------------:|
| 0 | kzoe | 0 |
| 1 | npyo | 1 |
| 2 | uriu | 2 |
| 3 | hodu | 3 |
| 4 | rdko | 4 |
| 5 | pspn | 5 |
| 6 | qnut | 6 |
| 7 | gtyz | 7 |
| 8 | gfzs | 8 |
| 9 | gcel | 9 |
Now ensure we have some matching values in df2 from df1, which we can later check:
df2['sax'][2] = df1['sax'][5]
df2['sax'][0] = df1['sax'][8]
Then use:
df2.loc[df1.sax.apply(lambda x: df2.sax.str.contains(x)).any(), 'timestamp']
to get:
| | timestamp |
|---:|------------:|
| 0 | 0 |
| 2 | 2 |
With np.where docs here you can get the indices back as well:
np.where(df1.sax.apply(lambda x: df2.sax.str.contains(x)) == True)
# -> (array([5, 8]), array([2, 0]))
Here we can see that df1 has matching indices [5, 8] and df2 has [2, 0], which is exactly what we enforced with the lines above...
If we have a look at the return of df1.sax.apply(lambda x: df2.sax.str.contains(x)), the result above matches exactly the indices (magic...whooo):
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---:|----:|----:|----:|----:|----:|----:|----:|----:|----:|----:|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Step1: Convert Dataset 2 to a dict using:
import numpy as np
import pandas as pd
a_dictionary = df.to_dict['list]
Step2: Use a comparator in a for loop to extract time stamps.
lookup_value = "abcdef" #This can be a list item.
all_keys = []
for key, value in a_dictionary.items():
if(value == lookup_value):
all_keys.append(key)
print(all_keys)
Step3: ENJOY!
I'm trying to convert a table using a pivot to get columns for Sales and Profit for each Year. I want to be able to prefix the new columns appropriately, i.e. sales_(Year), profit_(Year)?
Is there a way of adding different prefixes in pandas for the different values?
Example:
d = {'SourceID': [1, 1, 2, 2, 3, 3, 3], 'Year': [0, 1, 0, 1, 1, 2, 3], 'Sales': [100, 200, 300, 400 , 500, 600, 700], 'Profit': [10, 20, 30, 40, 50, 60, 70]}
df = pd.DataFrame(data=d)
I can get the information into the appropriate structure using the following:
result = (
df
.pivot_table(
index=['SourceID'],
columns=['Year'],
values=['Sales', 'Profit'],
fill_value=0,
aggfunc='mean'
)
.add_prefix('sales_')
.reset_index()
)
However I can't work out how to add the prefixes separately for Sales and Profit. At the moment I'm stuck with just sales_ for everything.
Current Output:
| | ('SourceID', '') | ('sales_Profit', 'sales_0') | ('sales_Profit', 'sales_1') | ('sales_Profit', 'sales_2') | ('sales_Profit', 'sales_3') | ('sales_Sales', 'sales_0') | ('sales_Sales', 'sales_1') | ('sales_Sales', 'sales_2') | ('sales_Sales', 'sales_3') |
| ---: | ---------------: | --------------------------: | --------------------------: | --------------------------: | --------------------------: | -------------------------: | -------------------------: | -------------------------: | -------------------------: |
| 0 | 1 | 10 | 20 | 0 | 0 | 100 | 200 | 0 | 0 |
| 1 | 2 | 30 | 40 | 0 | 0 | 300 | 400 | 0 | 0 |
| 2 | 3 | 0 | 50 | 60 | 70 | 0 | 500 | 600 | 700 |
Desired Output:
| | ('SourceID', '') | ('Profit', 'profit_0') | ('Profit', 'profit_1') | ('Profit', 'profit_2') | ('Profit', 'profit_3') | ('Sales', 'sales_0') | ('Sales', 'sales_1') | ('Sales', 'sales_2') | ('Sales', 'sales_3') |
| ---: | ---------------: | ---------------------: | --------------------: | ---------------------: | ---------------------: | -------------------: | -------------------: | -------------------: | -------------------: |
| 0 | 1 | 10 | 20 | 0 | 0 | 100 | 200 | 0 | 0 |
| 1 | 2 | 30 | 40 | 0 | 0 | 300 | 400 | 0 | 0 |
| 2 | 3 | 0 | 50 | 60 | 70 | 0 | 500 | 600 | 700 |
###### Original Table
| | SourceID | Year | Sales | Profit |
| ---: | -------: | ---: | ----: | -----: |
| 0 | 1 | 0 | 100 | 10 |
| 1 | 1 | 1 | 200 | 20 |
| 2 | 2 | 0 | 300 | 30 |
| 3 | 2 | 1 | 400 | 40 |
| 4 | 3 | 1 | 500 | 50 |
| 5 | 3 | 2 | 600 | 60 |
| 6 | 3 | 3 | 700 | 70 |
Use list comprehension with f-strings for lowercase first level values and added to second level:
result = (
df
.pivot_table(
index=['SourceID'],
columns=['Year'],
values=['Sales', 'Profit'],
fill_value=0,
aggfunc='mean'
))
L = [(a, f'{a.lower()}{b}') for a,b in result.columns]
result.columns = pd.MultiIndex.from_tuples(L)
result = result.reset_index()
print (result)
SourceID Profit Sales
profit0 profit1 profit2 profit3 sales0 sales1 sales2 sales3
0 1 10 20 0 0 100 200 0 0
1 2 30 40 0 0 300 400 0 0
2 3 0 50 60 70 0 500 600 700
I have loaded raw_data from MySQL using sqlalchemy and pymysql
engine = create_engine('mysql+pymysql://[user]:[passwd]#[host]:[port]/[database]')
df = pd.read_sql_table('data', engine)
df is something like this
| Age Category | Category |
|--------------|----------------|
| 31-26 | Engaged |
| 26-31 | Engaged |
| 31-36 | Not Engaged |
| Above 51 | Engaged |
| 41-46 | Disengaged |
| 46-51 | Nearly Engaged |
| 26-31 | Disengaged |
Then i had performed analysis as follow
age = pd.crosstab(df['Age Category'], df['Category'])
| Category | A | B | C | D |
|--------------|---|----|----|---|
| Age Category | | | | |
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
I want to change it to
Pandas DataFrame something like this.
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
Thank you for your time and consideration
Both texts are called columns and index names, solution for change them is use DataFrame.rename_axis:
age = age.rename_axis(index=None, columns='Age Category')
Or set columns names by index names, and then set index names to default - None:
age.columns.name = age.index.name
age.index.name = None
print (age)
Age Category Disengaged Engaged Nearly Engaged Not Engaged
26-31 1 1 0 0
31-26 0 1 0 0
31-36 0 0 0 1
41-46 1 0 0 0
46-51 0 0 1 0
Above 51 0 1 0 0
But this texts are something like metadata, so some functions should remove them.
I have some data like this:
pd.DataFrame({'code': ['a', 'a', 'a', 'b', 'b', 'c'],
'value': [1,2,3, 4, 2, 1] })
+-------+------+-------+
| index | code | value |
+-------+------+-------+
| 0 | a | 1 |
+-------+------+-------+
| 1 | a | 2 |
+-------+------+-------+
| 2 | a | 3 |
+-------+------+-------+
| 3 | b | 4 |
+-------+------+-------+
| 4 | b | 2 |
+-------+------+-------+
| 5 | c | 1 |
+-------+------+-------+
i want add a column that contain the max value of each code :
| index | code | value | max |
|-------|------|-------|-----|
| 0 | a | 1 | 3 |
| 1 | a | 2 | 3 |
| 2 | a | 3 | 3 |
| 3 | b | 4 | 4 |
| 4 | b | 2 | 4 |
| 5 | c | 1 | 1 |
is there any way to do this with pandas?
Use GroupBy.transform for new column of aggregated values:
df['max'] = df.groupby('code')['value'].transform('max')
You can try this as well.
df["max"] = df.code.apply(lambda i : max(df.loc[df["code"] == i]["value"]))
data is like:
+------------------+--------------------------------------+-----+--------------------------------------+-------------+---------------+---------------------------------+-------------------------+------------------------------+--------------------------------------+--------------------+
| SeriousDlqin2yrs | RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents |
+------------------+--------------------------------------+-----+--------------------------------------+-------------+---------------+---------------------------------+-------------------------+------------------------------+--------------------------------------+--------------------+
| 1 | 0.186762647 | 44 | 0 | 0.579385737 | 1920 | 5 | 0 | 1 | 0 | 4 |
| 1 | 0.023579132 | 57 | 0 | 0.299046087 | 8700 | 12 | 0 | 3 | 0 | 0 |
| 1 | 0.003621589 | 58 | 0 | 0.310172457 | 4000 | 14 | 0 | 1 | 0 | 2 |
| 1 | 0.145603022 | 77 | 0 | 0.313491151 | 4350 | 10 | 0 | 1 | 0 | 0 |
| 1 | 0.245191827 | 53 | 0 | 0.238513897 | 7051 | 3 | 0 | 1 | 0 | 0 |
| 1 | 0.443504996 | 23 | 0 | 54 | 6670.221237 | 3 | 0 | 0 | 0 | 0.757222268 |
| 1 | 0.332611231 | 51 | 0 | 0.267748028 | 9000 | 9 | 0 | 2 | 0 | 6 |
| 1 | 0.625243566 | 40 | 0 | 0.474557522 | 7231 | 9 | 0 | 1 | 0 | 2 |
| 1 | 0.091590841 | 51 | 0 | 2359 | 6670.221237 | 3 | 0 | 1 | 0 | 0 |
| 1 | 0.69186808 | 48 | 3 | 0.125587441 | 10000 | 7 | 0 | 0 | 0 | 2 |
| 1 | 0.004999828 | 63 | 1 | 0.246688328 | 4000 | 5 | 0 | 1 | 0 | 0 |
| 1 | 0.064841612 | 53 | 0 | 0.239478872 | 11666 | 13 | 0 | 3 | 0 | 1 |
| 1 | 0.512060051 | 44 | 1 | 0.412406271 | 4400 | 14 | 0 | 0 | 0 | 1 |
| 1 | 0.9999999 | 25 | 0 | 0.024314936 | 2590 | 1 | 0 | 0 | 0 | 0 |
| 1 | 0.372130998 | 32 | 0 | 0.206849144 | 8000 | 7 | 0 | 0 | 0 | 2 |
| 1 | 0.9999999 | 34 | 0 | 0.208158368 | 5000 | 4 | 0 | 0 | 0 | 2 |
| 1 | 0.023464572 | 63 | 0 | 0.149350649 | 1539 | 13 | 0 | 0 | 0 | 0 |
| 1 | 0.937531861 | 64 | 2 | 0.563646207 | 14776 | 9 | 0 | 2 | 1 | 1 |
| 1 | 0.001808414 | 51 | 0 | 1736 | 6670.221237 | 7 | 0 | 1 | 0 | 0 |
| 1 | 0.019950125 | 54 | 1 | 3622 | 6670.221237 | 7 | 0 | 1 | 0 | 0 |
| 1 | 0.183178709 | 42 | 0 | 0.162644416 | 7789 | 12 | 0 | 0 | 0 | 2 |
| 1 | 0.039786673 | 76 | 0 | 0.011729323 | 3324 | 6 | 0 | 0 | 0 | 1 |
| 1 | 0.047418557 | 41 | 1 | 1.178100863 | 2200 | 7 | 0 | 1 | 0 | 2 |
| 1 | 0.127890461 | 59 | 0 | 67 | 6670.221237 | 2 | 0 | 0 | 0 | 0 |
| 1 | 0.074955088 | 57 | 0 | 776 | 6670.221237 | 9 | 0 | 0 | 0 | 0 |
| 1 | 0.025459356 | 63 | 0 | 0.326794591 | 18708 | 30 | 0 | 3 | 0 | 1 |
| 1 | 0.9999999 | 29 | 0 | 0.06104034 | 3767 | 1 | 0 | 0 | 0 | 3 |
| 1 | 0.016754935 | 50 | 0 | 0.046870976 | 7765 | 6 | 0 | 0 | 0 | 0 |
| 0 | 0.566751792 | 40 | 0 | 0.2010541 | 6450 | 9 | 1 | 0 | 1 | 4 |
+------------------+--------------------------------------+-----+--------------------------------------+-------------+---------------+---------------------------------+-------------------------+------------------------------+--------------------------------------+--------------------+
# self define auto-binning function
def mono_bin(Y, X, n = 20):
r = 0
good=Y.sum()
bad=Y.count()-good
while np.abs(r) < 1:
d1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.qcut(X, n)})
d2 = d1.groupby('Bucket', as_index = True)
r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)
n = n - 1
d3 = pd.DataFrame(d2.X.min(), columns = ['min'])
d3['min']=d2.min().X
d3['max'] = d2.max().X
d3['sum'] = d2.sum().Y
d3['total'] = d2.count().Y
d3['rate'] = d2.mean().Y
d3['woe']=np.log((d3['rate']/(1-d3['rate']))/(good/bad))
d4 = (d3.sort_index(by = 'min')).reset_index(drop=True)
print("=" * 60)
print(d4)
return d4
have checked some answers,most of them suggested "to use something like iteritems."it seems differnet with mine
dfx1, ivx1,cutx1,woex1=mono_bin(data.SeriousDlqin2yrs,data.RevolvingUtilizationOfUnsecuredLines,n=10)
in ()
----> 1 dfx1, ivx1,cutx1,woex1=mono_bin(data.SeriousDlqin2yrs,data.RevolvingUtilizationOfUnsecuredLines,n=10)
ValueError: too many values to unpack (expected 4)
Any idea how I can fix this?