Type specific output expected using pandas

Type specific output expected using pandas - python

I calculated changbtwread column using this code below for multiple type.
for v in df['Type'].unique():
df[f'Changebetweenreadings_{v}'] = df.loc[df['Type'].eq(v), 'Last'].diff()
Given
Type Last changbtwread_ada changbtwread_btc changbtwread_eur
0 ada 3071.56 NaN NaN NaN
1 ada 3097.82 26.26 NaN NaN
2 btc 1000.00 NaN NaN NaN
3 ada 2000.00 -1097.82 NaN NaN
4 btc 3000.00 NaN 2000.0 NaN
5 eur 1000.00 NaN NaN NaN
6 eur 1500.00 NaN NaN 500.0
Now that i need to calculate direction column based on these changebtw column.
My output should look like
Type change_dir_ada change_dir_btc change_dir_eur
ada Nut
ada Pos
btc Nut
ada Neg
btc Nut
eur
eur Pos
A quick fix i tried is using this code.
df.loc[df.Changebetweenreadings_btceur > 0, 'ChangeDirection_btceur'] = 'Pos'
df.loc[df.Changebetweenreadings_btceur < 0, 'ChangeDirection_btceur'] = 'Neg'
df.loc[df.Changebetweenreadings_btceur == 0, 'ChangeDirection_btceur'] = 'Nut'
df.loc[df.Changebetweenreadings_adabtc > 0, 'ChangeDirection_adabtc'] = 'Pos'
df.loc[df.Changebetweenreadings_adabtc < 0, 'ChangeDirection_adabtc'] = 'Neg'
df.loc[df.Changebetweenreadings_adabtc == 0, 'ChangeDirection_adabtc'] = 'Nut'
But i this is a lot of code and its not a dynamic way of doing i think.
I expect something like this.
for v in df['Type'].unique():
df[f'Changebetweenreadings_{v}'] #--> Do this calculation above.
It doesn't work for these values
change type dir_ada dir_btc
-3637.31 ada
-4E-08 ada Neg
-3637.31 ada Nut
3637.8 btc Nut
In place of Pos it gives random mapping.

I believe you need:
vals = ['Pos','Neg', 'Nut']
for v in df['Type'].unique():
df[f'change_dir_{v}'] = df.loc[df['Type'].eq(v), 'Last'].diff()
df[f'change_dir_{v}'] = np.select([df[f'change_dir_{v}'] > 0,
df[f'change_dir_{v}'] < 0,
df[f'change_dir_{v}']== 0], vals, '')
print (df)
Type Last change_dir_ada change_dir_btc change_dir_eur
0 ada 3071.56
1 ada 3097.80 Pos
2 btc 1000.00
3 ada 2000.00 Neg
4 btc 3000.00 Pos
5 eur 1000.00
6 eur 1500.00 Pos

Related

Pandas dataframe merge row by addition

I want to create a dataframe from census data. I want to calculate the number of people that returned a tax return for each specific earnings group.
For now, I wrote this
census_df = pd.read_csv('../zip code data/19zpallagi.csv')
sub_census_df = census_df[['zipcode', 'agi_stub', 'N02650', 'A02650', 'ELDERLY', 'A07180']].copy()
num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
for i, column_name in zip(range(1, 7), num_of_returns):
sub_census_df[column_name] = sub_census_df[sub_census_df['agi_stub'] == i]['N02650']
I have 6 groups attached to a specific zip code. I want to get one row, with the number of returns for a specific zip code appearing just once as a column. I already tried to change NaNs to 0 and to use groupby('zipcode').sum(), but I get 50 million rows summed for zip code 0, where it seems that only around 800k should exist.
Here is the dataframe that I currently get:
zipcode agi_stub N02650 A02650 ELDERLY A07180 Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more Amount_1_25000 Amount_25000_50000 Amount_50000_75000 Amount_75000_100000 Amount_100000_200000 Amount_200000_more
0 0 1 778140.0 10311099.0 144610.0 2076.0 778140.0 NaN NaN NaN NaN NaN 10311099.0 NaN NaN NaN NaN NaN
1 0 2 525940.0 19145621.0 113810.0 17784.0 NaN 525940.0 NaN NaN NaN NaN NaN 19145621.0 NaN NaN NaN NaN
2 0 3 285700.0 17690402.0 82410.0 9521.0 NaN NaN 285700.0 NaN NaN NaN NaN NaN 17690402.0 NaN NaN NaN
3 0 4 179070.0 15670456.0 57970.0 8072.0 NaN NaN NaN 179070.0 NaN NaN NaN NaN NaN 15670456.0 NaN NaN
4 0 5 257010.0 35286228.0 85030.0 14872.0 NaN NaN NaN NaN 257010.0 NaN NaN NaN NaN NaN 35286228.0 NaN
And here is what I want to get:
zipcode Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more
0 0 778140.0 525940.0 285700.0 179070.0 257010.0 850.0

here is one way to do it using groupby and sum the desired columns
num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
df.groupby('zipcode', as_index=False)[num_of_returns].sum()
zipcode Number_of_returns_1_25000 Number_of_returns_25000_50000 Number_of_returns_50000_75000 Number_of_returns_75000_100000 Number_of_returns_100000_200000 Number_of_returns_200000_more
0 0 778140.0 525940.0 285700.0 179070.0 257010.0 0.0

This question needs more information to actually give a proper answer. For example you leave out what is meant by certain columns in your data frame:
- `N1: Number of returns`
- `agi_stub: Size of adjusted gross income`
According to IRS this has the following levels.
Size of adjusted gross income "0 = No AGI Stub
1 = ‘Under $1’
2 = '$1 under $10,000' 3 = '$10,000 under $25,000' 4 = '$25,000 under $50,000' 5 = '$50,000 under $75,000' 6 = '$75,000 under $100,000' 7 = '$100,000 under $200,000'
8 = ‘$200,000 under $500,000’
9 = ‘$500,000 under $1,000,000’
10 = ‘$1,000,000 or more’"
I got the above from https://www.irs.gov/pub/irs-soi/16incmdocguide.doc
With this information, I think what you want to find is the number of
people who filed a tax return for each of the income levels of agi_stub.
If that is what you mean then, this can be achieved by:
import pandas as pd
data = pd.read_csv("./data/19zpallagi.csv")
## select only the desired columns
data = data[['zipcode', 'agi_stub', 'N1']]
## solution to your problem?
df = data.pivot_table(
index='zipcode',
values='N1',
columns='agi_stub',
aggfunc=['sum']
)
## bit of cleaning up.
PREFIX = 'agi_stub_level_'
df.columns = [PREFIX + level for level in df.columns.get_level_values(1).astype(str)]
Here's the output.
In [77]: df
Out[77]:
agi_stub_level_1 agi_stub_level_2 ... agi_stub_level_5 agi_stub_level_6
zipcode ...
0 50061850.0 37566510.0 ... 21938920.0 8859370.0
1001 2550.0 2230.0 ... 1420.0 230.0
1002 2850.0 1830.0 ... 1840.0 990.0
1005 650.0 570.0 ... 450.0 60.0
1007 1980.0 1530.0 ... 1830.0 460.0
... ... ... ... ... ...
99827 470.0 360.0 ... 170.0 40.0
99833 550.0 380.0 ... 290.0 80.0
99835 1250.0 1130.0 ... 730.0 190.0
99901 1960.0 1520.0 ... 1030.0 290.0
99999 868450.0 644160.0 ... 319880.0 142960.0
[27595 rows x 6 columns]

Dataframe calculation

I want to do the following calculation and the outcome has to be a new column Calculated trap..
test["calculation trap"] = (( 0.000164 + 0.000415)/2)
so the outcome of this formula has to be 0.0002895.
I tried the following code to do this calculation for the whole column, but i got the outcome in the column below.
test["calculation trap"] = ((test["calculation"][0:]+test["calculation"][1:])/2).reset_index(drop=True)
Temp calculation. calculation trap.
0 90.01 0.000164 NaN
1 91.03 0.000415 0.000415
2 95.06 0.001315 0.001315
3 100.07 0.002896 0.002896
4 103.50 NaN NaN

Use Series.shift with -1:
test["calculation trap"] = ((test["calculation"].shift(-1)+test["calculation"])/2)
print (test)
Temp calculation calculation trap
0 90.01 0.000164 0.000290
1 91.03 0.000415 0.000865
2 95.06 0.001315 0.002106
3 100.07 0.002896 NaN
4 103.50 NaN NaN

Is there a possibility to use a bigger List in phython?

For school I have to make a project about wifisignals and I am trying put the data in a dataframe.
There are 208.000 rows of data.
And when it comes to the code below, the code does not complete. The code is like it is stuck in an infinite loop.
But when I use only a 1000 rows my program works. So I think that my list are to small if that is possible.
Do bigger Lists exist in phython? Or is it because I use bad coding?
Thanks in advance.
edit 1:
(data is the original dataframe and wifiinfo is a column of that)
i have this format:
df = pd.DataFrame(columns=['Sender','Time','Date','Place','X','Y','Bezetting','SSID','BSSID','Signal'])
And i am trying to fill SSID, BSSID and Signal from the Column WifiInfo for this i have to split the data.
this is how 1 WifiInfo looks like:
ODISEE#88-1d-fc-41-dc-50:-83,ODISEE#88-1d-fc-2c-c0-00:-72,ODISEE#88-1d-fc-41-d2-d0:-82,CiscoC5976#58-6d-8f-19-14-38:-78,CiscoC5959#58-6d-8f-19-13-f4:-93,SNB#c8-d7-19-6f-be-b7:-99,ODISEE#88-1d-fc-2c-c5-70:-94,HackingDemo#58-6d-8f-19-11-48:-156,ODISEE#88-1d-fc-30-d4-40:-85,ODISEE#88-1d-fc-41-ac-50:-100
My current approach looks like:
for index, row in data.iterrows():
bezettingList = list()
ssidList = list()
bssidList = list()
signalList = list()
#WifiInfo splitting
wifis = row.WifiInfo.split(',')
for wifi in wifis:
#split wifi and add to List
ssid, bssid = wifi.split('#')
bssid, signal = bssid.split(':')
ssidList.append(ssid)
bssidList.append(bssid)
signalList.append(int(signal))
#add bezettingen to List
bezettingen = row.Bezetting.split(',')
for bezetting in bezettingen:
bezettingList.append(bezetting)
#add list to dataframe
df.loc[index,'SSID'] = ssidList
df.loc[index,'BSSID'] = bssidList
df.loc[index,'Signal'] = signalList
df.loc[index,'Bezetting'] = bezettingList
df.head()

IIUC, you need to first explode the row by commas so that this:
SSID BSSID Signal WifiInfo
0 NaN NaN NaN ODISEE#88-1d-fc-41-dc-50:-83,ODISEE#88- ...
becomes this:
SSID BSSID Signal WifiInfo
0 NaN NaN NaN ODISEE#88-1d-fc-41-dc-50:-83
1 NaN NaN NaN ODISEE#88-1d-fc-2c-c0-00:-72
2 NaN NaN NaN ODISEE#88-1d-fc-41-d2-d0:-82
3 NaN NaN NaN CiscoC5976#58-6d-8f-19-14-38:-78
4 NaN NaN NaN CiscoC5959#58-6d-8f-19-13-f4:-93
5 NaN NaN NaN SNB#c8-d7-19-6f-be-b7:-99
6 NaN NaN NaN ODISEE#88-1d-fc-2c-c5-70:-94
7 NaN NaN NaN HackingDemo#58-6d-8f-19-11-48:-156
8 NaN NaN NaN ODISEE#88-1d-fc-30-d4-40:-85
9 NaN NaN NaN ODISEE#88-1d-fc-41-ac-50:-100
# use `.explode`
data = data.assign(WifiInfo=data.WifiInfo.str.split(',')).explode('WifiInfo')
Now you could use .str.extract:
data['SSID'] = data['WifiInfo'].str.extract(r'(.*)#')
data['BSSID'] = data['WifiInfo'].str.extract(r'#(.*):')
data['Signal'] = data['WifiInfo'].str.extract(r':(.*)')
SSID BSSID Signal WifiInfo
0 ODISEE 88-1d-fc-41-dc-50 -83 ODISEE#88-1d-fc-41-dc-50:-83
1 ODISEE 88-1d-fc-2c-c0-00 -72 ODISEE#88-1d-fc-2c-c0-00:-72
2 ODISEE 88-1d-fc-41-d2-d0 -82 ODISEE#88-1d-fc-41-d2-d0:-82
3 CiscoC5976 58-6d-8f-19-14-38 -78 CiscoC5976#58-6d-8f-19-14-38:-78
4 CiscoC5959 58-6d-8f-19-13-f4 -93 CiscoC5959#58-6d-8f-19-13-f4:-93
5 SNB c8-d7-19-6f-be-b7 -99 SNB#c8-d7-19-6f-be-b7:-99
6 ODISEE 88-1d-fc-2c-c5-70 -94 ODISEE#88-1d-fc-2c-c5-70:-94
7 HackingDemo 58-6d-8f-19-11-48 -156 HackingDemo#58-6d-8f-19-11-48:-156
8 ODISEE 88-1d-fc-30-d4-40 -85 ODISEE#88-1d-fc-30-d4-40:-85
9 ODISEE 88-1d-fc-41-ac-50 -100 ODISEE#88-1d-fc-41-ac-50:-100
If you want to keep data grouped after column explosion, I'd assign an ID for each group of entries first:
data['Group'] = pd.factorize(data['WifiInfo'])[0]+1
SSID BSSID Signal WifiInfo Group
0 NaN NaN NaN ODISEE#88-1d-fc-41-dc-50:-83,ODISEE#88- ... 1
1 NaN NaN NaN ASD#22-1d-fc-41-dc-50:-83,QWERTY#88- ... 2
# after you explode the column
SSID BSSID Signal WifiInfo Group
ODISEE 88-1d-fc-41-dc-50 -83 ODISEE#88-1d-fc-41-dc-50:-83 1
ODISEE 88-1d-fc-2c-c0-00 -72 ODISEE#88-1d-fc-2c-c0-00:-72 1
...
...
ASD 22-1d-fc-41-dc-50 -83 ASD#88-1d-fc-41-dc-50:-83 2
QWERTY 88-1d-fc-2c-c0-00 -72 QWERTY#88-1d-fc-2c-c0-00:-72 2

How to handle Cells containing only NaN values in pandas?

I am setting up a stock price prediction data set,in that while applying the following code for Ichimoku Cloud Indicator:
from datetime import timedelta
high_9 = df['High'].rolling(window= 9).max()
low_9 = df['Low'].rolling(window= 9).min()
df['tenkan_sen'] = (high_9 + low_9) /2
high_26 = df['High'].rolling(window= 26).max()
low_26 = df['Low'].rolling(window= 26).min()
df['kijun_sen'] = (high_26 + low_26) /2
# this is to extend the 'df' in future for 26 days
# the 'df' here is numerical indexed df
# the problem is here
last_index = df.iloc[-1:].index[0]
last_date = df['Date'].iloc[-1].date()
for i in range(26):
df.loc[last_index+1 +i, 'Date'] = last_date + timedelta(days=i)
df['senkou_span_a'] = ((df['tenkan_sen'] + df['kijun_sen']) / 2).shift(26)
high_52 = df['High'].rolling(window= 52).max()
low_52 = df['Low'].rolling(window= 52).min()
df['senkou_span_b'] = ((high_52 + low_52) /2).shift(26)
# most charting softwares dont plot this line
df['chikou_span'] = df['Close'].shift(-26)
The above code works great but the problem is while extending to the next 26 time steps(rows) in 'senoku span a' and 'b' columns it turns other rest columns row's values to NaN.
So i need the help to make 'Senoku span a' & 'Senoku span b' predicted rows in my data set without making other rows vlaues to NaN.
The current output is:
Date Open High Low Close Senoku span a Senoku span b
2019-03-16 50 51 52 53 56.0 55.82
2019-03-17 NaN NaN NaN NaN 55.0 56.42
2019-03-18 NaN NaN NaN NaN 54.0 57.72
2019-03-19 NaN NaN NaN NaN 53.0 58.12
2019-03-20 NaN NaN NaN NaN 52.0 59.52
The expected output is:
Date Open High Low Close Senoku span a Senoku span b
2019-03-16 50 51 52 53 56.0 55.82
2019-03-17 55.0 56.42
2019-03-18 54.0 57.72
2019-03-19 53.0 58.12
2019-03-20 52.0 59.52

how to subtract within pandas dataframe

I have a question on arithmetic within a dataframe. Please note that each of the below columns in my dataframe are based on one another except for 'holdings'
Here is a shortened version of my dataframe
'holdings' & 'cash' & 'total'
0.0 10000.0 10000.0
0.0 10000.0 10000.0
1000 9000.0 10000.0
1500 10000.0 11500.0
2000 10000.0 12000.0
initial_cap = 10000.0
But here is my problem... the first time I have holdings, the cash is calculated correctly where cash of 10000.0 - holdings of 1000.0 = 9000.0
I need cash to remain at 9000.0 until my holdings goes back to 0.0 again
Here are my calculations
In other words, how would you calculate cash so that it remains at 9000.0 until holdings goes back to 0.0
Here is how I want it to look like
'holdings' & 'cash' & 'total'
0.0 10000.0 10000.0
0.0 10000.0 10000.0
1000 9000.0 10000.0
1500 9000.0 10500.0
2000 9000.0 11000.0
cash = initial_cap - holdings

So I try to rephrase: You start with initial capital 10 and a given sequence of holdings {0, 0, 1, 1.5, 2} and want to create a cashvariable that is 10 whenever cash is 0. As soon as cash increases in an initial period by x, you want cash to be 10 - x until cash equals 0 again.
If this is correct, this is what I would do (the logic of total and all of this is still unclear to me, but this is what you added in the end, so I focus on this).
PS. Providing code to create your sample is considered nice
df = pd.DataFrame([0, 1, 2, 2, 0, 2, 3, 3], columns=['holdings'])
x = 10
# triggers are when cash is supposed to be zero
triggers = df['holdings'] == 0
# inits are when holdings change for the first time
inits = df.index[triggers].values + 1
df['cash'] = 0
for i in inits:
df['cash'][i:] = x - df['holdings'][i]
df['cash'][triggers] = 0
df
Out[339]:
holdings cash
0 0 0
1 1 9
2 2 9
3 2 9
4 0 0
5 2 8
6 3 8
7 3 8

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Type specific output expected using pandas - python

Related

Pandas dataframe merge row by addition

Dataframe calculation

Is there a possibility to use a bigger List in phython?

How to handle Cells containing only NaN values in pandas?

how to subtract within pandas dataframe

Categories

Resources