Filter Data with Pandas in Pycharm - python

This is my code so far in Pycharm for my Streamlit Data app:
import pandas as pd
import plotly.express as px
import streamlit as st
st.set_page_config(page_title='Matching Application Number',
layout='wide')
df = pd.read_csv('Analysis_1.csv')
st.sidebar.header("Filter Data:")
MeetingFileType = st.sidebar.multiselect(
"Select File Type:",
options=df['MEETING_FILE_TYPE'].unique(),
default=df['MEETING_FILE_TYPE'].unique()
)
df_selection = df.query(
'MEETING_FILE_TYPE == #MeetingFileType'
)
st.dataframe(df_selection)
The result of my code on streamlit is this below:
Application_ID MEETING_FILE_TYPE
BBC#:1010 1
NBA#:1111 2
BRC#:1212 1
SAC#:1412 4
QRD#:1912 2
BBA#:1092 4
But, I would like to filter the data and only return Application_ID results for just MEETING_FILE_TYPE 1&2.
I am looking for this result below:
Filter Data: Application_ID MEETING_FILE_TYPE
select type: BBC#:1010 1
1 2 NBA#:1111 2
BRC#:1212 1
QRD#:1912 2

The .isin() function is useful for creating a vector of Bools on which to filter the rows of your DataFrame. For filtering categorical columns, as in your example, it's the simplest way to go. Documentation here.
select_type = [1,2]
df = df[df['MEETING_FILE_TYPE'].isin(select_type)]
It's not necessary in this example, but pairing the ~, which gives the reverse Bool value, along with .isin() often comes in handy. Someone covered this here.

Related

Compute number of floats in a int range - Python

I've the following dataframe containing floats as input and would like to compute how many values are in range 0;90 and 90;180. The output dataframe was obtained using frequency() function from excel.
[Input dataframe]
[Desired output]
I'd like to do the same thing with python but didn't find a solution. Do you have any suggestion ?
I can also provide source files if needed.
Here's one way, by dividing the columns by 90, then using groupy and count:
import numpy as np
import pandas as pd
data = [
[87.084,5.293],
[55.695,0.985],
[157.504,2.995],
[97.701,179.593],
[97.67,170.386],
[118.713,177.53],
[99.972,176.665],
[124.849,1.633],
[72.787,179.459]
]
df = pd.DataFrame(data,columns=['Var1','Var2'])
df = (df / 90).astype(int)
df1 = pd.DataFrame([["0-90"], ["90-180"]])
df1['Var1'] = df.groupby('Var1').count()
df1['Var2'] = df.groupby('Var2').count()
print(df1)
Output:
0 Var1 Var2
0 0-90 3 4
1 90-180 6 5

querying a multiindex pandas dataframe with slices

Assuming I have the following multiindex DF
import pandas as pd
import numpy as np
import pandas as pd
input_id = np.array(['12345'])
docType = np.array(['pre','pub','app','dw'])
docId = np.array(['34455667'])
sec_type = np.array(['bib','abs','cl','de'])
sec_ids = np.array(['x-y','z-k'])
index = pd.MultiIndex.from_product([input_id,docType,docId,sec_type,sec_ids])
content= [str(randint(1,10))+ '##' + str(randint(1,10)) for i in range(len(index))]
df = pd.DataFrame(content, index=index, columns=['content'])
df.rename_axis(index=['input_id','docType','docId','secType','sec_ids'], inplace=True)
df
I know that I can query a multiindex DF as follows:
# querying a multiindex DF
idx = pd.IndexSlice
df.loc[idx[:,['pub','pre'],:,'de',:]]
basically with the help of pd.IndexSlice I can pass the values I want for every of the indexes. In the above case I want the resulting DF where the second index is 'pub' OR 'pre' and the 4th one is 'de'.
I am looking for the way to pass a range of values to the query. something like multiindex 3 beeing between 34567 and 45657. Assume those are integers.
pseudocode: df.loc[idx[:,['pub','pre'],XXXXX,'de',:]]
XXXX = ?
EDIT 1:
docId column index is of text type, probably its necessary to change it first to int
Turns out query is very powerful:
df.query('docType in ["pub","pre"] and ("34455667" <= docId <= "3445568") and (secType=="de")')
Output:
content
input_id docType docId secType sec_ids
12345 pre 34455667 de x-y 2##9
z-k 6##1
pub 34455667 de x-y 6##5
z-k 9##8

Problems Sorting Data out of a text-file

I have a csv file imported into a dataframe and have trouble sorting the data.
df looks like this:
Data
0 <WindSpeed>0.69</WindSpeed>
1 <PowerOutput>0</PowerOutput>
2 <ThrustCoEfficient>0</ThrustCoEffici...
3 <RotorSpeed>8.17</RotorSpeed>
4 <ReactivePower>0</ReactivePower>
5 </DataPoint>
6 <DataPoint>
7 <WindSpeed>0.87</WindSpeed>
8 <PowerOutput>0</PowerOutput
I want it to look like this:
0 Windspeed Poweroutput
1 0.69 0.0
HereĀ“s the code that I wrote so far:
import pandas as pd
from pandas.compat import StringIO
import re
import numpy as np
df= pd.read_csv('powercurve.csv', encoding='utf-8',skiprows=42)
df.columns=['Data']
no_of_rows=df.Data.str.count("WindSpeed").sum()/2
rows=no_of_rows.astype(np.uint32)
TRBX=pd.DataFrame(index=range(0,abs(rows)),columns=['WSpd[m/s]','Power[kW]'],dtype='float')
i=0
for i in range(len(df)):
if 'WindSpeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
elif 'Rotorspeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
Is this a suitable approach? If yes, so far there are no values written into the TRBX dataframe. Where is my mistake?
The code below should help you if your df is indeed in the same format as you:
import re
split_func = lambda x: re.split('<|>', str(x))
split_series = df.Data.apply(split_func)
data = a.apply(lambda x: x[2]).rename('data')
features = a.apply(lambda x: x[1]).rename('features')
df = pd.DataFrame(data).set_index(features).T
You may want to drop some columns that have no data or input some N/A values afterwards. You also may want to rename the variables and series to different names that make more sense to you.

Warning - value is trying to be set on a copy of a slice

I get the warning when i run this code. I tried all possible solutions I can think of, but cannot get rid of it. Kindly help !
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
import math
task2_df['price_square'] = None
i = 0
for row in data.iterrows():
task2_df['price_square'].at[i] = math.pow(task2_df['price'][i],2)
i += 1
For starters, I don't see your error on Pandas v0.19.2 (tested with code at the bottom of this answer). But that's probably irrelevant to solving your issue. You should avoid iterating rows in Python-level loops. NumPy arrays which are used by Pandas are specifically designed for numerical computations:
df = pd.DataFrame({'price': [54.74, 12.34, 35.45, 51.31]})
df['price_square'] = df['price'].pow(2)
print(df)
price price_square
0 54.74 2996.4676
1 12.34 152.2756
2 35.45 1256.7025
3 51.31 2632.7161
Test on Pandas v0.19.2 with no warnings / errors:
import math
df = pd.DataFrame({'price': [54.74, 12.34, 35.45, 51.31]})
df['price_square'] = None
i = 0
for row in df.iterrows():
df['price_square'].at[i] = math.pow(df['price'][i],2)
i += 1

Python Pandas - convert unicode data into dataframe so I can append

I am pulling data using pytreasurydirect and I would like to query each unique cusip and then append them and create a pandas dataframe table. I am having difficulties generating the the pandas dataframe. I believe it is because of the unicode structure of the data.
import pandas as pd
from pytreasurydirect import TreasuryDirect
td = TreasuryDirect()
cusip_list = [['912796PY9','08/09/2018'],['912796PY9','06/07/2018']]
for i in cusip_list:
cusip =''.join(i[0])
issuedate =''.join(i[1])
cusip_value=(td.security_info(cusip, issuedate))
#pd.DataFrame(cusip_value.items())
df = pd.DataFrame(cusip_value, index=['a'])
td = td.append(df, ignore_index=False)
Example of data from pytreasurydirect :
Index([u'accruedInterestPer100', u'accruedInterestPer1000',
u'adjustedAccruedInterestPer1000', u'adjustedPrice',
u'allocationPercentage', u'allocationPercentageDecimals',
u'announcedCusip', u'announcementDate', u'auctionDate',
u'auctionDateYear',
...
u'totalTendered', u'treasuryDirectAccepted',
u'treasuryDirectTendersAccepted', u'type',
u'unadjustedAccruedInterestPer1000', u'unadjustedPrice',
u'updatedTimestamp', u'xmlFilenameAnnouncement',
u'xmlFilenameCompetitiveResults', u'xmlFilenameSpecialAnnouncement'],
dtype='object', length=116)
I think you want to define a function like this:
def securities(type):
secs = td.security_type(type)
keys = secs[0].keys() if secs else []
seri = [pd.Series([sec[key] for sec in secs]) for key in keys]
return pd.DataFrame(dict(zip(keys, seri)))
Then, use it:
df = securities('Bond')
df[['cusip', 'issueDate', 'maturityDate']].head()
to get results like these, for example (TreasuryDirect returns a lot of addition columns):
cusip issueDate maturityDate
0 912810SD1 2018-08-15T00:00:00 2048-08-15T00:00:00
1 912810SC3 2018-07-16T00:00:00 2048-05-15T00:00:00
2 912810SC3 2018-06-15T00:00:00 2048-05-15T00:00:00
3 912810SC3 2018-05-15T00:00:00 2048-05-15T00:00:00
4 912810SA7 2018-04-16T00:00:00 2048-02-15T00:00:00
At least today those are the results today. The results will change over time as bonds are issued and, alas, mature. Note the multiple issueDates per cusip.
Finally, per the TreasuryDirect website (https://www.treasurydirect.gov/webapis/webapisecurities.htm), the possible security types are: Bill, Note, Bond, CMB, TIPS, FRN.

Categories

Resources