Is there a method to programmatically calculate index number in Python dataframe?

Is there a method to programmatically calculate index number in Python dataframe? - python

I have this hardcoded index number in the for loop to extract the data from specific row index within the dataframe:
import pandas as pd
import re
input_csv_file = "./CSV/Officers_and_Shareholders.csv"
df = pd.read_csv(input_csv_file, skiprows=10, on_bad_lines='skip')
df.fillna('', inplace=True)
# df.drop([0, 3], inplace=True)
df.columns = ['Nama', 'Jabatan', 'Alamat', 'Klasifikasi Saham', 'Jumlah Lembar Saham', 'Total']
for i in range(len(df.columns)):
if df["Total"][i] == '':
shareholders = df.iloc[24:42]
print(i, shareholders)
else:
officers = df.iloc[0:23]
print(i, officers)
The for loop on the above works great and returns a separate information for shareholders and officers, but instead of using df.iloc[hardcoded number] that works only for this file, is there a way to adjust so that python is able to automatically locate the officers and shareholders even when the file format is changed?

Related

How to start a for loop for this given DataFrame in Pandas for multiple same name rows?

I need some help, I am working on a .ipynb file to filter data and get certain things from that Dataframe.
This is DataFrame I'm working with.
From this dataframe, as you can see there are multiple rows of the same SYMBOL.
I need help to open a "for" loop which will get me the highest CHG_IN_OI for every symbol, take the row of that highest CHG_IN_OI for that row.
For example if there are 14 rows of ACC as a symbol, I need to find highest CHG_IN_OI for ACC from the CHG_IN_OI column and get that row of the highest change and Retain the remaining columns as well!.
I have made a list named, Multisymbols which has these symbols:
multisymbols = [
'ACC',
'ADANIENT',
'ADANIPORTS',
'AMARAJABAT',
'AMBUJACEM',
'APOLLOHOSP',
'APOLLOTYRE',
'ASHOKLEY',
'ASIANPAINT',
'AUROPHARMA',
'AXISBANK',
'BAJAJ-AUTO',
'BAJAJFINSV',
'BAJFINANCE',
'BALKRISIND',
'BANDHANBNK',
'BANKBARODA',
'BATAINDIA',
'BEL',
'BERGEPAINT',
'BHARATFORG',
'BHARTIARTL',
'BHEL',
'BIOCON',
'BOSCHLTD',
'BPCL',
'BRITANNIA',
'CADILAHC',
'CANBK',
'CENTURYTEX',
'CHOLAFIN',
'CIPLA',
'COALINDIA',
'COLPAL',
'CONCOR',
'CUMMINSIND',
'DABUR',
'DIVISLAB',
'DLF',
'DRREDDY',
'EICHERMOT',
'EQUITAS',
'ESCORTS',
'EXIDEIND',
'FEDERALBNK',
'GAIL',
'GLENMARK',
'GMRINFRA',
'GODREJCP',
'GODREJPROP',
'GRASIM',
'HAVELLS',
'HCLTECH',
'HDFC',
'HDFCBANK',
'HDFCLIFE',
'HEROMOTOCO',
'HINDALCO',
'HINDPETRO',
'HINDUNILVR',
'IBULHSGFIN',
'ICICIBANK',
'ICICIPRULI',
'IDEA',
'IDFCFIRSTB',
'IGL',
'INDIGO',
'INDUSINDBK',
'INFRATEL',
'INFY',
'IOC',
'ITC',
'JINDALSTEL',
'JSWSTEEL',
'JUBLFOOD',
'KOTAKBANK',
'L&TFH',
'LICHSGFIN',
'LT',
'LUPIN',
'M&M',
'M&MFIN',
'MANAPPURAM',
'MARICO',
'MARUTI',
'MCDOWELL-N',
'MFSL',
'MGL',
'MINDTREE',
'MOTHERSUMI',
'MRF',
'MUTHOOTFIN',
'NATIONALUM',
'NAUKRI',
'NESTLEIND',
'NIITTECH',
'NMDC',
'NTPC',
'ONGC',
'PAGEIND',
'PEL',
'PETRONET',
'PFC',
'PIDILITIND',
'PNB',
'POWERGRID',
'PVR',
'RAMCOCEM',
'RBLBANK',
'RECLTD',
'RELIANCE',
'SAIL',
'SBILIFE',
'SBIN',
'SHREECEM',
'SEIMENS',
'SRF',
'SRTRANSFIN',
'SUNPHARMA',
'SUNTV',
'TATACHEM',
'TATACONSUM',
'TATAMOTORS',
'TATAPOWER',
'TATASTEEL',
'TCS',
'TECHM',
'TITAN',
'TORNTPHARM',
'TORNTPOWER',
'TVSMOTOR',
'UBL',
'UJJIVAN',
'ULTRACEMCO',
'UPL',
'VEDL',
'VOLTAS',
'WIPRO',
'ZEEL'
]
df = df[df['SYMBOL'].isin(multisymbols)]
df
These are all the shares in the NSE. Hope you can understand and help me out. I used .groupby(),it successfully gave me the highest CHG_IN_OI and .agg() to retain the remaining columns but the data was not correct. I just simply want the row for every symbols "HIGHEST" CHG_IN_OI.
Thanks in Advance!

Although different from the data presented in the question, we have answered the same financial data using equity data as an example.
import pandas as pd
import pandas_datareader.data as web
import datetime
with open('./alpha_vantage_api_key.txt') as f:
api_key = f.read()
start = datetime.datetime(2019, 1, 1)
end = datetime.datetime(2020, 8,1)
df_all = pd.DataFrame()
symbol = ['AAPL','TSLA']
for i in symbol:
df = web.DataReader(i, 'av-daily', start, end, api_key=api_key)
df['symbol'] = i
df_all = pd.concat([df_all, df], axis=0)
df.index = pd.to_datetime(df.index)
Aggregating a single column
df_all.groupby('symbol')['volume'].agg('max').reset_index()
symbol volume
0 AAPL 106721200
1 TSLA 60938758
Multi-Column Aggregation
df_all.groupby('symbol')[['high','volume']].agg(high=('high','max'), volume=('volume','max'))
high volume
symbol
AAPL 425.66 106721200
TSLA 1794.99 60938758
Extract the target line
symbol_max = df_all.groupby('symbol').apply(lambda x: x.loc[x['volume'].idxmax()]).reset_index(drop=True)
symbol_max
open high low close volume symbol
0 257.26 278.4100 256.37 273.36 106721200 AAPL
1 882.96 968.9899 833.88 887.06 60938758 TSLA

Loop filters containing a list of keywords (one key word each time) in a specific column in Pandas

I want to use the loop function to perform filters containing a list of different keywords (i.e different reference numbers) in a specific column (i.e. CaseText) but just filter one keyword each time so that I do not have to manually change the keyword each time. I want to see the result in the form of dataframe.
Unfortunately, my code doesn't work. It just returns the whole dataset.
Anyone can help and find out what's wrong with my code? Additionally, it will be great if the resultant table will be broken into different results of each keyword.
Many thanks.
import pandas as pd
pd.set_option('display.max_colwidth', 0)
list_of_files = ['https://open.barnet.gov.uk/download/2nq32/c1d/Open%20Data%20Planning%20Q1%2019-20%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/9wj/Open%20Data%20Planning%202018-19%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/my7/Planning%20Decisions%202017-2018%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/303/Planning%20Decisions%202016-2017%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/zf1/Planning%20Decisions%202015-2016%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/9b3/Open%20Data%20Planning%202014-2015%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/6zz/Open%20Data%20Planning%202013-2014%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/r7m/Open%20Data%20Planning%202012-2013%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/fzw/Open%20Data%20Planning%202011-2012%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/x3w/Open%20Data%20Planning%202010-2011%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/tbc/Open%20Data%20Planning%202009-2010%20-%20NG.csv']
data_container = []
for filename in list_of_files:
print(filename)
df = pd.read_csv(filename, encoding='mac_roman')
data_container.append(df)
all_data = pd.concat(data_container)
reference_list = ['H/04522/11','15/07697/FUL'] # I want to filter the dataset with a single keyword each time. Because I have nearly 70 keywords to filter.
select_data = pd.DataFrame()
for keywords in reference_list:
select_data = select_data.append(all_data[all_data['CaseText'].str.contains("reference_list", na=False)])
select_data = select_data[['CaseReference', 'CaseDate', 'ServiceTypeLabel', 'CaseText',
'DecisionDate', 'Decision', 'AppealRef']]
select_data.drop_duplicates(keep='first', inplace=True)
select_data

One of the problems is that the items in reference_container do not match any of the values in column 'CaseReference'. Once you figure out which CaseReference numbers you want to search for then the below code should work for you. Just put the correct CaseReference numbers in the reference_container list.
import pandas as pd
url = ('https://open.barnet.gov.uk/download/2nq32/fzw/'
'Open%20Data%20Planning%202011-2012%20-%20NG.csv')
data = pd.read_csv(url, encoding='mac_roman')
reference_list = ['hH/02159/13','16/4324/FUL']
select_data = pd.DataFrame()
for keywords in reference_list:
select_data = select_data.append(data[data['CaseReference'] == keywords],
ignore_index=True)
select_data = select_data[['CaseDate', 'ServiceTypeLabel', 'CaseText',
'DecisionDate', 'Decision', 'AppealRef']]
select_data.drop_duplicates(keep='first', inplace=True)
select_data

This should work
import pandas as pd
pd.set_option('display.max_colwidth', 0)
list_of_files = ['https://open.barnet.gov.uk/download/2nq32/c1d/Open%20Data%20Planning%20Q1%2019-20%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/9wj/Open%20Data%20Planning%202018-19%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/my7/Planning%20Decisions%202017-2018%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/303/Planning%20Decisions%202016-2017%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/zf1/Planning%20Decisions%202015-2016%20non%20geo.csv',
'https://open.barnet.gov.uk/download/2nq32/9b3/Open%20Data%20Planning%202014-2015%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/6zz/Open%20Data%20Planning%202013-2014%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/r7m/Open%20Data%20Planning%202012-2013%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/fzw/Open%20Data%20Planning%202011-2012%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/x3w/Open%20Data%20Planning%202010-2011%20-%20NG.csv',
'https://open.barnet.gov.uk/download/2nq32/tbc/Open%20Data%20Planning%202009-2010%20-%20NG.csv']
# this takes some time
df = pd.concat([pd.read_csv(el, engine='python' ) for el in list_of_files]) # read all csvs
reference_list = ['H/04522/11','15/07697/FUL']
reference_dict = dict() # create an empty dictionary.
# Will populate this where each key will be the keyword and each value will be a dataframe
# filtered for 'CaseText' contain the keyword
for el in reference_list:
reference_dict[el] = df[(df['CaseText'].str.contains(el)) & ~(df['CaseText'].isna())]
# notice the two conditions
# 1) the column CaseText should contain the keyword. (df['CaseText'].str.contains(el))
# 2) there are some elements in CaseText that are NaN so they need to be excluded
# this is what ~(df['CaseText'].isna()) does
# you can see the resulting dataframes like so: reference_dict[keyword]. for example
reference_dict['H/04522/11']
UPDATE
if you want one dataframe to include the cases where any of the keywords is in the column CaseText try this
# lets start after having read in the data
# Seperate your keywords with | in one string.
keywords = 'H/04522/11|15/07697/FUL' # read into regular expressions to understand this
final_df = df[(df['CaseText'].str.contains(keywords)) & ~(df['CaseText'].isna())]
final_df

Pandas, Python, Excel, Bold a row using conditional formatting no solution

I am using python3 and pandas to create a script to:
Read unstructured xsls data of varing column lengths
Total the "this", "last" and "diff" columns
Add Total under the brands columns
Dynamically bold the entire row that contains "total"
On the last point, the challenge I have been struggling with is that the row index changes depending on the data being fed in to the script. The code provided does not have a solution to this issue. I have tried every variation I can think of using style.applymap(bold) with and without variables.
Example of input
input
Example of desired outcome
outcome
Script:
import pandas as pd
import io
import sys
import warnings
def bold(val):
return 'font-weight: bold'
excel_file = 'testfile1.xlsx'
df = pd.read_excel(excel_file)
product = (df.loc[df['Brand'] == "widgit"])
product = product.append({'Brand':'Total',
'This':product['This'].sum(),
'Last':product['Last'].sum(),
'Diff':product['Diff'].sum(),
'% Chg':product['This'].sum()/product['Last'].sum()
},
ignore_index=True)
product = product.append({'Brand':' '}, ignore_index=True)
product.fillna(' ', inplace=True)

try something like this:
def highlight_max(x):
return ['font-weight: bold' if v == x.loc[4] else ''
for v in x]
df = pd.DataFrame(np.random.randn(5, 2))
df.style.apply(highlight_max)
output:

How to Merge a list of Multiple DataFrames and Tag each Column with a another list

I have a lisit of DataFrames that come from the census api, i had stored each year pull into a list.
So at the end of my for loop i have a list with dataframes per year and a list of years to go along side the for loop.
The problem i am having is merging all the DataFrames in the list while also taging them with a list of years.
So i have tried using the reduce function, but it looks like it only taking 2 of the 6 Dataframes i have.
concat just adds them to the dataframe with out tagging or changing anything
# Dependencies
import pandas as pd
import requests
import json
import pprint
import requests
from census import Census
from us import states
# Census
from config import (api_key, gkey)
year = 2012
c = Census(api_key, year)
for length in range(6):
c = Census(api_key, year)
data = c.acs5.get(('NAME', "B25077_001E","B25064_001E",
"B15003_022E","B19013_001E"),
{'for': 'zip code tabulation area:*'})
data_df = pd.DataFrame(data)
data_df = data_df.rename(columns={"NAME": "Name",
"zip code tabulation area": "Zipcode",
"B25077_001E":"Median Home Value",
"B25064_001E":"Median Rent",
"B15003_022E":"Bachelor Degrees",
"B19013_001E":"Median Income"})
data_df = data_df.astype({'Zipcode':'int64'})
filtervalue = data_df['Median Home Value']>0
filtervalue2 = data_df['Median Rent']>0
filtervalue3 = data_df['Median Income']>0
cleandata = data_df[filtervalue][filtervalue2][filtervalue3]
cleandata = cleandata.dropna()
yearlst.append(year)
datalst.append(cleandata)
year += 1
so this generates the two seperate list one with the year and other with dataframe.
So my output came out to either one Dataframe with missing Dataframe entries or it just concatinated all without changing columns.
what im looking for is how to merge all within a list, but datalst[0] to be tagged with yearlst[0] when merging if at all possible

No need for year list, simply assign year column to data frame. Plus avoid incrementing year and have it the iterator column. In fact, consider chaining your process:
for year in range(2012, 2019):
c = Census(api_key, year)
data = c.acs5.get(('NAME', "B25077_001E","B25064_001E", "B15003_022E","B19013_001E"),
{'for': 'zip code tabulation area:*'})
cleandata = (pd.DataFrame(data)
.rename(columns={"NAME": "Name",
"zip code tabulation area": "Zipcode",
"B25077_001E": "Median_Home_Value",
"B25064_001E": "Median_Rent",
"B15003_022E": "Bachelor_Degrees",
"B19013_001E": "Median_Income"})
.astype({'Zipcode':'int64'})
.query('(Median_Home_Value > 0) & (Median_Rent > 0) & (Median_Income > 0)')
.dropna()
.assign(year_column = year)
)
datalst.append(cleandata)
final_data = pd.concat(datalst, ignore_index = True)

Python Pandas filtering dataframe on date

I am trying to manipulate a CSV file on a certain date in a certain column.
I am using pandas (total noob) for that and was pretty successful until i got to dates.
The CSV looks something like this (with more columns and rows of course).
These are the columns:
Circuit
Status
Effective Date
These are the values:
XXXX001
Operational
31-DEC-2007
I tried dataframe query (which i use for everything else) without success.
I tried dataframe loc (which worked for everything else) without success.
How can i get all rows that are older or newer from a given date? If i have other conditions to filter the dataframe, how do i combine them with the date filter?
Here's my "raw" code:
import pandas as pd
# parse_dates = ['Effective Date']
# dtypes = {'Effective Date': 'str'}
df = pd.read_csv("example.csv", dtype=object)
# , parse_dates=parse_dates, infer_datetime_format=True
# tried lot of suggestions found on SO
cols = df.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
df.columns = cols
status1 = 'Suppressed'
status2 = 'Order Aborted'
pool = '2'
region = 'EU'
date1 = '31-DEC-2017'
filt_df = df.query('Status != #status1 and Status != #status2 and Pool == #pool and Region_A == #region')
filt_df.reset_index(drop=True, inplace=True)
filt_df.to_csv('filtered.csv')
# this is working pretty well
supp_df = df.query('Status == #status1 and Effective_Date < #date1')
supp_df.reset_index(drop=True, inplace=True)
supp_df.to_csv('supp.csv')
# this is what is not working at all
I tried many approaches, but i was not able to put it together. This is just one of many approaches i tried.. so i know it is perhaps completely wrong, as no date parsing is used.
supp.csv will be saved, but the dates present are all over the place, so there's no match with the "logic" in this code.
Thanks for any help!

Make sure you convert your date to datetime and then filter slice on it.
df['Effective Date'] = pd.to_datetime(df['Effective Date'])
df[df['Effective Date'] < '2017-12-31']
#This returns all the values with dates before 31th of December, 2017.
#You can also use Query

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there a method to programmatically calculate index number in Python dataframe? - python

Related

How to start a for loop for this given DataFrame in Pandas for multiple same name rows?

Loop filters containing a list of keywords (one key word each time) in a specific column in Pandas

Pandas, Python, Excel, Bold a row using conditional formatting no solution

How to Merge a list of Multiple DataFrames and Tag each Column with a another list

Python Pandas filtering dataframe on date

Categories

Resources