extract only certain rows in a dataframe

extract only certain rows in a dataframe - python

I have a dataframe like this:
Code Date Open High Low Close Volume VWAP TWAP
0 US_GWA_BTC 2014-04-01 467.28 488.62 467.28 479.56 74,776.48 482.76 482.82
1 GWA_BTC 2014-04-02 479.20 494.30 431.32 437.08 114,052.96 460.19 465.93
2 GWA_BTC 2014-04-03 437.33 449.74 414.41 445.60 91,415.08 432.29 433.28
.
316 MWA_XRP_US 2018-01-19 1.57 1.69 1.48 1.53 242,563,870.44 1.59 1.59
317 MWA_XRP_US 2018-01-20 1.54 1.62 1.49 1.57 140,459,727.30 1.56 1.56
I want to filter out rows where code which has GWA infront of it.
I tried this code but it's not working.
df.set_index("Code").filter(regex='[GWA_]*', axis=0)

Try using startswith:
df[df.Code.str.startswith('GWA')]

Related

Getting Empty DataFrame in pandas from table data

I'm getting data from using print command but in Pandas DataFrame throwing result as : Empty DataFrame,Columns: [],Index: [`]
Script:
from bs4 import BeautifulSoup
import requests
import re
import json
import pandas as pd
url='http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=jsonp1640132253903&t=XNAS:AAPL'
req=requests.get(url).text
#print(req)
data=re.search(r'jsonp1640132253903\((\{.*\})\)',req).group(1)
json_data=json.loads(data)['componentData']
#print(json_data)
# with open('index.html','w') as f:
# f.write(json_data)
soup=BeautifulSoup(json_data,'lxml')
for tr in soup.select('tr'):
row_data=[td.get_text(strip=True) for td in tr.select('td,th') if td.text]
if not row_data:
continue
if len(row_data) < 12:
row_data = ['Particulars'] + row_data
#print(row_data)
df=pd.DataFrame(row_data)
print(df)
Print result:
['Particulars', '2012-09', '2013-09', '2014-09', '2015-09', '2016-09', '2017-09', '2018-09', '2019-09', '2020-09', '2021-09', 'TTM']
['RevenueUSD Mil', '156,508', '170,910', '182,795', '233,715', '215,639', '229,234', '265,595', '260,174', '274,515', '365,817', '365,817']
['Gross Margin %', '43.9', '37.6', '38.6', '40.1', '39.1', '38.5', '38.3', '37.8', '38.2', '41.8', '41.8']
['Operating IncomeUSD Mil', '55,241', '48,999', '52,503', '71,230', '60,024', '61,344', '70,898', '63,930', '66,288', '108,949', '108,949']
['Operating Margin %', '35.3', '28.7', '28.7', '30.5', '27.8', '26.8', '26.7', '24.6', '24.1', '29.8', '29.8']
['Net IncomeUSD Mil', '41,733', '37,037', '39,510', '53,394', '45,687', '48,351', '59,531', '55,256', '57,411',
'94,680', '94,680']
['Earnings Per ShareUSD', '1.58', '1.42', '1.61', '2.31', '2.08', '2.30', '2.98', '2.97', '3.28', '5.61', '5.61'
Expected output:
2012-09 2013-09 2014-09 2015-09 2016-09 2017-09 2018-09 2019-09 2020-09 2021-09 TTM
Revenue USD Mil 156,508 170,910 182,795 233,715 215,639 229,234 265,595 260,174 274,515 365,817 365,817
Gross Margin % 43.9 37.6 38.6 40.1 39.1 38.5 38.3 37.8 38.2 41.8 41.8
Operating Income USD Mil 55,241 48,999 52,503 71,230 60,024 61,344 70,898 63,930 66,288 108,949 108,949
Operating Margin % 35.3 28.7 28.7 30.5 27.8 26.8 26.7 24.6 24.1 29.8 29.8
Net Income USD Mil 41,733 37,037 39,510 53,394 45,687 48,351 59,531 55,256 57,411 94,680 94,680
Earnings Per Share USD 1.58 1.42 1.61 2.31 2.08 2.30 2.98 2.97 3.28 5.61 5.61
Dividends USD 0.09 0.41 0.45 0.49 0.55 0.60 0.68 0.75 0.80 0.85 0.85
Payout Ratio % * — 27.4 28.5 22.3 24.8 26.5 23.7 25.1 23.7 16.3 15.2
Shares Mil 26,470 26,087 24,491 23,172 22,001 21,007 20,000 18,596 17,528 16,865 16,865
Book Value Per Share * USD 4.25 4.90 5.15 5.63 5.93 6.46 6.04 5.43 4.26 3.91 3.85
Operating Cash Flow USD Mil 50,856 53,666 59,713 81,266 65,824 63,598 77,434 69,391 80,674 104,038 104,038
Cap Spending USD Mil -9,402 -9,076 -9,813 -11,488 -13,548 -12,795 -13,313 -10,495 -7,309 -11,085 -11,085
Free Cash Flow USD Mil 41,454 44,590 49,900 69,778 52,276 50,803 64,121 58,896 73,365 92,953 92,953
Free Cash Flow Per Share * USD 1.58 1.61 1.93 2.96 2.24 2.41 2.88 3.07 4.04 5.57 —
Working Capital USD Mil 19,111 29,628 5,083 8,768 27,863 27,831 14,473 57,101 38,321 9,355
Expected columns:
'Particulars', '2012-09', '2013-09', '2014-09', '2015-09', '2016-09', '2017-09', '2018-09', '2019-09', '2020-09', '2021-09', 'TTM'

#QHarr's answer is by far the most straightforward, but in case you are wondering what is wrong with your code, it's that you are resetting the variable row_data for every iteration of the loop.
To make your code work, you can instead store each row as an element in a list. Then to build a DataFrame, you can pass this list of rows and the column names to pd.DataFrame:
data = []
soup=BeautifulSoup(json_data,'lxml')
for tr in soup.select('tr'):
row_data=[td.get_text(strip=True) for td in tr.select('td,th') if td.text]
if not row_data:
continue
elif len(row_data) < 12:
columns = ['Particulars'] + row_data
else:
data.append(row_data)
df=pd.DataFrame(data, columns=columns)
Result:
>>> df
Particulars 2012-09 2013-09 2014-09 2015-09 2016-09 2017-09 2018-09 2019-09 2020-09 2021-09 TTM
0 RevenueUSD Mil 156,508 170,910 182,795 233,715 215,639 229,234 265,595 260,174 274,515 365,817 365,817
1 Gross Margin % 43.9 37.6 38.6 40.1 39.1 38.5 38.3 37.8 38.2 41.8 41.8
2 Operating IncomeUSD Mil 55,241 48,999 52,503 71,230 60,024 61,344 70,898 63,930 66,288 108,949 108,949
3 Operating Margin % 35.3 28.7 28.7 30.5 27.8 26.8 26.7 24.6 24.1 29.8 29.8
4 Net IncomeUSD Mil 41,733 37,037 39,510 53,394 45,687 48,351 59,531 55,256 57,411 94,680 94,680
5 Earnings Per ShareUSD 1.58 1.42 1.61 2.31 2.08 2.30 2.98 2.97 3.28 5.61 5.61
6 DividendsUSD 0.09 0.41 0.45 0.49 0.55 0.60 0.68 0.75 0.80 0.85 0.85
7 Payout Ratio % * — 27.4 28.5 22.3 24.8 26.5 23.7 25.1 23.7 16.3 15.2
8 SharesMil 26,470 26,087 24,491 23,172 22,001 21,007 20,000 18,596 17,528 16,865 16,865
9 Book Value Per Share *USD 4.25 4.90 5.15 5.63 5.93 6.46 6.04 5.43 4.26 3.91 3.85
10 Operating Cash FlowUSD Mil 50,856 53,666 59,713 81,266 65,824 63,598 77,434 69,391 80,674 104,038 104,038
11 Cap SpendingUSD Mil -9,402 -9,076 -9,813 -11,488 -13,548 -12,795 -13,313 -10,495 -7,309 -11,085 -11,085
12 Free Cash FlowUSD Mil 41,454 44,590 49,900 69,778 52,276 50,803 64,121 58,896 73,365 92,953 92,953
13 Free Cash Flow Per Share *USD 1.58 1.61 1.93 2.96 2.24 2.41 2.88 3.07 4.04 5.57 —
14 Working CapitalUSD Mil 19,111 29,628 5,083 8,768 27,863 27,831 14,473 57,101 38,321 9,355 —

Use read_html for the DataFrame creation and then drop the na rows
json_data=json.loads(data)['componentData']
pd.read_html(json_data)[0].dropna(axis=0, how='all')

Dataframe split columns value, how to solve error message?

I have a panda dataframe with the following columns:
Stock ROC5 ROC20 ROC63 ROCmean
0 IBGL.SW -0.59 3.55 6.57 3.18
0 EHYA.SW 0.98 4.00 6.98 3.99
0 HIGH.SW 0.94 4.22 7.18 4.11
0 IHYG.SW 0.56 2.46 6.16 3.06
0 HYGU.SW 1.12 4.56 7.82 4.50
0 IBCI.SW 0.64 3.57 6.04 3.42
0 IAEX.SW 8.34 18.49 14.95 13.93
0 AGED.SW 9.45 24.74 28.13 20.77
0 ISAG.SW 7.97 21.61 34.34 21.31
0 IAPD.SW 0.51 6.62 19.54 8.89
0 IASP.SW 1.08 2.54 12.18 5.27
0 RBOT.SW 10.35 30.53 39.15 26.68
0 RBOD.SW 11.33 30.50 39.69 27.17
0 BRIC.SW 7.24 11.08 75.60 31.31
0 CNYB.SW 1.14 4.78 8.36 4.76
0 FXC.SW 5.68 13.84 19.29 12.94
0 DJSXE.SW 3.11 9.24 6.44 6.26
0 CSSX5E.SW -0.53 5.29 11.85 5.54
How can I write in the dataframe a new columns "Symbol" with the stock without ".SW".
Example first row result should be IBGL (modified value IBGL.SW).
Example last row result should be CSSX5E (splited value SSX5E.SW).
If I send the following command:
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
Than I receive an error message:
:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
How can I solve this problem?
Thanks a lot for your support.

METHOD 1:
You can do a vectorized operation by str.get(0) -
df['SYMBOL'] = df['Stock'].str.split('.').str.get(0)
METHOD 2:
You can do another vectorized operation by using expand=True in str.split() and then getting the first column.
df['SYMBOL'] = df['Stock'].str.split('.', expand = True)[0]
METHOD 3:
Or you can write a custom lambda function with apply (for more complex processes). Note, this is slower but good if you have your own UDF.
df['SYMBOL'] = df['Stock'].apply(lambda x:x.split('.')[0])

This is not an error, but a warning as you may have probably noticed your script finishes its execution.
edite: Given your comments it seems your issues generate previously in the code, therefore I suggest you use the following:
new_df = new_df.copy(deep=False)
And then proceed to solve it with:
new_df.loc['Symbol'] = new_df['Stock'].str.split('.').str[0]

new_df = new_df.copy()
new_df['Symbol'] = new_df.Stock.str.replace('.SW','')

Loop using to_csv only printing last file (Python)

I am trying to load a number of csv files from a folder, use a function that calculates missing values on each file, then saves new csv files containing the output. When I edit the script to print the output, I get the expected result. However, the loop only ever saves the last file to the directory. The code I am using is:
from pathlib import Path
import pandas as pd
import os
import glob
files = glob("C:/Users/61437/Desktop/test_folder/*.csv") # get all csv's from folder
n = 0
for file in files:
print(file)
df = pd.read_csv(file, index_col = False)
d = calc_missing_prices(df) # calc_missing_prices is a user defined function
print(d)
d.to_csv(r'C:\Users\61437\Desktop\test_folder\derived_files\derived_{}.csv'.format(n+1), index = False)
The print() command returns the expected output, which for my data is:
C:/Users/61437/Desktop/test_folder\file1.csv
V_150 V_200 V_300 V_375 V_500 V_750 V_1000
0 3.00 2.75 4.50 6.03 8.35 12.07 15.00
1 2.32 3.09 4.63 5.00 9.75 12.50 12.25
2 1.85 2.47 3.70 4.62 6.17 9.25 12.33
3 1.75 2.00 4.06 6.50 6.78 10.16 15.20
C:/Users/61437/Desktop/test_folder\file2.csv
V_300 V_375 V_500 V_750 V_1000
0 4.00 4.50 6.06 9.08 11.00
1 3.77 5.00 6.50 8.50 12.56
2 3.00 3.66 4.88 7.31 9.50
C:/Users/61437/Desktop/test_folder\file3.csv
V_500 V_750 V_1000
0 5.50 8.25 11.00
1 6.50 8.50 12.17
2 4.75 7.12 9.50
However the only saved csv file is 'derived_1.csv' which contains the output from file3.csv
What am I doing that is preventing all three files from being created?

You are not incrementing n inside the loop. Your data gets stored in the file derived_1.csv, which is overwritten on every iteration. Once the for loop finishes executing, only the last csv will be saved.
Include the line n += 1 inside the for loop to increment it by 1 on every iteration.

dataFrame duplication extraction row

The code below gives exactly the following Jupyter output:
date open high low close volume
0 29/04/1992 2.21 2.21 1.98 1.99 0
1 29/04/1992 2.21 2.21 1.98 1.98 0
2 30/04/1992 2.02 2.32 1.95 1.98 0
size: 6686
no duplicates? False
date open high low close volume
0 29/04/1992 2.21 2.21 1.98 1.99 0
1 29/04/1992 2.21 2.21 1.98 1.98 0
2 30/04/1992 2.02 2.32 1.95 1.98 0
no duplicates? False
size: 6686
What should I change in the duplication-extraction line?
Thanks!
fskilnik
checking = pd.DataFrame(df)
print(checking.head(3))
size2 = len(checking.index)
print('size:',size2)
print('no duplicates?', checking.date.is_unique)
checking.drop_duplicates(['date'], keep='last')
print(checking.head(3))
print('no duplicates?', checking.date.is_unique)
size2 = len(checking.index)
print('size:',size2)

You should add inplace=True to the drop_duplicates method or reassign the dataframe like:
checking.drop_duplicates(['date'], keep='last', inplace=True)
Or:
checking = checking.drop_duplicates(['date'], keep='last')

Cannot convert input to Timestamp, bday_range(...) - Pandas/Python

Looking to generate a number for the days in business days between current date and the end of the month of a pandas dataframe.
E.g. 26/06/2017 - 4, 23/06/2017 - 5
I'm having trouble as I keep getting a Type Error:
TypeError: Cannot convert input to Timestamp
From line:
result['bdaterange'] = pd.bdate_range(pd.to_datetime(result['dte'], unit='ns').values, pd.to_datetime(result['bdate'], unit='ns').values)
I have a Data Frame result with the column dte in a date format and I'm trying to create a new column (bdaterange) as a simple integer/float that I can use to see how far from month end in business days it has.
Sample data:
bid ask spread dte day bdate
01:49:00 2.17 3.83 1.66 2016-12-20 20.858333 2016-12-30
02:38:00 2.2 3.8 1.60 2016-12-20 20.716667 2016-12-30
22:15:00 2.63 3.12 0.49 2016-12-20 21.166667 2016-12-30
03:16:00 1.63 2.38 0.75 2016-12-21 21.391667 2016-12-30
07:11:00 1.46 2.54 1.08 2016-12-21 21.475000 2016-12-30
I've tried BDay() and using that the day cannot be 6 & 7 in the calculation but have not got anywhere. I came across bdate_range which I believe will be exactly what I'm looking for, but the closest I've got gives me the error Cannot convert input to Timestamp.
My attempt is:
result['bdate'] = pd.to_datetime(result['dte']) + BMonthEnd(0)
result['bdaterange'] = pd.bdate_range(pd.to_datetime(result['dte'], unit='ns').values, pd.to_datetime(result['bdate'], unit='ns').values)
print(result['bdaterange'])
Not sure how to solve the error though.

I think you need length of bdate_range for each row, so need custom function with apply:
#convert only once to datetime
result['dte'] = pd.to_datetime(result['dte'])
f = lambda x: len(pd.bdate_range(x['dte'], x['dte'] + pd.offsets.BMonthEnd(0)))
result['bdaterange'] = result.apply(f, axis=1)
print (result)
bid ask spread dte day bdaterange
01:49:00 2.17 3.83 1.66 2016-12-20 20.858333 9
02:38:00 2.20 3.80 1.60 2016-12-20 20.716667 9
22:15:00 2.63 3.12 0.49 2016-12-20 21.166667 9
03:16:00 1.63 2.38 0.75 2016-12-21 21.391667 8
07:11:00 1.46 2.54 1.08 2016-12-21 21.475000 8

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extract only certain rows in a dataframe - python

Try using startswith: df[df.Code.str.startswith('GWA')]

Related

Getting Empty DataFrame in pandas from table data

Dataframe split columns value, how to solve error message?

Loop using to_csv only printing last file (Python)

dataFrame duplication extraction row

Cannot convert input to Timestamp, bday_range(...) - Pandas/Python

Categories

Resources