Create a csv from website by python scrapping

Create a csv from website by python scrapping - python

I want to have a csv file of the table in this website. https://www.moneycontrol.com/stocks/fno/marketstats/arbitrage/futures-spot-near-2.html
The desired output is a pandas dataframe with columns..
Company,Future,Spot,Basis,Basis%,Previous,Basis,Change,Lot Size
import pandas as pd
url = 'https://www.moneycontrol.com/stocks/fno/marketstats/arbitrage/futures-spot-near-2.html'
df = pd.read_html(url, header=0)[0]
df.rename(columns={'Unnamed: 0': 'Company',
'Futures': 'Future',
'Spot': 'Spot',
'Basis': 'Basis',
'Basis%': 'Basis%',
'Previous': 'Previous',
'Basis.1': 'Change',
'Lot Size': 'Lot Size'}, inplace=True)
df.to_csv('futures_spot_near.csv', index=False)
I tried this but only get the column names in csv

Here is how you can get that data:
import pandas as pd
url = 'https://www.moneycontrol.com/stocks/fno/marketstats/arbitrage/futures-spot-near-2.html'
df = pd.read_html(url)[1]
df.columns = ['Company','Future','Spot','Basis','Basis %','Previous Basis','Change','Lot Size']
df.to_csv('futures_spot_near.csv', index=False)
print(df)
Result in terminal (also saved as csv):
Company Future Spot Basis Basis % Previous Basis Change Lot Size
0 LTIM 5150.00 4691.60 458.40 9.77 379.35 79.05 150
1 INDIACEM 201.40 197.90 3.50 1.77 0.80 2.70 2900
2 ADANIENT 1860.00 1846.95 13.05 0.71 3.00 10.05 250
3 BIOCON 244.00 242.35 1.65 0.68 0.50 1.15 2300
4 SHRIRAMFIN 1287.10 1278.60 8.50 0.66 -14.95 23.45 600
... ... ... ... ... ... ... ... ...
188 MRF 88340.05 89070.90 -730.85 -0.82 -463.40 -267.45 10
189 MGL 893.60 901.35 -7.75 -0.86 -6.35 -1.40 800
190 BALRAMCHIN 359.55 363.55 -4.00 -1.10 -0.65 -3.35 1600
191 ITC 366.45 371.35 -4.90 -1.32 -4.25 -0.65 1600
192 CHAMBLFERT 293.80 298.20 -4.40 -1.48 -3.60 -0.80 1500
193 rows × 8 columns

Related

How to modify the number of the rows in .csv file and plot them

I read .csv file using this command
df = pd.read_csv('filename.csv', nrows=200)
I set the number of rows to 200. So it will only get the data for 200 rows. (200 rows x 1 column)
data
1 4.33
2 6.98
.
.
200 100.896
I want to plot these data however I would like to divide the number of rows by 50. (there will be 200 elements still but the numbers of the rows will be divided by 50).
data
0.02 4.33
0.04 6.98
.
.
4 100.896
I'm not sure how I would do that. Is there a way of doing this?

Just divide the index by 50.
Here an example :
import pandas as pd
import random
data = pd.DataFrame({'col1' : random.sample(range(300), 200)}, index = range(1,201))
data.index = data.index / 50
data
col1
0.02
196
0.04
198
0.06
278
0.08
209
0.10
36
...
...
3.92
175
3.94
69
3.96
145
3.98
15
4.00
18

transpose multiple columns in a pandas dataframe

AD AP AR MD MS iS AS
0 169.88 0.00 50.50 814.0 57.3 32.3 43.230
1 12.54 0.01 84.75 93.0 51.3 36.6 43.850
2 321.38 0.00 65.08 986.0 56.7 28.9 42.070
I would like to change the dataframe above to a transposed version where for each column, the values are put in a single row, so e.g. for columns AD and AP, it will look like this
d1_AD d2_AD d3_AD d1_AP d2_AP d3_AP
169.88 12.54 321.38 0.00 0.01 0.00
I can do a transpose, but how do I get the column names and output structure like above?
NOTE: The output is truncated for legibility but the actual output should include all the other columns like AR MD MS iS AS

We can rename to make the index of the correct form, then stack and sort_index, then Collapse the MultiIndex and to_frame and transpose
new_df = df.rename(lambda x: f'd{x + 1}').stack().sort_index(level=1)
new_df.index = new_df.index.map('_'.join)
new_df = new_df.to_frame().transpose()
Input df:
df = pd.DataFrame({
'AD': [169.88, 12.54, 321.38], 'AP': [0.0, 0.01, 0.0],
'AR': [50.5, 84.75, 65.08], 'MD': [814.0, 93.0, 986.0],
'MS': [57.3, 51.3, 56.7], 'iS': [32.3, 36.6, 28.9],
'AS': [43.23, 43.85, 42.07]
})
new_df:
d1_AD d2_AD d3_AD d1_AP d2_AP ... d2_MS d3_MS d1_iS d2_iS d3_iS
0 169.88 12.54 321.38 0.0 0.01 ... 51.3 56.7 32.3 36.6 28.9
[1 rows x 21 columns]
If lexicographic sorting does not work we can wait to convert the MultiIndex to string until after sort_index:
new_df = df.stack().sort_index(level=1) # Sort level 1 (by number)
new_df.index = new_df.index.map(lambda x: f'd{x[0]+1}_{x[1]}')
new_df = new_df.to_frame().transpose()
Larger frame:
df = pd.concat([df] * 4, ignore_index=True)
Truncated output:
d1_AD d2_AD d3_AD d4_AD d5_AD ... d8_iS d9_iS d10_iS d11_iS d12_iS
0 169.88 12.54 321.38 169.88 12.54 ... 36.6 28.9 32.3 36.6 28.9
[1 rows x 84 columns]
If needing columns in same order as df, use melt using ignore_index=False to not have to recalculate groups and let melt handle the ordering:
new_df = df.melt(value_name=0, ignore_index=False)
new_df = new_df[[0]].set_axis(
# Create the new index
'd' + (new_df.index + 1).astype(str) + '_' + new_df['variable']
).transpose()
Truncated output on the larger frame:
d1_AD d2_AD d3_AD d4_AD d5_AD ... d8_AS d9_AS d10_AS d11_AS d12_AS
0 169.88 12.54 321.38 169.88 12.54 ... 43.85 42.07 43.23 43.85 42.07
[1 rows x 84 columns]

You could try melt and set_index with groupby:
x = df.melt().set_index('variable').rename_axis(index=None).T.set_axis([0])
x.set_axis(x.columns + x.columns.to_series().groupby(level=0).transform('cumcount').add(1).astype(str), axis=1)
AD1 AD2 AD3 AP1 AP2 AP3 AR1 AR2 AR3 ... MS1 MS2 MS3 iS1 iS2 iS3 AS1 AS2 AS3
0 169.88 12.54 321.38 0.0 0.01 0.0 50.5 84.75 65.08 ... 57.3 51.3 56.7 32.3 36.6 28.9 43.23 43.85 42.07
[1 rows x 21 columns]

Create a dataframe with tickers as rows and a for loop as columns

I have a dataframe close consists of close price (with some calculations beforehand) of some stocks, and I want to create a dataframe (with empty entries or random numbers) such that the row names are the tickers of the close and column names are from 10 to 300 with a step size 10. ie. 10,20,30,40,50...
I want to create this df in order to use a for loop to fill in all the entries.
The df close I have is like below:
Close \
ticker AAPL AMD BIDU GOOGL IXIC
Date
2011-06-01 12.339643 8.370000 132.470001 263.063049 2769.189941
2011-06-02 12.360714 8.240000 138.490005 264.294281 2773.310059
2011-06-03 12.265714 7.970000 133.210007 261.801788 2732.780029
2011-06-06 12.072857 7.800000 126.970001 260.790802 2702.560059
2011-06-07 11.858571 7.710000 124.820000 259.774780 2701.560059
......
I tried to check if I firstly create this dataframe correctly as below:
rows = close.iloc[0]
columns = [[i] for i in range(10,300,10)]
print(pd.DataFrame(rows, columns))
But what I got is:
2011-06-01
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 NaN
After this, I would use something like
percent = pd.DataFrame(rows, columns)
for i in range(10, 300, 10):
myerror = myfunction(close, i) # myfunction is a function defined beforehand
extreme = myerror > 0.1
percent.iloc[:,i] = extreme.mean()
To be specific, for i=10, my extreme.mean() is something like:
ticker
Absolute Error (Volatility) AAPL 0.420
AMD 0.724
BIDU 0.552
GOOGL 0.316
IXIC 0.176
MSFT 0.320
NDXT 0.228
NVDA 0.552
NXPI 0.476
QCOM 0.468
SWKS 0.560
TXN 0.332
dtype: float64
But if I tried this way, I got:
IndexError: iloc cannot enlarge its target object
How shall I create this df first? Or do I even need to create this df first?

Here is how I will approach it:
from io import StringIO
import numpy as np
df = pd.read_csv(StringIO("""ticker_Date AAPL AMD BIDU GOOGL IXIC
2011-06-01 12.339643 8.370000 132.470001 263.063049 2769.189941
2011-06-02 12.360714 8.240000 138.490005 264.294281 2773.310059
2011-06-03 12.265714 7.970000 133.210007 261.801788 2732.780029
2011-06-06 12.072857 7.800000 126.970001 260.790802 2702.560059
2011-06-07 11.858571 7.710000 124.820000 259.774780 2701.560059 """), sep="\s+", index_col=0)
col_names = [f"col_{i}" for i in range(10, 300, 10)]
# generate random data
data = np.random.random((df.shape[1], len(col_names)))
# create dataframe
df = pd.DataFrame(data, columns=col_names, index=df.columns.values)
df.head()
This will generate:
col_10 col_20 col_30 col_40 col_50 col_60 col_70 col_80 col_90 col_100 ... col_200 col_210 col_220 col_230 col_240 col_250 col_260 col_270 col_280 col_290
AAPL 0.758983 0.990241 0.804344 0.143388 0.987025 0.402098 0.814308 0.302948 0.551587 0.107503 ... 0.270523 0.813130 0.354939 0.594897 0.711924 0.574312 0.124053 0.586718 0.182854 0.430028
AMD 0.280330 0.540498 0.958757 0.779778 0.988756 0.877748 0.083683 0.935331 0.601838 0.998863 ... 0.426469 0.459916 0.458180 0.047625 0.234591 0.831229 0.975838 0.277486 0.663604 0.773614
BIDU 0.488226 0.792466 0.488340 0.639612 0.829161 0.459805 0.619539 0.614297 0.337481 0.009500 ... 0.049147 0.452581 0.230441 0.943240 0.587269 0.703462 0.528252 0.099104 0.510057 0.151219
GOOGL 0.332762 0.135621 0.653414 0.955116 0.341629 0.213716 0.308320 0.982095 0.762138 0.532052 ... 0.095432 0.908001 0.077070 0.413706 0.036768 0.481697 0.092373 0.016260 0.394339 0.042559
IXIC 0.358842 0.653332 0.994692 0.863552 0.307594 0.269833 0.972357 0.520336 0.124850 0.907647 ... 0.189050 0.664955 0.167708 0.333537 0.295740 0.093228 0.762875 0.779000 0.316752 0.687238

looping over variables in pandas and for each month, computing OHLC and storing data into new dataframe's columns

I have a financial close price (weekly) data which looks like this:
Date V1 V2 V3 V4 V5 V6 V7
2010-01-01 77.31 66.94 52.33 34.94 81.38 84.75 482
2010-01-08 78.05 68.85 52.84 34.66 90.15 95.61 508
2010-01-15 79.29 68.3 53.61 35.33 86.97 97.87 490
2010-01-22 80.57 68.19 55.43 35.8 86.04 99.26 480
2010-01-29 81.87 68.79 55.84 35.6 83.36 98.53 462
2010-02-05 83.7 70.35 57.3 36.57 84.54 91.83 464
2010-02-12 81.85 68.32 56.4 37.35 81.2 90.75 455
2010-02-19 82.66 69.04 56.21 36.89 81.85 93.98 457
2010-02-26 86.32 69.7 57.43 37.12 83.96 96.43 467
2010-03-05 85.37 69.98 57.34 36.71 84.01 94.83 466
2010-03-12 84.04 69.76 56.74 36.98 83.02 93.92 466
2010-03-19 84.37 69.76 56.77 37.07 83.29 95.04 458
2010-03-26 85.7 70.06 56.62 36.81 81.64 94.84 459
2010-04-02 85.38 70.72 56.03 36.78 83.91 94.98 464
2010-04-09 89.21 71.7 58.38 37.49 86.95 98.74 471
2010-04-16 89.74 72.35 58.74 38.05 85.58 98.28 487
2010-04-23 90.72 74.26 60.61 38.64 90.5 100.18 492
2010-04-30 99.79 78.67 65.14 38.89 95.82 108.87 494
2010-05-07 102.34 81.48 63.45 41.87 93.18 106.2 478
2010-05-14 96.42 79.81 62.57 41.23 88.94 102.23 484
2010-05-21 96.17 76.9 61.06 39.28 88.22 97.8 444
2010-05-28 95.73 77.67 61.1 39.88 92.88 96.84 421
Here V1, V2... V7 are some companies for which weekly closing prices are given in the above table data.
What I want to do is to calculate:
for each month, what is open, close, high and low prices
open should be the price on the first date in Date column for that month, and close should be the last date, obviously, right?
I am using the below code which returns me the result as shown below the code:
def calculate(x):
open = x.loc[x.index.min(), "V1"]
high = x.loc[x.index, "V1"].max()
low = x.loc[x.index, "V1"].min()
close = x.loc[x.index.max(), "V1"]
return open, high, low, close
result = pd.DataFrame()
result = df.groupby(df["Date"].dt.to_period("M")).apply(calculate)
result
Result of the above:
Date
2010-01 (77.31, 81.87, 77.31, 81.87)
2010-02 (83.7, 86.32, 81.85, 86.32)
2010-03 (85.37, 85.7, 84.04, 85.7)
2010-04 (85.38, 99.79, 85.38, 99.79)
2010-05 (102.34, 102.34, 95.73, 95.73)
...
Now I wanted to take these tuples into respective columns, along with the Date:
Date, Open, High, Low, Close
And also,
2. I want to repeat above function for all the variables (V1 through V7) using single loop operation or something.
Could someone please suggest me how I should be able to do that?

IIUC, you can also try:
df.Date = pd.to_datetime(df.Date)
df = df.sort_values('Date')
df = (df.groupby(pd.Grouper(key = 'Date', freq = '1M')).agg(**{'open': ('V1','first'), 'high' : ('V1', 'max'), 'low' : ('V1', 'min'), 'close': ('V1','last')}))
OUTPUT:
open high low close
Date
2010-01-31 77.31 81.87 77.31 81.87
2010-02-28 83.70 86.32 81.85 86.32
2010-03-31 85.37 85.70 84.04 85.70
2010-04-30 85.38 99.79 85.38 99.79
2010-05-31 102.34 102.34 95.73 95.73
NOTE: you can also use resample:
df = df.set_index('Date').resample('1M').agg({'V1': ['min', 'max', 'first', 'last']})
Updated Answer:
`
df1 = df.set_index('Date').resample('1M').agg(['min', 'max', 'first', 'last'])
df1.columns = [f'{i}_{j}' for i,j in zip(df1.columns.get_level_values(0),df1.columns.get_level_values(1).map({'min': 'low', 'max': 'high', 'first': 'open', 'last': 'close'}))]

Columns error while converting to float

df = pd.read_csv('TestdataNALA.csv', sep=';', na_values='ND')
df.fillna(0, inplace=True)
df = df[['ID','VegType','Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']]
new_list = df[['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']]
nnew_list = [float(i) for i in new_list]
print(type(df['ID']))
#--> <class 'pandas.core.series.Series'>
print('ID' in df.columns)
#-->True
Some data from the CSV:
ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g]
A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400
C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800
C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000
E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500
A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100
A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200
A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000
When I run this, I get the error:
ValueError: could not convert string to float: 'Elevation'
It seems as if it's trying to convert the headlines, but I only wanted to convert the list.

To convert columns to numeric, you should use pd.to_numeric:
cols = ['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
Your code will not work because, like a dictionary and its keys, when you loop through a dataframe you only find the column headings.
Update 1
Try also using the options below:
import pandas as pd
from io import StringIO
mystr = StringIO(""" <bound method NDFrame.head of ID VegType Elevation C N C/N Soil13C Soil15N pH \
1 C2489WCG Cyt_Gen 2489 2,18 0,17 12,50 -24,51 5,53 4,21
2 C2591WCG Cyt_Gen 2591 5,13 0,36 14,29 -25,52 3,41 4,30
3 E2695WC Cyt 2695 1,43 0,14 10,55 -25,71 5,75 4,95
""")
df = pd.read_csv(mystr, skiprows=1, decimal=',', header=None, delim_whitespace=True)
# 0 1 2 3 4 5 6 7 8 9
# 0 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21
# 1 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30
# 2 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95
Update 2
import pandas as pd
from io import StringIO
mystr = StringIO("""ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g]
A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400
C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800
C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000
E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500
A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100
A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200
A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000
""")
df = pd.read_csv(mystr, decimal=',', delimiter=';')
# ID VegType Elevation C N C/N Soil13C Soil15N pH \
# 0 A2425WC Cyt 2425 2.78 0.23 12.14 -24.43 4.85 4.88
# 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21
# 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30
# 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95
# 4 A2900WC Cyt 2900 6.95 0.53 13.11 -25.54 3.38 5.35
# 5 A2800WC Cyt 2800 1.88 0.16 11.62 -23.68 5.88 5.28
# 6 A3050WC Cyt 3050 1.50 0.12 12.50 -24.62 2.23 5.97
# Na_[mg/g] K_[mg/g] Mg_[mg/g] Ca_[mg/g] P_[µg/g]
# 0 0,0005 0.0772 0.0646 0.3588 13.84
# 1 0,0010 0.0639 0.0286 0.0601 0.68
# 2 0,0046 0.0854 0.1169 0.7753 5.70
# 3 ND 0.0766 0.0441 0.0978 8.85
# 4 0,0032 0.1119 0.2356 1.8050 14.71
# 5 0,0025 0.0983 0.0770 0.3777 5.42
# 6 ND 0.0696 0.0729 0.5736 9.40

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create a csv from website by python scrapping - python

Related

How to modify the number of the rows in .csv file and plot them

transpose multiple columns in a pandas dataframe

Create a dataframe with tickers as rows and a for loop as columns

looping over variables in pandas and for each month, computing OHLC and storing data into new dataframe's columns

Columns error while converting to float

Categories

Resources