Python data scraping from string

Python data scraping from string - python

I have lines of text containing multiple variables which correspond to a specific entry.
I have been trying to use regular expressions, such as the one below, with mixed success (lines are quite standardised but do contain typos and inconsistencies)
re.compile('matching factor').findall(input)
I was wondering what is the best way to approach this case, what data structures to use and how to loop it to go though multiple lines of text. Here is the sample of the text, with highlighted data I would like to scrape:
CHINA: National Grain Trade Centre: in auction of state reserves, govt. sold 70,418 t wheat (equivalent to 3.5% of total volume offered) at an average price of CNY2,507/t ($378.19) and 4,359 t maize (4.7%), at an average price of CNY1,290/t ($194.39). Separately, sold 2,100 t of 2013 wheat imports (1.5%) at CNY2,617/t ($394.25). 23 Oct
I am interested to create a data set containing variable such as:
VOLUME - COMMODITY - PERCENTAGE SOLD - PRICE - DATE

Related

How to create a search algorithm for sales otimization in Python?

I have a dataset with distances between 4 cities, each city has a sales store from the same company and I have the number of sales from the last month of each sales store in another dataset.
What i want to know is the best possible route between the cities to make more profit (each product is sold for 5) knowing that i only produce in the first city and then i have a truck with a maximum truckload of 5000 loading the other 3 cities.
I can´t find anything similar to my problem, the closest i could find were search algorithms, can someone tell me what approach to take?
Sorry if my question is a bit confusing.

3D polar chart in Python

Background
I have a set of financial data I am trying to analyze and display graphically. The data consists of high resolution price data for a number of commodity contracts. These contracts are specified by having both a product (the actual commodity in question) and a tenor (a specified delivery date for the contract). The combination of a product and tenor therefore gives a fully defined commodity contract, which has a current market price, as follows
T Product 1 Product 2
--------- ---------------------------- ----------------------------
Tenor 1 $ Commodity contract price $ Commodity contract price
for product 1 at tenor 1 for product 2 at tenor 1
Tenor 2 $ Commodity contract price $ Commodity contract price
for product 1 at tenor 2 for product 1 at tenor 2
Using some example data, looking at the September, October and November contracts for two grades of crude oil, Brent and WTI, and the current market prices would give us something like this.
T Brent WTI
----------- --------- ---------
Sep 2020 $37.25 $33.40
Oct 2020 $38.10 $33.75
Nov 2020 $38.85 $34.15
But of course these prices aren’t static as they are moving with market forces, so we have a snapshot of these prices every t seconds. Let’s say the above is our starting point, but at t+1 Brent prices have moved a dollar and WTI by 50c, so our prices now look like this
T + 1 Brent WTI
----------- --------- ---------
Sep 2020 $38.25 $33.90
Oct 2020 $39.10 $34.25
Nov 2020 $39.85 $34.65
This is broadly the structure of the dataset, though the scope is of course much larger.
What I am trying to do and how far I have gotten
I have a specific visualization of this data (or subsets thereof) that I have been working towards. I am trying to produce what I would call a 3D or extruded polar chart, which would have contract values for different products as the distance from the centre of the polar chart, would extend along the Z-axis to represent the different tenors, and that would animate to display the change in contract values over time. I have achieved 2/3rds of this using the polar chart function of ploty express, to produce a 2D animated polar chart that displays the contract values for 6 products for 1 tenor over time. The following two images show two frames from this chart
June14 contract price for Brent, WTI, barges, 180, 380, GC on date of 2014-02-14
June14 contract price for Brent, WTI, barges, 180, 380, GC on date of 2014-03-27
However, as far as I can tell this function cannot be extended to 3d, so I think I do have to start fresh. A diagram of what I’m trying to achieve is as follows:
Desired result
I am aiming for a result similar to a very basic Streamtube (see https://plotly.com/python/streamtube-plot/?_ga=2.134282552.895284899.1586457324-1038545846.1585071729 and https://medium.com/plotly/streamtubes-in-plotly-with-python-and-r-a30216ef20a3), but the cross sectional polygon shape of a streamtube seems to always have a fixed radius for each point from the centre, not varying as in my use case, so I don’t believe I can use this pre-built function. Additionally these work using vector fields to draw the plot, so there would be some complex and unhelpful reverse engineering of my data to produce the correct inputs, which indicates to me that this is the wrong route. I believe my best bet is using contour maps in either matplotlib or ploty, but I have hit a bit of brick wall on how to do this.
Any suggestions would be very appreciated.

How to get statistics of once column of dataframe using data from a second column?

I'm trying to write a program to give a deeper analysis of stock trading data but am coming up against a wall. I'm pulling all trades for a given timeframe and creating a new CSV file in order to use that file as the input for a predictive neural network.
The dataframe I currently have has three values: (1) the price of the stock; (2) the number of shares sold at that price; and (3) the unix timestamp of that particular trade. I'm having trouble getting any accurate statistical analysis of the data. For example, if I use .median(), the program only looks at the number of values listed rather than the fact that each value may have been traded hundreds of times based on the volume column.
As an example, this is the partial trading history for one of the stocks that I'm trying to analyze.
0 227.60 40 1570699811183
1 227.40 27 1570699821641
2 227.59 50 1570699919891
3 227.60 10 1570699919891
4 227.36 100 1570699967691
5 227.35 150 1570699967691 . . .
To better understand the issue, I've also grouped it by price and summed the other columns with groupby('p').sum(). I realize this means the timestamp is useless, but it makes visualization easier.
227.22 2 1570700275307
227.23 100 1570699972526
227.25 100 4712101657427
227.30 105 4712101371199
227.33 50 1570700574172
227.35 4008 40838209836171 . . .
Is there any way to use the number from the trade volume column to perform a statistical analysis of the price column? I've considered creating a new dataframe where each price is listed the number of times that it is traded, but am not sure how to do this.
Thanks in advance for any help!

I have repeated date values in my x axis. How do i create a different row with a single average of those values?

I have been working with a dataset which contains information about houses that have been sold on a particular market. There are two columns, 'price', and 'date'.
I would like to make a line plot to show how the prices of this market have chaged over time.
The problem is, I see that some houses have been sold at the same date but with a diferent price.
So ideally i would need to get the mean/average price of the house sold on each date before plotting.
So for example, if I had something like this:
DATE / PRICE
02/05/2015 / $100
02/05/2015 / $200
I would need to get a new row with the following average:
DATE / PRICE
02/05/2015 / $150
I just havent been able to figure it out yet. I would appreciate anyone who could guide me in this matter. Thanks in advance.

Assuming you're using pandas:
pd.groupby('DATE')['PRICE'].mean()

Fill a dataframe from another one based on two conditions

I am a little stuck on a small project I am working on and I would appreciate your help.
I have two data frames.
The first one is larger and it is the one I want to use for my final analyses.
It contains ISIN for bonds based on industry, region and has ratings from S&P and Moody’s.
ISIN
Industry
Region
SP
MD
The second data has Industry, rating(S&P and Moody’s) and region as well as an estimated rating based on financial information like investments, spending on R&D etc.
Industry
Region
SP
MD
Internal Estimate
I would like to extract in a new column in the first database the internal rating based on the Industry, Region and Rating labeled “Internal Estimate”.
A merge wouldn’t work because in an industry you can have several S&P and Moody’s ratings or even sometimes those are missing.
That is why I have written a code with the following conditions:
For i in range (1: i):
if Bond_Rating[‘MD’]='' and Bond_Ratings[‘SP’]='':
Bond_Rating[Internal Estimate] = ''
elif Bond_Rating['MD']='' and Bond_Rating[‘SP’]!='':
Bond_Rating['INTERNAL ESTIMATE']= Bond_Rating.lookup[concat('BicId','RegionName',’SP’),INTERNAL ESTIMATE.Table[‘InternalEstimate’]]
elif Bond_Rating['MD']!='' and Bond_Rating[‘SP’]='':
Bond_Rating['INTERNAL ESTIMATE']= Bond_Rating.lookup[concat('BicId','RegionName','MD'), INTERNAL ESTIMATE.Table[‘InternalEstimate’]]
elif Bond_Rating['MD']!='' and Bond_Rating[‘SP’] !='':
Bond_Rating['INTERNAL ESTIMATE']= Bond_Rating.lookup[concat('BicId','RegionName','MD',’SP’), INTERNAL ESTIMATE.Table[‘InternalEstimate’]]
However, I am unsure why my code doesn’t work. I keep getting errors.
I would appreciate your assistance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.