I am going through a json file and using a regex to pull out info around company financial KPIs and their corresponding values. For example, the regex for
"grossProfits":{"raw":19805000000,"fmt":"19.8B","longFmt":"19,805,000,000"}
would return the 19.8B. The issue is when the KPI does not have any info. For example
"returnOnEquity":{}.
In this case returnOnEquity would return the next number the regex finds.
"returnOnEquity":{},"grossProfits":{"raw":19805000000,"fmt":"19.8B","longFmt":"19,805,000,000"}.
So the value returned for returnOnEquity will be that of grossProfits (19.8B).
Here is my current regex r'.*?"(\d{1,8}\.\d{1,8}M?B?K?|[{}])%?'
In a perfect world, I would want it to return 0 but even a '{' or '}' will suffice.
Any help is much appreciated.
As suggested by the earlier commentators, the json module is the way to go (see Docs)
In your case,
import json
with open('sample.txt') as js:
data = json.load(js)
for firm in data:
print(firm)
print(data[firm]['grossProfits']['raw'])
print(data[firm]['returnOnEquity'])
It turns your data into a dictionary of dictionaries, so you don't have to worry about parsing.
Related
I am using docx library to read files from a word doc, I am trying to extract only the questions using regex search and match. I found infinite ways of doing it but I keep getting a "TypeError".
The data I am trying to extract is this:
Will my financial aid pay for housing?
Off Campus Housing - After financial aid applies toward your tuition and fees, any remaining funds will be sent to you as a refund that will either be directly deposited (which can be set up through your account) or mailed to you as a paper check. You can then use the refund to pay your rent. It is important to note that financial aid may not be available when rent is due, so make sure to have a plan in place to pay your rent. Will my financial aid pay for housing?
"financial" "help" "house"
funds "univ oak"
"money" "chisho"
"pay" "chap"
"grant" "laurel"
What are the requirements to receive a room and grant?
How do I pay for my housing?
How do I pay for housing?
If there's also an easier method of exporting the word doc into a different type of file, that'll be great to know for feedback. Thank you
I am using regex 101, I've tried the following regex expressions to match only the sentences that end in a question mark
".*[?=?]$"
"^(W|w).*[?=?]$"
"^[A-Za-z].*[?=?]$"
import re
import sys
from docx import Document
wordDoc = Document('botDoc.docx')
result = re.search('.*[?=?]$', wordDoc)
print(result)
if result:
print(result.group(0))
for table in wordDoc.tables:
for row in table.rows:
for cell in row.cells:
print("test")
I expect to save the matching patterns into directories so I can export the data to a csv file
Your error:
result = re.search('.*[?=?]$', wordDoc)
I believe that this line is the cause of the problem. search() is expecting a string as a second parameter, but is receiving a Document object.
What you should do is use the findall() function. search() only finds the first match for a pattern; findall() finds all the matches and returns them as a list of strings, with each string representing one match.
Since you are working with docx, you would have to extract the contents of the docx and use them as second parameter of the findall() method. If I remember correctly, this is done by first extracting all the paragraphs, and then extracting the text of the individual paragraphs. Refer to this question.
FYI, the way you would do this for a simple text file is the following:
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'your pattern', f.read())
Your Regex:
Unfortunately, your regex is not quite correct, because although logically it makes sense to match only sentences that end on a ?, one of your matches is place to pay your rent. Will my financial aid pay for housing?, for example. Only the second part of that sentence is an actual question. So discard any lower case letters. Your regex should be something like:
[A-Z].*\?$
I'm setting up integration between a webflow store and shippo to assist with creating labels and managing shipping. Webflow passes the data as one huge object for address information, however to create a new order in shippo, I need the information parsed, separated as individual line items. I have attempted to use formatter which allows one to extract text, split text, use regex to match data and more.
import re
details = re.search(r'(?<=city:\s).*$', input_data[All Addresses])
Regex in Python is my best option, yet the result will not find and/or display the data.
Please any experts in Zapier integrations, I need assistance in figuring out a way to parse the incoming data from webflow, pass it to the 'create a order' action with shippo.
Structure of Data:
addressee: string
city: string
country: string
more....
You can try this one:
Combine all the data in one whole string
import re
details = re.finall(r'(?<=city:\s).*$', all_addresses)
return details
It will you give the list of all matches in the text.
I am practicing my programming skills (in Python) and I realized that I don't know what to do when I need to find a value that is unknown but introduced by a key word. I am taking the information for this off a website where in the page source it says, '"size":"10","stockKeepingUnitId":"(random number)"'
How can I figure out what that number is.
This is what I have so far --
def stock():
global session
endpoint = '(website)'
reponse = session.get(endpoint)
soup = bs(response.text, "html.parser")
sizes = soup.find('"size":"10","stockKeepingUnitId":')
Off the top of my head there are two ways to do this. Say you have the string mystr = 'some text...content:"67588978"'. The first way is just to search for "content:" in the string and use string slicing to take everything after it:
num = mystr[mystr.index('content:"') + len('content:"'):-1]
Alternatively, as probably a better solution, you could use regular expressions
import re
nums = re.findall(r'.*?content:\"(\d+)\"')
As you haven't provided an example of the dataset you're trying to analyze, there could also be a number of other solutions. If you're trying to parse a JSON or YAML file, there are simple libraries to turn them into python dicts (json is part of the standard library, and PyYaml handles YAML files easily).
I am attempting to isolate TLDs utilizing regex from giant lists of FQDNs without importing 3rd party modules and am attempting to determine if there is a more eloquent way of doing this. My way works but is a bit cumbersome for my liking.
Sample code:
domains = ['x.sample1.com', 'y.sample2.org', 'z.sample3.biz']
temp = []
for domain in domains:
temp.append(re.findall('\.[a-z0-9]+', domain, re.I)
tlds = []
for item in temp:
for tld in item:
tlds.append(tld)
It is inconvenient how the return of the re.findall is a list object as it makes the iterating process an entire level deeper than desired but am unsure of how to get around this.
The "quick fix" is either to take the last item in each array:
split('.', domain)[-1]
Or, if you really don't care about the first matches, then don't capture them at all:
re.find('\.[a-z0-9]+$', domain, re.I)
(Note the use of $ to match the end of string.)
HOWEVER, note that it's impossible to solve this problem properly with regex. For example, how can you know that the TLD for google.co.uk is co.uk, and not just uk?
The only full solution to this problem, unfortunately, is by using a library that implements the public suffix list - which is basically just a very long (manually updated) list of all TLDs. For example, in python: https://pypi.python.org/pypi/publicsuffix/
I'm kinda new to Python. And I'm trying to find out how to do parsing in Python?
I've got a task: to do parsing with some piece of unknown for me symbols and put it to DB. I guess I can create DB and tables with help of SQLAlchemy, but I have no idea how to do parsing and what all these symbols below mean?
http://joxi.ru/YmEVXg6Iq3Q426
http://joxi.ru/E2pvG3NFxYgKrY
$$HDRPUBID 112701130020011127162536
H11127011300UNIQUEPONUMBER120011127
D11127011300UNIQUEPONUMBER100001112345678900000001
D21127011300UNIQUEPONUMBER1000011123456789AR000000001
D11127011300UNIQUEPONUMBER200002123456987X000000001
D21127011300UNIQUEPONUMBER200002123456987XIR000000000This item is inactive. 9781605600000
$$EOFPUBID 1127011300200111271625360000005
Thanks in advance those who can give me some advices what to start from and how the parsing is going on?
The best approach is to first figure out where each token begins and ends, and write a regular expression to capture these. The site RegexPal might help you design the regex.
As other suggest take a look to some regex tutorials, and also re module help.
Probably you're looking to something like this:
import re
headerMapping = {'type': (1,5), 'pubid': (6,11), 'batchID': (12,21),
'batchDate': (22,29), 'batchTime': (30,35)}
poaBatchHeaders = re.findall('\$\$HDR\d{30}', text)
parsedBatchHeaders = []
batchHeaderDict = {}
for poaHeader in poaBatchHeaders:
for key in headerMapping:
start = headerMapping[key][0]-1
end = headerMapping[key][1]
batchHeaderDict.update({key: poaHeader[start:end]})
parsedBatchHeaders.append(batchHeaderDict)
Then you have list with dicts, each dict contains data for each attribute. I assume that you have your datafile in text which is string. Each dict is made for one found structure (POA Batch Header in example).
If you want to parse it further, you have to made a function to parse each date in each attribute.
def batchDate(batch):
return (batch[0:2]+'-'+batch[2:4]+'-20'+batch[4:])
for header in parsedBatchHeaders:
header.update({'batchDate': batchDate( header['batchDate'] )})
Remember, that's an example and I don't know documentation of your data! I guess it works like that, but rest is up to you.