Using regex to break string into a dictionary with key and values

Using regex to break string into a dictionary with key and values - python

I am attempting to turn this list into a list full of dictionaries. Below I included a snippet of the full list Im using.
With row 4.1 as an example, I want:
the key to be the row number, ('4.1')
the values to include the title ('Properties occupied by the company
(less $ 43,332,898 \nencumbrances')
and the four numbers after it as a list ['68,122,291', '0',
'68,122,291', '64,237,046'].
I got the general loop for how I'd put together each separate dictionary. Where I am struggling is coming up with regex patterns to get the row title and row values. Its difficult since some of the row titles also include numbers. Another issue is that not all of the rows have four numbers at the end. For these instances, I just want the available numbers. Any help figure out the regex to grab these would be appreciated.
clean = ['4.1 Properties occupied by the company (less $ 43,332,898 \nencumbrances) 68,122,291 0 68,122,291 64,237,046 \n',
'4.2 Properties held for the production of income (less \n $ encumbrances) 0 0 0 0 \n',
'4.3 Properties held for sale (less $ \nencumbrances) 0 0 \n',]
clean_list = []
for n in clean:
row_num = re.findall(r'\d+\.',n)
row_title =
row_values =
new_dict = {row_num: row_title, row_values}
clean_list.append(new_dict)

Not sure why you want a separate dictionary for each line, each with just one key. I would think it more useful to end up with one dictionary with several keys.
d = {}
for line in clean:
parts = re.match(r"^([\d.]+)\s+(.*?)\s+(\d[\d,.]*)\s*(?:(\d[\d,.]*)\s*)?(?:(\d[\d,.]*)\s*)?(?:(\d[\d,.]*)\s*)?$",
line, re.DOTALL)
code, title, *values = parts.group(1,2,3,4,5,6)
d[code] = (title, list(filter(None, values)))
For the sample data, the value of d would be:
{
'4.1': (
'Properties occupied by the company (less $ 43,332,898 \nencumbrances)',
['68,122,291', '0', '68,122,291', '64,237,046']
),
'4.2': (
'Properties held for the production of income (less \n $ encumbrances)',
['0', '0', '0', '0']
),
'4.3': (
'Properties held for sale (less $ \nencumbrances)',
['0', '0']
)
}

Following the examples shared by you
"4.1 Properties occupied by the company (less $ 43,332,898 \nencumbrances) 68,122,291 0 68,122,291 64,237,046"
"4.2 Properties held for the production of income (less \n $ encumbrances) 0 0 0 0"
"4.3 Properties held for sale (less $ \nencumbrances) 0 0 "
for n in clean:
row_num = re.search(r'$\d+\.\d+',n).group()
row_title = re.search(r'(?<=\d+\.\d+).*?(?=((\d+,?)+ ?)+$)', n}.group()
row_values = re.search(r'((\d+,?)+ ?)+$', n).group()
new_dict = {row_num: row_title, row_values}
clean_list.append(new_dict)
I would seriously recommend going through https://regexr.com/ and trying these patterns and messing with them to learn. The solution is not important. It's how you end up at it.

you can use split() to accommodate variable number of the last values into a list. if your data always has the parenthesis you can use those as part of the pattern:
for n in clean:
row_num, row_title, row_values = re.findall(
r'^(\d+\.\d+)\s+(.*\))\s+(.*)$', n, re.DOTALL)[0]
new_dict = {row_num: (row_title, row_values.split())}
clean_list.append(new_dict)
keeping your use of re.findall(). output looks like:
[{'4.1': ('Properties occupied by the company (less $ 43,332,898 \n'
'encumbrances)',
['68,122,291', '0', '68,122,291', '64,237,046'])},
{'4.2': ('Properties held for the production of income (less \n'
' $ encumbrances)',
['0', '0', '0', '0'])},
{'4.3': ('Properties held for sale (less $ \nencumbrances)', ['0', '0'])}]
this retains the newlines, if you want those.

Related

How to remove single or doable word form list in python and pdfminer unable to covert rupee font

I am extracting text from PDF by converted into HTML. when we extracting text form Html with the help of BeautifulSoup. I have faced issues with symbols like currency (rupee symbol).and the rupees symbol came like a Tilda ['``']
['Amid ', '41'], ['``', '41'], ['3L cr shortfall, GST cess to continue beyond June 2022 ', '41'], ['Cong clips wings of �letter writers� in new appointments ', '32'] ,['MVA aims to cut guv�s power to choose VCs ', '28']}
Present output
1. Amid
2. 3L cr shortfall, GST cess to continue beyond June 2022
3. Cong clips wings of ‘letter writers’ in new appointments
4. MVA aims to cut guv’s power to choose VC
I want the output of text which has a higher font size and also want to remove single line character in a list like [['``', '41']
My desired output should look like this
1. Amid 3L cr shortfall, GST cess to continue beyond June 2022
2. Cong clips wings of ‘letter writers’ in new appointments
3. Cong clips wings of ‘letter writers’ in new appointments
My full Code:
import sys,os,re,operator,tempfile,fileinput
from bs4 import BeautifulSoup,Tag,UnicodeDammit
from io import StringIO
from pdfminer.layout import LAParams
from pdfminer.high_level import extract_text_to_fp
def convert_html(filename):
output = StringIO()
with open(filename, 'rb') as fin:
extract_text_to_fp(fin, output, laparams=LAParams(),output_type='html', codec=None)
Out_txt=output.getvalue()
return Out_txt
def get_the_start_of_font(x,attr):
""" Return the index of the 'font-size' first occurrence or None. """
match = re.search(x, attr)
if match is not None:
return match.start()
return None
def get_font_size_from(attr):
""" Return the font size as string or None if not found. """
font_start_i = get_the_start_of_font('font-size:',attr)
if font_start_i is not None:
font_size=str(attr[font_start_i + len('font-size:'):].split('px')[0])
if int(font_size)>25:
return font_size
return None
def write_to_txtfile(PDF_file,x):
filename='txt'.join(PDF_file.split('pdf'))
path_out=(r'c:\Headline\out\\')
with open(path_out+filename,'w+',encoding="utf-8") as text_file:
top3=x[:4]
for idx, line in enumerate(sorted([row for row in top3 if len(row[0]) > 2], key=lambda z: int(z[1]), reverse=True)):
text_file.write("{}. {}\n".format(idx+1, line[0]))
def main():
os.chdir(r'c:\Headline\in')
for PDF_file in os.listdir():
if PDF_file.endswith('.pdf'):
raw_html=convert_html(PDF_file)
#Converting Microsoft smart quotes to HTML or XML entities:
UnicodeDammit(raw_html, ["windows-1252"], smart_quotes_to="html").unicode_markup
soup = BeautifulSoup(raw_html, 'html.parser')
# iterate through all descendants:
fonts = []
for child in soup.descendants:
if isinstance(child, Tag) is True and child.get('style') is not None:
font = get_font_size_from(child.get('style'))
if font is not None:
fonts.append([str(child.text.replace('\n',' ')),font])
write_to_txtfile(PDF_file,fonts)
print(" File have Sucess of Extract Headline form this Page%s"%PDF_file )
if __name__ == "__main__":
main()

headlines = [['In bid to boost realty, state cuts stamp duty for 7 mths ', '42'],
['India sees world’s third-biggest spike of 76,000+ cases, toll crosses 60k ','28'],
['O', '33'],
['Don’t hide behind RBI on loan interest waiver: SC to govt ', '28']]
for idx, line in enumerate(sorted([row for row in headlines if len(row[0]) > 1], key=lambda z: int(z[1]), reverse=True)):
print("{}. {}".format(idx+1, line[0]))
Output:
1. In bid to boost realty, state cuts stamp duty for 7 mths
2. India sees world’s third-biggest spike of 76,000+ cases, toll crosses 60k
3. Don’t hide behind RBI on loan interest waiver: SC to govt
Breakdown of what is happening above:
[row for row in headlines if len(row[0]) > 1]
This will create a new list, containing all entries in headlines if the length of entry_in_headlines[0] is greater than 1.
sorted(<iterable>, key=lambda z: int(z[1]), reverse=True)
Will sort the given iterable using a lambda function, which takes one argument, and returns the second index of that variable as an integer. Then reverses the results, due to reverse=True.
for idx, line in enumerate(<iterable>):
Looping over enumerate will return both the "count" of how many times it has been called, and also the next value inside of the iterable.
print("{}. {}".format(idx+1, line[0]))
Using string-formatting we create the new string inside of the for-loop.

i can't really work out what you are trying or where your data is but you need to add an if statement.
For example:
data = ['In bid to boost realty, state cuts stamp duty for 7 mths ', '42']
if len(data[0].split()) >= 2:
print(data[0])
Any statements with 2 words or less will not be printed.
If you have a list of lists:
data = [['In bid to boost realty, state cuts stamp duty for 7 mths ', '42'],
['India sees world’s third-biggest spike of 76,000+ cases, toll crosses 60k',
'28'], ['O', '33'], ['Don’t hide behind RBI on loan interest waiver: SC to
govt ', '28']]
for lists in data:
if len(lists[0].split()) <= 2:
data.remove(lists)
print(*("".join(lists[0]) for lists in data), sep='\n')

While iterating a list, add the value and the next 2 values into a new list

I am currently making a program to scan a PDF file and look for the key word 'Ref'. Once this word is found I need to take the next two strings, 'code' and 'shares' and add them to a new list to be imported into Excel later.
I have written code to take the text from the PDF file and add it to a list. I then iterate through this list and look for the 'Ref' keyword. When the first one is found it adds it to the list no problem. However when it comes to the next, it adds the first instance of Ref (+the code and the shares) to the list again and not the next one in the PDF file...
Here is the code for adding the Ref + code + shares to the new list (python 3):
for word in wordList:
match = 'false'
if word == 'Ref':
match = 'true'
ref = word
code = wordList[wordList.index(ref)+1]
shares = wordList[wordList.index(ref)+2]
if match == 'true':
refList.append(ref)
refList.append(code)
refList.append(shares)
Here is the output:
['Ref', '1', '266','Ref', '1', '266','Ref', '1', '266','Ref', '1', '266','Ref', '1', '266','Ref', '1', '266']
As you can see its the same reference number each time... the correct output should be something like this:
['Ref', '1', '266','Ref', '2', '642','Ref', '3', '435','Ref', '4', '6763'] etc...
If anyone knows why it is always adding the first ref and code with every instance of 'Ref' in the wordList let me know! I am quite stuck! Thanks

Your issue is that the call to the index method of wordlist will only return you the first instance it fines. I.E you will always get the first instance of "Ref". Instead a better approach is to use enumerate over the list which will give the index and value for each entry as you go, then you can just reference the index value to get the next two elements. below is code example.
data = """
this
Ref
1
266
that
hello
Ref
2
642"""
refList = []
wordList = [item.rstrip() for item in data.splitlines()]
for index, word in enumerate(wordList):
match = 'false'
if word == 'Ref':
match = 'true'
ref = word
code = wordList[index+1]
shares = wordList[index+2]
if match == 'true':
refList.append(ref)
refList.append(code)
refList.append(shares)
print(refList)
OUTPUT
['Ref', '1', '266', 'Ref', '2', '642']
you could also clean up and remove a lot of unneeded code and just write it as:
for index, word in enumerate(wordList):
if word == 'Ref':
refList += [word, wordList[index+1], wordList[index+2]]

When you use the list.index(str) function, it returns the first occurrence of str. To fix this, iterate by index:
for i in range(len(wordList):
match = False
if word == 'Ref':
match = True
ref = wordList[i]
code = wordList[i+1]
shares = wordList[i+2]
if match == True:
refList.append(ref)
refList.append(code)
refList.append(shares)
I hope this helps. Cheers!

Get elements form list based on content of each element of list

I'm just starting to learn and faced one problem in Python.
I have a srt doc (subtitles). Name - sub. It looks like:
8
00:01:03,090 --> 00:01:05,260
<b><font color="#008080">MATER:</font></b> Yes, sir, you did.
<b><font color="#808000">(MCQUEEN GASPS)</font></b>
9
00:01:05,290 --> 00:01:07,230
You used to say
that all the time.
In Python it looks like:
'3', '00:00:46,570 --> 00:00:48,670', '<b><font color="#008080">MCQUEEN:</font></b> Okay, here we go.', '', '4', '00:00:48,710 --> 00:00:52,280', 'Focus. Speed. I am speed.', '', '5', '00:00:52,310 --> 00:00:54,250', '<b><font color="#808000">(ENGINES ROARING)</font></b>', '',
Also, I had a list of words (name - noun). It looks like:
['man', 'poster', 'motivation', 'son' ... 'boy']
Let's look at this example:
...'4', '00:00:48,710 --> 00:00:52,280', 'Focus. Speed. I am speed.', '', '5',....
What I need to do is to find word from the list in the subtitles (first apperrence, as an illustrtion, "Speed") and get into list the time of the word appearence (00:00:48,710 --> 00:00:52,280) and sequence number (4), which is located before the time in the document. I was trying to get this information with indx but unfortunately I did not succeed.
Can you help me how to do this?)

Welcome to SO and Python. Although this is not an answer I think it might be helpful. The one and only Python library for tables is Pandas. You can read in the srt file to a dataframe and work your way from there. (You would need to learn the Pandas syntax do to stuff but it is well-invested time)
import pandas as pd
import requests
# Lion King subtitle
data = requests.get("https://opensubtitles.co/download/67071").text
df = pd.DataFrame([i.split("\r\n") for i in data.split("\r\n\r\n")])
df = df.rename(columns={0:"Index",1:"Time",2:"Row1",3:"Row2"}).set_index("Index")
Printing first 5 rows print(df.head()) gives:
Time Row1 Row2
Index
1 00:01:01,600 --> 00:01:05,800 <i>Nants ingonyama</i> None
2 00:01:05,900 --> 00:01:07,200 <i>Bagithi baba</i> None
3 00:01:07,300 --> 00:01:10,600 <i>Sithi uhhmm ingonyama</i> None
4 00:01:10,700 --> 00:01:13,300 <i>lngonyama</i> None
5 00:01:13,300 --> 00:01:16,400 <i>Nants ingonyama</i> None

Continuing with Anton vBR's suggestion:
words=['ingonyama','king']
results=[]
for w in words:
for row in df.itertuples():
if row[2] is not None:
if w in row[2].lower():
results.append((w, row[0], row[1]))
if row[3] is not None:
if w in row[3].lower():
results.append((w, row[0], row[1]))
print(results)
You'll get a list of tuples, each of which contains a word you're searching for, a sequence number where it appears, and a time-frame where it appears. Then you can write these tuples to a csv file or whatever. Hope this helps.

Python: parsing texts in a .txt file

I have a text file like this.
1 firm A Manhattan (company name) 25,000
SK Ventures 25,000
AEA investors 10,000
2 firm B Tencent collaboration 16,000
id TechVentures 4,000
3 firm C xxx 625
(and so on)
I want to make a matrix form and put each item into the matrix.
For example, the first row of matrix would be like:
[[1,Firm A,Manhattan,25,000],['','',SK Ventures,25,000],['','',AEA investors,10,000]]
or,
[[1,'',''],[Firm A,'',''],[Manhattan,SK Ventures,AEA Investors],[25,000,25,000,10,000]]
For doing so, I wanna parse texts from each line of the text file. For example, from the first line, I can create [1,firm A, Manhattan, 25,000]. However, I can't figure out how exactly to do it. Every text starts at the same position, but ends at different positions. Is there any good way to do this?
Thank you.

Well if you know all of the start positions:
# 0123456789012345678901234567890123456789012345678901234567890
# 1 firm A Manhattan (company name) 25,000
# SK Ventures 25,000
# AEA investors 10,000
# 2 firm B Tencent collaboration 16,000
# id TechVentures 4,000
# 3 firm C xxx 625
# Field #1 is 8 wide (0 -> 7)
# Field #2 is 15 wide (8 -> 22)
# Field #3 is 19 wide (23 -> 41)
# Field #4 is arbitrarily wide (42 -> end of line)
field_lengths = [ 8, 15, 19, ]
data = []
with open('/path/to/file', 'r') as f:
row = f.readline()
row = row.strip()
pieces = []
for x in field_lengths:
piece = row[:x].strip()
pieces.append(piece)
row = row[x:]
pieces.append(row)
data.append(pieces)

From what you've given as data*, the input changes if the lines starts with a number or a space, and the data can be separated as
(numbers)(spaces)(letters with 1 space)(spaces)(letters with 1 space)(spaces)(numbers+commas)
or
(spaces)(letters with 1 space)(spaces)(numbers+commas)
That's what the two regexes below look for, and they build a dictionary with indexes from the leading numbers, each having a firm name and a list of company and value pairs.
I can't really tell what your matrix arrangement is.
import re
data = {}
f = open('data.txt')
for line in f:
if re.match('^\d', line):
matches = re.findall('^(\d+)\s+((\S\s|\s\S|\S)+)\s\s+((\S\s|\s\S|\S)+)\s\s+([0-9,]+)', line)
idx, firm, x, company, y, value = matches[0]
data[idx] = {}
data[idx]['firm'] = firm.strip()
data[idx]['company'] = [(company.strip(), value)]
else:
matches = re.findall('\s+((\S\s|\s\S|\S)+)\s\s+([0-9,]+)', line)
company, x, value = matches[0]
data[idx]['company'].append((company.strip(), value))
import pprint
pprint.pprint(data)
->
{'1': {'company': [('Manhattan (company name)', '25,000'),
('SK Ventures', '25,000'),
('AEA investors', '10,000')],
'firm': 'firm A'},
'2': {'company': [('Tencent collaboration', '16,000'),
('id TechVentures', '4,000')],
'firm': 'firm B'},
'3': {'company': [('xxx', '625')],
'firm': 'firm C'}
}
* This works on your example, but it may not work on your real data very well. YMMV.

If I understand you correctly (although I'm not totally sure I do), this will produce the output I think your looking for.
import re
with open('data.txt', 'r') as f:
f_txt = f.read() # Change file object to text
f_lines = re.split(r'\n(?=\d)', f_txt)
matrix = []
for line in f_lines:
inner1 = line.split('\n')
inner2 = [re.split(r'\s{2,}', l) for l in inner1]
matrix.append(inner2)
print(matrix)
print('')
for row in matrix:
print(row)
Output of the program:
[[['1', 'firm A', 'Manhattan (company name)', '25,000'], ['', 'SK Ventures', '25,000'], ['', 'AEA investors', '10,000']], [['2', 'firm B', 'Tencent collaboration', '16,000'], ['', 'id TechVentures', '4,000']], [['3', 'firm C', 'xxx', '625']]]
[['1', 'firm A', 'Manhattan (company name)', '25,000'], ['', 'SK Ventures', '25,000'], ['', 'AEA investors', '10,000']]
[['2', 'firm B', 'Tencent collaboration', '16,000'], ['', 'id TechVentures', '4,000']]
[['3', 'firm C', 'xxx', '625']]
I am basing this on the fact that you wanted the first row of your matrix to be:
[[1,Firm A,Manhattan,25,000],['',SK Ventures,25,000],['',AEA investors,10,000]]
However, to achieve this with more rows, we then get a list that is nested 3 levels deep. Such is the output of print(matrix). This can be a little unwieldy to use, which is why TessellatingHeckler's answer uses a dictionary to store the data, which I think is a much better way to access what you need. But if a list of list of "matrices' is what your after, then the code I wrote above does that.

How can I sort records in a gigantic csv file by date(mm/dd/yyyy)?

I have a csv file that is 3,642,197 lines long and i need it to be sorted from earliest date to latest date.
I wrote a program that searches the database, and writes every line that contains the "API" number the user specifies to a file that will be used for graphing later. It's very important that it has the earliest dates occur first in the file, so I'm running into this problem: Whoever put this giant file together used 3 different files from excel and combined it into one csv, so the dates aren't sorted.
If i can format the database so that all the earliest dates would be found first, i figure that would be the easiest way to solve the problem.
I am somewhat new to python and I'm trying to wrap my head around how i can sort this file by date. I tried to do it in excel and libreoffice calc, but it exceeds the maximum row allowance.
Here is an example of the text in the file:
"01/31/1986","25003050040000","SHA","Shannon",121,"",0,0,1324,31,False,P,""
I have records from 2013 to 1986, and have to have them sorted, but have not been able to understand how this is done. From what i have searched i cannot find anything that I can understand.
Much thanks and appreciation in advance!
EDIT: the easiest way is with Linux/unix. A simple sort command does exactly what I'm talking about.
Ex. Sort -t/ -g -r -k3 -k1 -k2 infile.csv > outfile.csv
-t/ sets the delimiter, -g sort by numerical value, -r reads the file in starting at the last line. -k3 is the year field, -k1 is the month field, and -k2 is the day field. It will sort by year, then by month, then by day. If you need to sort a giant csv file chronologically, and it won't fit into excel, this is by far the easiest solution I have found.
Note: if your data is comma separated and the field after your date field is a number, you will need to change the first comma delimiter to a / so it doesn't include the trailing data in the sort.
Ex. 02/25/1987,204928169562,62563959401,16375840 <-- this will need to be changed to 02/25/1987/204928169562,62563959401,16375840 so your data is sorted correctly.

One approach (perhaps not the most clever, but it'll work) is to read all the lines into a list. Then the data looks like:
# lines -> ['"01/31/1986",..', '"4/30/2000",..', ..]
Then sorting with a key mapping can be used. This establishes a mapping for each item of what the real ordering is. In this case it's a matter of turning "mm/dd/yyyy" into something well-ordered. Possible keys might be: "YYYYMMDD", a datetime object, or perhaps an epoch timestamp.
For instance:
def lineKey (v): # v -> '"01/31/1986",..'
r = v[1:11] # r -> '01/31/1986'
return datetime.strptime(r, "%m/%d/%Y")
lines.sort(key=lineKey)
# or; lines = sorted(lines, key=lineKey)

You can read the csv file, convert the silly date to ISO 8601 format so that they sort properly and proceed:
csv_txt='''\
"01/31/1987","25003050040000","SHA","Shannon",121,"",0,0,1324,31,False,P,""
"01/31/1986","25003050040000","SHA","Shannon",121,"",0,0,1324,31,False,P,""
"01/31/1993","25003050040000","SHA","Shannon",121,"",0,0,1324,31,False,P,""
"01/28/1993","25003050040000","SHA","Shannon",121,"",0,0,1324,31,False,P,""
"01/31/2013","25003050040000","SHA","Shannon",121,"",0,0,1324,31,False,P,""'''
import csv
import datetime
data=[]
for line in csv.reader(csv_txt.splitlines()):
d=datetime.datetime.strptime(line[0],'%m/%d/%Y')
data.append([d.isoformat().partition('T')[0]]+line[1:])
for e in sorted(data):
print e
Prints:
['1986-01-31', '25003050040000', 'SHA', 'Shannon', '121', '', '0', '0', '1324', '31', 'False', 'P', '']
['1987-01-31', '25003050040000', 'SHA', 'Shannon', '121', '', '0', '0', '1324', '31', 'False', 'P', '']
['1993-01-28', '25003050040000', 'SHA', 'Shannon', '121', '', '0', '0', '1324', '31', 'False', 'P', '']
['1993-01-31', '25003050040000', 'SHA', 'Shannon', '121', '', '0', '0', '1324', '31', 'False', 'P', '']
['2013-01-31', '25003050040000', 'SHA', 'Shannon', '121', '', '0', '0', '1324', '31', 'False', 'P', '']

You can use sed and sort for that task:
cat big_file.csv | \
sed -e 's,^"\(..\)/\(..\)/\(....\)",\3\1\2,' | \
sort | \
sed -e 's,^\(....\)\(..\)\(..\),"\2/\3/\1",' > sorted_file.csv
The first sed command converts:
"01/31/1986","25003050040000","SHA","Shannon",121,"",0,0,1324,31,False,P,""
to
19860131,"25003050040000","SHA","Shannon",121,"",0,0,1324,31,False,P,""
Then the lines are sorted lexically by sort.
The second sed restores the US date format.
The > puts the sorted text into a file.
If you want to use Python instead:
lines = ((line[7:11], line[1:3], line[4:6]), line) # tuples of (date, line)
for line in open('big_file.csv')) # that's a "generator"
sorted_lines = (line[1] for line in sorted(lines)) # sort tuples and omit date
sorted_content = ''.join(lines) # recreated CSV file
The idea is exactly the same as with the shell script.
I just noted that you can to this far easier using the key argument of sorted that #user2864740 mentioned:
content = ''.join(sorted(open('big_file.csv'),
key=lambda line: (line[7:11], line[1:3], line[4:6])))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.