Google Cloud Vision Form Data Extract for handwritten text python - python

I have a image like this
I'm trying to extract the form data like this
{
"comments":"nil",
"namefirst":"Jhon",
"last":"Doe",
"mf":"",
"address 1": "PICADALLY LONDON",
"APT":"103",
"City": "London",
"State":"Nil",
"DOB": "",
"AGE": 43,
"Phone Number":"+4464343",
"email":"nil",
"date":"20-03-2012"
}
But I'm unable to extract it like that I'm able to get the box boundaries I'm stuck here since 5 days any help would be greatly appreciated.
my code
items = []
lines = {}
for text in response.text_annotations[1:]:
top_x_axis = text.bounding_poly.vertices[0].x
top_y_axis = text.bounding_poly.vertices[0].y
bottom_y_axis = text.bounding_poly.vertices[3].y
if top_y_axis not in lines:
lines[top_y_axis] = [(top_y_axis, bottom_y_axis), []]
for s_top_y_axis, s_item in lines.items():
if top_y_axis < s_item[0][1]:
lines[s_top_y_axis][1].append((top_x_axis, text.description))
break
for _, item in lines.items():
if item[1]:
words = sorted(item[1], key=lambda t: t[0])
items.append((item[0], ' '.join([word for _, word in words]), words))
print(items)
Can anyone help me with this.
Thanks in advance

Related

How do I join two columns in python, whilst one column has a url in every row, and other column has list of last part of the url

I have two columns:
One looks like:
"Link": "https://url.com/item?variant=",
"Link": "https://url2.com/item?variant=",
"Link": "https://url3.com/item?variant=",
2nd looks like:
"link extension": ["1","2"],
"link extension": ["1","2"],
"link extension": ["1","1","3"],
What I'm trying to do is to combine them together so that my Link column looks like this:
"Link": "https://url.com/item?variant=1"
"Link": "https://url.com/item?variant=2"
"Link": "https://url2.com/item?variant=1"
"Link": "https://url2.com/item?variant=2"
"Link": "https://url3.com/item?variant=1"
"Link": "https://url3.com/item?variant=2"
"Link": "https://url3.com/item?variant=3"
However, I'm a beginner of Python - and even basic level at Pandas. I tried to find the answer, and I came across map/append options but none of them seem to work throwing different TypeError
Any help or advice on what/where to read would be very helpful.
Thank you in advance.
Here is my basic code:
def parse(self, response):
items = response.xpath("//*[#id='bc-sf-filter-products']/div")
for item in items:
link = item.xpath(".//div[#class='figcaption product--card--text under text-center']/a/#href").get()
yield response.follow(url=link, callback=self.parse_item)
def parse_item(self, response):
Title = response.xpath(".//div[#class='hide-on-mobile']/div[#class='productTitle']/text()").get()
Item_Link = response.url
n_item_link = f"{Item_Link}?variant="
idre = r'("id":\d*)' #defining regex
id = response.xpath("//script[#id='ProductJson-product']/text()").re(idre) #applying regex
id1 = [item.replace('"id":', '') for item in id] #cleaning list of url-ids
id2 = id1[1:] #dropping first item
test = n_item_link.append(id2) # doesn't work
test2 = n_item_link.str.cat(id2) # doesn't work either
yield{
'test':test,
'test2':test2
}
# recreating the DataFrame
df = pd.DataFrame({
"link": ["https://url.com/item?variant=",
"https://url2.com/item?variant=",
"https://url3.com/item?variant="],
"variants" : [["1","2"],
["1","2"],
["1","1","3"]]
}
)
#creating a new column containg the lenght of each list
df["len_list"] = [len(x) for x in df["variants"].to_list()]
# creating a list of all values in df.variants and converting values to string
flat_list_variants = [str(item) for sublist in df["variants"].to_list() for item in sublist]
# creating a new DataFrame which contains each index replicated by the lenght of df["len_list"]
df_new = df.loc[df.index.repeat(df.len_list)]
# assign the list to a new column
df_new["flat_variants"] = flat_list_variants
#compose the result by sum strings
df_new["results"] = df_new["link"] + df_new["flat_variants"]
I don't know how exactly your input looks like but assuming you have a list (or a different iterable) for your links and your extensions this will work:
def join_url(links, ext_lists):
urls = []
for link, extension_list in zip(links, ext_lists):
for extension in extension_list:
urls.append(link + extension)
return urls
Sample input:
websites = ['web1&hello=', 'web2--', 'web3=']
extensions = [['1', '2'], ['1', '2', '3'], ['3', '1']]
url_list = join_url(websites, extensions)
print(url_list)
Output:
['web1&hello=1', 'web1&hello=2', 'web2--1', 'web2--2', 'web2--3', 'web3=3', 'web3=1']

How to put all element from a dictionary in a database

I show you my problem:
To begin,
I have a dictionary of list in python like this :
"links": [{"url": "http://catherineingram.com/biography.html", "type": {"key": "/type/link"}, "title": "Biography"}, {"url": "http://www.youtube.com/watch?v=4lJK9cfXP3c", "type": {"key": "/type/link"}, "title": "Interview on Consciousness TV"}, {"url": "http://www.huffingtonpost.com/catherine-ingram/", "type": {"key": "/type/link"}, "title": "Blog on Huffington Post"}]
My goal is to got only url and title of link and put them in a database.
For the moment I worked only with url and I did it :
for record in csv.DictReader(open(INPUT_FILE, 'r'), fieldnames=COLUMNS, delimiter='\t'):
j = json.loads(record['json'])
result=[]
if 'links' in j:
for link in j['links']:
result.append({'url': link['url']})
a=(link['url'])
print(a)
links=a
else:
links = ''
The result is :
http://catherineingram.com/biography.html
http://www.youtube.com/watch?v=4lJK9cfXP3c
http://www.huffingtonpost.com/catherine-ingram/
So it's perfect, I got exactely what I wanted but the problem now is when I put links in my database with:
links=a
I got only the last element of url in my database and not the 3 url
So I try to have 3 url in my database but I got only last.
I hope you can help me on my problem
Thanks to listen me !!!!
ps:
If you want more detail on code it's here :
for record in csv.DictReader(open(INPUT_FILE, 'r'), fieldnames=COLUMNS, delimiter='\t'):
j = json.loads(record['json'])
result=[]
if 'links' in j:
for link in j['links']:
result.append({'url': link['url']})
a=(link['url'])
print(a)
links=a
else:
links = ''
# print(n)
#links_url.append(n['url'])
#links_title.append(n['title'])
# links_url.append(n['url'])
# links_title.append(n['title'])
c.execute('INSERT INTO AUTHORS VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)',
[record['key'],
j.get('name'),
j.get('eastern_order'),
j.get('personal_name'),
j.get('enumeration'),
j.get('title'),
bio,
alternate_names,
uris,
j.get('location'),
j.get('birth_date'),
j.get('death_date'),
j.get('date'),
j.get('wikipedia'),
links
])
db.commit()
Thanks to answer me !
My goal is to put all my url in my database
So I did it :
for record in csv.DictReader(open(INPUT_FILE, 'r'), fieldnames=COLUMNS, delimiter='\t'):
j = json.loads(record['json'])
if 'bio' in j and 'value' in j['bio']:
bio = j['bio']['value']
else:
bio = None
if 'alternate_names' in j:
for n in j['alternate_names']:
alternate_names = n
else:
alternate_names = None
if 'uris' in j:
for n in j['uris']:
uris = n
else:
uris = None
if 'links' in j:
for link in j['links']:
dico=({'url': link['url']})
print(dico['url'])
links=dico['url']
else:
links = ''
# print(n)
#links_url.append(n['url'])
#links_title.append(n['title'])
# links_url.append(n['url'])
# links_title.append(n['title'])
c.execute('INSERT INTO AUTHORS VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)',
[record['key'],
j.get('name'),
j.get('eastern_order'),
j.get('personal_name'),
j.get('enumeration'),
j.get('title'),
bio,
alternate_names,
uris,
j.get('location'),
j.get('birth_date'),
j.get('death_date'),
j.get('date'),
j.get('wikipedia'),
links
])
db.commit()
But when I did it, all other elements ( bio,alternate names ...) work but not links beause I need to do an other method because it's a list of dictionary like it :
"links": [{"url": "http://catherineingram.com/biography.html", "type": {"key": "/type/link"}, "title": "Biography"}, {"url": "http://www.youtube.com/watch?v=4lJK9cfXP3c", "type": {"key": "/type/link"}, "title": "Interview on Consciousness TV"}, {"url": "http://www.huffingtonpost.com/catherine-ingram/"
For moment I take only element url in a dictionary and I try to put in my database all the url of links. It works perfectely when I have only 1 url but sometimes I have 2 or 3 Url and when it happens, only the last url is in my database and not others. Thtat's my problem
Thanks !
As I mentioned in my comment to your question you have some indentation issues. I am only taking a guess as to what you are trying to achieve. You also have assignment to variables that are then never referenced in the code shown, so it may very well be that they have not been declared at the right level.
for record in csv.DictReader(open(INPUT_FILE, 'r'), fieldnames=COLUMNS, delimiter='\t'):
j = json.loads(record['json'])
result=[] # what do you do with variable result? Shiuld this be delclared before the 'for record' statment?
if 'links' in j:
for link in j['links']:
result.append({'url': link['url']})
a=link['url'] # what do you do with variable a?
print(a)
links=a # do you need both variables a and links?
c.execute('INSERT INTO AUTHORS VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)',
[record['key'],
j.get('name'),
j.get('eastern_order'),
j.get('personal_name'),
j.get('enumeration'),
j.get('title'),
bio,
alternate_names,
uris,
j.get('location'),
j.get('birth_date'),
j.get('death_date'),
j.get('date'),
j.get('wikipedia'),
links
])
else:
links = ''
# print(n)
#links_url.append(n['url'])
#links_title.append(n['title'])
# links_url.append(n['url'])
# links_title.append(n['title'])
db.commit() # should this be moved to following the c.execute statement rather than doing one commit for all the inserts?
The above is now writing multiple rows with identical data bit with different links. That leads to an unnormalized database. Could you have meant instead to have written out one row where the column contained 3 links? That, too, would be a case of an unnormalized database. Again, I am just guessing as to what you meant by "I try to have 3 url in my database but I got only last."

creating coumn for each output receive in one field in python

I am implementing an emotion analysis using lstm method where I have already done my training model as well as my prediction part. but my prediction is appearing in one column.. I will show you below.
Here are my codes:
with open('output1.json', 'w') as f:
json.dump(new_data, f)
selection1 = new_data['selection1']
#creating empty list to be able to create a dataframe
names = []
dates = []
commentss = []
labels = []
hotelname = []
for item in selection1:
name = item['name']
hotelname.append(name)
#print ('>>>>>>>>>>>>>>>>>> ', name)
Date = item['reviews']
for d in Date:
names.append(name)
#convert date from 'january 12, 2020' to 2020-01-02
date = pd.to_datetime(d['date']).strftime("%Y-%m-%d")
#adding date to the empty list dates[]
dates.append(date)
#print('>>>>>>>>>>>>>>>>>> ', date)
CommentID = item['reviews']
for com in CommentID:
comment = com['review']
lcomment = comment.lower() # converting all to lowercase
result = re.sub(r'\d+', '', lcomment) # remove numbers
results = (result.translate(
str.maketrans('', '', string.punctuation))).strip() # remove punctuations and white spaces
comments = remove_stopwords(results)
commentss.append(comment)
# print('>>>>>>',comments)
#add the words in comments that are already present in the keys of dictionary
encoded_samples = [[word2id[word] for word in comments if word in word2id.keys()]]
# Padding
encoded_samples = keras.preprocessing.sequence.pad_sequences(encoded_samples, maxlen=max_words)
# Make predictions
label_probs, attentions = model_with_attentions.predict(encoded_samples)
label_probs = {id2label[_id]: prob for (label, _id), prob in zip(label2id.items(), label_probs[0])}
labels.append(label_probs)
#creating dataframe
dataframe={'name': names,'date': dates, 'comment': commentss, 'classification': labels}
table = pd.DataFrame(dataframe, columns=['name', 'date', 'comment', 'classification'])
json = table.to_json('hotel.json', orient='records')
here is the results i obtain:
[
{
"name": "Radisson Blu Azuri Resort & Spa",
"date": "February 02, 2020",
"comment": [
"enjoy",
"daily",
"package",
"start",
"welcoming",
"end",
"recommend",
"hotel"
],
"label": {
"joy": 0.0791392997,
"surprise": 0.0002606699,
"love": 0.4324670732,
"sadness": 0.2866959572,
"fear": 0.0002588668,
"anger": 0.2011781186
}
},
you can find the complete output on this link: https://jsonblob.com/a9b4035c-5576-11ea-afe8-1d95b3a2e3fd
Is it possible to break the label field into separate fields like below??
[
{
"name": "Radisson Blu Azuri Resort & Spa",
"date": "February 02, 2020",
"comment": [
"enjoy",
"daily",
"package",
"start",
"welcoming",
"end",
"recommend",
"hotel"
],
"joy": 0.0791392997,
"surprise": 0.0002606699,
"love": 0.4324670732,
"sadness": 0.2866959572,
"fear": 0.0002588668,
"anger": 0.2011781186
},
Can someone please help me how do i need to modify my codes and make this possible please guys explain to me please..
If you can't do it before you produce the result, you can easily manipulate that dictionary like so:
def move_labels_to_dict_root(result):
labels = result["labels"]
meta_data = result
del meta_data["labels"]
result = {**meta_data, **labels}
return result
and then call move_labels_to_dict_root in a list comprehension like [move_labels_to_dict_root(result) for result in results].
However, I would ask why you want to do this?

Dictionary from a String with particular structure

I am using python 3 to read this file and convert it to a dictionary.
I have this string from a file and I would like to know how could be possible to create a dictionary from it.
[User]
Date=10/26/2003
Time=09:01:01 AM
User=teodor
UserText=Max Cor
UserTextUnicode=392039n9dj90j32
[System]
Type=Absolute
Dnumber=QS236
Software=1.1.1.2
BuildNr=0923875
Source=LAM
Column=OWKD
[Build]
StageX=12345
Spotter=2
ApertureX=0.0098743
ApertureY=0.2431899
ShiftXYZ=-4.234809e-002
[Text]
Text=Here is the Text files
DataBaseNumber=The database number is 918723
..... (There are more than 1000 lines per file) ...
On the text I have "Name=Something" and then I would like to convert it as follows:
{'Date':'10/26/2003',
'Time':'09:01:01 AM'
'User':'teodor'
'UserText':'Max Cor'
'UserTextUnicode':'392039n9dj90j32'.......}
The word between [ ] can be removed, like [User], [System], [Build], [Text], etc...
In some fields there is only the first part of the string:
[Colors]
Red=
Blue=
Yellow=
DarkBlue=
What you have is an ordinary properties file. You can use this example to read the values into map:
try (InputStream input = new FileInputStream("your_file_path")) {
Properties prop = new Properties();
prop.load(input);
// prop.getProperty("User") == "teodor"
} catch (IOException ex) {
ex.printStackTrace();
}
EDIT:
For Python solution, refer to the answerred question.
You can use configparser to read .ini, or .properties files (format you have).
import configparser
config = configparser.ConfigParser()
config.read('your_file_path')
# config['User'] == {'Date': '10/26/2003', 'Time': '09:01:01 AM'...}
# config['User']['User'] == 'teodor'
# config['System'] == {'Type': 'Abosulte', ...}
Can easily be done in python. Assuming your file is named test.txt.
This will also work for lines with nothing after the = as well as lines with multiple =.
d = {}
with open('test.txt', 'r') as f:
for line in f:
line = line.strip() # Remove any space or newline characters
parts = line.split('=') # Split around the `=`
if len(parts) > 1:
d[parts[0]] = ''.join(parts[1:])
print(d)
Output:
{
"Date": "10/26/2003",
"Time": "09:01:01 AM",
"User": "teodor",
"UserText": "Max Cor",
"UserTextUnicode": "392039n9dj90j32",
"Type": "Absolute",
"Dnumber": "QS236",
"Software": "1.1.1.2",
"BuildNr": "0923875",
"Source": "LAM",
"Column": "OWKD",
"StageX": "12345",
"Spotter": "2",
"ApertureX": "0.0098743",
"ApertureY": "0.2431899",
"ShiftXYZ": "-4.234809e-002",
"Text": "Here is the Text files",
"DataBaseNumber": "The database number is 918723"
}
I would suggest to do some cleaning to get rid of the [] lines.
After that you can split those lines by the "=" separator and then convert it to a dictionary.

Python Dictionary: Key with two values possible?

For a test program I'm crawling a webpage. I'd like to crawl all activites for specifid ID´s which are associated to the respective cities.
For example, my initial code:
RegionIDArray = {522: "London", 4745: "London", 2718: "London", 3487: "Tokio"}
Im now wondering if its possible to sum up all IDs (values) which are related to e.g. London into one key:
RegionIDArray = {522, 4745, 2718: "London}
If I´m trying this, I get no results
My full code so far
RegionIDArray = {522: "London", 4745: "London", 2718: "London", 3487: "Tokio"}
for reg in RegionIDArray:
r = requests.get("https://www.getyourguide.de/-l" +str(reg) +"/")
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("span", {"class": "intro-title"})
for item in g_data:
POI_final = (str(item.text))
end_final = ("POI: " + POI_final)
if end_final not in already_printed:
print(end_final)
already_printed.add(end_final)
Is there any smart way.Appreciate any feedback.
You can do this in 2 steps:
Create a dictionary mapping locations to list of IDs.
Reverse this dictionary, taking care to ensure your keys are hashable.
The first step is optimally processed via collections.defaultdict.
For the second step, you can use either tuple or frozenset. I opt for the latter since it is not clear that ordering is relevant.
from collections import defaultdict
RegionIDArray = {522: "London", 4745: "London", 2718: "London", 3487: "Tokio"}
d = defaultdict(list)
for k, v in RegionIDArray.items():
d[v].append(k)
res = {frozenset(v): k for k, v in d.items()}
print(res)
{frozenset({522, 2718, 4745}): 'London',
frozenset({3487}): 'Tokio'}
You can use itertools.groupby:
import itertools
RegionIDArray = {522: "London", 4745: "London", 2718: "London", 3487: "Tokio"}
new_results = {tuple(c for c, _ in b):a for a, b in itertools.groupby(sorted(RegionIDArray.items(), key=lambda x:x[-1]), key=lambda x:x[-1])}
Output:
{(3487,): 'Tokio', (4745, 522, 2718): 'London'}
What you can do is make a reverse lookup table from the values to all working keys, like so:
def reverse(ids):
table = {}
for key in ids:
if ids[key] not in table:
table[ids[key]] = []
table[ids[key]].append(key)
return table

Categories

Resources