Split a paragraph into lines - python

I have this paragraph in a variable
"Information about this scan : abc version : 5.2.5 pqr version : 201 403061815 hello kdshfldfs;dfkfjljcsdlc sljc lsjclsj csjclks cscjsld"
I want to fetch 'abc version' and 'pqr version'.
How can I achieve that?

You can do it as follows:
string = "Your Paragraph String"
string = string.split()
abc_version = string[string.index('abc')+3]
AND
pqr_version = string[string.index('pqr')+3] #This will give 201
OR
pqr_version = ' '.join(string[string.index('pqr')+3:string.index('pqr')+5]) #This will give 201 403061815
Please specify where your pqr version string starts and ends.

Related

Separating text/text processing using regex

I have a paragraph that needs to be separated by a certain list of keywords.
Here is the text (a single string):
"Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore Address: 189 West Moncler Drive Home Phone: 353 273 400 Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019 Author: social worker"
So I want to separate this paragraph based on the variable names using python. "Evaluation Note", "Date","ID","Contact","Name","Address","Home Phone","Additional Information" and "Author" are the variable names. I think using regex seems nice but I don't have a lot of experience in regex.
Here is what I am trying to do:
import re
regex = r"Evaluation Note(?:\:)? (?P<note>\D+) Date(?:\:)? (?P<date>\D+)
ID(?:\:)? (?P<id>\D+) Contact(?:\:)? (?P<contact>\D+)Name(?:\:)? (? P<name>\D+)"
test_str = "Evaluation Note: Suspected abuse by own mother. Date 3/13/2019
ID: #N/A Contact: Not Specified Name: Cecilia Valore "
matches = re.finditer(regex, test_str, re.MULTILINE)
But doesn't find any patterns.
You can probably generate that regex on the fly. So long as the order of the params is fixed.
Here my try at it, it does do the job. The actual regex it is shooting for is something like Some Key(?P<some_key>.*)Some Other Key(?P<some_other_key>.*), and so on.
import re
test_str = r'Evaluation Note: Suspected abuse by own mother. Date 3/13/2019 ID: #N/A Contact: Not Specified Name: Cecilia Valore '
keys = ['Evaluation Note', 'Date', 'ID', 'Contact', 'Name']
def find(keys, string):
keys = [(key, key.replace(' ', '_')) for key in keys] # spaces aren't valid param names
pattern = ''.join([f'{key}(?P<{name}>.*)' for key, name in keys]) # generate the actual regex
for find in re.findall(pattern, test_str):
for item in find:
yield item.strip(':').strip() # clean up the result
for find in find(keys, test_str):
print(find)
Which returns:
Suspected abuse by own mother.
3/13/2019
#N/A
Not Specified
Cecilia Valore
You can use search to get locations of variables and parse text accordingly. You can customize it easily.
import re
en = re.compile('Evaluation Note:').search(text)
print(en.group())
d = re.compile('Date').search(text)
print(text[en.end()+1: d.start()-1])
print(d.group())
i_d = re.compile('ID:').search(text)
print(text[d.end()+1: i_d.start()-1])
print(i_d.group())
c = re.compile('Contact:').search(text)
print(text[i_d.end()+1: c.start()-1])
print(c.group())
n = re.compile('Name:').search(text)
print(text[c.end()+1: n.start()-1])
print(n.group())
ad = re.compile('Address:').search(text)
print(text[n.end()+1: ad.start()-1])
print(ad.group())
p = re.compile('Home Phone:').search(text)
print(text[ad.end()+1: p.start()-1])
print(p.group())
ai = re.compile('Additional Information:').search(text)
print(text[p.end()+1: ai.start()-1])
print(ai.group())
aut = re.compile('Author:').search(text)
print(text[ai.end()+1: aut.start()-1])
print(aut.group())
print(text[aut.end()+1:])
this will output:
Evaluation Note: Suspected abuse by own mother.
Date: 3/13/2019
ID: #N/A
Contact: Not Specified
Name: Cecilia Valore
Address: 189 West Moncler Drive
Home Phone: 353 273 400
Additional Information: Please tell me when the mother arrives, we will have a meeting with her next Monday, 3/17/2019
Author: social worker
I hope this helps

Find and replace the parameter of MediaWiki template

I am creating a small script to replace the parameter of MediaWiki Template. There are two types of MediaWiki Template form:
First (inline):
{{Infobox|name = ABC |work = ABC |year = 1021 }}
Second (non-inline):
{{Infobox
|name = ABC
|work = ABC
|year = 1021
}}
Now I want to replace the name with CBA:
{{Infobox
|name = CBA
|work = ABC
|year = 1021
}}
I have three variables in the Python script:
param = sheet.cell_value(i + 1, 1)
value = sheet.cell_value(i + 1, 2)
template = sheet.cell_value(i + 1, 3)
Here template = Infobox, param = name, value= CBA
I did some searches on Google and found that it will be done by regex. Let store the template content in the text variable. So How we find and replace it?
Please keep in mind that MediaWiki Template may be in both forms (inline or non-inline). and it should not replace the same values of other parameters.
I don't know if this helps:
msg = re.sub(r"^(.*name\s*=\s*)[A-Za-z0-9]+(.*)$", r"\1CBA\2", msg, flags=re.S)
Explanation:
The Code replaces the content in msg with "(regex match group)CBA(regex match group)"
Here is my test-case:
import re
pattern = r"name\s*=\s*([A-Za-z0-9]+)"
msg = '{{Infobox|name = ABC |work = ABC |year = 1021 }}'
print(msg)
msg_long = '{{Infobox \
|name = CBA \
|work = ABC \
|year = 1021 \
}}'
msg = re.sub(r"^(.*name\s*=\s*)[A-Za-z0-9]+(.*)$", r"\1CBA\2", msg, flags=re.S)
print(msg)
print(msg_long)
msg_long = re.sub(r"^(.*name\s*=\s*)[A-Za-z0-9]+(.*)$", r"\1CBA\2", msg_long, flags=re.S)
print(msg_long)

Split and save text string on scrapy

I need split a substring from a string, exactly this source text:
Article published on: Tutorial
I want delete "Article published on:" And leave only
Tutorial
, so i can save this
i try with:
category = items[1]
category.split('Article published on:','')
and with
for p in articles:
bodytext = p.xpath('.//text()').extract()
joined_text = ''
# loop in categories
for each_text in text:
stripped_text = each_text.strip()
if stripped_text:
# all the categories together
joined_text += ' ' + stripped_text
joined_text = joined_text.split('Article published on:','')
items.append(joined_text)
if not is_phrase:
title = items[0]
category = items[1]
print('title = ', title)
print('category = ', category)
and this don't works, what im missing?
error with this code:
TypeError: 'str' object cannot be interpreted as an integer
You probably just forgot to assign the result:
category = category.replace('Article published on:', '')
Also it seems that you meant to use replace instead of split. The latter also works though:
category = category.split(':')[1]

my post method returns (u'') and django saves includes the (u'') string when saving it

This is how I retrieve the post data from the webpage. The person models can be saved but it includes the "(u'')" string. For example if change the firstname to "Alex", it gets the raw value u('Alex') and saves it.
def submit_e(req, person_id=None):
if(req.POST):
try:
person_id = req.POST['driver']
person = Person.objects.get(pk=person_id)
person.firstname = req.POST['firstname'],
person.midname = req.POST['middleinitial'],
person.lastname = req.POST['lastname'],
person.full_clean()
person.save()
except Exception as e:
print e
return HttpResponseRedirect(reverse('users:user_main'))
NB: the following is my best guess at what you are seeing based on your question. If I have guessed wrongly, the please update your post with more details - putting print statements throughout your code and adding the output to your post would be a good start.
The u prefix on a string indicates a Unicode string. It is not actually part of the contents of the string. If we create a string in the interpreter:
>>> name = u'Me'
and then request details of the string,
>>> name
u'Me'
then the u is shown as it is part of the information about the string, which is what we have requested. If we print the contents of the string
>>> print name
Me
then the u is not shown (just like the quotes aren't shown).
Using the interpreter to try and reproduce your problem, I created a new user with a Unicode string for a username:
>>> from django.contrib.auth.models import User
>>> new_user = User()
>>> new_user.username = u'Me'
>>> new_user.save()
And as before, if we request the details about the string we see the u and the quotes, but if we print the contents of the string we don't:
>>> new_user.username
u'Me'
>>> print new_user.username
>>> Me
To further confirm the u was not stored, we can explore the database directly:
sqlite> select username from auth_user;
Me
you need to delete the "," at the end of each line
so, before:
person.firstname = req.POST['firstname'],
person.midname = req.POST['middleinitial'],
person.lastname = req.POST['lastname'],
after
person.firstname = req.POST['firstname']
person.midname = req.POST['middleinitial']
person.lastname = req.POST['lastname']

Python - sending unicode characters (prefixed with \u) in an HTTP POST request

I'm writing a program which fetches and edits articles on Wikipedia, and I'm having a bit of trouble handling Unicode characters prefixed with \u. I've tried .encode("utf8") and it isn't seeming to do the trick here. How can I properly encode these values prefixed with \u to POST to Wikipedia? See this edit for my problem.
Here is some code:
To get the page:
url = "http://en.wikipedia.org/w/api.php?action=query&format=json&titles="+urllib.quote(name)+"&prop=revisions&rvprop=content"
articleContent = ClientCookie.urlopen(url).read().split('"*":"')[1].split('"}')[0].replace("\\n", "\n").decode("utf-8")
Before I POST the page:
data = dict([(key, value.encode('utf8')) for key, value in data.iteritems()])
data["text"] = data["text"].replace("\\", "")
editInfo = urllib2.Request("http://en.wikipedia.org/w/api.php", urllib.urlencode(data))
You are downloading JSON data without decoding it. Use the json library for that:
import json
articleContent = ClientCookie.urlopen(url)
data = json.load(articleContent)
JSON encoded data looks a lot like Python, it uses \u escaping as well, but it is in fact a subset of JavaScript.
The data variable now holds a deep datastructure. Judging by the string splitting, you wanted this piece:
articleContent = data['query']['pages'].values()[0]['revisions'][0]['*']
Now articleContent is an actual unicode() instance; it is the revision text of the page you were looking for:
>>> print u'\n'.join(data['query']['pages'].values()[0]['revisions'][0]['*'].splitlines()[:20])
{{For|the game|100 Bullets (video game)}}
{{GOCEeffort}}
{{italic title}}
{{Supercbbox <!--Wikipedia:WikiProject Comics-->
| title =100 Bullets
| image =100Bullets vol1.jpg
| caption = Cover to ''100 Bullets'' vol. 1 "First Shot, Last Call". Cover art by Dave Johnson.
| schedule = Monthly
| format =
|complete=y
|Crime = y
| publisher = [[Vertigo (DC Comics)|Vertigo]]
| date = August [[1999 in comics|1999]] – April [[2009 in comics|2009]]
| issues = 100
| main_char_team = [[Agent Graves]] <br/> [[Mr. Shepherd]] <br/> The Minutemen <br/> [[List of characters in 100 Bullets#Dizzy Cordova (also known as "The Girl")|Dizzy Cordova]] <br/> [[List of characters in 100 Bullets#Loop Hughes (also known as "The Boy")|Loop Hughes]]
| writers = [[Brian Azzarello]]
| artists = [[Eduardo Risso]]<br>Dave Johnson
| pencillers =
| inkers =
| colorists = Grant Goleash<br>[[Patricia Mulvihill]]

Categories

Resources