The date document is wrote like below
1060301 1030727 1041201 1060606 1060531 1060629 1060623 1060720
...and some of them like....
831008 751125 1060110 890731 700815 731022 1010724 980116
Which represent the date data of:
Year(2~3 character)/Month(2 characters)/Day(2 characters)
And some r blanks for leaking data
is there a way to read those data into arranged date type?
So by reading 1060301, Im assuming thats year:106, month:03, and day:01, so perform different operations on the different lengthed numbers with this:
valuelist = []
value = ''
date = ''
file = open('testfile.txt','r+')
filetowriteto = open('OUTPUTFILE','a+')
for line in file:
for char in line:
if char == ' ':
if len(value) == 6:
date = value[0:2]+'/'+value[2:4]+'/'+value[4:]
elif len(value)==7:
date = value[:3]+'/'+value[3:5]+'/'+value[5:]
valuelist.append(date)
value = ''
date = ''
else:
value += char
continue
for t in valuelist:
filetowriteto.write(t+' ')
file.close()
filetowriteto.close()
Please don't hesitate to comment about anything.
Related
Say I have hundreds of text files like this example :
NAME
John Doe
DATE OF BIRTH
1992-02-16
BIO
THIS is
a PRETTY
long sentence
without ANY structure
HOBBIES
//..etc..
NAME, DATE OF BIRTH, BIO, and HOBBIES (and others) are always there, but text content and the number of lines between them can sometimes change.
I want to iterate through the file and store the string between each of these keys. For example, a variable called Name should contain the value stored between 'NAME' and 'DATE OF BIRTH'.
This is what I turned up with :
lines = f.readlines()
for line_number, line in enumerate(lines):
if "NAME" in line:
name = lines[line_number + 1] # In all files, Name is one line long.
elif "DATE OF BIRTH" in line:
date = lines[line_number + 2] # Date is also always two lines after
elif "BIO" in line:
for x in range(line_number + 1, line_number + 20): # Length of other data can be randomly bigger
if "HOBBIES" not in lines[x]:
bio += lines[x]
else:
break
elif "HOBBIES" in line:
#...
This works well enough, but I feel like instead of using many double loops, there must be a smarter and less hacky way to do it.
I'm looking for a general solution where NAME would store everything until DATE OF BIRTH, and BIO would store everything until HOBBIES, etc. With the intention of cleaning up and removing extra white lintes later.
Is it possible?
Edit : While I was reading through the answers, I realized I forgot a really significant detail, the keys will sometimes be repeated (in the same order).
That is, a single text file can contain more than one person. A list of persons should be created. The key Name signals the start of a new person.
I did it storing everything in a dictionary, see code below.
f = open("test.txt")
lines = f.readlines()
dict_text = {"NAME":[], "DATEOFBIRTH":[], "BIO":[]}
for line_number, line in enumerate(lines):
if not ("NAME" in line or "DATE OF BIRTH" in line or "BIO" in line):
text = line.replace("\n","")
dict_text[location].append(text)
else:
location = "".join((line.split()))
You could use a regular expression:
import re
keys = """
NAME
DATE OF BIRTH
BIO
HOBBIES
""".strip().splitlines()
key_pattern = '|'.join(f'{key.strip()}' for key in keys)
pattern = re.compile(fr'^({key_pattern})', re.M)
# uncomment to see the pattern
# print(pattern)
with open(filename) as f:
text = f.read()
parts = pattern.split(text)
... process parts ...
parts will be a list strings. The odd indexed positions (parts[1], parts[3], ...) will be the keys ('NAME', etc) and the even indexed positions (parts[2], parts[4], ...) will be the text in between the keys. parts[0] will be whatever was before the first key.
Instead of reading lines you could cast the file as one long string. Use string.index() to find the start index of your trigger words, then set everything from that index to the next trigger word index to a variable.
Something like:
string = str(f)
important_words = ['NAME', 'DATE OF BIRTH']
last_phrase = None
for phrase in important_words:
phrase_start = string.index(phrase)
phrase_end = phrase_start + len(phrase)
if last_phrase is not None:
get_data(string, last_phrase, phrase_start)
last_phrase = phrase_end
def get_data(string, previous_end_index, current_start_index):
usable_data = string[previous_end_index: current_start_index]
return usable_data
Better/shorter variable names should probably be used
You can just read the text in as 1 long string. And then make use of .split()
This will only work if the categories are in order and don't repeat.
Like so;
Categories = ["NAME", "DOB", "BIO"] // in the order they appear in text
Output = {}
Text = str(f)
for i in range(1,len(Categories)):
SplitText = Text.split(Categories[i])
Output.update({Categories[i-1] : SplitText[0] })
Text = SplitText[1]
Output.update({Categories[-1] : Text})
You can try the following.
keys = ["NAME","DATE OF BIRTH","BIO","HOBBIES"]
f = open("data.txt", "r")
result = {}
for line in f:
line = line.strip('\n')
if any(v in line for v in keys):
last_key = line
else:
result[last_key] = result.get(last_key, "") + line
print(result)
Output
{'NAME': 'John Doe', 'DATE OF BIRTH': '1992-02-16', 'BIO ': 'THIS is a PRETTY long sentence without ANY structure ', 'HOBBIES ': '//..etc..'}
I have a 4 columns tab separated text file. Also I have a list of values which need to be iterated through and searched in the text file to get the value of one of the columns:
Here's my code (Python 2.7):
def populate_data():
file = open('file.txt', 'r')
values = ['value1', 'value2', 'value3']
secondary_values = ['second_value1', 'second_value2', 'second_value3']
os = 'iOS'
i = 0
outputs = []
while i < len(values):
value = values[i]
secondary_value = secondary_values[i]
output = lookup(file, os, value, secondary_value)
if output != None:
outputs.append(output)
i += 1
def lookup(file, input_os, input_value, input_secondary_value):
for line in file:
columns = line.strip().split('\t')
if len(columns) != 4:
continue
else:
value = columns[0]
secondary_value = columns[1]
os = columns[2]
output = columns[3]
if input_os == os and input_value == value and input_secondary_value == secondary_value:
return output
The search basically should work as this SQL statement:
SELECT output FROM data_set WHERE os='os' AND value='value' and secondary_value='secondary_value'
The problem I'm experiencing is that the lookup method works in the while look and also maintains a for loop and obviously the parent while loop doesn't wait for the inner loop to finish and return the value before continue. This results in a problem that despite of the fact of the match the data is not returned. If this was JavaScript I would do that with Promises, but not sure how to achieve it in Python.
Any clues how this could be solved?
The correct thing to do here was to read the file and insert all of the rows into a dict like so:
dc = dict()
dc[value+secondary_value+os] = output
Then accessing the values in the main while loop.
I plan to analyze some conference documents, and before the analysis, I need to rearrange these documents into data frame. The format I expect is that for each row of the data, the first value is the speaker and the second value is the utterance of that speaker. For instance, ["Jo", "I just had tacos."]. The sample document can be reached here. Below is the progress so far:
file = open('the document','r')
Name = []
sentence = []
for line in file:
if line.find("Column") != -1:
continue
if line.find("Section") or line.find("Index") or line.find("Home Page"):
continue
if line.find(':') != -1:
tokens = line.split(":")
Name.append(tokens[0])
else:
sentence.append(line + " ")
My first question is that how I can combine the speaker and the utterance in one list and then search for the next person. The second question is that is there any better way to get rid of the content before Oral Answers to Questions and after The House divided: Ayes 240, Noes 329.Division No. 54][9.59 pm.
I appreciate any help.
Here, I have come up with a simple solution. This simple solution has three parts
When there is an empty line
When the line ends with :
Otherwise
Here is the code:
import re
from collections import defaultdict
def clean_speaker(sp):
sp = re.sub(r"(\(\w+\))", "", sp) #remove single words within parentheses
sp = re.sub(r"(\d+\.?)", "", sp) #remove digits such as 1. or 2.
return sp.strip()
document = []
with open('the document','r') as fin:
foundSpeaker = False
dialogue = defaultdict(str)
for line in fin.readlines():
line = line.strip() #remove white-spaces
#----- when line is empty -----
if not line:
dialogue = defaultdict(str)
foundSpeaker = False
#----- When line ends with : -----
elif line[-1] == ":":
if dialogue:
document.append(dialogue)
dialogue = defaultdict(str)
foundSpeaker = True
dialogue["Speaker"] = clean_speaker(line[:-1])
#----- Otherwise -----
else:
if foundSpeaker:
dialogue["Sentence"] += " " + line
else:
if dialogue:
document.append(dialogue)
dialogue = defaultdict(str)
foundSpeaker = False
continue
Now, the variable document has all the dialogue in the given file... it's a list of dictionary where each dictionary has just two keys (speaker, and sentence). So, we can see what's inside document like so:
for d in document:
for key, value in d.items():
print(key+":", value)
Or you can do something smarter where you convert that list into pandas.dataframe and write that dataframe into csv like so:
import pandas as pd
pd.DataFrame.from_dict(document).to_csv('document.csv')
Now, open document.csv and you will find everything in order... I hope this helps you
I am working on some Latin texts that contain dates and was using various regex patterns and rule based statements to extract dates. I was wondering if I can use an algorithm to train to extract these dates instead of the method I am currently using. Thanks
This is an extract of my algorithm:
def checkLatinDates(i, record, no):
if(i == 0 and isNumber(record[i])): #get deed no
df.loc[no,'DeedNo'] = record[i]
rec = record[i].lower()
split = rec.split()
if(split[0] == 'die'):
items = deque(split)
items.popleft()
split = list(items)
if('eodem' in rec):
n = no-1
if(no>1):
while ( pd.isnull(df.ix[n]['LatinDate'])):
n = n-1
print n
df['LatinDate'][no] = df.ix[n]['LatinDate']
if(words_in_string(latinMonths, rec.lower()) and len(split)<10):
if not (dates.loc[dates['Latin'] == split[0], 'Number'].empty):
day = dates.loc[dates['Latin'] == split[0], 'Number'].iloc[0]
split[0] = day
nd = ' '.join(map(str, split))
df['LatinDate'][no] = nd
elif(convertArabic(split[0])!= ''):
day = convertArabic(split[0])
split[0] = day
nd = ' '.join(map(str, split))
df['LatinDate'][no] = nd
You could use some machine learning algorithm like adaboost, using IOB tagging
adding some context features, like the type of word, a regex to detect if it is obviously a date, the surrounding words type, etc.
Here is a tutorial.
I have a large amount of data of this type:
array(14) {
["ap_id"]=>
string(5) "22755"
["user_id"]=>
string(4) "8872"
["exam_type"]=>
string(32) "PV Technical Sales Certification"
["cert_no"]=>
string(12) "PVTS081112-2"
["explevel"]=>
string(1) "0"
["public_state"]=>
string(2) "NY"
["public_zip"]=>
string(5) "11790"
["email"]=>
string(19) "ivorabey#zeroeh.com"
["full_name"]=>
string(15) "Ivor Abeysekera"
["org_name"]=>
string(21) "Zero Energy Homes LLC"
["org_website"]=>
string(14) "www.zeroeh.com"
["city"]=>
string(11) "Stony Brook"
["state"]=>
string(2) "NY"
["zip"]=>
string(5) "11790"
}
I wrote a for loop in python which reads through the file, creating a dictionary for each array and storing elements like thus:
a = 0
data = [{}]
with open( "mess.txt" ) as messy:
lines = messy.readlines()
for i in range( 1, len(lines) ):
line = lines[i]
if "public_state" in line:
data[a]['state'] = lines[i + 1]
elif "public_zip" in line:
data[a]['zip'] = lines[i + 1]
elif "email" in line:
data[a]['email'] = lines[i + 1]
elif "full_name" in line:
data[a]['contact'] = lines[i + 1]
elif "org_name" in line:
data[a]['name'] = lines[i + 1]
elif "org_website" in line:
data[a]['website'] = lines[i + 1]
elif "city" in line:
data[a]['city'] = lines[i + 1]
elif "}" in line:
a += 1
data.append({})
I know my code is terrible, but I am fairly new to Python. As you can see, the bulk of my project is complete. What's left is to strip away the code tags from the actual data. For example, I need string(15) "Ivor Abeysekera" to become Ivor Abeysekera".
After some research, I considered .lstrip(), but since the preceding text is always different.. I got stuck.
Does anyone have a clever way of solving this problem? Cheers!
Edit: I am using Python 2.7 on Windows 7.
Depending on how the code tags are formatted, you could split the line on " then pick out the second element.
s = 'string(15) "Ivor Abeysekera"'
temp = s.split('"')[1]
# temp is 'Ivor Abeysekera'
Note that this will get rid of the trailing ", if you need it you can always just add it back on. In your example this would look like:
data[a]['state'] = lines[i + 1].split('"')[1]
# etc. for each call of lines[i + 1]
Because you are calling it so much (regardless of what answer you use) you should probably turn it into a function:
def prepare_data(line_to_fix):
return line_to_fix.split('"')[1]
# latter on...
data[a]['state'] = prepare_data(lines[i + 1])
This will give you some more flexibility.
BAD SOLUTION Based on current question
but to answer your question just use
info_string = lines[i + 1]
value_str = info_string.split(" ",1)[-1].strip(" \"")
BETTER SOLUTION
do you have access to the php generating that .... if you do just do echo json_encode($data); instead of using var_dump
if instead you have them output json it(the json output) will look like
{"variable":"value","variable2","value2"}
you can then read it in like
import json
json_str = requests.get("http://url.com/json_dump").text # or however you get the original text
data = json.loads(json_str)
print data
You should use regular expressions (regex) for this:
http://docs.python.org/2/library/re.html
What you intend to do can be easily done with the following code:
# Import the library
import re
# This is a string just to demonstrate
a = 'string(32) "PV Technical Sales Certification"'
# Create the regex
p = re.compile('[^"]+"(.*)"$')
# Find a match
m = p.match(a)
# Your result will be now in s
s = m.group(1)
Hope this helps!
You can do this statefully by looping across all the lines and keeping track of where you are in a block:
# Make field names to dict keys
fields = {
'public_state': 'state',
'public_zip': 'zip',
'email': 'email',
'full_name': 'contact',
'org_name': 'name',
'org_website': 'website',
'city': 'city',
}
data = []
current = {}
key = None
with open( "mess.txt" ) as messy:
for line in messy.split('\n'):
line = line.lstrip()
if line.startswith('}'):
data.append(current)
current = {}
elif line.startswith('['):
keyname = line.split('"')[1]
key = fields.get(keyname)
elif key is not None:
# Get everything betweeen the first and last quotes on the line
value = line.split('"', 1)[1].rsplit('"', 1)[0]
current[key] = value
This avoids having to keep track of your position in the file, and also means that you could work across enormous data files (if you process the dictionary after each record) without having to load the whole thing into memory at once. In fact, let's restructure that as a generator that processes blocks of data at a time and yields dicts for you to work with:
fields = {
'public_state': 'state',
'public_zip': 'zip',
'email': 'email',
'full_name': 'contact',
'org_name': 'name',
'org_website': 'website',
'city': 'city',
}
def dict_maker(fileobj):
current = {}
key = None
for line in fileobj:
line = line.lstrip()
if line.startswith('}'):
yield current
current = {}
elif line.startswith('['):
keyname = line.split('"')[1]
key = fields.get(keyname)
elif key is not None:
# Get everything betweeen the first and last quotes on the line
value = line.split('"', 1)[1].rsplit('"', 1)[0]
current[key] = value
with open("mess.txt") as messy:
for d in dict_maker(messy):
print d
That makes your main loop tiny and understandable: you loop across the potentially enormous set of dicts, one at a time, and do something with them. It totally separates the act of making the dictionaries from the act of consuming them. And since the generator is stateful, and only processes one line at a time, you could pass in anything that looks like a file, like a list of strings, the output of a web request, input from another programming writing to sys.stdin, or whatever.