Sequentially filescraping text files - a smarter way?

Sequentially filescraping text files - a smarter way? - python

I'm trying to scrape some text files into a DB - the format is similar to this with a couple of 1000 segments like this :
Posted By
Date
John Keys
31.08.2019, 10:10 AM
Peter Hall 200 150
Ed Parker 14 1
Posted By
Date
John Keys
31.08.2019, 10:15 AM
Rose Stone 200 150
Travis Anderson 14 1
The records that are important are the records that are coming right after "Date" - so the logic is :
inside_match_flag =0
for line in ins:
if inside_match_flag == 1:
inside_match_flag = 2 # add one to it as we will get all lines
if line == "Posted By": # until we see Posted By again (or EOF)
inside_match_flag =0 # we are now outside the segment
if line == "Date" : # lines after Dates are the ones we want
inside_match_flag =1 # the following lines are to be stored
So this is the way I've done it (the above is not the running code) before doing this by keeping track of a flag and depending on the flag_value I know what lines are most likely coming next.
The issue is of course this about 'the lines coming next' - as I'm reading line per line, I can't easy grab out these segments as I don't want to rely on loading the complete file into memory (as it can go huge).
But the code always gets ugly when I implement something like this - and thinking anyone here that would have a lot smarter approach to do this ?
And note - I am also interested if there would be a super-smart compact way to do this if it requires to load all into memory where code doesn't get so ugly, if all is in memory I guess I can just look for DATE field and save all lines between until it sees Posted By again.
Edit 1
Note the number of players can be more than 2 per game, so a record could also look like this :
Posted By
Date
John Keys
31.08.2019, 10:10 AM
Peter Hall 200 150
Ed Parker 54 1
Rose Stone 20 15
Travis Anderson 1 150
Posted By
...
....
My dream format would be to have an object like this - example based on the match above with 4 players :
{
"Game 1:"
{
"posted by" : "john keys"
"date" : "31.08.2019, 10:10 AM"
"players" : {
{ 1, "Peter Hall, "200", "150" }
{ 2, Ed Parker, "54", "1" }
{ 3 , Rose Stone, "20", "15" }
{ 4, Travis Anderson, "1", "150" }
}
}
}
Note : not 100% correct json format there - and doesn't have to be json, just an object as I will throw them into a SQLite database where it's stored per game which should be illustrated above.

Optimized and memory-efficient generator function approach which yields records on demand:
import pprint
def extract_records(fname):
def prepare_record(rec):
return {'posted by': rec[0], 'date': rec[1],
'players': [[i] + p.rsplit(maxsplit=2)
for i, p in enumerate(rec[2:], 1)]}
with open(fname) as f:
record = []
add_item = False
for line in f:
line = line.strip()
if line == 'Date':
add_item = True
continue
elif line == 'Posted By':
add_item = False
if record:
yield prepare_record(record)
record = []
continue
if add_item:
record.append(line)
if record:
yield prepare_record(record)
records_gen = extract_records('datafile.txt') # generator
for rec in records_gen:
pprint.pprint(rec) # further processing, ex. inserting into DB
The output (2 sample records):
{'date': '31.08.2019, 10:10 AM',
'players': [[1, 'Peter Hall', '200', '150'],
[2, 'Ed Parker', '14', '1'],
[3, 'Rose Stone', '20', '15'],
[4, 'Travis Anderson', '1', '150']],
'posted by': 'John Keys'}
{'date': '31.08.2019, 10:15 AM',
'players': [[1, 'Rose Stone', '200', '150'],
[2, 'Travis Anderson', '14', '1']],
'posted by': 'John Keys'}

There is no magic method for this specific case. Here is an example solution:
buf_size = ...
start_marker = "Posted by\n"
date_marker = "Date\n"
def parse_game(filename)
fh = open(filename)
page = ""
buffer = True # just the start value
while buffer:
buffer = fh.read(buf_size)
page += buffer
records = page.split(start_marker)
if buffer:
page = records.pop()
for record in records:
# skip everything before "Date" and split by lines
chunks = record.split(date_marker, 1)[-1].split("\n")
posted_by, date = chunks[:2]
players = [chunk.split() for chunk in chunks[2:]]
yield {
"posted_by": posted_by,
"date": date,
"players": players
}
If you can read the whole file into memory, it will be just:
def read_game(filename):
for record in open(filename).read().split(start_marker):
# skip everything before "Date" and split by lines
chunks = record.split(date_marker, 1)[-1].split("\n")
posted_by, date = chunks[:2]
players = [chunk.split() for chunk in chunks[2:]]
yield {
"posted_by": posted_by,
"date": date,
"players": players
}
This solution is very similar to Roman's. It is slightly less memory efficient (assuming you have buf_size of memory), but will result in less IO

Related

Parser for command line output

I am getting command line output in below format
server
3 threads started
1.1.1.1 ONLINE at SUN
version: 1.2.3.4
en: net
1.1.1.2 ONLINE at SUN
version: 1.2.3.5
en: net
1.1.1.3 OFFLINE at SUN
version: 1.2.3.6
en: net
File: xys
high=600
low=70
name=lmn
I want parsed output like
l1 = [
{
"1.1.1.1": {
"status": "ONLINE",
"version": "1.2.3.4",
"en": "net"
},
"1.1.1.2": {
"status": "ONLINE",
"version": "1.2.3.5",
"en": "net"
},
"1.1.1.3": {
"status": "OFFLINE",
"version": "1.2.3.6",
"en": "net"
}
}
]
l2 = {
"File": "xys",
"high": 600,
"low": 70,
"name": "lmn"
}
I am getting all this in a string.
I Have split string by \n and created a list and then From "File" keyword created 2 lists of the main list. Then parsed both lists separately.
index = [i for i in range(len(output)) if "File" in output[i] ]
if index:
list1 = output[:index[0]]
list2 = output[index[0]:]
Is there any other more efficient way to parse this output.

What you did would work alright.
How much you should worry about this would depend on if this is just some quick setup being done for a few automated tests or if this code is for a service in an enterprise environment that has to stay running, but the one thing I would be worried about is what happens if File: ... is no longer the line that follows the IP addresses. If you want to make sure this does not throw of your code, you could go through the string line by line parsing it.
You would need your parser to check for all of the following cases:
The word server
The comments following the word server about how many threads where started
Any other comments after the word server
The IP address (regex is your friend)
The indented area that follows having found an IP addresses
key value pairs separated with a colon
key value pairs separated with an equals sign
But in all reality, I think what you did looks great. It's not that hard to change your code from searching for "File" to something else if that need ever arises. You will want to spend a little bit of time verifying that it appears that "File" does always proceed the IP addresses. If reliability is super important, then you will have some additional work to do in protecting yourself from running into problems later on if the order things come in is changes on you.

The solution provided below does not need to use the number of server threads running, as it can keep track of the thread number by removing all metadata preceding and following the threads' information:
with open("data.txt", "r") as inFile:
lines = [line for line in inFile]
lines = [line for line in lines[2:] if line != '\n']
threads = lines[:-4]
meta = lines[-4:]
l1 = []
l2 = {}
for i in range(0,len(threads),3):
status = threads[i]
version = threads[i+1]
en = threads[i+2]
status = status.split()
name = status[0]
status = status[1]
version = version.split()
version = version[1].strip()
en = en.split()
en = en[1].strip()
l1.append({name : {'status' : status, "version" : version, "en" : en}})
fileInfo = meta[0].strip().split(": ")
l2.update({fileInfo[0] : fileInfo[1]})
for elem in meta[1:]:
item = elem.strip().split("=")
l2.update({item[0] : item[1]})
The result will be:
For l1:
[{'1.1.1.1': {'status': 'ONLINE', 'version': '1.2.3.4', 'en': 'net'}}, {'1.1.1.2': {'status': 'ONLINE', 'version': '1.2.3.5', 'en': 'net'}}, {'1.1.1.3': {'status': 'OFFLINE', 'version': '1.2.3.6', 'en': 'net'}}]
For l2:
{'File': 'xys', 'high': '600', 'low': '70', 'name': 'lmn'}

How to parse tab-delimited text file with 4th column as json and remove certain keys?

I have a text file that is 26 Gb, The line format is as follow
/type/edition /books/OL10000135M 4 2010-04-24T17:54:01.503315 {"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4}
I'm trying to get only the last columns which is a json and from that Json I'm only trying to save the "title", "isbn 13", "isbn 10"
I was able to save only the last column with this code
csv.field_size_limit(sys.maxsize)
# File names: to read in from and read out to
input_file = '../inputFile/ol_dump_editions_2019-10-31.txt'
output_file = '../outputFile/output.txt'
## ==================== ##
## Using module 'csv' ##
## ==================== ##
with open(input_file) as to_read:
with open(output_file, "w") as tmp_file:
reader = csv.reader(to_read, delimiter = "\t")
writer = csv.writer(tmp_file)
desired_column = [4] # text column
for row in reader: # read one row at a time
myColumn = list(row[i] for i in desired_column) # build the output row (process)
writer.writerow(myColumn) # write it
but this doesn't return a proper json object instead returns everything with a double quotations next to it. Also how would I extract certain values from the json as a new json
EDIT:
"{""publishers"": [""Bernan Press""], ""physical_format"": ""Hardcover"", ""subtitle"": ""9th November - 3rd December, 1992"", ""key"": ""/books/OL10000135M"", ""title"": ""Parliamentary Debates, House of Lords, Bound Volumes, 1992-93"", ""identifiers"": {""goodreads"": [""6850240""]}, ""isbn_13"": [""9780107805401""], ""languages"": [{""key"": ""/languages/eng""}], ""number_of_pages"": 64, ""isbn_10"": [""0107805405""], ""publish_date"": ""December 1993"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-04-24T17:54:01.503315""}, ""authors"": [{""key"": ""/authors/OL2645777A""}], ""latest_revision"": 4, ""works"": [{""key"": ""/works/OL7925046W""}], ""type"": {""key"": ""/type/edition""}, ""subjects"": [""Government - Comparative"", ""Politics / Current Events""], ""revision"": 4}"
EDIT 2:
so im trying to read this file which is a tab separated file with the following columns:
type - type of record (/type/edition, /type/work etc.)
key - unique key of the record. (/books/OL1M etc.)
revision - revision number of the record
last_modified - last modified timestamp
JSON - the complete record in JSON format
Im trying to read the JSON file and from that Json im only trying to get the "title", "isbn 13", "isbn 10" as a json and save it to the file as a row
so every row should look like the original but with only those key and values

Here's a straight-forward way of doing it. You would need to repeat this and extract the desired data from each line of the file as it's being read, line-by-line (the default way text file reading is handled in Python).
import json
line = '/type/edition /books/OL10000135M 4 2010-04-24T17:54:01.503315 {"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4}'
csv_cols = line.split('\t')
json_data = json.loads(csv_cols[4])
#print(json.dumps(json_data, indent=4))
desired = {key: json_data[key] for key in ("title", "isbn_13", "isbn_10")}
result = json.dumps(desired, indent=4)
print(result)
Output from sample line:
{
"title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93",
"isbn_13": [
"9780107805401"
],
"isbn_10": [
"0107805405"
]
}

So given that your current code returns the following:
result = '{""publishers"": [""Bernan Press""], ""physical_format"": ""Hardcover"", ""subtitle"": ""9th November - 3rd December, 1992"", ""key"": ""/books/OL10000135M"", ""title"": ""Parliamentary Debates, House of Lords, Bound Volumes, 1992-93"", ""identifiers"": {""goodreads"": [""6850240""]}, ""isbn_13"": [""9780107805401""], ""languages"": [{""key"": ""/languages/eng""}], ""number_of_pages"": 64, ""isbn_10"": [""0107805405""], ""publish_date"": ""December 1993"", ""last_modified"": {""type"": ""/type/datetime"", ""value"": ""2010-04-24T17:54:01.503315""}, ""authors"": [{""key"": ""/authors/OL2645777A""}], ""latest_revision"": 4, ""works"": [{""key"": ""/works/OL7925046W""}], ""type"": {""key"": ""/type/edition""}, ""subjects"": [""Government - Comparative"", ""Politics / Current Events""], ""revision"": 4}'
Looks like what you need to do is: First - Replace those double-double-quotes with regular double quotes, otherwise things are not parsible:
res = result.replace('""','"')
Now res is convertible to a JSON object:
import json
my_json = json.loads(res)
my_json now looks like this:
{'authors': [{'key': '/authors/OL2645777A'}],
'identifiers': {'goodreads': ['6850240']},
'isbn_10': ['0107805405'],
'isbn_13': ['9780107805401'],
'key': '/books/OL10000135M',
'languages': [{'key': '/languages/eng'}],
'last_modified': {'type': '/type/datetime',
'value': '2010-04-24T17:54:01.503315'},
'latest_revision': 4,
'number_of_pages': 64,
'physical_format': 'Hardcover',
'publish_date': 'December 1993',
'publishers': ['Bernan Press'],
'revision': 4,
'subjects': ['Government - Comparative', 'Politics / Current Events'],
'subtitle': '9th November - 3rd December, 1992',
'title': 'Parliamentary Debates, House of Lords, Bound Volumes, 1992-93',
'type': {'key': '/type/edition'},
'works': [{'key': '/works/OL7925046W'}]}
You can conveniently get any field you want from this object:
my_json['title']
# 'Parliamentary Debates, House of Lords, Bound Volumes, 1992-93'
my_json['isbn_10'][0]
# '0107805405'

Especially because your example is so large, I'd recommend using a specialized library such as pandas, which has a read_csv method, or even dask, which supports out-of-memory operations.
Both of these systems will automatically parse out the quotations for you, and dask will do so in "pieces" direct from disk so you never have to try to load 26GB into RAM.
In both libraries, you can then access the columns you want like this:
data = read_csv(PATH)
data["ColumnName"]
You can then parse these rows either using json.loads() (import json) or you can use the pandas/dask json implementations. If you can give some more details of what you're expecting, I can help you draft a more specific code example.
Good luck!

I saved your data to a file to see if i could read just the rows, let me know if this works:
lines = zzread.split('\n')
temp=[]
for to_read in lines:
if len(to_read) == 0:
break
new_to_read = '{' + to_read.split('{',1)[1]
temp.append(json.loads(new_to_read))
for row in temp:
print(row['isbn_13'])
If that works this should create a json for you:
lines = zzread.split('\n')
temp=[]
for to_read in lines:
if len(to_read) == 0:
break
new_to_read = '{' + to_read.split('{',1)[1]
temp.append(json.loads(new_to_read))
new_json=[]
for row in temp:
new_json.append({'title': row['title'], 'isbn_13': row['isbn_13'], 'isbn_10': row['isbn_10']})

How to make a for _ in _ work for a function

I'm using code that compares a json list to a list in my python file, however the for loop im using comes back with TypeError: 'function' object is not iterable. I'm not sure how to fix this
I've tried changing "for k2 in occupants:" to "for k2 in occupants():" and "for k2 in occupants[:]:" but each of these had problems
url = 'http://api.open-notify.org/astros.json'
response = urllib.request.urlopen(url)
result = json.loads(response.read())
people = result["people"]
people_with_exist_state = []
occupants = [
{
'name': "Oleg Kononenko",
'bio': 'Oleg Dmitriyevich Kononenko is a Russian cosmonaut. He has flown to the International Space Station four times as a flight engineer and commander. Kononenko accumulated over 533 days in orbit during his first three long duration flights to ISS.',
'img': 'https://upload.wikimedia.org/wikipedia/commons/thumb/a/a0/Kononenko.jpg/220px-Kononenko.jpg',
},
{
'name': "David Saint-Jacques",
'bio': "David Saint-Jacques (born January 6, 1970) is a Canadian astronaut with the Canadian Space Agency (CSA). He is also an astrophysicist, engineer, and a physician.",
'img': 'https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/David_Saint-Jacques_official_portrait.jpg/440px-David_Saint-Jacques_official_portrait.jpg',
}, #countinues for all occupants of the ISS
#app.route("/occupants")
def occupants():
for k in people:
is_here = 0
for k2 in occupants:
if k['name'] == k2['name']:
is_here = 1
if is_here == 0:
#name does not exist
k['exist'] = 0
else:
# name exist
k['exist'] = 1
people_with_exist_state.append(k)
return render_template('occupants.html', people=people_with_exist_state)

Assuming people and occupants are global variables in your code you cant name the function def occupants(): like you named the variable occupants

Parsing Text Structured with Indents in Python

I am getting stuck trying to figure out an efficient way to parse some plaintext that is structured with indents (from a word doc). Example (note: indentation below not rendering on mobile version of SO):
Attendance records 8 F 1921-2010 Box 2
1921-1927, 1932-1944
1937-1939,1948-1966,
1971-1979, 1989-1994, 2010
Number of meetings attended each year 1 F 1991-1994 Box 2
Papers re: Safaris 10 F 1951-2011 Box 2
Incomplete; Includes correspondence
about beginning “Safaris” may also
include announcements, invitations,
reports, attendance, and charges; some
photographs.
See also: Correspondence and Minutes
So the unindented text is the parent record data and each set of indented text below each parent data line are some notes for that data (which are also split into multiple lines themselves).
So far I have a crude script to parse out the unindented parent lines so that I get a list of dictionary items:
import re
f = open('example_text.txt', 'r')
lines = f.readlines()
records = []
for line in lines:
if line[0].isalpha():
processed = re.split('\s{2,}', line)
for i in processed:
title = processed[0]
rec_id = processed[1]
years = processed[2]
location = processed[3]
records.append({
"title": title,
"id": rec_id,
"years": years,
"location": location
})
elif not line[0].isalpha():
print "These are the notes, but attaching them to the above records is not clear"
print records`
and this produces:
[{'id': '8 F',
'location': 'Box 2',
'title': 'Attendance records',
'years': '1921-2010'},
{'id': '1 F',
'location': 'Box 2',
'title': 'Number of meetings attended each year',
'years': '1991-1994'},
{'id': '10 F',
'location': 'Box 2',
'title': 'Papers re: Safaris',
'years': '1951-2011'}]
But now I want to add to each record the notes to the effect of:
[{'id': '8 F',
'location': 'Box 2',
'title': 'Attendance records',
'years': '1921-2010',
'notes': '1921-1927, 1932-1944 1937-1939,1948-1966, 1971-1979, 1989-1994, 2010'
},
...]
What's confusing me is that I am assuming this procedural approach, line by line, and I'm not sure if there is a more Pythonic way to do this. I am more used to working with scraping webpages and with those at least you have selectors, here it's hard to double back going one by one down the line and I was hoping someone might be able to shake my thinking loose and provide a fresh view on a better way to attack this.
Update
Just adding the condition suggested by answer below over the indented lines worked fine:
import re
import repr as _repr
from pprint import pprint
f = open('example_text.txt', 'r')
lines = f.readlines()
records = []
for line in lines:
if line[0].isalpha():
processed = re.split('\s{2,}', line)
#print processed
for i in processed:
title = processed[0]
rec_id = processed[1]
years = processed[2]
location = processed[3]
if not line[0].isalpha():
record['notes'].append(line)
continue
record = { "title": title,
"id": rec_id,
"years": years,
"location": location,
"notes": []}
records.append(record)
pprint(records)

As you have already solved the parsing of the records, I will only focus on how to read the notes of each one:
records = []
with open('data.txt', 'r') as lines:
for line in lines:
if line.startswith ('\t'):
record ['notes'].append (line [1:])
continue
record = {'title': line, 'notes': [] }
records.append (record)
for record in records:
print ('Record is', record ['title'] )
print ('Notes are', record ['notes'] )
print ()

adding column 2 from a group of text files to 1 text file

I have a group of text files and I am looking to sequentially add the second column from each text file into a new text file. The files are tab delimited and of the following format:
name dave
age 35
job teacher
income 30000
I have generated a file with the 1st column of one of these files in the place of the second column to hopefully simplify the problem:
0 name
0 age
0 job
0 income
I have a large number of these files and would like to have them all in a tab delimited text file such as:
name dave mike sue
age 35 28 40
job teacher postman solicitor
income 30000 20000 40000
I have a text file containing just the names of all the files called all_libs.txt
so far I have written:
#make a sorted list of the file names
with open('all_libs.txt', 'r') as lib:
people = list([line.rstrip() for line in lib])
people_s = sorted(people)
i=0
while i< len(people_s):
with open(people_s[i]) as inf:
for line in inf:
parts = line.split() #split line into parts
if len(parts) > 1: #if more than 1 discrete unit in parts
with open("all_data.txt", 'a') as out_file: #append column2 to all_data
out_file.write((parts[1])+"\n")
i=i+1 #go to the next file in the list
As each new file is opened I would like to add it as a new column rather than just appending as a new line. Would really appreciate any help? I realize something like SQL would probably make this easy but I have never used it and don't really have time to commit to the learning curve for SQL. Many thanks.

This is a very impractical way to store your data - each record is distributed over all the lines, so it's going to be hard to reconstruct the records when reading the file and (as you've seen) to add records.
You should be using a standard format like csv or (even better in a case like this) json:
For example, you could save them as CSV like this:
name,age,job,income
dave,35,teacher,30000
mike,28,postman,20000
sue,40,solicitor,40000
Reading this file:
>>> import csv
>>> with open("C:/Users/Tim/Desktop/people.csv", newline="") as infile:
... reader = csv.DictReader(infile)
... people = list(reader)
Now you have a list of people:
>>> people
[{'income': '30000', 'age': '35', 'name': 'dave', 'job': 'teacher'},
{'income': '20000', 'age': '28', 'name': 'mike', 'job': 'postman'},
{'income': '40000', 'age': '40', 'name': 'sue', 'job': 'solicitor'}]
which you can access easily:
>>> for item in people:
... print("{0[name]} is a {0[job]}, earning {0[income]} per year".format(item))
...
dave is a teacher, earning 30000 per year
mike is a postman, earning 20000 per year
sue is a solicitor, earning 40000 per year
Adding new records now is only a matter of adding them to the end of your file:
>>> with open("C:/Users/Tim/Desktop/people.csv", "a", newline="") as outfile:
... writer = csv.DictWriter(outfile,
... fieldnames=["name","age","job","income"])
... writer.writerow({"name": "paul", "job": "musician", "income": 123456,
... "age": 70})
Result:
name,age,job,income
dave,35,teacher,30000
mike,28,postman,20000
sue,40,solicitor,40000
paul,70,musician,123456
Or you can save it as JSON:
>>> import json
>>> with open("C:/Users/Tim/Desktop/people.json", "w") as outfile:
... json.dump(people, outfile, indent=1)
Result:
[
{
"income": "30000",
"age": "35",
"name": "dave",
"job": "teacher"
},
{
"income": "20000",
"age": "28",
"name": "mike",
"job": "postman"
},
{
"income": "40000",
"age": "40",
"name": "sue",
"job": "solicitor"
}
]

file_1 = """
name dave1
age 351
job teacher1
income 300001"""
file_2 = """
name dave2
age 352
job teacher2
income 300002"""
file_3 = """
name dave3
age 353
job teacher3
income 300003"""
template = """
0 name
0 age
0 job
0 income"""
Assume that the above is read from the files
_dict = {}
def concat():
for cols in template.splitlines():
if cols:
_, col_name = cols.split()
_dict[col_name] = []
for each_file in [file_1, file_2, file_3]:
data = each_file.splitlines()
for line in data:
if line:
words = line.split()
_dict[words[0]].append(words[1])
_text = ""
for key in _dict:
_text += '\t'.join([key, '\t'.join(_dict[key]), '\n'])
return _text
print concat()
OUTPUT
job teacher1 teacher2 teacher3
age 351 352 353
name dave1 dave2 dave3
income 300001 300002 300003

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sequentially filescraping text files - a smarter way? - python

Related

Parser for command line output

How to parse tab-delimited text file with 4th column as json and remove certain keys?

How to make a for _ in _ work for a function

Parsing Text Structured with Indents in Python

adding column 2 from a group of text files to 1 text file

Categories

Resources