To parse text file and create json out of it

To parse text file and create json out of it - python

I am new to Python. I want to parse a text file in which the first row contains the headers and are the keys and the next row(2nd row) has its corresponding values.
The problem I'm facing is that the content in the text file is not symmetric meaning there are uneven spaces between the first and the second row so, I'm not able to use the delimiter also.
Also, there is no necessity that the header will always have a corresponding value in the next row. It may be empty sometimes.
After that, I want to make it a JSON format having those key-value pairs.
Any help would be appreciated.
import re
with open("E:\\wipro\\samridh\\test.txt") as read_file:
line = read_file.readline()
while line:
#print(line,end='')
new_string = re.sub(' +',' ', line)
line= read_file.readline()
print(new_string)
PFA image of my text input

You can find the indices and matches of the header with the finditer of the re package. Then, use that to process the rest:
import re
import json
thefile = open("file.txt")
line = thefile.readline()
iter = re.finditer("\w+\s+", line)
columns = [(m.group(0), m.start(0), m.end(0)) for m in iter]
records = []
while line:
line = thefile.readline()
record = {}
for col in columns:
record[col[0]] = line[col[1]:col[2]]
records.append(record)
print(json.dumps(records))
I'll leave it up to OP to strip whitespace and filter out empty entries. Not to mention error handling ;-).

Not quite sure what you want to do, but if I understand it correctly and under these assumptions: - you only have 2 lines in the file. - you have the same number of keys and value. - no spaces allowed "inside" a value or key, meaning no spaces allowed except the ones separate between elements.
with open(fname) as f:
content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]
after that, content[0] is your keys line and content[1] is your values.
now all you need to do is this:
key_value_dict = {}
for key,value in zip(content[0].split(), content[1].split()):
key_value_dict[key] = value
and your key_value_dict holds up a dictionary (JSON like) of keys and values.

I assume that each of the headers is a single word without intervening whitespace. Then, to find out where each column starts, you can do this, for example:
with open("E:\\wipro\\samridh\\test.txt") as read_file:
line = next(read_file)
headers = line.split()
l_bounds = [line.find(word) for word in headers]
When splitting data lines, you will also need the right boundaries. If you know, say, that none of your data lines is longer than 1000 characters, you could do something like this:
r_bounds = l_bounds[1:] + [1000]
When you walk over the data lines, you put together the left and right limits and the header_words like so:
out_str = json.dumps({name: line[l:r].strip()
for name, l, r in zip(headers, l_bounds, r_bounds)})
No regex required, by the way.

Assumptions the below makes:
Headers are one word (as they are in your example)
Headers and values don’t overlap... That is if header 1 goes from index 5 to 15, then its value in the row below will also be found within the same index of the row below
Benefits of this approach are that the values can have spaces in between them (as they do in your example). If you were to split both the header and value strings by spaces, then they would have a different number of elements and you wouldn’t be able to combine them. Also, you wouldn’t be able to find values that were empty (as in his example).
Here is the approach I would take...
If you are sure your file headers are only one words (no spaces), then find all indices of the first character of each word and store them in an array. Every time you figure out two indices, extract the header between them. So between (header1-firstchar, header2-firstchar - 1)...
Then get the second line and sequentially extract substrings from indices: (header1-firstchar, header2-firstchar - 1)...
Once you've done that combine the extracted header/key and values into a dictionary.
dictVerson = zip(headers, values)
Then call following:
import json
jsonVersion = json.dumps(dictVersion)

Related

Python script to import a comma separated csv that has fixed length fields

I have a .csv file with comma separated fields. I am receiving this file from a 3rd party and the content cannot change. I need to import the file to a database, but there are commas in some of the "comma" separated fields. The comma separated fields are also fixed length - when I stright up print the fields as per the below lines in function insert_line_csv they are spaced in a fixed length.
I need essentially need an efficient method of collecting fields that could have comma's included in the field. I was hoping to combine the two methods. Not sure if that would be efficient.
I am using python 3 - willing to use any libraries to make the job efficient and easy.
Currently I am have the following:
with open(FileName, 'r') as f:
for count, line in enumerate(f):
insert_line_csv(count, line)
with the insert_line_csv function looking like:
def insert_line_csv(line_no, line):
line = line.split(",")
field0 = line[0]
field1 = line[1]
......
I am importing the line_no, as that is also being entered into the db.
Any insight would be appreciated.
A sample dataset:
text ,2000.00 ,2018-07-07,textwithoutcomma ,text ,1
text ,3000.00 ,2018-07-08,textwith,comma ,text ,7
text ,1000.00 ,2018-07-07,textwithoutcomma ,text ,4

If the comma seperated fields are all fixed length, you should be able to just splice them off by count instead of splicing by commas, see Split string by count of characters
as a mockup-code you have
toParse = line
while (toParse != "")
chunk = first X chars of toParse
restOfLine = toParse without the chars just cut off
write chunk to db
toParse = restOfLine
That should work imho
Edit:
upon seeing your sample dataset. Can there only be one field with commas inside of it? If so, you could split via comma, read out the first 3 fields, then the last two. Whatever is left, you concatenate again, because it is the value fo the 4th field. (If it had commas, ou'll need to actually concatenate there, if not, its already the value)

Python: Use regex to extract a column of a file

I am currently extracting columns in a file by using awk in os.system():
os.system("awk '{print $'%i'}' < infile > outfile"%some_column)
np.loadtxt('outfile')
Is there an equivalent way to accomplish this using regex?
Thanks.
Edit: I want to clarify that I am looking for the most optimal way to extract specific columns of large files.

Depending on what your data delimiters are, regex is probably overkill for this. If the delimiters are simple (whitespace or a specific character/string), you can separate columns simply by using the string.split method.
Here is an example program to explain how this might work:
column = 0 # First column
with open("data.txt") as file:
data = file.readlines()
columns = list(map(lambda x: x.strip().split()[column], data))
To break this down:
column = 0
# Read a file named "data.txt" into an array of lines
with open("data.txt") as file:
data = file.readlines()
# This is where we will store the columns as we extract them
columns = []
# Iterate over each line in the file
for line in data:
# Strip the whitespace (including the trailing newline character) from the
# start and end of the string
line = line.strip()
# Split the line, using the standard delimiter (arbitrary number of
# whitespace characters)
line = line.split()
# Extract the column data from the desired index and store it in our list
columns.append(line[column])
# columns now holds a list of strings extracted from that column

Python removing substrings from strings

I'm trying to remove some substrings from a string in a csv file.
import csv
import string
input_file = open('in.csv', 'r')
output_file = open('out.csv', 'w')
data = csv.reader(input_file)
writer = csv.writer(output_file,quoting=csv.QUOTE_ALL)# dialect='excel')
specials = ("i'm", "hello", "bye")
for line in data:
line = str(line)
new_line = str.replace(line,specials,'')
writer.writerow(new_line.split(','))
input_file.close()
output_file.close()
So for this example:
hello. I'm obviously over the moon. If I am being honest I didn't think I'd get picked, so to get picked is obviously a big thing. bye.
I'd want the output to be:
obviously over the moon. If I am being honest I didn't think I'd get picked, so to get picked is obviously a big thing.
This however only works when im searching for a single word. So that specials = "I'm" for example. Do I need to add my words to a list or an array?

It looks like you aren't iterating through specials, since it's a tuple rather than a list, so it's only grabbing one of the values. Try this:
specials = ["i'm, "hello", "bye"]
for line in data:
new_line = str(line)
for word in specials:
new_line = str.replace(new_line, word, '')
writer.writerow(new_line.split(','))

It seems like you're already splitting the input via the csv.reader, but then you're throwing away all that goodness by turning the split line back into a string. It's best not to do this, but to keep working with the lists that are yielded from the csv reader. So, it becomes something like this:
for row in data:
new_row = [] # A place to hold the processed row data.
# look at each field in the row.
for field in row:
# remove all the special words.
new_field = field
for s in specials:
new_field = new_field.replace(s, '')
# add the sanitized field to the new "processed" row.
new_row.append(new_field)
# after all fields are processed, write it with the csv writer.
writer.writerow(new_row)

Removing unwanted characters in each line of a file then matching what is left to another file in Python

I would like to write a python script that addresses the following problem:
I have two tab separated files, one has just one column of a variety of words. The other file has one column that contains similar words, as well as columns other information. However, within the first file, some lines contain multiple words, separated by " /// ". The other file has a similar problem, but the separator is " | ".
File #1
RED
BLUE /// GREEN
YELLOW /// PINK /// PURPLE
ORANGE
BROWN /// BLACK
File #2 (Which contains additional columns of other measurements)
RED|PINK
ORANGE
BROWN|BLACK|GREEN|PURPLE
YELLOW|MAGENTA
I want to parse through each file and match the words that are the same, and then append the columns of additional measurements too. But I want to ignore the /// in the first file, and the | in the second, so that each word will be compared to the other list on its own. The output file should have just one column of any words that appear in both lists, and then the appended additional information from file 2. Any help??
Addition info / update:
Here are 8 lines of File #1, I used color names above to make it more simple but this is what the words really are: These are the "symbols":
ANKRD38
ANKRD57
ANKRD57
ANXA8 /// ANXA8L1 /// ANXA8L2
AOF1
AOF2
AP1GBP1
APOBEC3F /// APOBEC3G
Here is one line of file #2: What I need to do is run each symbol from file1 and see if it matches with any one of the "synonyms", found in file2, in column 5 (here the synonyms are A1B|ABG|GAP|HYST2477). If any symbols from file1 match ANY of the synonyms from col 5 file 2, then I need to append the additional information (the other columns in file2) onto the symbol in file1 and create one big output file.
9606 '\t' 1 '\t' A1BG '\t' - '\t' A1B|ABG|GAB|HYST2477'\t' HGNC:5|MIM:138670|Ensembl:ENSG00000121410|HPRD:00726 '\t' 19 '\t' 19q13.4'\t' alpha-1-B glycoprotein '\t' protein-coding '\t' A1BG'\t' alpha-1-B glycoprotein'\t' O '\t' alpha-1B-glycoprotein '\t' 20120726
File2 is 22,000 KB, file 1 is much smaller. I have thought of creating a dict much like has been suggested, but I keep getting held up with the different separators in each of the files. Thank you all for questions and help thus far.

EDIT
After your comments below, I think this is what you want to do. I've left the original post below in case anything in that was useful to you.
So, I think you want to do the following. Firstly, this code will read every separate synonym from file1 into a set - this is a useful structure because it will automatically remove any duplicates, and is very fast to look things up. It's like a dictionary but with only keys, no values. If you don't want to remove duplicates, we'll need to change things slightly.
file1_data = set()
with open("file1.txt", "r") as fd:
for line in fd:
file1_data.update(i.strip() for i in line.split("///") if i.strip())
Then you want to run through file2 looking for matches:
with open("file2.txt", "r") as in_fd:
with open("output.txt", "w") as out_fd:
for line in in_fd:
items = line.split("\t")
if len(items) < 5:
# This is so we don't crash if we find a line that's too short
continue
synonyms = set(i.strip() for i in items[4].split("|"))
overlap = synonyms & file1_data
if overlap:
# Build string of columns from file2, stripping out 5th column.
output_str = "\t".join(items[:4] + items[5:])
for item in overlap:
out_fd.write("\t".join((item, output_str)))
So what this does is open file2 and an output file. It goes through each line in file2, and first checks it has enough columns to at least have a column 5 - if not, it ignores that line (you might want to print an error).
Then it splits column 5 by | and builds a set from that list (called synonyms). The set is useful because we can find the intersection of this with the previous set of all the synonyms from file1 very fast - this intersection is stored in overlap.
What we do then is check if there was any overlap - if not, we ignore this line because no synonym was found in file1. This check is mostly for speed, so we don't bother building the output string if we're not going to use it for this line.
If there was an overlap, we build a string which is the full list of columns we're going to append to the synonym - we can build this as a string once even if there's multiple matches because it's the same for each match, because it all comes from the line in file2. This is faster than building it as a string each time.
Then, for each synonym that matched in file1, we write to the output a line which is the synonym, then a tab, then the rest of the line from file2. Because we split by tabs we have to put them back in with "\t".join(...). This is assuming I am correct you want to remove column 5 - if you do not want to remove it, then it's even easier because you can just use the line from file2 having stripped off the newline at the end.
Hopefully that's closer to what you need?
ORIGINAL POST
You don't give any indication of the size of the files, but I'm going to assume they're small enough to fit into memory - if not, your problem becomes slightly trickier.
So, the first step is probably to open file #2 and read in the data. You can do it with code something like this:
file2_data = {}
with open("file2.txt", "r") as fd:
for line in fd:
items = line.split("\t")
file2_data[frozenset(i.strip() for i in items[0].split("|"))] = items[1:]
This will create file2_data as a dictionary which maps a word on to a list of the remaining items on that line. You also should consider whether words can repeat and how you wish to handle that, as I mentioned in my earlier comment.
After this, you can then read the first file and attach the data to each word in that file:
with open("file1.txt", "r") as fd:
with open("output.txt", "w") as fd_out:
for line in fd:
words = set(i.strip() for i in line.split("///"))
for file2_words, file2_cols in file2_data.iteritems():
overlap = file2_words & words
if overlap:
fd_out.write("///".join(overlap) + "\t" + "\t".join(file2_cols))
What you should end up with is each row in output.txt being one where the list of words in the two files had at least one word in common and the first item is the words in common separated by ///. The other columns in that output file will be the other columns from the matched row in file #2.
If that's not what you want, you'll need to be a little more specific.
As an aside, there are probably more efficient ways to do this than the O(N^2) approach I outlined above (i.e. it runs across one entire file as many times as there are rows in the other), but that requires more detailed information on how you want to match the lines.
For example, you could construct a dictionary mapping a word to a list of the rows in which that word occurs - this makes it a lot faster to check for matching rows than the complete scan performed above. This is rendered slightly fiddly by the fact you seem to want the overlaps between the rows, however, so I thought the simple approach outlined above would be sufficient without more specifics.

Look at http://docs.python.org/2/tutorial/inputoutput.html for file i/o
Loop through each line in each file
file1set = set(file1line.split(' /// '))
file2set = set(file2line.split('|'))
wordsineach = list(file1set & file2set)
split will create an array of the color names
set() turns it into a set so we can easily compare differences in each line
Loop over 'wordsineach' and write to your new file

Use the str.replace function
with open('file1.txt', 'r') as f1:
content1 = f1.read()
content1 = content1.replace(' /// ', '\n').split('\n')
with open('file2.txt', 'r') as f2:
content2 = f2.read()
content2 = content1.replace('|', '\n').split('\n')
Then use a list comprehension
common_words = [i for i in content1 if i in content2]
However, if you already know that none of the words in each file are the same, you can use set intersections to make life easier
common_words = list(set(content1) & set(content2))
Then to output the remainder to another file:
common_words = [i + '\n' for i in common_words] #so that we print each word on a new line
with open('common_words.txt', 'w') as f:
f.writelines(common_words)
As to your 'additional information', I cannot help you unless you tell us how it is formatted, etc.

How to not include '\n' and the next index entry while appending values from a certain index to a list

I have a data set that is in the format
100 domain bacteria phylum chloroflexi genus caldilinea
200 domain bacteria phylum acuuhgsdiuh genus blahblahbl
300
basically what i have been trying to do is create a function that scans through the different indexes separated by tabs and when it finds the desired entry, it appends the entry after to a list [e.g. search for 'domain' append 'bacteria'] . what i have works, except for the last entry where I would search for 'genus' it would append 'caldilinea\n\n200' which makes sense because it has line breaks after it but i don't know how to make it so it only appends the last index ['caldilinea' in this case] instead of the last index + line breaks + the first index on the row beneath it .
here is my code as of now:
in_file = open(input_file,'r')
lines = in_file.read()
segment_tab = lines.split('\t')
next_index = [segment_tab[position + 1] for position, entry in enumerate(segment_tab) if entry == 'genus']
when I print next_index it should give me
'caldilinea','blahblahbl'
but instead it is giving me
'caldilinea\n\n200','blahblahbl\n\n300'
my data is a lot more complex than this and has hundreds of rows
How can i get it to not include the line breaks and the beginning index of the next row?

You should either split by lines and then split by tabs, or simultaneously split by both.
The former could be done like this:
lines = in_file.readlines()
segment_tab = [line.split('\t') for line in lines]
More idiomatic would be something like:
segment_tab = [line.split('\t') for line in in_file]
Note that this will give you a list of lists of strings, not just a list of strings. This is different than what you seem to expect, but is the more conventional approach.
The other approach is to split by both, like this:
lines = in_file.read()
segment_tab = re.split(r'\t|\n+', lines)
This is kind of unconventional (it treats groups of newlines just like a tab), but seems to be what you're asking for.
Note that you'll need to import re for this to work.

for line in open('input_file', 'r'):
segment_tab = line.strip().split('\t')
This will give you segment_tab = ['100', 'domain', 'bacteria', 'phylum', 'chloroflexi', 'genus', 'caldilinea'] for each line. Is this good enough?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.