Python: Use regex to extract a column of a file - python

I am currently extracting columns in a file by using awk in os.system():
os.system("awk '{print $'%i'}' < infile > outfile"%some_column)
np.loadtxt('outfile')
Is there an equivalent way to accomplish this using regex?
Thanks.
Edit: I want to clarify that I am looking for the most optimal way to extract specific columns of large files.

Depending on what your data delimiters are, regex is probably overkill for this. If the delimiters are simple (whitespace or a specific character/string), you can separate columns simply by using the string.split method.
Here is an example program to explain how this might work:
column = 0 # First column
with open("data.txt") as file:
data = file.readlines()
columns = list(map(lambda x: x.strip().split()[column], data))
To break this down:
column = 0
# Read a file named "data.txt" into an array of lines
with open("data.txt") as file:
data = file.readlines()
# This is where we will store the columns as we extract them
columns = []
# Iterate over each line in the file
for line in data:
# Strip the whitespace (including the trailing newline character) from the
# start and end of the string
line = line.strip()
# Split the line, using the standard delimiter (arbitrary number of
# whitespace characters)
line = line.split()
# Extract the column data from the desired index and store it in our list
columns.append(line[column])
# columns now holds a list of strings extracted from that column

Related

I get an error about wrong dictionary update sequence length when trying to read lines from txt file into a dictionary

I'm trying to loop through multiple lines and add that into a dictionary, then a dataframe.
I've had many attempts but no solution yet.
I have a txt file with multiple lines like this for example, and I'm trying to iterate through each line, add it to a dictionary and then append the dictionary to a dataframe.
So textfile for example would go from here:
ABC=123, DEF="456",
ABC="789", DEF="101112"
I would like this be added to a dictionary like this (on the first loop, for the first line):
{ABC:123,DEF=456}
and then appended to a df like this
ABC DEF
0 123 456
1 789 101112
So far I have tried this, this only works for one line in the text file, when I add a new line, I get this error:
dictionary update sequence element #6 has length 3; 2 is required
with open("file.txt", "r") as f:
s = f.read().strip()
dictionary = dict(subString.split("=") for subString in s.split(","))
dataframe = dataframe.append(dictionary, ignore_index=True)
dataframe
One suggestion is to parse each line with regex, and then insert the matches (if found) into the dictionary. You can change the regex pattern as needed, but this one matches words on the left side of = with numbers on the right which start with ' or ".
import re
import pandas as pd
pattern = r'(\w+)=[\'\"]?(\d+)'
str_dict = {}
with open('file.txt') as f:
for line in f:
for key, val in re.findall(pattern, line):
str_dict.setdefault(key, []).append(int(val))
df = pd.DataFrame(str_dict)
This is how I chose the regex pattern
This also works in the scenario of a huge text file with many different strings:
import re
file = open('event.txt', 'r').readlines()
for group in file:
output1 = group.replace('Event time', 'Event_time')
words = re.findall(r'".*?"', str(output1))
for word in words:
text = str(output1).replace(word, word.replace(" ", "_"))
output2 = text.strip().split(' ')
for section in output2:
key,val = section.strip().split('=')
data_dict[key.strip()] = val.strip()
df = df.append(data_dict, ignore_index=True)
print(df)

To parse text file and create json out of it

I am new to Python. I want to parse a text file in which the first row contains the headers and are the keys and the next row(2nd row) has its corresponding values.
The problem I'm facing is that the content in the text file is not symmetric meaning there are uneven spaces between the first and the second row so, I'm not able to use the delimiter also.
Also, there is no necessity that the header will always have a corresponding value in the next row. It may be empty sometimes.
After that, I want to make it a JSON format having those key-value pairs.
Any help would be appreciated.
import re
with open("E:\\wipro\\samridh\\test.txt") as read_file:
line = read_file.readline()
while line:
#print(line,end='')
new_string = re.sub(' +',' ', line)
line= read_file.readline()
print(new_string)
PFA image of my text input
You can find the indices and matches of the header with the finditer of the re package. Then, use that to process the rest:
import re
import json
thefile = open("file.txt")
line = thefile.readline()
iter = re.finditer("\w+\s+", line)
columns = [(m.group(0), m.start(0), m.end(0)) for m in iter]
records = []
while line:
line = thefile.readline()
record = {}
for col in columns:
record[col[0]] = line[col[1]:col[2]]
records.append(record)
print(json.dumps(records))
I'll leave it up to OP to strip whitespace and filter out empty entries. Not to mention error handling ;-).
Not quite sure what you want to do, but if I understand it correctly and under these assumptions: - you only have 2 lines in the file. - you have the same number of keys and value. - no spaces allowed "inside" a value or key, meaning no spaces allowed except the ones separate between elements.
with open(fname) as f:
content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]
after that, content[0] is your keys line and content[1] is your values.
now all you need to do is this:
key_value_dict = {}
for key,value in zip(content[0].split(), content[1].split()):
key_value_dict[key] = value
and your key_value_dict holds up a dictionary (JSON like) of keys and values.
I assume that each of the headers is a single word without intervening whitespace. Then, to find out where each column starts, you can do this, for example:
with open("E:\\wipro\\samridh\\test.txt") as read_file:
line = next(read_file)
headers = line.split()
l_bounds = [line.find(word) for word in headers]
When splitting data lines, you will also need the right boundaries. If you know, say, that none of your data lines is longer than 1000 characters, you could do something like this:
r_bounds = l_bounds[1:] + [1000]
When you walk over the data lines, you put together the left and right limits and the header_words like so:
out_str = json.dumps({name: line[l:r].strip()
for name, l, r in zip(headers, l_bounds, r_bounds)})
No regex required, by the way.
Assumptions the below makes:
Headers are one word (as they are in your example)
Headers and values don’t overlap... That is if header 1 goes from index 5 to 15, then its value in the row below will also be found within the same index of the row below
Benefits of this approach are that the values can have spaces in between them (as they do in your example). If you were to split both the header and value strings by spaces, then they would have a different number of elements and you wouldn’t be able to combine them. Also, you wouldn’t be able to find values that were empty (as in his example).
Here is the approach I would take...
If you are sure your file headers are only one words (no spaces), then find all indices of the first character of each word and store them in an array. Every time you figure out two indices, extract the header between them. So between (header1-firstchar, header2-firstchar - 1)...
Then get the second line and sequentially extract substrings from indices: (header1-firstchar, header2-firstchar - 1)...
Once you've done that combine the extracted header/key and values into a dictionary.
dictVerson = zip(headers, values)
Then call following:
import json
jsonVersion = json.dumps(dictVersion)

Reading a specific column of a text file into a list (python 3.6.3)

I know there's a million questions on this but I couldn't find one that matches what I'm looking for. Let's say I have a text file like this:
1 34
2 65
3 106
And I want to scan this file and read only the second column such that data=[34 65 106], how might I go about this? Further, if I wanted to make this program able to read any length dataset and any specific column input by the user. I can do most things in simple python but reading files eludes me.
pandas is a useful library for tasks such as this:
import pandas as pd
df = pd.read_csv('file.txt', header=None, delimiter=r"\s+")
lst = df.iloc[:, 1].tolist()
Solution
Sound like the case for a small helper function:
def read_col(fname, col=1, convert=int, sep=None):
"""Read text files with columns separated by `sep`.
fname - file name
col - index of column to read
convert - function to convert column entry with
sep - column separator
If sep is not specified or is None, any
whitespace string is a separator and empty strings are
removed from the result.
"""
with open(fname) as fobj:
return [convert(line.split(sep=sep)[col]) for line in fobj]
res = read_col('mydata.txt')
print(res)
Output:
[34, 65, 106]
If you want the first column, i.e. at index 0:
read_col('mydata.txt', col=0)
If you want them to be floats:
read_col('mydata.txt', col=0, convert=float)
If the columns are separated by commas:
read_col('mydata.txt', sep=',')
You can use any combination of these optional arguments.
Explanation
We define a new function with default parameters:
def read_col(fname, col=1, convert=int, sep=None):
This means you have to supply the file fname. All other arguments are optional and the default values will be used if not provide when calling the function.
In the function, we open the file with:
with open(fname) as fobj:
Now fobj is an open file object. The file will be closed when we de-dent, i.e. here when we end the function.
This:
[convert(line.split(sep=sep)[col]) for line in fobj]
creates a list by going through all lines of the file. Each line is split at the separator sep. We take only the value for the column with index col. We also convert the value in the datatype of convert, i.e. into an integer per default.
Edit
You can also skip the first line in the file:
with open(fname) as fobj:
next(fobj)
return [convert(line.split(sep=sep)[col]) for line in fobj]
Or more sophisticated as optional argument:
def read_col(fname, col=1, convert=int, sep=None, skip_lines=0):
# skip first `skip_lines` lines
for _ in range(skip_lines):
next(fobj)
with open(fname) as fobj:
return [convert(line.split(sep=sep)[col]) for line in fobj]
You an use a list comprehension:
data = [b for a, b in [i.strip('\n').split() for i in open('filename.txt')]]
You will first need to get list of all lines via
fileobj.readlines()
Then you can run a for loop to iterate through the lines one by one , for each line you can split it by char (" ")
Then in the same for loop you can add the second index of split result to a existing list which will be your final result
a=fil.readlines()
t=[]
for f in a:
e=f.split(" ")
t.append(e[1])
Is the file delimited?
You'll want to first open the file:
with open('file.txt', 'r') as f:
filedata = f.readlines()
Create a list, loop through the lines and split each line into a list based on your delimiter, and then append the indexed item in the list to your original list.
data = []
for line in filedata:
columns = line.split('*your delimiter*')
data.append(columns[1])
Then the data list should contain what you want.

Slice strings in .txt and return only one of the new strings

I want to use lines of strings of a .txt file as search queries in other .txt files. But before this, I need to slice those strings of the lines of my original text data. Is there a simple way to do this?
This is my original .txt data:
CHEMBL2057820|MUBD_HDAC2_ligandset|mol2|42|dock12
CHEMBL1957458|MUBD_HDAC2_ligandset|mol2|58|dock10
CHEMBL251144|MUBD_HDAC2_ligandset|mol2|41|dock98
CHEMBL269935|MUBD_HDAC2_ligandset|mol2|30|dock58
... (over thousands)
And I need to have a new file where the new new lines contain only part of those strings, like:
CHEMBL2057820
CHEMBL1957458
CHEMBL251144
CHEMBL269935
Open the file, read in the lines and split each line at the | character, then index the first result
with open("test.txt") as f:
parts = (line.lstrip().split('|', 1)[0] for line in f)
with open('dest.txt', 'w') as dest:
dest.write("\n".join(parts))
Explanation:
lstrip - removes whitespace on leading part of the line
split("|") returns a list like: ['CHEMBL2057820', 'MUBD_HDAC2_ligandset', 'mol2', '42', 'dock12'] for each line
Since we're only conerned with the first section it's redundant to split the rest of the contents of the line on the | character, so we can specify a maxsplit argument, which will stop splitting the string after it's encoutered that many chacters
So split("|", 1)
gives['CHEMBL2057820','MUBD_HDAC2_ligandset|mol2|42|dock12']
Since we're only interested in the first part split("|", 1)[0] returns
the "CHEMBL..." section
Use split and readlines:
with open('foo.txt') as f:
g = open('bar.txt')
lines = f.readlines()
for line in lines:
l = line.strip().split('|')[0]
g.write(l)

identify csv in python

I have a data dump that is a "messed up" CSV. (About 100 files, each with about 1000 lines of actual CSV data.)
The dump has some other text in addition to CSV. How can I extract the CSV part separately, programmatically?
As an example the data file looks like something like this
Session:1
Data collection date: 09-09-2016
Related questions:
Question 1: parta, partb, partc,
Question 2: parta, partb, partc
"field1","field2","field3","field4"
"data11","data12","data13","data14"
"data21","data22","data23","data24"
"data31","data32","data33","data34"
"data41","data42","data43","data44"
"data51","data52","data53","data54"
I need to extract the csv part.
Caveats,
the text in the beginning is NOT limited to 4 - 5 lines.
the additional text is NOT just in the beginning of the file
I saw this post that suggests using re.split and/or csv.Sniffer,
however my attempt was not fruitful.
with open("untitled.csv") as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
print(dialect.__dict__)
csvstarts = False
csvdump = []
for ln in csvfile.readlines():
toks = re.split(r'[,]', ln)
print(toks)
if toks[0] == '"field1"' and not csvstarts: # identify by the header line
csvstarts = True
continue
if csvstarts:
if toks[0] == '"field1"': # identify the start of subsequent csv data
csvstarts = False
continue
csvdump.append(ln) # record the current line
print(csvdump)
For now I am able to identify the csv lines accurately ONLY if there is one bunch of data.
Is there anything better I can do?
How about this:
import re
my_pattern = re.compile("(\"[\w]+\",)+")
with open('<your_file>', 'rb') as fi:
for f in fi:
result = my_pattern.match(f)
if result:
print f
Assuming the csv data can be differentiated from the rest by having no special characters in them (we only accept each element to have letters or numbers surrounded by double quotes and a comma to separate from the next element)
If your csv lines and only those lines start with \", then you can do this:
import csv
data = list(csv.reader(open("test.csv", 'rb'), quotechar='¬'))
# for quotechar - use something that won't turn up in data
def importCSV(data):
# outputs list of list with required data
# works on the assumption that all required data starts with \"
# and that no text starts with \"
out = []
for line in data:
if (line != []) and (line[0][0] == "\""):
line = [el.replace("\"", "") for el in line]
out.append(line)
return out
useful = importCSV(data)
Can you not read each line and do a regex to see weather or not to pull the data?
Maybe something like:
^(["][\w]["][,])+["][\w]["]$
My regex is not the best and there may likely be a better way but that seemed to work for me.

Categories

Resources