Reading an Excel spreadsheet of Regular expressions - python

I'm creating a program to parse data. My dictionary is growing quite long. Therefore, I'd like to save it as a file that can be read in. Preferably xlsx, but a txt file will work too. Besides cleaning up the program, this will also allow me to call different dictionaries depending on what data is to be extracted.
The dictionary that looks like this:
import re
import pandas as pd
my_Dict = {
'cat': re.compile(r'CAT (?P<cat>.*)\n'),
'dog': re.compile(r'DOG (?P<dog>.*)\n'),
'mouse': re.compile(r'MOUSE (?P<mouse>.*)\n'),
}
What's the best format to put this in an xlsx or txt form to make it most easily readable? Then how you read it in to use as a dictionary?
I've been able to write this dictionary to a file, but it never reads back in how I just wrote it.
Thanks!

I would recommend a Comma Separated Value (.csv) file. You can treat it as a plain text file or open it in Excel without much difficulty.
Your dict would look like:
cat, CAT (?P<cat>.*)\n
dog, DOG (?P<dog>.*)\n
mouse, MOUSE (?P<mouse>.*)\n
As far as reading it, you would just need to loop over the lines and separate them at the comma, using the first part as the key and the second as the value.
my_dict = {}
with open(filename) as f:
for line in f:
# Split the line on the comma
split_line = line.split(',')
# .strip() removes either specified characters or, if not argument is given,
# leading and trailing whitespace
my_dict[split_line[0].strip()] = re.compile(split_line[1].strip())
However, if you need to include commas in your regexes or names, this will break. In that case, a Tab Seperated Value (.tsv) file would probably work. Instead of splitting on ',', you would instead split on '\t'.
If neither of these work, you can split on just about any arbitrary character, however MS Excel will recognize and be able to open both .csv and .tsv files readily.

Related

Replace blank values with string

I need to manipulate a csv file in a way to go into the csv file look for blank fields between c0-c5 in my example csv file. with the csv file where ever there are blanks I would like to replace the blank with any verbage i want, like "not found"
the only thing for code I have so far is dropping a column I do not need, but the manipulation I need I really can not find anything.. maybe it is not possible?
also, i am wondering how to change a column name..thanks..
#!/bin/env python
import pandas
data = pandas.read_csv('report.csv')
data = data.drop(['date',axis=1)
data.to_csv('final_report.csv')
Alternatively and taking your "comment question" into account (if you do not necessarily want to use pandas as in n1colas.m's answer) use string replacements and
simply loop over your file with:
with open("modified_file.csv","w") as of:
with open("report.csv", "r") as inf:
for line in inf:
if "#" not in line: # in the case your csv file has a comment marker somewhere and it is called #, the line is skipped, which means you get a clean comma separated value file as the outfile- if you do want to keep such lines simply remove the if condition
mystring=line.replace(", ,","not_found").replace("data","input") # in case it is not only one blank space you can also use the regex for n times blank space here
print(mystring, file=of, end=""); # prints the replaced line to outfile and writes no newline
I know this is not the most efficient way to do it, but probably the one where you easily understand what you are doing and are able to modify this to your heart's desire.
For any reasonably sized csv files it sould still work nearly instantaneously.
Also for testing purposes always use a separate file (of) for such replacements instead of writing to your infile as your question seems to state. Check that it did what you wanted. ONLY THEN overwrite your infile. This may seem unnecessary at first, but mistakes happen...
You have to perform this line
data['data'] = data['data'].fillna("not found")
Here the documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
Here an example
import pandas
data = pandas.read_csv('final_report.csv')
data.info()
data['data'] = data['data'].fillna("Something")
print(data)
I would suggest to change the data variable to something different, because your column has the same name and can be confusing.

python csv to tsv: in case record has comma inside

I'm trying to convert csv format to tsv.
I just used this to change comma to tab.
tsv = re.sub(',', '\t', csv)
But I can't deal with the string with comma inside, for example:
dept_no,dt,hello
1,20180411,hello its me
2,20180412,here has tab
3,20180412,"here, is, commas"
Is there any way to convert to tsv without affecting comma inside of the string?
Try following, csv module should take care of inline commas.
import csv
csv.writer(file('output.tsv', 'w+'), delimiter='\t').writerows(csv.reader(open("input.csv")))
EDIT-1
[Added open method as per discussion]
Insted of using file where we are calling constructor directly, you can use open which is preferred way as per documentation here.
Result is same.
csv.writer(open('output.tsv', 'w+'), delimiter='\t').writerows(csv.reader(open("input.csv")))
Result

Converting a large wrongly created csv file into a tab delimited file using python and pandas

I have a very large csv file (>3GB, > 75million rows).
Problem is, it should not have been created as csv, but tab delimited.
The file has two columns, a string, and an integer. However, the string can have commas (for example: "Yes, it is very nice"), so, now the file may look like this, and it does not have a consistent number of columns and I cannot read it with pandas read_csv.
STRING CODE
This is nice 1
That is also nice 2
Yes it is very nice 3
I love everything 4
I am trying to convert it a tab delimited file, by changing the last comma into a tab. Since the file is huge, I cannot read it into memory. This is what I tried.
I read the file in chunks:
for ch in pandas.read_table("path", chunksize=256)
I define a function, myfunc, as follows:
li = s.rsplit(",", 1)
ret = "\t".join(li)
ret.rsplit("\t", 1)
Now, for each chunk I do something like:
data["STRING,CODE"] = data["STRING,CODE"].map(lambda x: x.myfunc(x))
data.to_csv("tmp.csv", sep="\t")
and I get something like:
STRING CODE
0 "This is nice 1
1 "That is also nice
2 "Yes it is very nice 3"
3 "I love everything 4"
Which is nothing like what I want. The entries are not separated the way I want, I get extra indices, and extra quotation marks. Besides, even after I am able to fix this for one chunk, I need to go back and append to the csv file to recreate the whole file.
Sorry this is messy, but I am lost. Any help?
File:
STRING,CODE
This is nice,1
That is also nice,2
Yes,it is very nice,3
I love everything,4
You shouldn't need pandas here. Just iterate through the lines of the file and write the fixed lines to a new file.
with open('new.csv', 'w') as newcsv:
with open('file.csv') as csvf:
for line in csvf:
head, _, tail = line.strip().rpartition(',')
newcsv.write('{}\t{}\n'.format(head, tail))
This should get the job done.
You don't even have to use python:
sed -i 's/\(.*\),/\1\t/' $INPUT
does an inplace replacement of the last , in the line with a /t.
If you want to preserve the input:
sed 's/\(.*\),/\1\t/' $INPUT > $OUTPUT
I suspect this would be faster than running it through python, but that's just a guess.

Keeping format of text (.txt) files when reading and rewriting

I have a .txt file containing formatting elements as \n for line breaks which I want to read and then rewrite its data until a specific line back to a new .txt file. My code looks like this:
with open (filename) as f:
content=f.readlines()
with open("lf.txt", "w") as file1:
file1.write(str(content))
file1.close
The output file lf.txt is produced correctly but it throws away the formatting of the input file. Is there a way to keep the formatting of file 1 when rewriting it to a new file?
You converted content to a string, while it's really a list of strings (lines).
Use join to convert the lines back to a string:
file1.write(''.join(content))
join is a string method, and it is activated in the example from an empty string object. The string calling this method is used as a separator for the strings joining process. In this situation we don't need any separator, just joining the strings.

Print elements of a list to a .csv file

I am reading in a csv file and dealing with each line as a list. At the end, I'd like to reprint to a .csv file, but the lines aren't necessarily even. I obviously cannot just go "print row", since this will print it as a list. How can I print it in .csv format?
Read manual, there's a CSV writer method (with example too). Don't print the data, store them and then write them into CSV file
http://docs.python.org/library/csv.html#csv.writer
Assuming that "row" contains a list of strings, you could try using
print ",".join(row)
Though it looks like your question was more towards writing to a csv file rather than printing it to standard output, here an example to do the latter using StringIO:
import StringIO
import csv
a_list_from_csv_file = [['for', 'bar'], [1, 2]]
out_fd = StringIO.StringIO()
writer = csv.writer(out_fd, delimiter='|')
for row in a_list_from_csv_file:
writer.writerow(row)
print out_fd.getvalue()
Like this you can use the different delimiters or escape characters as used by the csv you are reading.
What do you mean by "the lines aren't necessarily even"? Are you using a homebrew CSV parser or are you using the csv module?
If the former and there's nothing special you need to escape, you could try something like
print ",".join([ '"' + x.replace('"', '""') + '"' for x in row])
If you don't want ""s around each field, maybe make a function like escape_field() that checks if it needs to be wrapped in double quotes and/or escaped.

Categories

Resources