Replace blank values with string - python

I need to manipulate a csv file in a way to go into the csv file look for blank fields between c0-c5 in my example csv file. with the csv file where ever there are blanks I would like to replace the blank with any verbage i want, like "not found"
the only thing for code I have so far is dropping a column I do not need, but the manipulation I need I really can not find anything.. maybe it is not possible?
also, i am wondering how to change a column name..thanks..
#!/bin/env python
import pandas
data = pandas.read_csv('report.csv')
data = data.drop(['date',axis=1)
data.to_csv('final_report.csv')

Alternatively and taking your "comment question" into account (if you do not necessarily want to use pandas as in n1colas.m's answer) use string replacements and
simply loop over your file with:
with open("modified_file.csv","w") as of:
with open("report.csv", "r") as inf:
for line in inf:
if "#" not in line: # in the case your csv file has a comment marker somewhere and it is called #, the line is skipped, which means you get a clean comma separated value file as the outfile- if you do want to keep such lines simply remove the if condition
mystring=line.replace(", ,","not_found").replace("data","input") # in case it is not only one blank space you can also use the regex for n times blank space here
print(mystring, file=of, end=""); # prints the replaced line to outfile and writes no newline
I know this is not the most efficient way to do it, but probably the one where you easily understand what you are doing and are able to modify this to your heart's desire.
For any reasonably sized csv files it sould still work nearly instantaneously.
Also for testing purposes always use a separate file (of) for such replacements instead of writing to your infile as your question seems to state. Check that it did what you wanted. ONLY THEN overwrite your infile. This may seem unnecessary at first, but mistakes happen...

You have to perform this line
data['data'] = data['data'].fillna("not found")
Here the documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
Here an example
import pandas
data = pandas.read_csv('final_report.csv')
data.info()
data['data'] = data['data'].fillna("Something")
print(data)
I would suggest to change the data variable to something different, because your column has the same name and can be confusing.

Related

Reading an Excel spreadsheet of Regular expressions

I'm creating a program to parse data. My dictionary is growing quite long. Therefore, I'd like to save it as a file that can be read in. Preferably xlsx, but a txt file will work too. Besides cleaning up the program, this will also allow me to call different dictionaries depending on what data is to be extracted.
The dictionary that looks like this:
import re
import pandas as pd
my_Dict = {
'cat': re.compile(r'CAT (?P<cat>.*)\n'),
'dog': re.compile(r'DOG (?P<dog>.*)\n'),
'mouse': re.compile(r'MOUSE (?P<mouse>.*)\n'),
}
What's the best format to put this in an xlsx or txt form to make it most easily readable? Then how you read it in to use as a dictionary?
I've been able to write this dictionary to a file, but it never reads back in how I just wrote it.
Thanks!
I would recommend a Comma Separated Value (.csv) file. You can treat it as a plain text file or open it in Excel without much difficulty.
Your dict would look like:
cat, CAT (?P<cat>.*)\n
dog, DOG (?P<dog>.*)\n
mouse, MOUSE (?P<mouse>.*)\n
As far as reading it, you would just need to loop over the lines and separate them at the comma, using the first part as the key and the second as the value.
my_dict = {}
with open(filename) as f:
for line in f:
# Split the line on the comma
split_line = line.split(',')
# .strip() removes either specified characters or, if not argument is given,
# leading and trailing whitespace
my_dict[split_line[0].strip()] = re.compile(split_line[1].strip())
However, if you need to include commas in your regexes or names, this will break. In that case, a Tab Seperated Value (.tsv) file would probably work. Instead of splitting on ',', you would instead split on '\t'.
If neither of these work, you can split on just about any arbitrary character, however MS Excel will recognize and be able to open both .csv and .tsv files readily.

How to remove extra semicolons from the end of some lines in a txt

I am new to stackoverflow so if my post is not correctly posted or you need more info please let me know. So i have a really weird problem. I have a txt file with a lot of lines separated by ";". Normally there should 42 fields/columns, but for some reason some lines in my txt file when imported and separated by ";" it shows me a large amount of lines that are being skipped because python "expected 42 fields, saw 45". I import the file using pandas as most of my transformation are done with it:
text = pd.read_csv('file.txt',encoding='ISO-8859-1', keep_default_na=False,error_bad_lines=False, sep=';')
What I found out is that for some lines I have 3 extra ";" at the end. Because most of the data is confidential and I cannot share it outside my company I generated a similar 3 line txt file to show you where my issue lies.
;;;5123123;text1;text2;;;;123124;text3;text4;;;;5234234;text5;text6;;;;412321;text7;text8;;;;512312;text9;text10;;;;15123213;text11;text12;;;;123123;text13;text14
;;;4666190;text1;text2;;;;312312;text3;text4;;;;5123123;text5;text6;;;;;;;;;;;;;;;;;;;;;;55123;text7;text8
;;;5123123;text1;text2;;;;1321321;text3;text4;;;;123124;text5;text6;;;;;;;;;;;;;;;;;;;;;;3123123;512312312;text7;;;
So Those are similar three lines from my file, but with substituted names. The first and second line is correct, but the third yields me 45 fields when imported.
So is there a way that I can go through the file before importing it and look for all lines starting with ;;;5123123 and check if there are ";" at the end and if there are remove them, and after that of course import them. The problem is only with some lines starting with ;;;5123123. There are a few hundred lines with this error and the whole data is a little bit more than 50k linees.
I believe pd is pandas, so you can use usecols argument for read_csv method
text = pd.read_csv('file.txt',
encoding='ISO-8859-1',
keep_default_na=False,
error_bad_lines=False,
sep=';',
usecols=list(range(43)),
names=list(range(43)),
headers=None)
Edited
You can also add names and headers argument
Have you tried to split into a list and then removing blank elements??
f = open('file.txt', 'rb')
raw_str = str(f.read())
full_list = raw_str.split(';')
templist = list(filter(None, full_list))
by printing templist it gives a list of all elements. you can perform any action on it like to convert into a string again by using for loop according to your requirement. output is like-

Converting a large wrongly created csv file into a tab delimited file using python and pandas

I have a very large csv file (>3GB, > 75million rows).
Problem is, it should not have been created as csv, but tab delimited.
The file has two columns, a string, and an integer. However, the string can have commas (for example: "Yes, it is very nice"), so, now the file may look like this, and it does not have a consistent number of columns and I cannot read it with pandas read_csv.
STRING CODE
This is nice 1
That is also nice 2
Yes it is very nice 3
I love everything 4
I am trying to convert it a tab delimited file, by changing the last comma into a tab. Since the file is huge, I cannot read it into memory. This is what I tried.
I read the file in chunks:
for ch in pandas.read_table("path", chunksize=256)
I define a function, myfunc, as follows:
li = s.rsplit(",", 1)
ret = "\t".join(li)
ret.rsplit("\t", 1)
Now, for each chunk I do something like:
data["STRING,CODE"] = data["STRING,CODE"].map(lambda x: x.myfunc(x))
data.to_csv("tmp.csv", sep="\t")
and I get something like:
STRING CODE
0 "This is nice 1
1 "That is also nice
2 "Yes it is very nice 3"
3 "I love everything 4"
Which is nothing like what I want. The entries are not separated the way I want, I get extra indices, and extra quotation marks. Besides, even after I am able to fix this for one chunk, I need to go back and append to the csv file to recreate the whole file.
Sorry this is messy, but I am lost. Any help?
File:
STRING,CODE
This is nice,1
That is also nice,2
Yes,it is very nice,3
I love everything,4
You shouldn't need pandas here. Just iterate through the lines of the file and write the fixed lines to a new file.
with open('new.csv', 'w') as newcsv:
with open('file.csv') as csvf:
for line in csvf:
head, _, tail = line.strip().rpartition(',')
newcsv.write('{}\t{}\n'.format(head, tail))
This should get the job done.
You don't even have to use python:
sed -i 's/\(.*\),/\1\t/' $INPUT
does an inplace replacement of the last , in the line with a /t.
If you want to preserve the input:
sed 's/\(.*\),/\1\t/' $INPUT > $OUTPUT
I suspect this would be faster than running it through python, but that's just a guess.

Replacing cell, not string

I have the following code.
import fileinput
map_dict = {'*':'999999999', '**':'999999999'}
for line in fileinput.FileInput("test.txt",inplace=1):
for old, new in map_dict.iteritems():
line = line.replace(old, new)
sys.stdout.write(line)
I have a txt file
1\tab*
*1\tab**
Then running the python code generates
1\tab999999999
9999999991\tab999999999
However, I want to replace "cell" (sorry if this is not standard terminology in python. I am using the terminology of Excel) not string.
The second cell is
*
So I want to replace it.
The third cell is
1*
This is not *. So I don't want to replace it.
My desired output is
1\tab999999999
*1\tab999999999
How should I make this? The user will tell this program which delimiter I am using. But the program should replace only the cell not string..
And also, how to have a separate output txt rather than overwriting the input?
Open a file for writing, and write to it.
Since you want to replace the exact complete values (for example not touch 1*), do not use replace. However, to analyze each value split your lines according to the tab character ('\t').
You must also remove end of line characters (as they may prevent matching last cells in a row).
Which gives
import fileinput
MAPS = (('*','999999999'),('**','999999999'))
with open('output.txt','w') as out_file:
for line in open("test.txt",'r'):
out_list = []
for inp_cell in line.rstrip('\n').split('\t'):
out_cell = inp_cell
for old, new in MAPS:
if out_cell == old:
out_cell = new
out_list.append(out_cell)
out_file.write( "\t".join(out_list) + "\n" )
There are more condensed/compact/optimized ways to do it, but I detailed each step on purpose, so that you may adapt to your needs (I was not sure this is exactly what you ask for).
the csv module can help:
#!python3
import csv
map_dict = {'*':'999999999','**':'999999999'}
with open('test.txt',newline='') as inf, open('test2.txt','w',newline='') as outf:
w = csv.writer(outf,delimiter='\t')
for line in csv.reader(inf,delimiter='\t'):
line = [map_dict[item] if item in map_dict else item for item in line]
w.writerow(line)
Notes:
with will automatically close files.
csv.reader parses and splits lines on a delimiter.
A list comprehension translates line items in the dictionary into a new line.
csv.writer writes the line back out.

Using 'r+' mode to overwrite a line in a file with another line of the same length

I have a file called vegetables:
carrots
apples_
cucumbers
What I want to do is open the file in python, and modify it in-place, without overwriting large portions of the file. Specifically, I want to overwrite apples_ with lettuce, such that the file would look like this:
carrots
lettuce
cucumbers
To do this, I've been told to use 'r+' mode. However, I don't know how to overwrite that line in place. Is that possible? All the solutions I am familiar with involve caching the entire file, and then overwriting the entire file, for a small amendment. Is this really the best option?
Important note: the replacement line is always the same length as the original line.
For context: I'm not really concerned with a file on vegetables. Rather, I have a textfile of about 400 lines to which I need to make revisions roughly every two minutes. I have a script to do this, but I want to do it more efficiently.
an answer that works with your example
with open("vegetables","r+") as t:
data = t.read()
t.seek(data.index("apples_"))
t.write("lettuce")
although, it might not be worth it to complicate things like this,
it's fine to just read the entire file, and then overwriting the entire file, you aren't going to save much by doing something like my example
NOTE: this only works if it has the exactly the same length as the original text you are replacing
edit1: a (possibly bad) example to replace all match:
import re
with open("test","r+") as t:
data = t.read()
for m in re.finditer("apples_", data):
t.seek(m.start())
t.write("lettuce")
edit2: something a little more complex using closure so that it can check for multiple words to replace
import re
def get_find_and_replace(f):
"""f --> a file that is open with r+ mode"""
data = f.read()
def find_and_replace(old, new):
for m in re.finditer(old, data):
f.seek(m.start())
f.write(new)
return find_and_replace
with open("test","r+") as f:
find_and_replace = get_find_and_replace(f)
find_and_replace("apples_","lettuce")
#find_and_replace(...,...)
#find_and_replace(...,...)
If I understanding you correctly fileinput.input should work providing the string is not a substring of another:
import fileinput
for line in fileinput.input("in.txt",inplace=True):
print(line.rstrip().replace("apples_","lettuce"))
print(line.rstrip().replace("apples_","lettuce")) actually writes to the file inplace it does not print the line.
you can also check for multiple words to replace in one pass:
old = "apples_"
for line in fileinput.input("in.txt",inplace=True):
if line.rstrip() == old:
print(line.rstrip().replace(old,"lettuce"))
elif ....
elif....
else:
print(line.rstrip())

Categories

Resources