Excel CSV ending up all on one line - python

I'm looking for help in understanding why my name separator script isn't working. I am working through 'Automate the Boring Stuff With Python' and had the opportunity to test some things out at work today. I recognize this probably not the most efficient solution, but I'm trying to put my learning to work.
The Goal
I have an excel file with first and last names in a single cell. I need to separate these into two cells, one for first name and one for last name.
The Process
I began my saving the excel file as a .csv to then open in a text editor.
Used regular expressions to find the full name, grouping first and last names separately. (see the code in link provided)
I copy the raw .csv text to the clipboard using pyperclip (I don't know how to read from files yet.)
I extract the name data using the regex.
I run a for loop which creates a string with first name + ',' + last name + ',' so that excel will put the first and last names in different cells.
I want to end each firstName,lastName, pair with a new line so that my .csv file looks like:
firstname,lastname,
firstname2,lastname2,
etc...
I'm getting stuck on the last step. My for loop gets the firstname,lastname, pairs correct, but when I paste from the clipboard, the newline characters are not inserted. Everything is pasted as one huge string. Since I'm appending a new line character each cycle, shouldn't it paste everything on separate lines? Please help me understand what I'm missing!
Here is a link to my script: https://github.com/RNGeezus/name-separator/blob/master/name_separator.py
Here is what my .csv file looks like (recreated with dummy names to protect peopel's privacy):
my sample

Figured it out! Turns out I needed each pair to be followed by a \r\n. I was doing carriage returns, but no newlines. Doh!

Related

Search for a word, and modify the whole line in Python text processing

This is my carDatabase.txt
CarID:c01 ModelName:honda VehicleType:city Price:20
CarID:c02 ModelName:honda VehicleType:x Price:30
I want to search for the carID and be only able to modify the whole line without interrupting others
my current code is here:
# Converting txt data into a string and modify
carsDatabaseFile = open('carsDatabase.txt', 'r')
allDataFromDatabase = [line.split(',') for line in carsDatabaseFile.readlines()]
Note:
Your question has a couple of issues: your sample from carDatabase.txt looks like it is tab-delimited, but your current code looks like it is splitting the line around the ',' character. This also looks like a place where a list comprehension might be hurting you more than it is helping you. Break that up into a for-loop if you're trying to add some logic to manipulate a single line.
For looking at CSV files, I would highly recommend using pandas for general manipulation of data in comma ceparated as well as a number of other formats.
That said, if you are truly restricted to only using built-in packages, or you are looking at this as a learning exercise, and your goal is to directly manipulate just one line of that file, what you are looking for is the seek method. You can use this in combination with the tell method ( documented just blow seek in the above link ) to find where you are in the file.
Write a for loop to identify which line in the file you are looking for
From there, you can get the output of tell() to find the specific place in the file you are trying to manipulate
Using the output from the above two steps, you can set the file pointer to a specific location using the seek() method (by byte: files are really stored as one dimensional).
You can now use the write() method to directly update the file at the location you determined above.

Edit a few lines of uncompressed PDF in Python

I want to edit a few lines in an uncompressed pdf.
I found a similar problem but since I need to scan the file a few times to get the exact line positions I want to change this doesn't really suit (and the pure number of RegEx matches are more than desired).
The pdf contains utf-8 encodable lines (a few of them I want to edit, bookmark target ids in particular)
and a lot of blobs (guess images and so on).
When I edit the file with notepad it's working fine, but when I do it programatically (reading in, changing a few lines, writing back)
images and some formatting is missing. (Sine they are not read in at the firstplace, ignore-option)
with codecs.open("merged-uncompressed.pdf", "r", encoding='ascii', errors='ignore') as f:
I can read the file in with errors="surrogateescape" and wanted to map the lines from above import but don't know if this approach can work.
Does anyone know a way how to deal with this?
Best, Lukas
I was able to solve this:
read the file as binary
marked the lines which couldn't be encoded utf-8
copied the list line by line to a temporary list ( not encodable lines were copied with a placholder 'None\n')
Then I went back to do the searching part on the copied list so I got my lines I wanted to replace
replaced the lines in the original binary list (same indices!)
wrote it back to file
the resulting pdf was a bit corupted because of whitespace before the target ids of the bookmarks but by recompressing qpdf fixed it:)
The code is very messy at the moment and so I don't want to publish it right now.
But I want to add it at github within the next few weeks.
If anyone needs it: just comment and it will have more priority.
Thanks to anyone who wanted to help:)
Lukas

Replace blank values with string

I need to manipulate a csv file in a way to go into the csv file look for blank fields between c0-c5 in my example csv file. with the csv file where ever there are blanks I would like to replace the blank with any verbage i want, like "not found"
the only thing for code I have so far is dropping a column I do not need, but the manipulation I need I really can not find anything.. maybe it is not possible?
also, i am wondering how to change a column name..thanks..
#!/bin/env python
import pandas
data = pandas.read_csv('report.csv')
data = data.drop(['date',axis=1)
data.to_csv('final_report.csv')
Alternatively and taking your "comment question" into account (if you do not necessarily want to use pandas as in n1colas.m's answer) use string replacements and
simply loop over your file with:
with open("modified_file.csv","w") as of:
with open("report.csv", "r") as inf:
for line in inf:
if "#" not in line: # in the case your csv file has a comment marker somewhere and it is called #, the line is skipped, which means you get a clean comma separated value file as the outfile- if you do want to keep such lines simply remove the if condition
mystring=line.replace(", ,","not_found").replace("data","input") # in case it is not only one blank space you can also use the regex for n times blank space here
print(mystring, file=of, end=""); # prints the replaced line to outfile and writes no newline
I know this is not the most efficient way to do it, but probably the one where you easily understand what you are doing and are able to modify this to your heart's desire.
For any reasonably sized csv files it sould still work nearly instantaneously.
Also for testing purposes always use a separate file (of) for such replacements instead of writing to your infile as your question seems to state. Check that it did what you wanted. ONLY THEN overwrite your infile. This may seem unnecessary at first, but mistakes happen...
You have to perform this line
data['data'] = data['data'].fillna("not found")
Here the documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
Here an example
import pandas
data = pandas.read_csv('final_report.csv')
data.info()
data['data'] = data['data'].fillna("Something")
print(data)
I would suggest to change the data variable to something different, because your column has the same name and can be confusing.

Splitting line on python

I need to use a python code without editing it (another's code).
At some point, this code reads the line of a text file to get some file names.
To do so, it uses a line.split()
On the example I was given, I had a file name like /home/directory/fileName
When I do a split on such a line, I get ['/home/directory/fileName\]
The point is the files I work on are located on "My Passport".
I had errors during the execution of the code that are caused by the name of the file.
Indeed, when I tried on python to split the following line: /media/My Passport/directory/fileName, I have ['/media/My', 'Passport/directory/fileName'], so a list with two elements, which the program I have cannot handle. This is because at some point of this code, fileName[0][0] is called, which should be ['/media/My 'Passport/directory/fileName'], but which is ['/media/My', 'Passport/directory/fileName']
I tried to change the name of my device, but it turns out I need to reformat it to do so... which I can't...
Anyone has an idea how I can handle this problem, specifically how I can modify the file names so that, after a line.split(), I get ['/media/My 'Passport/directory/fileName'] ??
Thank you
EDIT
I have a text file in which I have a list of file names with their path
/media/My Passport/fileName1
/media/My Passport/fileName2
/media/My Passport/fileName3
I have a code where I split the lines of this file line.split() to get lists like
['/media/My Passport/fileName1']
I know I can get such lists using line.split(\n), but I have to use line.split()
I am looking for a way to modify the text file so that, when I run line.split(), I get lists like
['/media/My Passport/fileName1']
and not
['/media/My', 'Passport/fileName1']
I have been trying to change the file text using brackets and backslashes :
"/media/My Passport/fileName1"
/media/My\ Passport/fileName1
but the same problem remains
Let us say you have
splitted_result = ['/media/My', 'Passport/fileName1']
Then you can do a simple join
>>> [' '.join(splitted_result)]
['/media/My Passport/fileName1']
This will output a list as its result.

Python: Check one element in csv, use another to remove from second file

I am trying to get a script working, where it will check the existance of an IP in a lookup csv file, and then if it exists take the third element and remove that third element from another (second) file. Here is a extract of what I have:
for line in fileinput.input(hostsURLFileLoc,inplace =1):
elements = open(hostsLookFileLoc, 'r').read().split(".").split("\n")
first = elements[0].strip()
third = elements[2].strip()
if first == hostIP:
if line != third:
print line.strip()
This obviously doesn't work, I have tried playing with a few options, but here is my latest (crazy) attempt.
I think the problem is that there are two input files open at once.
Any thoughts welcome,
Cheers
All right, even though I haven't got any response to my comment on the question, here's my shot at a general answer. If I've got something wrong, just say so and I'll edit to try to address the errors.
First, here are my assumptions. You have two files, who's names are stored in the HostsLookFileLoc and HostsURLFileLoc variables.
The file at HostsLookFileLoc is a CSV file, with an IP address in the third column of each row. Something like this:
HostsLookFile.csv:
blah,blah,192.168.1.1,whatever,stuff
spam,spam,82.94.164.162,eggs,spam
me,myself,127.0.0.1,and,I
...
The file at HostsURLFileLoc is a flat text file with one IP address per line, like so:
HostsURLFile.txt:
10.1.1.2
10.1.1.3
10.1.2.253
127.0.0.1
8.8.8.8
192.168.1.22
82.94.164.162
64.34.119.12
...
Your goal is to read and then rewrite the HostsURLFile.txt file, excluding all of the IP addresses that are found in the third column of a row in the CSV file. In the example lists above, localhost (127.0.0.1) and python.org (82.94.164.162) would be excluded, but the rest of the IPs in the list would remain.
Here's how I'd do it, in three steps:
Read in the CSV file and parse it using the csv module to find the IP addresses. Stick them into a set.
Open the flat file and read the IP addresses into a list, closing the file afterwards.
Reopen the flat file and overwrite it with the loaded list of addresses, skipping any that are contained in the set from the first step.
Code:
import csv
def cleanURLFile(HostsLookFileLoc, HostsURLFileLoc):
"""
Remove IP addresses from file at HostsURLFileLoc if they are in
the third column of the file at HostsLookFileLoc.
"""
with open(HostsLookFileLoc, "r") as hostsLookFile:
reader = csv.reader(hostsLookFile)
ipsToExclude = set(line[2].strip() for line in reader)
with open(HostsURLFileLoc, "r") as hostsURLFile:
ipList = [line.strip() for line in hostsURLFile]
with open(HostsURLFileLoc, "w") as hostsURLFile: # truncates the file!
hostsURLFile.write("\n".join(ip for ip in ipList
if ip not in ipsToExclude))
This code is deliberately simple. There are a few things that could be improved, if they are important to your use case:
If something crashes the program during the rewriting step, HostsURLFile.txt may be clobbered. A safer way of rewriting (at least, on Unix-style systems) is to write to a temp file, then after the writing has finished (and the file has been closed), rename the temp file over the top of the old file. That way if the program crashes, you'll still have the original version or a completely written replacement, but never anything in between.
If the checking you needed to do was more complicated than set membership, I'd add an extra step between 2 and 3 to do the actual processing, then write the results out without further manipulation (other than adding newlines).
Speaking of newlines, if you have a trailing newline, it will be passed through as an empty string in the list of IP addresses, which should be OK for this scenario (it won't be in the set of IPs to exclude, unless your CSV file has a messed up line), but might cause trouble if you were doing something more complicated.
In test file test.csv (note there is an IP address in there):
'aajkwehfawe;fh192.168.0.1awefawrgaer'
(I am pretty much ignoring that it is CSV for now. I am just going to use regex matches.)
# Get the file data
with open('test.csv', 'r') as f:
data = f.read()
# Look for the IP:
find_ip = '192.168.0.1'
import re
m = re.search('[^0-9]({})[^0-9]'.format(find_ip), data)
if m: # found!
# this is weird, because you already know the value in find_ip, but anyway...
ip = m.group(1).split('.')
print('Third ip = ' + ip[2])
else:
print('Did not find a match for {}'.format(find_ip))
I do not understand the second part of your question, i.e. removing the third value from a second file. Are there numbers listed line by line, and you want to find the line that contains this number above and delete the line? If yes:
# Make a new list of lines that omits the matched one
new_lines=[]
for line in open('iplist.txt','r'):
if line.strip()!=ip[2]: # skip the matched line
new_lines.append(line)
# Replace the file with the new list of lines
with open('iplist.txt', 'w') as f:
f.write('\n'.join(new_lines))
If, once you have found values in the first file that need to be removed in the second file, I suggest something like this pseudocode:
Load first file into memory
Search string representing first file for matches using a regular expression
(in python, check for re.find(regex, string), where regex = re.compile("[0-9]{3}\\.[0-9]{3}\\.[0-9]\\.[0-9]"), I am not entirely certain that you need the double backslash here, try with and without)
Build up a list of all matches
Exit first file
Load second file into memory
Search string representing second file for the start index and end index of each match
For each match, use the expression string = string[:start_of_match] + string[end_of_match:]
Re-write the string representing the second (now trimmed) file to the second file
Essentially whenever you find a match, redefine the string to be the slices on either side of it, excluding it from the new string assignment. Then rewrite your string to a file.

Categories

Resources