I have a tsv-file (tab-seperated) and would like to filter out a lot of data using python before I import it into a postgresql database.
My problem is that I can't find a way to keep the format of the original file which is mandatory because otherwise the import processes won't work.
The web suggested that I should use the csv library, but no matter what delimter I use I always end up with files in a different format than the origin, e. g. files, that contain a comma after every character or files, that contain a tab after every character or files that have all data in one row.
Here is my code:
import csv
import glob
# create a list of all tsv-files in one directory
liste = glob.glob("/some_directory/*.tsv")
# go thru all the files
for item in liste:
#open the tsv-file for reading and a file for writing
with open(item, 'r') as tsvin, open('/some_directory/new.tsv', 'w') as csvout:
tsvin = csv.reader(tsvin, delimiter='\t')
# I am not sure if I have to enter a delimter here for the outfile. If I enter "delimter='\t'" like for the In-File, the outfile ends up with a tab after every character
writer = csv.writer(csvout)
# go thru all lines of the input tsv
for row in tsvin:
# do some filtering
if 'some_substring1' in row[4] or 'some_substring2' in row[4]:
#do some more filtering
if 'some_substring1' in str(row[9]) or 'some_substring1' in str(row[9]):
# now I get lost...
writer.writerow(row)
Do you have any idea what I am doing wrong? The final file has to have a tab between every field and some kind of line break at the end.
Somehow you are passing a string to w.writerow(), not a list as expected.
Remember that strings are iterable; each iteration returns a single character from the string. writerow() simply iterates over its argument writing each item separated by the delimiter character (by default a comma). So if you pass a string to writerow() it will write each character from the string separated by the delimiter.
How is it that row is a string? It could be that the delimiter for the input file is incorrect - perhaps the file does not use tabs but has fixed field widths using runs of spaces as the delimiter.
You can check whether the reader is correctly parsing your file by printing out the value of row:
for row in tsvin:
print(row)
...
If the file is being correctly parsed, expect to see that row is a list, and that each element of the list corresponds to a column/field from the file.
If it is not parsing correctly then you might see that row is a string, or that it's a list but the fields are empty and/or out of place.
It would be helpful if you added a sample of your input file to the question.
Related
I have a csv which contains text like
AAABBBBCCCDDDDDDD
EEEFFFRRRTTTHHHYY
when I run the code like below:
rows = csv.reader(csvfile)
for row in rows:
print(" ".join('%s' %row for row in rows))
it will project as follow:
['AAABBBBCCCDDDDDDD']
['EEEFFFRRRTTTHHHYY']
But I want to display as a series of words like below:
AAABBBBCCCDDDDDDDEEEFFFRRRTTTHHHYY
Is there anything wrong in the code?
Your example looks like you simply need
with open(csvfile) as inputfile: # misnomer; not really proper CSV
for row in inputfile:
print(row.rstrip('\n'), end='')
The example you provided doesn't look like a csv file. It looks like a simple text file. The you could have something as simple as :
Input.txt
AAABBBBCCCDDDDDDD
EEEFFFRRRTTTHHHYY
Solution.py
input_filename = "Input.txt"
with open(input_filename) as input_file:
print("".join(x.rstrip('\n') for x in input_file))
This is taking advantage of:
A file object can be iterated on. This will give you a new line from each iteration
Every line received from the file will have newline character at its end. Since you seem to not want it we use the method .rstrip() to remove it
The .join() method can accept any iterable even a...
Generator expression which will help us create an iterable that will accepted by .join() using .rstrip() to format every line coming from the input file.
EDIT: OK let's decompose further my answer:
When you open a file you can iterate over it. In the most simple way to explain it, let's say that it means that you do a loop over it (for line in input_file: ...).
But not only that, but with an iterator you can create another iterator by transforming each element. This is what a list comprehension or, in the case I have chosen, a generator expression does. So the expression (x.rstrip() for x in input_file) will be a iterator that takes every element of input_file and applies to it .rstrip()
The string method .join() will glue together the elements provided by an iterator using that string as a separator. Since I use here an empty string there won't be a seperator. I have used the iterator defined before for this.
I then print() the string provided by the .join() operation explained before.
I did a minor correction on my answer because there is the edge case that if there are space or tab characters at the end of a line in the input file they would have been removed if I use x.rstrip() instead of x.rstrip('\n')
You could start with an empty string, and for every row read from the csv file, remove the newline at the end and add the contents to the empty string.
joined = ""
with open(csvfile) as f:
for row in f:
joined = joined + row.replace("\n","")
print(joined)
Output:
>> AAABBBBCCCDDDDDDDEEEFFFRRRTTTHHHYY
I have alot of tsv files that I would like to read them 1 by 1 and write the last coloum into other file.
Here is my Code:
for filename in os.listdir(path):
with open(path+'/'+filename,'r',encoding="utf8") as tsvin, open('temptweets.csv','a',encoding='utf-8') as csvout:
tsvin = csv.reader(tsvin, delimiter='\t')
csvout = csv.writer(csvout)
count = 0
for row in tsvin:
try:
count = str(row[-1])
except ValueError:
pass # w.e.
if len(count) >= 0:
csvout.writerow([count])
Most of it, work perfect. But the Problem is that some of lines interpeted togther.
i.e. row varible getting more few lines contected togther, so it ends up that not only the last coloum is written into the file, but also ALL the coloums of the next line. It is stopping after few rows - can't tell why either.
I have tried to read the files in few other method (such as pandas) but got the same result.
I have also tried to open the input file and view all chars (notepad++) but all the lines (including the problmatic ones) DO HAVE CR:LF.
I know there is something wrong with the input file (the input file is given), but I would like to know if there is any way to solve it.
It looks like your file might have multiline fields embedded in double quotes (but it's hard to tell without looking at the data).
Try to add newline='' in your open() call (and maybe add quotechar='"' to reader(), but that's probably the default).
From the doc:
If newline='' is not specified, newlines embedded inside quoted fields
will not be interpreted correctly
Or it could be the opposite, and maybe you need to turn off quoting to parse those files correctly..
I am running the following on a csv of UIDs:
with open('C:/uid_sample.csv',newline='') as f:
reader = csv.reader(f,delimiter=' ')
uidlist = list(reader)
but the list returned is actually a list of lists:
[['27465307'], ['27459855'], ['27451353']...]
I'm using this workaround to get individual strings within one list:
for r in reader:
print(' '.join(r))
i.e.
['27465307','27459855','27451353',...]
Am I missing something where I can't do this automatically with the csv.reader or is there an issue with the formatting of my csv perhaps?
A CSV file is a file where each line, or row, contains columns that are usually delimited by commas. In your case, you told csv.reader() that your columns are delimited by a space. Since there aren't any spaces in any of the lines, each row of the csv.reader object has only one item. The problem here is that you aren't looking for a row with a single column; you are looking for a single item.
Really, you just want a list of the lines in the file. You could use f.readlines(), but that would include the newline character in each line. That actually isn't a problem if all you need to do with each line is convert it to an integer, but you might want to remove those characters. That can be done quite easily with a list comprehension:
newlist = [line.strip() for line in f]
If you are merely iterating through the lines (with afor loop, for example), you probably don't need a list. If you don't mind the newline characters, you can iterate through the file object directly:
for line in f:
uid = int(line)
print(uid)
If the newline characters need to go, you could either take them out per line:
for line in f:
line = line.strip()
...
or create a generator object:
uids = (line.strip() for line in f)
Note that reading a file is like reading a book: you can't read it again until you turn back to the first page, so remember to use f.seek(0) if you want to read the file more than once.
Issue: Remove the hyperlinks, numbers and signs like ^&*$ etc from twitter text. The tweet file is in CSV tabulated format as shown below:
s.No. username tweetText
1. #abc This is a test #abc example.com
2. #bcd This is another test #bcd example.com
Being a novice at python, I search and string together the following code, thanks to a the code given here:
import re
fileName="path-to-file//tweetfile.csv"
fileout=open("Output.txt","w")
with open(fileName,'r') as myfile:
data=myfile.read().lower() # read the file and convert all text to lowercase
clean_data=' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",data).split()) # regular expression to strip the html out of the text
fileout.write(clean_data+'\n') # write the cleaned data to a file
fileout.close()
myfile.close()
print "All done"
It does the data stripping, but the output file format is not as I desire. The output text file is in a single line like
s.no username tweetText 1 abc This is a cleaned tweet 2 bcd This is another cleaned tweet 3 efg This is yet another cleaned tweet
How can I fix this code to give me an output like given below:
s.No. username tweetText
1 abc This is a test
2 bcd This is another test
3 efg This is yet another test
I think something needs to be added in the regular expression code but I don't know what it could be. Any pointers or suggestions will be helpful.
You can read the line, clean it, and write it out in one loop. You can also use the CSV module to help you build out your result file.
import csv
import re
exp = r"(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"
def cleaner(row):
return [re.sub(exp, " ", item.lower()) for item in row]
with open('input.csv', 'r') as i, open('output.csv', 'wb') as o:
reader = csv.reader(i, delimiter=',') # Comma is the default
writer = csv.writer(o, delimiter=',')
# Take the first row from the input file (the header)
# and write it to the output file
writer.writerow(next(reader))
for row in reader:
writer.writerow(cleaner(row))
The csv module knows correctly how to add separators between items; as long as you pass it a collection of items.
So, what the cleaner method does it take each item (column) in the row from the input file, apply the substitution to the lowercase version of the item; and then return back a list.
The rest of the code is simply opening the file, configuring the CSV module with the separators you want for the input and output files (in the example code, the separator for both files is a tab, but you can change the output separator).
Next, the first row of the input file is read and written out to the output file. No transformation is done on this row (which is why it is not in the loop).
Reading the row from the input file automatically puts the file pointer on the next row - so then we simply loop through the input rows (in reader), for each row apply the cleaner function - this will return a list - and then write that list back to the output file with writer.writerow().
instead of applying the re.sub() and the .lower() expressions to the entire file at once try iterating over each line in the CSV file like this:
for line in myfile:
line = line.lower()
line = re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",line)
fileout.write(line+'\n')
also when you use the with <file> as myfile expression there is no need to close it at the end of your program, this is done automatically when you use with
Try this regex:
clean_data=' '.join(re.sub("[#\^&\*\$]|#\S+|\S+[a-z0-9]\.(com|net|org)"," ",data).split()) # regular expression to strip the html out of the text
Explanation:
[#\^&\*\$] matches on the characters, you want to replace
#\S+matches on hash tags
\S+[a-z0-9]\.(com|net|org) matches on domain names
If the URLs can't be identified by https?, you'll have to complete the list of potential TLDs.
Demo
I am trying to write a function which allows me to remove certain elements from URLs. These URLs are stored in a CSV called Backlink_Test. I would like to iterate over each item in that list of URLs, remove the unwanted elements from the URL, and then add the modified URLs to a new list, which is then stored in a new CSV called Cleaned_URLs.
The code is working to the extent that I can open the source file, run the loop and then store the results in the destination file. However, I am encountering quite an annoying problem: in the destination file, the URLs are stored with each character in an individual cell, rather than the whole URL in once cell.
This surprised me as I did a little test where I literally copied the contents from CSV to another (without modifying anything) and words with multiple characters were stored just fine. So my suspicion is that the for-loop creates the problem?
Any help / insight would be much appreciated! Code below, and screenshot of destination file attached.
import csv
new_strings = []
#replace unwanted elements and add cleaned strings to new list
with open("Backlink_Test.csv", "rb") as csvfile:
reader = csv.reader(csvfile)
for string in reader:
string = str(string)
string = string.replace("www.", "").replace("http://", "").replace("https://", "")
new_strings.append(string)
new_strings.sort()
print new_strings #for testing only; will be removed once function is working
cleaned_file = open("Cleaned_URLS.csv", "w")
writer = csv.writer(cleaned_file)
writer.writerows(new_strings)
cleaned_file.close()
Here is now the working code:
import csv
new_strings = []
#replace unwanted elements and add cleaned strings to new list
with open("Backlink_Test.csv", "rb") as csvfile:
reader = csv.reader(csvfile)
for string in reader:
string = str(string)
string = string.replace("www.", "").replace("http://", "").replace("https://", "")
new_strings.append(string)
new_strings.sort()
print new_strings
cleaned_file = open("Cleaned_URLS.csv", "w")
writer = csv.writer(cleaned_file)
for url in new_strings:
writer.writerow([url])
cleaned_file.close()
csvwriter.writerows expects an iterable of rows. A row is an iterable of cells.
You're feeding it with a list of strings. Since string is a list of letters, every letter is considered a cell in your example -- and it's exactly what gets written.
What you're doing wrong is assuming csv.reader outputs strings. It outputs rows.
Update:
for url in urls:
writer.writerow([url])
That's what Python does when you loop over a string instead of a list. Examine the return value from csv.reader() and adjust your code accordingly. In particular, string = str(string) is flattening your input.