Remove repeating lines of characters when reading text files in python? - python

I am reading a text file which was copied from a CSV file. When I read the file in python, I get a ton of unnecessary repeating lines as seen below. How can i strip away those three unwanted lines, including \cf0 and \cell\row at the beginning and end of each text?
Or should I read the text directly from the csv file itself? the text is in just one of the columns of the CSV file.
\itap1\trowd \taflags1 \trgaph108\trleft-108 \trbrdrl\brdrnil \trbrdrr\brdrnil
\clvertalc \clshdrawnil \clbrdrt\brdrs\brdrw20\brdrcf2 \clbrdrl\brdrs\brdrw20\brdrcf2 \clbrdrb\brdrs\brdrw20\brdrcf2 \clbrdrr\brdrs\brdrw20\brdrcf2 \clpadl100 \clpadr100 \gaph\cellx8640
\pard\intbl\itap1\pardeftab720
\cf0 i have been using your product and it has been helping me a lot to solve business problem,\cell \row
\itap1\trowd \taflags1 \trgaph108\trleft-108 \trbrdrl\brdrnil \trbrdrr\brdrnil
\clvertalc \clshdrawnil \clbrdrt\brdrs\brdrw20\brdrcf2 \clbrdrl\brdrs\brdrw20\brdrcf2 \clbrdrb\brdrs\brdrw20\brdrcf2 \clbrdrr\brdrs\brdrw20\brdrcf2 \clpadl100 \clpadr100 \gaph\cellx8640
\pard\intbl\itap1\pardeftab720
\cf0 I am very happy with your products. Very easy to use.\cell \row
\itap1\trowd \taflags1 \trgaph108\trleft-108 \trbrdrl\brdrnil \trbrdrr\brdrnil
\clvertalc \clshdrawnil \clbrdrt\brdrs\brdrw20\brdrcf2 \clbrdrl\brdrs\brdrw20\brdrcf2 \clbrdrb\brdrs\brdrw20\brdrcf2 \clbrdrr\brdrs\brdrw20\brdrcf2 \clpadl100 \clpadr100 \gaph\cellx8640
\pard\intbl\itap1\pardeftab720
\cf0 Many improvements with income tracker, and other time saving elements. Newer look, easier navigation. I believe there definitely is a time savings from past versions.\cell \row
Here is a snippet of the csv file:
page_url Review_title Product_id Rating Publish_date Review_Description
www.blabla.com Great! 777777 5 01/01/14 Excellent upgrade! Was not disappointed!
I only copied text from the Review_Description column and pasted them all in a text file.
Here is my python code to just read the file:
text_file=open("my_text.txt", "r")
lines=text_file.readlines()
print lines

Your real problem here appears to be that you pasted the CSV into an RTF file, not a text file. Pasting into Wordpad on Windows or TextEdit on Mac (especially if you copied from, say, Excel or Numbers) and saving it without explicitly telling it to "save as plain text" or "convert to plain text" will generally "help" you this way automatically.
While you could try to parse the RTF to recover the original text, you're much better off just using the original text if possible. Parsing CSV files in Python—either with Pandas, or with the stdlib's csv module—is very easy.
For example, your file appears to use tabs as delimiters, and no other non-default features. If so:
import csv
with open('my_csv.csv', 'rb') as f:
reader = csv.DictReader(f, delimiter='\t')
reviews = [row['Review_Description'] for row in reader]
Now you have a list of all the reviews, and can do anything you want with them. If you just want to print them out, it's even simpler:
import csv
with open('my_csv.csv', 'rb') as f:
reader = csv.DictReader(f, delimiter='\t')
for row in reader:
print row['Review_Description']

Related

differences between output of printing contents of file directly vs using using the csv.reader() in python?

i am working with a csv file in python and just discovered that after i open it like this : with open("randomfile.csv", "r") as csvfile, i can read contents in two ways :
#1
Content1 = next(csvfile)`
#2
csvreader = csv.reader(csvfile)
Content2 = next(csvreader)
but the type of Content1 is a string whereas the type of Content2 is a list, anyone knows why ?
EDIT : Turns out if you write
csvreader = csv.reader(csvfile)
Then writing
Content1 = next(csvfile)
will now output a list too
Hi csvreader is a special library to help you make it easier to work with csv files, as well as tab-separated files (tsv) and similar types of common delimited files.
In this case using next(csvfile) you are just reading the next line by calling next on the file (let's say - this is the "ordinary python" way of reading a file). When you do this you get the next line as a string. This is exactly what happens if you use a for loop
for line in open('randomfile.csv'):
print(line)
But when you call next on the csvreader the csv package is "helping" you by splitting the line into a list. You'll also notice that you don't have to think so much about end of line characters (\n) either! So it basically is trying to assist you to work with the file more easily. Doing this kind of splitting and turning the line into a list of fields can done and isn't super hard but all kind of other things can happen with csv files (like surrounding the fields with double quotes, and handling commas inside files. The csv package helps with all of these things as well: https://docs.python.org/3/library/csv.html

Why is my csv file separated by " \t " instead of commas (" , ")?

I downloaded data from internet and saved as a csv (comma delimited) file. The image shows what the file looks like in excel.
Using csv.reader in python, I printed each row. I have shown my code below along with the output in Spyder.
import csv
with open('p_dat.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row)
I am very confused as to why my values are not comma separated. Any help will be greatly appreciated.
As pointed out in the comments, technically this is a TSV (tab-separated values) file, which is actually perfectly valid.
In practice, of course, not all libraries will make a "hard" distinction between a TSV and CSV file. The way you parse a TSV file is basically the same as the way you parse a CSV file, except that the delimiter is different.
There are actually multiple valid delimiters for this kind of file, such as tabs, commas, and semicolons. Which one you choose is honestly a matter of preference, not a "hard" technical limit.
See the specification for csvs. There are many options for the delimiter in the file. In this case you have a tab, \t.
The option is important. Suppose your data had commas in it, then a , as a delimiter would not be a good choice.
Even though they're named comma-separated values, they're sometimes separated by different symbols (like the tab character that you have currently).
If you want to use Python to view this as a comma-separated file, you can try something like:
import csv
...
with open('p_dat.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
commarow = row.replace("\t",",")
print(commarow)

read csv file as a string in python

I accidentally corrupted a csv file (delimiters no longer working - thanks Microsoft Excel!). I want to salvage some data by reading it as a string and searching for things - I can see the text by opening the file on notepad, but I can't figure out how to load that string from the filepath in python.
I imagine it would be a variation of
csv_string = open(filepath, 'something').read()
but I can't get it to work, or find a solution on SO / google.
It should work with the following code, but it is not the best way to deal with csv.
csv_string = ''.join(open(filepath, 'r').readlines())
Something like:
with open(filepath, 'r') as corrupted_file:
for line in corrupted_file:
print(line) # Or whatever
You can read csv from this .
import csv
reader = csv.reader(open("samples/sample.csv"))
for title, year, director in reader:
print year, title

How to create the header of a CSV file?

i want to write a csv file in Python. I want to use these 2 words as header.
import csv
myFile = open('tabelle.csv','w')
with myFile:
writer = csv.writer(myFile)
writer.writerow(["Wort","Haeufigkeit"])
Is that enough to build my header? Now I want to add in this csv file the other words under this two words. Does python now accept this as a header or just as a normal row?
As far as the csv writer is concerned the header is like any other row. The idea of a header comes up only when you want to read and interpret a csv file. So, what you said does work.

How to write lines in a txt file, with data from a csv file

How can I tell Python to open a CSV file, and merge all columns per line, into new lines in a new TXT file?
To explain:
I'm trying to download a bunch of member profiles from a website, for a research project. To do this, I want to write a list of all the URLs in a TXT file.
The URLs are akin to this: website.com-name-country-title-id.html
I have written a script that takes all these bits of information for each member and saves them in columns (name/country/title/id), in a CSV file, like this:
mark japan rookie married
john sweden expert single
suzy germany rookie married
etc...
Now I want to open this CSV and write a TXT file with lines like these:
www.website.com/mark-japan-rookie-married.html
www.website.com/john-sweden-expert-single.html
www.website.com/suzy-germany-rookie-married.html
etc...
Here's the code I have so far. As you can probably tell I barely know what I'm doing so help will be greatly appreciated!!!
import csv
x = "http://website.com/"
y = ".html"
csvFile=csv.DictReader(open("NameCountryTitleId.csv")) #This file is stored on my computer
file = open("urls.txt", "wb")
for row in csvFile:
strArgument=str(row['name'])+"-"+str(row['country'])+"-"+str(row['title'])+"-"+str(row['id'])
try:
file.write(x + strArgument + y)
except:
print(strArgument)
file.close()
I don't get any error messages after running this, but the TXT file is completely empty.
Rather than using a DictReader, use a regular reader to make it easier to join the row:
import csv
url_format = "http://website.com/{}.html"
csv_file = 'NameCountryTitleId.csv'
urls_file = 'urls.txt'
with open(csv_file, 'rb') as infh, open(urls_file, 'w') as outfh:
reader = csv.reader(infh)
for row in reader:
url = url_format.format('-'.join(row))
outfh.write(url + '\n')
The with statement ensures the files are closed properly again when the code completes.
Further changes I made:
In Python 2, open a CSV files in binary mode, the csv module handles line endings itself, because correctly quoted column data can have embedded newlines in them.
Regular text files should be opened in text mode still though.
When writing lines to a file, do remember to add a newline character to delineate lines.
Using a string format (str.format()) is far more flexible than using string concatenations.
str.join() lets you join a sequence of strings together with a separator.
its actually quite simple, you are working with strings yet the file you are opening to write to is being opened in bytes mode, so every single time the write fails and it prints to the screen instead. try changing this line:
file = open("urls.txt", "wb")
to this:
file = open("urls.txt", "w")
EDIT:
i stand corrected, however i would like to point out that with an absence of newlines or some other form of separator, how do you intend to use the URLs later on? if you put newlines between each URL they would be easy to recover

Categories

Resources