Splitting a csv in a weird way - python

I'm currently trying to work with data about artists, their songs, and the lyrics. I have csv with the artist, the song name, and the lyrics in that order. I'm trying to split it so that I have each thing separate however the lyrics keep getting split whenever there is a new line. I've tried using this.
fp = open('songdata_test.csv', 'r')
for line in fp:
line_lst = line.split(',')
However that just returned the error previously described. Does anyone know how to split this csv so that the lyrics do not get split?
Edit: Example of what I'm trying to split.
Adele,All I Ask,"[Verse 1]
I will leave my heart at the door
I won't say a word
They've all been said before, you know..."
Bob Dylan,4Th Time Around,"When she said, ""Don't waste your words, they're
just lies,""
I cried she was deaf.
And she worked on my face until breaking my eyes,
Then said, ""What else you got left?""
It was then that I got up to leave..."

Parsing a csv with lyrics has some non-trivial problems that are difficult to handle by yourself (I can see from your edition that you already figured this out). In particular, columns delimited by quotes and new lines or commas inside the data itself are difficult to parse and there are modules already designed for such tasks.
I suggest trying with python's csv.reader or, better, with pandas.
Using csv.reader
From the documentation:
import csv
with open('songdata_test.csv') as csvfile:
reader= csv.reader(csvfile, delimiter=',', quotechar='"') # These are the defaults, I'm just showing the explicitly. This is equivalent to csv.reader(csvfile)
for row in reader:
print(', '.join(row))
Using pandas
import pandas as pd
df = pd.read_csv('songdata_test.csv')
This will return a pandas DataFrame object and handling it correctly will involve some learning, but if you use python and csvs with python I strongly suggest giving it a try.

Related

Import pipe delimited txt file into spark dataframe in databricks

I have a data file saved as .txt format which has a header row at the top, and is pipe delimited. I am working in databricks, and am needing to create a spark dataframe of this data, with all columns read in as StringType(), the headers defined by the first row, and the columns separated based on the pipe delimiter.
When importing .csv files I am able to set the delimiter and header options. However, I am not able to get the .txt files to import in the same way.
Example Data (completely made up)... for ease, please imagine it is just called datafile.txt:
URN|Name|Supported
12233345757777701|Tori|Yes
32313185648456414|Dave|No
46852554443544854|Steph|No
I would really appreciate a hand in getting this imported into a Spark dataframe so that I can crack on with other parts of the analysis. Thank you!
Any delimiter separated file is a good candidate for csv reading methods. The 'c' of csv is mostly by convention. Thus nothing stops us from reading this:
col1|col2|col3
0|1|2
1|3|8
Like this (in pure python):
import csv
from pathlib import Path
with Path("pipefile.txt").open() as f:
reader = csv.DictReader(f, delimiter="|")
data = list(reader)
print(data)
Since whatever custom reader your libraries are using probably uses csv.reader under the hood you simply need to figure out how to pass the right separator to it.
#blackbishop notes in a comment that
spark.read.csv("datafile.text", header=True, sep="|")
would be the appropriate spark call.

Output of terminal to a csv with separate columns in python

my code goes as follows:
import csv
with open('Remarks_Drug.csv', newline='', encoding ='utf-8') as myFile:
reader = csv.reader(myFile)
for row in reader:
product = row[0].lower()
filename = row[1]
product_patterns = ', '.join([i.split("+")[0].strip() for i in product.split(",")])
print(product_patterns, filename)
which outputs as below: (where film-coated tab should be one column and the filename should be another column)
film-coated tablet RECD outcome AUBAGIO IAIN-21 AoR.txt
solution for injection 093 Acceptance NO Safety profil.txt
I want to output this to a csv file with one column as product_patterns and another as filename. I wrote the below code but only the last row gets appended. Can anyone please help me with the looping here. The code I wrote is:
with open ('drug_output.csv', 'a') as csvfile:
fieldnames = ['product_patterns', 'filename']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'product_patterns':product_patterns, 'filename':filename})
enter image description here
Depending on the environment that you can use, it might be more practical to use more dedicated programs to solve your problem.
Especially the pandas package seems useful in your case.
Then you can load the csv using:
import pandas as pd
df=pd.read_csv(file_path)
After doing the necessary manipulations, you can save it again with
df.to_csv(file_path)
This will save you a lot of issues that typically occur when parsing line by line, and it should also increase performance a bit. Pandas is a pretty good package to learn anyway if you need to do some data manipulation.

Compare List against CSV file

I have an RSS feed I want to grab data from, manipulate and then save it to a CSV file. The RSS feed refresh rate is a big window, 1 minute to several hours, and only hold 100 items at a time. So to capture everything, Im looking to have my script run every minute. The problem with this is if the script runs before the feed updates I will be grabbing past data which lead to adding duplicate data to the CSV.
I tried looking at using examples mentioned here but it kept erroring out.
Data Flow:
RSS Feed --> Python Script --> CSV file
Sample data and code below:
Sample Data from CSV:
gandcrab,acad5fc7ebe8c6979d98cb8537e3a247,18bb2c3b82649314dfd45a379058869804954276,bf0ac94c6ae6f1ecfcccc049ae2373bfc659b2efb2e48e824e2e78fb43b6ebef,54,C
Sample Data from list:
zeus,186e84c5fd7da7331a62f1f13b1f4608,3c34aee767859fd75eb0c8c701716cbfd5655437,05c8e4f01ec8d4e6f4595db93bbcc0f85386c9f1b82b5833d983c9092640573a,49,C
Code for comparing:
if trends_f.is_file():
with open('trendsv3.csv', 'r+', newline='') as csv_file:
h_reader = csv.reader(csv_file)
next(h_reader) #skip reading header of csv
#should i load the csv into a list then compare it with diff() against the other list?
#or is there an easier, faster, more efficient way?
I would recommending downloading everything into a CSV, and then deduplicating in batches (eg nightly) that generates a new "clean" CSV for whatever you're working on.
To dedup, load the data in with the pandas library and then you can use the function drop_duplicates on the data.
http://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.DataFrame.drop_duplicates.html
Adding the ID from the feed seemed to make things the easiest to check against. Thank #blhsing for mentioning that. Ended reading the IDs from the csv into a list and checking the new data's IDs against that. There may be a faster more efficient way, but this works for me.
Code to check csv before saving to it:
if trends_f.is_file():
with open('trendsv3.csv', 'r') as csv_file:
h_reader = csv.reader(csv_file, delimiter=',')
next(h_reader, None)
for row in h_reader:
csv_list.append(row[6])
csv_file.close()
with open('trendsv3.csv', 'a', newline='') as csv_file:
h_writer = csv.writer(csv_file)
for entry in data_list:
if entry[6].strip() not in csv_list:
print(entry[6], ' is not in the list, saving ', entry[6],' to the list')
h_writer.writerow(entry)
else:
print(entry[6], ' is in the list')
csv_file.close()

Count number of columns in multiple csv files in directory

I have a directory that contains a large number of CSV files (more than 1000). I am using python pandas library to count the number of columns in each CSV file.
But the problem is that the separator used in some of CSV file is not only"," but "|" and ";"
How to tackle this problem:
import pandas as pd
import csv
import os
from collections import OrderedDict
path="C:\\Users\\Username\\Documents\\Sample_Data_August10\\outbound"
files=os.listdir(path)
col_count_dict=OrderedDict()
for file in files:
df=pd.read_csv(os.path.join(path,file),error_bad_lines=False,sep=",|;|\|",engine='python')
col_count_dict[file]=len(df.columns)
I am storing it as a dictionary.
I am getting an error like:
Error could possibly be due to quotes being ignored when a multi-char delimiter is used
I have used sep=None, but that didn't work.
Edit :
One of the csv is like this :
Number|CommentText|CreationDate|Detail|EventDate|ProfileLocale_ISO|Event_Number|Message_Number|ProfileInformation_Number|Substitute_UserNo|User_UserNo
Second one is like:
Number,Description
I can't reveal the data. I have just given the column name as the data is sensitive.
Update
After a little bit of tweaking and using print status to figure out using the code of andrey-portnoy, I came to know that csv sniffer was identifying the delimiter for "|" as "e" so using an if statement I changed it back to "|". Now it is giving me correct output.
Also in place of read() , I used readline() . in following line of code in Andrey's answer : dialect = csv.Sniffer().sniff(csvfile.read(1024))
But the problem remains unsolved. I was able to figure out this after a lot of inspection but every time I may not be correct to guess and this can lead to error.
Any help will be awaited.
By specifying the separator as sep=",|;|\|", you make that whole string a separator.
Instead, you want to use the Sniffer from the csv module to detect the CSV dialect used in each file, in particular the delimiter.
For example, for a single file example.csv:
import csv
with open('example.csv', newline='') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
sep = dialect.delimiter
df = pd.read_csv('example.csv', sep=sep)
Don't enable the Python engine by default, as it is much slower.

split excel sheet for every nrows using python

I have an excel file with more than 1 million rows. Now i need to split that for every n rows and save it in a new file. am very new to python. Any help, is much appreciated and needed
As suggested by OhAuth you can save the Excel document to a csv file. That would be a good start to begin the processing of you data.
Processing your data you can use the Python csv library. That would not require any installation since it comes with Python automatically.
If you want something more "powerful" you might want to look into Pandas. However, that requires an installation of the module.
If you do not want to use the csv module of Python nor the pandas module because you do not want to read into the docs, you could also do something like.
f = open("myCSVfile", "r")
for row in f:
singleRow = row.split(",") #replace the "," with the delimiter you chose to seperate your columns
print singleRow
> [value1, value2, value3, ...] #it returns a list and list comprehension is well documented and easy to understand, thus, further processing wont be difficult
However, I strongly recommend looking into the moduls since they handle csv data better, more efficient and on 'the long shot' save you time and trouble.

Categories

Resources