Import pipe delimited txt file into spark dataframe in databricks - python

I have a data file saved as .txt format which has a header row at the top, and is pipe delimited. I am working in databricks, and am needing to create a spark dataframe of this data, with all columns read in as StringType(), the headers defined by the first row, and the columns separated based on the pipe delimiter.
When importing .csv files I am able to set the delimiter and header options. However, I am not able to get the .txt files to import in the same way.
Example Data (completely made up)... for ease, please imagine it is just called datafile.txt:
URN|Name|Supported
12233345757777701|Tori|Yes
32313185648456414|Dave|No
46852554443544854|Steph|No
I would really appreciate a hand in getting this imported into a Spark dataframe so that I can crack on with other parts of the analysis. Thank you!

Any delimiter separated file is a good candidate for csv reading methods. The 'c' of csv is mostly by convention. Thus nothing stops us from reading this:
col1|col2|col3
0|1|2
1|3|8
Like this (in pure python):
import csv
from pathlib import Path
with Path("pipefile.txt").open() as f:
reader = csv.DictReader(f, delimiter="|")
data = list(reader)
print(data)
Since whatever custom reader your libraries are using probably uses csv.reader under the hood you simply need to figure out how to pass the right separator to it.
#blackbishop notes in a comment that
spark.read.csv("datafile.text", header=True, sep="|")
would be the appropriate spark call.

Related

Grab values from seperate csv file and replace the values of columns in a pipe delimited file

Trying to whip this out in python. Long story short I got a csv file that contains column data i need to inject into another file that is pipe delimited. My understanding is that python can't replace values, so i have to re-write the whole file with the new values.
data file(csv):
value1,value2,iwantthisvalue3
source file(txt, | delimited)
value1|value2|iwanttoreplacethisvalue3|value4|value5|etc
fixed file(txt, | delimited)
samevalue1|samevalue2| replacedvalue3|value4|value5|etc
I can't figure out how to accomplish this. This is my latest attempt(broken code):
import re
import csv
result = []
row = []
with open("C:\data\generatedfixed.csv","r") as data_file:
for line in data_file:
fields = line.split(',')
result.append(fields[2])
with open("C:\data\data.txt","r") as source_file, with open("C:\data\data_fixed.txt", "w") as fixed_file:
for line in source_file:
fields = line.split('|')
n=0
for value in result:
fields[2] = result[n]
n=n+1
row.append(line)
for value in row
fixed_file.write(row)
I would highly suggest you use the pandas package here, it makes handling tabular data very easy and it would help you a lot in this case. Once you have installed pandas import it with:
import pandas as pd
To read the files simply use:
data_file = pd.read_csv("C:\data\generatedfixed.csv")
source_file = pd.read_csv('C:\data\data.txt', delimiter = "|")
and after that manipulating these two files is easy, I'm not exactly sure how many values or which ones you want to replace, but if the length of both "iwantthisvalue3" and "iwanttoreplacethisvalue3" is the same then this should do the trick:
source_file['iwanttoreplacethisvalue3'] = data_file['iwantthisvalue3]
now all you need to do is save the dataframe (the table that we just updated) into a file, since you want to save it to a .txt file with "|" as the delimiter this is the line to do that (however you can customize how to save it in a lot of ways):
source_file.to_csv("C:\data\data_fixed.txt", sep='|', index=False)
Let me know if everything works and this helped you. I would also encourage to read up (or watch some videos) on pandas if you're planning to work with tabular data, it is an awesome library with great documentation and functionality.

Importing CSV Data and formatting in Excel via Python

I am importing CSV based data in an Excel spreadsheet via Python. I would like to know if it is possible to import the data and divide it in several columns (like we would do via the importing menu under DATA in Excel).
So far, I convert my CSV to a pandas and imported it in Excel, but all my data is clustered in 1 column :
df = pd.read_csv(r'C:\Users\Contractuel\Desktop\Test\Candiac_TypeLum_UTF8.csv')
writer = pd.ExcelWriter('TypeLum_TEST.xlsx')
df.to_excel(writer, index=False)
writer.save()
Thanks!
The read_csv method takes an argument sep= which tells pandas what separates the data. You probably need to use this to specify what the separator in the CSV file is. Default is , but CSVs sometimes have ; or other things as separators.

Python: How to create a new dataframe with first row when a specific value

I am reading csv files into python using:
df = pd.read_csv(r"C:\csvfile.csv")
But the file has some summary data, and the raw data start if a value "valx" is found. If "valx" is not found then the file is useless. I would like to create news dataframes that start when "valx" is found. I have been trying for a while with no success. Any help on how to achieve this is greatly appreciated.
Unfortunately, pandas only accepts skiprows for rows to skip in the beginning. You might want to parse the file before creating the dataframe.
As an example:
import csv
with open(r"C:\csvfile.csv","r") as f:
lines = csv.reader(f, newline = '')
if any('valx' in i for i in lines):
data = lines
Using the Standard Libary csv module, you can read file and check if valx is in the file, if it is found, the content will be returned in the data variable.
From there you can use the data variable to create your dataframe.

Count number of columns in multiple csv files in directory

I have a directory that contains a large number of CSV files (more than 1000). I am using python pandas library to count the number of columns in each CSV file.
But the problem is that the separator used in some of CSV file is not only"," but "|" and ";"
How to tackle this problem:
import pandas as pd
import csv
import os
from collections import OrderedDict
path="C:\\Users\\Username\\Documents\\Sample_Data_August10\\outbound"
files=os.listdir(path)
col_count_dict=OrderedDict()
for file in files:
df=pd.read_csv(os.path.join(path,file),error_bad_lines=False,sep=",|;|\|",engine='python')
col_count_dict[file]=len(df.columns)
I am storing it as a dictionary.
I am getting an error like:
Error could possibly be due to quotes being ignored when a multi-char delimiter is used
I have used sep=None, but that didn't work.
Edit :
One of the csv is like this :
Number|CommentText|CreationDate|Detail|EventDate|ProfileLocale_ISO|Event_Number|Message_Number|ProfileInformation_Number|Substitute_UserNo|User_UserNo
Second one is like:
Number,Description
I can't reveal the data. I have just given the column name as the data is sensitive.
Update
After a little bit of tweaking and using print status to figure out using the code of andrey-portnoy, I came to know that csv sniffer was identifying the delimiter for "|" as "e" so using an if statement I changed it back to "|". Now it is giving me correct output.
Also in place of read() , I used readline() . in following line of code in Andrey's answer : dialect = csv.Sniffer().sniff(csvfile.read(1024))
But the problem remains unsolved. I was able to figure out this after a lot of inspection but every time I may not be correct to guess and this can lead to error.
Any help will be awaited.
By specifying the separator as sep=",|;|\|", you make that whole string a separator.
Instead, you want to use the Sniffer from the csv module to detect the CSV dialect used in each file, in particular the delimiter.
For example, for a single file example.csv:
import csv
with open('example.csv', newline='') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
sep = dialect.delimiter
df = pd.read_csv('example.csv', sep=sep)
Don't enable the Python engine by default, as it is much slower.

Saving DataFrame to csv but output cells type becomes number instead of text

import pandas as pd
check = pd.read_csv('1.csv')
nocheck = check['CUSIP'].str[:-1]
nocheck = nocheck.to_frame()
nocheck['CUSIP'] = nocheck['CUSIP'].astype(str)
nocheck.to_csv('NoCheck.csv')
This works but while writing the csv, a value for an identifier like 0003418 (type = str) converts to 3418 (type = general) when the csv file is opened in Excel. How do I avoid this?
I couldn't find a dupe for this question, so I'll post my comment as a solution.
This is an Excel issue, not a python error. Excel autoformats numeric columns to remove leading 0's. You can "fix" this by forcing pandas to quote when writing:
import csv
# insert pandas code from question here
# use csv.QUOTE_ALL when writing CSV.
nocheck.to_csv('NoCheck.csv', quoting=csv.QUOTE_ALL)
Note that this will actually put quotes around each value in your CSV. It will render the way you want in Excel, but you may run into issues if you try to read the file some other way.
Another solution is to write the CSV without quoting, and change the cell format in Excel to "General" instead of "Numeric".

Categories

Resources