Pyspark : How to escape backslash ( \ ) in input file

Pyspark : How to escape backslash ( \ ) in input file - python

I am loading a csv file into postgresql using pyspark. I have a record in the input file which looks like below -
Id,dept,city,name,country,state
1234,ABC,dallas,markhenry\,USA,texas
When I load it into the postgresql database then it gets loaded like this which is not correct -
Id | dept| city | name | country | state
1234 | ABC | dallas | markhenry,USA | texas | null
correct output in postgresdb should be -
Id | dept| city | name | country | state
1234 | ABC | dallas | markhenry | USA | texas
I am reading the file like below -
input_df = spark.read.format("csv").option("quote", "\"").option("escape", "\"").option("header",
"true").load(filepath)
Is there a way I can modify my code to handle the backslash () coming in the data. Thanks in advance

The purpose of the "quote" option is to specify a quote character, which wraps entire column values. Not sure if that is needed here, but you can use the regexp_replace function to remove specific characters (just select everything else as-is and modify the name column this way).
from pyspark.sql.functions import *
df = spark.read.option("inferSchema", "true").option("header", "true").csv(filepath)
df2 = df.select(col("Id"), col("dept"), col("city"), regexp_replace(col("name"), "\\\\", "").alias("name"), col("country"), col("state"))
df2.show(4, False)
Output:
+----+----+------+---------+-------+-----+
|Id |dept|city |name |country|state|
+----+----+------+---------+-------+-----+
|1234|ABC |dallas|markhenry|USA |texas|
+----+----+------+---------+-------+-----+

Related

Pyspark mapping regex

I have a pyspark dataframe, with text column.
I wanted to map the values which with a regex expression.
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-RH', 'RH'))
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-FI, 'FI'))
Plus I wanted to map specifics values according to a dictionnary, I did the following (mapper is from create_map()):
df = df.withColumn("mapped_col",mapper.getItem(F.col("action")))
Finaly the values which has not been mapped by the dictionnary or the regex expression, will be set null. I do not know how to do this part in accordance to the two others.
Is it possible to have like a dictionnary of regex expression so I can regroup the two 'functions'?
{".*-RH": "RH", ".*FI" : "FI"}
Original Output Example
+-----------------------------+
|message |
+-----------------------------+
|GDF2009 |
|GDF2014 |
|ADS-set |
|ADS-set |
|XSQXQXQSDZADAA5454546a45a4-FI|
|dadaccpjpifjpsjfefspolamml-FI|
|dqdazdaapijiejoajojp565656-RH|
|kijipiadoa
+-----------------------------+
Expected Output Example
+-----------------------------+-----------------------------+
|message |status|
+-----------------------------+-----------------------------+
|GDF2009 | GDF
|GDF2014 | GDF
|ADS/set | ADS
|ADS-set | ADS
|XSQXQXQSDZADAA5454546a45a4-FI| FI
|dadaccpjpifjpsjfefspolamml-FI| FI
|dqdazdaapijiejoajojp565656-RH| RH
|kijipiadoa | null or ??
So first 4th line are mapped with a dict, and the other are mapped using regex. Unmapped are null or ??
Thank you,

You can achieve it using contains function:
from pyspark.sql.types import StringType
df = spark.createDataFrame(
["GDF2009", "GDF2014", "ADS-set", "ADS-set", "XSQXQXQSDZADAA5454546a45a4-FI", "dadaccpjpifjpsjfefspolamml-FI",
"dqdazdaapijiejoajojp565656-RH", "kijipiadoa"], StringType()).toDF("message")
df.show()
names = ("GDF", "ADS", "FI", "RH")
def c(col, names):
return [f.when(f.col(col).contains(i), i).otherwise("") for i in names]
df.select("message", f.concat_ws("", f.array_remove(f.array(*c("message", names)), "")).alias("status")).show()
output:
+--------------------+
| message|
+--------------------+
| GDF2009|
| GDF2014|
| ADS-set|
| ADS-set|
|XSQXQXQSDZADAA545...|
|dadaccpjpifjpsjfe...|
|dqdazdaapijiejoaj...|
| kijipiadoa|
+--------------------+
+--------------------+------+
| message|status|
+--------------------+------+
| GDF2009| GDF|
| GDF2014| GDF|
| ADS-set| ADS|
| ADS-set| ADS|
|XSQXQXQSDZADAA545...| FI|
|dadaccpjpifjpsjfe...| FI|
|dqdazdaapijiejoaj...| RH|
| kijipiadoa| |
+--------------------+------+

problem with line terminator \n on dataframe and .csv

Im getting (with an python API) a .csv file from an email attachment that i received in gmail, transforming it into a dataframe to make some dataprep, and saving as .csv on my pc. It is working great, the problem is that i get '\n' on some columns(it came like that from the source attachment).
the code that i used to get the data and transform into dataframe and .csv
r = io.BytesIO(part.get_payload(decode = True))
df = pd.DataFrame(r)
df.to_csv('C:/Users/x.csv', index = False)
Example of df that i get:
+-------------+----------+---------+----------------------+
| Information | Modified | Created | MD_x0020_Agenda\r\n' |
+-------------+----------+---------+----------------------+
| c | d | f | \r\n' |
| b\n' | | | |
| c | e | \r\n' | |
+-------------+----------+---------+----------------------+
example of answer that is correct:
+-------------+----------+---------+----------------------+
| Information | Modified | Created | MD_x0020_Agenda\r\n' |
+-------------+----------+---------+----------------------+
| c | d | f | \r\n' |
| b | c | e | \r\n' |
+-------------+----------+---------+----------------------+
i tried to use the line_terminator. in my mind, if i force it to get only \r\n and not \n, it would work. It didnt.
df.to_csv('C:/Users/x.csv', index = False, line_terminator='\r\n')
can somebody give me a help with that? its really freaking me out, because of that i cant advance at my project. thanks.

Usually, this "\n" appears to mark that sentence is going for next line i.e ‘return’ key, line break.
You can get rid of it just by applying replace('\n', '') on your dataframe:
df = df.replace('\n', '')
For more details on the function, consider checking this specific Pandas documentation
Hope it works.

I mixed the two answers and got the solution, thanks!!!!!
PS: with some research i found that this is a windows/excel issue, when you export .csv it considers \n and \r\n (\r too?) as new row. DataFrame considers only \r\n as new row(when default).
df = pd.read_csv(io.BytesIO(part.get_payload(decode = True)), header=None)
#grab the first row for the header
new_header = df.iloc[0]
#take the data less the header row
df = df[1:]
#set the header row as the df header
df.columns = new_header
#replace the \n wich is creating new lines
df['Information'] = df['Information'].replace(regex = '\n', value = '')
df.to_csv('C:/Users/x.csv', index = False', index = False)

Pandas not displaying all columns when writing to

I am attempting to export a dataset that looks like this:
+----------------+--------------+--------------+--------------+
| Province_State | Admin2 | 03/28/2020 | 03/29/2020 |
+----------------+--------------+--------------+--------------+
| South Dakota | Aurora | 1 | 2 |
| South Dakota | Beedle | 1 | 3 |
+----------------+--------------+--------------+--------------+
However the actual CSV file i am getting is like so:
+-----------------+--------------+--------------+
| Province_State | 03/28/2020 | 03/29/2020 |
+-----------------+--------------+--------------+
| South Dakota | 1 | 2 |
| South Dakota | 1 | 3 |
+-----------------+--------------+--------------+
Using this here code (runnable by running createCSV(), pulls data from COVID govt GitHub):
import csv#csv reader
import pandas as pd#csv parser
import collections#not needed
import requests#retrieves URL fom gov data
def getFile():
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 19/master/csse_covid_19_data/csse_covid_19_time_series /time_series_covid19_deaths_US.csv'
response = requests.get(url)
print('Writing file...')
open('us_deaths.csv','wb').write(response.content)
#takes raw data from link. creates CSV for each unique state and removes unneeded headings
def createCSV():
getFile()
#init data
data=pd.read_csv('us_deaths.csv', delimiter = ',')
#drop extra columns
data.drop(['UID'],axis=1,inplace=True)
data.drop(['iso2'],axis=1,inplace=True)
data.drop(['iso3'],axis=1,inplace=True)
data.drop(['code3'],axis=1,inplace=True)
data.drop(['FIPS'],axis=1,inplace=True)
#data.drop(['Admin2'],axis=1,inplace=True)
data.drop(['Country_Region'],axis=1,inplace=True)
data.drop(['Lat'],axis=1,inplace=True)
data.drop(['Long_'],axis=1,inplace=True)
data.drop(['Combined_Key'],axis=1,inplace=True)
#data.drop(['Province_State'],axis=1,inplace=True)
data.to_csv('DEBUGDATA2.csv')
#sets province_state as primary key. Searches based on date and key to create new CSVS in root directory of python app
data = data.set_index('Province_State')
data = data.iloc[:,2:].rename(columns=pd.to_datetime, errors='ignore')
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('03/23/2020', '03/29/20')] \
.to_csv('{0}_confirmed_deaths.csv'.format(name))
The reason for the loop is to set the date columns (everything after the first two) to a date, so that i can select only from 03/23/2020 and beyond. If anyone has a better method of doing this, I would love to know.
To ensure it works, it prints out all the field names, inluding Admin2 (county name), province_state, and the rest of the dates.
However, in my CSV as you can see, Admin2 seems to have disappeared. I am not sure how to make this work, if anyone has any ideas that'd be great!

changed
data = data.set_index('Province_State')
to
data = data.set_index((['Province_State','Admin2']))
Needed to create a multi key to allow for the Admin2 column to show. Any smoother tips on the date-range section welcome to reopen
Thanks for the help all!

Print from PrettyTable with Python2 vs Python3

I am playing a little with PrettyTable in Python and I noticed completely different behavior in Python2 and Python 3. Can somebody exactly explain me the difference in output? Nothing in docs gave me satisfied answer for that. But let's start with little code. Let's start with creating my_table
from prettytable import PrettyTable
my_table = PrettyTable()
my_table.field_name = ['A','B']
It creates two column table with column A and column B. Let's add on row to it, but assume that value in cell can have multi lines, separated by Python new line '\n' , as the example some properties of parameter from column A.
row = ['parameter1', 'component: my_component\nname:somename\nmode: magic\ndate: None']
my_table.add_row(row)
Generally the information in row can be anything, it's just a string retrieved from other function. As you can see, it has '\n' inside. The thing that I don't completely understand is the output of print function.
I have in Python2
print(my_table.get_string().encode('utf-8'))
Which have me output like this:
+------------+-------------------------+
| Field 1 | Field 2 |
+------------+-------------------------+
| parameter1 | component: my_component |
| | name:somename |
| | mode: magic |
| | date: None |
+------------+-------------------------+
But in Python3 I have:
+------------+-------------------------+
| Field 1 | Field 2 |
+------------+-------------------------+
| parameter1 | component: my_component |
| | name:somename |
| | mode: magic |
| | date: None |
+------------+-------------------------+
If I completely removes the encode part, it seems that output looks ok on both version of Python.
So when I have
print(my_table.get_string())
It works on Python3 and Python2. Should I remove the encode part from code? It is save to assume it is not necessary? Where is the problem exactly?

Writing values to excel in python using pandas

I'm new to python and would like to pass the ZipCode in excel file to 'uszipcode' package and write the state for that particular zipcode to 'OriginalZipcode' column in the excel sheet. The reason for doing this is, I want to compare the existing states with the original states. I don't understand if the for loop is wrong in the code or something else is. Currently, I cannot write the states to OriginalZipcode column in excel. The code I've written is:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import uszipcode as US
from uszipcode import ZipcodeSearchEngine
search = ZipcodeSearchEngine()
df = pd.read_excel("H:\excel\checking for zip and states\checkZipStates.xlsx", sheet_name='Sheet1')
#print(df.values)
for i, row in df.iterrows():
zipcode = search.by_zipcode(row['ZipCode']) #for searching zipcode
b = zipcode.State
df.at['row','OriginalState'] = b
df.to_excel("H:\\excel\\checking for zip and states\\new.xlsx", sheet_name = "compare", index = False)
The excel sheet is in this format:
| ZipCode |CurrentState | OriginalState |
|-----------|-----------------|---------------|
| 59714 | Montana | |
| 29620 | South Carolina | |
| 54405 | Wisconsin | |
| . | . | |
| . | . | |

You can add the OriginalState column without iterating the df:
Define a function that returns the value you want for any given zip code:
def get_original_state(state):
zipcode = search.by_zipcode(state) #for searching zipcode
return zipcode.State
Then:
df['OriginalState'] = df.apply( lambda row: get_original_state(row['ZipCode']), axis=1)
Finally, export only once the df to excel.
This should do the trick.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark : How to escape backslash ( \ ) in input file - python

Related

Pyspark mapping regex

problem with line terminator \n on dataframe and .csv

Pandas not displaying all columns when writing to

Print from PrettyTable with Python2 vs Python3

Writing values to excel in python using pandas

Categories

Resources