How to read CSV in pyspark with "," delimiter but not ", " - python

I am using the following code to read the CSV file in PySpark
cb_sdf = sqlContext.read.format("csv") \
.options(header='true',
multiLine = 'True',
inferschema='true',
treatEmptyValuesAsNulls='true') \
.load(cb_file)
The number of rows is correct. But for some rows, the columns are separated incorrectly. I think it is because the current delimiter is ",", but some cells contain ", " in the text as well.
For example, the following row in the pandas dataframe(I used pd.read_csv to debug)
Unnamed: 0
name
domain
industry
locality
country
size_range
111
cjsc "transport, customs, tourism"
ttt-w.ru
package/freight delivery
vyborg, leningrad, russia
russia
1 - 10
becomes
_c0
name
domain
industry
locality
country
size_range
111
"cjsc ""transport
customs
tourism"""
ttt-w.ru
package/freight delivery
vyborg, leningrad, russia
when I implemented pyspark.
It seems the cell "cjsc "transport, customs, tourism"" is separated into 3 cells: |"cjsc ""transport| customs| tourism"""|.
How can I set the delimiter to be exactly "," without any whitespace followed?
UPDATE:
I checked the CSV file, the original line is:
111,"cjsc ""transport, customs, tourism""",ttt-w.ru,package/freight delivery,"vyborg, leningrad, russia",russia,1 - 10
So is it still the problem of delimiter, or is it the problem of quotes?

I think that separating we'll have:
col1: 111
col2: "cjsc ""transport, customs, tourism"""
col3: ttt-w.ru,package/freight delivery
col4: "vyborg, leningrad, russia"
col5: russia
col6: 1 - 10

Related

Python dataframe from 2 text files (different number of columns)

I need to make a dataframe from two txt files.
The first txt file looks like this Street_name space id.
The second txt file loks like this City_name space id.
Example:
text file 1:
Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567
text file 2:
Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567
I need to make one dataframe out of this. Sometimes there is just one word for Street_name, and sometimes more. The same goes for City_name.
I get an error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 3 because I'm trying to put both words for street name into the same column, but don't know how to do it. I want one column for street name (no matter if it consists of one or more words, one for city name and one for id.
I want a df with 3 rows and 3 cols.
Thanks!
Edit: both text files are huge (each 50 mil rows +) so i need this code not to break and be optimised for large files.
It is NOT correct CSV and it may need to read it on your own.
You can normal open(), read() and later split on new line to create list of lines. And later you can use for-loop and use line.rsplit(" ", 1) to split line on last space.
Minimal working example:
I use io to simulate file in memory - so everyone can simply copy and test it - but you should use open()
text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
import io
#with open('filename') as fh:
with io.StringIO(text) as fh:
lines = fh.read().splitlines()
print(lines)
lines = [line.rsplit(" ", 1) for line in lines]
print(lines)
import pandas as pd
df = pd.DataFrame(lines, columns=['name', 'name'])
print(df)
Result:
['Roseberry st 1234', 'Brooklyn st 4321', 'Wolseley 1234567']
[['Roseberry st', '1234'], ['Brooklyn st', '4321'], ['Wolseley', '1234567']]
name number
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
EDIT:
read_csv can use regex to define separator (i.e. sep="\s+" for many spaces) and it can even use lookahead/loopbehind ((?=...)/(?<=...)) to check if there is digit after space without catching it as part of separator.
text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
import io
import pandas as pd
#df = pd.read_csv('filename', names=['name', 'number'], sep='\s(?=\d)', engine='python')
df = pd.read_csv(io.StringIO(text), names=['name', 'number'], sep='\s(?=\d)', engine='python')
print(df)
Result:
name number
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
And later you can try to connect both dataframe using .join(), .merge() with parameter on= (or something similar) like in SQL query.
text1 = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
text2 = '''Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567'''
import io
import pandas as pd
df1 = pd.read_csv(io.StringIO(text1), names=['street name', 'id'], sep='\s(?=\d)', engine='python')
df2 = pd.read_csv(io.StringIO(text2), names=['city name', 'id'], sep='\s(?=\d)', engine='python')
print(df1)
print(df2)
df = df1.merge(df2, on='id')
print(df)
Result:
street name id
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
city name id
0 Winnipeg 4321
1 Winnipeg 1234
2 Ste Anne 1234567
street name id city name
0 Roseberry st 1234 Winnipeg
1 Brooklyn st 4321 Winnipeg
2 Wolseley 1234567 Ste Anne
Pandas doc: Merge, join, concatenate and compare
There's nothing that I'm aware of in pandas that does this automatically.
Below, I built a script that will merge those addresses (addy + st) into a single column, then merges the two data frames into one based on the "id".
I assume your actual text files are significantly larger, so assuming they follow the pattern set in the two examples, this script should work fine.
Basically, this code turns each line of text in the file into a list, then combines lists of length 3 into length 2 by combining the first two list items.
After that, it turns the "list of lists" into a dataframe and merges those dataframes on column "id".
Couple caveats:
Make sure you set the correct text file paths
Make sure the first line of the text files contains 2, single string column headers (ie: "address id") or (ie: "city id")
Make sure each text file id column header is named "id"
import pandas as pd
import numpy as np
# set both text file paths (you may need full path i.e. C:\Users\Name\bla\bla\bla\text1.txt)
text_path_1 = r'text1.txt'
text_path_2 = r'text2.txt'
# declares first text file
with open(text_path_1) as f1:
text_file_1 = f1.readlines()
# declares second text file
with open(text_path_2) as f2:
text_file_2 = f2.readlines()
# function that massages data into two columns (to put "st" into same column as address name)
def data_massager(text_file_lines):
data_list = []
for item in text_file_lines:
stripped_item = item.strip('\n')
split_stripped_item = stripped_item.split(' ')
if len(split_stripped_item) == 3:
split_stripped_item[0:2] = [' '.join(split_stripped_item[0 : 2])]
data_list.append(split_stripped_item)
return data_list
# runs function on both text files
data_list_1 = data_massager(text_file_1)
data_list_2 = data_massager(text_file_2)
# creates dataframes on both text files
df1 = pd.DataFrame(data_list_1[1:], columns = data_list_1[0])
df2 = pd.DataFrame(data_list_2[1:], columns = data_list_2[0])
# merges data based on id (make sure both text files' id is named "id")
merged_df = df1.merge(df2, how='left', on='id')
# prints dataframe (assuming you're using something like jupyter-lab)
merged_df
pandas has strong support for strings. You can make the lines of each file into a Series and then use a regular expression to separate the fields into separate columns. I assume that "id" is the common value that links the two datasets, so it can become the dataframe index and the columns can just be added together.
import pandas as pd
street_series = pd.Series([line.strip() for line in open("text1.txt")])
street_df = street_series.str.extract(r"(.*?) (\d+)$")
del street_series
street_df.rename({0:"street", 1:"id"}, axis=1, inplace=True)
street_df.set_index("id", inplace=True)
print(street_df)
city_series = pd.Series([line.strip() for line in open("text2.txt")])
city_df = city_series.str.extract(r"(.*?) (\d+)$")
del city_series
city_df.rename({0:"city", 1:"id"}, axis=1, inplace=True)
city_df.set_index("id", inplace=True)
print(city_df)
street_df["city"] = city_df["city"]
print(street_df)

Compare values between 2 dataframes and transform data

The main aim of this script is to compare the regex format of the data present in the csv with the official ZIP Code regex format for that country, and if the format does not match, the script would carry out transformations on said data and output it all in one final dataframe.
I have 2 csv files, one (countries.csv) containing the following columns & data examples
INPUT:
Contact ID
Country
Zip Code
1
USA
71293
2
Italy
IT 2310219
and another csv (Regex.csv) with the following data examples:
Country
Regex format
USA
[0-9]{5}(?:-[0-9]{4})?
Italy
\d{5}
Now, the first csv has some 35k records so I would like to create a function which loops through the regex.csv (Dataframe) to grab the country column and also the regex format. Then it would loop through the country list to grab every instance where regex['country'] == countries['country'] and it would apply the regex transformation to the ZIP Codes for that country.
So far I have this function but I can't get it to work.
def REGI (dframe):
dframe=pd.DataFrame().reindex_like(contacts)
cols = list(contacts.columns)
for index,row in mergeOne.iterrows():
country = (row['Country'])
reg = (row[r'regex'])
for i, r in contactsS.iterrows():
if (r['Country of Residence'] == country or r['Country of Residence.1'] == country or r['Mailing Country (text only)'] == country or r['Other Country (text only)'] == country) :
dframe.loc[i] = r
dframe['Mailing Zip/Postal Code']=dframe['Mailing Zip/Postal Code'].apply(str).str.extractall(reg).unstack().apply(lambda x:','.join(x.dropna()), axis=1)
contacts.loc[contacts['Contact ID'].isin(dframe['Contact ID']),cols] = dframe[cols]
dframe = dframe.dropna(how='all')
return dframe
['Contact ID'] is being used as an identifier column.
The second for loop works on its own however I would need to manually re-type a new dataframe name, regex format and country name (without the first for loop).
At the moment I am getting the following error:
ValueError
ValueError: pattern contains no capture groups
removed some columns to mimic example given above
dataframes & error
error continued
If I paste the results into a new dataframe, it returns the following:
results in a new dataframe
Example as text
Account ID
Country
Zip/Postal Code
1
United Kingdom
WV9 5BT
2
Ireland
D24 EO29
3
Latvia
1009
4
United Kingdom
EN6 1JE
5
Italy
22010
REGEX table
Country
Regex
United Kingdom
([Gg][Ii][Rr] 0[Aa]{2})
(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})
([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})
Latvia
[L]{1}[V]{1}-{4}
Ireland
STRNG_LTN_EXT_255
Italy
\d{5}
United Kingdom regex:
([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})
Based on your response to my comment, I would suggest to directly fix the zip code using your regexes:
df3 = df2.set_index('Country')
df1['corrected_Zip'] = (df1.groupby('Country')
['Zip Code']
.apply(lambda x: x.str.extract('(%s)' % df3.loc[x.name, 'Regex format']))
)
df1
This groups by country, applies the regex for this country, and extract the value.
output:
Contact ID Country Zip Code corrected_Zip
0 1 USA 71293 71293
1 2 Italy IT 2310219 23102
NB. if you want you can directly overwrite Zip Code by doing df1['Zip Code'] = …
NB2. This will work only if all countries have an entry in df2, if this is not the case, you need to add a check for that (let me know)
NB3. if you want to know which rows had an invalid zip, you can fetch them using:
df1[df1['Zip Code']!=df1['corrected_Zip']]

Count how many times a string appears in pandas df with one variable element

i have a pandas df that contains text in one column:
df(print):
Country 1 - Name of Country
paragraph text
Country 2 - Name of Country
paragraph text
Country 3 - Name of Country
paragraph text
Country 4 - Name of Country
paragraph text
I am trying to count how many times the string "Country # -" appears. The thing is that the number in the middle is something that can change. There could be up to 20 countries listed.
with this example i am hoping to get that:
print(count):
4
There is an off chance that the word "Country" appears at the start of the paragraph text which is why i was hoping to be able to search for the full "Country + "Number" + "-" string
Any help would be much appreciated. Thanks very much!
Use a regular expression on the column that stores the data, for example:
np.random.seed(10)
countries_sample = ['Country 1 - text text', 'not Country string', 'Country 2']
df = pd.DataFrame(np.random.choice(countries_sample,10),
columns = ['text_to_validate'])
df.head(3)
# text_to_validate
# 0 not Country string
# 1 not Country string
# 2 Country 1 - text text
The use str attribute followed by contains method and the regular expression:
total = df['text_to_validate'].str.contains('^Country [0-9]+ -', regex=True).sum()
print(total) # 4

Merging inconsistent data in text files into a single excel spreadsheet

I have a large number of text files with data; each file can be imported into excel separately. However, while most of the columns are the same between the files, in many files there's a column or two added/missing so when I merge all the text files and put it into excel, many columns of data are shifted.
I can make a 'master list' of all the possible data entries, but I'm not exactly sure how to tell excel to put certain types of data in specific columns.
For instance, if I have two files that look like:
Name Year Food Color
Bob 2018 Cake Blue
Charlie 2017 Figs Red
and
LastName Name Age Year Color Size
Lily James 17 2021 green 0
How would I go about merging them like this in excel:
LastName Name Age Year Food Color Size
na Bob na 2018 Cake Blue na
na Charlie na 2017 Figs Red na
Lily James 17 2021 na green 0
Question: Merging inconsistent data in text files into a single excel spreadsheet
This solution is using the following build-in and moudules:
Set Types
Lists
CSV File Reading and Writing
Mapping Types — dict
The core of this solution is to normalize the columns names using a set() object and
the parameter .DictWriter(..., extrasaction='ignore') to handle the inconsistent columns.
The output format is CSV, which can be read from MS-Excel.
The given data, separated by blank
text1 = """Name Year Food Color
Bob 2018 Cake Blue
Charlie 2017 Figs Red
"""
text2 = """LastName Name Age Year Color Size
Lily James 17 2021 green 0
"""
Open three files an get the headers.
Aggregate all columns names, drop double names using a set().
Create a DictReader object for the in_* files.
Note: Replace io.StringIO(... with open(<Path to file>)
with io.StringIO(text1) as in_text1, \
io.StringIO(text2) as in_text2, \
io.StringIO() as out_csv:
columns = set()
reader = []
for n, fh in enumerate([in_text1, in_text2]):
fieldnames = fh.readline().rstrip().split()
[columns.add(name) for name in fieldnames]
reader.append(csv.DictReader(fh, delimiter=' ', fieldnames=fieldnames))
Create a DictWriter object using the normalized column names.
The parameter extrasaction='ignore', handle the inconsistent columns.
Note: The column order is not guaranteed. If you need a defined order, sort the list(columns) to your needs before assigning to fieldnames=.
writer = csv.DictWriter(out_csv, fieldnames=list(columns), , extrasaction='ignore')
writer.writeheader()
Loop all DictReader objects reading all lines and write it to the target .csv file.
for dictReader in reader:
for _dict in dictReader:
writer.writerow(_dict)
Output:
print(out_csv.getvalue())
Color,LastName,Year,Food,Age,Name,Size
Blue,,2018,Cake,,Bob,
Red,,2017,Figs,,Charlie,
green,Lily,2021,,17,James,0
Tested with Python: 3.4.2
If you were happy to work with the text files directly in Excel ... this will work but may need some refinement from yourself.
I understand it’s probably not what you’re looking for but it provides another option.
Open the Visual Basic editor, add a new module and copy the below code and paste in ...
Public Sub ReadAndMergeTextFiles()
Dim strSrcFolder As String, strFileName As String, strLine As String, strPath As String, bFirstLine As Boolean
Dim arrHeaders() As String, lngHeaderIndex As Long, arrFields, i As Long, objDestSheet As Worksheet, bFound As Boolean
Dim objLastHeader As Range, x As Long, lngLastColumn As Long, lngHeaderCol As Long, arrHeaderCols() As Long
Dim lngWriteRow As Long
lngLastColumn = 1
lngWriteRow = 2
Application.EnableEvents = False
Application.ScreenUpdating = False
' Change the sheet name being assigned to your destination worksheet name.
' Alternatively, display a prompt that asks for the sheet or simply uses the active sheet.
Set objDestSheet = Worksheets("Result")
With Application.FileDialog(msoFileDialogFolderPicker)
.Title = "Select Source Folder"
.Show
If .SelectedItems.Count = 1 Then
objDestSheet.Cells.Clear
strSrcFolder = .SelectedItems(1)
strFileName = Dir(strSrcFolder & "\*.txt")
Do While Len(strFileName) > 0
strPath = strSrcFolder & "\" & strFileName
Open strPath For Input As #1
bFirstLine = True
Do Until EOF(1)
Line Input #1, strLine
arrFields = Split(strLine, vbTab, , vbTextCompare)
lngHeaderIndex = -1
For i = 0 To UBound(arrFields)
If bFirstLine Then
' Loop through the header fields already written to the destination worksheet and find a match.
For x = 1 To objDestSheet.Columns.Count
bFound = False
If Trim(objDestSheet.Cells(1, x)) = "" Then Exit For
If UCase(objDestSheet.Cells(1, x)) = UCase(arrFields(i)) Then
lngHeaderCol = x
bFound = True
Exit For
End If
Next
If Not bFound Then
objDestSheet.Cells(1, lngLastColumn) = arrFields(i)
lngHeaderCol = lngLastColumn
lngLastColumn = lngLastColumn + 1
End If
lngHeaderIndex = lngHeaderIndex + 1
ReDim Preserve arrHeaderCols(lngHeaderIndex)
arrHeaderCols(lngHeaderIndex) = lngHeaderCol
Else
' Write out each value into the column found.
objDestSheet.Cells(lngWriteRow, arrHeaderCols(i)) = "'" & arrFields(i)
End If
Next
If Not bFirstLine Then
lngWriteRow = lngWriteRow + 1
End If
bFirstLine = False
Loop
Close #1
strFileName = Dir
Loop
objDestSheet.Columns.AutoFit
End If
End With
Application.ScreenUpdating = True
Application.EnableEvents = True
End Sub
... I did some basic testing with the data you provided and it seemed to work. If for some reason it fails over the data you're using and you can't work it out, let me know and I'll put a fix in.
Some points ...
The order of the columns depends on the order of your files and which columns appear first. Of course, that could be enhanced upon but it is what it is for now.
It assumes all files in the one folder and all files end in .txt
The separator within each file is assumed to be a TAB.
Let me know if that helps.

Pandas not recognizing csv columns

I am using pandas to read .csv data files. For one of my files I am able to index using the column title. For the other I get error messages
File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py",
line 1023, in _check_have
raise KeyError('no item named %s' % com.pprint_thing(item))
KeyError: u'no item named State'
The code I used is:
filename = "PovertyEstimates.csv"
#filename = "nm.csv"
f = open(filename)
import pandas as pd
data = pd.read_csv(f)#, index_col=0)
print data['State']
Even when I use index_col I get the same error(unless it is 0). I have found that when I print the csv file that isn't working in my terminal it is not separated into columns like the one that is. Rather the items in each row are printed consecutively separated by spaces. I believe this incorrect separation is the problem.
I am using LibreOffice Calc on Ubuntu Linux. For the improperly formatted file (which appears in perfect format in LibreOffice) the terminal output is:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3194 entries, 0 to 3193
Data columns:
FIPStxt State Area_name Rural-urban_Continuum Code_2003 Urban_Influence_Code_2003 Rural-urban_Continuum Code_20013 Urban_Influence_Code_20013 POVALL_2011 CI90LBAll_2011 CI90UBALL_2011 PCTPOVALL_2011 CI90LBALLP_2011 CI90UBALLP_2011 POV017_2011 CI90LB017_2011 CI90UB017_2011 PCTPOV017_2011 CI90LB017P_2011 CI90UB017P_2011 POV517_2011 CI90LB517_2011 CI90UB517_2011 PCTPOV517_2011 CI90LB517P_2011 CI90UB517P_2011 MEDHHINC_2011 CI90LBINC_2011 CI90UBINC_2011 POV05_2011 CI90LB05_2011 CI90UB05_2011 PCTPOV05_2011 CI90LB05P_2011 CI90UB05P_2011 3194 non-null values
dtypes: object(1)
The first few lines of the csv file are:
FIPStxt State Area_name Rural-urban_Continuum Code_2003
01000 AL Alabama
01001 AL Autauga County 2 2
01003 AL Baldwin County 4 5
The spaces are probably the problem. You need to tell pandas what separator to use when parsing the CSV.
data = pd.read_csv(f, sep=" ")
Problem is though, it will pick up all spaces as valid separators (e.g. Alabama County becomes 2 columns). The best would be to convert that one file to a an actual comma (semicolon or other) separated file or make sure that compound values are quoted ("Alabama County") and then specify the quotechar:
data = pd.read_csv(f, sep=" ", quotechar='"')

Categories

Resources