Parsing CSV file in pandas with commas - python
I need to create a pandas.DataFrame from a csv file. For that I am using the method pandas.csv_reader(...). The problem with this file is that one or more columns contain commas within the values (I don't control the file format).
I been trying to implement the solution from this question but I get the following error:
pandas.errors.EmptyDataError: No columns to parse from file
For some reason after implementing this solution the csv file I tried fixing is blank.
Here is the code I am using:
# fix csv file
with open ("/Users/username/works/test.csv",'rb') as f,\
open("/Users/username/works/test.csv",'wb') as g:
writer = csv.writer(g, delimiter=',')
for line in f:
row = line.split(',', 4)
writer.writerow(row)
# Manipulate csv file
data = pd.read_csv(os.path.expanduser\
("/Users/username/works/test.csv"),error_bad_lines=False)
Any ideas?
Data overview:
Id0 Id 1 Id 2 Country Company Title Email
23 123 456 AR name cargador email#email.com
24 123 456 AR name Executive assistant email#email.com
25 123 456 AR name Asistente Administrativo email#email.com
26 123 456 AR name Atención al cliente vía telefónica vía online email#email.com
39 123 456 AR name Asesor de ventas email#email.com
40 123 456 AR name inc. International company representative email#email.com
41 123 456 AR name Vendedor de campo email#email.com
42 123 456 AR name PUBLICIDAD ATENCIÓN AL CLIENTE email#email.com
43 123 456 AR name Asistente de Marketing email#email.com
44 123 456 AR name SOLDADOR email#email.com
217 123 456 AR name Se requiere vendedores Loja Quevedo Guayas) email#email.com
218 123 456 AR name Ing. Civil recién graduado Yaruquí email#email.com
219 123 456 AR name ayudantes enfermeria email#email.com
220 123 456 AR name Trip Leader for International Youth Exchange email#email.com
221 123 456 AR name COUNTRY MANAGER / DIRECTOR COMERCIAL email#email.com
250 123 456 AR name Ayudante de Pasteleria email#email.com Asesor email#email.com email#email.com
Pre-parsed CSV:
#,Id 1,Id 2,Country,Company,Title,Email,,,,
23,123,456,AR,name,cargador,email#email.com,,,,
24,123,456,AR,name,Executive assistant,email#email.com,,,,
25,123,456,AR,name,Asistente Administrativo,email#email.com,,,,
26,123,456,AR,name,Atención al cliente vía telefónica , vía online,email#email.com,,,
39,123,456,AR,name,Asesor de ventas,email#email.com,,,,
40,123,456,AR,name, inc.,International company representative,email#email.com,,,
41,123,456,AR,name,Vendedor de campo,email#email.com,,,,
42,123,456,AR,name,PUBLICIDAD, ATENCIÓN AL CLIENTE,email#email.com,,,
43,123,456,AR,name,Asistente de Marketing,email#email.com,,,,
44,123,456,AR,name,SOLDADOR,email#email.com,,,,
217,123,456,AR,name,Se requiere vendedores,, Loja , Quevedo, Guayas),email#email.com
218,123,456,AR,name,Ing. Civil recién graduado, Yaruquí,email#email.com,,,
219,123,456,AR,name,ayudantes enfermeria,email#email.com,,,,
220,123,456,AR,name,Trip Leader for International Youth Exchange,email#email.com,,,,
221,123,456,AR,name,COUNTRY MANAGER / DIRECTOR COMERCIAL,email#email.com,,,,
250,123,456,AR,name,Ayudante de Pasteleria,email#email.com, Asesor,email#email.com,email#email.com,
251,123,456,AR,name,Ejecutiva de Ventas,email#email.com,,,,
If you can assume that for the Comapny, that any commas are followed by spaces, and that all of the remaining errant commas are in the column prior to the email address, then a small parser can be written to process that.
Code:
import csv
import re
VALID_EMAIL = re.compile(r'[^#]+#[^#]+\.[^#]+')
def read_my_csv(file_handle):
# build csv reader
reader = csv.reader(file_handle)
# get the header, and find the e-mail and title columns
header = next(reader)
email_column = header.index('Email')
title_column = header.index('Title')
# yield the header up to the e-mail column
yield header[:email_column+1]
# for each row, go through rebuild columns
for row in reader:
# for each row, put the Company column back together
while row[title_column].startswith(' '):
row[title_column-1] += ',' + row[title_column]
del row[title_column]
# for each row, put the Title column back together
while not VALID_EMAIL.match(row[email_column]):
row[email_column-1] += ',' + row[email_column]
del row[email_column]
yield row[:email_column+1]
Test Code:
with open ("test.csv", 'rU') as f:
generator = read_my_csv(f)
columns = next(generator)
df = pd.DataFrame(generator, columns=columns)
print(df)
Results:
# Id 1 Id 2 Country Company \
0 23 123 456 AR name
1 24 123 456 AR name
2 25 123 456 AR name
3 26 123 456 AR name
4 39 123 456 AR name
5 40 123 456 AR name, inc.
6 41 123 456 AR name
7 42 123 456 AR name
8 43 123 456 AR name
9 44 123 456 AR name
10 217 123 456 AR name
11 218 123 456 AR name
12 219 123 456 AR name
13 220 123 456 AR name
14 221 123 456 AR name
15 250 123 456 AR name
16 251 123 456 AR name
Title Email
0 cargador email#email.com
1 Executive assistant email#email.com
2 Asistente Administrativo email#email.com
3 Atención al cliente vía telefónica , vía online email#email.com
4 Asesor de ventas email#email.com
5 International company representative email#email.com
6 Vendedor de campo email#email.com
7 PUBLICIDAD, ATENCIÓN AL CLIENTE email#email.com
8 Asistente de Marketing email#email.com
9 SOLDADOR email#email.com
10 Se requiere vendedores,, Loja , Quevedo, Guayas) email#email.com
11 Ing. Civil recién graduado, Yaruquí email#email.com
12 ayudantes enfermeria email#email.com
13 Trip Leader for International Youth Exchange email#email.com
14 COUNTRY MANAGER / DIRECTOR COMERCIAL email#email.com
15 Ayudante de Pasteleria email#email.com
16 Ejecutiva de Ventas email#email.com
Related
Problem Concating and making a Sum in a Describe of some Columns
Here it´s my problem y have two dataframes. I´m wanting to only make the sum in the column ¨Cantidad¨ because as you see, the other information is the same, I am only wanting to sum the column of ¨Cantidad¨ because that column will be variable. (Here samples) : First DF fac Tarifa Precio Cantidad Importe $ Porcentaje 3 Vecina 155 87 13485 49.2% 2 Misma Zona 130 72 9360 40.7% 0 Alejada 229 17 3893 9.6% 1 Grande 250 1 250 0.6% Second DF fac2 Tarifa Precio Cantidad Importe $ Porcentaje 2 Vecina 155 61 9455 55.5% 1 Misma Zona 130 40 5200 36.4% 0 Alejada 229 9 2061 8.2% I tried this with no luck: df_concat = pd.concat([fac,fac2],axis=0) df_grouped = df_concat.groupby(["Tarifa", "Precio"]).agg({"Cantidad": "sum"}).reset_index() # Ordenamos el dataframe por las mismas columnas que utilizamos en el groupby df_result = df_grouped.sort_values(["Tarifa", "Precio"]) # Mostramos el resultado print(df_result) The result: Tarifa Precio Cantidad 2 Vecina 155 87 1 Misma Zona 130 72 0 Alejada 229 17 As you see there is not sum in the column ¨Cantidad¨ Hope can you help me! Best regards!
r = (pd.concat([df1, df2], ignore_index=True) .groupby('Tarifa') .agg({'Precio': 'first', 'Cantidad': sum}) ) print(r) Precio Cantidad Tarifa Alejada 229 26 Grande 250 1 Misma Zona 130 112 Vecina 155 148
Since only one column is variable, you can try doing df = pd.concat([fac, fac2], axis=0)[['Tarifa', 'Precio', 'Cantidad']] df_result = df.groupby(['Tarifa', 'Precio']).sum().reset_index()
How do I optimize Levenshtein distance calculation for all rows of a single column of a Pandas DataFrame?
I want to calculate Levenshtein distance for all rows of a single column of a Pandas DataFrame. I am getting MemoryError when I cross-join my DataFrame containing ~115,000 rows. In the end, I want to keep only those rows where Levenshtein distance is either 1 or 2. Is there an optimized way to do the same? Here's my brute force approach: import pandas as pd from textdistance import levenshtein # from itertools import product # original df df = pd.DataFrame({'Name':['John', 'Jon', 'Ron'], 'Phone':[123, 456, 789], 'State':['CA', 'GA', 'MA']}) # create another df containing all rows and a few columns needed for further checks name = df['Name'] phone = df['Phone'] dic_ = {'Name_Match':name,'Phone_Match':phone} df_match = pd.DataFrame(dic_, index=range(len(name))) df['key'] = 1 df_match['key'] = 1 # cross join df containing all columns with another df containing some of its columns df_merged = pd.merge(df, df_match, on='key').drop("key",1) # keep only rows where distance = 1 or distance = 2 df_merged['distance'] = df_merged.apply(lambda x: levenshtein.distance(x['Name'], x['Name_Match']), axis=1) Original DataFrame: Out[1]: Name Phone State 0 John 123 CA 1 Jon 456 GA 2 Ron 789 MA New DataFrame from original DataFrame: df_match Out[2]: Name_Match Phone_Match 0 John 123 1 Jon 456 2 Ron 789 Cross-join: df_merged Out[3]: Name Phone State Name_Match Phone_Match distance 0 John 123 CA John 123 0 1 John 123 CA Jon 456 1 2 John 123 CA Ron 789 2 3 Jon 456 GA John 123 1 4 Jon 456 GA Jon 456 0 5 Jon 456 GA Ron 789 1 6 Ron 789 MA John 123 2 7 Ron 789 MA Jon 456 1 8 Ron 789 MA Ron 789 0 Final output: df_merged[((df_merged.distance==1)==True) | ((df_merged.distance==2)==True)] Out[4]: Name Phone State Name_Match Phone_Match distance 1 John 123 CA Jon 456 1 2 John 123 CA Ron 789 2 3 Jon 456 GA John 123 1 5 Jon 456 GA Ron 789 1 6 Ron 789 MA John 123 2 7 Ron 789 MA Jon 456 1
Your problem is not related to levenshtein distance, your main problem is that you are running out of device memory (RAM) while doing the operations (you can check it using the task manager in windows or the top or htop commands on linux/mac). One solution would be to partition your dataframe before starting the apply operation into smaller partitions and running it on each partition then deleting the ones that you don't need BEFORE processing the next partition. If you are running it on the cloud, you can get a machine with more memory instead. Bonus: I'd suggest you parallelize the apply operation using something like Pandarallel to make it way faster.
Grouping Herarchical Parent-Child data using Pandas Dataframe - Python
I have a data frame which I want to group based on the value of another column in the same data frame. For example: The Parent_ID and Child ID are linked and defines who is related to who in a hierarchical tree. The dataframe looks like (input from a csv file) No Name ID Parent_Id 1 Tom 211 111 2 Galie 209 111 3 Remo 200 101 4 Carmen 212 121 5 Alfred 111 191 6 Marvela 101 111 7 Armin 234 101 8 Boris 454 109 9 Katya 109 323 I would like to group this data frame based on the ID and Parent_ID in the below grouping, and generate CSV files out of this based on the top level parent. I.e, Alfred.csv, Carmen.csv (will have only its own entry, ice line #4) , Katya.csv using the to_csv() function. Alfred |_ Galie _ Tom _ Marvela |_ Remo _ Armin Carmen Katya |_ Boris And, I want to create a new column in the same data frame, that will have a tag indicating the hierarchy. Like: No Name ID Parent_Id Tag 1 Tom 211 111 Alfred 2 Galie 209 111 Alfred 3 Remo 200 101 Marvela, Alfred 4 Carmen 212 121 5 Alfred 111 191 6 Marvela 101 111 Alfred 7 Armin 234 101 Marvela, Alfred 8 Boris 454 109 Katya 9 Katya 109 323 Note that the names can repeat, but the ID will be unique. Kindly let me know how to achieve this using pandas. I tried out groupby() but seems a little complicated and not getting what I intend. There should be one file for each parent, and the child records in the parent file. If a child has other child (like marvel), it qualifies to have its own csv file. And the final output would be Alfred.csv - All records matching Galie, Tom, Marvela Marvela.csv - All records matching Remo, Armin Carmen.csv - Only record matching carmen (row) Katya.csv - all records matching katya, boris
I would write a recursive function to do this. First, create dictionary of {id:name}, {parent:id} and the recursive function. id_name_dict = dict(zip(df.ID, df.Name)) parent_dict = dict(zip(df.ID, df.Parent_Id)) def find_parent(x): value = parent_dict.get(x, None) if value is None: return "" else: # Incase there is a id without name. if id_name_dict.get(value, None) is None: return "" + find_parent(value) return str(id_name_dict.get(value)) +", "+ find_parent(value) Then create the new column with Series.apply and remove ', ' with Series.str.strip df['Tag'] = df.ID.apply(lambda x: find_parent(x)).str.rstrip(', ') df No Name ID Parent_Id Tag 0 1 Tom 211 111 Alfred 1 2 Galie 209 111 Alfred 2 3 Remo 200 101 Marvela, Alfred 3 4 Carmen 212 121 4 5 Alfred 111 191 5 6 Marvela 101 111 Alfred 6 7 Armin 234 101 Marvela, Alfred 7 8 Boris 454 109 Katya 8 9 Katya 109 323
Simplifying categorical variables with python/pandas
I'm working with an airbnb dataset on Kaggle: https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings and want to simplify the values for the language column into 2 groupings - english and non-english. For instance: users.language.value_counts() en 15011 zh 101 fr 99 de 53 es 53 ko 43 ru 21 it 20 ja 19 pt 14 sv 11 no 6 da 5 nl 4 el 2 pl 2 tr 2 cs 1 fi 1 is 1 hu 1 Name: language, dtype: int64 And the result I want it is: users.language.value_counts() english 15011 non-english 459 Name: language, dtype: int64 This is sort of the solution I want: def language_groupings(): for i in users: if users.language !='en': replace(users.language.str, 'non-english') else: replace(users.language.str, 'english') return users users['language'] = users.apply(lambda row: language_groupings) Except there's obviously something wrong with this as it returns an empty series when I run value_counts on the column.
Try this: users.language = np.where( users.language !='en', 'non-english', 'english' )
is that what you want? In [181]: x Out[181]: val en 15011 zh 101 fr 99 de 53 es 53 ko 43 ru 21 it 20 ja 19 pt 14 sv 11 no 6 da 5 nl 4 el 2 pl 2 tr 2 cs 1 fi 1 is 1 hu 1 In [182]: x.groupby(x.index == 'en').sum() Out[182]: val False 459 True 15011
Ignore backslash when reading tsv file in python
I have a large sep="|" tsv with an address field that has a bunch of values with the following ...xxx|yyy|Level 1 2 xxx Street\(MYCompany)|... This ends up as: line1) ...xxx|yyy|Level 1 2 xxx Street\ line2) (MYCompany)|... Tried running the quote=2 to turn non numeric into strings in read_table with Pandas but it still treats the backslash as new line. What is an efficient way to ignore rows with values in a field that contain backslash escapes to new line, is there a way to ignore the new line for \? Ideally it will prepare the data file so it can be read into a dataframe in pandas. Update: showing 5 lines with breakage on 3rd line. 1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49 XXX Ave|Australia|Victoria|3025|Melbourne 1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7 38-46 South Street|Australia|New South Wales|2116|Sydney 1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\ (My Company)|Australia|New South Wales|2000|Sydney|Sydney 1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other|Port Macquarie
Here is another solution using regex: import pandas as pd import re f = open('input.tsv') fl = f.read() f.close() #Replace '\\n' with '\' using regex fl = re.sub('\\\\\n','\\\\',s) o = open('input_fix.tsv','w') o.write(fl) o.close() cols = range(1,17) #Prime the number of columns by specifying names for each column #This takes care of the issue of variable number of columns df = pd.read_csv(fl,sep='|',names=cols) will produce the following result:
I think you can try first read_csv with sep which is NOT in values and it seems that it read correct: import pandas as pd import io temp=u""" 49 XXX Ave|Australia u7 38-46 South Street|Australia XXX Margaret Street\ New South Wales|Australia Po box ZZZ|Australia""" #after testing replace io.StringIO(temp) to filename df = pd.read_csv(io.StringIO(temp), sep="^", header=None) print df 0 0 49 XXX Ave|Australia 1 u7 38-46 South Street|Australia 2 XXX Margaret StreetNew South Wales|Australia 3 Po box ZZZ|Australia Then you can create new file with to_csv and read_csv with sep="|": df.to_csv('myfile.csv', header=False, index=False) print pd.read_csv('myfile.csv', sep="|", header=None) 0 1 0 49 XXX Ave Australia 1 u7 38-46 South Street Australia 2 XXX Margaret StreetNew South Wales Australia 3 Po box ZZZ Australia Next solution with not createing new file, but write to variable output and then read_csv with io.StringIO: import pandas as pd import io temp=u""" 49 XXX Ave|Australia u7 38-46 South Street|Australia XXX Margaret Street\ New South Wales|Australia Po box ZZZ|Australia""" #after testing replace io.StringIO(temp) to filename df = pd.read_csv(io.StringIO(temp), sep=";", header=None) print df 0 0 49 XXX Ave|Australia 1 u7 38-46 South Street|Australia 2 XXX Margaret StreetNew South Wales|Australia 3 Po box ZZZ|Australia output = df.to_csv(header=False, index=False) print output 49 XXX Ave|Australia u7 38-46 South Street|Australia XXX Margaret StreetNew South Wales|Australia Po box ZZZ|Australia print pd.read_csv(io.StringIO(u""+output), sep="|", header=None) 0 1 0 49 XXX Ave Australia 1 u7 38-46 South Street Australia 2 XXX Margaret StreetNew South Wales Australia 3 Po box ZZZ Australia If I test it in your data, it seems that 1. and 2.rows have 14 fields, next two 15 fields. So I remove last item from both rows (3. and 4.), maybe this is only typo (I hope it): import pandas as pd import io temp=u"""1788768|1831171|208434489|2014-08-14 13:40:02|108|c||Desktop|coupon|49 XXX Ave|Australia|Victoria|3025|Melbourne 1788772|1831177|202234489|2014-08-14 13:41:37|108|c||iOS||u7 38-46 South Street|Australia|New South Wales|2116|Sydney 1788776|1831182|205234489|2014-08-14 13:42:41|108|c||Desktop||Level XXX Margaret Street\ (My Company)|Australia|New South Wales|2000|Sydney 1788780|1831186|202634489|2014-08-14 13:43:46|108|c||Desktop||Po box ZZZ|Australia|New South Wales|2444|NSW Other""" #after testing replace io.StringIO(temp) to filename df = pd.read_csv(io.StringIO(temp), sep=";", header=None) print df 0 0 1788768|1831171|208434489|2014-08-14 13:40:02|... 1 1788772|1831177|202234489|2014-08-14 13:41:37|... 2 1788776|1831182|205234489|2014-08-14 13:42:41|... 3 1788780|1831186|202634489|2014-08-14 13:43:46|... output = df.to_csv(header=False, index=False) print pd.read_csv(io.StringIO(u""+output), sep="|", header=None) 0 1 2 3 4 5 6 7 \ 0 1788768 1831171 208434489 2014-08-14 13:40:02 108 c NaN Desktop 1 1788772 1831177 202234489 2014-08-14 13:41:37 108 c NaN iOS 2 1788776 1831182 205234489 2014-08-14 13:42:41 108 c NaN Desktop 3 1788780 1831186 202634489 2014-08-14 13:43:46 108 c NaN Desktop 8 9 10 11 \ 0 coupon 49 XXX Ave Australia Victoria 1 NaN u7 38-46 South Street Australia New South Wales 2 NaN Level XXX Margaret Street(My Company) Australia New South Wales 3 NaN Po box ZZZ Australia New South Wales 12 13 0 3025 Melbourne 1 2116 Sydney 2 2000 Sydney 3 2444 NSW Other But if data are correct, add parameter names=range(15) to read_csv: print pd.read_csv(io.StringIO(u""+output), sep="|", names=range(15)) 0 1 2 3 4 5 6 7 \ 0 1788768 1831171 208434489 2014-08-14 13:40:02 108 c NaN Desktop 1 1788772 1831177 202234489 2014-08-14 13:41:37 108 c NaN iOS 2 1788776 1831182 205234489 2014-08-14 13:42:41 108 c NaN Desktop 3 1788780 1831186 202634489 2014-08-14 13:43:46 108 c NaN Desktop 8 9 10 11 \ 0 coupon 49 XXX Ave Australia Victoria 1 NaN u7 38-46 South Street Australia New South Wales 2 NaN Level XXX Margaret Street(My Company) Australia New South Wales 3 NaN Po box ZZZ Australia New South Wales 12 13 14 0 3025 Melbourne NaN 1 2116 Sydney NaN 2 2000 Sydney Sydney 3 2444 NSW Other Port Macquarie