I have a .csv file with rows with multiple columns lengths.
import pandas as pd
df = pd.read_csv(infile, header=None)
returns the
ParserError: Error tokenizing data. C error: Expected 6 fields in line 8, saw 8
error. I know I can use the
names=my_cols
option in the read_csv call, but surely there has to be something more 'pythonic' than that?? Also, this is not a duplicate question, since
error_bad_lines=False
causes lines to be skipped (which is not desired). The .csv looks like::
Anne,Beth,Caroline,Ernie,Frank,Hannah
Beth,Caroline,David,Ernie
Caroline,Hannah
David,,Anne,Beth,Caroline,Ernie
Ernie,Anne,Beth,Frank,George
Frank,Anne,Caroline,Hannah
George,
Hannah,Anne,Beth,Caroline,David,Ernie,Frank,George
OK, somewhat inspired by this related question: Pandas variable numbers of columns to binary matrix
So read in the csv but override the separator to a tab so it doesn't try to split the names:
In[7]:
import pandas as pd
import io
t="""Anne,Beth,Caroline,Ernie,Frank,Hannah
Beth,Caroline,David,Ernie
Caroline,Hannah
David,,Anne,Beth,Caroline,Ernie
Ernie,Anne,Beth,Frank,George
Frank,Anne,Caroline,Hannah
George,
Hannah,Anne,Beth,Caroline,David,Ernie,Frank,George"""
df = pd.read_csv(io.StringIO(t), sep='\t', header=None)
df
Out[7]:
0
0 Anne,Beth,Caroline,Ernie,Frank,Hannah
1 Beth,Caroline,David,Ernie
2 Caroline,Hannah
3 David,,Anne,Beth,Caroline,Ernie
4 Ernie,Anne,Beth,Frank,George
5 Frank,Anne,Caroline,Hannah
6 George,
7 Hannah,Anne,Beth,Caroline,David,Ernie,Frank,Ge...
We can now use str.split with expand=True to expand the names into their own columns:
In[8]:
df[0].str.split(',', expand=True)
Out[8]:
0 1 2 3 4 5 6 7
0 Anne Beth Caroline Ernie Frank Hannah None None
1 Beth Caroline David Ernie None None None None
2 Caroline Hannah None None None None None None
3 David Anne Beth Caroline Ernie None None
4 Ernie Anne Beth Frank George None None None
5 Frank Anne Caroline Hannah None None None None
6 George None None None None None None
7 Hannah Anne Beth Caroline David Ernie Frank George
So just to be clear modify your read_csv line to this:
df = pd.read_csv(infile, header=None, sep='\t')
and then do the str.split as above
One can do some manipulation with the csv before using pandas.
# load data into list
with open('new_data.txt', 'r') as fil:
data = fil.readlines()
# remove line breaks from string entries
data = [ x.replace('\r\n', '') for x in data]
data = [ x.replace('\n', '') for x in data]
# calculate the number of columns
total_cols = max([x.count(',') for x in data])
# add ',' to end of list depending on how many are needed
new_data = [x + ','*(total_cols-x.count(',')) for x in data]
# save data
with open('save_data.txt', 'w') as outp:
outp.write('\n'.join(new_data))
# read it in as you did.
pd.read_csv('save_data.txt', header=None)
This is some rough python, but should work. I'll clean this up when I have time.
Or use the other answer, it's neat as it is.
Related
I have a txt file that looks like this
1000 lewis hamilton 36
1001 sebastian vettel 34
1002 lando norris 21
i want them to look like this
I tried the solution in here but it gave me a blank excel file and error when trying to open it
There is more than one million lines and each lines contains around 10 column
And one last thing i am not 100% sure if they are tab elimited because some columns looks like they have more space in between them than the others but when i press to backspace once they stick to each other so i guess it is
you can use pandas read_csv for read your txt file and then save it like an excel file with .to_excel
df = pd.read_csv('your_file.txt' , delim_whitespace=True)
df.to_excel('your_file.xlsx' , index = False)
here some documentation :
pandas.read_csv : https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
.to_excel : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html
If you're not sure about how the fields are separated, you can use '\s' to split by spaces.
import pandas as pd
df = pd.read_csv('f1.txt', sep="\s+", header=None)
# you might need: pip install openpyxl
df.to_excel('f1.xlsx', 'Sheet1')
Example of randomly separated fields (f1.txt):
1000 lewis hamilton 2 36
1001 sebastian vettel 8 34
1002 lando norris 6 21
If you have some lines having more columns than the first one, causing:
ParserError: Error tokenizing data. C error: Expected 5 fields in line 5, saw 6
You can ignore those by using:
df = pd.read_csv('f1.txt', sep="\s+", header=None, error_bad_lines=False)
This is an example of data:
1000 lewis hamilton 2 36
1001 sebastian vettel 8 34
1002 lando norris 6 21
1003 charles leclerc 1 3
1004 carlos sainz ferrari 2 2
The last line will be ignored:
b'Skipping line 5: expected 5 fields, saw 6\n'
import numpy as np
import pandas as pd
df = pd.read_csv(“data.csv”)
pd.pivot_table(df, index = ‘Employee ID’ , values = [ ‘ Member ID’, ‘Firstname’, ‘Lastname’] , aggfunc =‘first)
The format seems to work but only for one value , how do i display everthing ?
Any help is appreciated .
You can use set_index() and unstack(), but you will need to fix the columns, e.g.:
In []:
df = pd.read_csv(“data.csv”)
df['ID'] = df['MemberID'] # Copy because you want it in the values too
df = df.set_index(['EmployeeID', 'MemberID']).unstack(level=1, fill_value='').sort_index(level=1, axis=1)
df.columns = df.columns.to_series().apply(lambda x: 'Member{}{}'.format(x[1], x[0]))
print(df)
Out[]:
Member1ID Member1Lastname Member1firstname Member2ID Member2Lastname Member2firstname Member3ID Member3Lastname Member3firstname
EmployeeID
1 1 Ann Anu 2 Ann Aju 3 vAnn Abi
2 1 John Cini 2 John Biju
3 1 Peter Mathew 2 Peter Joseph
But I feel you can simplify if you really don't need MemberID in the values (you have it in the column name) or if you don't mind a MultiIndex then:
In []:
df.set_index(['EmployeeID', 'MemberID']).unstack(level=1, fill_value='').swaplevel(axis=1).sort_index(axis=1)
Out[]:
MemberID 1 2 3
Lastname firstname Lastname firstname Lastname firstname
EmployeeID
1 Ann Anu Ann Aju Ann Abi
2 John Cini John Biju
3 Peter Mathew Peter Joseph
You can use pivot_table of pandas
df = df.pivot_table(index=['Employe-id'],
columns=['MemberID','firstname','lastname'])
To install pandas use pip install pandas
then first make a dataframe object by read_csv()
then use above method to convert
Newer programmer here, deeply appreciate any help this knowledgeable community is willing to provide.
I have a column of 140,000 text strings (company names) in a pandas dataframe on which I want to strip all whitespace everywhere in/around the strings, remove all punctuation, substitute specific substrings, and uniformly transform to lowercase. I want to then take the first 0:10 elements in the strings and store them in a new dataframe column.
Here is a reproducible example.
import string
import pandas as pd
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
def remove_punctuations(text):
for punctuation in string.punctuation:
text = text.replace(punctuation, '')
return text
# applying remove_punctuations function
df['co_name_transform'] = df['co_name'].apply(remove_punctuations)
# this next step replaces 'Saint' with 'st' to standardize,
# and I may want to make other substitutions but this is a common one.
df['co_name_transform'] = df.co_name_transform.str.replace('Saint', 'st')
# replace whitespace
df['co_name_transform'] = df.co_name_transform.str.replace(' ', '')
# make lowercase
df['co_name_transform'] = df.co_name_transform.str.lower()
# select first 0:10 of strings
df['co_name_transform'] = df.co_name_transform.str[0:10]
print(df)
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
How can I put all these steps into a single function like this?
def clean_text(df[col]):
for co in co_name:
do_all_the_steps
return df[new_col]
Thank you
You don't need a function to do this. Try the following one-liner.
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Final output will be.
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
You can do all the steps in the function you pass to the apply method:
import re
df['co_name_transform'] = df['co_name'].apply(lambda s: re.sub(r'[\W_]+', '', s).replace('Saint', 'st').lower()[:10])
Another solution, similar to the previous one, but with the list of "to_replace" in one dictionary, so you can add more items to replace. Also, the previous solution won't give the first 10.
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
to_replace = {'[^A-Za-z0-9-]+':'','Saint':'st'}
for i in to_replace :
df['co_name'] = df['co_name'].str.replace(i,to_replace[i]).str.lower()
df['co_name'][0:10]
Result :
0 westgeorgiaco
1 wbcarellclockmakers
2 spineorthopedicllc
3 lrhssaintjosesgrocery
4 optitechnycityscape
5 optitechnycityscape
6 optitechnycityscape
7 optitechnycityscape
8 optitechnycityscape
9 optitechnycityscape
Name: co_name, dtype: object
Previous solution ( won't show the first 10)
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Result :
0 westgeorgi
1 wbcarellcl
2 spineortho
3 lrhssaintj
4 optitechny
5 optitechny
6 optitechny
7 optitechny
8 optitechny
9 optitechny
10 optitechny
11 optitechny
12 optitechny
Name: co_name_transform, dtype: object
So I am reading in a bunch of .xlsx files from a directory and I want to transform them into a table.
Easy, but I am running into the problem that these excel files do not have the same headers. How can I go about creating a code that will check the excel files headers, and either append it to a table that has the same columns, or create a new one if it that format of columns does not exist.
My Code:
import sqlite3 as sql
import pandas as pd
import os
def obtain_data(filename, db):
connect = sql.connect('filepath.sqlite3')
workbook = pd.ExcelFile('filepath' + filename)
df = workbook.parse('Sheet1')
new_db = db.append(df)
print(new_db)
new_db = new_db.rename(columns={'INDEX': 'INDX'})
connect.close()
return new_db
usable_files = []
for filename in os.listdir('filepath'):
if filename.endswith(".xlsx"):
print(filename)
usable_files.append(filename)
else:
print('no')
print(filename)
new_db = pd.DataFrame()
for file in usable_files:
new_db= new_db.append(obtain_data(file, new_db))
Note, I do not know in advanced whether the excel file will have a matching pair of columns or not. Thanks in advance.
UPDATE:
def obtain_data(filename, connect, data):
workbook = pd.ExcelFile('filpath' + filename)
df = workbook.parse('Sheet1')
df = df.rename(columns={'INDEX': 'INDX'})
headers = new_db.dtypes.index
header_list = str(headers.tolist())
header_list = ''.join(header_list)
hash_t = str(hashlib.md5(header_list.encode('utf-8')).hexdigest())
if hash_t not in hash_list:
x = pd.DataFrame(df)
print(x.name)
x.name = hash_t
print(x.name)
hash_list.append(hash_t)
data_frames = data.append(x)
connect.close()
elif hash_t in hash_list:
print('hash is repeating. Find a way to make this code get table name.')
print(filename + ' has been completed succesfully.')
final_results = {'df': df, 'hash_t': hash_t}
return final_results
usable_files = []
for filename in os.listdir('filepath'):
if filename.endswith(".xlsx"):
usable_files.append(filename)
else:
print('cool')
hash_list = []
data_frames = []
new_db = pd.DataFrame()
for file in usable_files:
connect = sql.connect('filepath_test.sqlite3')
x = new_db.append(obtain_data(file, connect, data_frames),
ignore_index=True)
if x['hash_t'] not in hash_list:
new_db = new_db.append(x['df'])
new_db.append(x['hash_t'])
else:
new_db = new_db.append(x['df'])
print(new_db)
connect.commit()
connect.close()
Not sure if this is exactly what you're looking for, but have a look. If your dataframes have common column names, they're merged together resulting in a new dataframe with all the columns in both dataframes, and any overlapping entry names are combined into a single row (I'm not sure if this is what you want) EDIT: For an example of this, see how the two Toms are combined in the output.
If the two dataframes don't have any columns in common, they're simply concatenated, resulting in a dataframe with the columns from both dataframes, but with no merging of overlapping entry names.
I've included a (pretty long) printout to make it clearer what's going on.
import pandas as pd
def merge_dataframes(merge_this_df, with_this_df):
print "-----------------------------------------------------"
print "Merging this:"
print merge_this_df
print "\nWith this:"
print with_this_df
print "\nResult:"
# Check if they have common columns
any_common_columns = any([column_name in merge_this_df.columns for column_name in with_this_df.columns])
if any_common_columns:
merged_df = merge_this_df.merge(with_this_df, how="outer")
print merged_df
return merged_df
else:
concatenated_df = pd.concat([merge_this_df, with_this_df])
print concatenated_df
return concatenated_df
# Create some dummy data
df = pd.DataFrame({
"name": ["Tom", "David", "Helen"],
"age": ["30", "40", "50"]
})
df2 = pd.DataFrame({
"name": ["Tom", "Juan", "Maria", "Julia"],
"occupation": ["Plumber", "Chef", "Astronaut", "Teacher"],
})
df3 = pd.DataFrame({
"animal": ["Cat", "Platypus"],
"food": ["Catfoot", "Platypus-food"]
})
# Collect all dummy data in a list
all_dfs = [df, df2, df3]
# Merge or concatenate all dataframes in to a single dataframe
final_df = pd.DataFrame()
for dataframe in all_dfs:
final_df = merge_dataframes(final_df, dataframe)
Printout:
-----------------------------------------------------
Merging this:
Empty DataFrame
Columns: []
Index: []
With this:
age name
0 30 Tom
1 40 David
2 50 Helen
Result:
age name
0 30 Tom
1 40 David
2 50 Helen
-----------------------------------------------------
Merging this:
age name
0 30 Tom
1 40 David
2 50 Helen
With this:
name occupation
0 Tom Plumber
1 Juan Chef
2 Maria Astronaut
3 Julia Teacher
Result:
age name occupation
0 30 Tom Plumber
1 40 David NaN
2 50 Helen NaN
3 NaN Juan Chef
4 NaN Maria Astronaut
5 NaN Julia Teacher
-----------------------------------------------------
Merging this:
age name occupation
0 30 Tom Plumber
1 40 David NaN
2 50 Helen NaN
3 NaN Juan Chef
4 NaN Maria Astronaut
5 NaN Julia Teacher
With this:
animal food
0 Cat Catfoot
1 Platypus Platypus-food
Result:
age animal food name occupation
0 30 NaN NaN Tom Plumber
1 40 NaN NaN David NaN
2 50 NaN NaN Helen NaN
3 NaN NaN NaN Juan Chef
4 NaN NaN NaN Maria Astronaut
5 NaN NaN NaN Julia Teacher
0 NaN Cat Catfoot NaN NaN
1 NaN Platypus Platypus-food NaN NaN
EDIT2: Another approach: Read sqlite database into pandas dataframe -> fix column-related stuff -> write pandas dataframe into sqlite database (overwriting the previous one):
import sqlite3 as sql
import pandas as pd
import os
def obtain_data(df_to_add):
# Connect to database
connect = sql.connect("my_database.sqlite")
print "--------------------------------------"
# Read current database into a dataframe
try:
current_df = pd.read_sql_query("SELECT * FROM my_database", connect)
print "Database currently looks like:"
print current_df
# Now, we check if we have overlapping column names in our database and our dataframe
if any([c in current_df.columns for c in df_to_add.columns]):
# If we do, we can merge them
new_df = current_df.merge(df_to_add, how="outer")
else:
# If there are no common columns, we just concatenate them
new_df = pd.concat([current_df, df_to_add])
# Now, we simply overwrite the DB with our current dataframe
print "Adding to database"
new_df.to_sql("my_database", connect, if_exists="replace", index=False)
# For good measure, read database again and print it out
database_df = pd.read_sql_query("SELECT * FROM my_database", connect)
print "Database now looks like:"
print database_df
connect.close()
except pd.io.sql.DatabaseError:
# There's no database called my_database, so simply insert our dataframe
print "Creating initial database named my_database"
df_to_add.to_sql("my_database", connect, index=False)
print "Current database:"
print df_to_add
# We're done here
connect.close()
return
# Create some dummy data
df1 = pd.DataFrame({
"name": ["Tom", "David", "Helen"],
"age": ["30", "40", "50"]
})
df2 = pd.DataFrame({
"name": ["Tom", "Juan", "Maria", "Julia"],
"occupation": ["Plumber", "Chef", "Astronaut", "Teacher"],
})
df3 = pd.DataFrame({
"animal": ["Cat", "Platypus"],
"food": ["Catfoot", "Platypus-food"]
})
# Read all dummy data into the database
for df in [df1, df2, df3]:
obtain_data(df)
And the output:
--------------------------------------
Creating initial database named my_database
Current database:
age name
0 30 Tom
1 40 David
2 50 Helen
--------------------------------------
Database currently looks like:
age name
0 30 Tom
1 40 David
2 50 Helen
Adding to database
Database now looks like:
age name occupation
0 30 Tom Plumber
1 40 David None
2 50 Helen None
3 None Juan Chef
4 None Maria Astronaut
5 None Julia Teacher
--------------------------------------
Database currently looks like:
age name occupation
0 30 Tom Plumber
1 40 David None
2 50 Helen None
3 None Juan Chef
4 None Maria Astronaut
5 None Julia Teacher
Adding to database
Database now looks like:
age animal food name occupation
0 30 None None Tom Plumber
1 40 None None David None
2 50 None None Helen None
3 None None None Juan Chef
4 None None None Maria Astronaut
5 None None None Julia Teacher
6 None Cat Catfoot None None
7 None Platypus Platypus-food None None
Let me know if this isn't what your're looking for.
I've been searching this for a while but still can't figure it out. I appreciate if you can provide me some help.
I have an excel file:
, John, James, Joan,
, Smith, Smith, Smith,
Index1, 234, 432, 324,
Index2, 2987, 234, 4354,
I'd like to read it into a dataframe, such that
"John Smith, James Smith, Joan Smith" is my header.
I've tried the follwoing, but my header is still "John, James, Joan"
xl = pd.ExcelFile(myfile, header=None)
row = df.apply(lambda x: str(x.iloc[0]) + str(x.iloc[1]))
df.append(row,ignore_index=True)
nrow = df.shape[0]
df = pd.concat([df.ix[nrow:], df.ix[2:nrow-1]])
May be it's easier to do by hand?:
>>> import itertools
>>> xl = pd.ExcelFile(myfile, header=None)
>>> sh = xl.book.sheet_by_index(0)
>>> rows = (sh.row_values(i) for i in xrange(sh.nrows))
>>> hd = zip(*itertools.islice(rows, 2))[1:] # read first two rows
>>> df = pd.DataFrame(rows) # create DataFrame from remaining rows
>>> df = df.set_index(0)
>>> df.columns = [' '.join(x) for x in hd] # rename columns
>>> df
John Smith James Smith Joan Smith
0
Index1 234 432 324
Index2 2987 234 4354
You can keep the two levels separate if you want. This might be useful if you want to filter columns based on last name alone, for instance. Otherwise, the other solution(s) certainly work better than this seems to.
Normally this works for me:
In [103]: txt = '''John,James,Joan
...: Smith,Smith,Smith
...: 234,432,324
...: 2987,234,4354
...: '''
In [104]: x = pandas.read_csv(StringIO(txt), header=[0,1])
...: x.columns = pandas.MultiIndex.from_tuples(x.columns.tolist())
...: x
...:
But for some reason, that's missing the first row :/
In [105]: x
Out[105]:
John James Joan
Smith Smith Smith
0 2987 234 4354
I'll check in with the pandas mailing list to see if that's a bug.
I worked around by converting the Excel file to csv file and the followings:
df = pd.read_csv(myfile, header=None)
header = df.apply(lambda x: str(x.ix[0]) + ' ' + str(x.ix[1]))
df = df[2:]
df.columns = header
Here's the output:
Out[252]:
John Smith James Smith Joan Smith
2 234 432 324
3 3453 2342 563
However when I read in by pd.ExcelFile (and parse the specific sheet I'm interested in), there is a similar issue as #Paul H had. It seems Excel format considers the first row as column names by default and returns me sth like:
Smith 234 Smith 432 Smith 324
3 3453 2342 563