I have data which consists 3004 rows without header, and each row has different number of fields (e.g. for row number 1,2,3,4 has 16,17,21,12, respectively). Here is my code when I call the csv.
df = pd.read_csv(file,'rb', delimiter ='\t', engine='python')
here is the output:
$GPRMC,160330.40,A,1341.,N,10020.,E,0.006,,150517,,,A*7D
$GPGGA,160330.40,1341.,N,10020.,E,1,..
$PUBX,00,160330.40,1341.,N,10020.,E,...
$PUBX,03,20,2,-,056,40,,000,5,U,014,39,41,026,...
$PUBX,04,160330.40,150517,144210.39,1949,18,-6...
ÿ$GPRMC,160330.60,A,1341.,N,10020.,E...
$GPGGA,160330.60,1341.,N,10020.,E,1,...
It seemed like delimiter didn't work at all to separate the data into column by column. Hence, I tried with columns function based on number of fields from ($PUBX, 00). Here is the code when I add columns:
my_cols = ['MSG type', 'ID MSG', 'UTC','LAT', 'N/S', 'LONG', 'E/W', 'Alt', 'Status','hAcc', 'vAcc','SOG', 'COG', 'VD','HDOP', 'VDOP', 'TDOP', 'Svs', 'reserved', 'DR', 'CS', '<CR><LF>']
df = pd.read_csv(file, 'rb', header = None, na_filter = False, engine = 'python', index_col=False, names=my_cols)
and the result be like the picture below. The file becomes into one column in 'MSG type'.
the output
My purpose after success to call this csv is to read rows only with combination between $PUBX, 00,... and one column of $PUBX, 04,... and write it to csv. But, I am still struggling how to separate the file into columns. Please advice me on this matter. Thank you very much.
pd.read_csv
is used for reading CSV(comma separated values) Files hence you don't need to specify a delimiter.
If you want to read a TSV (Tab separated values) File, you can use:
pd.read_table(filepath)
The default separator is tab
Hat Tip to Ilja Everilä
#Hasanah Based on your code:
df = pd.read_csv(file,'rb', delimiter ='\t', engine='python')
delimiter='\t' tells pandas to separate the data into fields based on tab characters.
The default delimiter when pandas reads in csv files is a comma, so you should not need to define a delimiter:
df = pd.read_csv(file,'rb', engine='python')
Related
I am trying to read a text file which has split lines randomly generated at column 28th from a third party.
When I conver to csv it is fine but, when I feed the files to Athena, it is not able to read because of split.
Is there a way to fine the CR here and put it back as other lines are?
Thanks,
SM
This is a code snippet :
import pandas as pd
add_columns = ["col1", "col2", "col3"...."col59"]
res = pd.read_csv("file_name.txt", names= add_columns, sep=',\s+', delimiter=',', encoding="utf-8", skipinitialspace=True)
df = pd.DataFrame(res)
df.to_csv('final_name.csv', index = None)
file_name.txt
99,999,00499013,X701,,,5669,5669,1232,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1232,LXA,,<<line is split on column 28>>
2,5669,,,,68,,,1,,,,,,,,,,,,71,
99,999,00499017,X701,,,5669,5669,1160,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1160,LXA,,1,5669,,,,,,,1,,,,,,,,,,,,71,
99,999,00499019,X701,,,5669,5669,1284,,1,1,,2,,,,0,0,0,,,,,,,,,,,,,,2400,1284,LXA,,2,5669,,,,66,,,1,,,,,,,,,,,,71,
I have tried str.split but, no luck.
If you are able to convert it successfully to CSV using pandas, you can try to save it as a CSV to feed into Athena.
I'm trying to read in a CSV file into a pandas dataframe and select a column, but keep getting a key error.
The file reads in successfully and I can view the dataframe in an iPython notebook, but when I want to select a column any other than the first one, it throws a key error.
I am using this code:
import pandas as pd
transactions = pd.read_csv('transactions.csv',low_memory=False, delimiter=',', header=0, encoding='ascii')
transactions['quarter']
This is the file I'm working on:
https://www.dropbox.com/s/81iwm4f2hsohsq3/transactions.csv?dl=0
Thank you!
use sep='\s*,\s*' so that you will take care of spaces in column-names:
transactions = pd.read_csv('transactions.csv', sep=r'\s*,\s*',
header=0, encoding='ascii', engine='python')
alternatively you can make sure that you don't have unquoted spaces in your CSV file and use your command (unchanged)
prove:
print(transactions.columns.tolist())
Output:
['product_id', 'customer_id', 'store_id', 'promotion_id', 'month_of_year', 'quarter', 'the_year', 'store_sales', 'store_cost', 'unit_sales', 'fact_count']
if you need to select multiple columns from dataframe use 2 pairs of square brackets
eg.
df[["product_id","customer_id","store_id"]]
I met the same problem that key errors occur when filtering the columns after reading from CSV.
Reason
The main reason of these problems is the extra initial white spaces in your CSV files. (found in your uploaded CSV file, e.g. , customer_id, store_id, promotion_id, month_of_year, )
Proof
To prove this, you could try print(list(df.columns)) and the names of columns must be ['product_id', ' customer_id', ' store_id', ' promotion_id', ' month_of_year', ...].
Solution
The direct way to solve this is to add the parameter in pd.read_csv(), for example:
pd.read_csv('transactions.csv',
sep = ',',
skipinitialspace = True)
Reference: https://stackoverflow.com/a/32704818/16268870
The key error generally comes if the key doesn't match any of the dataframe column name 'exactly':
You could also try:
import csv
import pandas as pd
import re
with open (filename, "r") as file:
df = pd.read_csv(file, delimiter = ",")
df.columns = ((df.columns.str).replace("^ ","")).str.replace(" $","")
print(df.columns)
Give the full path of the CSV file in the pd.read_csv(). This works for me.
Datsets when split by ',', create features with a space in the beginning. Removing the space using a regex might help.
For the time being I did this:
label_name = ' Label'
I'm trying to read in a CSV file into a pandas dataframe and select a column, but keep getting a key error.
The file reads in successfully and I can view the dataframe in an iPython notebook, but when I want to select a column any other than the first one, it throws a key error.
I am using this code:
import pandas as pd
transactions = pd.read_csv('transactions.csv',low_memory=False, delimiter=',', header=0, encoding='ascii')
transactions['quarter']
This is the file I'm working on:
https://www.dropbox.com/s/81iwm4f2hsohsq3/transactions.csv?dl=0
Thank you!
use sep='\s*,\s*' so that you will take care of spaces in column-names:
transactions = pd.read_csv('transactions.csv', sep=r'\s*,\s*',
header=0, encoding='ascii', engine='python')
alternatively you can make sure that you don't have unquoted spaces in your CSV file and use your command (unchanged)
prove:
print(transactions.columns.tolist())
Output:
['product_id', 'customer_id', 'store_id', 'promotion_id', 'month_of_year', 'quarter', 'the_year', 'store_sales', 'store_cost', 'unit_sales', 'fact_count']
if you need to select multiple columns from dataframe use 2 pairs of square brackets
eg.
df[["product_id","customer_id","store_id"]]
I met the same problem that key errors occur when filtering the columns after reading from CSV.
Reason
The main reason of these problems is the extra initial white spaces in your CSV files. (found in your uploaded CSV file, e.g. , customer_id, store_id, promotion_id, month_of_year, )
Proof
To prove this, you could try print(list(df.columns)) and the names of columns must be ['product_id', ' customer_id', ' store_id', ' promotion_id', ' month_of_year', ...].
Solution
The direct way to solve this is to add the parameter in pd.read_csv(), for example:
pd.read_csv('transactions.csv',
sep = ',',
skipinitialspace = True)
Reference: https://stackoverflow.com/a/32704818/16268870
The key error generally comes if the key doesn't match any of the dataframe column name 'exactly':
You could also try:
import csv
import pandas as pd
import re
with open (filename, "r") as file:
df = pd.read_csv(file, delimiter = ",")
df.columns = ((df.columns.str).replace("^ ","")).str.replace(" $","")
print(df.columns)
Give the full path of the CSV file in the pd.read_csv(). This works for me.
Datsets when split by ',', create features with a space in the beginning. Removing the space using a regex might help.
For the time being I did this:
label_name = ' Label'
Using pandas to read in large tab delimited file
df = pd.read_csv(file_path, sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, na_values='')
The problem is that there are 200 columns and the 3rd column is text with occasional newline characters. The text is not delimited with any special characters. These lines get chopped into multiple lines with data going into the wrong columns.
There are a fixed number of tabs in each line - that is all I have to go on.
The idea is to use regex to find all instances of stuff separated by a given number of tabs and ending in a newline. Then take all that and create a dataframe.
import pandas as pd
import re
def wonky_parser(fn):
txt = open(fn).read()
# This is where I specified 8 tabs
# V
preparse = re.findall('(([^\t]*\t[^\t]*){8}(\n|\Z))', txt)
parsed = [t[0].split('\t') for t in preparse]
return pd.DataFrame(parsed)
Pass a filename to the function and get your dataframe back.
name your third column
df.columns.values[2] = "some_name"
and use converters to pass your function.
pd.read_csv("foo.csv", sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, converters={'some_name':lambda x:x.replace('/n','')})
you could use any manipulating function which works for you under lambda.
I'm trying to read in a CSV file into a pandas dataframe and select a column, but keep getting a key error.
The file reads in successfully and I can view the dataframe in an iPython notebook, but when I want to select a column any other than the first one, it throws a key error.
I am using this code:
import pandas as pd
transactions = pd.read_csv('transactions.csv',low_memory=False, delimiter=',', header=0, encoding='ascii')
transactions['quarter']
This is the file I'm working on:
https://www.dropbox.com/s/81iwm4f2hsohsq3/transactions.csv?dl=0
Thank you!
use sep='\s*,\s*' so that you will take care of spaces in column-names:
transactions = pd.read_csv('transactions.csv', sep=r'\s*,\s*',
header=0, encoding='ascii', engine='python')
alternatively you can make sure that you don't have unquoted spaces in your CSV file and use your command (unchanged)
prove:
print(transactions.columns.tolist())
Output:
['product_id', 'customer_id', 'store_id', 'promotion_id', 'month_of_year', 'quarter', 'the_year', 'store_sales', 'store_cost', 'unit_sales', 'fact_count']
if you need to select multiple columns from dataframe use 2 pairs of square brackets
eg.
df[["product_id","customer_id","store_id"]]
I met the same problem that key errors occur when filtering the columns after reading from CSV.
Reason
The main reason of these problems is the extra initial white spaces in your CSV files. (found in your uploaded CSV file, e.g. , customer_id, store_id, promotion_id, month_of_year, )
Proof
To prove this, you could try print(list(df.columns)) and the names of columns must be ['product_id', ' customer_id', ' store_id', ' promotion_id', ' month_of_year', ...].
Solution
The direct way to solve this is to add the parameter in pd.read_csv(), for example:
pd.read_csv('transactions.csv',
sep = ',',
skipinitialspace = True)
Reference: https://stackoverflow.com/a/32704818/16268870
The key error generally comes if the key doesn't match any of the dataframe column name 'exactly':
You could also try:
import csv
import pandas as pd
import re
with open (filename, "r") as file:
df = pd.read_csv(file, delimiter = ",")
df.columns = ((df.columns.str).replace("^ ","")).str.replace(" $","")
print(df.columns)
Give the full path of the CSV file in the pd.read_csv(). This works for me.
Datsets when split by ',', create features with a space in the beginning. Removing the space using a regex might help.
For the time being I did this:
label_name = ' Label'