Related
I need to make a dataframe from two txt files.
The first txt file looks like this Street_name space id.
The second txt file loks like this City_name space id.
Example:
text file 1:
Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567
text file 2:
Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567
I need to make one dataframe out of this. Sometimes there is just one word for Street_name, and sometimes more. The same goes for City_name.
I get an error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 3 because I'm trying to put both words for street name into the same column, but don't know how to do it. I want one column for street name (no matter if it consists of one or more words, one for city name and one for id.
I want a df with 3 rows and 3 cols.
Thanks!
Edit: both text files are huge (each 50 mil rows +) so i need this code not to break and be optimised for large files.
It is NOT correct CSV and it may need to read it on your own.
You can normal open(), read() and later split on new line to create list of lines. And later you can use for-loop and use line.rsplit(" ", 1) to split line on last space.
Minimal working example:
I use io to simulate file in memory - so everyone can simply copy and test it - but you should use open()
text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
import io
#with open('filename') as fh:
with io.StringIO(text) as fh:
lines = fh.read().splitlines()
print(lines)
lines = [line.rsplit(" ", 1) for line in lines]
print(lines)
import pandas as pd
df = pd.DataFrame(lines, columns=['name', 'name'])
print(df)
Result:
['Roseberry st 1234', 'Brooklyn st 4321', 'Wolseley 1234567']
[['Roseberry st', '1234'], ['Brooklyn st', '4321'], ['Wolseley', '1234567']]
name number
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
EDIT:
read_csv can use regex to define separator (i.e. sep="\s+" for many spaces) and it can even use lookahead/loopbehind ((?=...)/(?<=...)) to check if there is digit after space without catching it as part of separator.
text = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
import io
import pandas as pd
#df = pd.read_csv('filename', names=['name', 'number'], sep='\s(?=\d)', engine='python')
df = pd.read_csv(io.StringIO(text), names=['name', 'number'], sep='\s(?=\d)', engine='python')
print(df)
Result:
name number
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
And later you can try to connect both dataframe using .join(), .merge() with parameter on= (or something similar) like in SQL query.
text1 = '''Roseberry st 1234
Brooklyn st 4321
Wolseley 1234567'''
text2 = '''Winnipeg 4321
Winnipeg 1234
Ste Anne 1234567'''
import io
import pandas as pd
df1 = pd.read_csv(io.StringIO(text1), names=['street name', 'id'], sep='\s(?=\d)', engine='python')
df2 = pd.read_csv(io.StringIO(text2), names=['city name', 'id'], sep='\s(?=\d)', engine='python')
print(df1)
print(df2)
df = df1.merge(df2, on='id')
print(df)
Result:
street name id
0 Roseberry st 1234
1 Brooklyn st 4321
2 Wolseley 1234567
city name id
0 Winnipeg 4321
1 Winnipeg 1234
2 Ste Anne 1234567
street name id city name
0 Roseberry st 1234 Winnipeg
1 Brooklyn st 4321 Winnipeg
2 Wolseley 1234567 Ste Anne
Pandas doc: Merge, join, concatenate and compare
There's nothing that I'm aware of in pandas that does this automatically.
Below, I built a script that will merge those addresses (addy + st) into a single column, then merges the two data frames into one based on the "id".
I assume your actual text files are significantly larger, so assuming they follow the pattern set in the two examples, this script should work fine.
Basically, this code turns each line of text in the file into a list, then combines lists of length 3 into length 2 by combining the first two list items.
After that, it turns the "list of lists" into a dataframe and merges those dataframes on column "id".
Couple caveats:
Make sure you set the correct text file paths
Make sure the first line of the text files contains 2, single string column headers (ie: "address id") or (ie: "city id")
Make sure each text file id column header is named "id"
import pandas as pd
import numpy as np
# set both text file paths (you may need full path i.e. C:\Users\Name\bla\bla\bla\text1.txt)
text_path_1 = r'text1.txt'
text_path_2 = r'text2.txt'
# declares first text file
with open(text_path_1) as f1:
text_file_1 = f1.readlines()
# declares second text file
with open(text_path_2) as f2:
text_file_2 = f2.readlines()
# function that massages data into two columns (to put "st" into same column as address name)
def data_massager(text_file_lines):
data_list = []
for item in text_file_lines:
stripped_item = item.strip('\n')
split_stripped_item = stripped_item.split(' ')
if len(split_stripped_item) == 3:
split_stripped_item[0:2] = [' '.join(split_stripped_item[0 : 2])]
data_list.append(split_stripped_item)
return data_list
# runs function on both text files
data_list_1 = data_massager(text_file_1)
data_list_2 = data_massager(text_file_2)
# creates dataframes on both text files
df1 = pd.DataFrame(data_list_1[1:], columns = data_list_1[0])
df2 = pd.DataFrame(data_list_2[1:], columns = data_list_2[0])
# merges data based on id (make sure both text files' id is named "id")
merged_df = df1.merge(df2, how='left', on='id')
# prints dataframe (assuming you're using something like jupyter-lab)
merged_df
pandas has strong support for strings. You can make the lines of each file into a Series and then use a regular expression to separate the fields into separate columns. I assume that "id" is the common value that links the two datasets, so it can become the dataframe index and the columns can just be added together.
import pandas as pd
street_series = pd.Series([line.strip() for line in open("text1.txt")])
street_df = street_series.str.extract(r"(.*?) (\d+)$")
del street_series
street_df.rename({0:"street", 1:"id"}, axis=1, inplace=True)
street_df.set_index("id", inplace=True)
print(street_df)
city_series = pd.Series([line.strip() for line in open("text2.txt")])
city_df = city_series.str.extract(r"(.*?) (\d+)$")
del city_series
city_df.rename({0:"city", 1:"id"}, axis=1, inplace=True)
city_df.set_index("id", inplace=True)
print(city_df)
street_df["city"] = city_df["city"]
print(street_df)
I am using the following code to read the CSV file in PySpark
cb_sdf = sqlContext.read.format("csv") \
.options(header='true',
multiLine = 'True',
inferschema='true',
treatEmptyValuesAsNulls='true') \
.load(cb_file)
The number of rows is correct. But for some rows, the columns are separated incorrectly. I think it is because the current delimiter is ",", but some cells contain ", " in the text as well.
For example, the following row in the pandas dataframe(I used pd.read_csv to debug)
Unnamed: 0
name
domain
industry
locality
country
size_range
111
cjsc "transport, customs, tourism"
ttt-w.ru
package/freight delivery
vyborg, leningrad, russia
russia
1 - 10
becomes
_c0
name
domain
industry
locality
country
size_range
111
"cjsc ""transport
customs
tourism"""
ttt-w.ru
package/freight delivery
vyborg, leningrad, russia
when I implemented pyspark.
It seems the cell "cjsc "transport, customs, tourism"" is separated into 3 cells: |"cjsc ""transport| customs| tourism"""|.
How can I set the delimiter to be exactly "," without any whitespace followed?
UPDATE:
I checked the CSV file, the original line is:
111,"cjsc ""transport, customs, tourism""",ttt-w.ru,package/freight delivery,"vyborg, leningrad, russia",russia,1 - 10
So is it still the problem of delimiter, or is it the problem of quotes?
I think that separating we'll have:
col1: 111
col2: "cjsc ""transport, customs, tourism"""
col3: ttt-w.ru,package/freight delivery
col4: "vyborg, leningrad, russia"
col5: russia
col6: 1 - 10
I have multiple CSV files (like 200) in a folder that I want to merge them into one unique dataframe. For example, each file has 3 columns, of which 2 are common in all the files (Country and Year), the third column is different in each file.
For example, one file has the following columns:
Country Year X
----------------------
Mexico 2015 10
Spain 2014 6
And other file can be like this:
Country Year A
--------------------
Mexico 2015 90
Spain 2014 67
USA 2020 8
I can read this files and merge them with the following code:
x = pd.read_csv("x.csv")
a = pd.read_csv("a.csv")
df = pd.merge(a, x, how="left", left_on=["country", "year"],
right_on=["country", "year"], indicator=False)
And this result in the output that I want, like this:
Country Year A X
-------------------------
Mexico 2015 90 10
Spain 2014 67 6
USA 2020 8
However, my problem is to do the previously process with each file, there are more than 200, I want to know if I can use a loop (or other method) in order to read the files and merge them into a unique dataframe.
Thank you very much, I hope I was clear enough.
Use glob like this:
import glob
print(glob.glob("/home/folder/*.csv"))
This gives all your files in a list : ['/home/folder/file1.csv', '/home/folder/file2.csv', .... ]
Now, you can just iterate over this list : from 1->end, keeping 0 as your base, and do pd.read_csv() and pd.merge() - it should be sorted!
Try this:
import os
import pandas as pd
# update this to path that contains your .csv's
path = '.'
# get files that end with csv in path
dir_list = [file for file in os.listdir(path) if file.endswith('.csv')]
# initiate empty list
df_list = []
# simple for loop with Try, Except that passes on iterations that throw errors when trying to 'read_csv' your files
for file in dir_list:
try:
# append to df_list and set your indices to match across your df's for later pd.concat to work
df_list.append(pd.read_csv(file).set_index(['Country', 'Year']))
except: # change this depending on whatever Errors pd.read_csv() throws
pass
concatted = pd.concat(df_list)
I have a series of very messy *.csv files that are being read in by pandas. An example csv is:
Instrument 35392
"Log File Name : station"
"Setup Date (MMDDYY) : 031114"
"Setup Time (HHMMSS) : 073648"
"Starting Date (MMDDYY) : 031114"
"Starting Time (HHMMSS) : 090000"
"Stopping Date (MMDDYY) : 031115"
"Stopping Time (HHMMSS) : 235959"
"Interval (HHMMSS) : 010000"
"Sensor warmup (HHMMSS) : 000200"
"Circltr warmup (HHMMSS) : 000200"
"Date","Time","","Temp","","SpCond","","Sal","","IBatt",""
"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","Volts",""
"Random message here 031114 073721 to 031114 083200"
03/11/14,09:00:00,"",15.85,"",1.408,"",.74,"",6.2,""
03/11/14,10:00:00,"",15.99,"",1.96,"",1.05,"",6.3,""
03/11/14,11:00:00,"",14.2,"",40.8,"",26.12,"",6.2,""
03/11/14,12:00:01,"",14.2,"",41.7,"",26.77,"",6.2,""
03/11/14,13:00:00,"",14.5,"",41.3,"",26.52,"",6.2,""
03/11/14,14:00:00,"",14.96,"",41,"",26.29,"",6.2,""
"message 3"
"message 4"**
I have been using this code to import the *csv file, process the double headers, pull out the empty columns, and then strip the offending rows with bad data:
DF = pd.read_csv(BADFILE,parse_dates={'Datetime_(ascii)': [0,1]}, sep=",", \
header=[10,11],na_values=['','na', 'nan nan'], \
skiprows=[10], encoding='cp1252')
DF = DF.dropna(how="all", axis=1)
DF = DF.dropna(thresh=2)
droplist = ['message', 'Random']
DF = DF[~DF['Datetime_(ascii)'].str.contains('|'.join(droplist))]
DF.head()
Datetime_(ascii) (Temp, øC) (SpCond, mS/cm) (Sal, ppt) (IBatt, Volts)
0 03/11/14 09:00:00 15.85 1.408 0.74 6.2
1 03/11/14 10:00:00 15.99 1.960 1.05 6.3
2 03/11/14 11:00:00 14.20 40.800 26.12 6.2
3 03/11/14 12:00:01 14.20 41.700 26.77 6.2
4 03/11/14 13:00:00 14.50 41.300 26.52 6.2
This was working fine and dandy until I have a file that has an erronious 1 row line after the header: "Random message here 031114 073721 to 031114 083200"
The error I receieve is:
*C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site-
packages\pandas\io\parsers.py in _do_date_conversions(self, names, data)
1554 data, names = _process_date_conversion(
1555 data, self._date_conv, self.parse_dates, self.index_col,
-> 1556 self.index_names, names,
keep_date_col=self.keep_date_col)
1557
1558 return names, data
C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site-
packages\pandas\io\parsers.py in _process_date_conversion(data_dict,
converter, parse_spec, index_col, index_names, columns, keep_date_col)
2975 if not keep_date_col:
2976 for c in list(date_cols):
-> 2977 data_dict.pop(c)
2978 new_cols.remove(c)
2979
KeyError: ('Time', 'HHMMSS')*
If I remove that line, the code works fine. Similarly, if I remove the header= line the code works fine. However, I want to be able to preserve this because I am reading in hundreds of these files.
Difficulty: I would prefer to not open each file before the call to pandas.read_csv() as these files can be rather large - thus I don't want to read and save multiple times! Also, I would prefer a real pandas/pythonic solution that doesn't involve openning the file first as a stringIO buffer to removing offending lines.
Here's one approach, making use of the fact that skip_rows accepts a callable function. The function receives only the row index being considered, which is a built-in limitation of that parameter.
As such, the callable function skip_test() first checks whether the current index is in the set of known indices to skip. If not, then it opens the actual file and checks the corresponding row to see if its contents match.
The skip_test() function is a little hacky in the sense that it does inspect the actual file, although it only inspects up until the current row index it's evaluating. It also assumes that the bad line always begins with the same string (in the example case, "foo"), but that seems to be a safe assumption given OP.
# example data
""" foo.csv
uid,a,b,c
0,1,2,3
skip me
1,11,22,33
foo
2,111,222,333
"""
import pandas as pd
def skip_test(r, fn, fail_on, known):
if r in known: # we know we always want to skip these
return True
# check if row index matches problem line in file
# for efficiency, quit after we pass row index in file
f = open(fn, "r")
data = f.read()
for i, line in enumerate(data.splitlines()):
if (i == r) & line.startswith(fail_on):
return True
elif i > r:
break
return False
fname = "foo.csv"
fail_str = "foo"
known_skip = [2]
pd.read_csv(fname, sep=",", header=0,
skiprows=lambda x: skip_test(x, fname, fail_str, known_skip))
# output
uid a b c
0 0 1 2 3
1 1 11 22 33
2 2 111 222 333
If you know exactly which line the random message will appear on when it does appear, then this will be much faster, as you can just tell it not to inspect the file contents for any index past the potential offending line.
After some tinkering yesterday I found a solution and what the potential issue may be.
I tried the skip_test() function answer above, but I was still getting errors with the size of the table:
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)()
ParserError: Error tokenizing data. C error: Expected 1 fields in line 14, saw 11
So after playing around with skiprows= I discovered that I was just not getting the behavior I wanted when using the engine='c'. read_csv() was still determining the size of the file from those first few rows, and some of those single column rows were still being passed. It may be that I have a few more bad single column rows in my csv set that I did not plan on.
Instead, I create an arbitrary sized DataFrame as a template. I pull in the entire .csv file, then use logic to strip out the NaN rows.
For example, I know that the largest table that I will encounter with my data will be 10 rows long. So my call to pandas is:
DF = pd.read_csv(csv_file, sep=',', \
parse_dates={'Datetime_(ascii)': [0,1]},\
na_values=['','na', '999999', '#'], engine='c',\
encoding='cp1252', names = list(range(0,10)))
I then use these two lines to drop the NaN rows and columns from the DataFrame:
#drop the null columns created by double deliminators
DF = DF.dropna(how="all", axis=1)
DF = DF.dropna(thresh=2) # drop if we don't have at least 2 cells with real values
If anyone in the future comes across this question, pandas has now implemented the on_bad_lines argument. You can now solve this problem by using on_bad_lines = "skip"
I have an issue with a CSV files I am trying to import in pandas. The structure of the file is as follow:
first character of the file is a single quote;
last character of the file is a single quote;
every line of the CSV start with a double quotes, end with a double quote followed by \n
So I have issues importing it with pandas.read_csv. Ideally I would like pandas to just ignore the single and double quotes when importing (not taking them into account for the structure of the data frame, and not importing these as characters).
I do not really know if I should manipulate the CSV file before using pandas.read_csv, or if I have option for just ignoring these characters.
The pd.read_csv methods first argument is either a file name or a stream.
You can read the file manually and manipulate the stream before handing it to pandas.
sio = StringIO("id,category,value\n1,beer,2.40\n2,wine,6.40\n3,$$$Theawsomestuff$$$###,166.00"
pd.read_csv(sio)
id category value
0 1 beer 2.4
1 2 wine 6.4
2 3 $$$Theawsomestuff$$$### 166.0
Thus subclassing StringIO you can change the behavior of the read method
class StreamChanger(StringIO):
def read(self, **kwargs):
data = super().read(**kwargs)
data = data.replace("$", "")
data = data.replace("#", "")
return data
sio = StreamChanger("id,category,value\n1,beer,2.40\n2,wine,6.40\n3,$$$Theawsomestuff$$$###,166.00")
pd.read_csv(sio)
id category value
0 1 beer 2.4
1 2 wine 6.4
2 3 Theawsomestuff 166.0
Use parameter 'quoting' and pass value 3 to read_csv. once you have dataframe created you should take care of the quotes in data and headers.
import pandas as pd
df=pd.read_csv('check.txt',doublequote=True,delimiter=',',quoting=3)
df=df.replace({'"': '','\'':''}, regex=True)
df.columns = ['Id1','StartTime','start_lat','start_long','StartGeohash']
print df
Sample File
'Id1,StartTime,start_lat,start_long,StartGeohash
"113,2016-11-01 10:50:28.063,-33.139507,-100.226715,9vbsx2"
"113,2016-11-02 10:49:24.063,-33.139507,-100.226715,9vbsx2"
"115,2016-11-03 10:55:20.063,-36.197660,-101.186050,9y2jcm"'
output
Id1 StartTime start_lat start_long StartGeohash
0 113 2016-11-01 10:50:28.063 -33.139507 -100.226715 9vbsx2
1 113 2016-11-02 10:49:24.063 -33.139507 -100.226715 9vbsx2
2 115 2016-11-03 10:55:20.063 -36.197660 -101.186050 9y2jcm