I have a dataset with more than 1,000,000 rows.
However, read_csv cannot read them all.
products = pd.read_csv("PAD_NEW.csv", encoding = "ISO-8859-1", error_bad_lines=False)
products.shape
(859971, 137)
But in r using fread, i can get 1048575
> dim(products)
[1] 1048575 137
I tried to read using R first then write a new file for Python. But it did not work.
UPDATE: I manually check those rows being ignored, there is a column named description and there are sentences like "new product, next week" I think python take the "," in this column as a separate. Because after I delete this column it works.
Related
I have a semicolon-delimited pandas DataFrame with all dtypes of object. Within some of the cells the string value can have ", a comma (,), or both (ex. TES"T_ING,_VALUE). I am then querying the DF using df.query based on some condition to get a subset of the DataFrame but the rows that have the pattern described in the example are being omitted completely but the remaining rows are being returned just fine. Another requirement is that I need to match all " within the text with a closing quote as well but applying a lambda to replace " with "" is also not being done properly. I have tried several methods and they are listed below
Problem 1:
pd.read_csv("file.csv", delimiter=';')
pd.read_csv("file.csv", delmiter=';', thousands=',')
pd.read_csv("file.csv", delimiter=";", escapechar='"')
pd.read_csv("file.csv", delimiter=";", encoding='utf-8')
All of the above fail to load the data in question.
Problem 2:
Input: TES"T_ING,_VALUE to TES""T_ING,_VALUE
I have tried:
df.apply(lambda s: s.str.replace('"', '""')
which doesn't do anything.
What exactly is going on? I haven't been able to find any questions tackling this particular type of issue anywhere.
Appreciate your help in advance.
EDIT: Sorry I didn't provide some mockup data due to sensitivity but here is some fake data that illustrates the issue
The following is a sample of how the csv structure
Column1;Column2;Column3;Column4;Column5\n
TES"T_ING,_VALUE;Col2Value;Col3Value;Col4Value;Col5Value\n
Col1value;TES"T_ING,_VALUE2;Col3Value;Col4Value;Col5Value\n
I have tried utilizing quoting=csv.QUOTE_ALL/QUOTE_NONNUMERIC and quotechar='"' when loading in the df but the result ends up being
Column1;Column2;Column3;Column4;Column5\n
"TES"T_ING,_VALUE;Col2Value;Col3Value;Col4Value;Col5Value";;;;\n
"Col1value;TES"T_ING,_VALUE2;Col3Value;Col4Value;Col5Value";;;;\n
So it interprets the whole row as value in column 1 rather than actually splitting on the ; and applying the quoting to only column1. Truthfully I can iterate through each row in the df and maybe do a split and load the remaining values into their respective column but the CSV is quite large so this operation would take sometime. The subset of the data the user queries on is supposed to be returned from an endpoint (this part is already working).
The problem was solved utilizing pd.apply and utilizing a custom function to process each record.
df = pd.read_csv("csv_file.csv", delimiter=';', escapechar='\\')
def mapper(record):
if ';' in record['col1']:
content = record['col1'].split(';')
if len(content) == num_columns:
if '"' in content[0]:
content[0] = content[0].replace('"', '""')
record['col1'] = content[0]
# repeat for remaining columns
processed = df.apply(lambda x: mapper(x), axis=1)
I'm trying to read 100 CSVs and collate data from all into a single CSV.
I made use of :
all_files = pd.DataFrame()
for file in files :
all_files = all_files.append(pd.read_csv(file,encoding= 'unicode_escape')).reset_index(drop=True)
where files = list of filepaths of 100 CSVs
Now each CSV may have different number of columns. single CSV, each row may have different no. of colums too.
I want to match the column headers names, put the data from all the CSVs in the correct column, and keep on adding new columns to my final DF on the go.
The above code works fine for 30-40 CSVs and then breaks and gives the following error:
ParserError: Error tokenizing data. C error: Expected 16 fields in line 78, saw 17
Any help will be much appreciated!
There are a couple of ways to read variable length csv files -
First, you can specify the column names beforehand. If you are not sure of the number of columns, you can give a reasonably large number of columns
df = pd.read_csv(filename.csv, header=None, names=list(range(10)))
The other option is to read the entire file into a single column using a different delimiter - and then split on commas
df = pd.read_csv(filename.csv, header=None, sep='\t')
df = df[0].str.split(',', expand=True)
Its because you are trying to read all CSV files into a single Dataframe. When the first file is read number of columns for the DataFrame are decided and then it results in error when a different number of columns are fed. If you really want to concat them you should read them all in python, adjust their coulmns and then concat them
I have a number of .xls datasheets which I am looking to clean and merge.
Each data sheet is generated by a larger system which cannot be changed.
The method that generates the data sets displays the selected parameters for the data set. (E.G 1) I am looking to automate the removal of these.
The number of rows that this takes up varies, so I am unable to blanket remove x rows from each sheet. Furthermore, the system that generates the report arbitrarily merges cells in the blank sections to the right of the information.
Currently I am attempting what feels like a very inelegant solution where I convert the file to a CSV, read it as a string and remove everything before the first column.
data_xls = pd.read_excel('InputFile.xls', index_col=None)
data_xls.to_csv('Change1.csv', encoding='utf-8')
with open("Change1.csv") as f:
s = f.read() + '\n'
a=(s[s.index("Col1"):])
df = pd.DataFrame([x.split(',') for x in a.split('\n')])
This works but it seems wildly inefficient:
Multiple format conversions
Reading every line in the file when the only rows being altered occur within first ~20
Dataframe ends up with column headers shifted over by one and must be re-aligned (Less concern)
With some of the files being around 20mb, merging a batch of 8 can take close to 10 minutes.
A little hacky, but an idea to speed up your process, by doing some operations directly on your dataframe. Considering you know your first column name to be Col1, you could try something like this:
df = pd.read_excel('InputFile.xls', index_col=None)
# Find the first occurrence of "Col1"
column_row = df.index[df.iloc[:, 0] == "Col1"][0]
# Use this row as header
df.columns = df.iloc[column_row]
# Remove the column name (currently an useless index number)
del df.columns.name
# Keep only the data after the (old) column row
df = df.iloc[column_row + 1:]
# And tidy it up by resetting the index
df.reset_index(drop=True, inplace=True)
This should work for any dynamic number of header rows in your Excel (xls & xlsx) files, as long as you know the title of the first column...
If you know the number of junk rows, you skip them using "skiprows",
data_xls = pd.read_excel('InputFile.xls', index_col=None, skiprows=2)
I imported a csv file to Python (Using Python data frame) and there are some missing values in a CSV file. In the data frame I have rows like following
> 08,63.40,86.21,63.12,72.78,,
I have tried everything to remove the rows containing the elements similar to the last element in the above data. Nothing works. I do not know if above is categorized as white space or empty string or what.
Here is what I have:
result = pandas.read_csv(file,sep='delimiter')
result[result!=',,']
This did not work. Then I have done following:
result.replace(' ', np.nan, inplace=True)
result.dropna(inplace=True)
This also did not work.
result = result.replace(r'\s+', np.nan, regex=True)
This also did not work. I still see the row containing the ,, element.
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1.( I do not know if this helps)
Can anyone tell me how to remove rows containing ,, elements?
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1
This is probably the key and IMHO is weird. When you import a csv in a pandas DataFrame you normally want each field to go in its own column, precisely to later be able to process that column values individually. So (still IMHO) the correct solution if to fix that.
Now to directly answer your (probably XY question), you do not want to remove rows containing blank or empty columns, because your row only contains one single column, but rows containing consecutive commas(,,). So you should use:
df.drop(df.iloc[0].str.contains(',,').index)
I think your code should work with a minor change:
result.replace('', np.nan, inplace=True)
result.dropna(inplace=True)
In case you have several rows in your CSV file, you can avoid the extra conversion step to NaN:
result = pandas.read_csv(file)
result = result[result.notnull().all(axis = 1)]
This will remove any row where there is an empty element.
However, your added comment explains that there is just one row in the CSV file, and it seems that the CSV reader shows some special behavior. Since you need to select the columns without NaN, I suggest these lines:
result = pandas.read_csv(file, header = None)
selected_columns = result.columns[result.notnull().any()]
result = result[selected_columns]
Note the option header = None with read_csv.
I'm having a tough time correctly loading csv file to pandas dataframe. The file is csv saved in MS Excel, where the rows looks like this:
Montservis, s.r.o.;"2 012";"-14.98";"-34.68";"- 11.7";"0.02";"0.09";"0.16";"284.88";"10.32";"
I am using
filep="file_name.csv"
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";")
(I have tried several combinations and alternatives of read_csv arguments, but without any success.....I have tried also read_table )
What I want to see in my dataframe that each semi colon separated value will be in separate column (I understand that read_csv works this way(?)).
Unfortunately, I always end up with whole row being placed in first column of dataframe. So basicly after loading I have many rows, but only one column (two if I count also indexes)
I have placed sample here:
datafile
Any idea welcomed.
Add quoting = 3. 3 is for QUOTE_NONE refer this.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";", quoting = 3)
This will give [7 rows x 23 columns] dataframe
The problem is enclosing characters which can be ignored by \ character.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter='\;')