I'm trying to read 100 CSVs and collate data from all into a single CSV.
I made use of :
all_files = pd.DataFrame()
for file in files :
all_files = all_files.append(pd.read_csv(file,encoding= 'unicode_escape')).reset_index(drop=True)
where files = list of filepaths of 100 CSVs
Now each CSV may have different number of columns. single CSV, each row may have different no. of colums too.
I want to match the column headers names, put the data from all the CSVs in the correct column, and keep on adding new columns to my final DF on the go.
The above code works fine for 30-40 CSVs and then breaks and gives the following error:
ParserError: Error tokenizing data. C error: Expected 16 fields in line 78, saw 17
Any help will be much appreciated!
There are a couple of ways to read variable length csv files -
First, you can specify the column names beforehand. If you are not sure of the number of columns, you can give a reasonably large number of columns
df = pd.read_csv(filename.csv, header=None, names=list(range(10)))
The other option is to read the entire file into a single column using a different delimiter - and then split on commas
df = pd.read_csv(filename.csv, header=None, sep='\t')
df = df[0].str.split(',', expand=True)
Its because you are trying to read all CSV files into a single Dataframe. When the first file is read number of columns for the DataFrame are decided and then it results in error when a different number of columns are fed. If you really want to concat them you should read them all in python, adjust their coulmns and then concat them
My code like below:
indexing_file_path = 'indexing.csv'
if not os.path.exists(indexing_file_path):
df = pd.DataFrame([['1111', '20200101', '20200101'],
['1112', '20200101', '20200101'],
['1113', '20200101', '20200101']],
columns = ['nname', 'nstart', 'nend'])
else:
df = pd.read_csv(indexing_file_path, header = 0)
print(df)
df.loc[len(df)] = ['1113', '20200202', '20200303']
# append() method not working either
print(df)
df.drop_duplicates('nname', keep = 'last', inplace = True)
print(df)
df.to_csv(indexing_file_path, index = False)
I want to keep the nname column unique in this file.
When the code run first time, it will save the records to csv file correctly, although the 1113 is not unique.
When the code run second time, it will save two 1113 rows to the csv file, because the DataFrame is created from a csv file.
After the third time run, it will always keep two 1113 rows.
Now I have a solution:
1, save to csv file with two 1113 row.
2, read the csv file again.
3, use drop_duplicates again.
4, save to csv file again.
Why the DataFrame created from a csv file is so different?
How can I save the unique row to csv file one time?
I can answer my question now.
The reason is:
When DataFrame is created from a csv file, pandas recognize the nname column as integer
But, when I add 1113 row again, pandas recognize the new row nname as a string, so the integer 1113 is not equals the string 1113, pandas will keep two row.
The solution is:
Read csv file as string.
df = pd.read_csv(indexing_file_path, header=0, dtype=str)
I'm aware this has been asked so many times, but it's left me really scratching my head. I have a .txt file which looks like:
"0,1,2,3,4, ....63,0,1,2,3,4.....63"
"0,1,2,3,4, ....63,0,1,2,3,4.....63"
"0,1,2,3,4, ....63,0,1,2,3,4.....63"
and so on for multiple rows.
So that's 64+64 = 128 columns separated by commas, while each row is enclosed in double quotes.
I have used the commands:
#Used this initially
df = pd.read_csv('test_data.txt')
#Used this after reading more stackoverflow answers
df = pd.read_csv('test_data.txt', header = None, sep=",", delimiter=',', quotechar='"', index_col = None)
I know sep and delimiter are the same parameters, but I tried both out anyway, I shouldn't have to specify these either because pandas chooses commas by default.
After this I'm just using:
df.head()
And it outputs:
0
0 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
1 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
2 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
3 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
4 0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00,8.00,9...
It just reads it all as the one column, please advise on how I can read all 128 columns.
This will get you to desired result:
df = pd.read_csv('test_data.txt', header=None)
df = pd.DataFrame(df[0].str.split(',').tolist())
So this will read your file, that has each row wrapped with quote marks, and pack it into single column.
Then, you split that column by comma and construct new dataframe from the results.
I want to convert data from a .data file to a .csv file and put the data from the .data file in columns with values under them. However, the .data file has a specific format and I don't know how to put the text in it in columns. Here is how the .data file looks like:
column1
column2
column3
column4
column5
column6
column7
column8
column9
column10
column11
column12
column13
........
column36
1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444
1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444
1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444
1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444
The file as shown above has the names of 36 columns, each on 1 line. Under these are many datapoints, with 36 values in them that are separated by semicolons. The datapoints are 2 lines long and each datapoint is separated by a blank line. The .csv file must look like this:
column1,column2,column3,column4,column5,column6,column7,column8,column9,column10,column11,column12,column14,column15,column16,column17,column18,column20,column20,column21,column22,column23,column24,column25,column26,column27,column28,column29,column30,column31,column32,column33,column34,column35,column36
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
The first line of the .csv as shown above file must consist of 36 columns with the names in it separated by commas. The next lines must consist of all datapoints, each on 1 line and in which the 36 values must be separated by commas.
Can you use the software library 'pandas' for this? Anyways, this is my starting code:
with open("file.data") as fIn, open("file.csv", "w") as fOut:
for r, line in enumerate(fIn):
if not line:
break
Thanks
Sure you can do it with pandas. You just need to read first N lines (36 in your case) to use them as header and read rest of the file like normal csv (pandas good at it). Then you can save pandas.DataFrame object to csv.
Since your data splitted into adjacent lines, we should split DataFrame we've read on two and stack them one next to other (horizontaly).
Consider the following code:
import pandas as pd
COLUMNS_COUNT = 36
# read first `COLUMNS_COUNT` lines to serve as a header
with open('data.data', 'r') as f:
columns = [next(f).strip() for line in range(COLUMNS_COUNT)]
# read rest of the file to temporary DataFrame
temp_df = pd.read_csv('data.data', skiprows=COLUMNS_COUNT, header=None, delimiter=';', skip_blank_lines=True)
# split temp DataFrame on even and odd rows
even_df = temp_df.iloc[::2].reset_index(drop=True)
odd_df = temp_df.iloc[1::2].reset_index(drop=True)
# stack even and odd DataFrames horizontaly
df = pd.concat([even_df, odd_df], axis=1)
# assign column names
df.columns = columns
# save result DataFrame to csv
df.to_csv('out.csv', index=False)
UPD: code updated to correctly process data splitted onto two lines
I'm having a tough time correctly loading csv file to pandas dataframe. The file is csv saved in MS Excel, where the rows looks like this:
Montservis, s.r.o.;"2 012";"-14.98";"-34.68";"- 11.7";"0.02";"0.09";"0.16";"284.88";"10.32";"
I am using
filep="file_name.csv"
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";")
(I have tried several combinations and alternatives of read_csv arguments, but without any success.....I have tried also read_table )
What I want to see in my dataframe that each semi colon separated value will be in separate column (I understand that read_csv works this way(?)).
Unfortunately, I always end up with whole row being placed in first column of dataframe. So basicly after loading I have many rows, but only one column (two if I count also indexes)
I have placed sample here:
datafile
Any idea welcomed.
Add quoting = 3. 3 is for QUOTE_NONE refer this.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter=";", quoting = 3)
This will give [7 rows x 23 columns] dataframe
The problem is enclosing characters which can be ignored by \ character.
raw_data = pd.read_csv(filep,engine="python",index_col=False, header=None, delimiter='\;')