Pandas data frame, to_csv creating duplicate rows - python

Here is my current code
data = pd.read_csv('file', sep='\t', header=[2])
ndf = pd.DataFrame(data=nd)
new_data = pd.concat([data, ndf])
new_data.to_csv('file', sep='\t', index=False, mode='a', header=False)
So the file I am reading has 3 rows of headers, the headers in the first 2 rows are not used but I need to keep them there.
The headers in row 3 are the same as the headers in ndf, when I concat data and ndf the new_data dataframe is correctly aligned. So there's no problem there.
The problem comes when I try to write the new_data back to the original file with append mode. Every row of data that was in the original file is duplicated. This happens each time.
I have tried adding drop_duplicates new_data = pd.concat([data, ndf]).drop_duplicates(subset='item_sku', keep=False)
But this still leaves me with 2 of each row each time I write back to file.
I also tried reading the file with multiple header rows: header=[0, 1, 2]
But this makes the concat fail, I'm guessing because it's I haven't told the concat function which row of headers to align with. I think passing keys= would work but I'm not understanding the documentation very well.
EDIT-
This is an example of the file I am reading
load v1.0 74b FlatFile
ver raid week month
Dept Date Sales IsHoliday
1 2010-02-05 24924.50 False
This would be the data I am trying to append
Dept Date Sales IsHoliday
3 2010-07-05 6743.50 False
And this is the output I am getting
load v1.0 74b FlatFile
ver raid week month
Dept Date Sales IsHoliday
1 2010-02-05 24924.50 False
1 2010-02-05 24924.50 False
3 2010-07-05 6743.50 False

Try re-setting the columns of nd to the three-level header before concat:
data = pd.read_csv("file1.csv",sep="\t",header=[0,1,2])
nd = pd.read_csv("file2.csv",sep="\t")
nd.columns = data.columns
output = pd.concat([data,nd])
output.to_csv('file', sep='\t', index=False)
>>> output
load v1.0 74b FlatFile
ver raid week month
Dept Date Sales IsHoliday
0 1 2010-02-05 24924.5 False
0 3 2010-07-05 6743.5 False

I'm sure there's a better way of doing it but I've ended up with this result that works.
data = pd.read_csv('file', sep='\t', header=[0, 1, 2])
columns = data.columns
ndf = pd.DataFrame(data=nd, columns=data.columns.get_level_values(2))
data.columns = data.columns.get_level_values(2)
new_data = pd.concat([data, ndf])
new_data.columns = columns
new_data.to_csv('file', sep='\t', index=False, header=True)
So what I did was, set the ndf to have the same columns as the third row of data, then did the same data.
This allowed me to concat the two dataframes.
I still had the issue that I was missing the first 2 rows of headers but if I saved the columns from the original data file I could then asign the columns, back to the original values, before I saved to csv.

Related

Pandas CSV Move first row to header row

I have this table which i export to CSV Using this code:
df['time'] = df['time'].astype("datetime64").dt.date
df = df.set_index("time")
df = df.groupby(df.index).agg(['min', 'max', 'mean'])
df = df.reset_index()
df = df.to_csv(r'C:\****\Exports\exportMMA.csv', index=False)
While exporting this, my result is:
column 1
column 2
column 3
time
BufTF2
BufTF3
12/12/2022
10
150
I want to get rid of column 1,2,3 and replace the header with BufFT2 and BufFT3
Tried this :
new_header = df.iloc[0] #grab the first row for the header
df = df[1:] #take the data less the header row
df.columns = new_header #set the header row as the df header
And This :
df.columns = df.iloc[0]
df = df[1:]
Somehow it wont work, I not realy in need to replace the headers in the dataframe having the right headers in csv is more important.
Thanks!
You can try rename:
df = df.rename(columns=df.iloc[0]).drop(df.index[0])
when loading the input file you can specify which row to use as the header
pd.read_csv(inputfile,header=1) # this will use the 2nd row in the file as column titles

How to change my excel file values usning python

I have an ecxel file like this and I want the numbers in the date field to be converted to a date like (2021.7.22) and replaced in the date field again using python
You can try something like this:
import pandas as pd
dfs = pd.read_excel('Test.xlsx', sheet_name=None)
output = {}
for ws, df in dfs.items():
if 'date' in df.columns:
df['date'] = pd.to_datetime(df['date'].apply(lambda x: f'{str(x)[:4]}.{str(x)[4:6 if len(str(x)) > 7 else 5]}.{str(x)[-2:]}')).dt.date
output[ws] = df
writer = pd.ExcelWriter('TestOutput.xlsx')
for ws, df in output.items():
df.to_excel(writer, index=None, sheet_name=ws)
writer.save()
writer.close()
For each worksheet containing the column date in the input xlsx file, it will convert the integer it finds to a date, assuming that the month portion may be 1 or 2 digits and that the day portion is always a full 2 digits. If the actual month/day protocol in your data is different, you can adjust the logic accordingly.
The code creates a new output xlsx reflecting the above changes.

How to append a pandas dataframe to a csv and if necessary create new columns?

So I have an output called df that came from a pandas dataframe created within a loop of many films like this, so each df is the data for one film:
df = pandas.get_dummies(data=df, columns=['genre1', 'genre2', 'genre3', 'genre4'])
df = df.rename(columns=lambda x: x.replace('genre1_', ''))
df = df.rename(columns=lambda x: x.replace('genre2_', ''))
df = df.rename(columns=lambda x: x.replace('genre3_', ''))
df = df.rename(columns=lambda x: x.replace('genre4_', ''))
df = pd.concat([df[col].sum(axis=1).rename(col) if len(df[col].shape)==2 else
df[col] for col in df.columns.unique()],axis=1)
print(df)
with open('test.csv', 'a') as f:
df.to_csv(f, mode='a', header=f.tell()==0)
but the problem is that with each loop there are different genres to the loop before.
So for the first loop the output looks like this:
title runTime comedy action drama biography ......
film1 90mins 1 1 1 1
which then gets assigned to the csv
But on the next iteration of the loop the next film is as follows:
title runTime comedy action history ......
film2 90mins 1 1 1
I now what it to create a new column called history and have a one in that row for film2 but a 0 for film1 and assign 0 to the biography and drama columns on film2.
Currently it simply creates the first film as the default and then thinks every other film has the same genres.
So the first iteration produces a df that looks like this like this:
Second iteration looks like this:
Appending to CSV file is what making it impossible, as you freezed the header, and don't have yet all you're potential headers.
an easier approach would be to create the full dataframe, and only then making the same as you already did, see next example code:
# initialize full dataframe
df_full = pd.DataFrame()
#... loop reading df of one film data
# once created the raw df, you would do:
df_full = pd.concat([df_full,df])
# when finished run you're code on df_full like:
df_full = pandas.get_dummies(data=df_full, columns=['genre1', 'genre2', 'genre3', 'genre4'])
# continue with the rest of your code, eventually writing the csv file
Acording to OP comment, I've written the code to handle this situation,
when can't do it in memory:
# initialize the full set of genres list and genre fix columns
full_genre_cols = set()
genre_cols = ['genre1', 'genre2', 'genre3', 'genre4']
# taking created raw dataframe and keeping new genres
df = pd.get_dummies(df, columns=genre_cols, prefix_sep='', prefix='')
actual_genre_cols = df.drop(non_genre_cols, axis=1).columns
full_genre_cols.update(actual_genre_cols)
# ... finish reading all dataframes, and start over
# this time create full columns dataframes, and append them to CSV file
# preparing raw df and transform
non_genre_cols = df.drop(genre_cols, axis=1).columns
df = pd.get_dummies(df, columns=genre_cols, prefix_sep='', prefix='')
actual_genre_cols = df.drop(non_genre_cols, axis=1).columns
# preparing full columns dataframe
full_cols = list(non_genre_cols)
full_cols.extend(full_genre_cols)
df_fullcols = pd.DataFrame(columns=full_cols)
# updating with current values the correct genre columns
# and resetting to 0 all NaN of genres that not exist in this current cycle
df_fullcols[actual_genre_cols] = df[actual_genre_cols]
df_fullcols.fillna(0, inplace=True)
# and now only left to append to file
with open('test.csv', 'a') as f:
df_fullcols.to_csv(f, mode='a', header=f.tell()==0)

Problems when pandas reading Excel file that has blank top row and left column

I tried to read an Excel file that looks like below,
I was using pandas like this
xls = pd.ExcelFile(file_path)
assets = xls.parse(sheetname="Sheet1", header=1, index_col=1)
But I got error
ValueError: Expected 4 fields in line 3, saw 5
I also tried
assets = xls.parse(sheetname="Sheet1", header=1, index_col=1, parse_cols="B:E")
But I got misparsed result as follows
Then tried
assets = xls.parse(sheetname="Sheet1", header=1, index_col=0, parse_cols="B:E")
Finally works, but why index_col=0 and parse_cols="B:E"? This makes me confused becasue based on pandas documents, assets = xls.parse(sheetname="Sheet1", header=1, index_col=1) should just be fine. Have I missed something?
The read_excel documentation is not clear on a point.
skiprows=1 to skip the first empty row at the top of the file or header=1 also works to use the second row has column index.
parse_cols='B:E' is a way to skip the first empty column at the left of the file
index_col=0 is optional and permits to define the first parsed column (B in this example) as the DataFrame index. The mistake is here since index_col is relative to columns selected though the parse_cols parameter.
With your example, you can use the following code
pd.read_excel('test.xls', sheetname='Sheet1', skiprows=1,
parse_cols='B:E', index_col=0)
# AA BB CC
# 10/13/16 1 12 -1
# 10/14/16 3 12 -2
# 10/15/16 5 12 -3
# 10/16/16 3 12 -4
# 10/17/16 5 23 -5

pd.read_csv ignores columns that don't have headers

I have a .csv file that is generated by a third-party program. The data in the file is in the following format:
%m/%d/%Y 49.78 85 6 15
03/01/1984 6.63368 82 7 9.8 34.29056405 2.79984079 2.110346498 0.014652412 2.304545521 0.004732732
03/02/1984 6.53368 68 0 0.2 44.61471002 3.21623666 2.990408898 0.077444779 2.793385466 0.02661873
03/03/1984 4.388344 55 6 0 61.14463457 3.637231063 3.484310818 0.593098236 3.224973641 0.214360796
There are 5 column headers (row 1 in excel, columns A-E) but 11 columns in total (row 1 columns F-K are empty, rows 2-N contain float values for columns A-K)
I was not sure how to paste the .csv lines in so they are easily replicable, sorry for that. An image of the excel sheet is shown here: Excel sheet to read in
when I use the following code:
FWInds=pd.read_csv("path.csv")
or:
FWInds=pd.read_csv("path.csv", header=None)
the resulting dataframe FWInds does not contain the last 6 columns - it only contains the columns with headers (columns A-E from excel, column A as index values).
FWIDat.shape
Out[48]: (245, 4)
Ultimately the last 6 columns are the only ones I even want to read in.
I also tried:
FWInds=pd.read_csv('path,csv', header=None, index_col=False)
but got the following error
CParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 11
I also tried to ignore the first row since the column titles are unimportant:
FWInds=pd.read_csv('path.csv', header=None, skiprows=0)
but get the same error.
Also no luck with the "usecols" parameter, it doesn't seem to understand that I'm referring to the column numbers (not names), unless I'm doing it wrong:
FWInds=pd.read_csv('path.csv', header=None, usecols=[5,6,7,8,9,10])
Any tips? I'm sure it's an easy fix but I'm very new to python.
There are a couple of parameters that can be passed to pd.read_csv():
import pandas as pd
colnames = list('ABCDEFGHIKL')
df = pd.read_csv('test.csv', sep='\t', names=colnames)
With this, I can actually import your data quite fine (and it is accessible via eg df['K'] afterwards).
You could do it as shown:
col_name = list('ABCDEFGHIJK')
data = 'path.csv'
pd.read_csv(data, delim_whitespace=True, header=None, names=col_name, usecols=col_name[5:])
To read all the columns from A → K, simply omit the usecols parameter.
Data:
data = StringIO(
'''
%m/%d/%Y,49.78,85,6,15
03/01/1984,6.63368,82,7,9.8,34.29056405,2.79984079,2.110346498,0.014652412,2.304545521,0.004732732
03/02/1984,6.53368,68,0,0.2,44.61471002,3.21623666,2.990408898,0.077444779,2.793385466,0.02661873
03/03/1984,4.388344,55,6,0,61.14463457,3.637231063,3.484310818,0.593098236,3.224973641,0.214360796
''')
col_name = list('ABCDEFGHIJK')
pd.read_csv(data, header=None, names=col_name, usecols=col_name[5:])

Categories

Resources