I'm attempting the following workflow:
Import large .csv file and isolate by column.
Remove duplicates
Import Second, smaller .csv file and isolate by column.
Concatenate the two into a single column in a dataframe and remove the duplicates so I can check the diff.
Here's the code I have so far.
import pandas as pd
df1 = pd.read_csv('file.csv', header=0, usecols=[4]).drop_duplicates()
# Place a checkpoint here to verify the file wrote correctly.
df1.to_csv('orders_new.csv', index=False, header=None, encoding='utf-8')
df2 = pd.read_csv('file2.csv', header=0, usecols=[0])
# Place a checkpoint here to verify the file wrote correctly.
df2.to_csv('tables_new.csv', index=False, header=None, encoding='utf-8')
new_dataframe = pd.concat([df1, df2], ignore_index=True).drop_duplicates(keep=False)
new_dataframe.to_csv('output.csv')
When I look in the output.csv file, file1 has been properly imported and dupes removed and dropped into the new file, but when I get to the point where that file ends and the next begins, the file 2 data ends up in the row to the right of it.
Looking for some guidance here.
Thanks!
I tried doing a simple concat within pandas, and it worked but the output was in two columns, and not one.
Related
I have a csv file with a wrong first row data. The names of labels are in the row number 2. So when I am storing this file to the DataFrame the names of labels are incorrect. And correct names become values of the row 0. Is there any function similar to reset_index() but for columns? PS I can not change csv file. Here is an image for better understanding. DataFrame with wrong labels
Hello let's suppose you csv file is data.csv :
Try this code:
import pandas as pd
#reading the csv file
df = pd.read_csv('data.csv')
#changing the headers name to integers
df.columns = range(df.shape[1])
#saving the data in another csv file
df.to_csv('data_without_header.csv',header=None,index=False)
#reading the new csv file
new_df = pd.read_csv('data_without_header.csv')
#plotting the new data
new_df.head()
If you do not care about the rows preceding your column names, you can pass in the "header" argument with the value of the correct row, for example if the proper column names are in row 2:
df = pd.read_csv('my_csv.csv', header=2)
Keep in mind that this will erase the previous rows from the DataFrame. If you still want to keep them, you can do the following thing:
df = pd.read_csv('my_csv.csv')
df.columns = df.iloc[2, :] # replace columns with values in row 2
Cheers.
I am trying to save a csv to a folder after making some edits to the file.
Every time I use pd.to_csv('C:/Path of file.csv') the csv file has a separate column of indexes. I want to avoid printing the index to csv.
I tried:
pd.read_csv('C:/Path to file to edit.csv', index_col = False)
And to save the file...
pd.to_csv('C:/Path to save edited file.csv', index_col = False)
However, I still got the unwanted index column. How can I avoid this when I save my files?
Use index=False.
df.to_csv('your.csv', index=False)
There are two ways to handle the situation where we do not want the index to be stored in csv file.
As others have stated you can use index=False while saving your
dataframe to csv file.
df.to_csv('file_name.csv',index=False)
Or you can save your dataframe as it is with an index, and while reading you just drop the column unnamed 0 containing your previous index.Simple!
df.to_csv(' file_name.csv ')
df_new = pd.read_csv('file_name.csv').drop(['unnamed 0'],axis=1)
If you want no index, read file using:
import pandas as pd
df = pd.read_csv('file.csv', index_col=0)
save it using
df.to_csv('file.csv', index=False)
As others have stated, if you don't want to save the index column in the first place, you can use df.to_csv('processed.csv', index=False)
However, since the data you will usually use, have some sort of index themselves, let's say a 'timestamp' column, I would keep the index and load the data using it.
So, to save the indexed data, first set their index and then save the DataFrame:
df.set_index('timestamp')
df.to_csv('processed.csv')
Afterwards, you can either read the data with the index:
pd.read_csv('processed.csv', index_col='timestamp')
or read the data, and then set the index:
pd.read_csv('filename.csv')
pd.set_index('column_name')
Another solution if you want to keep this column as index.
pd.read_csv('filename.csv', index_col='Unnamed: 0')
If you want a good format next statement is the best:
dataframe_prediction.to_csv('filename.csv', sep=',', encoding='utf-8', index=False)
In this case you have got a csv file with ',' as separate between columns and utf-8 format.
In addition, numerical index won't appear.
def Text2Col(df_File):
for i in range(0,len(df_File)):
with open(df_File.iloc[i]['Input']) as inf:
with open(df_File.iloc[i]['Output'], 'w') as outf:
i=0
for line in inf:
i=i+1
if i==2 or i==3:
continue
outf.write(','.join(line.split(';')))
Above code is used to convert a csv file from text to column.
This code makes all values string ( because split() ) which is problematic for me.
I tried using map function but cant make it.
Is there any other way in which I can do this.
My input file has 5 columns, the first column is string, the second is int and the rest are float.
I think it required some modification in last statement
outf.write(','.join(line.split(';')))
Please let me know if any other input is required.
Ok, trying to help here. If this doesn't work, please specify in your question, what you're missing or what else needs to be done:
Use pandas to read in a csv file:
import pandas as pd
df = pd.read_csv('your_file.csv')
If you have a header on the first row, then use:
import pandas as pd
df = pd.read_csv('your_file.csv', header=0)
If you have a tab delimiter instead of a comma delimiter, then use:
import pandas as pd
df = pd.read_csv('your_file.csv', header=0, sep='\t')
Thank you !
Following Code worked:
def Text2Col(df_File):
for i in range(0,len(df_File)):
df = pd.read_csv(df_File.iloc[i]['Input'],sep=';')
df = df[df.index != 0]
df= df[df.index != 1]
df.to_csv(df_File.iloc[i]['Output'])
File_List="File_List.csv"
df_File=pd.read_csv(File_List)
Text2Col(df_File)
Input files are kept in same folder with same name as mentioned in File_List.xls
Output files will be created in same folder with separated in column. I deleted row 0 and 1 for my use. One can skip or add depending upon his requirement.
In above code df_file is dataframe contain two column list, first column is input file name and second column is output file name.
I have 30 csv files. I want to give it as input in for loop, in pandas?
Each file has names such as fileaa, fileab,fileac,filead,....
And i would like to receive one output.
Usually i use read_csv but due to memory error, 'read_csv' doesn't work.
f = "./file.csv"
df = pd.read_csv(f, sep="/", header=0, dtype=str)
P.S only the first file has the column title but number of columns are same
I am trying to save a csv to a folder after making some edits to the file.
Every time I use pd.to_csv('C:/Path of file.csv') the csv file has a separate column of indexes. I want to avoid printing the index to csv.
I tried:
pd.read_csv('C:/Path to file to edit.csv', index_col = False)
And to save the file...
pd.to_csv('C:/Path to save edited file.csv', index_col = False)
However, I still got the unwanted index column. How can I avoid this when I save my files?
Use index=False.
df.to_csv('your.csv', index=False)
There are two ways to handle the situation where we do not want the index to be stored in csv file.
As others have stated you can use index=False while saving your
dataframe to csv file.
df.to_csv('file_name.csv',index=False)
Or you can save your dataframe as it is with an index, and while reading you just drop the column unnamed 0 containing your previous index.Simple!
df.to_csv(' file_name.csv ')
df_new = pd.read_csv('file_name.csv').drop(['unnamed 0'],axis=1)
If you want no index, read file using:
import pandas as pd
df = pd.read_csv('file.csv', index_col=0)
save it using
df.to_csv('file.csv', index=False)
As others have stated, if you don't want to save the index column in the first place, you can use df.to_csv('processed.csv', index=False)
However, since the data you will usually use, have some sort of index themselves, let's say a 'timestamp' column, I would keep the index and load the data using it.
So, to save the indexed data, first set their index and then save the DataFrame:
df.set_index('timestamp')
df.to_csv('processed.csv')
Afterwards, you can either read the data with the index:
pd.read_csv('processed.csv', index_col='timestamp')
or read the data, and then set the index:
pd.read_csv('filename.csv')
pd.set_index('column_name')
Another solution if you want to keep this column as index.
pd.read_csv('filename.csv', index_col='Unnamed: 0')
If you want a good format next statement is the best:
dataframe_prediction.to_csv('filename.csv', sep=',', encoding='utf-8', index=False)
In this case you have got a csv file with ',' as separate between columns and utf-8 format.
In addition, numerical index won't appear.