Pandas: Reading CSV files with different delimiters - merge error - python

I have 4 separate CSV files that I wish to read into Pandas. I want to merge these CSV files into one dataframe.
The problem is that the columns within the CSV files contain the following: , ; | and spaces. Therefore I have to use different delimiters when reading the different CSV files and do some transformations to get them in the correct format.
Each CSV file contains an 'ID' column. When I merge my dataframes, it is not done correctly and I get 'NaN' in the column which has been merged.
Do you have to use the same delimiter in order for the dataframes to merge properly?

In short : no, you do not need similar delimiters within your files to merge pandas Dataframes - in fact, once data has been imported (which requires setting the right delimiter for each of your files), the data is placed in memory and does not keep track of the initial delimiter (you can see this by writing down your imported dataframes to csv using the .to_csv method : the delimiter will always be , by default).
Now, in order to understand what is going wrong with your merge, please post more details about your data and the code your are using to perform the operation.

Related

Split a spark dataframe into multiple frames and write as CSV

I have a use case where in I am reading data from a source into a dataframe, doing a groupBy on a field and essentially breaking that dataframe into an array of dataframes.
My target state is to have all these dataframes be written as individual CSVs files in S3 ( CSV being they need to be downloaded by the client and need to be human readable ).
What's the best way of going about this?
I used this to split df into df_array : df_array = [(df.where(df[column_name] == i),i) for i in distinct_values]
And df.toPandas().to_csv(output_path +'.csv',index=False) individually on dataframes to convert to CSV files - but the challenges being faced in this approach are
My understanding is since I require a single CSV file per my grouping field, to_csv will bring data from all worker nodes to the driver and and may give driver OOM issue.
I am unable to use python multiprocessing to write the individual dataframes to S3 since data is distributed on worker nodes and gives me an error : Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion
No space left on device.
The pipeline is pretty slow as well, what is the better way I can approach this use case?
[EDIT]
I want to control the name of the CSV file which gets created as well. Target state is 1 CSV file per my group-by field ( let's call that Name ) so if there are 10 different Names in my initial df, output will be 10 CSV files each with the title as Name1.csv, Name2.csv and so on
As you're using pyspark, why don't you use the repartition and partitionBy to achieve your goal?
df.repartition(1).write.partitionBy('grouping_field1', 'grouping_field2', ...).save('/path/to/save', format='csv')

Can Pandas output inferred schema for a CSV file?

Is there a method I can use to output the inferred schema on a large CSV using pandas?
In addition, any way to have it tell me with that type if it is nullable/blank based off the CSV?
File is about 500k rows with 250 columns.
With my new job, I'm constantly being handed CSV files with zero format documentation.
Is it necessary to load the whole csv file? At least you could use the read_csv function if you know the separator or doing a cat of the file to know the separator. Then use the .info():
df = pd.read_csv(path_to_file,...)
df.info()

Is there a way to import several .txt files each becoming a separate dataframe using pandas?

I have to work with 50+ .txt files each containing 2 columns and 631 rows where I have to do different operations to each (sometimes with each other) before doing data analysis. I was hoping there was a way to import each text file under a different dataframe in pandas instead of doing it individually. The code I've been using individually has been
df = pd.read_table(file_name, skiprows=1, index_col=0)
print(B)
I use index_col=0 because the first row is the x-value. I use skiprows=1 because I have to drop the title which is the first row (and file name in folder) of each .txt file. I was thinking maybe I could use glob package and importing all as a single data frame from the folder and then splitting it into different dataframes while keeping the first column as the name of each variable? Is there a feasible way to import all of these files at once under different dataframes from a folder and storing them under the first column name? All .txt files would be data frames of 2 col x 631 rows not including the first title row. All values in the columns are integers.
Thank you
Yes. If you store your file in a list named filelist (maybe using glob) you can use the following commands to read all files and store them on a dict.
dfdict = {f: pd.read_table(f,...) for f in filelist}
Then you can use each data frame with dfdict["filename.txt"].

Why does the output of pd.read_csv() and pd.read_excel() deliver a dataframe with different functioning column headers?

I'm importing an .xlsx file with pd.read_excel(). I received this .xlsx file as an CSV file and used excel to seperate it by comma so I get the proper .xlsx file with columns etc. Six of the dataframe columns have a number as header (e.g. 5030, 5031,...). When I want to change the column name with df = df.rename(columns={...}) this does not work. Also df["5030"] does not work, it throws an error: KeyError:'5030'. This code works for columns which have regular/non-integer names.
However, when I import the raw .csv file with pd.read_csv(), all the code above does work. I can just rename column names. The df's do look exactly the same when imported with both techniques, but apparently something is different.
It is not a serious issue as I can change the column name to non-integers manually in excel, but I'm very curious about what the underlying "problem" is here and how these two function operate in a different way.
Thanks!

how to write comma separated list items to csv in a single column in python

I have a list(fulllist) of 292 items and converted to data frame. Then tried writing it to csv in python.
import pandas as pd
my_df = pd.DataFrame(fulllist)
my_df.to_csv('Desktop/pgm/111.csv', index=False,sep=',')
But the some comma separated values fills each columns of csv. I am trying to make that values in single column.
Portion of output is shown below.
I have tried with writerows but wont work.
import csv
with open('Desktop/pgm/111.csv', "wb") as f:
writer = csv.writer(fulllist)
writer.writerows(fulllist)
Also tried with "".join at each time, when the length of list is higher than 1. It also not giving the result. How to make the proper csv so that each fields fill each columns?
My expected output csv is
Please keep in mind that .csv files are in fact plain text files and understanding of .csv by given software depends on implementation, for example some might allow newline character as part of field, when it is between " and ", while other treat every newline character as next row.
Do you have to use .csv format? If not consider other possibilities:
DSV https://en.wikipedia.org/wiki/Delimiter-separated_values is similiar to csv, but you can use for example ; instead of ,, which should help if you do not have ; in your data
openpyxl allows writing and reading of .xlsx files.

Categories

Resources