How to capture specific info from CSV column in python - python

I have a CSV file that has a table with information that I'd like to reference in another table. To give you a better perspective, I have the following example:
"ID","Name","Flavor"
"45fc754d-6a9b-4bde-b7ad-be91ae60f582","account1-test1","m1.medium"
"83dbc739-e436-4c9f-a561-c5b40a3a6da5","account3-test2","m1.tiny"
"ef68fcf3-f624-416d-a59b-bb8f1aa2a769","account1-test3","m1.medium"
I would like to add another column that references the Name column and pulls the customner name in one column and the rest of the info into another column, example:
"ID","Name","Flavor","Customer","Misc"
"45fc754d-6a9b-4bde-b7ad-be91ae60f582","account1-test1","m1.medium","account1","test1"
"83dbc739-e436-4c9f-a561-c5b40a3a6da5","account3-test2","m1.tiny","account3,"test2"
"ef68fcf3-f624-416d-a59b-bb8f1aa2a769","account1-test3","m1.medium","account1","test3"
The task here is to have a python script that opens the original CSV file, and creates a new CSV file with the added column. Any ideas? I've been having trouble parsing through the name column successfully.

data = pd.read_csv('your_file.csv')
data[['Customer','Misc']] = data.Name.str.split("-",expand=True)
Now you can again save it to csv file by :
data.to_csv('another_file.csv')

Have you tried opening your csv file with a pandas DataFrame. This can be done with:
df = pd.read_csv('input_data.csv')
If the customer and misc columns are part of another csv file you can load this with the same method as above (naming df2) and then append with the following:
df['Customer'] = df2['Customer']
You can then output the DataFrame as a csv file with the following:
df.to_csv('output_data_name.csv')

Related

how to filter a .csv/.txt file using a list from another .txt

So I have an excel sheet that contains in this order:
Sample_name | column data | column data2 | column data ... n
I also have a .txt file that contains
Sample_name
What I want to do is filter the excel file for only the sample names contained in the .txt file. My current idea is to go through each column (excel sheet) and see if it matches any name in the .txt file, if it does, then grab the whole column. However, this seems like a nonefficient way to do it. I also need to do this using python. I was hoping someone could give me an idea on how to approach this better. Thank you very much.
Excel PowerQuery should do the trick:
Load .txt file as a table (list)
Load sheet with the data columns as another table
Merge (e.g. Left join) first table with second table
Optional: adjust/select the columns to be included or excluded in the resulting table
In Python with Pandas’ data frames the same can be accomplished (joining 2 data frames)
P.S. Pandas supports loading CSV files and txt files (as a variant of CSV) into a data frame

How to extract data from a specific column in the first CSV file to another column in another CSV file?

I have two different CSV files which i have imported using pd.read_csv.
Both files have different header names. I would like to export this specific column under the header name of ["Model"] in the first CSV file to the second CSV file under the header name of ["Product"]
I have tried using the following code but produced value error:
writer=df1[df1['Model']==df2['Product']]
Would appreciate any help.
Try joining the DataFrames on the index using pandas.DataFrame.join then exporting the result as a csv using pandas.DataFrame.to_csv.
df1.join(df2)
df1.to_csv('./df2.csv')

open comma seperated text and read the file with specific field data item with specific id and it can be use in dataframe

This is a CSV file contains column name.
This second image is a comma separated requirement file.
In this req. file first line contain Excel Name,Path of the excel fiel,dbname
Second line contain sr.no,Column Name,columndtype,userrequireddtype and so on.
What I want, I want read file with the help of "with open command" and read every element and put into the dataframe.
Suppose :
df=pd.read_excel("path of the excel file") this path read from req. file.
df[column name] and I want read column from the req. file.
suppose first column name dtype is int and convert into the another dtype i.e 'object'.
df['column name'].astype('object') this 'object' read by the req. file.
The scope of your question is not clear, but if the CSV file name is sample.csv you can read the data using
import pandas
df = pandas.read_csv('sample.csv')
You can then determine the column names using
list(df.columns)
And you can access a given column using
df['postgres']

exporting and indexing csv file with pandas

I've created a csv file with the column names and saved it using pandas library. This file will be used to create a historic record where the rows will be charged one by one in different moments... what I'm doing to add rows to this csv previously created is transform the record to a DataFrame and then using to_csv() I choose mode = 'a' as a parameter in order to append this record to the existing file. The problem here is that I would like to see and index automatically generated in the file everytime I add a new row. I already know when I import this file as a DF, an index is generated automatically, but this is within the idle interface...when I open the csv with Excel for example...the file doesn't have an index.
While writing your files to csv, you can use set index = True in the to_csv method. This ensures that the index of your dataframe is written explicitly to the csv file

Pyspark: write df to file with specific name, plot df

I'm working with lastest version of Spark(2.1.1). I read multiple csv files to dataframe by spark.read.csv.
After processing with this dataframe, How can I save it to output csv file with specific name.
For example, there are 100 input files (in1.csv,in2.csv,in3.csv,...in100.csv).
The rows that belong to in1.csv should be saved as in1-result.csv. The rows that belong to in2.csv should be saved as in2-result.csv and so on. (The default file name will be like part-xxxx-xxxxx which is not readable)
I have seen partitionBy(col) but look like it can just partition by column.
Another question is I want to plot my dataframe. Spark has no built-in plot library. Many people use df.toPandas() to convert to pandas and plot it. Is there any better solution? Since my data is very big and toPandas() will cause memory error. I'm working on the server and want to save the plot as image instead of showing.
I suggest below solution for writing DataFrame in specific directories related to input file:
in loop for each file:
read csv file
add new column with information about input file using withColumn tranformation
union all DataFrames using union transformation
do required preprocessing
save result using partitionBy by providing column with input file information, so that rows related to the same input file will be saved in the same output directory
Code could look like:
all_df = None
for file in files: # where files is list of input CSV files that you want to read
df = spark.read.csv(file)
df.withColumn("input_file", file)
if all_df is None:
all_df = df
else:
all_df = all_df.union(df)
# do preprocessing
result.write.partitionBy(result.input_file).csv(outdir)

Categories

Resources