I have scenario where I am getting the file(stream:flowfile of NIFI) as of type csv file,
then creating the dataframe and dumping it thats it.
But after creating the dataframe the structure of the file got disturbed, if I open
the same flowfile on my disk i could see clear structure with columns separated with
tab, but with python dataframe I am not getting the same structure, if I get same structure i can person row manipulation.
Here What I am doing:
1: using ExecuteSQL processor, I am getting database record,
2: then passing this record to ConvertRecord processor to convert this avro record type csv file separated by tab.
[enter image description here][1]
3: Then the reading flowfile(step 2 folowfile) as python data using ExecuteStreamCommand, coz I am want to perform some action on the database record, to do this my record structure has been changed in data frame.
The flow file generated from ConvertRecord is ...
[enter image description here][2]
The out put of the dataframe is ..
[enter image description here][3]
[1]: https://i.stack.imgur.com/Z4P97.png
[2]: https://i.stack.imgur.com/AU3ke.png
[3]: https://i.stack.imgur.com/LcICS.png
sample python code...
import sys, csv;
import pandas as pd
class CustomPythonScript():
df = pd.read_csv(sys.stdin)
output = pd.DataFrame(df)
Related
when I read the excel file to python.
import pandas as pd
data = pd.read_excel('copy.xlsx')
data
Some part of my time data was uploaded successfully but another part of the time data has some problems. Problem is on these columns (in_time, call_time, process_in_time, out_time).
Why is this happening?
And how to handle and normalize this time data?
The Excel data link is here >enter link description here
enter image description here
I am really beginner in python, but I want for my job as planner in a company fill automatically a form in excel, but the information is on a register pdf file.
I tried many python libraries so I got the Module fitz — PyMuPDF, It generate the best file because the data has coordinates (axis x and y). But now the issue is how to order every thing, since every data are on just one column
import fitz
import pandas as pd code her
doc = fitz.open('C:\\Users\\fhoqu\\OneDrive\\Desktop\\Sures19al21julio2022.pdf')
print(doc.page_count)
page1 = doc[0]
words = page1.get_text("words")
df = pd.DataFrame(words)
df.rename(columns={1:"axisX", 2:"axisY", 4:"Datos"}, inplace=True)
this is the original pdf:
but it generates a csv file like this:
and I need a file ordened like this:
I have a CSV file that has a table with information that I'd like to reference in another table. To give you a better perspective, I have the following example:
"ID","Name","Flavor"
"45fc754d-6a9b-4bde-b7ad-be91ae60f582","account1-test1","m1.medium"
"83dbc739-e436-4c9f-a561-c5b40a3a6da5","account3-test2","m1.tiny"
"ef68fcf3-f624-416d-a59b-bb8f1aa2a769","account1-test3","m1.medium"
I would like to add another column that references the Name column and pulls the customner name in one column and the rest of the info into another column, example:
"ID","Name","Flavor","Customer","Misc"
"45fc754d-6a9b-4bde-b7ad-be91ae60f582","account1-test1","m1.medium","account1","test1"
"83dbc739-e436-4c9f-a561-c5b40a3a6da5","account3-test2","m1.tiny","account3,"test2"
"ef68fcf3-f624-416d-a59b-bb8f1aa2a769","account1-test3","m1.medium","account1","test3"
The task here is to have a python script that opens the original CSV file, and creates a new CSV file with the added column. Any ideas? I've been having trouble parsing through the name column successfully.
data = pd.read_csv('your_file.csv')
data[['Customer','Misc']] = data.Name.str.split("-",expand=True)
Now you can again save it to csv file by :
data.to_csv('another_file.csv')
Have you tried opening your csv file with a pandas DataFrame. This can be done with:
df = pd.read_csv('input_data.csv')
If the customer and misc columns are part of another csv file you can load this with the same method as above (naming df2) and then append with the following:
df['Customer'] = df2['Customer']
You can then output the DataFrame as a csv file with the following:
df.to_csv('output_data_name.csv')
I am Trying to convert the YAML Data to Data frame through pandas with yamltodb package. but it is showing only the single row enclosed with header and only one data is showing. I tried to convert the Yaml file to JSON file and then tried normalize function. But it is not working out. Attached the screenshot for JSON function output. I need to categorize it under batman, bowler and runs etc. Code
Output Image and their code..
Just guessing, as I don’t know what your data actually looks like
import pandas as pd
import yaml
with open('fName.yaml', 'r') as f:
df = pd.io.json.json_normalize(yaml.load(f))
df.head()
I'm working with lastest version of Spark(2.1.1). I read multiple csv files to dataframe by spark.read.csv.
After processing with this dataframe, How can I save it to output csv file with specific name.
For example, there are 100 input files (in1.csv,in2.csv,in3.csv,...in100.csv).
The rows that belong to in1.csv should be saved as in1-result.csv. The rows that belong to in2.csv should be saved as in2-result.csv and so on. (The default file name will be like part-xxxx-xxxxx which is not readable)
I have seen partitionBy(col) but look like it can just partition by column.
Another question is I want to plot my dataframe. Spark has no built-in plot library. Many people use df.toPandas() to convert to pandas and plot it. Is there any better solution? Since my data is very big and toPandas() will cause memory error. I'm working on the server and want to save the plot as image instead of showing.
I suggest below solution for writing DataFrame in specific directories related to input file:
in loop for each file:
read csv file
add new column with information about input file using withColumn tranformation
union all DataFrames using union transformation
do required preprocessing
save result using partitionBy by providing column with input file information, so that rows related to the same input file will be saved in the same output directory
Code could look like:
all_df = None
for file in files: # where files is list of input CSV files that you want to read
df = spark.read.csv(file)
df.withColumn("input_file", file)
if all_df is None:
all_df = df
else:
all_df = all_df.union(df)
# do preprocessing
result.write.partitionBy(result.input_file).csv(outdir)