pandas read_csv add attributes by stdin issue - python

I want to add a new column in the dataframe. The new column is depend on some rules.
This is my code:
#!/usr/bin/python3.6
# coding=utf-8
import sys
import pandas as pd
import numpy as np
import io
import csv
df = pd.read_csv(sys.stdin,sep=',',encoding='utf-8',engine="python")
col_0 = check
df['df_cal'] = df.groupby(col_0)[col_0].transform('count')
df['status'] = np.where(
df['df_cal'] > 1,'change',
'New')
df = df.drop_duplicates(
subset=df.columns.difference(['keep']),keep = False)
df = df[(df.keep == '2')]
df.drop(['keep','df_cal'],axis = 1,inplace = True)
# print(sys.stdin)
df.to_csv(sys.stdout,encoding='utf-8',index = None)
sample csv:
VIP_number,keep
ab1,1
ab1,2
ab2,2
ab3,1
when I try to run this code, I write the command like this:
python3.6 nifi_python.py < test.csv check = VIP_number
and I get the error:
name 'check' is not defined
This is still not work because I don't know how can I input the column name to col_0 by stdin. col_0 should be 'VIP_number'. I don't want to hardcode the column name because the script will use in next time but the columns are different.
How can I add a new column in the dataframe by stdin?
Any help would be very much appreciated.

#!/usr/bin/python3.6
# coding=utf-8
import sys
import pandas as pd
import numpy as np
import io
import csv
if len(sys.argv) < 2:
print( "Usage: nifi_python.py check=<column>"
sys.exit(1)
df = pd.read_csv(sys.stdin,sep=',',encoding='utf-8',engine="python")
col_0 = sys.argv[1].split('=')[1]
...
python nifi_python.py check=VIP_number < test.csv

Related

Iterate through df and update based on prediction

I am not a Python programmer, so am struggling with the following;
def py_model(df):
import pickle
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
filename = 'C:/aaaTENNIS-DATA/votingC.pkl'
loaded_model = pickle.load(open(filename,'rb'))
for index, row in df.iterrows():
ab = row[['abc','def','ghi','jkl']]
input = np.array(ab)
df['Prediction'] =pd.DataFrame(loaded_model.predict([input]))
df['AccScore'] =??
return df
For each row of the dataframe, I wish to get a prediction and put it in df['Prediction'] and also get the model score and put it in another field.
You don't need to iterate
import pickle
filename = 'C:/aaaTENNIS-DATA/votingC.pkl'
loaded_model = pickle.load(open(filename,'rb'))
df['Prediction'] = loaded_model.predict(df[['abc','def','ghi','jkl']])
Tip #1: don't use input as a variable, it's a built-in function in python: https://docs.python.org/3/library/functions.html#input
Tip #2: don't put import statement in a function, put them all at the beginning of your file

Refresh a Panda dataframe while printing using loop

How can i print a new dataframe and clear the last printed dataframe while using a loop?
So it wont show all dataframes just the last one in the output?
Using print(df, end="\r") doesn't work
import pandas as pd
import numpy as np
while True:
df = pd.DataFrame(np.random.rand(10,10))
print(df)
If i get live data from an api to insert into the df, i'll use the while loop to constantly update the data. But how can i print only the newest dataframe instead of printing all the dataframes underneath each other in the output?
If i use the snippet below it does work, but i think there should be a more elegant solution.
import pandas as pd
import numpy as np
Height_DF = 10
Width_DF = 10
while True:
df = pd.DataFrame(np.random.rand(10,10))
print(df)
for i in range(Height_DF + 1):
sys.stdout.write("\033[F")
try this:
import pandas as pd
import numpy as np
import time
import sys
while True:
df = pd.DataFrame(np.random.rand(10,10))
print(df)
sys.stdout.write("\033[F")
time.sleep(1)

Create a loop to process multiple files

I have written the code below but currently I need to retype the same conditions for each file and, as there are over 100 files, this is not ideal.
I couldn't come up with a way to implement this using a loop that will read all of these files and filter the values in MP out. Meanwhile, adding two new columns to each filter file as the written code below would be the only method I know so far.
I try to obtain a new combined data frame with all filter files with their conditions
Please suggest ways of implementing this using a loop:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import signal
df1 = pd.read_csv(r'E:\Unmanned Cars\Unmanned Cars\2017040810_052.csv')
df2 = pd.read_csv(r'E:\Unmanned Cars\Unmanned Cars\2017040901_052.csv')
df3 = pd.read_csv(r'E:\Unmanned Cars\Unmanned Cars\2017040902_052.csv')
df1 =df1["MP"].unique()
df1=pd.DataFrame(df1, columns=['MP'])
df1["Dates"] = "2017-04-08"
df1["Inspection"] = "10"
##
df2 =df2["MP"].unique()
df2=pd.DataFrame(df2, columns=['MP'])
df2["Dates"] = "2017-04-09"
df2["Inspection"] = "01"
##
df3 =df3["MP"].unique()
df3=pd.DataFrame(df3, columns=['MP'])
df3["Dates"] = "2017-04-09"
df3["Inspection"] = "02"
Final = pd.concat([df1,df2,df3,df4],axis = 0, sort = False)
Maybe this sample code will help you.
#!/usr/bin/env python3
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import signal
from os import path
import glob
import re
def process_file(file_path):
result = None
file_path = file_path.replace("\\","/")
filename = path.basename(file_path)
regex = re.compile("^(\\d{4})(\\d{2})(\\d{2})(\\d{2})")
match = regex.match(filename)
if match:
date = "%s-%s-%s" % (match[1] , match[2] , match[3])
inspection = match[4]
df1 = pd.read_csv(file_path)
df1 =df1["MP"].unique()
df1=pd.DataFrame(df1, columns=['MP'])
df1["Dates"] = date
df1["Inspection"] = inspection
result = df1
return result
def main():
# files_list = [
# r'E:\Unmanned Cars\Unmanned Cars\2017040810_052.csv',
# r'E:\Unmanned Cars\Unmanned Cars\2017040901_052.csv',
# r'E:\Unmanned Cars\Unmanned Cars\2017040902_052.csv'
# ]
directory = 'E:\\Unmanned Cars\\Unmanned Cars\\'
files_list = [f for f in glob.glob(directory + "*_052.csv")]
result_list = [ process_file(filename) for filename in files_list ]
Final = pd.concat(result_list, axis = 0, sort = False)
if __name__ == "__main__":
main()
I've created a process_file function for processing each file.
There is used a regular expression for extracting data from filename. Also, the glob module was used for reading the files from a directory with pattern matching and expansion.

Appending dataframes from json files in a for loop

I am trying to iterate through json files in a folder and append them all into one pandas dataframe.
If I say
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize
import os
directory_in_str = 'building_data'
directory = os.fsencode(directory_in_str)
df_all = pd.DataFrame()
with open("building_data/rooms.json") as file:
data = json.load(file)
df = json_normalize(data['rooms'])
df_y.append(df, ignore_index=True)
I get a dataframe with the data from the one file. If I turn this thinking into a for loop, I have tried
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize
import os
directory_in_str = 'building_data'
directory = os.fsencode(directory_in_str)
df_all = pd.DataFrame()
for file in os.listdir(directory):
with open(directory_in_str+'/'+filename) as file:
data = json.load(file)
df = json_normalize(data['rooms'])
df_all.append(df, ignore_index=True)
print(df_all)
This returns an empty dataframe. Does anyone know why this is happening? If I print df before appending it, it prints the correct values, so I am not sure why it is not appending.
Thank you!
Instead of append next DataFrame I would try to join them like that:
if df_all.empty:
df_all = df
else:
df_all = df_all.join(df)
When joining DataFrames, you can specify on what they should be joined - on index or on specific (key) column, as well as how (default option is similar to appending - 'left').
Here's docs about pandas.DataFrame.join.
In these instances I load everything from json into a list by appending each file's returned dict onto that list. Then I pass the list to pandas.DataFrame.from_records (docs)
In this case the source would become something like...
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize
import os
directory_in_str = 'building_data'
directory = os.fsencode(directory_in_str)
json_data = []
for file in os.listdir(directory):
with open(directory_in_str+'/'+filename) as file:
data = json.load(file)
json_data.append( json_normalize(data['rooms']) )
df_all = pandas.DataFrame.from_records( json_data )
print(df_all)

Changing Column Heading CSV File

I am currently trying to change the headings of the file I am creating. The code I am using is as follows;
import pandas as pd
import os, sys
import glob
path = "C:\\Users\\cam19\\Desktop\\Test1\\*.csv"
list_=[]
for fname in glob.glob(path):
df = pd.read_csv(fname, dtype=None, low_memory=False)
output = (df['logid'].value_counts())
list_.append(output)
df1 = pd.DataFrame()
df2 = pd.concat(list_, axis=1)
df2.to_csv('final.csv')
Basically I am looping through a file directory and extracting data from each file. Using this is outputs the following image;
http://imgur.com/a/LE7OS
All i want to do it change the columns names from 'logid' to the file name it is currently searching but I am not sure how to do this. Any help is great! Thanks.
Instead of appending the values try to append values by creating the dataframe and setting the column i.e
output = pd.DataFrame(df['value'].value_counts())
output.columns = [os.path.basename(fname).split('.')[0]]
list_.append(output)
Changes in the code in the question
import pandas as pd
import os, sys
import glob
path = "C:\\Users\\cam19\\Desktop\\Test1\\*.csv"
list_=[]
for fname in files:
df = pd.read_csv(fname)
output = pd.DataFrame(df['value'].value_counts())
output.columns = [os.path.basename(fname).split('.')[0]]
list_.append(output)
df2 = pd.concat(list_, axis=1)
df2.to_csv('final.csv')
Hope it helps

Categories

Resources