Compare 2 values in a excel file and get the common values - python

i am trying to extract a particular data using python. I have 1 month data which consists of how many jobs have failed with a mentioned return code over the 1 month duration.
I have 30 excel files and so far i have loaded the data into my dataframe using the below code :
import glob2
import os
import pandas as pd
def concatenate(indir="C:\\Users\\hp",outfile="C:\\Users\\hp\\new1.csv"):
os.chdir(indir)
filelist = glob2.glob("*.csv")
dfList=[]
for f in filelist:
print(f)
df = pd.read_csv(f)
dfList.append(df)
concatDf = pd.concat(dfList,axis=0)
b = concatDf[['JOB NAME' , ' RC ']]
i have extracted the required columns and i have to perform a operation on them such that i know in 1 month data how many jobs have failed with same reasons
input :
STATUS JOB NAME RC DATE TIME
R ABCDEFGH U0900 18163 19:53
X SSTUFGHI C0001 18164 2:04
R LMNOPQRS SB37 18164 2:41
R ABCDEFGH U0900 18164 3:36
O/P required :
JOB NAME RC
ABCDEFGH U0900
ABCDEFGH U0900
i am not understanding how do i compare the 2 values and get the above o/p.Please help me i am very new to python

Related

What is the most efficient way to read and augment (copy samples and change some values) large dataset in .csv

Currently, I have managed to solve this but it is slower than what I need. It takes approximately: 1 hour for 500k samples, the entire dataset is ~100M samples, which requires ~200hours for 100M samples.
Hardware/Software specs: RAM 8GB, Windows 11 64bit, Python 3.8.8
The problem:
I have a dataset in .csv (~13GB) where each sample has a value and a respective start-end period of few months.I want to create a dataset where each sample will have the same value but referring to each specific month.
For example:
from:
idx | start date | end date | month | year | value
0 | 20/05/2022 | 20/07/2022 | 0 | 0 | X
to:
0 | 20/05/2022 | 20/07/2022 | 5 | 2022 | X
1 | 20/05/2022 | 20/07/2022 | 6 | 2022 | X
2 | 20/05/2022 | 20/07/2022 | 7 | 2022 | X
Ideas: Manage to do it parallel (like Dask, but I am not sure how for this task).
My implementation:
Chunk read in pandas, augment in dictionaries , append to CSV. Use a function that, given a df, calculates for each sample the months from start date to end date and creates a copy sample for each month appending it to a dictionary. Then it returns the final dictionary.
The calculations are done in dictionaries as they were found to be way faster than doing it in pandas. Then I iterate through the original CSV in chunks and apply the function at each chunk appending the resulting augmented df to another csv.
The function:
def augment_to_monthly_dict(chunk):
'''
Function takes a df or subdf data and creates and returns an Augmented dataset with monthly data in
Dictionary form (for efficiency)
'''
dict={}
l=1
for i in range(len(chunk)):#iterate through every sample
# print(str(chunk.iloc[i].APO)[4:6] )
#Find the months and years period
mst =int(float((str(chunk.iloc[i].start)[4:6])))#start month
mend=int(str(chunk.iloc[i].end)[4:6]) #end month
yst =int(str(chunk.iloc[i].start)[:4] )#start year
yend=int(str(chunk.iloc[i].end)[:4] )#end year
if yend==yst:
months=[ m for m in range(mst,mend+1)]
years=[yend for i in range(len(months))]
elif yend==yst+1:# year change at same sample
months=[m for m in range(mst,13)]
years=[yst for i in range(mst,13)]
months= months+[m for m in range(1, mend+1)]
years= years+[yend for i in range(1, mend+1)]
else:
continue
#months is a list of each month in the period of the sample and years is a same
#length list of the respective years eg months=[11,12,1,2] , years=
#[2021,2022,2022,2022]
for j in range(len(months)):#iterate through list of months
#copy the original sample make it a dictionary
tmp=pd.DataFrame(chunk.iloc[i]).transpose().to_dict(orient='records')
#change the month and year values accordingly (they were 0 for initiation)
tmp[0]['month'] = months[j]
tmp[0]['year'] = years[j]
# Here could add more calcs e.g. drop irrelevant columns, change datatypes etc
#to reduce size
#
#-------------------------------------
#Append new row to the Augmented data
dict[l] = tmp[0]
l+=1
return dict
Reading the original dataset (.csv ~13GB), augment using the function and append result to new .csv:
chunk_count=0
for chunk in pd.read_csv('enc_star_logar_ek.csv', delimiter=';', chunksize=10000):
chunk.index = chunk.reset_index().index
aug_dict = augment_to_monthly_dict(chunk)#make chunk dictionary to work faster
chunk_count+=1
if chunk_count ==1: #get the column names and open csv write headers and 1st chunk
#Find the dicts keys, the column names only from the first dict(not reading all data)
for kk in aug_dict.values():
key_names = [i for i in kk.keys()]
print(key_names)
break #break after first input dict
#Open csv file and write ';' separated data
with open('dic_to_csv2.csv', 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile,delimiter=';', fieldnames=key_names)
writer.writeheader()
writer.writerows(aug_dict.values())
else: # Save the rest of the data chunks
print('added chunk: ', chunk_count)
with open('dic_to_csv2.csv', 'a', newline='') as csvfile:
writer = csv.DictWriter(csvfile,delimiter=';', fieldnames=key_names)
writer.writerows(aug_dict.values())
Pandas efficiency comes in to play when you need to manipulate columns of data, and to do that Pandas reads the input row-by-row building up a series of data for each column; that's a lot of extra computation your problem doesn't benefit from, and in fact just slows your solution down.
You actually need to manipulate rows, and for that the fastest way is to use the standard csv module; all you need to do is read a row in, write the derived rows out, and repeat:
import csv
import sys
from datetime import datetime
def parse_dt(s):
return datetime.strptime(s, r"%d/%m/%Y")
def get_dt_range(beg_dt, end_dt):
"""
Returns a range of (month, year) tuples, from beg_dt up-to-and-including end_dt.
"""
if end_dt < beg_dt:
raise ValueError(f"end {end_dt} is before beg {beg_dt}")
mo, yr = beg_dt.month, beg_dt.year
dt_range = []
while True:
dt_range.append((mo, yr))
if mo == 12:
mo = 1
yr = yr + 1
else:
mo += 1
if (yr, mo) > (end_dt.year, end_dt.month):
break
return dt_range
fname = sys.argv[1]
with open(fname, newline="") as f_in, open("output_csv.csv", "w", newline="") as f_out:
reader = csv.reader(f_in)
writer = csv.writer(f_out)
writer.writerow(next(reader)) # transfer header
for row in reader:
beg_dt = parse_dt(row[1])
end_dt = parse_dt(row[2])
for mo, yr in get_dt_range(beg_dt, end_dt):
row[3] = mo
row[4] = yr
writer.writerow(row)
And, to compare with Pandas in general, let's examine #abokey's specifc Pandas solution—I'm not sure if there is a better Pandas implementation, but this one kinda does the right thing:
import sys
import pandas as pd
fname = sys.argv[1]
df = pd.read_csv(fname)
df["start date"] = pd.to_datetime(df["start date"], format="%d/%m/%Y")
df["end date"] = pd.to_datetime(df["end date"], format="%d/%m/%Y")
df["month"] = df.apply(
lambda x: pd.date_range(
start=x["start date"], end=x["end date"] + pd.DateOffset(months=1), freq="M"
).month.tolist(),
axis=1,
)
df["year"] = df["start date"].dt.year
out = df.explode("month").reset_index(drop=True)
out.to_csv("output_pd.csv")
Let's start with the basics, though, do the programs actually do the right thing. Given this input:
idx,start date,end date,month,year,value
0,20/05/2022,20/05/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/12/2022,20/01/2023,0,0,X
My program, ./main.py input.csv, produces:
idx,start date,end date,month,year,value
0,20/05/2022,20/05/2022,5,2022,X
0,20/05/2022,20/07/2022,5,2022,X
0,20/05/2022,20/07/2022,6,2022,X
0,20/05/2022,20/07/2022,7,2022,X
0,20/12/2022,20/01/2023,12,2022,X
0,20/12/2022,20/01/2023,1,2023,X
I believe that's what you're looking for.
The Pandas solution, ./main_pd.py input.csv, produces:
,idx,start date,end date,month,year,value
0,0,2022-05-20,2022-05-20,5,2022,X
1,0,2022-05-20,2022-07-20,5,2022,X
2,0,2022-05-20,2022-07-20,6,2022,X
3,0,2022-05-20,2022-07-20,7,2022,X
4,0,2022-12-20,2023-01-20,12,2022,X
5,0,2022-12-20,2023-01-20,1,2022,X
Ignoring the added column for the frame index, and the fact the date format has been changed (I'm pretty sure that can be fixed with some Pandas directive I don't know), it still does the right thing with regards to creating new rows with the appropriate date range.
So, both do the right thing. Now, on to performance. I duplicated your initial sample, just the 1 row, for 1_000_000 and 10_000_000 rows:
import sys
nrows = int(sys.argv[1])
with open(f"input_{nrows}.csv", "w") as f:
f.write("idx,start date,end date,month,year,value\n")
for _ in range(nrows):
f.write("0,20/05/2022,20/07/2022,0,0,X\n")
I'm running a 2020, M1 MacBook Air with the 2TB SSD (which gives very good read/write speeds):
1M rows (sec, RAM)
10M rows (sec, RAM)
csv module
7.8s, 6MB
78s, 6MB
Pandas
75s, 569MB
750s, 5.8GB
You can see both programs following a linear increase in time-to-run that follows the increase in the size of rows. The csv module's memory remains constanly non-existent because it's streaming data in-and-out (holding on to virtually nothing); Pandas's memory rises with the size of rows it has to hold so that it can do the actual date-range computations, again on whole columns. Also, not shown, but for the 10M-rows Pandas test, Pandas spent nearly 2 minutes just writing the CSV—longer than the csv-module approach took to complete the entire task.
Now, for all my putting-down of Pandas, the solution is far fewer lines, and is probably bug free from the get-go. I did have a problem writing get_dt_range(), and had to spend about 5 minutes thinking about what it actually needed to do and debug it.
You can view my setup with the small test harness, and the results, here.
I suggest you to use pandas (or even dask) to return the list of months between two columns of a huge dataset (e.g, .csv ~13GB). First you need to convert your two columns to a datetime by using pandas.to_datetime. Then, you can use pandas.date_range to get your list.
Try with this :
import pandas as pd
from io import StringIO
s = """start date end date month year value
20/05/2022 20/07/2022 0 0 X
"""
df = pd.read_csv(StringIO(s), sep='\t')
df['start date'] = pd.to_datetime(df['start date'], format = "%d/%m/%Y")
df['end date'] = pd.to_datetime(df['end date'], format = "%d/%m/%Y")
df["month"] = df.apply(lambda x: pd.date_range(start=x["start date"], end=x["end date"] + pd.DateOffset(months=1), freq="M").month.tolist(), axis=1)
df['year'] = df['start date'].dt.year
out = df.explode('month').reset_index(drop=True)
>>> print(out)
start date end date month year value
0 2022-05-20 2022-07-20 5 2022 X
1 2022-05-20 2022-07-20 6 2022 X
2 2022-05-20 2022-07-20 7 2022 X
Note : I tested the code above on a 1 million .csv dataset and it took ~10min to get the output.
you can read very large csv file with dask, then process it (same api as pandas), then convert it to pandas dataframe if you need.
dask is perfect when pandas fails due to data size or computation speed. But for data that fits into RAM, pandas can often be faster and easier to use than Dask DataFrame.
import dask.dataframe as dd
#1. read the large csv
dff = dd.read_csv('path_to_big_csv_file.csv') #return Dask.DataFrame
#if still not enough, try more reducing IO costs:
dff = dd.read_csv('largefile.csv', blocksize=25e6) #use blocksize (number of bytes by which to cut up larger files)
dff = dd.read_csv('largefile.csv', columns=["a", "b", "c"]) #return only columns a, b and c
#2. work with dff, dask has the same api than pandas:
#https://docs.dask.org/en/stable/dataframe-api.html
#3. then, finally, convert dff to pandas dataframe if you want
df = dff.compute() #return pandas dataframe
you can also try other alternatives for reading very large csv files efficiently with high speed & low momory usage:
pola, modin, koalas.
all those packages, same as dask, use similar api as pandas.
if you have very big csv file, pandas read_csv with chunksize usually don't succeed, and even if if succeed, it will be waist of time and energy
There's a Table helper in convtools library (I must confess, a lib of mine). This helper processes csv files as a stream, using simple csv.reader under the hood:
from datetime import datetime
from convtools import conversion as c
from convtools.contrib.tables import Table
def dt_range_to_months(dt_start, dt_end):
return tuple(
(year_month // 12, year_month % 12 + 1)
for year_month in range(
dt_start.year * 12 + dt_start.month - 1,
dt_end.year * 12 + dt_end.month,
)
)
(
Table.from_csv("tmp/in.csv", header=True)
.update(
year_month=c.call_func(
dt_range_to_months,
c.call_func(datetime.strptime, c.col("start date"), "%d/%m/%Y"),
c.call_func(datetime.strptime, c.col("end date"), "%d/%m/%Y"),
)
)
.explode("year_month")
.update(
year=c.col("year_month").item(0),
month=c.col("year_month").item(1),
)
.drop("year_month")
.into_csv("tmp/out.csv")
)
Input/Output:
~/o/convtools ❯❯❯ head tmp/in.csv
idx,start date,end date,month,year,value
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
...
~/o/convtools ❯❯❯ head tmp/out.csv
idx,start date,end date,month,year,value
0,20/05/2022,20/07/2022,5,2022,X
0,20/05/2022,20/07/2022,6,2022,X
0,20/05/2022,20/07/2022,7,2022,X
0,20/05/2022,20/07/2022,5,2022,X
0,20/05/2022,20/07/2022,6,2022,X
0,20/05/2022,20/07/2022,7,2022,X
0,20/05/2022,20/07/2022,5,2022,X
0,20/05/2022,20/07/2022,6,2022,X
0,20/05/2022,20/07/2022,7,2022,X
...
On my M1 Mac on a file where each row explodes into three it processes 100K of rows per second. In case of 100M rows of the same structure it should take ~ 1000s (< 17 min). Of course it depends on how deep inner by-month cycles are.

Python & Pandas: appending data to new column

With Python and Pandas, I'm writing a script that passes text data from a csv through the pylanguagetool library to calculate the number of grammatical errors in a text. The script successfully runs, but appends the data to the end of the csv instead of to a new column.
The structure of the csv is:
The working code is:
import pandas as pd
from pylanguagetool import api
df = pd.read_csv("Streamlit\stack.csv")
text_data = df["text"].fillna('')
length1 = len(text_data)
for i, x in enumerate(range(length1)):
# this is the pylanguagetool operation
errors = api.check(text_data, api_url='https://languagetool.org/api/v2/', lang='en-US')
result = str(errors)
# this pulls the error count "message" from the pylanguagetool json
error_count = result.count("message")
output_df = pd.DataFrame({"error_count": [error_count]})
output_df.to_csv("Streamlit\stack.csv", mode="a", header=(i == 0), index=False)
The output is:
Expected output:
What changes are necessary to append the output like this?
Instead of using a loop, you might consider lambda which would accomplish what you want in one line:
df["error_count"] = df["text"].fillna("").apply(lambda x: len(api.check(x, api_url='https://languagetool.org/api/v2/', lang='en-US')["matches"]))
>>> df
user_id ... error_count
0 10 ... 2
1 11 ... 0
2 12 ... 0
3 13 ... 0
4 14 ... 0
5 15 ... 2
Edit:
You can write the above to a .csv file with:
df.to_csv("Streamlit\stack.csv", index=False)
You don't want to use mode="a" as that opens the file in append mode whereas you want (the default) write mode.
My strategy would be to keep the error counts in a list then create a separate column in the original database and finally write that database to csv:
text_data = df["text"].fillna('')
length1 = len(text_data)
error_count_lst = []
for i, x in enumerate(range(length1)):
errors = api.check(text_data, api_url='https://languagetool.org/api/v2/', lang='en-US')
result = str(errors)
error_count = result.count("message")
error_count_lst.append(error_count)
text_data['error_count'] = error_count_lst
text_data.to_csv('file.csv', index=False)

Reading from a .dat file Dataframe in Python

I have a .dat file which looks something like the below....
#| step | Channel| Mode | Duration|Freq.| Amplitude | Phase|
0 1 AWG Pi/2 100 2 1
1 1 SIN^2 100 1 1
2 1 SIN^2 200 0.5 1
3 1 REC 50 100 1 1
100 0 REC Pi/2 150 1 1
I had created a data frame and I wanted to read extract data from the data frame but I have an error
TypeError: expected str, bytes or os.PathLike object, not DataFrame
My code is below here,
import pandas as pd
import numpy as np
path = "updated.dat"
datContent = [i.strip().split() for i in open(path).readlines()]
#print(datContent)
column_names = datContent.pop(0)
print(column_names)
df = pd.DataFrame(datContent)
print(df)
extract_column = df.iloc[:,2]
with open (df, 'r') as openfile :
for line in openfile:
for column_search in line:
column_search = df.iloc[:,2]
if "REC" in column_search:
print ("Rec found")
Any suggestion would be appreciated
Since your post does not have any clear question, I have to guess based on your code. I am assuming that what you want to get is to find all rows in DataFrame where column Mode contains value REC.
Based on that, I prepared a small, self contained example that works on your data.
In your situation, the only line that you should use is the last one. Assuming that your DataFrame is created and filled correctly, your code below print(df) can be exchanged by this single line.
I would really recommend you reading the official documentation about indexing and selecting data from DataFrames. https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
import pandas as pd
from io import StringIO
data = StringIO("""
no;step;Channel;Mode;Duration;Freq.;Amplitude;Phase
;0;1;AWG;Pi/2;100;2;1
;1;1;SIN^2;;100;1;1
;2;1;SIN^2;;200;0.5;1
;3;1;REC;50;100;1;1
;100;0;REC;Pi/2;150;1;1
""")
df = pd.read_csv(data, sep=";")
df.loc[df.loc[:, 'Mode'] == "REC", :]

How to import data from a .txt file into arrays in python

I am trying to import data from a .txt file that contains four columns that are separated by tab and is several thousands lines long. This is how the start of the document look like:
Data info
File name: D:\(path to file)
Start time: 6/26/2019 15:39:54.222
Number of channels: 3
Sample rate: 1E6
Store type: fast on trigger
Post time: 20
Global header information: from DEWESoft
Comments:
Events
Event Type Event Time Comment
1 storing started at 7.237599
2 storing stopped at 7.257599
Data1
Time Incidente Transmitida DI 6
s um/m um/m -
0 2.1690152 140.98599 1
1E-6 2.1690152 140.98599 1
2E-6 4.3380303 145.32402 1
3E-6 4.3380303 145.32402 1
4E-6 -2.1690152 145.32402 1
I have several of these files that I want to loop trough and store in a cell/list that each cell/list item contains the four columns. After that I just use that cell/list to plot the data with a loop.
I saw that pandas library was suitable, but I don't understand how to use it.
fileNames = (["Test1_0001.txt", "Test2_0000.txt", "Test3_0000.txt",
"Test4_0000.txt", "Test5_0000.txt", "Test6_0001.txt", "Test7_0000.txt",
"Test8_0000.txt", "Test9_0000.txt", "Test10_0000.txt", "RawblueMat_0000.txt"])
folderName = 'AuxeticsSHPB\\' #Source folder for all files above
# Loop trough each source document
for i in range(0,len(fileNames)):
print('File location: '+folderName+fileNames[i])
# Get data from source as arrays, cut out the first 20 lines
temp=pd.read_csv(folderName+fileNames[i], sep='\t', lineterminator='\r',
skiprows=[19], error_bad_lines=False)
# Store data in list/cell
# data[i] = temp # sort it
This is something I tried that didn't work, don't really know how to proceed. I know there are some documentation on this problem but I am new to this and need some help.
An error I get when trying the above:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 12, saw 4
So it was an easy fix, just had to remove the braces from skiprows=[19].
The cods now looks like this and works.
fileNames = ["Test1_0001.txt", "Test2_0000.txt", "Test3_0000.txt",
"Test4_0000.txt", "Test5_0000.txt", "Test6_0001.txt", "Test7_0000.txt",
"Test8_0000.txt", "Test9_0000.txt", "Test10_0000.txt", "RawblueMat_0000.txt"]
folderName = 'AuxeticsSHPB\\' #Source folder for all files above
# Preallocation
data = []
for i in range(0,len(fileNames)):
temp=pd.read_csv(folderName+fileNames[i], sep='\t', lineterminator='\r',
skiprows=19)
data.append(temp)

Exporting max values of different csv files in to one

I got 3 datasets which contain the flow in m3/s per location. Dataset 1 is a 5 year ARI flood, Dataset 2 is a 20 year ARI flood and Dataset 3 is a 50 year ARI flood.
Per location I found the maximum discharge (5,20 & 50)
Code:
for key in Data_5_ARI_RunID_Flow_New.keys():
m = key
y5F_RunID = Data_5_ARI_RunID_Flow_New.loc[:,m]
y20F_RunID = Data_20_ARI_RunID_Flow_New.loc[:,m]
y50F_RunID = Data_50_ARI_RunID_Flow_New.loc[:,m]
max_y5F = max(y5F_RunID)
max_y20F = max(y20F_RunID)
max_y50F = max(y50F_RunID)
Max_DataID = m, max_y5F, max_y20F, max_y50F
print (Max_DataID)
The output is like this:
('G60_18', 44.0514, 47.625, 56.1275)
('Area5_11', 1028.4065, 1191.5946, 1475.9685)
('Area5_12', 1017.8286, 1139.2628, 1424.4304)
('Area5_13', 994.5626, 1220.0084, 1501.1483)
('Area5_14', 995.9636, 1191.8066, 1517.4541)
Now I want to export this result to a csv file, but I don't know how. I used this line of code, but it didn't work:
Max_DataID.to_csv(r'C:\Users\Max_DataID.csv', sep=',', index = False)
Use this file name myexample.csv with specific path where you want to create the file.
Please check that Max_DataID is a iterable value. And as your reference the values are in form of tuple so I use list() to convert tuples into list and that will be supported values for writerow in csv.
import csv
file = open('myexample.csv', 'wb')
filewriter = csv.writer(file,delimiter =',')
for data in Max_DataID:
filewriter.writerow(list(data))
You can do the following.
df.to_csv(file_name, sep='\t')
Also, if you want to split it into chunks, like 10,000 rows, or whatever, you can do this.
import pandas as pd
for i,chunk in enumerate(pd.read_csv('C:/your_path_here/main.csv', chunksize=10000)):
chunk.to_csv('chunk{}.csv'.format(i))

Categories

Resources