python code only reading partial csv file from gcs bucket - python
I am facing a strange issue while reading a csv file from Google cloud storage bucket and writing it to a file in different folder in the same bucket.
I have a csv file named test.csv with 1000001 lines in it. I am trying to replace " in each line with blank space and write to a file called cleansed_test.csv.
I tested my code in local and works as expected.
below is the code i am using in my local
import pandas as pd
import csv
import re
new_lines=[]
new_lines_error_less_cols=[]
new_lines_error_more_cols=[]
with open('c:\\Users\test_file.csv','r') as f:
lines = f.readlines()
print(len(lines))
for line in lines:
new_line = re.sub('["]','',line)
new_line= new_line.strip()
new_lines.append(new_line)
# elif line.count('|') < 295:
# new_line_error_less = re.sub('["]','inches',line)
# new_line_error_less= new_line_error_less.strip()
# new_lines_error_less_cols.append(new_line_error_less)
# else:
# new_line_error_more = re.sub('["]','inches',line)
# new_line_error_more= new_line_error_more.strip()
# new_lines_error_more_cols.append(new_line_error_more)
new_data = pd.DataFrame(new_lines)
print(new_data.info())
#new_data.to_csv('c:\\cleansed_file.csv',header=None,index=False,encoding='utf-8')
But when i try doing the same file in gcs bucket only 67514 rows are being read
code I am using in composer
def replace_quotes(project,bucket,**context):
import pandas as pd
import numpy as np
import csv
import os
import re
import gcsfs
import io
fs = gcsfs.GCSFileSystem(project='project_name')
updated_file_list = fs.ls('bucketname/FULL')
updated_file_list = [ x for x in updated_file_list if "filename" in x ]
new_lines=[]
new_lines_error_less_cols=[]
new_lines_error_more_cols=[]
for f in updated_file_list:
file_name = os.path.splitext(f)[0]
parse_names = file_name.split('/')
filename = parse_names[2]
bucketname = parse_names[0]
with fs.open("gs://"+f,'r') as pf:
lines = pf.readlines()
print("length of lines----->",len(lines))#even here showing 67514
for line in lines:
new_line = re.sub('["]','',line)
new_line= new_line.strip()
new_lines.append(new_line)
new_data = pd.DataFrame(new_lines)
#new_data.to_csv("gs://"+bucketname+"/ERROR_FILES/cleansed_"+filename+".csv",escapechar='',header = None,index=False,encoding='utf-8',quoting=csv.QUOTE_NONE)
Also in the bucket i see the sizes of the files test.csv and cleansed_test.csv are the same.
The only thing i can think of is since files are compressed in gcs buckets should i be opening the files in a different way. cuz when i download the files to local they are lot larger than what i see in the bucket.
Please advise.
Thanks.
I think you can achieve what you want by using the replace method of the dataframe column object, and specifying the bool true parameter (otherwise field string must perfectly match the condition of matching character). In this way you can simply iterate per each column and replace the unwanted string, rewriting each column with the newly modified one afterwards.
I modified a bit your code and ran it on my VM in GCP. As you can see I preferred to use the Pandas.read_csv() method as the GCSF one was throwing me some errors. The code seems doing its job as I initially tested by replacing a dummy common string and it worked smoothly.
This is your modified code. Please also note I refactored a bit the reading part as did not properly match the path of the csv in my bucket.
from pandas.api.types import is_string_dtype
import pandas as pd
import numpy as np
import csv
import os
import re
import gcsfs
import io
fs = gcsfs.GCSFileSystem(project='my-project')
updated_file_list = fs.ls('test-bucket/')
updated_file_list = [ x for x in updated_file_list if "simple.csv" in x ]
new_lines=[]
new_lines_error_less_cols=[]
new_lines_error_more_cols=[]
for f in updated_file_list:
file_name = os.path.splitext(f)[0]
print(f, file_name)
parse_names = file_name.split('/')
filename = parse_names[1]
bucketname = parse_names[0]
with fs.open("gs://"+f) as pf:
df = pd.read_csv(pf)
#print(df.head(len(df))) #To check results
for col in df:
if is_string_dtype(df[col]):
df[col] = df[col].replace(to_replace=['"'], value= '', regex= True)
#print(df.head(len(df))) #To check results
new_data = pd.DataFrame(df)
#new_data.to_csv("gs://"+bucketname+"/ERROR_FILES/cleansed_"+filename+".csv",escapechar='',header = None,index=False,encoding='utf-8',quoting=csv.QUOTE_NONE
Hope my efforts solved you issue...
for any one curious this is how to inflate a file that has extension .csv but actually is compressed with gzip.
gsutil cat gs://BUCKET/File_Name.csv | zcat | gsutil cp - gs://BUCKET/Newfile.csv
The only issue here i see is
i cant use wild cards or i should say to put it plainly we have to give the destination file name
the down side is since i have to specify the destination file name i cannot use it in bash operator in airflow(this is what i thik i may be wrong)
thanks
any ways hope this helps
Related
Making efficient code using re.findall to sort files in folder
This is how the files are named in the folder : Data20210608_FL_.xlsx Data20210608_FLFR_.xlsx Data20210510-fl_.xlsx Data20210510-flfr_.xlsx Data20210608_LRC_.xlsx Data20210609_LRC_.xlsx I would like to: use a loop to open only the ones containing FL or FLFR separate between the ones ending with FL and the ones ending with FLFR; This is my code but it does not work and I don't fully understand how to use re.findall import glob import os import re import pandas as pd # %% directory = r'C:/ .../Licor/' appendix = "_.xlsx" location = directory + appendix datafinal = pd.DataFrame() #%% for filepath in glob.iglob(location): print(filepath) head_tail = os.path.split(filepath) Treatment = re.findall("[_FL][^_]*", head_tail[1])[0] data = pd.read_excel(filepath) data['Spectrum']= Treatment datafinal= pd.concat([datafinal, data]) Thank you!
You are doing it by too complicated way, try: import os for fn in os.listdir("./"): fn_without_ext = os.path.splitext(fn)[0] if fn_without_ext.endswith("FLFR_"): print(fn) # do your FLFR stuff if fn_without_ext.endswith("FL_"): print(fn) # do your FL stuff More info: https://docs.python.org/3/library/os.html#os.listdir https://docs.python.org/3/library/stdtypes.html?highlight=endswith#str.endswith https://docs.python.org/3/library/os.path.html#os.path.splitext
Ignoring commas in string literals while reading in .csv file without using any outside libraries
I am trying to read in a .csv file that has a line that looks something like this: "Red","Apple, Tomato". I want to read that line into a dictionary, using "Red" as the key and "Apple, Tomato" as the definition. I also want to do this without using any libraries or modules that need to be imported. The issue I am facing is that it is trying to split that line into 3 separate pieces because there is a comma between "Apple" and "Tomato" that the code is splitting on. This is what I have right now: file_folder = sys.argv[1] file_path = open(file_folder+ "/food_colors.csv", "r") food_dict = {} for line in file_path: (color, description) = line.rstrip().split(',') print(f"{color}, {description}") But this gives me an error because it has 3 pieces of data, but I am only giving it 2 variables to store the info in. How can I make this ignore the comma inside the string literal?
You can collect the remaining strings into a list, like so color, *description = line.rstrip().split(',') You can then join the description strings back together to make the value for your dict Another way color, description = line.rstrip().split(',', 1) Would mean you only perform the split operation once and the rest of the string remains unsplit.
You can use pandas package and use pandas.DataFrame.read_csv. For example, this works: from io import StringIO import pandas as pd TESTDATA = StringIO('"Red","Apple, Tomato"') df = pd.read_csv(TESTDATA, sep=",", header=None) print(df)
Wrong boolean result in the main program (python)
I'm trying to write this simple code in Python: if the second element of a line of a csv file contains one of the family specified in the "malware_list" list, the main program should print "true". However, the result, is that the program prints always "FALSE". Each line in the file is in the form: "NAME,FAMILY" This is the code: malware_list = ["FakeInstaller","DroidKungFu", "Plankton", "Opfake", "GingerMaster", "BaseBridge", "Iconosys", "Kmin", "FakeDoc", "Geinimi", "Adrd", "DroidDream", "LinuxLotoor", "GoldDream" "MobileTx", "FakeRun", "SendPay", "Gappusin", "Imlog", "SMSreg"] def is_malware (line): line_splitted = line.split(",") family = line_splitted[1] if family in malware_list: return True return False def main(): with open("datset_small.csv", "r") as f: for i in range(1,100): line = f.readline() print(is_malware(line)) if __name__ == "__main__": main()
line = f.readline() readline doesn't strip the trailing newline off of the result, so most likely line here looks something like "STEVE,FakeDoc\n". Then family becomes "FakeDoc\n", which is not a member of malware_list, so your function returns False. Try stripping out the whitespace after reading: line = f.readline().strip()
python has a package called pandas. By using pandas we can read CSV file in dataframe format. import pandas as pd df=pd.read_csv("datset_small.csv") Please post your content in CSV file so that I can help you out
It can be easily achieved using dataframe. example code is as follows import pandas as pd malware_list = ["FakeInstaller","DroidKungFu", "Plankton", "Opfake", "GingerMaster", "BaseBridge", "Iconosys", "Kmin", "FakeDoc", "Geinimi", "Adrd", "DroidDream", "LinuxLotoor", "GoldDream" "MobileTx", "FakeRun", "SendPay", "Gappusin", "Imlog", "SMSreg"] # read csv into dataframe df = pd.read_csv('datset_small.csv') print(df['FAMILY'].isin(malware_list)) output is 0 True 1 True 2 True sample csv used is NAME,FAMILY 090b5be26bcc4df6186124c2b47831eb96761fcf61282d63e13fa235a20c7539,Plankton bedf51a5732d94c173bcd8ed918333954f5a78307c2a2f064b97b43278330f54,DroidKungFu 149bde78b32be3c4c25379dd6c3310ce08eaf58804067a9870cfe7b4f51e62fe,Plankton
I would you set instead of list for speed and definitely Pandas is better due to speed and easiness of the code. You can use x in y logic to get the results ;) import io #not needed in your case import pandas as pd data = io.StringIO('''090b5be26bcc4df6186124c2b47831eb96761fcf61282d63e13fa235a20c7539,Plankton bedf51a5732d94c173bcd8ed918333954f5a78307c2a2f064b97b43278330f54,DroidKungFu 149bde78b32be3c4c25379dd6c3310ce08eaf58804067a9870cfe7b4f51e62fe,Plankton''') df = pd.read_csv(data,sep=',',header=None) malware_set = ("FakeInstaller","DroidKungFu", "Plankton", "Opfake", "GingerMaster", "BaseBridge", "Iconosys", "Kmin", "FakeDoc", "Geinimi", "Adrd", "DroidDream", "LinuxLotoor", "GoldDream" "MobileTx", "FakeRun", "SendPay", "Gappusin", "Imlog", "SMSreg") df.columns = ['id','software'] df['malware'] = df['software'].apply(lambda x: x.strip() in malware_set) print(df)
How to change names of a list of numpy files?
I have list of numbpy files, I need to change their names, In fact, let's assume that I have this list of files: AES_Trace=1_key=hexaNumber_Plaintext=hexaNumber_Ciphertext=hexaNumber.npy AES_Trace=2_key=hexaNumber_Plaintext=hexaNumber_Ciphertext=hexaNumber.npy AES_Trace=3_key=hexaNumber_Plaintext=hexaNumber_Ciphertext=hexaNumber.npy What I need to change is the number of files, as a result I must have: AES_Trace=100001_key=hexaNumber_Plaintext=hexaNumber_Ciphertext=hexaNumber.npy AES_Trace=100002_key=hexaNumber_Plaintext=hexaNumber_Ciphertext=hexaNumber.npy AES_Trace=100003_key=hexaNumber_Plaintext=hexaNumber_Ciphertext=hexaNumber.npy I have tried: import os import numpy as np import struct path_For_Numpy_Files='C:\\Users\\user\\My_Test_Traces\\1000_Traces_npy' os.chdir(path_For_Numpy_Files) list_files_Without_Sort=os.listdir(os.getcwd()) list_files_Sorted=sorted((list_files_Without_Sort),key=os.path.getmtime) for file in list_files_Sorted: print (file) os.rename(file,file[11]+100000) I think that is not the good solution, firstly It doesn't work, then it gives me this error: os.rename(file,file[11]+100000) IndexError: string index out of range
Your file variable is a str, so you can't add an int like 10000 to it. >>> file = 'Tracenumber=01_Pltx5=23.npy' >>> '{}=1000{}'.format(file.split('=')[0],file.split('=')[1:]) 'Tracenumber=100001_Pltx5=23.npy' So, you can rather use os.rename(file,'{}=1000{}'.format(file.split('=')[0],file.split('=')[1:]))
I'm sure that you can do this in one line, or with regex but I think that clarity is more valuable. Try this: import os path = 'C:\\Users\\user\\My_Test_Traces\\1000_Traces_npy' file_names = os.listdir(path) for file in file_names: start = file[0:file.index("Trace=")+6] end = file[file.index("_key"):] num = file[len(start): file.index(end)] new_name = start + str(100000+int(num)) + end os.rename(os.path.join(path, file), os.path.join(path, new_name)) This will work with numbers >9, which the other answer will stick extra zeros onto.
Shutil multiple files after reading with "pydicom"
What I basicalling want is for myvar to vary between 1-280 so that I can use this to read the file using pydicom. I.e. I want to read the files between /data/lfs2/model-mie/inputDataTest/subj2/mp2rage/0-280_tfl3d1.IMA. Then if M is true in gender then I want to shutil them into a folder. Doesnt seem to be working with count. Thanks for the help! from pydicom import dicomio myvar = str(count(0)) import shutil file = "/data/lfs2/model-mie/inputDataTest/subj2/mp2rage/" + myvar + "_tfl3d1.IMA" ds = dicomio.read_file(file) gender = ds.PatientSex print(gender) if gender == "M": shutil.copy(file, "/mnt/nethomes/s4232182/Desktop/New")
I think the range() function should do what you want, something like this: import shutil from pydicom import dicomio for i in range(281): filename = "/data/lfs2/model-mie/inputDataTest/subj2/mp2rage/" + str(i) + "_tfl3d1.IMA" ds = dicomio.read_file(filename) if ds.get('PatientSex') == "M": shutil.copy(filename, "/mnt/nethomes/s4232182/Desktop/New" ) I've also used ds.get() to avoid problems if the dataset does not contain a PatientSex data element. In one place in your question, the numbering is 1-280, in another it is 0-280. If the former, then use range(1, 281) instead.