filter files by pyspark date

filter files by pyspark date - python

I am trying to lift some files with pyspark from a databricks datalake. To do this, I use the "sqlContext" statement to create the data frame, I do this without problems. Each file is named by the creation date, for example "20211001.cv". These arrive on a daily basis and I was using "* .csv" to get them all up. But now I need to lift the files from a certain date forward and I can't find a way to do it, that's why I turn to you please.
The statement style I am using is the following:
df_example= (sqlContext
.read
.format("com.databricks.spark.csv")
.option("delimiter", ";")
.option("header","true")
.option("inferSchema","true")
.option("encoding","windows-1252")
.load("/mnt/path/202110*.csv"))
I need to be able to detect files from a certain date forward in ".load" sentence, is it possible to do it with pyspark? Example "NameFile.csv >= 202110" Do you have an example please?
From already thank you very much!

I believe you can't do this. At least how you intend it.
If the data would've been written partitioned by date, said date would be part of the path and then Spark would add it as another column which you could then use to filter using the DataFrame API as you do with any other column.
So if the files were, let's say:
your_main_df_path
├── date_at=20211001
│ └── file.csv
├── date_at=20211002
│ └── file.csv
├── date_at=20211003
│ └── file.csv
└── ...
You could then do:
df = spark_session.read.format("csv").load("your_main_df_path") # with all your options
df.filter("date_at>=20211002") # or any other date you need
Spark would use the date in the path to do the partition pruning and only read the dates you need. If you can modify how the data is written, this is probably the best option.
If you can't control that or is hard to change for all the data you already have there. Maybe you can try to write a little python function takes a start date (and maybe an optional end_date) and returns a list of files that fall in that range. That list of files could then be passed to the DataFrameWriter.

get the all dates >[particular-date] and make that as list and pass those values as parameterized value in iteration mode as like this ,
%python
table_name ='my_table_name'
survey_curated_delta_path = f"abfss://container#datalake.dfs.core.windows.net/path1/path2/stage/validation/results/{table_name}.csv"
survey_sdf = spark.read.format("csv").load(survey_curated_delta_path)
display(survey_sdf)

easiest will be to import also filename
from pyspark.sql.functions import input_file_name
df.withColumn("filename", input_file_name())
than remove non numeric characters from filename columns
df['filename'] = df['filename'].str.replace(r'\D+', '')
df['filename'] = df['filename'].astype(int)
now we can filter
df.filter("filename >= 20211002")

Related

Python Polars consuming high memory and taking longer time

This is what I'm trying to do.
Scan the csv using Polars lazy dataframe
Format the phone number using a function
Remove nulls and duplicates
Write the csv in a new file
Here is my code
import sys
import json
import polars as pl
import phonenumbers
#define the variable and parse the encoded json
args = json.loads(sys.argv[1])
#format phone number as E164
def parse_phone_number(phone_number):
try:
return phonenumbers.format_number(phonenumbers.parse(phone_number, "US"), phonenumbers.PhoneNumberFormat.E164)
except phonenumbers.NumberParseException:
pass
return None
#scan the csv file do some filter and modify the data and then write the output to a new csv file
pl.scan_csv(args['path'], sep=args['delimiter']).select(
[args['column']]
).with_columns(
#convert the int phne number as string and apply the parse_phone_number function
[pl.col(args['column']).cast(pl.Utf8).apply(parse_phone_number).alias(args['column']),
#add another column list_id with value 100
pl.lit(args['list_id']).alias("list_id")
]
).filter(
#filter nulls
pl.col(args['column']).is_not_null()
).unique(keep="last").collect().write_csv(args['saved_path'], sep=",")
I tested a file with 800k rows and 23 columns (150mb) and it takes around 20 seconds and more than 500mb ram then it completes the task.
Is this normal? Can I optimize the performance (the memory usage at least)?
I'm really new with Polars and I work with PHP and I'm very noob at python too, so sorry if my code looks bit dumb haha.

You are using an apply, which means you are effectively writing a python for loop. This often is 10-100x slower than using expressions.
Try to avoid apply. And if you do use apply, don't expect it to be fast.
P.S. you can reduce memory usage by not casting the whole column to Utf8, but instead cast inside your apply function. Though I don't think using 500MB is that high. Ideally polars uses as much RAM as available without going OOM. Unused RAM might be wasted potential.

Presumably you have something that looks like this...
df=pl.DataFrame({'column':[9345551234, 9945554321, 8005559876]})
and you want to end up with something that looks like
shape: (3, 1)
┌────────────────┐
│ phnum │
│ --- │
│ str │
╞════════════════╡
│ (934) 555-1234 │
│ (994) 555-4321 │
│ (800) 555-9876 │
└────────────────┘
You can get this using the str.slice method
df.select(pl.col('column').cast(pl.Utf8())) \
.select((pl.lit("(") + pl.col('column').str.slice(0,3) +
pl.lit(") ") + pl.col('column').str.slice(3,3) +
pl.lit("-")+pl.col('column').str.slice(6,4)).alias('phnum'))

Convert pandas to txt in google colab

I have a dataset which called preprocessed_sample in the following format
preprocessed_sample.ftr.zstd
and I am opening it using the following code
df = pd.read_feather(filepath)
The output looks something like that
index text
0 0 i really dont come across how i actually am an...
1 1 music has become the only way i am staying san...
2 2 adults are contradicting
3 3 exo are breathing 553 miles away from me. they...
4 4 im missing people that i met when i was hospit...
and finally I would like to save this dataset in a file which called 'examples' and contains all these texts into txt format.
Update: #Tsingis I would like to have the above lines into txt files, for example the first line 'i really dont come across how i actually am an...' will be a file named 'line1.txt', in the same way all the lines will be txt files into a folder which called 'examples'.

You can use the following code:
import pathlib
data_dir = pathlib.Path('./examples')
data_dir.mkdir(exist_ok=True)
for i, text in enumerate(df['text'], 1):
with open(f'examples/line{i}.txt', 'w') as fp:
fp.write(text)
Output:
examples/
├── line1.txt
├── line2.txt
├── line3.txt
├── line4.txt
└── line5.txt
1 directory, 5 files
line1.txt:
i really dont come across how i actually am an...

Another way, is to use pandas built-ins itertuples and to_csv :
import pandas as pd
for row in df.itertuples():
pd.Series(row.text).to_csv(f"examples/line{row.index+1}.txt",
index=False, header=False)

How do I import images with filenames corresponding to column values in a dataframe?

I'm a doctor trying to learn some code for work, and was hoping you could help me solve a problem I have with regards to importing multiple images into python.
I am working in Jupyter Notebook, where I have created a dataframe (named df_1) using pandas. In this dataframe each row represents a patient, and the first column shows the case number for each patient (e.g. 85).
Now, what I want to do is import multiple images (.bmp) from a given folder(same location as the .ipynb file). There are many images in this folder, and I do not want all of them - only the ones who have filenames corresponding to the "case_number" column in my dataframe (e.g. 85.bmp).
I already read this post, but I must admit it was way to complicated for me to understand.
Is there some simple loop (or something else) I could create to import all images with filenames corresponding to the values of the "case number" column in the dataframe?
I was imagining something like the below would be possible, I just do not know how to write it.
for i=[(df_1['case_number'()]
cv2.imread('[i].bmp')
The images don't really need to be implemented in the dataframe, but I would like to be able to view them in my notebook by using e.g. plt.imshow(85) afterwards.
Here is an image of the head of my dataframe
Thank you for helping!

You can access all of your files using this:
imageList = []
for i in range(0, len(df_1)):
cv2.imread('./' + str(df_1['case_number'][i]) + '.bmp')
imageList.append('./' + str(df_1['case_number'][i]) + '.bmp')
plt.imshow(imagelist[x])
This is looping through every item in the case_number column, the ./ shows that your file is within the current directory, using the directory path leading up to your current file. And by making everything a string and joining it you make it so that the file path is readable. The path created by joining the strings should look something like ./85.bmp, which should open your desired file. Also, you are appending the filenames to the list so that they can be accessed by the plt.imshow()
If you would like to access the files based on their name, you can use another variable (which could be set as an input) and implement the code below
fileName = input('Enter Your Value: ')
inputFile = imageList.index('./' + fileName + '.bmp')
and from here, you could use the same plt.imshow(imagelist[x]), but replace the x with the inputFile variable.

Converting a very very large csv to parquet

I am trying to convert a csv file to parquet (I don't really care if it is done in python or command line, or...) In any case, this question addresses is, but the answers seem to require one to read the csv in first, and since in my case the csv is 17GB, this is not really feasible, so I would like some "offline" or streaming approach.

I successfully converted a 7GB+ (2.7 millions lines) CSV file into a parquet file, using csv2parquet.
The process is simple:
First I had to clean my CSV with csvclean from csvkit (but you might not need this)
Generate a JSON schema with csv2parquet
Edit the schema by hand, as it might not suit you
Generate the parquet file thanks to csv2parquet
Bonus: use DuckDB to test simple SQL queries directly on the parquet file
You can probably reproduce the process if you download our CSV export at https://world.openfoodfacts.org/data
# Not needed for you, just in case you want to reproduce
wget https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv
csvclean -t en.openfoodfacts.org.products.csv
# Generate the schema
./csv2parquet --header true -p -n en.openfoodfacts.org.products_out.csv products_zstd.pqt > parquet.shema
# It has to be modified because column detection is sometimes wrong.
# From Open Food Facts CSV, for example, the code column is detected as a an Int64, but it's in fact a "Utf8".
nano parquet.shema
# Generate parquet file.
# Using -c for compression is optional.
# -c zstd appears to be the best option regarding speed/compression.
./csv2parquet --header true -c zstd -s parquet.schema en.openfoodfacts.org.products_out.csv products_zstd.pqt
# Try a query thanks to DuckDB. It's as fast as a database!
time ./duckdb test-duck.db "select * FROM (select count(data_quality_errors_tags) as products_with_issues from read_parquet('products_zstd.pqt') where data_quality_errors_tags != ''), (select count(data_quality_errors_tags) as products_with_issues_but_without_images from $db where data_quality_errors_tags != '' and last_image_datetime == '');"
┌──────────────────────┬─────────────────────────────────────────┐
│ products_with_issues │ products_with_issues_but_without_images │
├──────────────────────┼─────────────────────────────────────────┤
│ 29333 │ 4897 │
└──────────────────────┴─────────────────────────────────────────┘
real 0m0,211s
user 0m0,645s
sys 0m0,053s

Python: Deleting files reverse-alphabetically if a certain number of files in in a directory

Well, deleting files with a smaller date (though I believe numbers work the same as letters). For example, we have the following files in one directory named "BOB":
10-10-2000.txt
11-10-2000.txt
12-10-2000.txt
13-10-2000.txt
14-10-2000.txt
But I only want 4, so I need to remove the least-recent date which is the 10-10-2000.txt. However, this can't be done just by removing the least-recently created file as the 10-10-2000.txt may have been made just yesterday.
Thanks.

You could try to create datetime objects from the file names, and then do comparisons on them. I won't write out your code for you, but something of the like:
import datetime as dt
file_names = ... # (maybe try os.listdir()?)
dates = []
for name in file_names:
dates.append((name, dt.datetime.strptime(x, "%d-%m-%Y.txt")))
# ... logic to sort dates ...
os.remove(dates[0][0]) # if the sort worked, and your first result is the first date
If you want help with sorting the dates, ask me, but you should be able to use basic comparison operators once you have dates

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

filter files by pyspark date - python

Related

Python Polars consuming high memory and taking longer time

Convert pandas to txt in google colab

How do I import images with filenames corresponding to column values in a dataframe?

Converting a very very large csv to parquet

Python: Deleting files reverse-alphabetically if a certain number of files in in a directory

Categories

Resources