This is what I'm trying to do.
Scan the csv using Polars lazy dataframe
Format the phone number using a function
Remove nulls and duplicates
Write the csv in a new file
Here is my code
import sys
import json
import polars as pl
import phonenumbers
#define the variable and parse the encoded json
args = json.loads(sys.argv[1])
#format phone number as E164
def parse_phone_number(phone_number):
try:
return phonenumbers.format_number(phonenumbers.parse(phone_number, "US"), phonenumbers.PhoneNumberFormat.E164)
except phonenumbers.NumberParseException:
pass
return None
#scan the csv file do some filter and modify the data and then write the output to a new csv file
pl.scan_csv(args['path'], sep=args['delimiter']).select(
[args['column']]
).with_columns(
#convert the int phne number as string and apply the parse_phone_number function
[pl.col(args['column']).cast(pl.Utf8).apply(parse_phone_number).alias(args['column']),
#add another column list_id with value 100
pl.lit(args['list_id']).alias("list_id")
]
).filter(
#filter nulls
pl.col(args['column']).is_not_null()
).unique(keep="last").collect().write_csv(args['saved_path'], sep=",")
I tested a file with 800k rows and 23 columns (150mb) and it takes around 20 seconds and more than 500mb ram then it completes the task.
Is this normal? Can I optimize the performance (the memory usage at least)?
I'm really new with Polars and I work with PHP and I'm very noob at python too, so sorry if my code looks bit dumb haha.
You are using an apply, which means you are effectively writing a python for loop. This often is 10-100x slower than using expressions.
Try to avoid apply. And if you do use apply, don't expect it to be fast.
P.S. you can reduce memory usage by not casting the whole column to Utf8, but instead cast inside your apply function. Though I don't think using 500MB is that high. Ideally polars uses as much RAM as available without going OOM. Unused RAM might be wasted potential.
Presumably you have something that looks like this...
df=pl.DataFrame({'column':[9345551234, 9945554321, 8005559876]})
and you want to end up with something that looks like
shape: (3, 1)
┌────────────────┐
│ phnum │
│ --- │
│ str │
╞════════════════╡
│ (934) 555-1234 │
│ (994) 555-4321 │
│ (800) 555-9876 │
└────────────────┘
You can get this using the str.slice method
df.select(pl.col('column').cast(pl.Utf8())) \
.select((pl.lit("(") + pl.col('column').str.slice(0,3) +
pl.lit(") ") + pl.col('column').str.slice(3,3) +
pl.lit("-")+pl.col('column').str.slice(6,4)).alias('phnum'))
Related
I have a dataset which called preprocessed_sample in the following format
preprocessed_sample.ftr.zstd
and I am opening it using the following code
df = pd.read_feather(filepath)
The output looks something like that
index text
0 0 i really dont come across how i actually am an...
1 1 music has become the only way i am staying san...
2 2 adults are contradicting
3 3 exo are breathing 553 miles away from me. they...
4 4 im missing people that i met when i was hospit...
and finally I would like to save this dataset in a file which called 'examples' and contains all these texts into txt format.
Update: #Tsingis I would like to have the above lines into txt files, for example the first line 'i really dont come across how i actually am an...' will be a file named 'line1.txt', in the same way all the lines will be txt files into a folder which called 'examples'.
You can use the following code:
import pathlib
data_dir = pathlib.Path('./examples')
data_dir.mkdir(exist_ok=True)
for i, text in enumerate(df['text'], 1):
with open(f'examples/line{i}.txt', 'w') as fp:
fp.write(text)
Output:
examples/
├── line1.txt
├── line2.txt
├── line3.txt
├── line4.txt
└── line5.txt
1 directory, 5 files
line1.txt:
i really dont come across how i actually am an...
Another way, is to use pandas built-ins itertuples and to_csv :
import pandas as pd
for row in df.itertuples():
pd.Series(row.text).to_csv(f"examples/line{row.index+1}.txt",
index=False, header=False)
I am trying to read a large database table with polars. Unfortunately, the data is too large to fit into memory and the code below eventually fails.
Is there a way in polars how to define a chunksize, and also write these chunks to parquet, or use the lazy dataframe interface to keep the memory footprint low?
import polars as pl
df = pl.read_sql("SELECT * from TABLENAME", connection_string)
df.write_parquet("output.parquet")
Yes and no.
There's not a predefined method to do it but you can certainly do it yourself. You'd do something like:
rows_at_a_time=1000
curindx=0
while True:
df = pl.read_sql(f"SELECT * from TABLENAME limit {curindx},{rows_at_a_time}", connection_string)
if df.shape[0]==0:
break
df.write_parquet(f"output{curindx}.parquet")
curindx+=rows_at_a_time
ldf=pl.concat([pl.scan_parquet(x) for x in os.listdir(".") if "output" in x and "parquet" in x])
This borrows limit syntax from this answer assuming you're using mysql or a db that has the same syntax which isn't trivial assumption. You may need to do something like this if not using mysql.
Otherwise you just read your table in chunks, saving each chunk to a local file. When the chunk you get back from your query has 0 rows then it stops looping and loads all the files to a lazy df.
You can almost certainly (and should) increase the rows_at_a_time to something greater than 1000 but that's dependent on your data and computer memory.
I have a 3 column CSV file where I perform a simple calculation with python and pandas.
The file is very large, just under 4Gb, after the calculation about 1.9Gb
the CSV file is:
data1,data2,data3
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw97,856521536521321,112535
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,6521321,112138
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,856521536521321,122135
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,521321,112132
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,856521536521321,212135
The calculation is a trivial sum. If column A is identical, then add B and rewrite the CSV.
Example result :
data1,data2,data3
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw97,856521536521321
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,856521543042642
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,856521537042642
import pandas as pd
#Read csv
df = pd.read_csv('data.csv', sep=',' , engine='python')
# Groupby and sum
df_new = df.groupby(["data1"]).agg({"data2": "sum"}).reset_index()
# Save in new file
df_new.to_csv('data2.csv', encoding='utf-8', index=False)
How could I improve the code to speed up execution?
It currently takes about 7 hours on a vps to complete the calculation
add info
The RAM resources are almost always 100% (8Gb), while the choice of the engine = 'python' is because I used a code already present on https://stackoverflow.com/, and honestly I don't know the usefulness or not of that command, but I have seen that the calculation works correctly.
Data3 is actually useless to me (right now, probably useful in the future).
There's an alternative option - use convtools for this. It is a pure python library which generates pure python code to build ad hoc converters. Of course bare python cannot beat pandas in terms of speed, but at least it doesn't need any wrappers and it works just like you'd implement everything by hand.
So, normally the following would work for you:
from convtools import conversion as c
from convtools.contrib.tables import Table
# you can store the converter somewhere for further reuse
converter = (
c.group_by(c.item("data1"))
.aggregate({
"data1": c.item("data1"),
"data2": c.ReduceFuncs.Sum(c.item("data2"))
})
.gen_converter()
)
# this is an iterable (stream of rows), not the list
rows = Table.from_csv("tmp4.csv", header=True).into_iter_rows(dict)
Table.from_rows(converter(rows)).into_csv("out.csv")
JFYI: If you run the script manually, then you can monitor the speed using e.g. tqdm, just wrap an iterable you are consuming with it:
from tqdm import tqdm
# same code as above, except for the last line:
Table.from_rows(converter(tqdm(rows))).into_csv("out.csv")
HOWEVER:
the solution above doesn't require an input file to fit into memory, but the result should. In your case, if the result is 1.9GB csv file, it is unlikely to fit corresponding python objects into 8GB of RAM.
Then you may need to:
remove the header: tail -n +2 raw_file.csv > raw_file_no_header.csv
pre-sort the file sort raw_file_no_header.csv > sorted_file.csv
a then:
from convtools import conversion as c
from convtools.contrib.tables import Table
converter = (
c.chunk_by(c.item("data1"))
.aggregate(
{
"data1": c.ReduceFuncs.First(c.item("data1")),
"data2": c.ReduceFuncs.Sum(c.item("data2")),
}
)
.gen_converter()
)
rows = Table.from_csv("sorted_file.csv", header=True).into_iter_rows(dict)
Table.from_rows(converter(rows)).into_csv("out.csv")
This only requires a single group to fit into memory.
Remove the engine='python', it does no good.
Get more RAM, 8GB is not enough, you should never hit 100% (this is what slows you down)
(it is too late now), but don't use .csv files for large datasets. Look into feather or parquet.
If you can't get more RAM, then maybe #Afaq will elaborate on the file splitting approach. The problem I see there, is that you are not reducing your dataset much, so map reduce may choke on the reduce part, unless you split your file in such a way, that same data1 strings would always go into the same file.
I am trying to lift some files with pyspark from a databricks datalake. To do this, I use the "sqlContext" statement to create the data frame, I do this without problems. Each file is named by the creation date, for example "20211001.cv". These arrive on a daily basis and I was using "* .csv" to get them all up. But now I need to lift the files from a certain date forward and I can't find a way to do it, that's why I turn to you please.
The statement style I am using is the following:
df_example= (sqlContext
.read
.format("com.databricks.spark.csv")
.option("delimiter", ";")
.option("header","true")
.option("inferSchema","true")
.option("encoding","windows-1252")
.load("/mnt/path/202110*.csv"))
I need to be able to detect files from a certain date forward in ".load" sentence, is it possible to do it with pyspark? Example "NameFile.csv >= 202110" Do you have an example please?
From already thank you very much!
I believe you can't do this. At least how you intend it.
If the data would've been written partitioned by date, said date would be part of the path and then Spark would add it as another column which you could then use to filter using the DataFrame API as you do with any other column.
So if the files were, let's say:
your_main_df_path
├── date_at=20211001
│ └── file.csv
├── date_at=20211002
│ └── file.csv
├── date_at=20211003
│ └── file.csv
└── ...
You could then do:
df = spark_session.read.format("csv").load("your_main_df_path") # with all your options
df.filter("date_at>=20211002") # or any other date you need
Spark would use the date in the path to do the partition pruning and only read the dates you need. If you can modify how the data is written, this is probably the best option.
If you can't control that or is hard to change for all the data you already have there. Maybe you can try to write a little python function takes a start date (and maybe an optional end_date) and returns a list of files that fall in that range. That list of files could then be passed to the DataFrameWriter.
get the all dates >[particular-date] and make that as list and pass those values as parameterized value in iteration mode as like this ,
%python
table_name ='my_table_name'
survey_curated_delta_path = f"abfss://container#datalake.dfs.core.windows.net/path1/path2/stage/validation/results/{table_name}.csv"
survey_sdf = spark.read.format("csv").load(survey_curated_delta_path)
display(survey_sdf)
easiest will be to import also filename
from pyspark.sql.functions import input_file_name
df.withColumn("filename", input_file_name())
than remove non numeric characters from filename columns
df['filename'] = df['filename'].str.replace(r'\D+', '')
df['filename'] = df['filename'].astype(int)
now we can filter
df.filter("filename >= 20211002")
I am trying to convert a csv file to parquet (I don't really care if it is done in python or command line, or...) In any case, this question addresses is, but the answers seem to require one to read the csv in first, and since in my case the csv is 17GB, this is not really feasible, so I would like some "offline" or streaming approach.
I successfully converted a 7GB+ (2.7 millions lines) CSV file into a parquet file, using csv2parquet.
The process is simple:
First I had to clean my CSV with csvclean from csvkit (but you might not need this)
Generate a JSON schema with csv2parquet
Edit the schema by hand, as it might not suit you
Generate the parquet file thanks to csv2parquet
Bonus: use DuckDB to test simple SQL queries directly on the parquet file
You can probably reproduce the process if you download our CSV export at https://world.openfoodfacts.org/data
# Not needed for you, just in case you want to reproduce
wget https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv
csvclean -t en.openfoodfacts.org.products.csv
# Generate the schema
./csv2parquet --header true -p -n en.openfoodfacts.org.products_out.csv products_zstd.pqt > parquet.shema
# It has to be modified because column detection is sometimes wrong.
# From Open Food Facts CSV, for example, the code column is detected as a an Int64, but it's in fact a "Utf8".
nano parquet.shema
# Generate parquet file.
# Using -c for compression is optional.
# -c zstd appears to be the best option regarding speed/compression.
./csv2parquet --header true -c zstd -s parquet.schema en.openfoodfacts.org.products_out.csv products_zstd.pqt
# Try a query thanks to DuckDB. It's as fast as a database!
time ./duckdb test-duck.db "select * FROM (select count(data_quality_errors_tags) as products_with_issues from read_parquet('products_zstd.pqt') where data_quality_errors_tags != ''), (select count(data_quality_errors_tags) as products_with_issues_but_without_images from $db where data_quality_errors_tags != '' and last_image_datetime == '');"
┌──────────────────────┬─────────────────────────────────────────┐
│ products_with_issues │ products_with_issues_but_without_images │
├──────────────────────┼─────────────────────────────────────────┤
│ 29333 │ 4897 │
└──────────────────────┴─────────────────────────────────────────┘
real 0m0,211s
user 0m0,645s
sys 0m0,053s