I have a dataset which called preprocessed_sample in the following format
preprocessed_sample.ftr.zstd
and I am opening it using the following code
df = pd.read_feather(filepath)
The output looks something like that
index text
0 0 i really dont come across how i actually am an...
1 1 music has become the only way i am staying san...
2 2 adults are contradicting
3 3 exo are breathing 553 miles away from me. they...
4 4 im missing people that i met when i was hospit...
and finally I would like to save this dataset in a file which called 'examples' and contains all these texts into txt format.
Update: #Tsingis I would like to have the above lines into txt files, for example the first line 'i really dont come across how i actually am an...' will be a file named 'line1.txt', in the same way all the lines will be txt files into a folder which called 'examples'.
You can use the following code:
import pathlib
data_dir = pathlib.Path('./examples')
data_dir.mkdir(exist_ok=True)
for i, text in enumerate(df['text'], 1):
with open(f'examples/line{i}.txt', 'w') as fp:
fp.write(text)
Output:
examples/
├── line1.txt
├── line2.txt
├── line3.txt
├── line4.txt
└── line5.txt
1 directory, 5 files
line1.txt:
i really dont come across how i actually am an...
Another way, is to use pandas built-ins itertuples and to_csv :
import pandas as pd
for row in df.itertuples():
pd.Series(row.text).to_csv(f"examples/line{row.index+1}.txt",
index=False, header=False)
Related
This is what I'm trying to do.
Scan the csv using Polars lazy dataframe
Format the phone number using a function
Remove nulls and duplicates
Write the csv in a new file
Here is my code
import sys
import json
import polars as pl
import phonenumbers
#define the variable and parse the encoded json
args = json.loads(sys.argv[1])
#format phone number as E164
def parse_phone_number(phone_number):
try:
return phonenumbers.format_number(phonenumbers.parse(phone_number, "US"), phonenumbers.PhoneNumberFormat.E164)
except phonenumbers.NumberParseException:
pass
return None
#scan the csv file do some filter and modify the data and then write the output to a new csv file
pl.scan_csv(args['path'], sep=args['delimiter']).select(
[args['column']]
).with_columns(
#convert the int phne number as string and apply the parse_phone_number function
[pl.col(args['column']).cast(pl.Utf8).apply(parse_phone_number).alias(args['column']),
#add another column list_id with value 100
pl.lit(args['list_id']).alias("list_id")
]
).filter(
#filter nulls
pl.col(args['column']).is_not_null()
).unique(keep="last").collect().write_csv(args['saved_path'], sep=",")
I tested a file with 800k rows and 23 columns (150mb) and it takes around 20 seconds and more than 500mb ram then it completes the task.
Is this normal? Can I optimize the performance (the memory usage at least)?
I'm really new with Polars and I work with PHP and I'm very noob at python too, so sorry if my code looks bit dumb haha.
You are using an apply, which means you are effectively writing a python for loop. This often is 10-100x slower than using expressions.
Try to avoid apply. And if you do use apply, don't expect it to be fast.
P.S. you can reduce memory usage by not casting the whole column to Utf8, but instead cast inside your apply function. Though I don't think using 500MB is that high. Ideally polars uses as much RAM as available without going OOM. Unused RAM might be wasted potential.
Presumably you have something that looks like this...
df=pl.DataFrame({'column':[9345551234, 9945554321, 8005559876]})
and you want to end up with something that looks like
shape: (3, 1)
┌────────────────┐
│ phnum │
│ --- │
│ str │
╞════════════════╡
│ (934) 555-1234 │
│ (994) 555-4321 │
│ (800) 555-9876 │
└────────────────┘
You can get this using the str.slice method
df.select(pl.col('column').cast(pl.Utf8())) \
.select((pl.lit("(") + pl.col('column').str.slice(0,3) +
pl.lit(") ") + pl.col('column').str.slice(3,3) +
pl.lit("-")+pl.col('column').str.slice(6,4)).alias('phnum'))
I want to build a model for emotion classification and tbh I am struggling with the dataset. I am using CK+ since I read it'd be on industry standard. I don't know how to format it the right way so I can start working.
The Dataset is formatted in the following way.
Anger (Folder)
File 1
File 2
...
Contempt (Folder)
File 3
File 4
...
I need the foldernames as labels for the files inside of the folder but don't really know how to get there.
You can load all your data in a tf.data.Dataset using the tf.keras.utils.image_dataset_from_directory function. Assuming that your Anger and Contempt folders are located in a directory named Parent you can do like this:
import tensorflow as tf
dataset = tf.keras.utils.image_dataset_from_directory('Parent')
You can then access the images and labels directly from the Dataset, for example like this:
iterator = dataset.as_numpy_iterator()
print(iterator.next())
I am trying to lift some files with pyspark from a databricks datalake. To do this, I use the "sqlContext" statement to create the data frame, I do this without problems. Each file is named by the creation date, for example "20211001.cv". These arrive on a daily basis and I was using "* .csv" to get them all up. But now I need to lift the files from a certain date forward and I can't find a way to do it, that's why I turn to you please.
The statement style I am using is the following:
df_example= (sqlContext
.read
.format("com.databricks.spark.csv")
.option("delimiter", ";")
.option("header","true")
.option("inferSchema","true")
.option("encoding","windows-1252")
.load("/mnt/path/202110*.csv"))
I need to be able to detect files from a certain date forward in ".load" sentence, is it possible to do it with pyspark? Example "NameFile.csv >= 202110" Do you have an example please?
From already thank you very much!
I believe you can't do this. At least how you intend it.
If the data would've been written partitioned by date, said date would be part of the path and then Spark would add it as another column which you could then use to filter using the DataFrame API as you do with any other column.
So if the files were, let's say:
your_main_df_path
├── date_at=20211001
│ └── file.csv
├── date_at=20211002
│ └── file.csv
├── date_at=20211003
│ └── file.csv
└── ...
You could then do:
df = spark_session.read.format("csv").load("your_main_df_path") # with all your options
df.filter("date_at>=20211002") # or any other date you need
Spark would use the date in the path to do the partition pruning and only read the dates you need. If you can modify how the data is written, this is probably the best option.
If you can't control that or is hard to change for all the data you already have there. Maybe you can try to write a little python function takes a start date (and maybe an optional end_date) and returns a list of files that fall in that range. That list of files could then be passed to the DataFrameWriter.
get the all dates >[particular-date] and make that as list and pass those values as parameterized value in iteration mode as like this ,
%python
table_name ='my_table_name'
survey_curated_delta_path = f"abfss://container#datalake.dfs.core.windows.net/path1/path2/stage/validation/results/{table_name}.csv"
survey_sdf = spark.read.format("csv").load(survey_curated_delta_path)
display(survey_sdf)
easiest will be to import also filename
from pyspark.sql.functions import input_file_name
df.withColumn("filename", input_file_name())
than remove non numeric characters from filename columns
df['filename'] = df['filename'].str.replace(r'\D+', '')
df['filename'] = df['filename'].astype(int)
now we can filter
df.filter("filename >= 20211002")
I have a folder which comprises of different images like dell_01.png, hp_01.png, toshiba_01.png and would like to create a dataframe from it which would look like this:
If the file starts with hp, it should be assigned to class 1. If it starts with toshiba, it should be assigned to class 1. If it starts with dell, it should be assigned to class 2 as seen in the below expected dataframe output.
filename class
hp_01.png 0
toshiba_01.png 1
dell_01.png 2
Break the problem up:
I have a folder... different images
So you need to get the filenames from the folder. User pathlib.Path.glob:
from pathlib import Path
for fn in Path("/path/to/folder/").glob("*.png"):
...
if the file starts with hp... class 1, toshiba... class 2
so you have a condition.
if fn.stem.startswith("hp"):
class = 1
elif ...
Now you can solve the two parts individually.
in a dataframe
Use the two constructs above to make a dictionary for every file. Then make a dataframe from that dict. Your code will look something like this:
files = []
for fn in Path("/path/to/pngs").glob("*.png"):
if fn.stem.startwith("hp"):
class = 0
...
files.append({"filename": fn.name, "class": class})
(Yes, there are more direct routes to getting this into a dataframe, but I was trying to make what was going on clearer.)
Does that help? I've deliberately avoided just writing the answer for you, but tried to get just close enough you can fill in the rest.
I would like to save some tables in word document to CSV file or Excel doesn't matter.
I tried to "readlines()" it doesn't work! I don't know know.
Tables in word document are like this..
Name Age Gender
Alex 12 F
Willy 14 M
.
.
.
However, I would like to save this table in the same row.. I mean that.. I would like to save in CSV or Excel File
Alex 12 F Willy 14 M ....
import win32com
word = win32com.client.Dispatch('Word.Application')
f=word.Documents.Open('C:/3.doc')
have a look to www.ironpython.com: it runs over .NET so it has all the libraries to access to the Microsoft world.
For your case, read this small tutorial about convert a .doc to a .txt file. It should be very useful for you:
http://www.ironpython.info/index.php/Converting_a_Word_document_to_Text