How do I convert the value of a pyspark dataframe column? - python

I have a column in a pyspark dataframe for the age of an electronic device, and these values are given in milliseconds. Is there an easy way to convert that column's values to years? I am not well versed in Spark.
EDIT: I understand that you can convert milliseconds to years pretty easily with basic math, I'm trying to take a column of a pyspark dataframe and iterate through it and convert all column values to a different value. Is there a specific pyspark function that makes this easier or no? I have a column where all values are very large integers with time in milliseconds, and I am trying to filter out values which are too small or large to make sense based on the lifespan of the device.
table.filter(F.col("age")>0).filter(F.col("age")<yearsToSeconds(20))
where yearsToSeconds is a very basic function converting years to seconds. I'd prefer being able to convert the column values to years, but I haven't worked with spark before and I don't know an optimal way to do that.

well, one way is to use withColumn.
here I'm demonstrating adding a new column called "ageinMin" to dataframe and calculate it based on "age" column from dataframe and dividing it by 600 to get equivalent minutes:
df.withColumn("ageinMin",col("age") /600)

Related

Dates management in Pandas - np.array, timestamp, datetime, etc

I'm trying to convert a whole column (actually four columns) from object to datetime.
As I observed in dataframe, there are some NaN values which are making my life more difficult, so I took the following strategy:
Fill NaN values to a specific date using df[column].fillna('2030-1-1')
Convert to date time using pd.to_datetime
The curious thing is that this strategy worked only for two columns, while the other two the error insists, as shown in below picture.
As you can see, the first two columns has successfully been converted
Although all columns have received the same information from fillna command, the last two columns cannot be converted
as shown in this picture.
I also observed that these columns differ from datatype originally
as you can see, the "error column" originally has a related datetime.datetime array.
I have researched for NumPy array documentation, but I could not yet solve this problem. Would you help me?

Interpolation in data frame without making a new data frame

I have a pandas dataframe with an index and jut one column. Index has Dates. Column has Values. I would like to find the NewValue of a NewDate that is not in the index. To do that i suppose i may use interpolation function as: NewValue=InterpolationFunction(NewDate,Index,Column,method ext..). So what is the InterpolationFunction? It seems that most of interpolation functions are used for padding, finding the missing values ext. This is not what i want. I just want the NewValue. Not built a new Dataframe ext..
Thank you very very much for taking the time to read this.
I am afraid that you cannot find the missing values without constructing a base for your data. here is the answer to your question if you make a base dataframe:
You need to construct a panel in order to set up your data for proper interpolation. For instance, in the case of the date, let's say that you have yearly data and you want to add information for a missing year in between or generate data for quarterly intervals.
You need to construct a base for the time series i.e.:
dates = pd.date_range(start="1987-01-01",end="2021-01-01", freq="Q").values
panel = pd.DataFrame({'date_Q' :dates})
Now you can join your data to this structure and it will have dates without information as missing. Now you need to use a proper interpolation algorithm to fill the missing values. Pandas .interpolate() method has some basic interpolation methods such as polynomial and linear which you can find here.
However, much more ways of interpolation are offered by Scipy which you can find in the tutorials here.

Iterating through big data with pandas, large and small dataframes

This is my first post here and it’s based upon an issue I’ve created and tried to solve at work. I’ll try to precisely summarize my issue as I’m having trouble wrapping my head around a preferred solution. #3 is a real stumper for me.
Grab a large data file based on a parquet - no problem
Select 5 columns from the parquet and create a dataframe - no problem
import pandas
df = pd.read_parquet(’/Users/marmicha/Downloads/sample.parquet’,
columns=["ts", "session_id", "event", "duration", "sample_data"])
But here is where it gets a bit tricky for me. One column(a key column) is called "session_id" . Many values are unique. Many duplicate values(of session_id) exist and have multiple associated entry rows of data. I wish to iterate through the master dataframe, create a unique dataframe per session_id. Each of these unique (sub) dataframes would have a calculation done that simply gets the SUM of the "duration" column per session_id. Again that SUM would be unique per unique session_id, so each sub dataframe would have it's own SUM with a row added with that total listed along with the session_id I'm thinking there is a nested loop formula that will work for me but every effort has been a mess to date.
Ultimately, I'd like to have a final dataframe that is a collection of these unique sub dataframes. I guess I'd need to define this final dataframe, and append it with each new sub dataframe as I iterate through the data. I should be able to do that simply
Finally, write this final df to a new parquet file. Should be simple enough so I won't need help with that.
But that is my challenge in a nutshell. The main design I’d need help with is #3. I’ve played with interuples and iterows
I think the groupby function will work:
df.groupby('session_id')['duration'].sum()
More info here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

Is there a way to map 2 dataframes onto each other to produce a rearranged dataframe (with one data frame values acting as the new column names)?

I have one dataframe of readings that come in a particular arrangement due to the nature of the experiment. I also have another dataframe, that contains information about each point on the dataframe and what each point corresponds to, in terms of what chemical was at that point. Note, there are only a few different chemicals over the dataframe, but they are arranged all over the dataframe.
What I want to do is to create a new, reorganised dataframe, where the columns are the type of chemical. My inital thought would be to compare the data and information dataframes to produce a dictionary, which I could then transform into a new dataframe. I could not figure out how to do this, and might not actually be the best approach either!
I have previously achieved it by manually rearranging the points on the dataframe to match the pattern I want, but I'm not happy with this approach and must be a better way.
Thanks in advance for any help!

How can I manipulate tabular data with python?

I have column-based data in a CSV file and I would like to manipulate it in several ways. People have pointed me to R because it gives you easy access to both rows and columns, but I am already familiar with python and rather use it.
For example, I want to be able to delete all the rows that have a certain value in one of the columns. Or I want to change all the values of one column (i.e., trim the string). I also want to be able to aggregate rows based on common values (like a SQL GROUP BY).
Is there a way to do this in python without having to write a loop to iterate over all of the rows each time?
Look at the pandas library. It provides a DataFrame type similar to R's dataframe that lets you do the kind of thing you're talking about.

Categories

Resources