How do I copy pandas nested column to another DF? - python

We have some data in a Delta source which has nested structures. For this example we are focusing on a particular field from the Delta named status which has a number of sub-fields: commissionDate, decommissionDate, isDeactivated, isPreview, terminationDate.
In our transformation we currently read the Delta file in using PySpark, convert the DF to pandas using df.toPandas() and operate on this pandas DF using the pandas API. Once we have this pandas DF we would like to access its fields without using row iteration.
The data in Pandas looks like the following when queried using inventory_df["status"][0] (i.e. inventory_df["status"] is a list):
Row(commissionDate='2011-07-24T00:00:00+00:00', decommissionDate='2013-07-15T00:00:00+00:00', isDeactivated=True, isPreview=False, terminationDate=None)
We have found success using row iteration like:
unit_df["Active"] = [
not row["isDeactivated"] for row in inventory_df["status"]
]
but we have to use a row iteration each time we want to access data from the inventory_df. This is more verbose and is less efficient.
We would love to be able to do something like:
unit_df["Active"] = [
not inventory_df["status.isDeactivated"]
]
which is similar to the Spark destructuring approach, and allows accessing all of the rows at once but there doesn't seem to be equivalent pandas logic.
The data within PySpark has a format like status: struct<commissionDate:string,decommissionDate:string,isDeactivated:boolean,isPreview:boolean,terminationDate:string> and we can use the format mentioned above, selecting a subcolumn like df.select("status.isDeactivated").
How can this approach be done using pandas?

This may get you to where you think you are:
unit_df["Active"] = unit_df["Active"].apply(lambda x: pd.DataFrame(x.asDict()))
From here I would do:
unit_df = pd.concat([pd.concat(unif_df["Active"], ignore_index=True), unit_df], axis=1)
Which would get you a singular pd.DataFrame, now with columns for commissiondate, decomissiondate, etc.

Related

How to combine rows with same id number using pandas in python?

I have a big csv file and I would like to combine rows with the same id#.
For instance, this is what my csv shows right now.
and I would like it to be like this:
how can I do this using pandas?
Try this:
df = df.groupby('id').agg({'name':'last',
'type':'last',
'date':'last' }).reset_index()
this way you can have customized function in handling each columns.
(By changing the function from 'last' to your function)
You can read the csv with pd.read_csv() function and then use the GroupBy.last() function to aggregate rows with the same id.
something like:
df = pd.read_csv('file_name.csv')
df1 = df.groupby('id').last()
you should also decide an aggregation function instead of using "the last" row value.

Efficiently load and store data using Dask by changing one column at a time

I'm in the process of implementing a csv parser using Dask and pandas dataframes. I'd like to make it load only the columns it needs, so it works well with and doesn't need to load large amounts of data.
Currently the only method I've found of writing a column to a parquet/Dask dataframe is by loading all the data as a pandas dataframe, modifying the column and converting from pandas.
all_data = self.data_set.compute() # Loads all data, compute to pandas dataframe
all_data[column] = column_data # Modifies one column
self.data_set = dd.from_pandas(all_data, npartitions=2) # Store all data into dask dataframe
This seems really inefficient, so I was looking for a way to avoid having to load all the data and perhaps modify one column at a time or write directly to parquet.
I've stripped away most of the code but here is an example function that is meant to normalise the data for just one column.
import pandas as pd
import dask.dataframe as dd
def normalise_column(self, column, normalise_type=NormaliseMethod.MEAN_STDDEV):
column_data = self.data_set.compute()[column] # This also converts all data to pd dataframe
if normalise_type is NormaliseMethod.MIN_MAX:
[min, max] = [column_data.min(), column_data.max()]
column_data = column_data.apply(lambda x: (x - min) * (max - min))
elif normalise_type is NormaliseMethod.MEAN_STDDEV:
[mean, std_dev] = [column_data.mean(), column_data.std()]
column_data = column_data.apply(lambda x: (x - mean) / std_dev)
all_data = self.data_set.compute()
all_data[column] = column_data
self.data_set = dd.from_pandas(all_data, npartitions=2)
Can someone please help me make this more efficient for large amounts of data?
Due to the binary nature of the parquet format, and that compression is normally applied to the column chunks, it is never possible to update the values of a column in a file, without a full load-process-save cycle (the number of bytes would not stay constant). At least, Dask should enable you to do this partition-by-partition, without breaking memory.
It would be possible to make custom code to avoid parsing the compressed binary data in columns you know you don't want to change, just read and write again, but implementing this would take some work.

how can I extract specific row which contain specific keyword from my json dataset using pandas in python?

sorry that might be very simple question but I am new to python/json and everything. I am trying to filter my twitter json data set based on user_location/country_code/gb. but I have no idea how to do this. I have tried several ways but still no chance. I have attached my data set and some codes I have used here. I would appreciate any help.
here is what I did to get the best result however I do not know how to tell it to go for whole data set and print out the result of tweet_id:
import json
import pandas as pd
df = pd.read_json('example.json', lines=True)
if df['user_location'][4]['country_code'] == 'th':
print (df.tweet_id[4])
else:
print('false')
this code show me the tweet_id : 1223489829817577472
however, I couldn't extend it to the whole data set.
I have tried theis code as well, still no chance:
dataset = df[df['user_location'].isin([ "gb" ])].copy()
print (dataset)
that is what my data set looks like:
I would break the user_location column into multiple columns using the following
df = pd.concat([df, df.pop('user_location').apply(pd.Series)], axis=1)
Running this should give you a column each for the keys contained within the user_location json. Then it should be easy to print out tweet_ids based on country_code using:
df[df['country_code']=='th']['tweet_id']
An explanation of what is actually happening here:
df.pop('user_location') removes the 'user_location' column from df and returns it at the same time
With the returned column, we use the .apply method to apply a function to the column
pd.Series converts the JSON data/dictionary into a DataFrame
pd.concat concatenates the original df (now without the 'user_location' column) with the new columns created from the 'user_location' data

Does dask dataframe have any efficient way to group by one column and then join on this column?

I have a dask.DataFrame like this:
uid|name
1|A
2|A
3|B
4|C
I want to get following result:
uid|name
1|A|A_NEW_ID
2|A|A_NEW_ID
3|B|B_NEW_ID
4|C|C_NEW_ID
I try to get the result by the following way:
Firstly, I use groupby to get name table
df2 = df.groupby("name").reset_index()
I get a new DataFrame like following:
index|name
0|A
1|B
2|C
Then, I can join the two DataFrame.
final_df = df.join(df2,on="name")
However, my table is very large and field name is also a big field. join consumes too much resources. Is there any efficient way to do this?
If you have a small pandas dataframe already that maps uid's to names then a join between a dask dataframe and a pandas dataframe should be fast and efficient
If you are looking for a unique set of uids then I recommend.
df.uid.unique().compute()

Function in pandas dataframe that replicates dplyr group_by(multiple variables) function in R

Consider this case:
Python pandas equvilant to R groupby mutate
In dplyr:
df = df%>% group_by(a,b) %>%
means first the dataframe is grouped by column a then by b.
In my case I am trying to group my data first by group_name column, then by user_name , then by type_of_work . There are more than three columns (which is why I got confused) but I need data grouped according to these three headers in the same order. I already have an algorithm to work with columns after this stage. I only need an algorithm for creating a dataframe grouped according to these three columns.
It is important in my case that the sequence is preserved like the dplyr function.
Do we have anything similar in pandas data-frame?
Grouped = df.groupby(['a', 'b'])
Read more on "split-apply-combine" strategy in the pandas docs to see how pandas deals with these issues compared to R.
From your comment it seem you want assign the grouped frames. You can either use a groupbyobject through the API, eg grouped.mean(), or you can iterate through the groupby object. You will get name and group in each loop.

Categories

Resources