Creating new Columns and fill them based on another columns values - python

Let's say I have a dataframe df looking like this:
|ColA |
|---------|
|B=7 |
|(no data)|
|C=5 |
|B=3,C=6 |
How do I extract the data into new colums, so it looks like this:
|ColA | B | C |
|------|---|---|
|True | 7 | |
|False | | |
|True | | 5 |
|True | 3 | 6 |
For filling the columns I know I can use regex .extract, as shown in this solution.
But how do I set the Column name at the same time? So far I use a loop over df.ColA.loc[df["ColA"].isna()].iteritems(), but that does not seem like the best option for a lot of data.

You could use str.extractall to get the data, then reshape the output and join to a derivative of the original dataframe:
# create the B/C columns
df2 = (df['ColA'].str.extractall('([^=]+)=([^=,]+),?')
.set_index(0, append=True)
.droplevel('match')[1]
.unstack(0, fill_value='')
)
# rework ColA and join previous output
df.notnull().join(df2).fillna('')
# or if several columns:
df.assign(ColA=df['ColA'].notnull()).join(df2).fillna('')
output:
ColA B C
0 True 7
1 False
2 True 5
3 True 3 6

Related

Python: Filtering a datastructure depended on columnvalue [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 7 months ago.
I have two pandas dataframe structured like so:
DF1:
|'ID'|'Zone'|
|:---------:|
| 11 | 1 |
| 12 | 2 |
| 10 | 0 |
DF2:
|'ID'|'Time'|
|:---------:|
| 11 | 1 |
| 11 | 2 |
| 12 | 1 |
| 12 | 2 |
And I want to add a new column to DF2 named zone, that contain the correct value for which zone each ID belong to. See example below.
|'ID'|'Time'|'Zone'|
|:----------------:|
| 11 | 1 | 1 |
| 11 | 2 | 1 |
| 12 | 1 | 2 |
| 12 | 2 | 2 |
For this small example I have written some code that works fine, but I will like to use is on two large DF. So my qustion is, is there a more delicated (better) way to do this? My current code is:
df2 = np.empty([len(df2.index)]
for i in df2.index:
for j in df.index:
if df2['id'][i] == df1['id'][j]:
df2.loc[i, 'zone'] = df1.loc[j, 'zone']
You need to use merge function to perform a join:
pd.merge(df1, df2, on="ID")
Where on="ID" indicates which is the reference column and must be present in both dataframes (if you want to join using columns with different names, it is also possible, just check the docs).
merge is also available as a dataframe's method so you can alternatively call it as:
df1.merge(df2, on="ID")
With exactly same result.
df1.merge(df2, how='right')
or
df2.merge(df1,how='left')
joining df2 with df1 using ID columns
Here's a nice one liner
df2["Zone"] = [df1.set_index("ID")["Zone"][df2["ID"][i]] for i in df2.index]
Setting the index of df1 to its 'ID' column allows the values of df2's 'ID' column to act as indices, making the call a little simpler.

pyspark create all possible combinations of column values of a dataframe

I want to get all the possible combinations of size 2 of a column in pyspark dataframe.
My pyspark dataframe looks like
| id |
| 1 |
| 2 |
| 3 |
| 4 |
For above input, I want to get output as
| id1 | id2 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 3 |
and so on..
One way would be to collect the values and get them into python iterable (list, pandas df) and use itertools.combinations to generate all combinations.
values = df.select(F.collect_list('id')).first()[0]
combns = list(itertools.combinations(values, 2))
However, I want to avoid collecting the dataframe column to the driver since the rows can be extremely large. Is there a better way to achieve this using spark APIs?
You can use the crossJoin method, and then cull the lines with id1 > id2.
df = df.toDF('id1').crossJoin(df.toDF('id2')).filter('id1 < id2')

How to apply a function on each group of data in a pandas group by

Suppose the data frame below:
|id |day | order |
|---|--- |-------|
| a | 2 | 6 |
| a | 4 | 0 |
| a | 7 | 4 |
| a | 8 | 8 |
| b | 11 | 10 |
| b | 15 | 15 |
I want to apply a function to day and order column of each group by rows on id column.
The function is:
def mean_of_differences(my_list):
return sum([ my_list[i] - my_list[i-1] for i in range(1, len(my_list))]) / len(my_list)
This function calculates mean of differences of each element and the next one. For example, for id=a, day would be 2+3+1 divided by 4. I know how to use lambda, but didn't find a way to implement this in a pandas group by. Also, each column should be ordered to get my desired output, so apparently it is not possible to sort by one column before group by
The output should be like this:
|id |day| order |
|---|---|-------|
| a |1.5| 2 |
| b | 2 | 2.5 |
Any one know how to do so in a group by?
First, sort your data by day then group by id and finally compute your diff/mean.
df = df.sort_values('day') \
.groupby('id') \
.agg({'day': lambda x: x.diff().fillna(0).mean()}) \
.reset_index()
Output:
>>> df
id day
0 a 1.5
1 b 2.0

Using pandas apply to pass in both a row and the entire dataframe with it [duplicate]

This question already has an answer here:
Pandas - Finding percent contributed by each group
(1 answer)
Closed 2 years ago.
I have a df and I want to create some new cols with it. How would I use the apply function to both pass in the row, and the entire df with it? I need the entire df to do some filtering, and the data is subject to the values in each row.
Or maybe I don't need to use apply, but that's the first thing that came to my mind. Thank you and all help is appreciated!
Ex of df:
+----+--------+--------+
| ID | Family | Amount |
+----+--------+--------+
| 1 | A | 2 |
| 2 | A | 10 |
| 3 | B | 4 |
| 4 | B | 7 |
+----+--------+--------+
Result:
+----+--------+--------+-----------+------------+
| ID | Family | Amount | Total_Fam | Id_Percent |
+----+--------+--------+-----------+------------+
| 1 | A | 2 | 12 | .166 |
| 2 | A | 10 | 12 | .833 |
| 3 | B | 4 | 11 | .363 |
| 4 | B | 7 | 11 | .636 |
+----+--------+--------+-----------+------------+
First, group by Family and then transform amount and then you can directly divide Amount by the new column.
df['Total_Fam'] = df.groupby('Family')['Amount'].transform(np.sum)
df['Id_Percent'] = df['Amount']/df['Total_Fam']
df
Using apply on a column passes each row individualy. If you use apply on the entire dataset, it sees the entire dataset, hence, you can use all columns. As you can see in the example below, df['new_2] which is made using a function which I apply to the dataset, I do not need to pass the df to it.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('iris')
df['new'] = df['species'].apply(lambda x: x[:2])
def sumIsMore(dataframe):
x = dataframe['sepal_length']
y = dataframe['sepal_width']
return x+y >= 8.5
df['new_2'] = df.apply(sumIsMore, axis=1)

Pandas - applying groupings and counts to multiple columns in order to generate/change a dataframe

I'm sure what I'm trying to do is fairly simple for those with better knowledge of PD, but I'm simply stuck at transforming:
+---------+------------+-------+
| Trigger | Date | Value |
+---------+------------+-------+
| 1 | 01/01/2016 | a |
+---------+------------+-------+
| 2 | 01/01/2016 | b |
+---------+------------+-------+
| 3 | 01/01/2016 | c |
+---------+------------+-------+
...etc, into:
+------------+---------------------+---------+---------+---------+
| Date | #of triggers | count a | count b | count c |
+------------+---------------------+---------+---------+---------+
| 01/01/2016 | 3 | 1 | 1 | 1 |
+------------+---------------------+---------+---------+---------+
| 02/01/2016 | 5 | 2 | 1 | 2 |
+------------+---------------------+---------+---------+---------+
... and so on
The issue is, I've got no bloody idea of how to achieve this..
I've scoured SO, but I can't seem to find anything that applies to my specific case.
I presume I'd have to group it all by date, but then once that is done, what do I need to do to get the remaining columns?
The initial DF is loaded from an SQL Alchemy query object, and then I want to manipulate it to get the result I described above. How would one do this?
Thanks
df.groupby(['Date','Value']).count().unstack(level=-1)
You can use GroupBy.size with unstack, also parameter sort=False is helpful:
df1 = df.groupby(['Date','Value'])['Value'].size().unstack(fill_value=0)
df1['Total'] = df1.sum(axis=1)
cols = df1.columns[-1:].union(df1.columns[:-1])
df1 = df1[cols]
print (df1)
Value Total a b c
Date
01/01/2016 3 1 1 1
The difference between size and count is:
size counts NaN values, count does not.

Categories

Resources