Union Row inside Row PySpark Dataframe - python

I want to convert my Dataframe that has rows inside rows to a unique row, like this:
My dataframe:
[Row(Autorzc=u'S', Cd=u'00000012793', ClassCli=u'A' Op=Row(CEP=u'04661904', CaracEspecial='S', Venc=Row(v110=u'1', v120=u'2'))),
Row(Autorzc=u'S', Cd=u'00000012794', ClassCli=u'A' Op=Row(CEP=u'04661904', CaracEspecial='S', Venc=Row(v110=u'1', v120=u'2')))]
and I want to transform to this:
[Row(Autorzc=u'S', Cd=u'00000012793', ClassCli=u'A', CEP=u'04661904', CaracEspecial='S', v110=u'1', v120=u'2'),
Row(Autorzc=u'S', Cd=u'00000012794', ClassCli=u'A', CEP=u'04661904', CaracEspecial='S', v110=u'1', v120=u'2')]
Any suggestion?

You can do a simple select operation and your columns will be renamed accordingly.
final = initial.select("Autorzc","Cd" , "ClassCli", "Op.CEP"
"Op.CaracEspecial","Op.Venc.v110","Op.Venc.v120")
print(final.first())

Related

How do I get the range in one dataframe column based on duplicate items in two other columns?

I have a dataframe that contains three columns: 'sequences', 'smiles' and 'labels'. Some of the rows have the same string entries in the 'sequences' and 'smiles' column, but a different float value in the 'labels' column. For duplicate sequences and smiles, I would like the get the range of values of the 'labels' column for those duplicate rows, which will be stored in a fourth column. I intend to reject rows, which have a range above a certain value.
I have made a dataframe that contains all the duplicate values:
duplicate_df = pd.concat(g for _, g in df.groupby(['sequence', 'smiles']) if len(g) > 1)
How do I get the range of the labels from the df?
Is there something like this I can do?
duplicate_df.groupby(['Target_sequence', 'processed_SMILES']).range()
My duplicate_df looks like this:
pd.DataFrame({'Label': {86468: 55700.0,
86484: 55700.0,
86508: 55700.0,
124549: 55690.0,
124588: 55690.0},
'Target_sequence': {86468: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
86484: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
86508: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
124549: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
124588: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF'},
'processed_SMILES': {86468: 'CCOC(=O)[NH+]1CC[NH+](C(=O)c2ccc(-n3c(=S)[n-]c4ccccc4c3=O)cc2)CC1',
86484: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3ccccc3F)cs2)CC1',
86508: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3cccc([N+](=O)[O-])c3)cs2)CC1',
124549: 'C[NH+]1CC[NH+](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3cccc([N+](=O)[O-])c3)cs2)CC1',
124588: 'CCOC(=O)[NH+]1CC[NH+](C(=O)c2ccc(-n3c(=S)[n-]c4ccccc4c3=O)cc2)CC1'}})
For example, duplicate rows where the items are the same I would like to have 0 in the 'range' column.
std() is a valid aggregation function for group-by object. Therefore, after creating your df with the duplicated data, you can try:
duplicate_df.groupby(['Target_sequence', 'processed_SMILES'])['labels'].std()
Edit:
This is a nice opportunity to use pd.NamedAgg which was released in version 0.25:
df.groupby(['Target_sequence','processed_SMILES']).agg(Minimum = pd.NamedAgg(column='Label',aggfunc='min'),
Maximum = pd.NamedAgg(column='Label',aggfunc='max'))

How to calculate sum of specific column based on more than 2 complex conditions in python dataframe

'So basically what I wanted to figure out is that is there a way of calculating 'batsman_runs'(not visible in the image but yes there is a column) per 'match_id' for different 'batsman' and then store them as a dictionary or a list or just print the value.
The following link is a snapshot of the dataset
https://i.stack.imgur.com/zVWSh.jpg
Assuming you have imported numpy:
result=pd.your_df['batsman_runs'].to_numpy()/pd.your_df['match_id'].to_numpy()
result will be a numpy array, which holds all the values of the 'batsman_runs' column divided by all the respective values of the 'match_id' column.
You can try something like this as you said you have a column called batsman_runs
df = df.groupby(by=['match_id','batsman'])['batsman_runs'].sum()

Integrating two diff dataframes using contains condition and creating a new dataframe

First data frame looks like below:
OSIED geometry
257005 POLYGON ((311852.712 178933.993, 312106.023 17...
017049 POLYGON ((272943.107 137755.159, 272647.627 13...
017032 POLYGON ((276637.425 146141.397, 276601.509 14...
Second data frame looks like below:
small_area Median_BER
217099001/217112002/217112003/2570052005/217112... 212.9
047041005/047041001/2570051004/047041002/047041... 271.3
157041002/157041004/157041003/157041001/157129... 222.5
I need to search col1(df1) in col1(df2) using "contains" condition.
If it matches/has the string then fetch the corresponding values from df1 and df2
I tried merge,df.get and str.contains.
str.contains works but I am unable to fetch other records
Output should look like this:
OSIED geometry small_area Median_BER
257005 POLYGON ((311852.712 178933.993, 312106.023 17... 217099001/217112002/217112003/2570052005/217112
212.9
017049 POLYGON ((272943.107 137755.159, 272647.627 13... 047041005/047041001/2570051004/047041002/047041
222.5
Playing around with some code I was able to generate the following
small_area_oseid_df = pd.DataFrame(
[
{'OSIED': oseid[:6], 'median_ber': row['median_ber']}
for row in df.to_dict(orient='records')
for oseid in row['small_area'].split('/')
]
)
Then you can join this table with the first table on the OSIED key. This is dependent on how many elements are in each row in the split. Since this will explode the dimension for the small_area_oseid_df you will create.

Slice a dataframe based on one column starting with the value of another column

I have a dataframe called data, that looks like like this:
|...|category|...|ngram|...|
I need to slice this dataframe to instances where category starts with the value of ngram. So for example, if I had the following instance:
category: beds
ngram: bed
then that instance should be dropped from the resulting dataframe.
In T-SQL, I use the following query (which may not be the best way, but it works):
SELECT
*
FROM mytable
WHERE category NOT LIKE ngram+'%';
I have read up on this a bit, and my best attempt is:
data[data.category.str.startswith(data.ngram.str) == True]
But this does not return any rows, nor does the inverse (using == True)
#use df.apply to filter the rows with category starts with ngram.
data[data.apply(lambda x: x.category.startswith(x.ngram), axis=1)]

how to locate row in dataframe without headers

I noticed that when using .loc in pandas dataframe, it not only finds the row of data I am looking for but also includes the header column names of the dataframe I am searching within.
So when I try to append the .loc row of data, it includes the data + column headers - I don't want any column headers!
##1st dataframe
df_futures.head(1)
date max min
19990101 2000 1900
##2nd dataframe
df_cash.head(1)
date$ max$ min$
1999101 50 40
##if date is found in dataframe 2, I will collect the row of data
data_to_track = []
for ii in range(len(df_futures['date'])):
##date I will try to find in df2
date_to_find = df_futures['date'][ii]
##append the row of data to my list
data_to_track.append(df_cash.loc[df_cash['Date$'] == date_to_find])
I want the for loop to return just 19990101 50 40
It currently returns 0 19990101 50 40, date$, max$, min$
I agree with other comments regarding the clarity of the question. However, if what you want to get is just a string that contains a particular row's data, then you could use to_string() method of Pandas.
In your case,
Instead of this:
df_cash.loc[df_cash['Date$'] == date_to_find]
You could get a string that includes only the row data:
df_cash[df_cash['Date$'] == date_to_find].to_string(header=None)
Also notice that I dropped the .loc part, which outputs the same result.
If your dataframe has multiple columns and you dont want them to be joined in a string (may bring data type issues and is potentially problematic if you want to separate them later on), you could use list() method such as:
list(df_cash[df_cash['Date$'] == date_to_find].iloc[0])

Categories

Resources