get subset of dataframe and name each sub-dataframe differently using pandas - python

I am working with python.
I have a dataframe and I want to get subset of it.
I also want the subset of dataframe differently.
All I could find out so far as was,
df = pd.DataFrame({'Name':list('aabbef'),
'A':[4,5,4,5,5,4],
'B':[7,8,9,4,2,3],
'C':[1,3,5,7,1,0]}, columns = ['Name','A','B','C'])
print (df)
d = dict(tuple(df.groupby('Name')))
print (d)
print (d['a'])
In the above example, I could get a subset of dataframe using a specific value in Name column.
I wonder is there any way that I could assign different names for each of subsets.
For example, in the above case, there are 4 different values in Name column (a,b,e,f).
If I have 26 valaues (a,b,c, ..., z) and I generate a list of names for each sub-dataframe, qqq = [q1, q2, q3, q4, ... q26],
I want to get,
q1 = d['a']
q2 = d['b']
q3 = d['e']
q4 = d['f']
...
q26 = d['z']
Is there any way that I could use loop for this name assignment?
Thank you!
#
Thank you for the comments.
The data I am working with now looks like this.
enter image description here
For each ID, there are 15 more questions and for each question there are two to three answers. Now, the data is stacked (I would say as a long shape).
What I want is to transform the data into wide shape.
For each ID, I want quetion + answer varaible as a column.
What I have thought initially is that,
I generate a new column ('hca_qa') concatenating question and answer column.
Then I get the subset of dataframe by 'hca_qa'.
Then I merge the subsets of dataframe into one to get a wide shape data.
So far, I know how to get a subset of dafaframes, but to merge them into one, I may need a name for each dataframe and then merge them.
I am open to a better way so please let me know.
I am sincerely appreciate.

Related

How to create a new df with pandas based on a set of conditions that compares each row by the others within the same df?

I am attempting to use pandas to create a new df based on a set of conditions that compares the rows from one another within the original df. I am new to using pandas and feel comfortable comparing two df from one another and basic column comparisons, but for some reason the row by row comparison is stumping me. My specific conditions and problem are found below:
Cosine_i_ start_time fid_ Shape_Area
0 0.820108 2022-08-31T10:48:34Z emit20220831t104834_o24307_s000 0.067763
1 0.962301 2022-08-27T12:25:06Z emit20220827t122506_o23908_s000 0.067763
2 0.811369 2022-08-19T15:39:39Z emit20220819t153939_o23110_s000 0.404882
3 0.788322 2023-01-29T13:23:39Z emit20230129t132339_o02909_s000 0.404882
4 0.811369 2022-08-19T15:39:39Z emit20220819t153939_o23110_s000 0.108256
^^Above is my original df that I will be working with.
Goal: I am hoping to create a new df that contains only the FIDs that meet the following conditions: If the shape area is equal, the cosi values have a difference greater than 0.1, and the start time has a difference greater than 5 days. This is going to be applied to a large dataset, the df displayed is just a small sample one I made to help write the code.
For example: Rows 2 & 3 have the same shape area, so then looking at the cosi values, they have a difference in values greater than 0.1, and lastly they have a difference in their start times that is greater than 5 days. They meet all set conditions, so I would then like to take the FID values for BOTH of these rows and append it to a new df.
So essentially I want to compare every row with the other rows and that's where I am having trouble.
I am looking for as much guidance as possible on how to set this up as I am very very new to coding and am hoping to get a tutorial of some sort!
Thanks in advance.
Group by Shape_Area and filter each pair (single items on Shape_Area are omitted) by required conditions:
fids = df.groupby('Shape_Area').filter(lambda x: x.index.size > 1
and x['Cosine_i_'].diff().values[-1] >= 0.1
and x['start_time'].diff().abs().dt.days.values[-1] > 5,
dropna=True)['fid_'].tolist()
print(fids)

How could I create a column with matchin values from different datasets with different lengths

I want to create a new column in the dataset in which a ZipCode is assigned to a specific Region.
There are in total 5 Regions. Every Region consists of an x amount of ZipCodes. I would like to use the two different datasets to create a new column.
I tried some codes already, however, I failed because the series are not identically labeled. How should I tackle this problem?
I have two datasets, one of them has 1518 rows x 3 columns and the other one has
46603 rows x 3 columns.
As you can see in the picture:
df1 is the first dataset with the Postcode and Regio columns, which are the ZipCodes assigned to the corresponding Regio.
df2 is the second dataset where the Regio column is missing as you can see. I would like to add a new column into the df2 dataset which contains the corresponding Regio.
I hope someone could help me out.
Kind regards.
I believe you need to map the zipcode from dataframe 2 to the region column from the first dataframe. Assuming Postcode and ZipCode are same.
First create a dictionary from df1 and then replace the zipcode values based on the dictionary values
zip_dict = dict(zip(df1.Postcode, df1.Regio))
df2.ZipCode.replace(zip_dict)

Comparing specific row in a column with all columns for that specific row in a dataframe

I'm new to python and been trying to figure this out for a week now
I have a dataset 2 rows by 2000ish columns, the data came in a dictionary format and I used df.DataFrame to convert it (Don't know if this is helpful or not)
Here is an example
gene1 gene2 gene3 etc
location [1,2] [3,4] [5,6]
enhancer ATCG GGGG CATA
I want to compare the enhancer from gene 1 to the enhancer for the rest of the genes one by one to tell me how many differences there are between them. I know I can't make a new column for this since won't work, I think the best solution is to save the new information to a new Data frame
Example output
gene1 gene 2 gene3
difference 0 3 4
I would like an idea on how to approach this from a different perspective, I've tried doing it by using nested loop but couldn't figure it out.
Thank you
This is definitely not the best way to do that, but I would try the following as it is the most straight-forward to me.
df = pd.DataFrame({"gene1":[[1,2],"ATCG"],
"gene2":[[3,4],"GGGG"],
"gene3":[[5,6],"CATA"]},index=["location","enhancer"])
target_gene = df.loc["enhancer","gene1"]
df.loc["difference",:] = list(map(lambda x:sum([c1!=c2 for (c1,c2) in zip(x,target_gene)]),
df.loc["enhancer",:]))
df

How to create a new python DataFrame with multiple columns of differing row lengths?

I'm organizing a new dataframe in order to easily insert data into a Bokeh visualization code snippet. I think my problem is due to differing row lengths, but I am not sure.
Below, I organized the dataset in alphabetical order, by country name, and created an alphabetical list of the individual countries. new_data.tail() Although Zimbabwe is listed last, there are 80336 rows, hence the sorting.
df_ind_data = pd.DataFrame(ind_data)
new_data = df_ind_data.sort_values(by=['country'])
new_data = new_data.reset_index(drop=True)
country_list = list(ind_data['country'])
new_country_set = sorted(set(country_list))
My goal is create a new DataFrame, with 76 cols (country names), with the specific 'trust' data in the rows underneath each country column.
df = pd.DataFrame()
for country in new_country_set:
pink = new_data.loc[(new_data['country'] == country)]
df[country] = pink.trust
Output here
As you can see, the data does not get included for the rest of the columns after the first. I believe this is due to the fact that the number of rows of 'trust' data for each country varies. While the first column has 1000 rows, there are some with as many as 2500 data points, and as little as 500.
I have attempted a few different methods to specify the number of rows in 'df', but to no avail.
The visualization code snippet I have utilizes this same exact data structure for the template data, so that it why I'm attempting to put it in a dataframe. Plus, I can't do it, so I want to know how to do it.
Yes, I can put it in a dictionary, but I want to put it in a dataframe.
You should use combine_first when you add a new column so that the dataframe index gets extended. Instead of
df[country] = pink.trust
you should use
df = pink.trust.combine_first(df)
which ensures that your index is always union of all added columns.
I think in this case pd.pivot(columns = 'var', values = 'val') , will work for you, especially when you already have dataframe. This function will transfer values from particular column into column names. You could see the documentation for additional info. I hope that helps.

How to combine multiple rows of data into a single sting per group

To preface: I'm new to using Python.
I'm working on cleaning up a file where data was spread across multiple rows. I'm struggling to find a solution that will concatenate multiple text strings to a single cell. The .csv data looks similar to this:
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
with one or two blank rows between each entry, too.
The amount of rows used for 'description' isn't consistent. Sometimes it's just one cell, sometimes up to about four. The ideal output turns these multiple rows into a single row of useful data, without all the wasted space. I thought maybe I could create a series of masks by copying the data across a few columns, shifted up, and then iterating in some way. I haven't found a solution that matches what I'm trying to do, though. This is where I'm at so far:
#Add column f description stuff and shift up a row for concatenation
DogData['Z'] = DogData['Y'].shift(-1)
DogData['AA'] = DogData['Z'].shift(-1)
DogData['AB'] = DogData['AA'].shift(-1)
#create series checks to determine how to concat values properly
YNAs = DogData['Y'].isnull()
ZNAs = DogData['Z'].isnull()
AANAs = DogData['AA'].isnull()
The idea here was basically that I'd iterate over column 'Y', check if the same row in column 'Z' was NA or had a value, and concat if it did. If not, just use the value in 'Y'. Carry that logic across but stopping if it encountered an NA in any subsequent columns. I can't figure out how to do that, or if there's a more efficient way to do this.
What do I have to do to get to my end result? I can't figure out the right way to iterate or concatenate in the way I was hoping to.
'''
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
'''
df = pd.read_clipboard(sep=',')
df.fillna(method = 'ffill').groupby([
'name',
'date'
]).description.apply(lambda x : ', '.join(x)).to_frame(name = 'description')
I'm not sure I follow exactly what you mean. I took that text, saved it as a csv file, and successfully read it into a pandas dataframe.
import pandas as pd
df = pd.read_csv('test.csv')
df
Output:
name date description
0 bundy 12-12-2017 good dog
1 NaN NaN smells kind of weird
2 NaN NaN needs to be washed
Isn't this the output you require?

Categories

Resources