Pandas dataframe of tuples? - python

I have a pandas dataframe that I create from a list (which is created from a spark rdd) by calling:
newRdd = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), ))).collect() and then df = pd.DataFrame(newRdd)
My data ends up looking like a dataframe of tuples as shown below:
0 (2017-06-21, Sun, ATL, 10)
1 (2017-06-21, Sun, ATL, 11)
2 (2017-06-21, Sun, ATL, 11)
but I need it to look like a standard table with column headers as such:
date dayOfWeek airport val1
2017-06-11 Sun ATL 11
I'm honestly out of ideas on this one and need some help. I've tried a lot of different things and nothing has seemed to work. Any help would be greatly appreciated. Thank you for your time.

You can do it like this:
df = pd.DataFrame([*df.A],columns = ['date','dayOfWeek','airport','val1','val2','val3','val4','val5','val6'])
i supposed the column name in the dataframe you already have is A.
you can check here for tuples unpacking.
Hope this was helpful. in there are any questions please let me know.

Related

Pivoting data in Python using Pandas

I am doing a time series analysis. I have run the below code to generate random year in the dataframe as the original year did not have year values:
wc['Random_date'] = wc.Monthdate.apply(lambda val: f'{val} {randint(2019,2022)}')
#Generating random year from 2019 to 2022 to create ideal conditions
And now I have a dataframe that looks like this:
wc.head()
The ID column is the index currently, and I would like to generate a pivoted dataframe that looks like this:
Random_date
Count_of_ID
Jul 3 2019
2
Jul 4 2019
3
I do understand that aggregation will be needed to be done after I pivot the data, but the following code is not working:
abscount = wc.pivot(index= 'Random_date', columns= 'Random_date', values= 'ID')
Here is the ending part of the error that I see:
Please help. Thanks.
You may check with
df['Random_date'].value_counts()
If need unique count
df.reset_index().drop_duplicates('ID')['Random_date'].value_counts()
Or
df.reset_index().groupby('Random_date')['ID'].nunique()

How do I merge two tables with dates within the key (Python)

I've been wandering a lot before I could find a solution to my issue and I wanted to ask the community if you had a better idea than the one I came up with.
My problem is the following:
I have two tables (one table is my source data and the other is the mapping) that i want to merge through a certain key.
In my source data, I have two dates: Date_1 and Date_2
In my mapping, I have four dates: Date_1_begin, Date_1_end, Date_2_begin, Date_2_end
The problem is: those dates are part of my key.
For example:
df
A B date
0 1 A 20210310
1 1 A 20190101
2 3 C 19981231
mapping
A B date_begin date_end code
0 1 A 19600101 20201231 1
1 1 A 20210101 20991231 2
2 3 C 19600101 20991231 3
The idea is that: doing something like this:
pd.merge(df, mapping, on = ['A','B'])
would give me two codes for key 1_A : 1 and 2. But I want a 1-1 relation.
In order to assign the right code considering the dates, I did something like this using piecewise
from numpy library:
df_date= df['date'].values
conds = [(df_date >= start_date)&(df_date<= end_date)] for start_date, end_date in zip(mapping.date_begin.values, mapping.date_end.values)]
result = np.piecewise(np.zeros(len(df)), conds, mapping['code'].values)
df['code'] = result
And it works fine... But I figured it must exist somewhere something easier and classier maybe...
Many thanks in advance!
Clem
You need to add enumeration to the duplicate rows:
(df1.assign(enum=df1.groupby(['A','B'].cumcount())
.merge(df2.assign(enum=df2.groupby(['A','B']).cumcount()),
on=['A','B','enum'])
)

Python 3.6 Dataframe Sorting by DateTime column [10:00:01 comes before 00:00:01]

Thank you in advance for any advice. I have searched all over and could not find a solution.
I would like to sort a dataframe by a TIME column formated as "%H:%M:%S".
df_VFT['DATE'] = pd.to_datetime(df_VFT['DATE'])
pd.to_datetime(df_VFT.TIME, format="%H:%M:%S")
df_VFT['WEEK'] = df_VFT['DATE'].dt.year.map(str) + "-" + df_VFT['DATE'].dt.week.map(str)
df_VFT.sort_values(['TIME'], ascending=True, inplace=True)
print(df_VFT)
However, the resulting dataframe has the TIME column sorted almost correctly, but starting with 10:00:01. So sorting is 10:00:01 to 23:59:59, followed by 0:00:01 to 9:59:59.
First few rows of resulting dataframe
Last few ros of resulting dataframe
How can I correct this so that the time when sorted starts with 00:00:01?
Any help is really appreciated.

get subset of dataframe and name each sub-dataframe differently using pandas

I am working with python.
I have a dataframe and I want to get subset of it.
I also want the subset of dataframe differently.
All I could find out so far as was,
df = pd.DataFrame({'Name':list('aabbef'),
'A':[4,5,4,5,5,4],
'B':[7,8,9,4,2,3],
'C':[1,3,5,7,1,0]}, columns = ['Name','A','B','C'])
print (df)
d = dict(tuple(df.groupby('Name')))
print (d)
print (d['a'])
In the above example, I could get a subset of dataframe using a specific value in Name column.
I wonder is there any way that I could assign different names for each of subsets.
For example, in the above case, there are 4 different values in Name column (a,b,e,f).
If I have 26 valaues (a,b,c, ..., z) and I generate a list of names for each sub-dataframe, qqq = [q1, q2, q3, q4, ... q26],
I want to get,
q1 = d['a']
q2 = d['b']
q3 = d['e']
q4 = d['f']
...
q26 = d['z']
Is there any way that I could use loop for this name assignment?
Thank you!
#
Thank you for the comments.
The data I am working with now looks like this.
enter image description here
For each ID, there are 15 more questions and for each question there are two to three answers. Now, the data is stacked (I would say as a long shape).
What I want is to transform the data into wide shape.
For each ID, I want quetion + answer varaible as a column.
What I have thought initially is that,
I generate a new column ('hca_qa') concatenating question and answer column.
Then I get the subset of dataframe by 'hca_qa'.
Then I merge the subsets of dataframe into one to get a wide shape data.
So far, I know how to get a subset of dafaframes, but to merge them into one, I may need a name for each dataframe and then merge them.
I am open to a better way so please let me know.
I am sincerely appreciate.

putting .size() into new column python pandas

I am very new to python (and to stack overflow!) so hopefully this makes sense!
I have a dataframe which contains years and names (amongst otherthings however this is all I am interested in working with).
I have done df = df.groupby(['year', 'name']).size() to get the amount of times each names appears in each year.
it returns something similar to this:
year name
2001 nameone 2
2001 nametwo 3
2002 nameone 1
2002 nametwo 5
what I'm trying to do is put the size data in to a new column called 'count'.
(eventually what I am intending to do with this is plot it on graphs)
Any help would be greatly appreciated!
Here is the raw code (I have condensed it a bit for convenience) :
hso_df = pd.read_csv('HibernationSurveyObservationsCleaned.csv')
hso_df[["startDate", "endDate", "commonName"]]
year_df = hso_df
year_df['startDate'] = pd.to_datetime(hso_df['startDate'] )
year_df['year'] = year_df['startDate'].dt.year
year_df = year_df[["year", "commonName"]].sort_values('year')
year_df = year_df.groupby(['year', 'commonName']).size()
here is an image of the first 3 rows of the data displayed with .head()
The only columns that are of interest from this data are the commonName and the year (I have taken this from startDate)
IIUC you want transform to add the result of the groupby with its index aligned to the original df:
df['count'] = df.groupby(['year', 'name']).transform('size')
EDIT
Looking at your requirements, I suggest calling reset_index on the groupby result and then merging this back to your main df:
year_df= year_df.reset_index()
hso_df.merge(year_df).rename(columns={0:'count'})

Categories

Resources