I am doing a time series analysis. I have run the below code to generate random year in the dataframe as the original year did not have year values:
wc['Random_date'] = wc.Monthdate.apply(lambda val: f'{val} {randint(2019,2022)}')
#Generating random year from 2019 to 2022 to create ideal conditions
And now I have a dataframe that looks like this:
wc.head()
The ID column is the index currently, and I would like to generate a pivoted dataframe that looks like this:
Random_date
Count_of_ID
Jul 3 2019
2
Jul 4 2019
3
I do understand that aggregation will be needed to be done after I pivot the data, but the following code is not working:
abscount = wc.pivot(index= 'Random_date', columns= 'Random_date', values= 'ID')
Here is the ending part of the error that I see:
Please help. Thanks.
You may check with
df['Random_date'].value_counts()
If need unique count
df.reset_index().drop_duplicates('ID')['Random_date'].value_counts()
Or
df.reset_index().groupby('Random_date')['ID'].nunique()
Related
Looking to clean multiple data sets in a more automated way. The current format is year as column, month as row, the number values.
Below is an example of the current format, the original data has multiple years/months.
Current Format:
Year
Jan
Feb
2022
300
200
Below is an example of how I would like the new format to look like. It combines month and year into one column and transposes the number into another column.
How would I go about doing this in excel or python? Have files with many years and multiple months.
New Format:
Date
Number
2022-01
300
2022-02
200
Check below solution. You need to extend month_df for the months, current just cater to the example.
import pandas as pd
df = pd.DataFrame({'Year':[2022],'Jan':[300],'Feb':[200]})
month_df = pd.DataFrame({'Char_Month':['Jan','Feb'], 'Int_Month':['01','02']})
melted_df = pd.melt(df, id_vars=['Year'], value_vars=['Jan', 'Feb'], var_name='Char_Month',value_name='Number')
pd.merge(melted_df, month_df,left_on='Char_Month', right_on='Char_Month').\
assign(Year=melted_df['Year'].astype('str')+'-'+month_df['Int_Month'])\
[['Year','Number']]
Output:
I have a dataframe that has Date as its index. The dataframe has stock market related data so the dates are not continuous. If I want to move lets say 120 rows up in the dataframe, how do I do that. For example:
If I want to get the data starting from 120 trading days before the start of yr 2018, how do I do that below:
df['2018-01-01':'2019-12-31']
Thanks
Try this:
df[df.columns[df.columns.get_loc('2018-01-01'):df.columns.get_loc('2019-12-31')]]
Get location of both Columns in the column array and index them to get the desired.
UPDATE :
Based on your requirement, make some little modifications of above.
Yearly Indexing
>>> df[df.columns[(df.columns.get_loc('2018')).start:(df.columns.get_loc('2019')).stop]]
Above df.columns.get_loc('2018') will give out numpy slice object and will give us index of first element of 2018 (which we index using .start attribute of slice) and similarly we do for index of last element of 2019.
Monthly Indexing
Now consider you want data for First 6 months of 2018 (without knowing what is the first day), the same can be done using:
>>> df[df.columns[(df.columns.get_loc('2018-01')).start:(df.columns.get_loc('2018-06')).stop]]
As you can see above we have indexed the first 6 months of 2018 using the same logic.
Assuming you are using pandas and the dataframe is sorted by dates, a very simple way would be:
initial_date = '2018-01-01'
initial_date_index = df.loc[df['dates']==initial_date].index[0]
offset=120
start_index = initial_date_index-offset
new_df = df.loc[start_index:]
I have a Data Frame with this columns:
DF.head():
Email Month Year
abc#Mail.com 1 2018
abb#Mail.com 1 2018
abd#Mail.com 2 2019
.
.
abbb#Mail.com 6 2019
What I want to do is to get the total of email adresses in each month for both years 2018 and 2019 (knowing that I don't need to filter, since I have only this two years).
This is what I've done, but I want to make sure that this is right:
Stats = DF.groupby(['Year','Month'])['Email'].count()
Any Suggestion?
It depends what need.
If need exclude missing values or missing values not exist in Email column, your solution is right, use GroupBy.count:
Stats = DF.groupby(['Year','Month'])['Email'].count()
If need count all groups also with missing values (if exist) use GroupBy.size:
Stats = DF.groupby(['Year','Month']).size()
I have a pandas dataframe that I create from a list (which is created from a spark rdd) by calling:
newRdd = rdd.map(lambda row: Row(row.__fields__ + ["tag"])(row + (tagScripts(row), ))).collect() and then df = pd.DataFrame(newRdd)
My data ends up looking like a dataframe of tuples as shown below:
0 (2017-06-21, Sun, ATL, 10)
1 (2017-06-21, Sun, ATL, 11)
2 (2017-06-21, Sun, ATL, 11)
but I need it to look like a standard table with column headers as such:
date dayOfWeek airport val1
2017-06-11 Sun ATL 11
I'm honestly out of ideas on this one and need some help. I've tried a lot of different things and nothing has seemed to work. Any help would be greatly appreciated. Thank you for your time.
You can do it like this:
df = pd.DataFrame([*df.A],columns = ['date','dayOfWeek','airport','val1','val2','val3','val4','val5','val6'])
i supposed the column name in the dataframe you already have is A.
you can check here for tuples unpacking.
Hope this was helpful. in there are any questions please let me know.
I am getting familiar with Pandas and I want to learn the logic with a few simple examples.
Let us say I have the following panda DataFrame object:
import pandas as pd
d = {'year':pd.Series([2014,2014,2014,2014], index=['a','b','c','d']),
'dico':pd.Series(['A','A','A','B'], index=['a','b','c','d']),
'mybool':pd.Series([True,False,True,True], index=['a','b','c','d']),
'values':pd.Series([10.1,1.2,9.5,4.2], index=['a','b','c','d'])}
df = pd.DataFrame(d)
Basic Question.
How do I take a column as a list.
I.e., d['year']
would return
[2013,2014,2014,2014]
Question 0
How do I take rows 'a' and 'b' and columns 'year' and 'values' as a new dataFrame?
If I try:
d[['a','b'],['year','values']]
it doesn't work.
Question 1.
How would I aggregate (sum/average) the values column by the year, and dico columns, for example. I.e., such that different years/dico combinations would not be added, but basically mybool would be removed from the list.
I.e., after aggregation (this case average) I should get:
tipo values year
A 10.1 2013
A (9.5+1.2)/2 2014
B 4.2 2014
If I try the groupby function it seems to output some odd new DataFrame structure with bool in it, and all possible years/dico combinations - my objective is rather to have that simpler new sliced and smaller dataframe I showed above.
Question 2. How do I filter by a condition?
I.e., I want to filter out all bool columns that are False.
It'd return:
tipo values year mybool
A 10.1 2013 True
A 9.5 2014 True
B 4.2 2014 True
I've tried the panda tutorial but I still get some odd behavior so asking directly seems to be a better idea.
Thanks!
values from series in a list:
df['year'].values #returns an array
loc lets you subset a dateframe by index labels:
df.loc[['a','b'],['year','values']]
Group by lets you aggregate over columns:
df.groupby(['year','dico'],as_index=False).mean() #don't have 2013 in your df
Filtering by a column value:
df[df['mybool']==True]