I have yearly data sets with some missing data. I used this code to read but unable to omit white space present at the end of february. can anyone help to solve this problem?
df1 = pd.read_fwf('DQ404.7_77.txt',widths=ws,header=9, nrows=31, keep_default_na = False)
df1 = df1.drop('Day', 1)
df2 = np.array(df1).T
what I want is to arrange all the data in one column with respect to date. My data is uploaded in this link you can download
https://drive.google.com/open?id=0B2rkXkOkG7ExbEVwZUpHR29LNFE
what i wanted is to get time series data from this file and it should be like
Feb,25 13
Feb,26 13
Feb,27 13
Feb,28 13
March, 1 10
March, 2 10
March, 3 10
Not with empty strings in between february and March
So after a lot of comments it looks like df[df != ''] works for you
Related
Looking to clean multiple data sets in a more automated way. The current format is year as column, month as row, the number values.
Below is an example of the current format, the original data has multiple years/months.
Current Format:
Year
Jan
Feb
2022
300
200
Below is an example of how I would like the new format to look like. It combines month and year into one column and transposes the number into another column.
How would I go about doing this in excel or python? Have files with many years and multiple months.
New Format:
Date
Number
2022-01
300
2022-02
200
Check below solution. You need to extend month_df for the months, current just cater to the example.
import pandas as pd
df = pd.DataFrame({'Year':[2022],'Jan':[300],'Feb':[200]})
month_df = pd.DataFrame({'Char_Month':['Jan','Feb'], 'Int_Month':['01','02']})
melted_df = pd.melt(df, id_vars=['Year'], value_vars=['Jan', 'Feb'], var_name='Char_Month',value_name='Number')
pd.merge(melted_df, month_df,left_on='Char_Month', right_on='Char_Month').\
assign(Year=melted_df['Year'].astype('str')+'-'+month_df['Int_Month'])\
[['Year','Number']]
Output:
I am doing a time series analysis. I have run the below code to generate random year in the dataframe as the original year did not have year values:
wc['Random_date'] = wc.Monthdate.apply(lambda val: f'{val} {randint(2019,2022)}')
#Generating random year from 2019 to 2022 to create ideal conditions
And now I have a dataframe that looks like this:
wc.head()
The ID column is the index currently, and I would like to generate a pivoted dataframe that looks like this:
Random_date
Count_of_ID
Jul 3 2019
2
Jul 4 2019
3
I do understand that aggregation will be needed to be done after I pivot the data, but the following code is not working:
abscount = wc.pivot(index= 'Random_date', columns= 'Random_date', values= 'ID')
Here is the ending part of the error that I see:
Please help. Thanks.
You may check with
df['Random_date'].value_counts()
If need unique count
df.reset_index().drop_duplicates('ID')['Random_date'].value_counts()
Or
df.reset_index().groupby('Random_date')['ID'].nunique()
I have a dataframe that has Date as its index. The dataframe has stock market related data so the dates are not continuous. If I want to move lets say 120 rows up in the dataframe, how do I do that. For example:
If I want to get the data starting from 120 trading days before the start of yr 2018, how do I do that below:
df['2018-01-01':'2019-12-31']
Thanks
Try this:
df[df.columns[df.columns.get_loc('2018-01-01'):df.columns.get_loc('2019-12-31')]]
Get location of both Columns in the column array and index them to get the desired.
UPDATE :
Based on your requirement, make some little modifications of above.
Yearly Indexing
>>> df[df.columns[(df.columns.get_loc('2018')).start:(df.columns.get_loc('2019')).stop]]
Above df.columns.get_loc('2018') will give out numpy slice object and will give us index of first element of 2018 (which we index using .start attribute of slice) and similarly we do for index of last element of 2019.
Monthly Indexing
Now consider you want data for First 6 months of 2018 (without knowing what is the first day), the same can be done using:
>>> df[df.columns[(df.columns.get_loc('2018-01')).start:(df.columns.get_loc('2018-06')).stop]]
As you can see above we have indexed the first 6 months of 2018 using the same logic.
Assuming you are using pandas and the dataframe is sorted by dates, a very simple way would be:
initial_date = '2018-01-01'
initial_date_index = df.loc[df['dates']==initial_date].index[0]
offset=120
start_index = initial_date_index-offset
new_df = df.loc[start_index:]
I have a Data Frame with this columns:
DF.head():
Email Month Year
abc#Mail.com 1 2018
abb#Mail.com 1 2018
abd#Mail.com 2 2019
.
.
abbb#Mail.com 6 2019
What I want to do is to get the total of email adresses in each month for both years 2018 and 2019 (knowing that I don't need to filter, since I have only this two years).
This is what I've done, but I want to make sure that this is right:
Stats = DF.groupby(['Year','Month'])['Email'].count()
Any Suggestion?
It depends what need.
If need exclude missing values or missing values not exist in Email column, your solution is right, use GroupBy.count:
Stats = DF.groupby(['Year','Month'])['Email'].count()
If need count all groups also with missing values (if exist) use GroupBy.size:
Stats = DF.groupby(['Year','Month']).size()
I am working on processing a very large data set using pandas into more manageable data frames. I have a loop that goes through and splits the data frame into smaller data frames based on a leading ID number, I then sort by the date column. However, I notice that after everything runs there are still some issues with dates not being sorted correctly. I want to create a manual filter that basically loops through the date column and checks to see if next date is greater or equal to the previous date. This ideally would eliminate issues where the date column may go something like (obviously in more of a data frame format):
[2017,2017,2018,2018,2018,2017,2018,2018]
I am writing some code to take care of this however, I keep getting errors and was hoping someone could point me in the right direction to go.
for i in range(len(Rcols)):
dfs[i] = data.filter(regex=f'{Rcols[i]}-')
dfs[i]['Engine'] = data['Operation_ID:-PARAMETER_NAME:']
dfs[i].set_index('Engine',inplace=True)
dfs[i][f'{Rcols[i]}-DATE_TIME_START']=pd.to_datetime(dfs[i][f'{Rcols[i]}-DATE_TIME_START'],errors = 'ignore')
dfs[i].sort_values(by=f'{Rcols[i]}-DATE_TIME_START',ascending = True ,inplace=True)
for index, item in enumerate(dfs[i][f'{Rcols[i]}-DATE_TIME_START']):
if dfs[i][f'{Rcols[i]}-DATE_TIME_START'][index + 1] >= dfs[i][f'{Rcols[i]}-DATE_TIME_START'][index]:
continue
else:
dfs[i].drop(dfs[i][index])
Where Rcols is just a list of the column header leading IDs. dfs is a large list of names that call pandas data frames.
Thanks
This isn't particularly "manual", but you can use pd.Series.shift. Here's a minimal example, but the principle works equally well with a series of dates:
df = pd.DataFrame({'Years': [2017,2017,2018,2018,2018,2017,2018,2018]})
mask = df['Years'].shift() > df['Years']
df = df[~mask]
print(df)
Years
0 2017
1 2017
2 2018
3 2018
4 2018
6 2018
7 2018
Notice how the row with index 5 has been dropped since 2017 < 2018 (the row before). You can extend this for multiple columns via a for loop.
You should under no circumstances modify rows while you are iterating over them. This is spelt out in the docs for pd.DataFrame.iterrows:
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
However, this becomes irrelevant when there is a vectorised solution available, as described above.