pandas extra a new header from a already exist header? - python

df = DataFrame({'DATE' : ['2017-01-01','2017-01-02'],'Sexuality/us' :['femle','male'],'Height/us' :[190,195]})
DATE Sexuality/us Height/us
0 2017-01-01 female 190
1 2017-01-02 male 195
As you see this is a pandas DataFrame data
I want to transfer this DataFrame to csv
When I use df.to_csv('demo.csv') as blow
What I really want to get is like this(extract the us to another row, of course, there may be many countries, I want to extract countries to an row as header):
Any one can help me? Thanks very much

If you put DATE into the index then you can split your columns by / and create a multiindex.
df = df.set_index('DATE')
df.columns = df.columns.str.split('/', expand=True)
df.reset_index().swaplevel(axis=1)
us
DATE Height Sexuality
0 2017-01-01 190 femle
1 2017-01-02 195 male

Related

How do I delete specific dataframe rows based on a columns value?

I have a pandas dataframe with 2 columns ("Date" and "Gross Margin). I want to delete rows based on what the value in the "Date" column is. This is my dataframe:
Date Gross Margin
0 2021-03-31 44.79%
1 2020-12-31 44.53%
2 2020-09-30 44.47%
3 2020-06-30 44.36%
4 2020-03-31 43.69%
.. ... ...
57 2006-12-31 49.65%
58 2006-09-30 52.56%
59 2006-06-30 49.86%
60 2006-03-31 46.20%
61 2005-12-31 40.88%
I want to delete every row where the "Date" value doesn't end with "12-31". I read some similar posts on this and the pandas.drop() function seemed to be the solution, but I haven't figured out how to use it for this specific case.
Please leave any suggestions as to what I should do.
you can try the following code, where you match the day and month.
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df[df['Date'].dt.strftime('%m-%d') == '12-31']
Assuming you have the date formatted as year-month-day
df = df[~df['Date'].str.endswith('12-31')]
If the dates are using a consistent format, you can do it like this:
df = df[df['Date'].str.contains("12-31", regex=False)]

How to fill missing dates with corresponding NaN in other columns

I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.

Read in a csv with a different separator based on the each column?

Column id and runtime are comma-separated. However, column genres is separated by Pipe(|).
df = pd.read_csv(path, sep=',') results in the table below. However, I can't conduct any queries on column genres, for instance finding the most popular genre by year? Is it possible to separate pipe into separate rows?
df.head()
id runtime genres Year
0 135397 124 Action|Adventure|Science Fiction|Thriller 2000
1 76341 120 Action|Adventure|Science Fiction|Thriller 2002
2 262500 119 Adventure|Science Fiction|Thriller 2001
3 140607 136 Action|Adventure|Science Fiction|Fantasy 2000
4 168259 137 Action|Crime|Thriller 1999
You're better reading the file as is, then split the genres into new rows with pandas explode:
df = df.assign(genres = df.genres.str.split('|')).explode('genres')
so that you can easily manipulate your data.
For example, to get the most frequent (i.e. mode) genres per year:
df.groupby('Year').genres.apply(lambda x: x.mode()).droplevel(1)
To identify the counts:
def get_all_max(grp):
counts = grp.value_counts()
return counts[counts==counts.max()]
df.groupby('Year').genres.apply(get_all_max)\
.rename_axis(index={None:'Genre'}).to_frame(name='Count')

Merging DataFrames on specific columns

I have a frame moviegoers that includes zip codes but not cities.
I then redefined moviegoers to be zipcodes and changed the data type of zip codes to be a data frame instead of a series.
zipcodes = pd.read_csv('NYC1-moviegoers.csv',dtype={'zip_code': object})
I know the dataset URL I need is this: https://raw.githubusercontent.com/mafudge/datasets/master/zipcodes/free-zipcode-database-Primary.csv.
I defined a dataframe, zip_codes, to call the data from that dataset and change the dataset type from series to dataframe so its in the same format as the zipcodes dataframe.
I want to merge the dataframes so I can have the movie goer data. But, instead of zipcodes, I want to have the state abbreviation. This is where I am having issues.
The end goal is to count the number of movie goers per state. Example ideal output:
CA 116
MN 78
NY 60
TX 51
IL 50
Any ideas would be greatly appreciated.
I think need map by Series and then use value_counts for count:
print (zipcodes)
zip_code
0 85711
1 94043
2 32067
3 43537
4 15213
s = zip_codes.set_index('Zipcode')['State']
df = zipcodes['zip_code'].map(s).value_counts().rename_axis('state').reset_index(name='count')
print (df.head())
state count
0 OH 1
1 CA 1
2 FL 1
3 AZ 1
4 PA 1
Simply merge both datasets on Zipcode columns then run groupby for state counts.
# READ DATA FILES WITH RENAMING OF ZIP COLUMN IN FIRST
url = "https://raw.githubusercontent.com/mafudge/datasets/master/zipcodes/free-zipcode-database-Primary.csv"
moviegoers = pd.read_csv('NYC1-moviegoers.csv', dtype={'zip_code': object}).rename(columns={'zip_code': 'Zipcode'})
zipcodes = pd.read_csv(url, dtype={'Zipcode': object})
# MERGE ON COMMON FIELD
merged_df = pd.merge(moviegoers, zipcodes, on='Zipcode')
# AGGREGATE BY INDICATOR (STATE)
merged_df.groupby('State').size()
# ALTERNATIVE GROUP BY COUNT
merged_df.groupby('State')['Zipcode'].agg('count')

Filtering data-frame on multiple criteria

I have a dataframe df which has a head that looks like:
Shop Opening date
0 London NaT
22 Brighton 01/03/2016
27 Manchester 01/31/2017
54 Bristol 03/31/2017
69 Glasgow 04/09/2017
I also have a variable startPeriod which is set to 1/04/2017 date and endPeriod variable that has a value of 30/06/17
I am trying to create a new dataframe based on df that filters out any rows that do not have a date (so removing any rows with an Opening date of NaT) and also filter out any rows with an opening date between the startPeriod and endPeriod. So in the above example I would be left with the following new dataframe:
Shop Opening date
22 Brighton 01/03/2016
69 Glasgow 04/09/2017
I have tried to filter out the 'NaT' using the following:
df1 = df['Opening date '] != 'NaT'
but am unsure how to also filter out any Opening date that are inside the startPeriod/endPeriod range.
You can use between with boolean indexing:
df['date'] = pd.to_datetime(df['date'])
df = df[df['date'].between('2016-03-01', '2017-04-05')]
print (df)
Shop Opening date
2 27 Manchester 2017-01-31
3 54 Bristol 2017-03-31
I think filtering out NaNs is not necessary, but if need it chain new condition:
df = df[df['date'].between('2016-03-01', '2017-04-05') & df['date'].notnull()]
First of all, be careful with the space after date in df['Opening date ']
try this solution:
df1 = df[df['Opening date'] != 'NaT']
it would be much better if you create a copy of the subset you're making
df1 = df[df['Opening date'] != 'NaT'].copy()

Categories

Resources