I have a dataframe that has a column containing dates and cities, these dates are repeated every time a new city appears.
I want to leave only 4 specific dates and delete the others. I discovered a function that does this, but that runs one date at a time.
I want to create a function that does this whole process and leaves only the dates I want. Follow the df and code that eliminates one date at a time.
df[df.column != '2020-06-19']
You can do it like this.
df = df[df.column.isin(['2020-06-19', '2020-06-20', '2020-06-21'])]
or if you want to remove these dates.
df = df[~df.column.isin(['2020-06-19', '2020-06-20', '2020-06-21'])]
Related
I have a df with a date time column of expiration dates for employee certifications. Some certifications never expire, however, and those are 'NaT' in the column.
I want to filter the df for dates after a certain date ('2019-12-31') while KEEPING NaTs as well.
Initially I tried:
valid_cert = tcert[tcert.cert_expire_dt>='2019-12-31']
which filtered out the NaTs. I then tried adding
or (tcert.cert_expire_dt == 'NaT')
before the final bracket (with parentheses around the previous condition as well), but that didn't work.
hello stackoverflow community. I am having an issue while trying to do a simple merge between two dataframes which share the same date column. sorry I am new to python and perhaps the way I express myself is not very clear. I am working on the project related to stock prices calculation. the first data frame has date and closing prices columns, while the second one only has similar date column. my goal is to obtain a single date column which will have matching closing prices column next to it.
this is what I have done to merge two dataframes
inner_join = pd.merge(df.iloc[7:79],df1[['Ex-Date','FDX UN Equity']],on ='Ex-date',how ='inner')
inner_join
Ex-date refers to date column and FXD UN Equity refers to column with closing prices
I get this as a result:
) = self._get_merge_keys()
# validate the merge keys dtypes. We may need to coerce
# Check for duplicates
# work-around for merge_asof(right_index=True)
KeyError: 'Ex-date'```
Pandas read the format of date columns differently, so I made the same format for date columns in original excel file but it hasn't helped. I tried all sorts of various merges but it didn't work either.
anyone have any ideas what is going on?
The code would look like this
import pandas as pd
inner_join = pd.merge_asof(df, df1, on = 'Ex-date')
Change both column header name to the same lower case and merge again.. check Ex-Date.. the column name header should be the same before you merge and use how=‘left’
I am a complete noob at this Python and Jupiter Notebook stuff. I am taking an Intro to Python Course and have been assigned a task to do. This is to extract information from a .csv file. The following is a snapshot of my .csv file titled "feeds1.csv"
https://i.imgur.com/BlknyC3.png
I can import the .csv into Jupyter Notebook, and have tried groupby function to sort it. But it won't work due to the fact that column also has time in it.
import pandas as pd
df = pd.read_csv("feeds1.csv")
I need it to output as follows:
https://i.imgur.com/BDfnZrZ.png
The ultimate goal would be to create a csv file with this accumulated data and use it to plot a chart,
If you do not need the time of day but just the date, you can simply use this:
df.created_at = df.created_at.str.split(' ').str[0]
dfout = df.groupby(['created_at']).count()
dfout.reset_index(level=0, inplace=True)
finaldf = dfout[['created_at', 'entry_id']]
finaldf.columns = ['Date', 'field2']
finaldf.to_csv('outputfile.csv', index=False)
The first line will split the created_at column at the space between the date and time. The .str[0] means it will only keep the first part of the split (which is the date).
The second line groups them by date and gives you the count.
When writing to csv, if you do not want the index to show (as in your pic), then use index=False. If you want the index, then just leave that portion out.
First you need to parse your date right:
df["date_string"] = df["created_at"].str.split(" ").str[0]
df["date_time"] = pd.to_datetime(df["date_string"])
# You can chose to drop earlier columns
# Now you just want to groupby with the date and apply the aggregation/function you want to
df = df.groupby(["date_time"]).sum("field2").reset_index() # for example
df.to_csv("abc.csv", index=False)
I have a Pandas df with one column (Reservation_Dt_Start) representing the start of a date range and another (Reservation_Dt_End) representing the end of a date range.
Rather than each row having a date range, I'd like to expand each row to have as many records as there are dates in the date range, with each new row representing one of those dates.
See the two pics below for an example input and the desired output.
The code snippet below works!! However, for every 250 rows in the input table, it takes 1 second to run. Given my input table is 120,000,000 rows in size, this code will take about one week to run.
pd.concat([pd.DataFrame({'Book_Dt': row.Book_Dt,
'Day_Of_Reservation': pd.date_range(row.Reservation_Dt_Start, row.Reservation_Dt_End),
'Pickup': row.Pickup,
'Dropoff' : row.Dropoff,
'Price': row.Price},
columns=['Book_Dt','Day_Of_Reservation', 'Pickup', 'Dropoff' , 'Price'])
for i, row in df.iterrows()], ignore_index=True)
There has to be a faster way to do this. Any ideas? Thanks!
pd.concat in a loop with a large dataset gets pretty slow as it will make a copy of the frame each time and return a new dataframe. You are attempting to do this 120m times. I would try to work with this data as a simple list of tuples instead then convert to dataframe at the end.
e.g.
Given a list list = []
For each row in the dataframe:
get list of date range (can use pd.date_range here still) store in variable dates which is a list of dates
for each date in date range, add a tuple to the list list.append((row.Book_Dt, dates[i], row.Pickup, row.Dropoff, row.Price))
Finally you can convert the list of tuples to a dataframe:
df = pd.DataFrame(list, columns = ['Book_Dt', 'Day_Of_Reservation', 'Pickup', 'Dropoff', 'Price'])
I am trying to combine 2 separate data series using one minute data to create a ratio then creating Open High Low Close (OHLC) files for the ratio for the entire day. I am bringing in two time series then creating associated dataframes using pandas. The time series have missing data so I am creating a datetime variable in each file then merging the files using the pd.merge approach on the datetime variable. Up this this point everything is going fine.
Next I group the data by the date using groupby. I then feed the grouped data to a for loop that calculates the OHLC and feeds that into a new dataframe for each respective day. However, the newly populated dataframe uses the date (from the grouping) as the dataframe index and the sorting is off. The index data looks like this (even when sorted):
01/29/2013
01/29/2014
01/29/2015
12/2/2013
12/2/2014
In short, the sorting is being done on only the month not the whole date as a date so it isn't chronological. My goal is to get it sorted by date so it would be chronological. Perhaps I need to create a new column in the dataframe referencing the index (not sure how). Or maybe there is a way to tell pandas the index is a date not just a value? I tried using various sort approaches including sort_index but since the dates are the index and don't seem to be treated as dates the sort functions sort by the month regardless of the year and thus my output file is out of order. In more general terms I am not sure how to reference/manipulate the actual unique identifier index in a pandas dataframe so any associated material would be useful.
Thank you
Years later...
This fixes the problem.
df is a dataframe
import pandas as pd
df.index = pd.to_datetime(df.index) #convert the index to a datetime object
df = df.sort_index() #sort the converted
This should get the sorting back into chronological order