Find the same date from two sets of data - python

I am new to Python. I got two sets of data shown as below.
Set 1:
Gmt time,Open,High,Low,Close,Volume,RSI,,Change,Gain,Loss,Avg Gain,Avg Loss,RS
15.06.2017 00:00:00.000,0.75892,0.76313,0.7568,0.75858,107799.5406,0,,,,,,,
16.06.2017 00:00:00.000,0.75857,0.76294,0.75759,0.76202,94367.4299,0,,0.00344,0.00344,0,,,
18.06.2017 00:00:00.000,0.76202,0.76236,0.76152,0.76188,5926.0998,0,,-0.00014,0,0.00014,,,
19.06.2017 00:00:00.000,0.76189,0.76289,0.75848,0.75902,87514.849,0,,-0.00286,0,0.00286,,,
...
Set 2:
Gmt time,Open,High,Low,Close,Volume
15.06.2017 00:00:00.000,0.75892,0.75933,0.75859,0.75883,4777.4702
15.06.2017 01:00:00.000,0.75885,0.76313,0.75833,0.76207,7452.5601
15.06.2017 02:00:00.000,0.76207,0.76214,0.76106,0.76143,4798.4102
15.06.2017 03:00:00.000,0.76147,0.76166,0.76015,0.76154,4961.4502
15.06.2017 04:00:00.000,0.76154,0.76162,0.76104,0.76121,2977.6399
15.06.2017 05:00:00.000,0.7612,0.76154,0.76101,0.76151,3105.4399
...
I want to find lines in Set 2 in the same date with Set 1. I tried this: print(daily['Gmt time'][0].date == hourly['Gmt time'][0].date), but I don't know why it came out False. Isn't there a way to compare the date(just date, not including time) from two sets of data?

First read the data sets into dataframes:
import pandas as pd
df_one = pd.DataFrame.from_csv('data_set_one.csv', index_col=False)
df_two = pd.DataFrame.from_csv('data_set_two.csv', index_col=False)
Convert date column to date
df_one['Gmt date'] = pd.to_datetime(df_one['Gmt time']).dt.date
df_two['Gmt date'] = pd.to_datetime(df_two['Gmt time']).dt.date
now compare both the dataframes:
for i, row in df_one.iterrows():
df_one_date = row['Gmt date']
print('df_one_date', df_one_date)
print(df_two[df_two['Gmt date'] == df_one_date])
print('----')
it's still unclear how you want to handle for different dates from df_one to match df_two. Hope this gives you enough idea on how to handle it.

Since using iterrows can be slow, a better option might be to use merge.
import pandas as pd
# load data
df_one = pd.read_csv('data_set_one.csv', index_col=False)
df_two = pd.read_csv('data_set_two.csv', index_col=False)
# convert times to datetime and then strip off the time to leave the date
df_one['Gmt date'] = pd.to_datetime(df_one['Gmt time']).dt.date
df_two['Gmt date'] = pd.to_datetime(df_two['Gmt time']).dt.date
# merge
# selecting only the date in each dataframe for clarity
df_merge = df_two[['Gmt date']].merge(df_one[['Gmt date']], on=['Gmt date'], how='inner', right_index=True)
# get list of indices from df_two where dates exist in both frames
ix = list(df_two.index.unique())
print ix
[0, 1, 2, 3, 4, 5]

Related

Partial string filter pandas

On Pandas 1.3.4 and Python 3.9.
So I'm having issues filtering for a partial piece of the string. The "Date" column is listed in the format of MM/DD/YYYY HH:MM:SS A/PM where the most recent one is on top. If the date is single digit (example: November 3rd), it does not have the 0 such that it is 11/3 instead of 11/03. Basically I'm looking to go look at column named "Date" and have python read parts of the string to filter for only today.
This is what the original csv looks like. This is what I want to do to the file. Basically looking for a specific date but not any time of that date and implement the =RIGHT() formula. However this is what I end up with with the following code.
from datetime import date
import pandas as pd
df = pd.read_csv(r'file.csv', dtype=str)
today = date.today()
d1 = today.strftime("%m/%#d/%Y") # to find out what today is
df = pd.DataFrame(df, columns=['New Phone', 'Phone number', 'Date'])
df['New Phone'] = df['Phone number'].str[-10:]
df_today = df['Date'].str.contains(f'{d1}',case=False, na=False)
df_today.to_csv(r'file.csv', index=False)
This line is wrong:
df_today = df['Date'].str.contains(f'{d1}',case=False, na=False)
All you're doing there is creating a mask; essentially what that is is just a Pandas series, containg True or False in each row, according to the condition you created the mask in. The spreadsheet get's only FALSE as you showed because non of the items in the Date contain the string that the variable d1 holds...
Instead, try this:
from datetime import date
import pandas as pd
# Load the CSV file, and change around the columns
df = pd.DataFrame(pd.read_csv(r'file.csv', dtype=str), columns=['New Phone', 'Phone number', 'Date'])
# Take the last ten chars of each phone number
df['New Phone'] = df['Phone number'].str[-10:]
# Convert each date string to a pd.Timestamp, removing the time
df['Date'] = pd.to_datetime(df['Date'].str.split(r'\s+', n=1).str[0])
# Get the phone numbers that are from today
df_today = df[df['Date'] == date.today().strftime('%m/%d/%Y')]
# Write the result to the CSV file
df_today.to_csv(r'file.csv', index=False)

how to sum of columns based on another column value of excel

I would like to ask how to sum using python or excel.
Like to do summation of "number" columns based on "time" column.
Sum of the Duration for (00:00 am - 00:59 am) is (2+4) 6.
Sum of the Duration for (02:00 am - 02:59 am) is (3+1) 4.
Could you please advise how to ?
When you have a dataframe you can use groupby to accomplish this:
# import pandas module
import pandas as pd
# Create a dictionary with the values
data = {
'time' : ["12:20:51", "12:40:51", "2:26:35", "2:37:35"],
'number' : [2, 4, 3, 1]}
# create a Pandas dataframe
df = pd.DataFrame(data)
# or load the CSV
df = pd.read_csv('path/dir/filename.csv')
# Convert time column to datetime data type
df['time'] = df['time'].apply(pd.to_datetime, format='%H:%M:%S')
# add values by hour
dff = df.groupby(df['time'].dt.hour)['number'].sum()
print(dff.head(50))
output:
time
12 6
2 4
When you need more than one column. You can pass the columns as a list inside .groupby(). The code will look like this:
import pandas as pd
df = pd.read_csv('filename.csv')
# Convert time column to datetime data type
df['time'] = df['time'].apply(pd.to_datetime, format='%H:%M:%S')
df['date'] = df['date'].apply(pd.to_datetime, format='%d/%m/%Y')
# add values by hour
dff = df.groupby([df['date'], df['time'].dt.hour])['number'].sum()
print(dff.head(50))
# save the file
dff.to_csv("filename.csv")

Unable to convert a list into a dataframe. Keep getting the error "ValueError: Must pass 2-d input. shape=(1, 4, 5)"

I have to 2 dfs:
dfMiss:
and
dfSuper:
I need to create a final output that summarises the data in the 2 tables which I am able to do shown in the code below:
dfCity = dfSuper \
.groupby(by='City').count() \
.drop(columns='Superhero ID') \
.rename(columns={'Superhero': 'Total count'})
print("This is the df city : ")
print(dfCity)
## Convert column MissionEndDate to DateTime format
for df in dfMiss:
# Dates are interpreted as MM/dd/yyyy by default, dayfirst=False
df['Mission End date'] = pd.to_datetime(df['Mission End date'], dayfirst=True)
# Get Year and Quarter, given Q1 2020 starts in April
date = df['Mission End date'] - pd.DateOffset(months=3)
df['Mission End quarter'] = date.dt.year.astype(str) + ' Q' + date.dt.quarter.astype(str)
## Get no. Superheros working per City per Quarter
dfCount = []
for dfM in dfMiss:
# Merge DataFrames
df = dfSuper.merge(dfM, left_on='Superhero ID', right_on='SID')
df = df.pivot_table(index=['City', 'Superhero'], columns='Mission End quarter', aggfunc='nunique')
# Get the first group (all the groups have the same values)
df = df[df.columns[0][0]]
# Group the values by City (effectively "collapsing" the 'Superhero' column)
df = df.groupby(by=['City']).count()
dfCount += [df]
## Get no. Superheros available per City per Quarter
dfFree = []
for dfC in dfCount:
# Merge DataFrames
df = dfCity.merge(right=dfC, on='City', how='outer').fillna(0) # convert NaN values to 0
# Subtract no. working superheros from total no. superheros per city
for col in df.columns[1:]:
df[col] = df['Total count'] - df[col]
dfFree += [df.astype(int)]
print(dfFree)
dfResult = pd.DataFrame(dfFree)
The problem is when I try to convert DfFree into a dataframe I get the error:
"ValueError: Must pass 2-d input. shape=(1, 4, 5) "
The line that raises the error is
dfResult = pd.DataFrame(dfFree)
Anyone have any idea what this means and how I can convert the list into a df?
Thanks :)
separate your code using SOLID. separation of concerns. It is not easy to read
sid=[665544,665544,2121,665544,212121,123456,666666]
mission_end_date=["10/10/2020", "03/03/2021", "02/02/2021", "05/12/2020", "15/07/2021", "03/06/2021", "12/10/2020"]
superherod_sid=[212121,364331,678523,432432,665544,123456,555555,666666,432432]
hero=["Spiderman","Ironman","Batman","Dr. Strange","Thor","Superman","Nightwing","Loki","Wolverine"]
city=["New York","New York","Gotham","New York","Asgard","Metropolis","Gotham","Asgard","New York"]
df_mission=pd.DataFrame({'sid':sid,'mission_end_date':mission_end_date})
df_super=pd.DataFrame({'sid':superherod_sid,'hero':hero, 'city':city})
df=df_super.merge(df_mission,on="sid", how="left")
df['mission_end_date']=pd.to_datetime(df['mission_end_date'])
df['mission_end_date_quarter']=df['mission_end_date'].dt.quarter
df['mission_end_date_year']=df['mission_end_date'].dt.year
print(df.head(20))
pivot = df.pivot_table(index=['city', 'hero'], columns='mission_end_date_quarter', aggfunc='nunique').fillna(0)
print(pivot.head())

Filter particular date in a DF column

I want to filter particular date in a DF column.
My code:
df
df["Crawl Date"]=pd.to_datetime(df["Crawl Date"]).dt.date
date=pd.to_datetime("03-21-2020")
df=df[df["Crawl Date"]==date]
It is showing no match.
Note: df column is having time also with date which need to be trimmed.
Thanks in advance.
The following script assumes that the 'Crawl Dates' column contains strings:
import pandas as pd
import datetime
column_names = ["Crawl Date"]
df = pd.DataFrame(columns = column_names)
#Populate dataframe with dates
df.loc[0] = ['03-21-2020 23:45:57']
df.loc[1] = ['03-22-2020 23:12:33']
df["Crawl Date"]=pd.to_datetime(df["Crawl Date"]).dt.date
date=pd.to_datetime("03-21-2020")
df=df[df["Crawl Date"]==date]
Then df returns:
Crawl Date 0 2020-03-21

Python Pandas filtering dataframe on date

I am trying to manipulate a CSV file on a certain date in a certain column.
I am using pandas (total noob) for that and was pretty successful until i got to dates.
The CSV looks something like this (with more columns and rows of course).
These are the columns:
Circuit
Status
Effective Date
These are the values:
XXXX001
Operational
31-DEC-2007
I tried dataframe query (which i use for everything else) without success.
I tried dataframe loc (which worked for everything else) without success.
How can i get all rows that are older or newer from a given date? If i have other conditions to filter the dataframe, how do i combine them with the date filter?
Here's my "raw" code:
import pandas as pd
# parse_dates = ['Effective Date']
# dtypes = {'Effective Date': 'str'}
df = pd.read_csv("example.csv", dtype=object)
# , parse_dates=parse_dates, infer_datetime_format=True
# tried lot of suggestions found on SO
cols = df.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
df.columns = cols
status1 = 'Suppressed'
status2 = 'Order Aborted'
pool = '2'
region = 'EU'
date1 = '31-DEC-2017'
filt_df = df.query('Status != #status1 and Status != #status2 and Pool == #pool and Region_A == #region')
filt_df.reset_index(drop=True, inplace=True)
filt_df.to_csv('filtered.csv')
# this is working pretty well
supp_df = df.query('Status == #status1 and Effective_Date < #date1')
supp_df.reset_index(drop=True, inplace=True)
supp_df.to_csv('supp.csv')
# this is what is not working at all
I tried many approaches, but i was not able to put it together. This is just one of many approaches i tried.. so i know it is perhaps completely wrong, as no date parsing is used.
supp.csv will be saved, but the dates present are all over the place, so there's no match with the "logic" in this code.
Thanks for any help!
Make sure you convert your date to datetime and then filter slice on it.
df['Effective Date'] = pd.to_datetime(df['Effective Date'])
df[df['Effective Date'] < '2017-12-31']
#This returns all the values with dates before 31th of December, 2017.
#You can also use Query

Categories

Resources