Partial string filter pandas - python

On Pandas 1.3.4 and Python 3.9.
So I'm having issues filtering for a partial piece of the string. The "Date" column is listed in the format of MM/DD/YYYY HH:MM:SS A/PM where the most recent one is on top. If the date is single digit (example: November 3rd), it does not have the 0 such that it is 11/3 instead of 11/03. Basically I'm looking to go look at column named "Date" and have python read parts of the string to filter for only today.
This is what the original csv looks like. This is what I want to do to the file. Basically looking for a specific date but not any time of that date and implement the =RIGHT() formula. However this is what I end up with with the following code.
from datetime import date
import pandas as pd
df = pd.read_csv(r'file.csv', dtype=str)
today = date.today()
d1 = today.strftime("%m/%#d/%Y") # to find out what today is
df = pd.DataFrame(df, columns=['New Phone', 'Phone number', 'Date'])
df['New Phone'] = df['Phone number'].str[-10:]
df_today = df['Date'].str.contains(f'{d1}',case=False, na=False)
df_today.to_csv(r'file.csv', index=False)

This line is wrong:
df_today = df['Date'].str.contains(f'{d1}',case=False, na=False)
All you're doing there is creating a mask; essentially what that is is just a Pandas series, containg True or False in each row, according to the condition you created the mask in. The spreadsheet get's only FALSE as you showed because non of the items in the Date contain the string that the variable d1 holds...
Instead, try this:
from datetime import date
import pandas as pd
# Load the CSV file, and change around the columns
df = pd.DataFrame(pd.read_csv(r'file.csv', dtype=str), columns=['New Phone', 'Phone number', 'Date'])
# Take the last ten chars of each phone number
df['New Phone'] = df['Phone number'].str[-10:]
# Convert each date string to a pd.Timestamp, removing the time
df['Date'] = pd.to_datetime(df['Date'].str.split(r'\s+', n=1).str[0])
# Get the phone numbers that are from today
df_today = df[df['Date'] == date.today().strftime('%m/%d/%Y')]
# Write the result to the CSV file
df_today.to_csv(r'file.csv', index=False)

Related

how to sum of columns based on another column value of excel

I would like to ask how to sum using python or excel.
Like to do summation of "number" columns based on "time" column.
Sum of the Duration for (00:00 am - 00:59 am) is (2+4) 6.
Sum of the Duration for (02:00 am - 02:59 am) is (3+1) 4.
Could you please advise how to ?
When you have a dataframe you can use groupby to accomplish this:
# import pandas module
import pandas as pd
# Create a dictionary with the values
data = {
'time' : ["12:20:51", "12:40:51", "2:26:35", "2:37:35"],
'number' : [2, 4, 3, 1]}
# create a Pandas dataframe
df = pd.DataFrame(data)
# or load the CSV
df = pd.read_csv('path/dir/filename.csv')
# Convert time column to datetime data type
df['time'] = df['time'].apply(pd.to_datetime, format='%H:%M:%S')
# add values by hour
dff = df.groupby(df['time'].dt.hour)['number'].sum()
print(dff.head(50))
output:
time
12 6
2 4
When you need more than one column. You can pass the columns as a list inside .groupby(). The code will look like this:
import pandas as pd
df = pd.read_csv('filename.csv')
# Convert time column to datetime data type
df['time'] = df['time'].apply(pd.to_datetime, format='%H:%M:%S')
df['date'] = df['date'].apply(pd.to_datetime, format='%d/%m/%Y')
# add values by hour
dff = df.groupby([df['date'], df['time'].dt.hour])['number'].sum()
print(dff.head(50))
# save the file
dff.to_csv("filename.csv")

Filter particular date in a DF column

I want to filter particular date in a DF column.
My code:
df
df["Crawl Date"]=pd.to_datetime(df["Crawl Date"]).dt.date
date=pd.to_datetime("03-21-2020")
df=df[df["Crawl Date"]==date]
It is showing no match.
Note: df column is having time also with date which need to be trimmed.
Thanks in advance.
The following script assumes that the 'Crawl Dates' column contains strings:
import pandas as pd
import datetime
column_names = ["Crawl Date"]
df = pd.DataFrame(columns = column_names)
#Populate dataframe with dates
df.loc[0] = ['03-21-2020 23:45:57']
df.loc[1] = ['03-22-2020 23:12:33']
df["Crawl Date"]=pd.to_datetime(df["Crawl Date"]).dt.date
date=pd.to_datetime("03-21-2020")
df=df[df["Crawl Date"]==date]
Then df returns:
Crawl Date 0 2020-03-21

Python Pandas filtering dataframe on date

I am trying to manipulate a CSV file on a certain date in a certain column.
I am using pandas (total noob) for that and was pretty successful until i got to dates.
The CSV looks something like this (with more columns and rows of course).
These are the columns:
Circuit
Status
Effective Date
These are the values:
XXXX001
Operational
31-DEC-2007
I tried dataframe query (which i use for everything else) without success.
I tried dataframe loc (which worked for everything else) without success.
How can i get all rows that are older or newer from a given date? If i have other conditions to filter the dataframe, how do i combine them with the date filter?
Here's my "raw" code:
import pandas as pd
# parse_dates = ['Effective Date']
# dtypes = {'Effective Date': 'str'}
df = pd.read_csv("example.csv", dtype=object)
# , parse_dates=parse_dates, infer_datetime_format=True
# tried lot of suggestions found on SO
cols = df.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
df.columns = cols
status1 = 'Suppressed'
status2 = 'Order Aborted'
pool = '2'
region = 'EU'
date1 = '31-DEC-2017'
filt_df = df.query('Status != #status1 and Status != #status2 and Pool == #pool and Region_A == #region')
filt_df.reset_index(drop=True, inplace=True)
filt_df.to_csv('filtered.csv')
# this is working pretty well
supp_df = df.query('Status == #status1 and Effective_Date < #date1')
supp_df.reset_index(drop=True, inplace=True)
supp_df.to_csv('supp.csv')
# this is what is not working at all
I tried many approaches, but i was not able to put it together. This is just one of many approaches i tried.. so i know it is perhaps completely wrong, as no date parsing is used.
supp.csv will be saved, but the dates present are all over the place, so there's no match with the "logic" in this code.
Thanks for any help!
Make sure you convert your date to datetime and then filter slice on it.
df['Effective Date'] = pd.to_datetime(df['Effective Date'])
df[df['Effective Date'] < '2017-12-31']
#This returns all the values with dates before 31th of December, 2017.
#You can also use Query

Python: store a value in a variable so that you can recognize each reoccurence

If this question is unclear, I am very open to constructive criticism.
I have an excel table with about 50 rows of data, with the first column in each row being a date. I need to access all the data for only one date, and that date appears only about 1-5 times. It is the most recent date so I've already organized the table by date with the most recent being at the top.
So my goal is to store that date in a variable and then have Python look only for that variable (that date) and take only the columns corresponding to that variable. I need to use this code on 100's of other excel files as well, so it would need to arbitrarily take the most recent date (always at the top though).
My current code below simply takes the first 5 rows because I know that's how many times this date occurs.
import os
from numpy import genfromtxt
import pandas as pd
path = 'Z:\\folderwithcsvfile'
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
broken_df = pd.read_csv(file_path)
df3 = broken_df['DATE']
df4 = broken_df['TRADE ID']
df5 = broken_df['AVAILABLE STOCK']
df6 = broken_df['AMOUNT']
df7 = broken_df['SALE PRICE']
print (df3)
#print (df3.head(6))
print (df4.head(6))
print (df5.head(6))
print (df6.head(6))
print (df7.head(6))
This is a relatively simple filtering operation. You state that you want to "take only the columns" that are the latest date, so I assume that an acceptable result will be a filter DataFrame with just the correct columns.
Here's a simple CSV that is similar to your structure:
DATE,TRADE ID,AVAILABLE STOCK
10/11/2016,123,123
10/11/2016,123,123
10/10/2016,123,123
10/9/2016,123,123
10/11/2016,123,123
Note that I mixed up the dates a little bit, because it's hacky and error-prone to just assume that the latest dates will be on the top. The following script will filter it appropriately:
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# convert the DATE column to datetimes
df['DATE'] = pd.to_datetime(df['DATE'])
# find the latest datetime
latest_date = df['DATE'].max()
# use index filtering to only choose the columns that equal the latest date
latest_rows = df[df['DATE'] == latest_date]
print (latest_rows)
# now you can perform your operations on latest_rows
In my example, this will print:
DATE TRADE ID AVAILABLE STOCK
0 2016-10-11 123 123
1 2016-10-11 123 123
4 2016-10-11 123 123

how to extract date/time parameters from a list of strings?

i have a pandas dataframe having a column as
from pandas import DataFrame
df = pf.DataFrame({ 'column_name' : [u'Monday,30 December,2013', u'Delivered', u'19:23', u'1']})
now i want to extract every thing from it and store in three columns as
date status time
[30/December/2013] ['Delivered'] [19:23]
i have so far used this :
import dateutil.parser as dparser
dparser.parse([u'Monday,30 December,2013', u'Delivered', u'19:23', u'1'])
but this throws an error . can anyone please guide me to a solution ?
You can apply() a function to a column, see the whole example:
from pandas import DataFrame
df = DataFrame({'date': ['Monday,30 December,2013'], 'delivery': ['Delivered'], 'time': ['19:23'], 'status':['1']})
# delete the status column
del df['status']
def splitter(val):
parts = val.split(',')
return parts[1]
df['date'] = df['date'].apply(splitter)
This yields a dataframe with date, delivery and the time.

Categories

Resources