I have a CSV file which have four column
Like this
Freq ID Date Name
0 2053 1998 apple
2 2054 1998 May-June. orange
3 2055 2019 apple
5 2056 1999 Oct-Nov orange
It is large file and I have to remove May-Jun from Date column and all which have year with month I have to keep only year
How can I remove it from python
you can use pandas to read and extract year from date column. you can use split() function and split on space the first item will be your year
like this
import pandas as pd
df = pd.read_csv(filename)
df['Date'] = df["Date"].str.split(" ").str.get(0)
print(df)
Hi I have recently come across the similar kind of problem, I tried to resolve this using below snippet code, you can probable try using it will work and it most optimized solution I think.
import pandas as pd
import csv
from datetime import datetime
to_datetime = lambda d: datetime.strptime(d[:4] , '%Y')
path = "D:\python_poc"
filename="\Input.csv"
df = pd.read_csv(path+filename,parse_dates=['Date'])
df = pd.read_csv(path+filename, converters={'Date': to_datetime})
df.to_csv(path+filename,index=False,quoting=csv.QUOTE_ALL)
Related
I get a file everyday with around 15 columns. Somedays there are 2 date columns and some days one date column. Also the date format on somedays is YYYY-MM-DD and on some its DD-MM-YYYY. Task is to convert the 2 or 1 date columns to MM-DD-YYYY. Sample data in csv file for few columns :
Execution_date
Extract_date
Requestor_Name
Count
2023-01-15
2023-01-15
John Smith
7
Sometimes we dont get the second column above - extract_date :
Execution_date
Requestor_Name
Count
17-01-2023
Andrew Mill
3
Task is to find all the date columns in the file and change the date format to MM-DD-YYYY.
So the sample output of above 2 files will be :
Execution_date
Extract_date
Requestor_Name
Count
01-15-2023
01-15-2023
John Smith
7
Execution_date
Requestor_Name
Count
01-17-2023
Andrew Mill
3
I am using pandas and can't figure out how to deal with the missing second column on some days and the change of the date value format.
I can hardcode the 2 column names and change the format by :
df['Execution_Date'] = pd.to_datetime(df['Execution_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
df['Extract_Date'] = pd.to_datetime(df['Extract_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
This will only work when the file has 2 columns and the values are in DD-MM-YYYY format.
Looking for guidance on how to dynamically find the number of date columns and the date value format so that I can use it in my above 2 lines of code. If not then any other solution would also work for me. I can use powershell if it can't be done in python. But I am guessing there will be a lot more avenues in python to do this than we will have in powershell.
The following loads a CSV file into a dataframe, checks each value (that is a str) to see if it matches one of the date formats, and if it does rearranges the date to the format you're looking for. Other values are untouched.
import pandas as pd
import re
df = pd.read_csv("today.csv")
# compiling the patterns ahead of time saves a lot of processing power later
d_m_y = re.compile(r"(\d\d)-(\d\d)-(\d\d\d\d)")
d_m_y_replace = r"\2-\1-\3"
y_m_d = re.compile(r"(\d\d\d\d)-(\d\d)-(\d\d)")
y_m_d_replace = r"\2-\3-\1"
def change_dt(value):
if isinstance(value, str):
if d_m_y.fullmatch(value):
return d_m_y.sub(d_m_y_replace, value)
elif y_m_d.fullmatch(value):
return y_m_d.sub(y_m_d_replace, value)
return value
new_df = df.applymap(change_dt)
However, if there are other columns containing dates that you don't want to change, and you just want to specify the columns to be altered, use this instead of the last line above:
cols = ["Execution_date", "Extract_date"]
for col in cols:
if col in df.columns:
df[col] = df[col].apply(change_dt)
You can convert the columns to datetimes if you wish.
You can use a function to check all column names that contain "date" and use .fillna to try other formats (add all possible formats).
import pandas as pd
def convert_to_datetime(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
for column in df.columns[df.columns.str.contains(column_name, case=False)]:
df[column] = (
pd.to_datetime(df[column], format="%d-%m-%Y", errors="coerce")
.fillna(pd.to_datetime(df[column], format="%Y-%m-%d", errors="coerce"))
).dt.strftime("%m-%d-%Y")
return df
data1 = {'Execution_date': '2023-01-15', 'Extract_date': '2023-01-15', 'Requestor_Name': "John Smith", 'Count': 7}
df1 = pd.DataFrame(data=[data1])
data2 = {'Execution_date': '17-01-2023', 'Requestor_Name': 'Andrew Mill', 'Count': 3}
df2 = pd.DataFrame(data=[data2])
final1 = convert_to_datetime(df=df1, column_name="date")
print(final1)
final2 = convert_to_datetime(df=df2, column_name="date")
print(final2)
Output:
Execution_date Extract_date Requestor_Name Count
0 01-15-2023 01-15-2023 John Smith 7
Execution_date Requestor_Name Count
0 01-17-2023 Andrew Mill 3
I have a CSV file that looks something like this:
# data.csv (this line is not there in the file)
Names, Age, Names
John, 5, Jane
Rian, 29, Rath
And when I read it through Pandas in Python I get something like this:
import pandas as pd
data = pd.read_csv("data.csv")
print(data)
And the output of the program is:
Names Age Names
0 John 5 Jane
1 Rian 29 Rath
Is there any way to get:
Names Age
0 John 5
1 Rian 29
2 Jane
3 Rath
First, I'd suggest having unique names for each column. Either go into the csv file and change the name of a column header or do so in pandas.
Using 'Names2' as the header of the column with the second occurence of the same column name, try this:
Starting from
datalist = [['John', 5, 'Jane'], ['Rian', 29, 'Rath']]
df = pd.DataFrame(datalist, columns=['Names', 'Age', 'Names2'])
We have
Names Age Names
0 John 5 Jane
1 Rian 29 Rath
So, use:
dff = pd.concat([df['Names'].append(df['Names2'])
.reset_index(drop=True),
df.iloc[:,1]], ignore_index=True, axis=1)
.fillna('').rename(columns=dict(enumerate(['Names', 'Ages'])))
to get your desired result.
From the inside out:
df.append combines the columns.
pd.concat( ... ) combines the results of df.append with the rest of the dataframe.
To discover what the other commands do, I suggest removing them one-by-one and looking at the results.
Please forgive the formating of dff. I'm trying to make everything clear from an educational perspective.
Adjust indents so the code will compile.
You can use:
usecols which helps to read only selected columns.
low_memory is used so that we Internally process the file in chunks.
import pandas as pd
data = pd.read_csv("data.csv", usecols = ['Names','Age'], low_memory = False))
print(data)
Please have unique column name in your csv
I want to append an expense df to a revenue df but can't properly do so. Can anyone offer how I may do this?
'''
import pandas as pd
import lxml
from lxml import html
import requests
import numpy as np
symbol = 'MFC'
url = 'https://www.marketwatch.com/investing/stock/'+ symbol +'/financials'
df=pd.read_html(url)
revenue = pd.concat(df[0:1]) # the revenue dataframe obj
revenue = revenue.dropna(axis='columns') # drop naN column
header = revenue.iloc[:0] # revenue df header row
expense = pd.concat(df[1:2]) # the expense dataframe obj
expense = expense.dropna(axis='columns') # drop naN column
statement = revenue.append(expense) #results in a dataframe with an added column (Unnamed:0)
revenue = pd.concat(df[0:1]) =
Fiscal year is January-December. All values CAD millions.
2015
2016
2017
2018
2019
expense = pd.concat(df[1:2]) =
Unnamed: 0
2015
2016
2017
2018
2019
'''
How can I append the expense dataframe to the revenue dataframe so that I am left with a single dataframe object?
Thanks,
Rename columns.
df = df.rename(columns={'old_name': 'new_name',})
Then append with merge(), join(), or concat().
I managed to append the dataframes with the following code. Thanks David for putting me on the right track. I admit this is not the best way to do this because in a run time environment, I don't know the value of the text to rename and I've hard coded it here. Ideally, it would be best to reference a placeholder at df.iloc[:0,0] instead, but I'm having a tough time getting that to work.
df=pd.read_html(url)
revenue = pd.concat(df[0:1])
revenue = revenue.dropna(axis='columns')
revenue.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
header = revenue.iloc[:0]
expense = pd.concat(df[1:2])
expense = expense.dropna(axis='columns')
expense.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = revenue.append(expense,ignore_index=True)
Using the df=pd.read_html(url) construct, several lists are returned when scraping marketwatch financials. The below function returns a single dataframe of all balance sheet elements. The same code applies to quarterly and annual income and cash flow statements.
def getBalanceSheet(url):
df=pd.read_html(url)
count = sum([1 for Listitem in df if 'Unnamed: 0' in Listitem])
statement = pd.concat(df[0:1])
statement = statement.dropna(axis='columns')
if 'q' in url: #quarterly
statement.rename({'All values CAD millions.':'LineItem'},axis=1,inplace=True)
else:
statement.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
for rowidx in range(count):
df_name = 'df_'+str(int(rowidx))
df_name = pd.concat(df[rowidx+1:rowidx+2])
df_name = df_name.dropna(axis='columns')
df_name.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = statement.append(df_name,ignore_index=True)
return statement
I've problem with replacing, here's what I wrote, I need to replace 1999 with 1900 as you can see. I started recently, so please excuse me. (I searched a lot and watched clips on YouTube, but the method didn't work.)
import pandas as pd
df = pd.read_excel('book1.xlsx')
#replace
df.replace("1999","1900")
#I also tried this method, but it didn't work.
#df.replace(to_replace = "1999", value = "1900")
#writer
writer = pd.ExcelWriter('book2.xlsx')
df.to_excel(writer,'new_sheet')
writer.save()
My second question, how can I replace data through a text file (or Excel), for example, replace 1999 (in column A, book1.xlsx) with the column b in mistakes.xlsx.
A B
1999 1900
Thank you guys for the help.
You could define a function and apply it element-wise with Series.apply:
df = pandas.DataFrame.from_records([('Cryptonomicon', 1999), ('Snow Crash', 1992), ('Quicksilver', 2003)], columns=['Title', 'Year'])
# df is:
# Title Year
# 0 Cryptonomicon 1999
# 1 Snow Crash 1992
# 2 Quicksilver 2003
# Imagine this dataframe came from an Excel spreadsheet...
df_replacements = pandas.DataFrame.from_records([(1999, 1900), (2003, 3003)], columns=['A', 'B'])
replacements = pandas.Series(df_replacements['B'].values, index=df_replacements['A'])
def replaced(value):
return replacements.get(value, value)
df['Year'] = df['Year'].apply(replaced)
# df is:
# Title Year
# 0 Cryptonomicon 1900
# 1 Snow Crash 1992
# 2 Quicksilver 3003
If you have a very large dataframe, you could vectorize this using pandas.Series.map():
year = df['Year']
df['Year'] = year.where(~year.isin(replacements.keys()),
year,
year.map(replacements))
This should work. It will work with either string or numbers but it will store the values as string. If you know that you only have 1999 as a number then just remove .astype(str) and take out the single quotes around the years.
df=pd.read_excel('book1.xlsx',sheetname='Sheet1')
for key, value in df.iteritems():
df[key] = df[key].astype(str).replace(to_replace='1999', value='1900')
writer=ExcelWriter('book2.xlsx')
df.to_excel(writer,'new_sheet',index=False)
writer.save()
I have a pandas df that looks like this:
df = pd.DataFrame([[1,'hello,bye','USA','3/20/2016 7:00:17 AM'],[2,'good morning','UK','3/20/2016 7:00:20 AM']],columns=['id','text','country','datetime'])
id text country datetime
0 1 hello,bye USA 3/20/2016 7:00:17 AM
1 2 good morning UK 3/20/2016 7:00:20 AM
I want to print this output to csv but only if the country column contains 'USA'.
This is what I tried:
if 'USA' in df.country.values:
df.to_csv('test.csv')
but it prints the entire df to the test.csv file still.
Here is a simple solution to your problem:
df = pd.DataFrame([[1,'hello,bye','USA','3/20/2016 7:00:17 AM'],[2,'good morning','UK','3/20/2016 7:00:20 AM']],columns=['id','text','country','datetime'])
if 'USA' in df.country.tolist():
df.to_csv('test.csv')
Alternatively, you can also do this by:
df['country'].tolist()
Hope this helps you :)