CSV file handling with Pandas for text comparison - python

I have this csv file called input.csv
KEY;Rate;BYld;DataAsOfDate
CH04;0.719;0.674;2020-01-29
CH03;1.5;0.148;2020-01-29
then I execute the following code:
import pandas as pd
input_df = pd.read_csv('input.csv', sep=";")
input_df.to_csv('output.csv', sep=";")
and get the following output.csv file
KEY;Rate;BYld;DataAsOfDate
CH04;0.7190000000000001;0.674;2020-01-29
CH03;1.5;0.14800000000000002;2020-01-29
I was hoping for and expecting an output like this:
(to be able to use a tool like winmerge.org to detect real differences on each row)
(my real code truly modifies the dataframe - this stack overflow example is for demonstration only)
KEY;Rate;BYld;DataAsOfDate
CH04;0.719;0.674;2020-01-29
CH03;1.5;0.148;2020-01-29
What is the idiomatic way with to achieve such an unmodified output with Pandas?

Python does not use traditional rounding to so as to prevent problems with bankers rounding. However, if being close is not a problem you could use the round function and replace the "2" with whichever number you would like to round to
d = [['CH04',0.719,0.674,'2020-01-29']]
df = pd.DataFrame(d, columns = (['KEY', 'Rate', 'BYld', 'DataAsOfDate']))
df['Rate'] = df['Rate'].apply(lambda x : round(x, 2))
df

Using #Prokos idea I changed the code like this:
import pandas as pd
input_df = pd.read_csv('input.csv', dtype='str',sep=";")
input_df.to_csv('str_output.csv', sep=";", index=False)
and that meets the requirement - all columns come out unchanged.

Related

how do i apply language tool to Python df and add results as new column in df?

I am trying to add a column to a df (large Excel imported as df with Panda). The new column would be the output errors of using Language Tool import when applied to a column in the df. So for each row, I'd have the errors or blank/no errors in new column 'Issues'
import language_tool_python
import pandas as pd
tool = language_tool_python.LanguageTool('en-US')
fn = "Example.xlsx"
xlreader = pd.read_excel(fn, sheet_name="This is Starting File")
for row in xlreader:
text= str(xlreader[['Description']])
xlreader['Issues'] = tool.check(text)
The above results in a ValueError.
I also tried,
xlreader['Issues'] = xlreader.apply(lambda x: tool.check(text))
The result was NaN, even though there are errors.
Is there a way to accomplish the desired output?
Desired output:
ID
Description
Added column 'Issues'
1-432
"The text withissues to check"
Possible spelling mistake
Maybe do thé changes:
To cast as str:
xlreader['Description'].astype('str')
To apply the function:
xlreader['Issues'] = xlreader['Description'].apply(lambda x: tool.check(x))

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance

I got following warning
PerformanceWarning: DataFrame is highly fragmented. This is usually
the result of calling frame.insert many times, which has poor
performance. Consider using pd.concat instead. To get a
de-fragmented frame, use newframe = frame.copy()
when I tried to append multiple dataframes like
df1 = pd.DataFrame()
for file in files:
df = pd.read(file)
df['id'] = file
df1 = df1.append(df, ignore_index =True)
where
df['id'] = file
seems to cause the warning. I wonder if anyone can explain how copy() can avoid or reduce the fragment problem or suggest other different solutions to avoid the issues.
Thanks,
I tried to create a testing code to duplicate the problem but I don't see Performance Warning with a testing dataset (random integers). The same code would continue to produce warning when reading in the real dataset. It looks like something triggered the issues in the real dataset.
import pandas as pd
import numpy as np
import os
import glob
rows = 35000
cols = 1900
def gen_data(rows, cols, num_files):
if not os.path.isdir('./data'):
os.mkdir('./data')
files = []
for i in range(num_files):
file = f'./data/{i}.pkl'
pd.DataFrame(
np.random.randint(1, 1_000, (rows, cols))
).to_pickle(file)
files.append(file)
return files
# Comment the first line to run real dataset, comment the second line will run the testing dataset
files = gen_data(rows, cols, 10) # testing dataset, runs okay
files = glob.glob('../pickles3/my_data_*.pickle') # real dataset, get performance warning
dfs = []
for file in files:
df = pd.read_pickle(file)
df['id'] = file
dfs.append(df)
dfs = pd.concat(dfs, ignore_index = True)
append is not an efficient method for this operation. concat is more appropriate in this situation.
Replace
df1 = df1.append(df, ignore_index =True)
with
pd.concat((df1,df),axis=0)
Details about the differences are in this question: Pandas DataFrame concat vs append
I had the same problem. This raised the PerformanceWarning:
df['col1'] = False
df['col2'] = 0
df['col3'] = 'foo'
This didn't:
df[['col1', 'col2', 'col3']] = (False, 0, 'foo')
Maybe you're adding single columns elsewhere?
copy() is supposed to consolidate the dataframe, and thus defragment. There was a bug fix in pandas 1.3.1 [GH 42579][1]. Copies on a larger dataframe might get expensive.
Tested on pandas 1.5.2, python 3.8.15
[1]: https://github.com/pandas-dev/pandas/pull/42579
This is a problem with recent update. Check this issue from pandas-dev. It seems to be resolved in pandas version 1.3.1 (reference PR).

Changing column values for a value in an adjacent column in the same dataframe using Python

I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!
With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]
Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)

Python module export Pandas DataFrame

I relatively new to Python but my understanding of Python modules is that any object defined in a module can be exported, for example is you had:
# my_module.py
obj1 = 4
obj2 = 8
you can import both these objects simply with from my_module import obj1, obj2.
While working with Pandas, it is common to have code with looks like this (not actual working code):
# pandas_module.py
import pandas as pd
df = pd.DataFrame(...)
df = df.drop()
df = df[df.col > 0]
where the same object (df) is redefined multiple times. If I want to export df how should I handle this? My guess is that if I simply from pandas_module import df from elsewhere, all the pandas code will run first and I will the the final df as expected, but I'm not sure if this is good practice. Maybe it is better to do something like final_df = df.copy() and export final_df instead. This seems like it would be more understandable for someone who is not that familiar with Python.
So my question is, what is the proper way to handle this situation of exporting a df which is defined multiple times?
Personally, I usually create a function that returns a Dataframe object. Such as:
# pandas_module.py
import pandas as pd
def clean_data():
df = pd.DataFrame(...)
df = df.drop()
df = df[df.col > 0]
return df
Then you can call the function from your main work flow and get the expected Dataframe:
from pandas_module.py import clean_data
df = clean_data()

Sum all columns in each column in csv using Pandas

The program I have written generally has done what I've wanted it to do - for the most part. To add totals of each column. My dataframe uses the csv file format. My code is below:
import pandas as pd
import matplotlib.pyplot
class ColumnCalculation:
"""This houses the functions for all the column manipulation calculations"""
def max_electricity(self):
df.set_index('Date', inplace=True)
df.loc['Total'] = df.sum()
print(df)
df = pd.read_csv("2011-onwards-city-elec-consumption.csv")
ColumnCalculation.max_electricity(df)
Also my dataset (I didn't know how to format it properly)
The code nicely adds up all totals into a total column at the bottom of each column, except when it comes to the last column(2017)(image below):
I am not sure the program does is, I've tried to use different formatting options like .iloc or .ix but it doesn't seem to make a difference. I have also tried adding each column individually (below):
def max_electricity(self):
df.set_index('Date', inplace=True)
df.loc['Total', '2011'] = df['2011'].sum()
df.loc['Total', '2012'] = df['2012'].sum()
df.loc['Total', '2013'] = df['2013'].sum()
df.loc['Total', '2014'] = df['2014'].sum()
df.loc['Total', '2015'] = df['2015'].sum()
df.loc['Total', '2016'] = df['2016'].sum()
df.loc['Total', '2017'] = df['2017'].sum()
print(df)
But I receive an error, as I assume this would be too much? I've tried to figure this out for a good hour and a bit.
Your last column isn't being parsed as floats, but strings.
To fix this, try casting to numeric before summing:
import locale
locale.setlocale(locale.LC_NUMERIC, '')
df['2017'] = df['2017'].map(locale.atoi)
Better still, try reading in the data as numeric data. For example:
df = pd.read_csv('file.csv', sep='\t', thousands=',')

Categories

Resources