Pandas: import csv with user corrected faulty values - python

I try to import a csv and dealing with faulty values, e.x. wrong decimal seperator or strings in int/double columns. I use converters to do the error fixing. In case of strings in number columns the user sees a input box where he has to fix the value. Is it possible to get the column name and/or the row which is actually 'imported'? If not, is there a better way to do the same?
example csv:
------------
description;elevation
point a;-10
point b;10,0
point c;35.5
point d;30x
from PyQt4 import QtGui
import numpy
from pandas import read_csv
def fixFloat(x):
# return x as float if possible
try:
return float(x)
except:
# if not, test if there is a , inside, replace it with a . and return it as float
try:
return float(x.replace(",", "."))
except:
changedValue, ok = QtGui.QInputDialog.getText(None, 'Fehlerhafter Wert', 'Bitte korrigieren sie den fehlerhaften Wert:', text=x)
if ok:
return self.fixFloat(changedValue)
else:
return -9999999999
def fixEmptyStrings(s):
if s == '':
return None
else:
return s
converters = {
'description': fixEmptyStrings,
'elevation': fixFloat
}
dtypes = {
'description': object,
'elevation': numpy.float64
}
csvData = read_csv('/tmp/csv.txt',
error_bad_lines=True,
dtype=dtypes,
converters=converters
)

If you want to iterate over them, the built-in csv.DictReader is pretty handy. I wrote up this function:
import csv
def read_points(csv_file):
point_names, elevations = [], []
message = (
"Found bad data for {0}'s row: {1}. Type new data to use "
"for this value: "
)
with open(csv_file, 'r') as open_csv:
r = csv.DictReader(open_csv, delimiter=";")
for row in r:
tmp_point = row.get("description", "some_default_name")
tmp_elevation = row.get("elevation", "some_default_elevation")
point_names.append(tmp_point)
try:
tmp_elevation = float(tmp_elevation.replace(',', '.'))
except:
while True:
user_val = raw_input(message.format(tmp_point,
tmp_elevation))
try:
tmp_elevation = float(user_val)
break
except:
tmp_elevation = user_val
elevations.append(tmp_elevation)
return pandas.DataFrame({"Point":point_names, "Elevation":elevations})
And for the four-line test file, it gives me the following:
In [41]: read_points("/home/ely/tmp.txt")
Found bad data for point d's row: 30x. Type new data to use for this value: 30
Out[41]:
Elevation Point
0 -10.0 point a
1 10.0 point b
2 35.5 point c
3 30.0 point d
[4 rows x 2 columns]
Displaying a whole QT dialog box seems way overkill for this task. Why not just a command prompt? You can also add more conversion functions and change some things like the delimiter to be keyword arguments if you want it to be more customizable.
One question is how much data there is to iterate through. If it's a lot of data, this will be time consuming and tedious. In that case, you may just want to discard observations like the '30x' or write their point ID name to some other file so you can go back and deal with them all in one swoop inside something like Emacs or VIM where manipulating a big swath of text at once will be easier.

I would take a different approach here.
Rather than at read_csv time, I would read the csv naively and then fix / convert to float:
In [11]: df = pd.read_csv(csv_file, sep=';')
In [12]: df['elevation']
Out[12]:
0 -10
1 10,0
2 35.5
3 30x
Name: elevation, dtype: object
Now just iterate through this column:
In [13]: df['elevation'] = df['elevation'].apply(fixFloat)
This is going to make it much easier to reason about the code (which columns you're applying functions to, how to access other columns etc. etc.).

Related

I am having issues with my code working properly and im stuck

I am having a problem with my code and getting it to work. Im not sure if im sorting this correctly. I am trying to sort with out lambda pandas or itemgetter.
Here is my code that I am having issues with.
with open('ManufacturerList.csv', 'r') as man_list:
ml = csv.reader(man_list, delimiter=',')
for row in ml:
manufacturerList.append(row)
print(row)
with open('PriceList.csv', 'r') as price_list:
pl = csv.reader(price_list, delimiter=',')
for row in pl:
priceList.append(row)
print(row)
with open('ManufacturerList.csv', 'r') as service_list:
sl = csv.reader(service_list, delimiter=',')
for row in sl:
serviceList.append(row)
print(row)
new_mfl = (sorted(manufacturerList, key='None'))
new_prl = (sorted(priceList, key='None'))
new_sdl = (sorted(serviceList, key='None'))
for x in range(0, len(new_mfl)):
new_mfl[x].append(priceList[x][1])
for x in range(0, len(new_mfl)):
new_mfl[x].append(serviceList[x][1])
new_list = new_mfl
inventoryList = (sorted(list, key=1))
i have tried to use the def function to try to get it to work but i dont know if im doing it right. This is what i tried.
def new_mfl(x):
return x[0]
x.sort(key=new_mfl)
You can do it like this:
def manufacturer_key(x):
return x[0]
sorted_mfl = sorted(manufacturerList, key=manufacturer_key)
The key argument is the function that extracts the field of the CSV that you want to sort by.
sorted_mfl = sorted(manufacturerList, key=lambda x: x[0])
There are different Dialects and Formatting Parameters that allow to handle input and output of data from comma separated value files; Maybe it could be used in a way with fewer statements using the correct delimiter which depends on the type of data you handle, this would be added to using built in methods like split for string data or another method to sort and manipulate lists, for example for single column data, delimiter=',' separate data by comma and it would iterate trough each value and not as a list of lists when you call csv.reader
['9.310788653967691', '4.065746465800029', '6.6363356879192965', '7.279020237137884', '4.010297786910394']
['9.896092029283933', '7.553018448286675', '0.3268282119829197', '2.348011394854333', '3.964531054345021']
['5.078622663277619', '4.542467725728741', '3.743648062104161', '12.761916277286993', '9.164698479088221']
# out:
column1 column2 column3 column4 column5
0 4.737897984379577 6.078414943611958 2.7021438955897095 5.8736388919905895 7.878958949784588
1 4.436982168483749 3.9453563399358544 12.66647791861843 5.323017508568736 4.156777982870004
2 4.798241413768279 12.690268531982028 9.638858110105895 7.881360524434767 4.2948334000783195
This is achieved because I am using lists that contain singular values, since for columns or lists that are of the form sorted_mfl = {'First Name' : ['name', 'name', 'name'], 'Second Name ' : [...], 'ID':[...]}, new_prl = ['example', 'example', 'example'] new_sdl = [...] the data would be added by something like sorted_mfl + new_prl + new_sdl and since different modules are also used to set and manage comma separated files, you should add more information to your question like the data type you use or create a minimal reproducible example with pandas.

Convert string variables into ints in a dataset

I'm trying to convert values from strings to ints in a certain column of a dataset. I tried using a for loop and even though the loop does seem to be iterating through the data it's failing to convert any of the variables. I'm certain that I'm making a super basic mistake but can't figure it out as I'm very new at this.
I downloaded a data file from https://www.kaggle.com/datasets/majunbajun/himalayan-climbing-expeditions
Then proceeded to process the data so that I can analyse it statistically.
Here's the start of the code
#import pandas
import pandas as pd
#import expeditions as csv file
exp = pd.read_csv('C:\\file\\path\\to\\expeditions.csv')
#create subset for success vs failure
exp_win_v_fail = exp[['termination_reason', 'basecamp_date', 'season']]
#drop successes in dispute
exp_win_v_fail = exp_win_v_fail[(exp_win_v_fail['termination_reason'] != 'Success (claimed)') & (exp_win_v_fail['termination_reason'] != 'Attempt rumoured')]
This is the part I can't figure out
#recode termination reason to be binary
for element in exp_win_v_fail['termination_reason']:
if element == 'Success (main peak)':
element = 1
elif element == 'Success (subpeak)':
element = 1
else:
element = 0
Any help would be very much appreciated
To replace all values beginning with 'Success' with 1, and all other values to 0:
from pandas import read_csv
RE = '^Success.*$'
NRE = '^((?!Success).)*$'
TR = 'termination_reason'
BD = 'basecamp_date'
SE = 'season'
data = read_csv('expeditions.csv')
exp_win_v_fail = data[[TR, BD, SE]]
for v, re_ in enumerate((NRE, RE)):
exp_win_v_fail[TR] = exp_win_v_fail[TR].replace(to_replace=re_, value=v, regex=True)
for e in exp_win_v_fail[TR]:
print(e)

Preserve string when writing pandas data frame to csv [duplicate]

I am having a recurring problem with saving large numbers in Python to csv. The numbers are millisecond epoch time stamps, which I cannot convert or truncate and have to save in this format. As the columns with the millisecond timestamps also contain some NaN values, pandas casts them automatically to float (see the documentation in the Gotchas under "Support for integer NA".
I cannot seem to avoid this behaviour, so my question is, how can I save these numbers as an integer value when using df.to_csv, i.e. with no decimal point or trailing zeros? I have columns with numbers of different floating precision in the same dataframe and I do not want to lose the information there. Using the float_format parameter in to_csv seems to apply the same format for ALL float columns in my dataframe.
An example:
>>> df = pd.DataFrame({'a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
>>> df['b'].dtype
Out[1]: dtype('int64')
>>> df.loc[2] = np.NaN
>>> df
Out[1]:
a b
0 1.25 1.424380e+12
1 2.54 1.425511e+12
2 NaN NaN
>>> df['b'].dtype
dtype('float64')
>>> df.to_csv('test.csv')
>>> with open ('test.csv') as f:
... for line in f:
... print(line)
,a,b
0,1.25,1.42438044944e+12
1,2.54,1.42551073119e+12
2,,
As you can see, I lost the precision of the last two digits of my epoch time stamp.
While pd.to_csv does not have a parameter to change the format of individual columns, pd.to_string does. It is a little cumbersome and might be a problem for very large DataFrames but you can use it to produce a properly formatted string and then write that string to a file (as suggested in this answer to a similar question). to_string's formatters parameter takes for example a dictionary of functions to format individual columns. In your case, you could write your own custom formatter for the "b" column, leaving the defaults for the other column(s). This formatter might look somewhat like this:
def printInt(b):
if pd.isnull(b):
return "NaN"
else:
return "{:d}".format(int(b))
Now you can use this to produce your string:
df.to_string(formatters={"b": printInt}, na_rep="NaN")
which gives:
' a b\n0 1.25 1424380449437\n1 2.54 1425510731187\n2 NaN NaN'
You can see that there is still the problem that this is not comma separated and to_string actually has no parameter to set a custom delimiter, but this can easily be fixed by a regex:
import re
re.sub("[ \t]+(NaN)?", ",",
df.to_string(formatters={"b": printInt}, na_rep="NaN"))
gives:
',a,b\n0,1.25,1424380449437\n1,2.54,1425510731187\n2,,'
This can now be written into the file:
with open("/tmp/test.csv", "w") as f:
print(re.sub("[ \t]+(NaN)?", ",",
df.to_string(formatters={"b": printInt}, na_rep="NaN")),
file=f)
which results in what you wanted:
,a,b
0,1.25,1424380449437
1,2.54,1425510731187
2,,
If you want to keep the NaN's in the csv-file, you can just change the regex:
with open("/tmp/test.csv", "w") as f:
print(re.sub("[ \t]+", ",",
df.to_string(formatters={"b": printInt}, na_rep="NaN")),
file=f)
will give:
,a,b
0,1.25,1424380449437
1,2.54,1425510731187
2,NaN,NaN
If your DataFrame contained strings with whitespaces before, a robust solution is not as easy. You could insert another character in front of every value, that indicates the start of the next entry. If you have only single whitespaces in all strings you could use another whitespace for example. This would change the code to this:
import pandas as pd
import numpy as np
import re
df = pd.DataFrame({'a a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
df.loc[2] = np.NaN
def printInt(b):
if pd.isnull(b):
return " NaN"
else:
return " {:d}".format(int(b))
def printFloat(a):
if pd.isnull(a):
return " NaN"
else:
return " {}".format(a)
with open("/tmp/test.csv", "w") as f:
print(re.sub("[ \t][ \t]+", ",",
df.to_string(formatters={"a": printFloat, "b": printInt},
na_rep="NaN", col_space=2)),
file=f)
which would give:
,a a,b
0,1.25,1424380449437
1,2.54,1425510731187
2,NaN,NaN
Maybe this could work:
pd.set_option('precision',15)
df = pd.DataFrame({'a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
fg = df.applymap(lambda x: str(x))
fg.loc[2] = np.NaN
fg.to_csv('test.csv', na_rep='NaN')
Your output should be something like this (I'm on a mac):
I had the same problems with large numbers, this is the correct way for excel files
df = "\t" + df

Python Loop Addition

No matter what I do I don't seem to be able to add all the base volumes and quote volumes together easily! I want to end up with a total base volume and a total quote volume of all the data in the data frame. Can someone help me on how you can do this easily?
I have tried summing and saving the data in a dictionary first and then adding it but I just don't seem to be able to make this work!
import urllib
import pandas as pd
import json
def call_data(): # Call data from Poloniex
global df
datalink = 'https://poloniex.com/public?command=returnTicker'
df = urllib.request.urlopen(datalink)
df = df.read().decode('utf-8')
df = json.loads(df)
global current_eth_price
for k, v in df.items():
if 'ETH' in k:
if 'USDT_ETH' in k:
current_eth_price = round(float(v['last']),2)
print("Current ETH Price $:",current_eth_price)
def calc_volumes(): # Calculate the base & quote volumes
global volume_totals
for k, v in df.items():
if 'ETH' in k:
basevolume = float(v['baseVolume'])*current_eth_price
quotevolume = float(v['quoteVolume'])*float(v['last'])*current_eth_price
if quotevolume > 0:
percentages = (quotevolume - basevolume) / basevolume * 100
volume_totals = {'key':[k],
'basevolume':[basevolume],
'quotevolume':[quotevolume],
'percentages':[percentages]}
print("volume totals:",volume_totals)
print("#"*8)
call_data()
calc_volumes()
A few notes:
For the next 2 years don't use the keyword globals for anything.
put function documentation under the function in quotes
using the requests library will be much easier than urllib. However ...
pandas can fetch the JSON and parse it all in one step
ok it doesn't have to be as split up as this, I'm just showing you how to properly pass variables around instead of globals.
I could not find "ETH" by itself. In the data they sent they have these 3 ['BTC_ETH', 'USDT_ETH', 'USDC_ETH']. So I used "USDT_ETH" I hope the substitution is ok.
calc_volumes is seeming to do the calculation and being some sort of filter (it's picky as to what it prints). This function needs to be broken up in to it's two separate jobs. printing and calculating. (maybe there was a filter step but I leave that for homework)
.
import pandas as pd
eth_price_url = 'https://poloniex.com/public?command=returnTicker'
def get_data(url=''):
""" Call data from Poloniex and put it in a dataframe"""
data = pd.read_json(url)
return data
def get_current_eth_price(data = None):
""" grab the price out of the dataframe """
current_eth_price = data['USDT_ETH']['last'].round(2)
return current_eth_price
def calc_volumes(data=None, current_eth_price=None):
""" Calculate the base & quote volumes """
data = df[df.columns[df.columns.str.contains('ETH')]].loc[['baseVolume', 'quoteVolume', 'last']]
data = data.transpose()
data[['baseVolume','quoteVolume']]*= current_eth_price
data['quoteVolume']*=data['last']
data['percentages']=(data['quoteVolume'] - data['baseVolume']) / data['quoteVolume'] * 100
return data
df = get_data(url = eth_price_url)
the_price = get_current_eth_price(data = df)
print(f'the current eth price is: {the_price}')
volumes = calc_volumes(data=df, current_eth_price=the_price)
print(volumes)
This code seems kind of odd and inconsistent... for example, you're importing pandas and calling your variable df but you're not actually using dataframes. If you used df = pd.read_json('https://poloniex.com/public?command=returnTicker', 'index')* to get a dataframe, most of your data manipulation here would become much easier, and wouldn't require any loops either.
For example, the first function's code would become as simple as current_eth_price = df.loc['USDT_ETH','last'].
The second function's code would basically be
eth_rows = df[df.index.str.contains('ETH')]
total_base_volume = (eth_rows.baseVolume * current_eth_price).sum()
total_quote_volume = (eth_rows.quoteVolume * eth_rows['last'] * current_eth_price).sum()
(*The 'index' argument tells pandas to read the JSON dictionary indexed by rows, then columns, rather than columns, then rows.)

pandas - drop row with list of values, if contains from list

I have a huge set of data. Something like 100k lines and I am trying to drop a row from a dataframe if the row, which contains a list, contains a value from another dataframe. Here's a small time example.
has = [['#a'], ['#b'], ['#c, #d, #e, #f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
tweet user
0 [#a] 1
1 [#b] 2
2 [#c, #d, #e, #f] 3
3 [#g] 5
z
0 #d
1 #a
The desired outcome would be
tweet user
0 [#b] 2
1 [#g] 5
Things i've tried
#this seems to work for dropping #a but not #d
for a in range(df.tweet.size):
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a)
#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]
#the error being "unterminated character set at position 1343770"
#i went to check what was on that line and it returned this
basket.iloc[1343770]
user_id 17060480
tweet [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object
Any help would be greatly appreciated.
is ['#c, #d, #e, #f'] 1 string or a list like this ['#c', '#d', '#e', '#f'] ?
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
simple solution would be
screen = set(df2.z.tolist())
to_delete = list() # this will speed things up doing only 1 delete
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
speed comparaison (for 10 000 rows):
st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258
st = time.time()
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break
print(time.time()-st)
43.99799990653992
For me, your code works if I make several adjustments.
First, you're missing the last line when putting range(df.tweet.size), either increase this or (more robust, if you don't have an increasing index), use df.tweet.index.
Second, you don't apply your dropping, use inplace=True for that.
Third, you have #d in a string, the following is not a list: '#c, #d, #e, #f' and you have to change it to a list so it works.
So if you change that, the following code works fine:
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break # so if we already dropped it we no longer look whether we should drop this line
This will provide the desired result. Be aware of this potentially being not optimal due to missing vectorization.
EDIT:
you can achieve the string being a list with the following:
from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))
This applies a function to each line (assuming each line contains a list with one or more elements): Split each element (should be a string) by comma into a new list and "flatten" all the lists in one line (if there are multiple) together.
EDIT2:
Yes, this is not really performant But basically does what was asked. Keep that in mind and after having it working, try to improve your code (less for iterations, do tricks like collecting the indices and then drop all of them).

Categories

Resources