I am having a recurring problem with saving large numbers in Python to csv. The numbers are millisecond epoch time stamps, which I cannot convert or truncate and have to save in this format. As the columns with the millisecond timestamps also contain some NaN values, pandas casts them automatically to float (see the documentation in the Gotchas under "Support for integer NA".
I cannot seem to avoid this behaviour, so my question is, how can I save these numbers as an integer value when using df.to_csv, i.e. with no decimal point or trailing zeros? I have columns with numbers of different floating precision in the same dataframe and I do not want to lose the information there. Using the float_format parameter in to_csv seems to apply the same format for ALL float columns in my dataframe.
An example:
>>> df = pd.DataFrame({'a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
>>> df['b'].dtype
Out[1]: dtype('int64')
>>> df.loc[2] = np.NaN
>>> df
Out[1]:
a b
0 1.25 1.424380e+12
1 2.54 1.425511e+12
2 NaN NaN
>>> df['b'].dtype
dtype('float64')
>>> df.to_csv('test.csv')
>>> with open ('test.csv') as f:
... for line in f:
... print(line)
,a,b
0,1.25,1.42438044944e+12
1,2.54,1.42551073119e+12
2,,
As you can see, I lost the precision of the last two digits of my epoch time stamp.
While pd.to_csv does not have a parameter to change the format of individual columns, pd.to_string does. It is a little cumbersome and might be a problem for very large DataFrames but you can use it to produce a properly formatted string and then write that string to a file (as suggested in this answer to a similar question). to_string's formatters parameter takes for example a dictionary of functions to format individual columns. In your case, you could write your own custom formatter for the "b" column, leaving the defaults for the other column(s). This formatter might look somewhat like this:
def printInt(b):
if pd.isnull(b):
return "NaN"
else:
return "{:d}".format(int(b))
Now you can use this to produce your string:
df.to_string(formatters={"b": printInt}, na_rep="NaN")
which gives:
' a b\n0 1.25 1424380449437\n1 2.54 1425510731187\n2 NaN NaN'
You can see that there is still the problem that this is not comma separated and to_string actually has no parameter to set a custom delimiter, but this can easily be fixed by a regex:
import re
re.sub("[ \t]+(NaN)?", ",",
df.to_string(formatters={"b": printInt}, na_rep="NaN"))
gives:
',a,b\n0,1.25,1424380449437\n1,2.54,1425510731187\n2,,'
This can now be written into the file:
with open("/tmp/test.csv", "w") as f:
print(re.sub("[ \t]+(NaN)?", ",",
df.to_string(formatters={"b": printInt}, na_rep="NaN")),
file=f)
which results in what you wanted:
,a,b
0,1.25,1424380449437
1,2.54,1425510731187
2,,
If you want to keep the NaN's in the csv-file, you can just change the regex:
with open("/tmp/test.csv", "w") as f:
print(re.sub("[ \t]+", ",",
df.to_string(formatters={"b": printInt}, na_rep="NaN")),
file=f)
will give:
,a,b
0,1.25,1424380449437
1,2.54,1425510731187
2,NaN,NaN
If your DataFrame contained strings with whitespaces before, a robust solution is not as easy. You could insert another character in front of every value, that indicates the start of the next entry. If you have only single whitespaces in all strings you could use another whitespace for example. This would change the code to this:
import pandas as pd
import numpy as np
import re
df = pd.DataFrame({'a a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
df.loc[2] = np.NaN
def printInt(b):
if pd.isnull(b):
return " NaN"
else:
return " {:d}".format(int(b))
def printFloat(a):
if pd.isnull(a):
return " NaN"
else:
return " {}".format(a)
with open("/tmp/test.csv", "w") as f:
print(re.sub("[ \t][ \t]+", ",",
df.to_string(formatters={"a": printFloat, "b": printInt},
na_rep="NaN", col_space=2)),
file=f)
which would give:
,a a,b
0,1.25,1424380449437
1,2.54,1425510731187
2,NaN,NaN
Maybe this could work:
pd.set_option('precision',15)
df = pd.DataFrame({'a':[1.25, 2.54], 'b':[1424380449437, 1425510731187]})
fg = df.applymap(lambda x: str(x))
fg.loc[2] = np.NaN
fg.to_csv('test.csv', na_rep='NaN')
Your output should be something like this (I'm on a mac):
I had the same problems with large numbers, this is the correct way for excel files
df = "\t" + df
Related
Assuming that I have a pandas dataframe and I want to add thousand separators to all the numbers (integer and float), what is an easy and quick way to do it?
When formatting a number with , you can just use '{:,}'.format:
n = 10000
print '{:,}'.format(n)
n = 1000.1
print '{:,}'.format(n)
In pandas, you can use the formatters parameter to to_html as discussed here.
num_format = lambda x: '{:,}'.format(x)
def build_formatters(df, format):
return {
column:format
for column, dtype in df.dtypes.items()
if dtype in [ np.dtype('int64'), np.dtype('float64') ]
}
formatters = build_formatters(data_frame, num_format)
data_frame.to_html(formatters=formatters)
Adding the thousands separator has actually been discussed quite a bit on stackoverflow. You can read here or here.
Use Series.map or Series.apply with this solutions:
df['col'] = df['col'].map('{:,}'.format)
df['col'] = df['col'].map(lambda x: f'{x:,}')
df['col'] = df['col'].apply('{:,}'.format)
df['col'] = df['col'].apply(lambda x: f'{x:,}')
Assuming you just want to display (or render to html) the floats/integers with a thousands separator you can use styling which was added in version 0.17.1:
import pandas as pd
df = pd.DataFrame({'int': [1200, 320], 'flt': [5300.57, 12000000.23]})
df.style.format('{:,}')
To render this output to html you use the render method on the Styler.
Steps
use df.applymap() to apply a function to every cell in your dataframe
check if cell value is of type int or float
format numbers using f'{x:,d}' for integers and f'{x:,f}' for floats
Here is a simple example for integers only:
df = df.applymap(lambda x: f'{x:,d}' if isinstance(x, int) else x)
If you want "." as thousand separator and "," as decimal separator this will works:
Data = pd.read_Excel(path)
Data[my_numbers] = Data[my_numbers].map('{:,.2f}'.format).str.replace(",", "~").str.replace(".", ",").str.replace("~", ".")
If you want three decimals instead of two you change "2f" --> "3f"
Data[my_numbers] = Data[my_numbers].map('{:,.3f}'.format).str.replace(",", "~").str.replace(".", ",").str.replace("~", ".")
The formatters parameter in to_html will take a dictionary.
Click the example link for reference
I have a pandas dataframe with columns and rows. Now I want to create another column which will be a concatenation of two strings and a column from the dataframe.
so the way it would work is i have string one (see the below dictionary)+ colx (from dataframe) + string two
stringList = {
'one': """ AC:A000 AMI:NO CM:B C:YES CL:CPN:'#US3L+""",
'two': """ FRQ:4 NOT:1 PX:C PXND:1E-6:DOWN RDTE:MAT RP:1 SET:0WW XD:NO """
}
i tried to create a function but I think this is not working as I want. I want this to be a function so i can call it in another function.
def fun(final):
for i in dm:
c = stringList['one'] + str(dm[i]) + stringList['two']
final.append(c)
Please help with this as I am stuck with this problem for now.
Required Output:
str1 |QM |str2 |output
AC:A000 AMI:NO CM:B C:YES CL:CPN:'#US3L+ |0.0125 | RQ:4 NOT:1 PX:C PXND:1E-6:DOWN RDTE:MAT RP:1 SET:0WW XD:NO| AC:A000 AMI:NO CM:B C:YES CL:CPN:'#US3L+0.0125RQ:4 NOT:1 PX:C PXND:1E-6:DOWN RDTE:MAT RP:1 SET:0WW XD:NO
AC:A000 AMI:NO CM:B C:YES CL:CPN:'#US3L+ 0.016 RQ:4 NOT:1 PX:C PXND:1E-
Hope this helps explain. I know it is not a very good representation but I have this problem which is critical to solve
THanks
After looking at your output, I realized that you want to combine three columns str1, QM, and str2. I am assuming here that str1 and str2 have dtype str and QM has dtype float. You can use the following code to get the output column as below
df["output"] = df["str1"] + df["QM"].astype(str) + df["str2"]
Assuming that I have a pandas dataframe and I want to add thousand separators to all the numbers (integer and float), what is an easy and quick way to do it?
When formatting a number with , you can just use '{:,}'.format:
n = 10000
print '{:,}'.format(n)
n = 1000.1
print '{:,}'.format(n)
In pandas, you can use the formatters parameter to to_html as discussed here.
num_format = lambda x: '{:,}'.format(x)
def build_formatters(df, format):
return {
column:format
for column, dtype in df.dtypes.items()
if dtype in [ np.dtype('int64'), np.dtype('float64') ]
}
formatters = build_formatters(data_frame, num_format)
data_frame.to_html(formatters=formatters)
Adding the thousands separator has actually been discussed quite a bit on stackoverflow. You can read here or here.
Use Series.map or Series.apply with this solutions:
df['col'] = df['col'].map('{:,}'.format)
df['col'] = df['col'].map(lambda x: f'{x:,}')
df['col'] = df['col'].apply('{:,}'.format)
df['col'] = df['col'].apply(lambda x: f'{x:,}')
Assuming you just want to display (or render to html) the floats/integers with a thousands separator you can use styling which was added in version 0.17.1:
import pandas as pd
df = pd.DataFrame({'int': [1200, 320], 'flt': [5300.57, 12000000.23]})
df.style.format('{:,}')
To render this output to html you use the render method on the Styler.
Steps
use df.applymap() to apply a function to every cell in your dataframe
check if cell value is of type int or float
format numbers using f'{x:,d}' for integers and f'{x:,f}' for floats
Here is a simple example for integers only:
df = df.applymap(lambda x: f'{x:,d}' if isinstance(x, int) else x)
If you want "." as thousand separator and "," as decimal separator this will works:
Data = pd.read_Excel(path)
Data[my_numbers] = Data[my_numbers].map('{:,.2f}'.format).str.replace(",", "~").str.replace(".", ",").str.replace("~", ".")
If you want three decimals instead of two you change "2f" --> "3f"
Data[my_numbers] = Data[my_numbers].map('{:,.3f}'.format).str.replace(",", "~").str.replace(".", ",").str.replace("~", ".")
The formatters parameter in to_html will take a dictionary.
Click the example link for reference
For a project for my lab, I'm analyzing Twitter data. The tweets we've captured all have the word 'sex' in them, that's the keyword we filtered the TwitterStreamer to capture based on.
I converted the CSV where all of the tweet data (json metatags) is housed into a pandas DB and saved the 'text' column to isolate the tweet text.
import pandas as pd
import csv
df = pd.read_csv('tweets_hiv.csv')
saved_column4 = df.text
print saved_column4
Out comes the correct output:
0 Some example tweet text
1 Oh hey look more tweet text #things I hate #stuff
...a bunch more lines
Name: text, Length: 8540, dtype: object
But, when I try this
from textblob import TextBlob
tweetstr = str(saved_column4)
tweets = TextBlob(tweetstr).upper()
print tweets.words.count('sex', case_sensitive=False)
My output is 22.
There should be AT LEAST as many incidences of the word 'sex' as there are lines in the CSV, and likely more. I can't figure out what's happening here. Is TextBlob not configuring right around a dtype:object?
I'm not entirely sure this is methodically correct insofar as language processing, but using join will give you the count you need.
import pandas as pd
from textblob import TextBlob
tweets = pd.Series('sex {}'.format(x) for x in range(1000))
tweetstr = " ".join(tweets.tolist())
tweetsb = TextBlob(tweetstr).upper()
print tweetsb.words.count('sex', case_sensitive=False)
# 1000
If you just need the count without necessarily using TextBlob, then just do:
import pandas as pd
tweets = pd.Series('sex {}'.format(x) for x in range(1000))
sex_tweets = tweets.str.contains('sex', case=False)
print sex_tweets.sum()
# 1000
You can get a TypeError in the first snippet if one of your elements is not of type string. This is more of join issue. A simple test can be done using the following snippet:
# tweets = pd.Series('sex {}'.format(x) for x in range(1000))
tweets = pd.Series(x for x in range(1000))
tweetstr = " ".join(tweets.tolist())
Which gives the following result:
Traceback (most recent call last):
File "F:\test.py", line 6, in <module>
tweetstr = " ".join(tweets.tolist())
TypeError: sequence item 0: expected string, numpy.int64 found
A simple workaround is to convert x in the list comprehension into a string before using join, like so:
tweets = pd.Series(str(x) for x in range(1000))
Or you can be more explicit and create a list first, map the str function to it, and then use join.
tweetlist = tweets.tolist()
tweetstr = map(str, tweetlist)
tweetstr = " ".join(tweetstr)
The CSV conversion is not the problem! When you use str() on a column of a DataFrame (that is, a Series), it makes a "print-friendly" output of the Series, which means cutting out the majority of the data, and just displaying the first few and the last few. Here is a transcript of an IPython session that will probably illustrate the issue better:
In [1]: import pandas as pd
In [2]: blah = pd.Series('tweet %d' % n for n in range(1000))
In [3]: blah
Out[3]:
0 tweet 0
1 tweet 1
... (output continues from 1 to 29)
29 tweet 29
... (OUTPUT SKIPS HERE)
970 tweet 970
... (output continues from 970 to 998)
998 tweet 998
999 tweet 999
dtype: object
In [4]: blahstr = str(blah)
In [5]: blahstr.count('tweet')
Out[5]: 60
So, since the output of the str() operation cuts off my data (and might even truncate column values, If I had used longer strings), I don't get 1000, I get 60.
If you want to do it your way (combine everything back into a single string and work with it that way), there's no point in using a library like Pandas. Pandas gives you better ways:
Working With a Series of Strings
Pandas has tools for working with a Series that contains strings. Here is a tutorial-like page about it, and here is the full string handling API documentation. In particular, for finding the number of uses of the word "sex", you could do something like this (assuming df is a DataFrame, and text is the column containing the tweets):
import re
counts = df['text'].str.count('sex', re.IGNORECASE)
counts should be a Series containing the number of occurrences of "sex" in each tweet. counts.sum() would give you the total number of usages, which should hopefully be more than 1000.
I try to import a csv and dealing with faulty values, e.x. wrong decimal seperator or strings in int/double columns. I use converters to do the error fixing. In case of strings in number columns the user sees a input box where he has to fix the value. Is it possible to get the column name and/or the row which is actually 'imported'? If not, is there a better way to do the same?
example csv:
------------
description;elevation
point a;-10
point b;10,0
point c;35.5
point d;30x
from PyQt4 import QtGui
import numpy
from pandas import read_csv
def fixFloat(x):
# return x as float if possible
try:
return float(x)
except:
# if not, test if there is a , inside, replace it with a . and return it as float
try:
return float(x.replace(",", "."))
except:
changedValue, ok = QtGui.QInputDialog.getText(None, 'Fehlerhafter Wert', 'Bitte korrigieren sie den fehlerhaften Wert:', text=x)
if ok:
return self.fixFloat(changedValue)
else:
return -9999999999
def fixEmptyStrings(s):
if s == '':
return None
else:
return s
converters = {
'description': fixEmptyStrings,
'elevation': fixFloat
}
dtypes = {
'description': object,
'elevation': numpy.float64
}
csvData = read_csv('/tmp/csv.txt',
error_bad_lines=True,
dtype=dtypes,
converters=converters
)
If you want to iterate over them, the built-in csv.DictReader is pretty handy. I wrote up this function:
import csv
def read_points(csv_file):
point_names, elevations = [], []
message = (
"Found bad data for {0}'s row: {1}. Type new data to use "
"for this value: "
)
with open(csv_file, 'r') as open_csv:
r = csv.DictReader(open_csv, delimiter=";")
for row in r:
tmp_point = row.get("description", "some_default_name")
tmp_elevation = row.get("elevation", "some_default_elevation")
point_names.append(tmp_point)
try:
tmp_elevation = float(tmp_elevation.replace(',', '.'))
except:
while True:
user_val = raw_input(message.format(tmp_point,
tmp_elevation))
try:
tmp_elevation = float(user_val)
break
except:
tmp_elevation = user_val
elevations.append(tmp_elevation)
return pandas.DataFrame({"Point":point_names, "Elevation":elevations})
And for the four-line test file, it gives me the following:
In [41]: read_points("/home/ely/tmp.txt")
Found bad data for point d's row: 30x. Type new data to use for this value: 30
Out[41]:
Elevation Point
0 -10.0 point a
1 10.0 point b
2 35.5 point c
3 30.0 point d
[4 rows x 2 columns]
Displaying a whole QT dialog box seems way overkill for this task. Why not just a command prompt? You can also add more conversion functions and change some things like the delimiter to be keyword arguments if you want it to be more customizable.
One question is how much data there is to iterate through. If it's a lot of data, this will be time consuming and tedious. In that case, you may just want to discard observations like the '30x' or write their point ID name to some other file so you can go back and deal with them all in one swoop inside something like Emacs or VIM where manipulating a big swath of text at once will be easier.
I would take a different approach here.
Rather than at read_csv time, I would read the csv naively and then fix / convert to float:
In [11]: df = pd.read_csv(csv_file, sep=';')
In [12]: df['elevation']
Out[12]:
0 -10
1 10,0
2 35.5
3 30x
Name: elevation, dtype: object
Now just iterate through this column:
In [13]: df['elevation'] = df['elevation'].apply(fixFloat)
This is going to make it much easier to reason about the code (which columns you're applying functions to, how to access other columns etc. etc.).