I have a Pandas Series with 76 elements, when I try to print out the Series (for debugging) it is abbreviated with "..." in the output. Is there a way to pretty print all of the elements of the Series?
In this example, the Series is called "data"
print str(data)
gives me this
Open 40.4568
High 40.4568
Low 39.806
Close 40.114
Volume 796146.2
Active 1
TP1_ema 700
stop_ema_width 0.5
LS_ema 10
stop_window 210
target_width 3
LS_width 0
TP1_pct 1
TP1_width 4
stop_ema 1400
...
ValueSharesHeld NaN
AccountIsWorth NaN
Profit NaN
BuyPrice NaN
SellPrice NaN
ShortPrice NaN
BtcPrice NaN
LongStopPrice NaN
ShortStopPrice NaN
LongTargetPrice NaN
ShortTargetPrice NaN
LTP1_Price NaN
STP1_Price NaN
TradeOpenPrice NaN
TheEnd False
Name: 2000-11-03 14:00, Length: 76, dtype: object
Note the "..." inserted in the middle. I'm debugging using PTVS on Visual Studio 2013 (Python Tools for Visual Studio". I get the same behaviour with enthought canopy.
pd.options.display.max_rows = 100
The default is set at 60 (so dataframes or series with more elements will be truncated when printed).
Related
Good afternoon
I am trying to import more than a 100 separate .txt files containing data I want to plot. I would like to automise this process, since doing the same iteration for every individual file is most tedious.
I have read up on how to read multiple .txt files, and found a nice explanation. However, following the example all my data gets imported as NaNs. I read up some more and found a more reliable way of importing .txt files, namely by using pd.read_fwf() as can be seen here.
Although I can at least see my data now, I have no clue how to plot it, since the data is in one column separated by \t, e.g.
0 Extension (mm)\tLoad (kN)\tMachine extension (mm)\tPreload extension
1 0.000000\t\t\t
2 0.152645\t0.000059312\t.....
... etc.
I have tried using different separators in both the pd.read_csv() and pd.read_fwf() including ' ', '\t' and '-s+', but to now avail.
Of course this causes a problem, because now I can not plot my data. Speaking of, I am also not sure how to plot the data in the dataframe. I want to plot each .txt file's data separately on the same scatter plot.
I am very new to stack overflow, so pardon the format of the question if it does not conform to the normal standard. I attach my code below, but unfortunately I can not attach my .txt files. Each .txt file contains about a thousand rows of data. I attach a picture of the general format of all the files. General format of the .txt files.
import numpy as np
import pandas as pd
from matplotlib import pyplot as pp
import os
import glob
# change the working directory
os.chdir(r"C:\Users\Philip de Bruin\Desktop\Universiteit van Pretoria\Nagraads\sterktetoetse_basislyn\trektoetse\speel")
# get the file names
leggername = [i for i in glob.glob("*.txt")]
# put everything in a dataframe
df = [pd.read_fwf(legger) for legger in leggername]
df
EDIT: the output I get now for the DataFrame is:
[ Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.152645\t0.000059312\t-...
4
... ...
997 76.0173\t0.037706\t0.005...
998
999 76.1699\t0.037709\t\t
1000
1001
from Preload (mm)
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN
[1002 rows x 4 columns],
Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.128151\t0.000043125\t-...
4
... ...
997 63.8191\t0.034977\t-0.00...
998
999 63.9473\t0.034974\t\t
1000
1001
from Preload (mm)
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN
[1002 rows x 4 columns],
Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.174403\t0.000061553\t0...
4
... ...
997 86.8529\t0.036093\t-0.00...
998
999 87.0273\t\t-0.0059160\t-...
1000
1001
from Preload (mm)
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN
... etc
The basic gist is to skip the first data row (that has a single value in it), then read the individual files with pd.read_csv, using tab as the separator, and stack them together.
There is, however, a more problematic issue: the data files turn out to be UTF-16 encoded (the binary data show a NUL character at the even positions), but there is no byte-order-mark (BOM) to indicate this. As a result, you can't specify the encoding in read_csv, but have to manually read each file as binary, then decode it with UTF-16 to a string, then feed that string to read_csv. Since the latter requires a filename or IO-stream, the text data needs to be put into a StringIO object first (or save the corrected data to disk first, then read the corrected file; might not be a bad idea).
import pandas as pd
import os
import glob
import io
# change the working directory
os.chdir(r"C:\Users\Philip de Bruin\Desktop\Universiteit van Pretoria\Nagraads\sterktetoetse_basislyn\trektoetse\speel")
dfs = []
for filename in glob.glob("*.txt"):
with open(filename, 'rb') as fp:
data = fp.read() # a single file should fit in memory just fine
# Decode the UTF-16 data that is missing a BOM
string = data.decode('UTF-16')
# And put it into a stream, for ease-of-use with `read_csv`
stream = io.StringIO(string)
# Read the data from the, now properly decoded, stream
# Skip the single-value row, and use tabs as separators
df = pd.read_csv(stream, sep='\t', skiprows=[1])
# To keep track of the individual files, add an "origin" column
# with its value set to the corresponding filename
df['origin'] = filename
dfs.append(df)
# Concate all dataframes (default is to stack the rows)
df = pd.concat(dfs)
# For a quick and dirty plot, you can enjoy the power of Seaborn
import seaborn as sns
# Use appropriate (full) column names, and use the 'origin'
# column for the hue and symbol
sns.scatterplot(data=df, x='Time (s)', y='Machine Extension (mm)', hue='origin', style='origin')
Seaborn's scatterplot documentation.
I am reading an excel spreadsheet into pandas as:
input_df: pd.DataFrame = pd.read_excel(data_filename, engine='openpyxl')
Here's a screenshot of the beginning of the excel file:
However, when I exam the dataframe, fractional parts are added to two out of the three time columns.
Out[6]:
Real Time Current(nA) Unnamed: 2 Unnamed: 3 Sensor 4 Time Sensor 4 Current nA Unnamed: 6 FS Time FS Value
0 11:58:03.111700 119.400 NaN NaN 10:53:39 119.428 NaN 10:43:12 101.0
1 11:58:04.681197 119.439 NaN NaN 10:53:40.795800 119.474 NaN 10:44:06 103.0
2 11:58:07.246866 119.417 NaN NaN 10:53:43.214300 119.447 NaN 10:51:36 88.0
3 11:58:09.388763 119.416 NaN NaN 10:53:45.294400 119.439 NaN 10:53:39 88.0
4 11:58:11.454134 119.411 NaN NaN 10:53:47.302400 119.451 NaN 11:06:58 83.0
These don't appear in the original excel file as evidenced by the screenshot below:
I have no idea where these fractions come from. They don't appear in the original file. Why is this happening, and how can I read in the correct times?
OK. This was my fault. It turns out that there is a fractional part to the timestamps. Google sheets needed to be configured to show that fractional part. In summary, it appears that there is agreement between the xlsx file and the pandas dataframe.
ValueError: Input contains NaN
i have run
from sklearn.preprocessing import OrdinalEncoderfrom
data_.iloc[:,1:-1] = OrdinalEncoder().fit_transform(data_.iloc[:,1:-1])
here is data_
Age Sex Embarked Survived
0 22.0 male S 0
1 38.0 female C 2
2 26.0 female S 2
3 35.0 female S 2
4 35.0 male S 0
Before doing some processor, you have always have to to preprocess the data and make a few summary of how your data is. In concrete, the error you obtained is telling you that you have NaN values. To check it, try this command:
df.isnull().any().any()
If the output is TRUE, you have NaN values. You can run the next command if you want to know where this NaN values are:
df.isnull().any()
Then, you will know in which column are your NaN values.
Once you know you have NaN values, you have to preprocess them (eliminate, fill,... whatever you believe is the best option). The link gtomer commented is a nice resource.
Im sorry for not posting the data but it wouldn't really help. The thing is a need to make a graph and I have a csv file full of information organised by date. It has 'Cases' 'Deaths' 'Recoveries' 'Critical' 'Hospitalized' 'States' as categories. It goes in order by date and has the amount of cases,deaths,recoveries per day of each state. How do I sum this categories to make a graph that shows how the total is increasing? I really have no idea how to start so I can't post my data. Below are some numbers that try to explain what I have.
0 2020-02-20 1 Andalucía NaN NaN NaN
1 2020-02-20 2 Aragón NaN NaN NaN
2 2020-02-20 3 Asturias NaN NaN NaN
3 2020-02-20 4 Baleares 1.0 NaN NaN
4 2020-02-20 5 Canarias 1.0 NaN NaN
.. ... ... ... ... ... ...
888 2020-04-06 19 Melilla 92.0 40.0 3.0
889 2020-04-06 14 Murcia 1283.0 500.0 84.0
890 2020-04-06 15 Navarra 3355.0 1488.0 124.0
891 2020-04-06 16 País Vasco 9021.0 4856.0 417.0
892 2020-04-06 17 La Rioja 2846.0 918.0 66.0
It's unclear exactly what you mean by "sum this categories". I'm assuming you mean that for each date, you want to sum the values across all different regions to come up with the total values for Spain?
In which case, you will want to groupby date, then .sum() the columns (you can drop the States category.
grouped_df = df.groupby("date")["Cases", "Deaths", ...].sum()
grouped_df.set_index("date").plot()
This snippet will probably not work directly, you may need to reformat the dates etc. But should be enough to get you started.
I think you are looking for groupby followed by a cumsum not including dates.
columns_to_group = ['Cases', 'Deaths',
'Recoveries', 'Critical', 'Hospitalized', 'date']
new_columns = ['Cases_sum', 'Deaths_sum',
'Recoveries_sum', 'Critical_sum', 'Hospitalized_sum']
df_grouped = df[columns_to_group].groupby('date').sum().reset_index()
For plotting seaborn provides an easy functions:
import seaborn as sns
df_melted = df_grouped.melt(id_vars=["date"])
sns.lineplot(data=df_melted, x='date', y = 'value', hue='variable')
When I run the code below I get the error:
TypeError: 'NoneType' object has no attribute 'getitem'
import pyarrow
import pandas
import pyarrow.parquet as pq
df = pq.read_table("file.parquet").to_pandas()
df = df.iloc[1:,:]
df = df.dropna (how="any", inplace = True) # modifies it in place, creates new dataset without NAN
average_age = df["_c2"].mean()
print average_age
The dataframe looks like this:
_c0 _c1 _c2
0 RecId Class Age
1 1 1st 29
2 2 1st NA
3 3 1st 30
If I print the df after calling the dropna method, I get 'None'.
Shouldn't it be creating a new dataframe without the 'NA' in it, which would then allow me to get the average age without throwing an error?
As per OP’s comment, the NA is a string rather than NaN. So dropna() is no good here. One of many possible options for filtering out the string value ‘NA’ is:
df = df[df["_c2"] != "NA"]
A better option to catch inexact matches (e.g. with trailing spaces) as suggested by #DJK in the comments:
df = df[~df["_c2"].str.contains('NA')]
This one should remove any strings rather than only ‘NA’:
df = df[df[“_c2”].apply(lambda x: x.isnumeric())]
This will work, also if you the NA in your df is NaN (np.nan), this will not affect your getting the mean of the column, only if your NA is 'NA', which is string
(df.apply(pd.to_numeric,errors ='coerce',axis=1)).describe()
Out[9]:
_c0 _c1 _c2
count 3.0 0.0 2.000000
mean 2.0 NaN 29.500000
std 1.0 NaN 0.707107
min 1.0 NaN 29.000000
25% 1.5 NaN 29.250000
50% 2.0 NaN 29.500000
75% 2.5 NaN 29.750000
max 3.0 NaN 30.000000
More info
df.apply(pd.to_numeric,errors ='coerce',axis=1)# all object change to NaN and will not affect getting mean
Out[10]:
_c0 _c1 _c2
0 NaN NaN NaN
1 1.0 NaN 29.0
2 2.0 NaN NaN
3 3.0 NaN 30.0