pd.to_csv using different sep for rows - python

I read the following txt file with 'pd.read_csv(filename, sep = ',')'
read pass 1000K.
-128,-50,-48,-47,-41,-45,-41,-41,-39,-37
-127,-49,-46,-46,-40,-44,-40,-40,-38,-36
-126,-48,-44,-45,-39,-43,-39,-39,-37,-35
-125,-47,-42,-44,-38,-42,-38,-38,-36,-34
then I convert it to csv using
df = pd.to_csv(filename, index=None)
I get the following:
read pass 1000K.
-37
-36
-35
-34
only one column is preserved since it is default sep = ','
Anyone know how to keep the first row separated with ' ' and the other rows separated with ','?
so I can get all the data into cells
read|pass|1000K.
-128|-50|-48|-47|-41|-45|-41|-41|-39|-37
-127|-49|-46|-46|-40|-44|-40|-40|-38|-36
-126|-48|-44|-45|-39|-43|-39|-39|-37|-35
-125|-47|-42|-44|-38|-42|-38|-38|-36|-34

I tried the following, and it is working fine.
hello.txt
read pass 1000K.
-128,-50,-48,-47,-41,-45,-41,-41,-39,-37
-127,-49,-46,-46,-40,-44,-40,-40,-38,-36
-126,-48,-44,-45,-39,-43,-39,-39,-37,-35
-125,-47,-42,-44,-38,-42,-38,-38,-36,-34
In [1]: import pandas as pd
In [2]: df = pd.read_csv('hello.txt')
In [3]: df
Out[3]:
read pass 1000K.
-128 -50 -48 -47 -41 -45 -41 -41 -39 -37
-127 -49 -46 -46 -40 -44 -40 -40 -38 -36
-126 -48 -44 -45 -39 -43 -39 -39 -37 -35
-125 -47 -42 -44 -38 -42 -38 -38 -36 -34
In [4]: df.to_csv("test3.csv")
Now If I check test.csv it has all columns preserved.
,,,,,,,,,read pass 1000K.
-128,-50,-48,-47,-41,-45,-41,-41,-39,-37
-127,-49,-46,-46,-40,-44,-40,-40,-38,-36
-126,-48,-44,-45,-39,-43,-39,-39,-37,-35
-125,-47,-42,-44,-38,-42,-38,-38,-36,-34

Related

Issue with astype losing decimals

I have a problem with an astype, when I do it I'm losing decimals that are really important cause it's for longitude and latitude coordinates.
df[["Latitud","Longitud"]] = df[["Latitud","Longitud"]].astype(float)
Here is what I need:
df[["Latitud", "Longitud"]]
Latitud Longitud
0 -34.807023 -56.0336021
1 -34.8879924 -56.1846677
2 -34.8895332 -56.1560728
3 -34.8860972 -56.1635684
4 -34.7242753 -56.2012194
393 -34.8575722 -56.0534571
394 -34.7448815 -56.2132383
395 -34.8539222 -56.2320066
396 -34.8513169 -56.1721213
397 -34.8220428 -55.9906951
And here it's what astype gives me:
df[["Latitud", "Longitud"]]
Latitud Longitud
0 -35 -56
1 -35 -56
2 -35 -56
3 -35 -56
4 -35 -56
393 -35 -56
394 -35 -56
395 -35 -56
396 -35 -56
397 -35 -56
I try with no luck:
df[["Latitud","Longitud"]] = pd.to_numeric(df[["Latitud","Longitud"]],errors='coerce')
pd.options.display.float_format = '{:.08f}'.format
How can I can keep my decimals?
Well the thing that solve the problem was using:
pd.options.display.float_format = '{:,.8f}'.format
Only to discover that was not the problem! But hope can help someone with decimals!

data.dropna() doesnt work for my data.csv file and i still get a data with NaN elements

I'm studying Pandas from Python.
I'm trying to remove NaN elements from my data.csv file with data.dropna() and it isn't removing.
import pandas as pd
data = pd.read_csv('data.csv')
new_data = data.dropna()
print(new_data)
This is data.csv content.
Duration Date Pulse Maxpulse Calories
60 '2020/12/01' 110 130 409.1
60 '2020/12/02' 117 145 479.0
60 '2020/12/03' 103 135 340.0
45 '2020/12/04' 109 175 282.4
45 '2020/12/05' 117 148 406.0
60 '2020/12/06' 102 127 300.0
60 '2020/12/07' 110 136 374.0
450 '2020/12/08' 104 134 253.3
30 '2020/12/09' 109 133 195.1
60 '2020/12/10' 98 124 269.0
60 '2020/12/11' 103 147 329.3
60 '2020/12/12' 100 120 250.7
60 '2020/12/12' 100 120 250.7
60 '2020/12/13' 106 128 345.3
60 '2020/12/14' 104 132 379.3
60 '2020/12/15' 98 123 275.0
60 '2020/12/16' 98 120 215.2
60 '2020/12/17' 100 120 300.0
45 '2020/12/18' 90 112 NaN
60 '2020/12/19' 103 123 323.0
45 '2020/12/20' 97 125 243.0
60 '2020/12/21' 108 131 364.2
45 NaN 100 119 282.0
60 '2020/12/23' 130 101 300.0
45 '2020/12/24' 105 132 246.0
60 '2020/12/25' 102 126 334.5
60 2020/12/26 100 120 250.0
60 '2020/12/27' 92 118 241.0
60 '2020/12/28' 103 132 NaN
60 '2020/12/29' 100 132 280.0
60 '2020/12/30' 102 129 380.3
60 '2020/12/31' 92 115 243.0
My guess is that data.csv is written incorrect?
The data.csv file is written wrong, to fix it need to add commas.
Corrected format: data.csv
Duration,Date,Pulse,Maxpulse,Calories
60,2020/12/01',110,130,409.1
60,2020/12/02',117,145,479.0
60,2020/12/03',103,135,340.0
45,2020/12/04',109,175,282.4
45,2020/12/05',117,148,406.0
60,2020/12/06',102,127,300.0
60,2020/12/07',110,136,374.0
450,2020/12/08',104,134,253.3
30,2020/12/09',109,133,195.1
60,2020/12/10',98,124,269.0
60,2020/12/11',103,147,329.3
60,2020/12/12',100,120,250.7
60,2020/12/12',100,120,250.7
60,2020/12/13',106,128,345.3
60,2020/12/14',104,132,379.3
60,2020/12/15',98,123,275.0
60,2020/12/16',98,120,215.2
60,2020/12/17',100,120,300.0
45,2020/12/18',90,112,
60,2020/12/19',103,123,323.0
45,2020/12/20',97,125,243.0
60,2020/12/21',108,131,364.2
45,,100,119,282.0
60,2020/12/23',130,101,300.0
45,2020/12/24',105,132,246.0
60,2020/12/25',102,126,334.5
60,20201226,100,120,250.0
60,2020/12/27',92,118,241.0
60,2020/12/28',103,132,
60,2020/12/29',100,132,280.0
60,2020/12/30',102,129,380.3
60,2020/12/31',92,115,243.0
TL,DR:
Try this:
new_data = df.fillna(pd.NA).dropna()
or:
import numpy as np
new_data = df.fillna(np.NaN).dropna()
That's the real csv file? I don't think so.
There isn't any specification of missing values in csv doc. From my experience, missing values in csv are represented by nothing between two separators (if the separator is a comma, it looks like ,,).
From pandas doc, the pandas.read_csv contains an argument na_values:
na_values : scalar, str, list-like, or dict, optional
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
If your csv file contains 'NaN', pandas are capable to infer and read as NaN, but you can pass the parameter as you need.
Also, you can use (consider i as the number of row and j for column):
type(df.iloc[i,j])
Compare with:
type(np.NaN) # numpy NaN
float
type(pd.NA) # pandas NaN
pandas._libs.missing.NAType

Making separate plots with unique identifiers in Python using CSV file

I have a CSV file where one column has a unique identifier (a,b,c...) and I would like to make separate plots based on this identifier (so a separate line on the same graph for a,b and so forth).
SSID Time RSSI
0 a 13:14:42 -33
1 a 13:14:46 -30
2 a 13:14:49 -31
3 a 13:14:52 -31
4 a 13:14:55 -35
.. ... ... ...
64 b 13:15:43 -58
65 b 13:15:46 -56
66 b 13:15:50 -65
67 b 13:15:53 -52
68 b 13:15:57 -65
What I've written plots every point together in one line, but how can I plot them on the same graph, but have them separated based on the identifier?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
temp = np.genfromtxt('file.csv', delimiter=',')
plt.figure()
plt.plot(temp)
plt.show
Thank you!
reshape so that SSID are columns
simple pandas plot()
df = pd.read_csv(io.StringIO(""" SSID Time RSSI
0 a 13:14:42 -33
1 a 13:14:46 -30
2 a 13:14:49 -31
3 a 13:14:52 -31
4 a 13:14:55 -35
64 b 13:15:43 -58
65 b 13:15:46 -56
66 b 13:15:50 -65
67 b 13:15:53 -52
68 b 13:15:57 -65"""), sep="\s+")
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, figsize=[10,6])
df.set_index(["SSID","Time"]).unstack(0).droplevel(0,1).plot(ax=ax)

Possible to store getvalue() from FTP and store into dataframe Python

I'm trying to see if I can store .txt files directly from an FTP and turning it into a single concatenated dataframe variable without saving each file on my local drive.
My method of doing this right now looks like this:
#import packages
import pandas as pd
import numpy as np
import csv
import os, io
import re
import ftplib
from ftplib import *
from ftplib import FTP
from io import StringIO, BytesIO
ftp = ftplib.FTP('ftp.cpc.ncep.noaa.gov') # enter the main page
ftp.login() # log in to archive
# directory for monthly data
ftp.cwd('htdocs/products/analysis_monitoring/cdus/degree_days/archives/Heating degree Days/monthly states/2018')
filenames_monthly = ftp.nlst() # get list of filenames in archive
#grab files wanted from a list of files already on local machine
filenames_wanted_monthly = list(set(filenames_monthly).intersection(date_list_monthly)) # get list of unobtained dates
for file_month in filenames_wanted_monthly:
r = BytesIO()
ftp.retrbinary('RETR '+ file_month, r.write)
df_list.append(r.getvalue().decode('utf-8'))
print(r.getvalue())
When I print the values I get the text I'm looking for but in an unaggregated and messy form:
b' \n HEATING DEGREE DAY DATA MONTHLY SUMMARY\n POPULATION-WEIGHTED STATE,REGIONAL,AND NATIONAL AVERAGES\n CLIMATE PREDICTION CENTER-NCEP-NWS-NOAA\n \n MONTHLY DATA FOR OCT 2018\n ACCUMULATIONS ARE FROM JULY 1, 2018\n -999 = NORMAL LESS THAN 100 OR RATIO INCALCULABLE\n \n STATE MONTH MON MON CUM CUM CUM CUM CUM\n TOTAL DEV DEV TOTAL DEV DEV DEV DEV\n FROM FROM FROM FROM FROM FROM\n NORM L YR NORM L YR NORM L YR\n PRCT PRCT\n \n ALABAMA 90 -63 -28 90 -78 -48 -46 -35\n ALASKA 685 -314 -149 1436 -629 -251 -30 -15\n ARIZONA 43 -33 42 43 -40 42 -999 -999\n ARKANSAS 197 24 24 215 14 24 7 13\n CALIFORNIA 43 -76 -5 43 -126 -22 -75 -34\n COLORADO 593 20 37 721 -187 -38 -21 -5\n CONNECTICUT 399 -22 179 498 -56 137 -10 38\n DELAWARE 242 -43 94 245 -83 75 -25 44\n DISTRCT COLUMBIA 194 -11 103 195 -36 97 -16 99\n FLORIDA 3 -7 -19 3 -7 -19 -999 -999\n GEORGIA 117 -37 -18 117 -52 -39 -31 -25\n HAWAII 0 0 0 0 0 0 -999 -999\n IDAHO 535 3 -78 732 -112 -104 -13 -12\n ILLINOIS 430 35 135 529 9 107 2 25\n INDIANA 386 10 95 471 -24 45 -5 11\n IOWA
Appending to the list gives me a set of str valued lists of size=1.
Is there a way to store these .txt files by separating them by the \n tags into a single readable dataframe?
My desired output would be a dataframe with 2 columns with the first 2 values after each instance of \n:
STATE TOTAL
ALABAMA 90
ARIZONA 43
CALIFORNIA 43
CONNECTICUT 399
.
.
use the StringIO to write the strings from getvalue() to a file and then read that file using pandas to get the desired dataframe
1. test_result = StringIO(result) # write strings to a file
2. pd.read_csv(test_result,sep="\n") # convert to dataframe

Why can't I apply shift from within a pandas function?

I am trying to build a function that uses .shift() but it is giving me an error.
Consider this:
In [40]:
data={'level1':[20,19,20,21,25,29,30,31,30,29,31],
'level2': [10,10,20,20,20,10,10,20,20,10,10]}
index= pd.date_range('12/1/2014', periods=11)
frame=DataFrame(data, index=index)
frame
Out[40]:
level1 level2
2014-12-01 20 10
2014-12-02 19 10
2014-12-03 20 20
2014-12-04 21 20
2014-12-05 25 20
2014-12-06 29 10
2014-12-07 30 10
2014-12-08 31 20
2014-12-09 30 20
2014-12-10 29 10
2014-12-11 31 10
A normal function works fine. To demonstrate I calculate the same result twice, using the direct and function approach:
In [63]:
frame['horizontaladd1']=frame['level1']+frame['level2']#works
def horizontaladd(x):
test=x['level1']+x['level2']
return test
frame['horizontaladd2']=frame.apply(horizontaladd, axis=1)
frame
Out[63]:
level1 level2 horizontaladd1 horizontaladd2
2014-12-01 20 10 30 30
2014-12-02 19 10 29 29
2014-12-03 20 20 40 40
2014-12-04 21 20 41 41
2014-12-05 25 20 45 45
2014-12-06 29 10 39 39
2014-12-07 30 10 40 40
2014-12-08 31 20 51 51
2014-12-09 30 20 50 50
2014-12-10 29 10 39 39
2014-12-11 31 10 41 41
But while directly applying shift works, in a function it doesn't work:
frame['verticaladd1']=frame['level1']+frame['level1'].shift(1)#works
def verticaladd(x):
test=x['level1']+x['level1'].shift(1)
return test
frame.apply(verticaladd)#error
results in
KeyError: ('level1', u'occurred at index level1')
I also tried applying to a single column which makes more sense in my mind, but no luck:
def verticaladd2(x):
test=x-x.shift(1)
return test
frame['level1'].map(verticaladd2)#error, also with apply
error:
AttributeError: 'numpy.int64' object has no attribute 'shift'
Why not call shift directly? I need to embed it into a function to calculate multiple columns at the same time, along axis 1. See related question Ambiguous truth value with boolean logic
Try passing the frame to the function, rather than using apply (I am not sure why apply doesn't work, even column-wise):
def f(x):
x.level1
return x.level1 + x.level1.shift(1)
f(frame)
returns:
2014-12-01 NaN
2014-12-02 39
2014-12-03 39
2014-12-04 41
2014-12-05 46
2014-12-06 54
2014-12-07 59
2014-12-08 61
2014-12-09 61
2014-12-10 59
2014-12-11 60
Freq: D, Name: level1, dtype: float64
Check if the values you are trying to shift is not an array. Then you need to convert the array to series. With this you will be able to shift the values. I was having same issues,now I am able to get the shift values.
This is my part of the code for your reference.
X = grouped['Confirmed_day'].values
X_series=pd.Series(X)
X_lag1 = X_series.shift(1)
I'm not entirely following along, but if frame['level1'].shift(1) works, then I can only imagine that frame['level1'] is not a numpy.int64 object while whatever you are passing into the verticaladd function is. Probably need to look at your types.

Categories

Resources