How do I present my output as a Pandas dataframe? - python

CHECK_OUTPUT_HERE
Currently, the output I am getting is in the string format. I am not sure how to convert that string to a pandas dataframe.
I am getting 3 different tables in my output. It is in a string format.
One of the following 2 solutions will work for me:
Convert that string output to 3 different dataframes. OR
Change something in the function so that I get the output as 3 different data frames.
I have tried using RegEx to convert the string output to a dataframe but it won't work in my case since I want my output to be dynamic. It should work if I give another input.
def column_ch(self, sample_count=10):
report = render("header.txt")
match_stats = []
match_sample = []
any_mismatch = False
for column in self.column_stats:
if not column["all_match"]:
any_mismatch = True
match_stats.append(
{
"Column": column["column"],
"{} dtype".format(self.df1_name): column["dtype1"],
"{} dtype".format(self.df2_name): column["dtype2"],
"# Unequal": column["unequal_cnt"],
"Max Diff": column["max_diff"],
"# Null Diff": column["null_diff"],
}
)
if column["unequal_cnt"] > 0:
match_sample.append(
self.sample_mismatch(column["column"], sample_count, for_display=True)
)
if any_mismatch:
for sample in match_sample:
report += sample.to_string()
report += "\n\n"
print("type is", type(report))
return report

Since you have a string, you can pass your string into a file-like buffer and then read it with pandas read_csv into a dataframe.
Assuming that your string with the dataframe is called dfstring, the code would look like this:
import io
bufdf = io.StringIO(dfstring)
df = pd.read_csv(bufdf, sep=???)
If your string contains multiple dataframes, split it with split and use a loop.
import io
dflist = []
for sdf in dfstring.split('\n\n'): ##this seems the separator between two dataframes
bufdf = io.StringIO(sdf)
dflist.append(pd.read_csv(bufdf, sep=???))
Be careful to pass an appropriate sep parameter, my ??? means that I am not able to understand what could be a proper parameter. Your field are separated by spaces, so you could use sep='\s+') but I see that you have also spaces which are not meant to be a separator, so this may cause a parsing error.
sep accept regex, so to have 2 consecutive spaces as a separator, you could do: sep='\s\s+' (this will require an additional parameter engine='python'). But again, be sure that you have at least 2 spaces between two consecutive fields.
See here for reference about the io module and StringIO.
Note that the io module exists in python3 but not in python2 (it has another name) but since the latest pandas versions require python3, I guess you are using python3.

Related

I want to print "identifier":"OPTSTKHDFC25-08-2022PE1660.00" in python

I tried below code but it gave
TypeError: string indices must be integers
from nsepython import *
import pandas as pd
import json
positions = nse_optionchain_scrapper ('HDFC')
json_decode = json.dumps(positions,indent = 4, sort_keys=True,separators =(". ", " = "))
print(json_decode['data'][0]['identifier'])
print(json_decode['filtered']['data'][0]['PE']['identifier'])
json_decode = json.dumps(positions,indent = 4, sort_keys=True,separators =(". ", " = "))
You can't build a JSON string (which is what json.dumps does) and then try to access part of the result as if it were the original data structure. Your json_decode is just a string, not a dict; as far as Python is concerned it has no structure beyond the individual characters that make it up.
If you want to access parts of the data, just use positions directly:
print(positions['data'][0]['identifier'])
You can encode just that bit to JSON if you like:
print(json.dumps(positions['data'][0]['identifier'])
but that's probably just a quoted string in this case.
So I'm not sure what your goal is. If you want to print out the JSON version of positions, great, just print it out. But the JSON form is for input and output only; it's not suitable for messing around with inside your Python code.

Why does my Python code thinks a variable is a str when it should be an int?

So we are making a Python program in class in which we extract data about road accidents.
The file we are extracting from is a table with each line giving information about the people involved in a given accident
usagers_2016 = open('usagers_2016.csv','w',encoding='utf8', errors='ignore', newline="\n")
usagers_2016.write("Num_Acc;place;catu;grav;sexe;trajet;secu;locp;actp;etatp;an_nais;num_veh\n
201600000001;1;1;1;2;0;11;0;0;0;1983;B02\n
201600000001;1;1;3;1;9;21;0;0;0;2001;A01\n
201600000002;1;1;3;1;5;11;0;0;0;1960;A01\n
201600000002;2;2;3;1;0;11;0;0;0;2000;A01\n
201600000002;3;2;3;2;0;11;0;0;0;1962;A01\n
201600000003;1;1;1;1;1;11;0;0;0;1997;A01\n")
next(usagers_2016)
dict_acc = {}
for ligne in usagers_2016.readlines():
ligne = ligne[:-2].split(";")
I chose to extract the info in a dictionnary, where the accident is the key, the value of each key is a list, whose first element is a list of the people involved, each person being represented by a list including their gender and birth year
if ligne[0] not in dict_acc.keys():
dict_acc[ligne[0]] = [[],0,0,0,0]
dict_acc[ligne[0]][0].append([ligne[4],ligne[10]])
usagers_2016.close()
for accident in dict_acc:
accident[1] = len(accident[0]) # TypeError: 'str' object does not support item assignment
My problem is the following: I want to add, as the second element of the main list (the value of the key), the number of people involved in each accident (which is the len() of the first element (list) of the list). However it was revealed during the running of the code that the first 0 (line 2 of the previous code extract) is considered a str and can't receive item assignment whatsoever. The problem is that it was supposed to be an int!!!! I thought that expliciting the int type as following dict_acc[ligne[0]] = [[],int(0),int(0),int(0),int(0)] would correct it, but no, my 0s are still considered strings. Would you know why?
OK so the problem was that I was calling accident[1] instead of dict_acc[accident][1]
solution is
for accident in dict_acc:
dict_acc[accident][1] = len(dict_acc[accident][0])
I thank #MisterMiyagi
The reason is you are reading the file with open you have an object of type _io.TextIOWrapper and each line is a string, which you are later splitting based on the delimeter ';'.
The line next(usagers_2016) makes me think you are dropping the first row with headers.
So what you could do instead is to open this csv as a Dataframe in Pandas like this:
import pandas as pd
df = pd.read_csv ('usagers_2016.csv', sep=';')
# To remove the tailing \n
df.columns=df.columns.str.replace(r'\n','')
df.replace(r'\\n','', regex=True)
# Now to calculate the number of people involve in each accident you can
df.groupby(['Num_Acc']).size()

Format French number into English number - Python

I need to convert French formatted numbers extracted from .csv into English formatted numbers so I can use dataframe functions. The .csv gives:
Beta Alpha
2014-07-31 100 100
2014-08-01 99,55 100,01336806
2014-08-04 99,33 100,05348297
2014-08-05 99,63 100,06685818
2014-08-06 98,91 100,08023518
"99,5" & "100,01336806" are actually objects for python.
I need to turn them into floats with the following format "99.5" and "100.01336806"
I tried:
df = df.str.replace(to_replace =',', value = '.', case = False)
Doesn't give my any error for that code line but doesn't switch the ',' into '.' either.
df = pd.to_numeric(df, error = 'coerce')
TypeError: arg must be a list, tuple, 1-d array, or Series
Also tried the regex module without success, and I would rather use built-in function if possible.
Any help welcome!
What is the type of sources objects are "99,5" & "100,01336806", and what type of target objects do you want ?
The following tested with Python 3.8
Case 1: source object is numeric, target is string. Formatting do not allow "French format", only "English". So have to substitute . and ,
Eg. (float) 99.55 -> (string) '99,55'
v1 = float(99.55)
f"{v1:,.2f}"
'99.55'
f"{v1:,.2f}".replace(".",",")
'99,55'
Case 2: source is string with "English" format, target is float. Means the , must be replaced by a . first before converting the string to float
Eg. (string) '99,55' -> (float) 99.55
v2 = "99,55"
float(v2.replace(",","."))
99.55
try using the replace() function.
x = "100,01336806"
y = x.replace(",",".")
print(y)

Pandas - Can't change datatype of dataframe columns

Downloading some data from here:
http://insideairbnb.com/get-the-data.html
Then
listings = pd.read_csv('listings.csv')
Trying to change types
listings.bathrooms = listings.bathrooms.astype('int64',errors='ignore')
listings.bedrooms = listings.bedrooms.astype('int64',errors='ignore')
listings.beds = listings.beds.astype('int64',errors='ignore')
listings.price = listings.price.replace('[\$,]','',regex=True).astype('float')
listings.price = listings.price.astype('int64',errors='ignore')
Tried some other combinations but at the end pops error or just doesn't change datatype.
EDIT: corrected some typos
The apostrophes in the last line is not in the correct place and the last one is not the correct type: you need ' instead of ` (maybe it was accidentaly added because of the code block).
So for me it works like this:
listings.price.astype('int64', errors='ignore')
But if you would like to reassign it to the original variable then you need the same structure as you used in the previous lines:
listings.price = listings.price.astype('int64', errors='ignore')

I can't use python's "replace" to make my 0 a missing value (0->np.nan)

I used pandas to read my csv file from the cloud, I used replace() and wanted 0 to become a missing value, but it doesn't seem to work.
I use Google's colab
I tried two methods:
user_data = user_data.replace(0,np.nan) # first
user_data.replace(0,np.nan,inplace = True) # second
user_data.head() # I use this to view the data.
But the data is the same as when I first read it, 0 has no change
Here is the function I read the file, I use the block method
# Read function
def get_df2(file):
mydata2 = []
for chunk in pd.read_csv(file,chunksize=500000,header = None,sep='\t'):
mydata2.append(chunk)
user_data = pd.concat(mydata2,axis=0)
names2=['user_id','age','gender','area','status']
user_data.columns = names2
return user_data
# read
user_data_path = 'a_url'
user_data = get_df2(user_data_path)
user_data.head()
Note: my code doesn't report an error, it outputs the result, but that's not what I want
Your 0s are probably just strings, try using:
user_data = user_data.replace('0', np.nan)
Python can get irritating under such scenarios.
As pointed out earlier, it is probably because of 0 being a string and not an integer.
which can be catered by
user_data.replace("0",np.nan,inplace = True)
But, I wanted to point out, in scenarios where you know what kind of data should be in a column in a pandas dataframe, you should explicitly set it to that type, that way, whenever there is such a scenario an error will be raised and you will know exactly where the problem is.
In your case, columns are:
names2=['user_id','age','gender','area','status']
Let's assume
user_id is string
age is integer
gender is string
area is string
status is string
You can tell pandas which column is supposed to be which datatype by
user_data = userdata.astype({"user": str, "age": integer, "gender": str, "area": str, "status": str})
There are many other ways to do that, as mentioned in the following answer. Choose whichever suits you or your needs.

Categories

Resources