How to convert Complex CSV to JSON? - python

I have CSV file which contains many fields and strings.
Here is that CSV file:
https://drive.google.com/file/d/1GVLvTqRv80Gfg1fvBYNx0z5xIiiLCHUo/view?usp=sharing
I want the proper JSON format. When I try with normal functions of python I get some fields in below form:
But I want "population_mesh" and other similar fields in Array (Not String)
-> One more thing is that I also want "population", "intervention", and "Outcomes" fields in array form.
Below is the sample target JSON file
https://drive.google.com/file/d/1Y7oGVyOG777APVqOsLBxc9eSd38pDMbc/view?usp=sharing
Actually I tried to convert "population" and other similar fields to array of strings but some string having quotes identified as a part of JSON jormat but escape character has to be put to make exactly like target JSON.

Related

I have a CSV generated from CICFLOWMETER and I'm unable generate a correlation matrix It either generates a empty data frame

This is the code I'm using and I have also tried converting my datatype of my columns which is object to float but I got this error
df = pd.read_csv('DDOSping.csv')
pearsoncorr = df.corr(method='pearson')
ValueError: could not convert string to float:
'172.27.224.251-172.27.224.250-56003-502-6'
Somewhere in your CSV this string value exists '172.27.224.251-172.27.224.250-56003-502-6'. Do you know why it's there? What does it represent? It looks to me like it shouldn't be in the data you include in your correlation matrix calculation.
The df.corr method is trying to convert the string value to a float, but it's obviously not possible to do because it's a big complicated string with various characters, not a regular number.
You should clean your CSV of unnecessary data (or make a copy and clean that so you don't lose anything important). Remove anything, like metadata, that isn't the exact data that df.corr needs, including the string in the error message.
If it's just a few values you need to clean then just open in excel or a text editor to do the cleaning. If it's a lot and all the irrelevant data to be removed is in specific rows and/or columns, you could just remove them from your DataFrame before calling 'df.corr' instead of cleaning the file itself.

How do I clean this weird JSON data which I extracted from an excel file so that it becomes a proper dictionary?

I have an excel file which has a JSON type of data. I extracted the data from the column of that Excel file and converted into dictionary using .to_dict() function. One cell of that column of the excel file looks like this, and there are more rows filled with same kind of data for that column:-
{\currentPortfolioId":null/"isNewRTQ":true/"isNewInvestmentTenure":true/"isNearTermVolatility":false/"getPath":true/"riskProfile":"Moderate"/"initialInvestment":200000/"cashflowDate":"01-01-2021"/"currentWealth":200000/"goalPriority":"Wish"/"rebalancing":"yearly"/"goalAmount":2000000/"startDate":"16-06-2021"/"endDate":"01-01-2031"/"isNewGoalPriority":true/"infusions":[0/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/0]/"scenario_type":"regular"/"infusion_type":"monthly"/"xforwardForValue":"49.228.234.102:49907/ 13.86.190.104:3072/ 172.30.217.148:36243"}"
As it is visible the data is not cleaned, with special characters like "/", "", "" etc.
Can anyone help me in how to clean this data and convert it into a proper dictionary so that I can later do operations in it?
I did try ast.literal_eval() but it doesn't seem to work.
Please help!
It's very similar to json. Maybe it was mangled by autocorrect or something like that in Excel? The commas have been replaced with backslash characters and every " is prefixed with a slash. Also a single " is missing before the first key.
If you add the missing ", strip out all \ and replace / with , you can parse it with json.loads().
data = r'''{\"currentPortfolioId\":null/\"isNewRTQ\":true/\"isNewInvestmentTenure\":true/\"isNearTermVolatility\":false/\"getPath\":true/\"riskProfile\":\"Moderate\"/\"initialInvestment\":200000/\"cashflowDate\":\"01-01-2021\"/\"currentWealth\":200000/\"goalPriority\":\"Wish\"/\"rebalancing\":\"yearly\"/\"goalAmount\":2000000/\"startDate\":\"16-06-2021\"/\"endDate\":\"01-01-2031\"/\"isNewGoalPriority\":true/\"infusions\":[0/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/10000/0]/\"scenario_type\":\"regular\"/\"infusion_type\":\"monthly\"/\"xforwardForValue\":\"49.228.234.102:49907/ 13.86.190.104:3072/ 172.30.217.148:36243\"}'''
import json
json.loads(data.replace("\\", "").replace("/", ","))

DataFrame is saving brackets while exporting to csv

I have a Pandas DataFrame that looks like this.
DataFrame picture
I thought of save a tuple of two values under a column and then retrieve whichever value is needed. But now, for example, if I want the first value in the tuple located at the first row of the 'Ref' column, I get "(" instead of "c0_4"
df = pd.read_csv(df_path)
print(df['Ref'][0][0])
The output for this is "(" and not "c0_4".
I don't want to use split() because I want the values to be searchable in the dataframe. For example, I would want to search for "c0_8" under the "Ref" column and get the row.
What other alternatives do I have to save two values in a row under the same column?
The immediate problem is that you're simply accessing character 0 of a string.
file is character-oriented storage; there is no "data frame" abstraction. Hence, we use CSV to hold the columnar data as text, a format that allows easy output and input recovery.
A CSV file consists only of text, with the separator character and newline having special meanings. There is no "tuple" form. Your data frame is stored as string data. If you want to recover your original tuple form, you will need to write parsing code to convert the strings back to tuples. Alternately, you can switch to a "pickle" (PCL) format for storing your data.
That should be enough leads to allow you to research whatever alternatives you want.
Your data is stored as a string
To format it into a tuple, split every string in your DataFrame and save it back as a tuple, with something like:
for n...
for m...
df[n][m] = (df[n][m].split(",")[0][1:], df[n][m].split(",")[1][:-1])

How to parse integers in JSON file with a character following them (type marker)?

I have a JSON file which contains additional JSON data that follows the format below.
["{id:\"thaumcraft:celestial_notes\",Count:5b,Damage:10s}",
"{id:\"bloodmagic:ritual_stone\",Count:1b,Damage:0s}",
"{id:\"enderio:block_lava_generator\",Count:13b,Damage:0s}"]
You should notice that there are types appended to the numbers in the JSON.
How do I get Python to parse this without error (it thinks they are supposed to be strings)?
I cannot modify my JSON file as it is 50,000 lines long and will change dynamically from user to user.
I've thought up different ways to parse the string, but they all are inefficient or are not practical and dynamic (there is another structure that looks like this in the JSON data I will need to account for).
"{id:\"enderio:item_inventory_charger_basic\",Count:15b,tag:{enderio.darksteel.upgrade.energyUpgrade:{level:3,energy:5000000}},Damage:0s}",
The correct answer would end with the JSON loading correctly, e.g. parsing every string to look like this instead.
"{id:\"enderio:block_wired_charger\",Count:13,Damage:0}"

Python/Pandas: read nested JSON

I am reading a data table from an API that returns me the data in JSON format, and one of the columns is itself a JSON string. I succeed in creating a Pandas dataframe for the overall table, but in the process of reading it, double quotes in the JSON string get converted to single quotes, and I can't parse the nested JSON.
I can't provide a reproducible example, but here is the key code:
myResult = requests.get(myURL, headers = myHeaders).text
myDF = pd.read_json(myResult, orient = "records", dtype = {"custom": str}, encoding = "unicode_escape")
Where custom is the nested JSON string. Try as I might by setting the dtype and encoding arguments, I cannot force Pandas to preserve the double quotes in the string.
So what started off as:
"custom": {"Field1":"Value1","Field2":"Value2"}
gets into the dataframe as:
{'Field1':'Value1','Field2':'Value2'}
I found this question which suggests using a custom parser for read_csv - but I can't see that this option is available for read_json.
I found a few suggestions here but the only one I could try was manually replacing the double quotes - and this causes fresh errors because there are apostrophes contained within the nested field values themselves...
The JSON strings are formatted correctly within myResult so it's the parsing applied by read_json that's the problem. Is there any way to change that or do I need to find some other way of reading this in?

Categories

Resources