I am working with CSV files where several of the columns have a simple json object (several key value pairs) while other columns are normal. Here is an example:
name,dob,stats
john smith,1/1/1980,"{""eye_color"": ""brown"", ""height"": 160, ""weight"": 76}"
dave jones,2/2/1981,"{""eye_color"": ""blue"", ""height"": 170, ""weight"": 85}"
bob roberts,3/3/1982,"{""eye_color"": ""green"", ""height"": 180, ""weight"": 94}"
After using df = pandas.read_csv('file.csv'), what's the most efficient way to parse and split the stats column into additional columns?
After about an hour, the only thing I could come up with was:
import json
stdf = df['stats'].apply(json.loads)
stlst = list(stdf)
stjson = json.dumps(stlst)
df.join(pandas.read_json(stjson))
This seems like I'm doing it wrong, and it's quite a bit of work considering I'll need to do this on three columns regularly.
Desired output is the dataframe object below. Added following lines of code to get there in my (crappy) way:
df = df.join(pandas.read_json(stjson))
del(df['stats'])
In [14]: df
Out[14]:
name dob eye_color height weight
0 john smith 1/1/1980 brown 160 76
1 dave jones 2/2/1981 blue 170 85
2 bob roberts 3/3/1982 green 180 94
I think applying the json.load is a good idea, but from there you can simply directly convert it to dataframe columns instead of writing/loading it again:
stdf = df['stats'].apply(json.loads)
pd.DataFrame(stdf.tolist()) # or stdf.apply(pd.Series)
or alternatively in one step:
df.join(df['stats'].apply(json.loads).apply(pd.Series))
There is a slightly easier way, but ultimately you'll have to call json.loads There is a notion of a converter in pandas.read_csv
converters : dict. optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels
So first define your custom parser. In this case the below should work:
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
In your case you'll have something like:
df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
We are telling read_csv to read the data in the standard way, but for the stats column use our custom parsers. This will make the stats column a dict
From here, we can use a little hack to directly append these columns in one step with the appropriate column names. This will only work for regular data (the json object needs to have 3 values or at least missing values need to be handled in our CustomParser)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
On the Left Hand Side, we get the new column names from the keys of the element of the stats column. Each element in the stats column is a dictionary. So we are doing a bulk assign. On the Right Hand Side, we break up the 'stats' column using apply to make a data frame out of each key/value pair.
Option 1
If you dumped the column with json.dumps before you wrote it to csv, you can read it back in with:
import json
import pandas as pd
df = pd.read_csv('data/file.csv', converters={'json_column_name': json.loads})
Option 2
If you didn't then you might need to use this:
import json
import pandas as pd
df = pd.read_csv('data/file.csv', converters={'json_column_name': eval})
Option 3
For more complicated situations you can write a custom converter like this:
import json
import pandas as pd
def parse_column(data):
try:
return json.loads(data)
except Exception as e:
print(e)
return None
df = pd.read_csv('data/file.csv', converters={'json_column_name': parse_column})
Paul's original answer was very nice but not correct in general, because there is no assurance that the ordering of columns is the same on the left-hand side and the right-hand side of the last line. (In fact, it does not seem to work on the test data in the question, instead erroneously switching the height and weight columns.)
We can fix this by ensuring that the list of dict keys on the LHS is sorted. This works because the apply on the RHS automatically sorts by the index, which in this case is the list of column names.
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
json_normalize function in pandas.io.json package helps to do this without using custom function.
(assuming you are loading the data from a file)
from pandas.io.json import json_normalize
df = pd.read_csv(file_path, header=None)
stats_df = json_normalize(data['stats'].apply(ujson.loads).tolist())
stats_df.set_index(df.index, inplace=True)
df.join(stats_df)
del df.drop(df.columns[2], inplace=True)
If you have DateTime values in your .csv file, df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series) will mess up the date time values
This link has some tip how to read the csv file
with json strings into the dataframe.
You could do the following to read csv file with json string column and convert your json string into columns.
Read your csv into the dataframe (read_df)
read_df = pd.read_csv('yourFile.csv', converters={'state':json.loads}, header=0, quotechar="'")
Convert the json string column to a new dataframe
state_df = read_df['state'].apply(pd.Series)
Merge the 2 dataframe with index number.
df = pd.merge(read_df, state_df, left_index=True, right_index=True)
Related
I am trying to create a list from a CSV. This CSV contains a 2 dimensional table [540 rows and 8 columns] and I would like to create a list that contains the values of an specific column, column 4 to be specific.
I tried: list(df.columns.values)[4], it does mention the name of the column but i'm trying to get the values from the rows on column 4 and make them a list.
import pandas as pd
import urllib
#This is the empty list
company_name = []
#Uploading CSV file
df = pd.read_csv('Downloads\Dropped_Companies.csv')
#Extracting list of all companies name from column "Name of Stock"
companies_column=list(df.columns.values)[4] #This returns the name of the column.
companies_column = list(df.iloc[:,4].values)
So for this you can just add the following line after the code you've posted:
company_name = df[companies_column].tolist()
This will get the column data in the companies column as pandas Series (essentially a Series is just a fancy list) and then convert it to a regular python list.
Or, if you were to start from scratch, you can also just use these two lines
import pandas as pd
df = pd.read_csv('Downloads\Dropped_Companies.csv')
company_name = df[df.columns[4]].tolist()
Another option: If this is the only thing you need to do with your csv file, you can also get away just using the csv library that comes with python instead of installing pandas, using this approach.
If you want to learn more about how to get data out of your pandas DataFrame (the df variable in your code), you might find this blog post helpful.
I think that you can try this for getting all the values of a specific column:
companies_column = df[{column name}]
Replace "{column name}" with the column you want to access the values of.
I have a Pandas dataframe in which one column contains JSON data (the JSON structure is simple: only one level, there is no nested data):
ID,Date,attributes
9001,2020-07-01T00:00:06Z,"{"State":"FL","Source":"Android","Request":"0.001"}"
9002,2020-07-01T00:00:33Z,"{"State":"NY","Source":"Android","Request":"0.001"}"
9003,2020-07-01T00:07:19Z,"{"State":"FL","Source":"ios","Request":"0.001"}"
9004,2020-07-01T00:11:30Z,"{"State":"NY","Source":"windows","Request":"0.001"}"
9005,2020-07-01T00:15:23Z,"{"State":"FL","Source":"ios","Request":"0.001"}"
I would like to normalize the JSON content in the attributes column so the JSON attributes become each a column in the dataframe.
ID,Date,attributes.State, attributes.Source, attributes.Request
9001,2020-07-01T00:00:06Z,FL,Android,0.001
9002,2020-07-01T00:00:33Z,NY,Android,0.001
9003,2020-07-01T00:07:19Z,FL,ios,0.001
9004,2020-07-01T00:11:30Z,NY,windows,0.001
9005,2020-07-01T00:15:23Z,FL,ios,0.001
I have been trying using Pandas json_normalize which requires a dictionary. So, I figure I would convert the attributes column to a dictionary but it does not quite work out as expected for the dictionary has the form:
df.attributes.to_dict()
{0: '{"State":"FL","Source":"Android","Request":"0.001"}',
1: '{"State":"NY","Source":"Android","Request":"0.001"}',
2: '{"State":"FL","Source":"ios","Request":"0.001"}',
3: '{"State":"NY","Source":"windows","Request":"0.001"}',
4: '{"State":"FL","Source":"ios","Request":"0.001"}'}
And the normalization takes the key (0, 1, 2, ...) as the column name instead of the JSON keys.
I have the feeling that I am close but I can't quite work out how to do this exactly. Any idea is welcome.
Thank you!
Normalize expects to work on an object, not a string.
import json
import pandas as pd
df_final = pd.json_normalize(df.attributes.apply(json.loads))
You shouldn’t need to convert to a dictionary first.
Try:
import pandas as pd
pd.json_normalize(df[‘attributes’])
I found an solution but I am not overly happy with it. I reckon it is very inefficient.
import pandas as pd
import json
# Import full dataframe
df = pd.read_csv(r'D:/tmp/sample_simple.csv', parse_dates=['Date'])
# Create empty dataframe to hold the results of data conversion
df_attributes = pd.DataFrame()
# Loop through the data to fill the dataframe
for index in df.index:
row_json = json.loads(df.attributes[index])
normalized_row = pd.json_normalize(row_json)
# df_attributes = df_attributes.append(normalized_row) (deprecated method) use concat instead
df_attributes = pd.concat([df_attributes, normalized_row], ignore_index=True)
# Reset the index of the attributes dataframe
df_attributes = df_attributes.reset_index(drop=True)
# Drop the original attributes column
df = df.drop(columns=['attributes'])
# Join the results
df_final = df.join(df_attributes)
# Show results
print(df_final)
print(df_final.info())
Which gives me the expected result. However, as I said, there are several inefficiencies in it. For starters, the dataframe append in the for loop. According to the documentation the best practice is to make a list and then append but I could not figure out how to do that while keeping the shape I wanted. I welcome all critics and ideas.
I am importing study data into a Pandas data frame using read_csv.
My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").
When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.
Is there a way to import this column unchanged maybe as a string?
I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.
As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.
converters={'column_name': str}
Let's say I have csv file projects.csv like below:
project_name,project_id
Some Project,000245
Another Project,000478
As for example below code is trimming leading zeros:
from pandas import read_csv
dataframe = read_csv('projects.csv')
print dataframe
Result:
project_name project_id
0 Some Project 245
1 Another Project 478
Solution code example:
from pandas import read_csv
dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe
Required result:
project_name project_id
0 Some Project 000245
1 Another Project 000478
To have all columns as str:
pd.read_csv('sample.csv', dtype=str)
To have certain columns as str:
# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)
here is a shorter, robust and fully working solution:
simply define a mapping (dictionary) between variable names and desired data type:
dtype_dic= {'subject_id': str,
'subject_number' : 'float'}
use that mapping with pd.read_csv():
df = pd.read_csv(yourdata, dtype = dtype_dic)
et voila!
If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:
df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file
You could also do:
df = pd.read_csv("your_file.csv", dtype=str)
By doing this you will have all your columns as strings and you won't lose any leading zeros.
You Can do This , Works On all Versions of Pandas
pd.read_csv('filename.csv', dtype={'zero_column_name': object})
You can use converters to convert number to fixed width if you know the width.
For example, if the width is 5, then
data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})
This will do the trick. It works for pandas==0.23.0 and also read_excel.
Python3.6 or higher required.
I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.
EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.
As an example, consider the following my_data.txt file:
id,A
03,5
04,6
To preserve the leading zeros for the id column:
df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df
id A
0 03 5
1 04 6
I need to account for folks entering data into a spreadsheet completely wrong. I cannot control their behavior because I'm scraping it from another website. However, there is some truly bad data entry, such as the following for "Tons" of cargo:
Lovely, right? I need to figure out a way to read numbers like that into pandas without pandas auto-casting them to dates, after which point it's impossible to convert them back to 11955 and 11862. To add a cherry on top, the following won't work:
dfx = pd.read_excel(ii,header=None,dtype={'Tons': str})
because often the data has no column headers and I'm inferring the header from the order of the data, which thankfully doesn't change. So how to get pandas to be agreeable here?
Once I read in the data, even if I then change the entire column to unicode or string, it'll just be a unicode or string representation of the date:
2055-01-19 00:00:00
2062-01-18 00:00:00
So I need to read it in either "raw" (not sure what that means) as 1,19,55 without pandas trying to guess at the type, or just somehow as a number ignoring the commas...
Thanks!
You can create a converter for the column Tons to format the data as you want as pd.read_execel documentation explains:
converters : dict, default None Dict of functions for converting
values in certain columns. Keys can either be integers or column
labels, values are functions that take one input argument, the Excel
cell content, and return the transformed content.
for example you can use the following converter
tons_converter = lambda x: int("".join(x.split(',')))
dfx = pd.read_excel(ii,header=None,dtype={0: str}, converters={0: tons_converter})
reproducible example
Here's an example creating a csv file on the fly and applying the conversion.
from StringIO import StringIO
import pandas as pd
data = """
1,125,125
10,578,589
12
"""
tons_converter = lambda x: int("".join(x.split(',')))
dfx = pd.read_csv(StringIO(data),header=None,dtype=object, sep="|", converters={0: tons_converter})
print(dfx.head())
The ouput is you want:
0
0 1125125
1 10578589
2 12
I'm a newbie for programming and python, so I would appreciate your advice!
I have a dataframe like this.
In 'info' column, there are 7 different categories: activities, locations, groups, skills, sights, types and other. and each categories have unique values within [ ].(ie,"activities":["Tour"])
I would like to split 'info' column into 7 different columns based on each category as shown below.
I would like to allocate appropriate column names and also put corresponding unique strings within [ ] to each row.
Is there any easy way to split dataframe like that?
I was thinking to use str.split functions to split into pieces and merge everthing later. But not sure that is the best way to go and I wanted to see if there is more sophisticated way to make a dataframe like this.
Any advice is appreciated!
--UPDATE--
When print(dframe['info']), it shows like this.
It looks like the content of the info column is JSON-formatted, so you can parse that into a dict object easily:
>>> import json
>>> s = '''{"activities": ["Tour"], "locations": ["Tokyo"], "groups": []}'''
>>> j = json.loads(s)
>>> j
{u'activities': [u'Tour'], u'locations': [u'Tokyo'], u'groups': []}
Once you have the data as a dict, you can do whatever you like with it.
Ok, here is how to do it :
import pandas as pd
import ast
#Initial Dataframe is df
mylist = list(df['info'])
mynewlist = []
for l in mylist:
mynewlist.append(ast.literal_eval(l))
df_info = pd.DataFrame(mynewlist)
#Add columns of decoded info to the initial dataset
df_new = pd.concat([df,df_info],axis=1)
#Remove the column info
del df_new['info']
You can use the json library to do that.
1) import the json libray
import json
2) Turn into string all the rows of that column and then Apply the json.loads function to all of them. Insert the result in an object
jsonO = df['info'].map(str).apply(json.loads)
3)The Json object is now a json dataframe in which you can navigate. For each columns of your Json dataframe, create a column in your final dataframe
df['Activities'] = jsonO.apply(lambda x: x['Activities'])
Here for one column of your json dataframe each 'rows' is dump in the new column of your final dataframe df
4) Re-do 3 for all the columns you're interested in