Leading zero issues with pandas read_csv function - python

I have a column of values such as this:
123, 234, 345, 456, 567
When I do
pd.read_csv(dtype = {'column': str})
or
pd.read_csv(dtype = 'column': object})
they both produce values like
00123, 00234, 00345, 00456, 00567.
I was searching through stackexchange, and people say that you should use dtype: object, but it doesn't work for me..

If you want to read in your data as integers, drop the dtype:
df = pd.read_csv('data.csv')
If you want to convert the fields to strings, you can apply a str transformation with df.astype:
df = pd.read_csv('data.csv').astype(str)
Another option would be to use a converter:
df = pd.read_csv('data.csv', converters={'ColName': str})

Related

PySpark: Create a subset of a dataframe for all dates

I have a DataFrame that has a lot of columns and I need to create a subset of that DataFrame that has only date values.
For e.g. my Dataframe could be:
1, 'John Smith', '12/10/1982', '123 Main St', '01/01/2000'
2, 'Jane Smith', '11/21/1999', 'Abc St', '12/12/2020'
And my new DataFrame should only have:
'12/10/1982', '01/01/2000'
'11/21/1999', '12/12/2000'
The dates could be of any format and could be on any column. I can use the dateutil.parser to parse them to make sure they are dates. But not sure how to call parse() on all the columns and only filter those that return true to another dataframe, easily.
If you know what you columns the datetimes are in it's easy:
pd2 = pd[["row_name_1", "row_name_2"]]
# or
pd2 = pd.iloc[:, [2, 4]]
You can find your columns' datatype by checking each tuple in your_dataframe.dtypes.
schema = "id int, name string, date timestamp, date2 timestamp"
df = spark.createDataFrame([(1, "John", datetime.now(), datetime.today())], schema)
list_of_columns = []
for (field_name, data_type) in df.dtypes:
if data_type == "timestamp":
list_of_columns.append(field_name)
Now you can use this list inside .select()
df_subset_only_timestamps = df.select(list_of_columns)
EDIT: I realized your date columns might be StringType.
You could try something like:
df_subset_only_timestamps = df.select([when(col(column).like("%/%/%"), col(column)).alias(column) for column in df.columns]).na.drop()
Inspired by this answer. Let me know if it works!

Pandas Dataframe from list nested in json

I have a request that gets me some data that looks like this:
[{'__rowType': 'META',
'__type': 'units',
'data': [{'name': 'units.unit', 'type': 'STRING'},
{'name': 'units.classification', 'type': 'STRING'}]},
{'__rowType': 'DATA', '__type': 'units', 'data': ['A', 'Energie']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['bar', ' ']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CCM', 'Volumen']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CDM', 'Volumen']}]
and would like to construct a (Pandas) DataFrame that looks like this:
Things like pd.DataFrame(pd.json_normalize(test)['data'] are close but still throw the whole list into the column instead of making separate columns. record_path sounded right but I can't get it to work correctly either.
Any help?
It's difficult to know how the example generalizes, but for this particular case you could use:
pd.DataFrame([d['data'] for d in test
if d.get('__rowType', None)=='DATA' and 'data' in d],
columns=['unit', 'classification']
)
NB. assuming test the input list
output:
unit classification
0 A Energie
1 bar
2 CCM Volumen
3 CDM Volumen
Instead of just giving you the code, first I explain how you can do this by details and then I'll show you the exact steps to follow and the final code. This way you understand everything for any further situation.
When you want to create a pandas dataframe with two columns you can do this by creating a dictionary and passing it to DataFrame class:
my_data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=my_data)
This will result in this dataframe:
So if you want to have the dataframe you specified in your question the my_data dictionary should be like this:
my_data = {
'unit': ['A', 'bar', 'CCM', 'CDM'],
'classification': ['Energie', '', 'Volumen', 'Volumen'],
}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df
(You can see the df.index=... part. This is because that the index column of the desired dataframe is started at 1 in your question)
So if you want to do so you just have to extract these data from the data you provided and convert them to the exact dictionary mentioned above (my_data dictionary)
To do so you can do this:
# This will get the data values like 'bar', 'CCM' and etc from your initial data
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
So the whole code would be this:
d = YOUR_DATA
# This will get the data values like 'bar', 'CCM' and etc
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df #or print(df)
Note: Of course you can do all of this in one complex line of code but to avoid confusion I decided to do this in couple of lines of code

Convert JSON data in data frame Python [duplicate]

This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 2 years ago.
I am beginner of programming language, so it would be appreciated you help and support.
Here is DataFrame and one column' data is JSON type? of data.
ID, Name, Information
1234, xxxx, '{'age': 25, 'gender': 'male'}'
2234, yyyy, '{'age': 34, 'gender': 'female'}'
3234, zzzz, '{'age': 55, 'gender': 'male'}'
I would like to covert this DataFrame as below.
ID, Name, age, gender
1234, xxxx, 25, male
2234, yyyy, 34, female
3234, zzzz, 55, male
I found that ast.literal_eval() can convert str to dict type, but I have no idea how to write code of this issue.
Would you please give some example of a code which can solve this issue?
Given test.csv
ID,Name,Information
1234,xxxx,"{'age': 25, 'gender': 'male'}"
2234,yyyy,"{'age': 34, 'gender': 'female'}"
3234,zzzz,"{'age': 55, 'gender': 'male'}"
Read the file in with pd.read_csv and use the converters parameter with ast.literal_eval, which will convert the data in the Information column from a str type to dict type.
Use pd.json_normalize to unpack the dict with keys as column headers and values in the rows
.join the normalized columns with df
.drop the Information column
import pandas as pd
from ast import literal_eval
df = pd.read_csv('test.csv', converters={'Information': literal_eval})
df = df.join(pd.json_normalize(df.Information))
df.drop(columns=['Information'], inplace=True)
# display(df)
ID Name age gender
0 1234 xxxx 25 male
1 2234 yyyy 34 female
2 3234 zzzz 55 male
If the data is not from a csv file
import pandas as pd
from ast import literal_eval
data = {'ID': [1234, 2234, 3234],
'Name': ['xxxx', 'yyyy', 'zzzz'],
'Information': ["{'age': 25, 'gender': 'male'}", "{'age': 34, 'gender': 'female'}", "{'age': 55, 'gender': 'male'}"]}
df = pd.DataFrame(data)
# apply literal_eval to Information
df.Information = df.Information.apply(literal_eval)
# normalize the Information column and join to df
df = df.join(pd.json_normalize(df.Information))
# drop the Information column
df.drop(columns=['Information'], inplace=True)
If third column is a JSON string, ' is not valid, it should be ", so we need to fix this.
If the third column is a string representation of python dict, you can use eval to convert it.
A sample of code to split third column of type dict and merge into the original DataFrame:
data = [
[1234, 'xxxx', "{'age': 25, 'gender': 'male'}"],
[2234, 'yyyy', "{'age': 34, 'gender': 'female'}"],
[3234, 'zzzz', "{'age': 55, 'gender': 'male'}"],
]
df = pd.DataFrame().from_dict(data)
df[2] = df[2].apply(lambda x: json.loads(x.replace("'", '"'))) # fix the data and convert to dict
merged = pd.concat([df[[0, 1]], df[2].apply(pd.Series)], axis=1)

Map two dataframes and perform sum operation using a dictionary

I have a dataframe df
df
Object Action Cost1 Cost2
0 123 renovate 10000 2000
1 456 do something 0 10
2 789 review 1000 50
and a dictionary (called dictionary)
dictionary
{'Object_new': ['Object'],
'Action_new': ['Action'],
'Total_Cost': ['Cost1', 'Cost2']}
Further, I have a (at the beginning empty) dataframe df_new that should contain almost the identicall information as df, except that the column names need to be different (naming according to the dictionary) and that some columns from df should be consolidated (e.g. a sum-operation) based on the dictionary.
The result should look like this:
df_new
Object_new Action_new Total_Cost
0 123 renovate 12000
1 456 do something 10
2 789 review 1050
How can I achieve this result using only the dictionary? I tried to use the .map() function but could not figure out how to perform the sum-operation with it.
The code to reproduce both dataframes and the dictionary are attached:
# import libraries
import pandas as pd
### create df
data_df = {'Object': [123, 456, 789],
'Action': ['renovate', 'do something', 'review'],
'Cost1': [10000, 0, 1000],
'Cost2': [2000, 10, 50],
}
df = pd.DataFrame(data_df)
### create dictionary
dictionary = {'Object_new':['Object'],
'Action_new':['Action'],
'Total_Cost' : ['Cost1', 'Cost2']}
### create df_new
# data_df_new = pd.DataFrame(columns=['Object_new', 'Action_new', 'Total_Cost' ])
data_df_new = {'Object_new': [123, 456, 789],
'Action_new': ['renovate', 'do something', 'review'],
'Total_Cost': [12000, 10, 1050],
}
df_new = pd.DataFrame(data_df_new)
A play with groupby:
inv_dict = {x:k for k,v in dictionary.items() for x in v}
df_new = df.groupby(df.columns.map(inv_dict),
axis=1).sum()
Output:
Action_new Object_new Total_Cost
0 renovate 123 12000
1 do something 456 10
2 review 789 1050
Given the complexity of your algorithm, I would suggest performing a Series addition operation to solve this problem.
Why? In Pandas, every column in a DataFrame works as a Series under the hood.
data_df_new = {
'Object_new': df['Object'],
'Action_new': df['Action'],
'Total_Cost': (df['Cost1'] + df['Cost2']) # Addition of two series
}
df_new = pd.DataFrame(data_df_new)
Running this code will map every value contained in your dataset, which will be stored in our dictionary.
You can use an empty data frame to copy the new column and use the to_dict to convert it to a dictionary.
import pandas as pd
import numpy as np
data_df = {'Object': [123, 456, 789],
'Action': ['renovate', 'do something', 'review'],
'Cost1': [10000, 0, 1000],
'Cost2': [2000, 10, 50],
}
df = pd.DataFrame(data_df)
print(df)
MyEmptydf = pd.DataFrame()
MyEmptydf['Object_new']=df['Object']
MyEmptydf['Action_new']=df['Action']
MyEmptydf['Total_Cost'] = df['Cost1'] + df['Cost2']
print(MyEmptydf)
dictionary = MyEmptydf.to_dict(orient="index")
print(dictionary)
you can run the code here:https://repl.it/repls/RealisticVillainousGlueware
If you trying to entirely avoid pandas and only use the dictionary this should solve it
Object = []
totalcost = []
action = []
for i in range(0,3):
Object.append(data_df['Object'][i])
totalcost.append(data_df['Cost1'][i]+data_df['Cost2'][i])
action.append(data_df['Action'][i])
dict2 = {'Object':Object, 'Action':action, 'TotalCost':totalcost}

Manipulate data in dictionary-column from TSV

I have a TSV file where one of the columns are a dictionary-format type.
Example of headers and one row (notice the string-quotes in Preferences-column)
Name, Age, Preferences
Nick, 18, "[{"Hobby":"Football", "Food":"Pizza", "FavoriteNumber":"72"}]"
To read the file into python:
df = pd.read_csv('search_data_assessment.tsv',delimiter='\t')
To remove the strings of the "Preferences" at beginning and end, I used ast.literal_eval:
df["Preferences"] = ast.literal_eval(df["Preferences"])
This raises "ValueError: malformed node or string: 0", but it seems to do the trick.
The question: How can I check all rows and look for "FavoriteNumber" in Preferences, and if it == 72, change it to 100 (arbitrary example)?
You can use pd.Series.apply with a custom function. Just note this is bordering on abuse of Pandas. Pandas isn't designed to hold lists of dictionaries in series. Here, you are running a loop in a particularly inefficient way.
from ast import literal_eval
df = pd.DataFrame([['Nick', 18, '[{"Hobby":"Football", "Food":"Pizza", "FavoriteNumber":"72"}]']],
columns=['Name', 'Age', 'Preferences'])
def updater(x):
if x[0]['FavoriteNumber'] == '72':
x[0]['FavoriteNumber'] = '100'
return x
df['Preferences'] = df['Preferences'].apply(literal_eval)
df['Preferences'] = df['Preferences'].apply(updater)
print(df['Preferences'].iloc[0])
[{'Hobby': 'Football', 'Food': 'Pizza', 'FavoriteNumber': '100'}]

Categories

Resources