Unexpected character when writting to Excel using Pandas - python

I have a dictionary like this:
film = {
'ID': [],
'Name': [],
'Run Time': [],
'Genre': [],
'link': [],
'name 2': []
}
Then I populate it in a for loop, like this:
film['ID'].append(film_id)
film['Name'].append(film_name)
film['Run Time'].append(film_runtime)
film['Genre'].append(film_genre)
film['link'].append(film_link)
film['name 2'].append(film_name2)
Then I convert the dictionary to a Pandas DataFrame, so that I can write it to an .xlsx file. Now before I actually write it, I print it to check the values of Run Time column. And everything is OK:
output_df = pd.DataFrame(film).set_index('ID')
print(output_df['Run Time'])
output:
ID
102 131
103 60
104
105
Name: Run Time, dtype: object
But then, when I write it, like this:
writer = ExcelWriter('output.xlsx')
output_df.to_excel(writer, 'فیلم')
writer.save()
The file looks like this:
As you can see, there's an extra ' (single quote) character in the file. This character is not visible. But I can highlight it:
And if I remove it, the number goes RTL:
So I thought the invisible character was LTR MARK (\u200E). I removed it like this:
film['Run Time'].append(film_runtime.replace('\u200E', ''))
But nothing happened, and the character is still there.
How can I fix this?

You need to make sure that cells that need to be numbers are converted to numbers (typically ints) before converting to an .xlsx file.
In your case just:
film['Run Time'].append(int(film_runtime))

The ' before a value in Excel forces the value to string. Looks like the Excel Writer is interpreting such list as an string array.
Changing the type in the DataFrame should solve it.

Related

JSON Fields with Panda DataFrame

I'm using Python (google colb) and I have a json dataframe with some fields like:
[{'ActedBy': ['team'], 'ActedAt': '2022-03-07T22:43:46Z', 'Status': 'Completed', 'LAB': 'No'}]
I need to get the "ActedAt" in order to get the "date" how can I get this?
Thanks!
You have an array of dictionaries. First, grab a dictionary from the array by index, then proceed to get the ActedAt property. Something like this:
json = [{'ActedBy': ['team'], 'ActedAt': '2022-03-07T22:43:46Z', 'Status': 'Completed', 'LAB': 'No'}]
# index into a variable for explicit readability
index = 0
# get the date you want
date = json[index]['ActedAt']
print(date)

python convert text rows to dictionary based on conditional match

I have the below string and need help on how write an if condition in a for loop that check if the row.startswith('name') then take the value and store is in a variable called name. Similarly for dob as well.
Once the for loop completes the output should be a dictionary as below which i can convert to a pandas dataframe.
'name john\n \n\nDOB\n12/08/1984\n\ncurrent company\ngoogle\n'
This is what i have tried so far but do not know how to get the values into a dictionary
for row in lines.split('\n'):
if row.startswith('name'):
name = row.split()[-1]
Final Ouput
data = {"name":"john", "dob": "12/08/1984"}
Try using a list comprehension and split:
s = '''name
john
dob
12/08/1984
current company
google'''
d = dict([i.splitlines() for i in s.split('\n\n')])
print(d)
Output:
{'name': 'john', 'dob': '12/08/1984', 'current company': 'google'}

String to dataframe

I am developing QR code scanner to csv. I am very much new in python. I am getting this output after scanning QR 'Employee ID: 101\nEmployee Name: Abhinav Jha\nDesignation: Student\nDepartment: Mechanical'.
Can anyone help me to covert it to csv. Thanks in advance.
Well, you can convert this incoming string to a python dictionary and then you can easily write the values in a CSV file.
dict_1 = {(item.split(':')[0]).strip():(item.split(':')[1]).strip() for item in s.split('\n')}
print(dict_1)
output -
{'Employee ID': '101',
'Employee Name': 'Abhinav Jha',
'Designation': 'Student',
'Department': 'Mechanical'}
Now, if you wanna access keys then use the .values() function -
print(dict_1.values()) # will print ['101', 'Abhinav Jha', 'Student', 'Mechanical']
I'm assuming you wanna write these values in a CSV file.

Google Sheets API and Pandas. Inconsistent data length from the API

I'm using the google sheets API to get data which I then pass to Pandas so I can easily work with the data.
Let's say I want to get a sheet with the following data (depicted as a JSON object as tables weren't presented here well)
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '12345', '8 Leafy Street']
}
The sheets API will return something along the lines of this:
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35', '12345', '8 Leafy Street']
]
}
This is great and allows me to easily pass the column headings and data to Pandas without much fuss. I do this in the following manner:
values = sheets_api_result["values"]
df = pd.DataFrame(values[1:], columns=values[0])
My Problem
If I have a Gsuite Sheet that looks like the below table, depicted as a key:value data type
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '', '']
}
I will receive the following response
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35']
]
}
Note that the length of the two arrays are not unequal, and that instead of None or null values being returned, the data is simply not present in the response.
When working with this data in my code, I end up with an error that looks like this
ValueError: 4 columns passed, passed data had 2 columns
So as far as I can tell I have two options:
Come up with a clever way to pad my response where necessary with None
If possible, instruct the API to return a null value in the JSON where null values exist, especially when the last column(s) have no data at all.
With regards to point 1. I think I can append x None values to the list where x is equal to length_of_column_heading_array - length_of_data_array. This does however seem ugly and perhaps there is a more elegant way of doing it.
And with regards to point 2, I haven't managed to find an answer that helps me.
If anyone has any ideas on how I can solve this, I'd be very grateful.
Cheers!
If anyone is interested, here is how I solved the issue.
First, we need to get all the data from the Sheets API.
# define the names of the tabs I want to get
ranges = ['tab1', 'tab2']
# Call the Sheets API
request = service.spreadsheets().values().batchGet(spreadsheetId=document, ranges=ranges,)
response = request.execute()
Now I want to go through every column and ensure that each row's list contains the same number of elements as the first row which contains the column headings.
# response is the response from google sheets API,
# and from the code above. It contains column headings
# and data from every row.
# valueRanges is the key to access the data.
def extract_case_data(response, keyword):
for obj in response["valueRanges"]:
if keyword in obj["range"]:
values = pad_data(obj["values"])
df = pd.DataFrame(values[1:], columns=values[0])
return df
return None
And finally, the method to pad the data
def pad_data(data: list):
# build a new array with the column heading data
# this is the list which we will return
return_data = [data[0]]
for row in data[1:]:
difference = len(data[0]) - len(row)
new_row = row
# append None to the lists which have a shorter
# length than the column heading list
for count in range(1, difference + 1):
new_row.append(None)
return_data.append(new_row)
return return_data
I'm certainly not saying that this is the best or most elegant solution, but it has done the trick for me.
Hope this helps someone.
Same idea, maybe simpler look:
Get raw values
result = service.spreadsheets().values().get(spreadsheetId=spreadsheet_id, range=data_range).execute()
raw_values = result.get('values', [])
Then complete while iterating
for row in raw_values:
row = row + [''] * (expected_length - len(row))

Replace values from pandas dataset with dictionary

I am extracting a column from excel document with pandas. After that, I want to replace for each row of the selected column, all keys contained in multiple dictionaries grouped in a list.
import pandas as pd
file_loc = "excelFile.xlsx"
df = pd.read_excel(file_loc, usecols = "C")
In this case, my dataframe is called by df['Q10'], this data frame has more than 10k rows.
Traditionally, if I want to replace a value in df I use;
df['Q10'].str.replace('val1', 'val1')
Now, I have a dictionary of words like:
mydic = [
{
'key': 'wasn't',
'value': 'was not'
}
{
'key': 'I'm',
'value': 'I am'
}
... + tons of line of key value pairs
]
Currently, I have created a function that iterates over "mydic" and replacer one by one all occurrences.
def replaceContractions(df, mydic):
for cont in contractions:
df.str.replace(cont['key'], cont['value'])
Next I call this function passing mydic and my dataframe:
replaceContractions(df['Q10'], contractions)
First problem: this is very expensive because mydic has a lot of item and data set is iterate for each item on it.
Second: It seems that doesn't works :(
Any Ideas?
Convert your "dictionary" to a more friendly format:
m = {d['key'] : d['value'] for d in mydic}
m
{"I'm": 'I am', "wasn't": 'was not'}
Next, call replace with the regex switch and pass m to it.
df['Q10'] = df['Q10'].replace(m, regex=True)
replace accepts a dictionary of key-replacement pairs, and it should be much faster than iterating over each key-replacement at a time.

Categories

Resources