Trying to read excel table that looks like this:
B
C
A
data
data
data
data
data
but read excel doesn't recognizes that one column doesn't start from first row and it reads like this:
Unnamed : 0
B
C
A
data
data
data
data
data
Is there a way to read data like i need? I have checked parameters like header = but thats not what i need.
A similar question was asked/solved here. So basically the easiest thing would be to either drop the first column (if thats always the problematic column) with
df = pd.read_csv('data.csv', index_col=0)
or remove the unnamed column via
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
You can skip automatic column labeling with something like pd.read_excel(..., header=None)
This will skip random labeling.
Then you can use more elaborate computation (e.g. first non empty value) to get the labels such as
df.apply(lambda s: s.dropna().reset_index(drop=True)[0])
I have an excel workbook with multiple sheets and I am trying to import/read the data starting from an empty col.
the row data look like this
A
C
One
Two
Three
and I am trying to get the data
C
Two
Three
I can't use usecols as the position of this empty col changes in each sheet I have in the workbook.
I have tried this but didn't work out for me.
df = df[~df.header.shift().eq('').cummax()]
I would appreciate any suggestions or hints. Mant thanks in advance!
Assuming that you want to start from the first empty header, then:
df = df[df.columns[list(df.columns).index('Unnamed: 1'):]]
sorry that might be very simple question but I am new to python/json and everything. I am trying to filter my twitter json data set based on user_location/country_code/gb. but I have no idea how to do this. I have tried several ways but still no chance. I have attached my data set and some codes I have used here. I would appreciate any help.
here is what I did to get the best result however I do not know how to tell it to go for whole data set and print out the result of tweet_id:
import json
import pandas as pd
df = pd.read_json('example.json', lines=True)
if df['user_location'][4]['country_code'] == 'th':
print (df.tweet_id[4])
else:
print('false')
this code show me the tweet_id : 1223489829817577472
however, I couldn't extend it to the whole data set.
I have tried theis code as well, still no chance:
dataset = df[df['user_location'].isin([ "gb" ])].copy()
print (dataset)
that is what my data set looks like:
I would break the user_location column into multiple columns using the following
df = pd.concat([df, df.pop('user_location').apply(pd.Series)], axis=1)
Running this should give you a column each for the keys contained within the user_location json. Then it should be easy to print out tweet_ids based on country_code using:
df[df['country_code']=='th']['tweet_id']
An explanation of what is actually happening here:
df.pop('user_location') removes the 'user_location' column from df and returns it at the same time
With the returned column, we use the .apply method to apply a function to the column
pd.Series converts the JSON data/dictionary into a DataFrame
pd.concat concatenates the original df (now without the 'user_location' column) with the new columns created from the 'user_location' data
I am importing study data into a Pandas data frame using read_csv.
My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").
When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.
Is there a way to import this column unchanged maybe as a string?
I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.
As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.
converters={'column_name': str}
Let's say I have csv file projects.csv like below:
project_name,project_id
Some Project,000245
Another Project,000478
As for example below code is trimming leading zeros:
from pandas import read_csv
dataframe = read_csv('projects.csv')
print dataframe
Result:
project_name project_id
0 Some Project 245
1 Another Project 478
Solution code example:
from pandas import read_csv
dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe
Required result:
project_name project_id
0 Some Project 000245
1 Another Project 000478
To have all columns as str:
pd.read_csv('sample.csv', dtype=str)
To have certain columns as str:
# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)
here is a shorter, robust and fully working solution:
simply define a mapping (dictionary) between variable names and desired data type:
dtype_dic= {'subject_id': str,
'subject_number' : 'float'}
use that mapping with pd.read_csv():
df = pd.read_csv(yourdata, dtype = dtype_dic)
et voila!
If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:
df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file
You could also do:
df = pd.read_csv("your_file.csv", dtype=str)
By doing this you will have all your columns as strings and you won't lose any leading zeros.
You Can do This , Works On all Versions of Pandas
pd.read_csv('filename.csv', dtype={'zero_column_name': object})
You can use converters to convert number to fixed width if you know the width.
For example, if the width is 5, then
data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})
This will do the trick. It works for pandas==0.23.0 and also read_excel.
Python3.6 or higher required.
I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.
EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.
As an example, consider the following my_data.txt file:
id,A
03,5
04,6
To preserve the leading zeros for the id column:
df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df
id A
0 03 5
1 04 6
So what i want to do is select a column and copy the values just under the same column i select, i know i can use pandas dataframe to select the column just by the name of it, but i dont know if it's better to use openpyxl instead. There are many similar question about this but no one answer my question. Here is my code where i try to use dataframes and numpy:
for file in files:
fileName = os.path.splitext(file)[0]
if fileName == 'fileNameA':
df = pd.read_excel(file)
list_dates = ['the string of the date i need' for dates in df['Date']]
# Here what happend is
# that for every date it generates a list with dates
print(list_dates)
new_df = df.loc[np.repeat(df['Dates'], len(list_dates)]
writer = pd.ExcelWriter('fileNameA1.xlsx', engine='xlsxwriter')
new_df.to_excel(writer, 'Sheet 1')
writer.save()
except Exception as e:
print(e)
#Input data:
Date
01/12/2018
02/12/2018
03/12/2018
04/12/2018
#Output i want:
Date
01/12/2018
02/12/2018
03/12/2018
04/12/2018
01/12/2018
02/12/2018
03/12/2018
04/12/2018
Which is the best alternative, working directly with openpyxl or using pandas and then use a writer to generate the xlsx?
In this question they use df_try or concat() but how do i know the number of time i should repeat it.
Just use NewDF = pd.concat([df, df])
This will duplicate all rows of df.
If you're trying to duplicate your rows three times or some other odd interval, you could just mash together a temporary df to get the desired results (for adding two copies of df, use the following):
tempdf = pd.concat([df, df])
NewDF = pd.concat([df, tempdf])
Best is usually too subjective to be any good and it is for this reasons that questions asking for library recommendations will be closed.
If you're not doing any real manipulation of the data for statistical purposes, etc. then you probably don't need Pandas. Sticking with a single library can mean your code is easier to understand and maintain.
One approach in openpyxl would allow you to simply append() the dates at the end of the current worksheet. Something like this: (the code will probably need some changes).
for row in ws:
ws.append(row[:1])