How to extract a data frame from a messy String? - python

I have got a dataset that has been revised before its release. However, its accessory codes had not been revised and now faces this error.
The code expects a data frame explaining the features of 255 homes, though the item is just a messy string that has no exact delimiter to convert it!
I showed the error, the types of the items in the new dataset and the content of the string in this [picture][1].

I'm sure there's a better way, but I use this trick to get dataframes to work with from poorly formatted SO questions.
Print the string (to let print take care of things like return characters, '\n'), then select-all and copy it. Then use:
df = pd.read_clipboard("\s\s+")
Sometimes I have to manually adjust the spacing a little bit between a few column names for it to work correctly, but it is unreasonably effective.

Related

Converting 0-1 values in dataset with the name of the column if the value of the cell is 1

I have a csv dataset with the values 0-1 for the features of the elements. I want to iterate each cell and replace the values 1 with the name of its column. There are more than 500 thousand rows and 200 columns and, because the table is exported from another annotation tool which I update often, I want to find a way in Python to do it automatically.
This is not the table, but a sample test which I was using while trying to write a code I tried some, but without success.
I would really appreciate it if you can share your knowledge with me. It will be a huge help. The final result I want to have is of the type: (abonojnë, token_pos_verb). If you know any method that I can do this in Excel without the help of Python, it would be even better.
Thank you,
Brikena
Text,Comment,Role,ParentID,doc_completeness,lemma,MultiWord_Expr,token,pos,punctuation,verb,noun,adjective
abonojnë,,,,,,,1,1,0,1,0,0
çokasin,,,,,,,1,1,0,1,0,1
gërgasin,,,,,,,1,1,0,1,0,0
godasin,,,,,,,1,1,0,1,0,0
përkasin,,,,,,,1,1,1,1,0,0
përdjegin,,,,,,,1,1,0,1,0,0
lakadredhin,,,,,,,1,1,0,1,1,0
përdredhin,,,,,,,1,1,0,1,0,0
spërdredhin,,,,,,,1,1,0,1,0,0
përmbledhin,,,,,,,1,1,0,1,0,0
shpërdredhin,,,,,,,1,1,0,1,0,0
arsejnë,,,,,,,1,1,0,1,1,0
çapëlejnë,,,,,,,1,1,0,1,0,0
Using pandas, this is quite easy:
# pip install pandas
import pandas as pd
# read data (here example with csv, but use "read_excel" for excel)
df = pd.read_csv('input.csv').set_index('Text')
# reshape and export
(df.mul(df.columns).where(df.eq(1))
.stack().rename('xxx')
.groupby(level=0).apply('_'.join)
).to_csv('output.csv') # here use "to_excel" for excel format
output file:
Text,xxx
abonojnë,token_pos_verb
arsejnë,token_pos_verb_noun
godasin,token_pos_verb
gërgasin,token_pos_verb
lakadredhin,token_pos_verb_noun
përdjegin,token_pos_verb
përdredhin,token_pos_verb
përkasin,token_pos_punctuation_verb
përmbledhin,token_pos_verb
shpërdredhin,token_pos_verb
spërdredhin,token_pos_verb
çapëlejnë,token_pos_verb
çokasin,token_pos_verb_adjective
An update to those who may find it helpful in the future. Thank you to #mozway for helping me. A friend of mine suggested working with Excel formula because the solution with Pandas and gropuby eliminates duplicates. Since I need all the duplicates, because it's an annotated corpus, it's normal that there are repeated words that should appear in every context, not only the first occurrence.
The other alternative is this:
Use a second sheet on the excel file, writing the formula =IF(Sheet1!B2=1,Sheet2!B$1,"") in the first cell with 0-1 values and drag it in all the other cells. This keeps all the occurrences of the words. It's quick and it works like magic.
I hope this can be helpful to others who want to convert a 0-1 dataset to feature names without having to code.

How to display all rows of a DataFrame based on a condition that a specific column strings are blank?

I'm experiencing difficulties displaying rows of the dataframe.
The thing is I want to display all rows of the DataFrame with a condition that strings of one specific column are blank.
If I try df.loc[df.column == ''] or df.loc[df.column == None] I get nothing.
Here's the link to my Github page where you find 2 files: module_2.ipynb and cacao_flavors.csv which is a DataFrame I use. (https://github.com/drlivesey85/projects.git)
I need to get all rows where company is blank.
You do have some strange encoding in the example dataframe (which I don't know what it is referring about). Maybe some symbol that seems to be encoded in some MAC code. Anyway, if you set your statement with that in mind, you can actually get the rows you're looking for.
Here's my solution:
cacao.loc[cacao[column]=='\xa0', :]
EDIT:
here you can find a way to replace it: How to replace \xA0 characters in an NSString
Note that the code works properly with that change too: cacao.loc[cacao[column]=='\u00a0', :]
You can try this:
cacao.loc[cacao.company.isna(),]
Also, you should refer to this page, it helps others to provide quick responses.

Python data frame, for loop and type casts

I have a data frame with several columns. My goal is to manipulate the last column in a certain way. Until now the last column has the type string. I need help by building a for loop that walks through the last column and removes the last two chars and then typecasts it into a float.
An example for an entry in the last column is "1234.5678;;". I want it to look like 1234.5678 and that for every entry for the last column.
Thanks in advance.
I'm unsure of what exactly you are asking, are you asking for how you can manipulate a string to remove the last two chars? Or are you asking for how you access (and edit/change) a data frame? If so are you using pandas?
Clarifying that may help other people help you with your issue more effectively.
In python, given any string, you can cut off the last two characters like this:
string=string[:-2]
I assume that the variable string holds your string you want.
For the future it would be greatly appreciated if you were to explain your issue in more detail, explain what it is you want to do and where you need help and to overall put in more effort into your question, a spelling mistake in the title is not a good look optics wise.
When you're using a pandas DataFrame, you can do it like this:
w.iloc[:][w.columns.size-1]=w.iloc[:][w.columns.size-1].str.replace(";;","")
w is your DataFrame. This line will replace all the ; in the column. I assume that the last two characters always follow the same pattern. This will obviously not work if the last two characters follow a not known pattern.

Save Pandas dataframe with numeric column as text in Excel

I am trying to export a Pandas dataframe to Excel where all columns are of text format. By default, the pandas.to_excel() function lets Excel decide the data type. Exporting a column with [1,2,'w'] results in the cells containing 1 and 2 to be numeric, and the cell containing 'w' to be text. I'd like all rows in the column to be text (i.e. ['1','2','w']).
I was able to solve the problem by assigning the column I need to be text using the .astype(str). However, if the data is large, I am concerned that I will run into performance issues. If I understand correctly, df[col] = df[col].astype(str) makes a copy of the data, which is not efficient.
import pandas as pd
df = pd.DataFrame({'a':[1,2,'w'], 'b':['x','y','z']})
df['a'] = df['a'].astype(str)
df.to_excel(r'c:\tmp\test.xlsx')
Is there a more efficient way to do this?
I searched SO several times and didn't see anything on this. Forgive me if this has been answered before. This is my first post, and I'm really happy to participate in this cool forum.
Edit: Thanks to the comments I've received, I see that Converting a series of ints to strings - Why is apply much faster than astype? gives me other options to astype(str). This is really useful. I also wanted to know if astype(str) was inefficient because it made a copy of the data, which I now see that it does not.
I don't think that you'll not have performance issues with that approach since data is not copied but replaced. You may also convert the whole dataframe into string type using
df = df.astype(str)

Time Data does not match format "'%H:%M.%S%f'"

I am trying to forecast time series data.
The time series data in my csv file is in the form 0:00.000
Hence, I indexed the time series data column as follows:
df.columns=['Elapsed','I']
df['Elapsed']=pd.to_datetime(df['Elapsed'], format='%H:%M.%S%f')
df['Elapsed']=df['Elapsed'].dt.time
df.set_index('Elapsed', inplace=True)
Then later I split my data into the test section and the train section
train = df.loc['0:00.000':'0:28.778']
test = df.loc['0:28.779':]
My stack trace is
An extract of my data is:
Can anyone explain how to prevent this error from occuring?
Since the question has now changed, I'll write a new answer.
Your dataframe is indexed by instances of datetime.time, but you're trying to slice it with strings - pandas doesn't want to compare strings with times.
To get your slicing to work, try this:
split_from = datetime.datetime.strptime('0:00.000', '%H:%M.%S%f').time()
split_to = datetime.datetime.strptime('0:28.778', '%H:%M.%S%f').time()
train = df[split_from:split_to]
It would also be useful to hold the format in a variable since you're now using it in several places.
Or if you have fixed split times, you could instead do
split_from = datetime.time(0, 0, 0)
split_to = datetime.time(0, 28, 77.8)
train = df[split_from:split_to]
Without seeing your data, I'm just guessing, but here goes:
I'm guessing your original data in the 'Elapsed' column looks like
'12:34.5678'
'12:35.1234'
In particular, it has quotes each side of the numbers. Otherwise your line
df['Elapsed']=pd.to_datetime(df['Elapsed'], format="'%H:%M.%S%f'")
would fail.
So the error message is telling you that your slicing times have the wrong format: they are missing quotes on each side. Change it to
train = df.loc["'0:00.000'":"'0:28.778'"]
(likewise for the next line) and hopefully that will sort it out.
If you can extract your source data in a way that avoids having quote characters in the timestamps, you'll probably find things a little simpler.

Categories

Resources