Convert a series of number become one single line of numbers - python

If I have a series of numbers in a DataFrame with one column, e.g.:
import pandas as pd
data = [4, 5, 6, 7, 8, 9, 10, 11]
pd.DataFrame(data)
Which looks like this (left column = index, right column = data):
0 4
1 5
2 6
3 7
4 8
5 9
6 10
7 11
How do I make it into one sequence number, so (4 5 6 7 8 9 10 11) in python or pandas ?
because i want to put that into xml file so it looks like this
<Or>
<numbers>
<example>4 5 6 7 8 9 10 11</example>
</numbers>
</Or>

You can use a f-string with conversion of the integers to string and str.join:
text = f''' <Or>
<numbers>
<example>{" ".join(s.astype(str))}</example>
</numbers>
</Or>'''
Output:
<Or>
<numbers>
<example>4 5 6 7 8 9 10 11</example>
</numbers>
</Or>

use tolist() to convert the column value to list and then you can join the list with str(...)
marks_list = (df['Marks'].tolist())
marks_str = (' '.join(str(x) for x in marks_list))

Related

reading only few fields from a text file with multiple delimiter

I have a text file with multiple delimiters separating the value. from that I just want to read the pipe separated values
data is like this for example:
'
10|10|10|10|10|10|10|10|10;10:10:10,10,10,10 ... etc
'
I want to read only upto the 8 pipe separated values as a dataframe and ignore the values with ";,:". How do I do that?
It would be a two step process. First read csv with | as delimiter
df = pd.read_csv(StringIO(
"10|10|10|10|10|10|10|10|10;10:10:10,10,10,10"
), delimiter='|', header=None)
0 1 2 3 4 5 6 7 8
0 10 10 10 10 10 10 10 10 10;10:10:10,10,10,10
Then update the last column by removing the string coming after [;,:]
df.iloc[:, -1] = df.iloc[:, -1].str.replace(r'[;,:].*', '', regex=True)
0 1 2 3 4 5 6 7 8
0 10 10 10 10 10 10 10 10 10
If you know the exact character after which you have to ignore, you can use comment attribute as follows. Everything after that 1 char string would be ignored.
df = pd.read_csv(StringIO(
"10|10|10|10|10|10|10|10|10;10:10:10,10,10,10"
), delimiter='|', header=None, comment=';')
df
0 1 2 3 4 5 6 7 8
0 10 10 10 10 10 10 10 10 10
This is longer than other proposed solutions, but also possibly faster because it only reads what's needed. It collects the result as a list but it could be another container type :
df = "10,10,10,10|10|10|10|10|10|10|10|10;10:10:10,10,10,10"
coll = []
start = 0
prevIdx = -1
while True:
try:
idx = df.index("|", start)
if prevIdx >= 0:
n = int(df[prevIdx+1:idx])
if isinstance(n, int): coll.append(n)
start = idx+1
prevIdx = idx
except:
break;
print(coll) # ==> [10, 10, 10, 10, 10, 10, 10]

Python - Create multiple lists and zip

I am looking to produce multiple lists based on the same function which randomises data based on a list. I want to be able to easily change how many of these new lists I want to have and then combine. The code which creates each list is the following:
"""
"""
R_ensemble=[]
for i in range(0,len(R)):
if R[i]==0:
R_ensemble.append(0)
else:
R_ensemble.append(np.random.normal(loc=R[i],scale=R[i]/4,size=None))
return R_ensemble
This perturbs each value from the list based on a normal distribution.
To combine them is fine when I just want a handful of lists:
"""
"""
ensemble_form_1,ensemble_form_2,ensemble_form_3 = [],[],[]
ensemble_form_1 = normal_transform(R)
ensemble_form_2 = normal_transform(R)
ensemble_form_3 = normal_transform(R)
zipped_ensemble = list(zip(ensemble_form_1,ensemble_form_2,ensemble_form_3))
df_ensemble = pd.DataFrame(zipped_ensemble, columns = ['Ensemble_1', 'Ensemble_2','Ensemble_3'])
return ensemble_form_1, ensemble_form_2, ensemble_form_3
How could I repeat the same randomisation process to create a fixed number of lists (say 50 or 100), and then combine them into a table? Is there an easy way to do this with a for loop, or any other method? I'd need to be able to pick out each new list/column individually, as I would be combining the results in some way.
Any help would be greatly appreciated.
You can construct multiple lists and a table like this:
import pandas as pd
import numpy as np
# Your function for creating the individual lists
def normal_transform(R):
R_ensemble=[]
for i in range(0,len(R)):
if R[i]==0:
R_ensemble.append(0)
else:
R_ensemble.append(np.random.normal(loc=R[i],scale=R[i]/4,size=None))
return R_ensemble
# Construction of multiple lists and the dataframe
NUM_LISTS = 50
R = list(range(100))
data = dict()
for i in range(NUM_LISTS):
data['Ensemble_' + str(i)] = normal_transform(R)
df_ensemble = pd.DataFrame(data)
You can access the individual lists/ columns like this:
df_ensemble['Ensemble_42']
df_ensemble[df_ensemble.columns[42]]
You can use zip() with * to create dataframe with variable number of columns. For example:
import pandas as pd
def generate_list(n):
#... generate your list here
return [*range(n)]
def get_dataframe(n_columns, n):
return pd.DataFrame(zip(*[generate_list(n) for _ in range(n_columns)]), columns=['Ensemble_{}'.format(i) for i in range(1, n_columns+1)])
print(get_dataframe(8, 10))
Prints (8 columns, 10 rows):
Ensemble_1 Ensemble_2 Ensemble_3 Ensemble_4 Ensemble_5 Ensemble_6 Ensemble_7 Ensemble_8
0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9

Creation dataframe from several list of lists

I need to build a dataframe from 10 list of list. I did it manually, but it's need a time. What is a better way to do it?
I have tried to do it manually. It works fine (#1)
I tried code (#2) for better perfomance, but it returns only last column.
1
import pandas as pd
import numpy as np
a1T=[([7,8,9]),([10,11,12]),([13,14,15])]
a2T=[([1,2,3]),([5,0,2]),([3,4,5])]
print (a1T)
#Output[[7, 8, 9], [10, 11, 12], [13, 14, 15]]
vis1=np.array (a1T)
vis_1_1=vis1.T
tmp2=np.array (a2T)
tmp_2_1=tmp2.T
X=np.column_stack([vis_1_1, tmp_2_1])
dataset_all = pd.DataFrame({"Visab1":X[:,0], "Visab2":X[:,1], "Visab3":X[:,2], "Temp1":X[:,3], "Temp2":X[:,4], "Temp3":X[:,5]})
print (dataset_all)
Output: Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
> Actually I have varying number of columns in dataframe (500-1500), thats why I need auto generated column names. Extra index (1, 2, 3) after name Visab_, Temp_ and so on - constant for every case. See code below.
For better perfomance I tried
code<br>
#2
n=3 # This is varying parameter. The parameter affects the number of columns in the table.
m=2 # This is constant for every case. here is 2, because we have "Visab", "Temp"
mlist=('Visab', 'Temp')
nlist=[range(1, n)]
for j in range (1,n):
for i in range (1,m):
col=i+(j-1)*n
dataset_all=pd.DataFrame({mlist[j]+str(i):X[:, col]})
I expect output like
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
but there is not any result (only error expected an indented block)
Ok, so the number of columns n is the number of sublists in each list, right? You can measure that with len:
len(a1T)
#Output
3
I'll simplify the answer above so you don't need X and add automatic column-names creation:
my_lists = [a1T,a2T]
my_names = ["Visab","Temp"]
dfs=[]
for one_list,name in zip(my_lists,my_names):
n_columns = len(one_list)
col_names=[name+"_"+str(n) for n in range(n_columns)]
df = pd.DataFrame(one_list).T
df.columns = col_names
dfs.append(df)
dataset_all = pd.concat(dfs,axis=1)
#Output
Visab_0 Visab_1 Visab_2 Temp_0 Temp_1 Temp_2
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
Now is much clearer. So you have:
X=np.column_stack([vis_1_1, tmp_2_1])
Let's create a list with the names of the columns:
columns_names = ["Visab1","Visab2","Visab3","Temp1","Temp2","Temp3"]
Now you can directly make a dataframe like this:
dataset_all = pd.DataFrame(X,columns=columns_names)
#Output
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5

Pandas Dataframe removes to much rows

I have a dataframe with a lot of tweets and i want to remove the duplicates. The tweets are stored in fh1.df['Tweets']. i counts the amount of non-duplicates. j the amount of duplicates. In the else statement I remove the lines of the duplicates. And in the if I make a new list "tweetChecklist" where I put all the good tweets in.
Ok, if I do i + j , i become the amount of original tweets. So that's good. But in the else, I don't know why, he removes to much rows because the shape of my dataframe is much smaller after the for loop (1/10).
How does the " fh1.df = fh1.df[fh1.df.Tweets != current_tweet]
" line remove to much rows??
tweetChecklist = []
for current_tweet in fh1.df['Tweets']:
if current_tweet not in tweetChecklist:
i = i + 1
tweetChecklist.append(current_tweet)
else:
j = j + 1
fh1.df = fh1.df[fh1.df.Tweets != current_tweet]
fh1.df['Tweets'] = pd.Series(tweetChecklist)
NOTE
Graipher's solution tells you how to generate a unique dataframe. My answer tells you why your current operation removes too many rows (per your question).
END NOTE
When you enter the "else" statement to remove the duplicated tweet you are removing ALL of the rows that have the specified tweet. Let's demonstrate:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.randint(0, 10, (10, 5)), columns=list('ABCDE'))
What does this make:
Out[118]:
A B C D E
0 2 7 0 5 4
1 2 8 8 3 7
2 9 7 4 6 2
3 9 7 7 9 2
4 6 5 7 6 8
5 8 8 7 6 7
6 6 1 4 5 3
7 1 4 7 8 7
8 3 2 5 8 5
9 5 8 9 2 4
In your method (assume you want to remove duplicates from "A" instead of "Tweets") you would end up with (i.e. only have rows that were not unique).
Out[118]:
A B C D E
5 8 8 7 6 7
7 1 4 7 8 7
8 3 2 5 8 5
9 5 8 9 2 4
If you just want to make this unique, implement Graipher's suggestion. If you want to count how many duplicates you have you can do this:
total = df.shape[0]
duplicates = total - df.A.unique().size
In pandas there is usually always a better way than iterating over the dataframe with a for loop.
In this case, what you really want is to group equal tweets together and just retain the first one. This can be achieved with pandas.DataFrame.groupby:
import random
import string
import pandas as pd
# some random one character tweets, so there are many duplicates
df = pd.DataFrame({"Tweets": random.choices(string.ascii_lowercase, k=100),
"Data": [random.random() for _ in range(100)]})
df.groupby("Tweets", as_index=False).first()
# Tweets Data
# 0 a 0.327766
# 1 b 0.677697
# 2 c 0.517186
# 3 d 0.925312
# 4 e 0.748902
# 5 f 0.353826
# 6 g 0.991566
# 7 h 0.761849
# 8 i 0.488769
# 9 j 0.501704
# 10 k 0.737816
# 11 l 0.428117
# 12 m 0.650945
# 13 n 0.530866
# 14 o 0.337835
# 15 p 0.567097
# 16 q 0.130282
# 17 r 0.619664
# 18 s 0.365220
# 19 t 0.005407
# 20 u 0.905659
# 21 v 0.495603
# 22 w 0.511894
# 23 x 0.094989
# 24 y 0.089003
# 25 z 0.511532
Even better, there is even a function explicitly for that, pandas.drop_duplicates, which is about twice as fast:
df.drop_duplicates(subset="Tweets", keep="first")

Using numpy.genfromtxt to read a csv file with strings containing commas

I am trying to read in a csv file with numpy.genfromtxt but some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':
2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0
the code
np.genfromtxt('t.csv', delimiter=',')
produces the error:
ValueError: Some errors were detected !
Line #2 (got 4 columns instead of 3)
The data structure I am looking for is:
array([['2012', 'Louisville KY', '3.5'],
['2011', 'Lexington, KY', '4.0']],
dtype='|S13')
Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the csv module and then convert it to a numpy array?
You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv can handle this. From the docs:
quotechar : string
The character to used to denote the start and end of a quoted item. Quoted items
can include the delimiter and it will be ignored.
The default value is ". An example:
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s="""year, city, value
...: 2012, "Louisville KY", 3.5
...: 2011, "Lexington, KY", 4.0"""
In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
year city value
0 2012 Louisville KY 3.5
1 2011 Lexington, KY 4.0
The trick here is that you also have to use skipinitialspace=True to deal with the spaces after the comma-delimiter.
Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).
The problem with the additional comma, np.genfromtxt does not deal with that.
One simple solution is to read the file with csv.reader() from python's csv module into a list and then dump it into a numpy array if you like.
If you really want to use np.genfromtxt, note that it can take iterators instead of files, e.g. np.genfromtxt(my_iterator, ...). So, you can wrap a csv.reader in an iterator and give it to np.genfromtxt.
That would go something like this:
import csv
import numpy as np
np.genfromtxt(("\t".join(i) for i in csv.reader(open('myfile.csv'))), delimiter="\t")
This essentially replaces on-the-fly only the appropriate commas with tabs.
If you are using a numpy you probably want to work with numpy.ndarray. This will give you a numpy.ndarray:
import pandas
data = pandas.read_csv('file.csv').as_matrix()
Pandas will handle the "Lexington, KY" case correctly
Make a better function that combines the power of the standard csv module and Numpy's recfromcsv. For instance, the csv module has good control and customization of dialects, quotes, escape characters, etc., which you can add to the example below.
The example genfromcsv_mod function below reads in a complicated CSV file similar to what Microsoft Excel sees, which may contain commas within quoted fields. Internally, the function has a generator function that rewrites each row with tab delimiters.
import csv
import numpy as np
def recfromcsv_mod(fname, **kwargs):
def rewrite_csv_as_tab(fname):
with open(fname, newline='') as fp:
dialect = csv.Sniffer().sniff(fp.read(1024))
fp.seek(0)
for row in csv.reader(fp, dialect):
yield "\t".join(row)
return np.recfromcsv(
rewrite_csv_as_tab(fname), delimiter="\t", encoding=None, **kwargs)
# Use it to read a CSV file into a record array
x = recfromcsv_mod("t.csv", case_sensitive=True)
You can try this code. We are reading .csv file from np.genfromtext()
method
Code:
myfile = np.genfromtxt('MyData.csv', delimiter = ',')
myfile = myfile.astype('int64')
print(myfile)
Output:
[[ 1 1 1 1 1 1 1 1 1 1 1]
[ 3 3 3 3 3 3 3 3 3 3 3]
[ 3 3 3 3 3 3 3 3 3 3 3]
[ 4 4 4 4 4 4 4 4 4 4 4]
[ 5 5 5 5 5 5 5 5 5 5 5]
[ 6 6 6 6 6 6 6 6 6 6 6]
[ 7 7 7 7 7 7 7 7 7 7 7]
[ 8 8 8 8 8 8 8 8 8 8 8]
[ 9 9 9 9 9 9 9 9 9 9 9]
[10 10 10 10 10 10 10 10 10 10 10]
[11 11 11 11 11 11 11 11 11 11 11]
[12 12 12 12 12 12 12 12 12 12 12]
[13 13 13 13 13 13 13 13 13 13 13]
[14 14 14 14 14 14 14 14 14 14 14]
[15 15 15 15 15 15 15 15 15 15 15]
[16 17 18 19 20 21 22 23 24 25 26]]
Input File "MyData.csv"

Categories

Resources